├── images
├── f1.png
├── iou.png
├── kkt.png
├── ss2.png
├── L2_reg.png
├── convex.png
├── l1_reg.png
├── newton.png
├── normal.png
├── polyak.png
├── saddle.png
├── shared.png
├── Link to uniform_distribution.png
├── adagrad.png
├── bagging.png
├── easy_int.gif
├── ensemble.png
├── gaussian.jpg
├── momentum.png
├── nesterov.png
├── plateau.png
├── residual.png
├── rmsprop.png
├── softmax.png
├── taylor1.png
├── taylor2.png
├── u-curve.png
├── L2_scaling.png
├── Lagrangian.png
├── adam_update.png
├── advance_int.jpg
├── adversarial.png
├── batch_norm.png
├── data_meme.jpg
├── decrease_lr.png
├── fbeta-score.jpg
├── inception.png
├── lagrangian.png
├── linear_reg.png
├── lr_high_low.png
├── lr_update.png
├── math_meme.jpg
├── nesterov.jpeg
├── nonconvex.png
├── polyak_exp.png
├── pr_equation.png
├── properties.png
├── renormalize.png
├── train_dev.png
├── conjugate_gd.png
├── deep_shallow.png
├── directed-pgm.png
├── greedy_final.png
├── hessian_as_gd.png
├── iou_examples.png
├── lasso_result.png
├── local_minima.gif
├── model_output.png
├── newton_point.png
├── rep_sparsity.png
├── saddle_point.png
├── ss2-original.png
├── weight_update.png
├── adagrad_problem.png
├── condition_number.png
├── critical_points.png
├── early_stopping.png
├── google_mistake.jpg
├── gradient_descent.png
├── loss_comparison.png
├── momentum_update.png
├── polyak_intuition.png
├── pool_invariance.png
├── precision_recall.png
├── regularization.png
├── sgd_convergence.png
├── undirected-pgm.png
├── workflow_final.png
├── workflow_final2.png
├── conjugate_gradient.png
├── conv_equivariance.png
├── coordinate_descent.png
├── eigendecomposition.png
├── gradient_clipping.png
├── grid_random_search.png
├── imagenet_progress.png
├── momentum_ball_roll.gif
├── momentum_ball_roll.png
├── ng_error_analysis.png
├── partial_derivative.png
├── random_search_dist.png
├── transfer learning.png
├── batch-normalization.png
├── coordinate_descent2.png
├── hyperparam_capacity.png
├── model_identifiability.png
├── optimizers_comparison.gif
├── optimizers_comparison.png
├── uniform_distribution.png
└── exploding_vanishing_gradient.jpg
├── 13 - Linear Factor Models.ipynb
├── README.md
├── Appendix.ipynb
├── 04 - Numerical Optimization.ipynb
├── 11 - Practical Methodology.ipynb
├── 07 - Regularization for Deep Learning.ipynb
└── 02 - Linear Algebra.ipynb
/images/f1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/f1.png
--------------------------------------------------------------------------------
/images/iou.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/iou.png
--------------------------------------------------------------------------------
/images/kkt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/kkt.png
--------------------------------------------------------------------------------
/images/ss2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/ss2.png
--------------------------------------------------------------------------------
/images/L2_reg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/L2_reg.png
--------------------------------------------------------------------------------
/images/convex.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/convex.png
--------------------------------------------------------------------------------
/images/l1_reg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/l1_reg.png
--------------------------------------------------------------------------------
/images/newton.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/newton.png
--------------------------------------------------------------------------------
/images/normal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/normal.png
--------------------------------------------------------------------------------
/images/polyak.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/polyak.png
--------------------------------------------------------------------------------
/images/saddle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/saddle.png
--------------------------------------------------------------------------------
/images/shared.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/shared.png
--------------------------------------------------------------------------------
/images/Link to uniform_distribution.png:
--------------------------------------------------------------------------------
1 | /media/aman/BE66ECBA66EC7515/Tutorials/DL Book/images/uniform_distribution.png
--------------------------------------------------------------------------------
/images/adagrad.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/adagrad.png
--------------------------------------------------------------------------------
/images/bagging.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/bagging.png
--------------------------------------------------------------------------------
/images/easy_int.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/easy_int.gif
--------------------------------------------------------------------------------
/images/ensemble.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/ensemble.png
--------------------------------------------------------------------------------
/images/gaussian.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/gaussian.jpg
--------------------------------------------------------------------------------
/images/momentum.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/momentum.png
--------------------------------------------------------------------------------
/images/nesterov.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/nesterov.png
--------------------------------------------------------------------------------
/images/plateau.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/plateau.png
--------------------------------------------------------------------------------
/images/residual.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/residual.png
--------------------------------------------------------------------------------
/images/rmsprop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/rmsprop.png
--------------------------------------------------------------------------------
/images/softmax.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/softmax.png
--------------------------------------------------------------------------------
/images/taylor1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/taylor1.png
--------------------------------------------------------------------------------
/images/taylor2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/taylor2.png
--------------------------------------------------------------------------------
/images/u-curve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/u-curve.png
--------------------------------------------------------------------------------
/images/L2_scaling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/L2_scaling.png
--------------------------------------------------------------------------------
/images/Lagrangian.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/Lagrangian.png
--------------------------------------------------------------------------------
/images/adam_update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/adam_update.png
--------------------------------------------------------------------------------
/images/advance_int.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/advance_int.jpg
--------------------------------------------------------------------------------
/images/adversarial.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/adversarial.png
--------------------------------------------------------------------------------
/images/batch_norm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/batch_norm.png
--------------------------------------------------------------------------------
/images/data_meme.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/data_meme.jpg
--------------------------------------------------------------------------------
/images/decrease_lr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/decrease_lr.png
--------------------------------------------------------------------------------
/images/fbeta-score.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/fbeta-score.jpg
--------------------------------------------------------------------------------
/images/inception.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/inception.png
--------------------------------------------------------------------------------
/images/lagrangian.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/lagrangian.png
--------------------------------------------------------------------------------
/images/linear_reg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/linear_reg.png
--------------------------------------------------------------------------------
/images/lr_high_low.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/lr_high_low.png
--------------------------------------------------------------------------------
/images/lr_update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/lr_update.png
--------------------------------------------------------------------------------
/images/math_meme.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/math_meme.jpg
--------------------------------------------------------------------------------
/images/nesterov.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/nesterov.jpeg
--------------------------------------------------------------------------------
/images/nonconvex.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/nonconvex.png
--------------------------------------------------------------------------------
/images/polyak_exp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/polyak_exp.png
--------------------------------------------------------------------------------
/images/pr_equation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/pr_equation.png
--------------------------------------------------------------------------------
/images/properties.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/properties.png
--------------------------------------------------------------------------------
/images/renormalize.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/renormalize.png
--------------------------------------------------------------------------------
/images/train_dev.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/train_dev.png
--------------------------------------------------------------------------------
/images/conjugate_gd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/conjugate_gd.png
--------------------------------------------------------------------------------
/images/deep_shallow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/deep_shallow.png
--------------------------------------------------------------------------------
/images/directed-pgm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/directed-pgm.png
--------------------------------------------------------------------------------
/images/greedy_final.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/greedy_final.png
--------------------------------------------------------------------------------
/images/hessian_as_gd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/hessian_as_gd.png
--------------------------------------------------------------------------------
/images/iou_examples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/iou_examples.png
--------------------------------------------------------------------------------
/images/lasso_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/lasso_result.png
--------------------------------------------------------------------------------
/images/local_minima.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/local_minima.gif
--------------------------------------------------------------------------------
/images/model_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/model_output.png
--------------------------------------------------------------------------------
/images/newton_point.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/newton_point.png
--------------------------------------------------------------------------------
/images/rep_sparsity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/rep_sparsity.png
--------------------------------------------------------------------------------
/images/saddle_point.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/saddle_point.png
--------------------------------------------------------------------------------
/images/ss2-original.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/ss2-original.png
--------------------------------------------------------------------------------
/images/weight_update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/weight_update.png
--------------------------------------------------------------------------------
/images/adagrad_problem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/adagrad_problem.png
--------------------------------------------------------------------------------
/images/condition_number.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/condition_number.png
--------------------------------------------------------------------------------
/images/critical_points.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/critical_points.png
--------------------------------------------------------------------------------
/images/early_stopping.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/early_stopping.png
--------------------------------------------------------------------------------
/images/google_mistake.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/google_mistake.jpg
--------------------------------------------------------------------------------
/images/gradient_descent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/gradient_descent.png
--------------------------------------------------------------------------------
/images/loss_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/loss_comparison.png
--------------------------------------------------------------------------------
/images/momentum_update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/momentum_update.png
--------------------------------------------------------------------------------
/images/polyak_intuition.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/polyak_intuition.png
--------------------------------------------------------------------------------
/images/pool_invariance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/pool_invariance.png
--------------------------------------------------------------------------------
/images/precision_recall.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/precision_recall.png
--------------------------------------------------------------------------------
/images/regularization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/regularization.png
--------------------------------------------------------------------------------
/images/sgd_convergence.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/sgd_convergence.png
--------------------------------------------------------------------------------
/images/undirected-pgm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/undirected-pgm.png
--------------------------------------------------------------------------------
/images/workflow_final.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/workflow_final.png
--------------------------------------------------------------------------------
/images/workflow_final2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/workflow_final2.png
--------------------------------------------------------------------------------
/images/conjugate_gradient.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/conjugate_gradient.png
--------------------------------------------------------------------------------
/images/conv_equivariance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/conv_equivariance.png
--------------------------------------------------------------------------------
/images/coordinate_descent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/coordinate_descent.png
--------------------------------------------------------------------------------
/images/eigendecomposition.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/eigendecomposition.png
--------------------------------------------------------------------------------
/images/gradient_clipping.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/gradient_clipping.png
--------------------------------------------------------------------------------
/images/grid_random_search.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/grid_random_search.png
--------------------------------------------------------------------------------
/images/imagenet_progress.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/imagenet_progress.png
--------------------------------------------------------------------------------
/images/momentum_ball_roll.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/momentum_ball_roll.gif
--------------------------------------------------------------------------------
/images/momentum_ball_roll.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/momentum_ball_roll.png
--------------------------------------------------------------------------------
/images/ng_error_analysis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/ng_error_analysis.png
--------------------------------------------------------------------------------
/images/partial_derivative.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/partial_derivative.png
--------------------------------------------------------------------------------
/images/random_search_dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/random_search_dist.png
--------------------------------------------------------------------------------
/images/transfer learning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/transfer learning.png
--------------------------------------------------------------------------------
/images/batch-normalization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/batch-normalization.png
--------------------------------------------------------------------------------
/images/coordinate_descent2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/coordinate_descent2.png
--------------------------------------------------------------------------------
/images/hyperparam_capacity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/hyperparam_capacity.png
--------------------------------------------------------------------------------
/images/model_identifiability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/model_identifiability.png
--------------------------------------------------------------------------------
/images/optimizers_comparison.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/optimizers_comparison.gif
--------------------------------------------------------------------------------
/images/optimizers_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/optimizers_comparison.png
--------------------------------------------------------------------------------
/images/uniform_distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/uniform_distribution.png
--------------------------------------------------------------------------------
/images/exploding_vanishing_gradient.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalmia/Deep-Learning-Book-Chapter-Summaries/HEAD/images/exploding_vanishing_gradient.jpg
--------------------------------------------------------------------------------
/13 - Linear Factor Models.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Deep Learning Book (Simplified)\n",
8 | "## Part II - Modern Practical Deep Networks\n",
9 | "*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org)\n",
10 | "where we are attempting to provide a summary of each chapter highlighting the concepts that we found to be the most important so that other people can use it as a starting point for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on \n",
11 | "notation.*\n",
12 | "\n",
13 | "\n",
14 | "## Chapter 13: Linear Factor Models"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": []
23 | }
24 | ],
25 | "metadata": {
26 | "kernelspec": {
27 | "display_name": "Python 2",
28 | "language": "python",
29 | "name": "python2"
30 | },
31 | "language_info": {
32 | "codemirror_mode": {
33 | "name": "ipython",
34 | "version": 2
35 | },
36 | "file_extension": ".py",
37 | "mimetype": "text/x-python",
38 | "name": "python",
39 | "nbconvert_exporter": "python",
40 | "pygments_lexer": "ipython2",
41 | "version": "2.7.12"
42 | }
43 | },
44 | "nbformat": 4,
45 | "nbformat_minor": 2
46 | }
47 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Deep-Learning-Book-Chapter-Summaries
2 | This repository provides a summary for each chapter of the Deep Learning [book](http://deeplearningbook.org) by Ian Goodfellow, Yoshua Bengio and Aaron Courville and attempts to explain some of the concepts in greater detail. Some of the tougher chapters have blog post(s) dedicated to them which can be found on http://medium.com/inveterate-learner.
3 |
4 | ## Chapters
5 |
6 | - **Part I: Applied Math and Machine Learning Basics**
7 | - Chapter 2: Linear Algebra [[chapter](http://www.deeplearningbook.org/contents/linear_algebra.html)]
8 | - Chapter 3: Probability and Information Theory [[chapter](http://www.deeplearningbook.org/contents/prob.html)]
9 | - Chapter 4: Numerical Computation [[chapter](http://www.deeplearningbook.org/contents/numerical.html)]
10 | - Chapter 5: Machine Learning Basics [[chapter](http://www.deeplearningbook.org/contents/ml.html)]
11 |
12 | - **Part II: Modern Practical Deep Networks**
13 | - Chapter 6: Deep Feedforward Networks [[chapter](http://www.deeplearningbook.org/contents/mlp.html)]
14 | - Chapter 7: Regularization for Deep Learning [[chapter](http://www.deeplearningbook.org/contents/regularization.html)]
15 | - Chapter 8: Optimization for Training Deep Models [[chapter](http://www.deeplearningbook.org/contents/optimization.html)]
16 | - Chapter 9: Convolutional Networks [[chapter](http://www.deeplearningbook.org/contents/convnets.html)]
17 | - Chapter 10: Sequence Modeling: Recurrent and Recursive Nets [[chapter](http://www.deeplearningbook.org/contents/rnn.html)]
18 | - Chapter 11: Practical Methodology [[chapter](http://www.deeplearningbook.org/contents/guidelines.html)]
19 | - Chapter 12: Applications [[chapter](http://www.deeplearningbook.org/contents/applications.html)]
20 |
21 | - **Part III: Deep Learning Research**
22 | - Chapter 13: Linear Factor Models [[chapter](http://www.deeplearningbook.org/contents/linear_factors.html)]
23 | - Chapter 14: Autoencoders [[chapter](http://www.deeplearningbook.org/contents/autoencoders.html)]
24 | - Chapter 15: Representation Learning [[chapter](http://www.deeplearningbook.org/contents/representation.html)]
25 | - Chapter 16: Structured Probabilistic Models for Deep Learning [[chapter](http://www.deeplearningbook.org/contents/graphical_models.html)]
26 | - Chapter 17: Monte Carlo Methods [[chapter](http://www.deeplearningbook.org/contents/monte_carlo.html)]
27 | - Chapter 18: Confronting the Partition Function [[chapter](http://www.deeplearningbook.org/contents/partition.html)]
28 | - Chapter 19: Approximate Inference [[chapter](http://www.deeplearningbook.org/contents/inference.html)]
29 | - Chapter 20: Deep Generative Models [[chapter](http://www.deeplearningbook.org/contents/generative_models.html)]
30 |
31 | ## Contributors
32 | - [Aman Dalmia](https://github.com/dalmia)
33 | - [Ameya Godbole](https://github.com/ameyagodbole)
34 |
35 | ## Contributing
36 |
37 | Please feel free to open a Pull Request to contribute a summary for the chapters 5, 6 and 12 as we might not be able to cover them owing to other commitments. Also, if you think there's any section that requires more/better explanation, please use the issue tracker to let us know about the same.
38 |
39 | ## Support
40 |
41 | If you like this repo and find it useful, please consider (★) starring it (on top right of the page) so that it can reach a broader audience.
42 |
--------------------------------------------------------------------------------
/Appendix.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Appendix \n",
8 | "This is an appendix notebook for explanations that we skipped in the blogs by [Inveterate Learner](https://medium.com/inveterate-learner) either because it would make it too long or was a bit repetitive."
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "### 1. Deep Learning Book: Chapter 7 - Regularization for Deep Learning ([Link to post](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c))\n",
16 | "\n",
17 | "**Explanation for the weight update of $L^1$ regularization**\n",
18 | "\n",
19 | "$$ \\bigtriangledown_w \\tilde{J}(\\theta; X, y) = \\bigtriangledown_w J(\\theta; X, y) + \\alpha * sign(w) $$\n",
20 | "$$ \\bigtriangledown_w J(\\theta; X, y) = H(w - w^*)$$\n",
21 | "\n",
22 | "Now, we'll have to look at each unit and not the entire *w* vector:\n",
23 | "- Case 1: sign($w_i$*) > 0\n",
24 | "\n",
25 | "Equating the gradient to zero, we get:\n",
26 | "$$ H_{i, i}(w_i - w_i^*) + \\alpha = 0$$\n",
27 | "$$ \\Rightarrow w_i = w_i^* - \\frac {\\alpha}{H_{i, i}} $$\n",
28 | "\n",
29 | "Similarly,\n",
30 | "- Case 2: sign($w_i$*) > 0\n",
31 | "$$ \\Rightarrow w_i = w_i^* + \\frac {\\alpha}{H_{i, i}} $$\n",
32 | "$$ \\Rightarrow w_i = - (-w_i^* -\\frac {\\alpha}{H_{i, i}}) $$\n",
33 | "\n",
34 | "Therefore, overall we get the following:\n",
35 | "\n",
36 | "$$ w_i = sign(w_i*)(|w_i^*| - \\frac {\\alpha}{H_{i, i}}) $$\n",
37 | "\n",
38 | "The explanation for why `max` occurs is given in the post itself."
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "### 2. Deep Learning Book: Chapter 8- Optimization For Training Deep Models ([Link to post (TODO)](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c))\n",
46 | "\n",
47 | "**Standard Error of the mean estimated from N samples**\n",
48 | "\n",
49 | "Assume that each of the samples is normally distributed, i.e. each $X_i \\sim \\mathcal{N}(\\mu, \\sigma^2)$, where $\\mu$ is the mean and $\\sigma^2$ is the variance.\n",
50 | "Then, estimated mean, $\\hat{\\mu}$ is given by:\n",
51 | "\n",
52 | "$$ \\hat{\\mu} = \\frac{\\sum_{i=1}^{n} X_i}{n} $$\n",
53 | "\n",
54 | "Therefore, remembering that: $var(\\frac{x}{n}) = \\frac{var(x)}{n^2}$, the variance of $\\hat{\\mu}$ is given by:\n",
55 | "\n",
56 | "$$ var(\\hat{\\mu}) = \\sum_{i=1}^{n} var(\\frac{X_i}{n}) $$\n",
57 | "$$ \\Rightarrow var(\\hat{\\mu}) = \\sum_{i=1}^{n} \\frac{\\sigma^2}{n^2} $$\n",
58 | "$$ \\Rightarrow var(\\hat{\\mu}) = n \\frac{\\sigma^2}{n^2} $$\n",
59 | "$$ \\Rightarrow var(\\hat{\\mu}) = \\frac{\\sigma^2}{n} $$ \n",
60 | "\n",
61 | "Now, Standard Error (S.E.) of any variable X is given by $\\sqrt{var(X)}$. Therefore:\n",
62 | "\n",
63 | "$$ S.E.(\\hat{\\mu}) = \\frac{\\sigma}{\\sqrt{n}} $$"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "**Critical point of 2nd Order Taylor Series Approximation of J($\\theta$)**\n",
71 | "\n",
72 | "$$ J'(\\theta) = J'(\\theta_0) + \\frac{d}{d\\theta} \\left[(\\theta - \\theta_0)^T \\triangledown_{\\theta}J(\\theta_0) \\right]+ \\frac{d}{d\\theta} \\left[\\frac{1}{2} (\\theta - \\theta_0)^T H (\\theta - \\theta_0) \\right] $$\n",
73 | "\n",
74 | "$$ J'(\\theta_0) = 0 \\hspace{.5cm} \\text{as it is a constant}$$\n",
75 | "\n",
76 | "$$ \\frac{d}{d\\theta} \\left[(\\theta - \\theta_0)^T \\triangledown_{\\theta}J(\\theta_0) \\right] = \\triangledown_{\\theta}J(\\theta_0) * \\frac{d}{d\\theta} (\\theta - \\theta_0)^T + (\\theta - \\theta_0)^T * \\frac{d}{d\\theta} \\triangledown_{\\theta}J(\\theta_0) \\hspace{.5cm} \\text{using the u-v method of differentiation} $$\n",
77 | "\n",
78 | "$$ \\hspace{4cm} = \\triangledown_{\\theta}J(\\theta_0) + 0 \\hspace{.5cm} \\text{as } \\triangledown_{\\theta}J(\\theta_0) \\text{ is a constant} $$\n",
79 | "\n",
80 | "$$ \\frac{d}{d\\theta} \\left[\\frac{1}{2} (\\theta - \\theta_0)^T H (\\theta - \\theta_0) \\right] = \\frac{1}{2} H (\\theta - \\theta_0) * \\frac{d}{d\\theta} (\\theta - \\theta_0)^T + \\frac{1}{2} (\\theta - \\theta_0)^T * \\frac{d}{d\\theta} H (\\theta - \\theta_0) \\hspace{.5cm} \\text{similarly} $$\n",
81 | "\n",
82 | "$$ \\hspace{4cm} = H (\\theta - \\theta_0) \\text{ property of matrix differentiation} $$\n",
83 | "\n",
84 | "So, overall:\n",
85 | "\n",
86 | "$$ J'(\\theta) = \\triangledown_{\\theta}J(\\theta_0) + H (\\theta - \\theta_0) $$\n",
87 | "\n",
88 | "At the critical point, $\\theta^*$, $J'(\\theta^*)$ = 0. Therefore:\n",
89 | "\n",
90 | "$$ 0 = \\triangledown_{\\theta}J(\\theta_0) + H (\\theta^* - \\theta_0) $$\n",
91 | "$$ \\theta^* = \\theta_0 - H ^ {-1}\\triangledown_{\\theta}J(\\theta_0) $$\n"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "**Explanation of how negative curvature results in standard gradient computation**\n",
99 | "\n",
100 | "If the eigenvalues of H are too negative, $\\alpha$ needs to be very high to compensate for that, in which case the term (H + $\\alpha$ I) is dominated by $\\alpha$ I.\n",
101 | "\n",
102 | "$$ \\mathcal{H} + \\alpha I \\approx \\alpha I $$\n",
103 | "\n",
104 | "$$ \\Rightarrow \\theta^* \\approx \\theta_0 - [\\alpha I]^{-1} \\bigtriangledown_{\\theta} f(\\theta_0)$$\n",
105 | "\n",
106 | "$$ \\Rightarrow \\theta^* \\approx \\theta_0 - \\frac{\\bigtriangledown_{\\theta} f(\\theta_0)}{\\alpha}$$\n",
107 | "\n",
108 | "whereas, the standard gradient descent update would be given by:\n",
109 | "\n",
110 | "$$ \\Rightarrow \\theta^* \\approx \\theta_0 - \\epsilon \\bigtriangledown_{\\theta} f(\\theta_0) \\hspace{.5cm} \\text{with } \\epsilon \\text{ being the learning rate} $$"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "**Explanation of how large weights cause symmetry breaking during initialization**\n",
118 | "\n",
119 | "Suppose the eigen-value decomposition of W is given by: $ W = Q V Q^{-1}$ where V is the diagonal matrix of eigen values. Now, if a noise of $\\epsilon$ is added to the input, upon doing W \\* x an extra term W * $\\epsilon$ appears at the output. This $\\epsilon$ term scales the diagonal matrix V. So, if the eigenvalues of W are $\\lambda_1$, $\\lambda_2$, etc., it becomes $\\lambda_1 \\epsilon$, $\\lambda_2 \\epsilon$, etc. Thus, if W had similar eigenvalues for all its eigen directions, i.e. $\\lambda_1 \\approx \\lambda_2$, etc., then $\\lambda_1 \\epsilon \\approx \\lambda_2 \\epsilon$, which means that using different eigen directions didn't give anything extra. However, if the eigen values differ a lot, then multiplication with $\\epsilon$ will increase that difference. This is making a much better use of different eigen directions and thus, has a symmetry breaking effect./"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": []
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {},
133 | "outputs": [],
134 | "source": []
135 | }
136 | ],
137 | "metadata": {
138 | "kernelspec": {
139 | "display_name": "Python 2",
140 | "language": "python",
141 | "name": "python2"
142 | },
143 | "language_info": {
144 | "codemirror_mode": {
145 | "name": "ipython",
146 | "version": 2
147 | },
148 | "file_extension": ".py",
149 | "mimetype": "text/x-python",
150 | "name": "python",
151 | "nbconvert_exporter": "python",
152 | "pygments_lexer": "ipython2",
153 | "version": "2.7.12"
154 | }
155 | },
156 | "nbformat": 4,
157 | "nbformat_minor": 2
158 | }
159 |
--------------------------------------------------------------------------------
/04 - Numerical Optimization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Deep Learning Book (Simplified)\n",
8 | "## Part I - Applied Math and Machine Learning basics\n",
9 | "*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org) where we are attempting to provide a summary of each chapter highlighting the concepts that we found to be most important so that other people can use it as a starting point for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on notation.*\n",
10 | "\n",
11 | "## Chapter 4: Numerical Computation\n",
12 | "\n",
13 | "Since you are here, there's a high probability that you must have heard of **Gradient Descent**. It is that part of a Deep Learning pipeline which leads to the model being *trained*. This chapter outlines the various kinds of numerical computations generally utilized by Machine Learning algorithms and also describes various optimization algorithms (e.g. Gradient Descent, Newton's method), which are those class of algorithms that update the estimates of the solution iteratively, rather than solving it analytically to provide a closed-form solution.\n",
14 | "\n",
15 | "The sections present in this chapter are listed below:
\n",
16 | "\n",
17 | "**1. Overflow and Underflow?**
\n",
18 | "**2. Poor Conditioning**
\n",
19 | "**3. Gradient-Based Optimization**
\n",
20 | "**4. Constrained Optimization**
"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "### 1. Overflow and Underflow\n",
28 | "\n",
29 | "There is a fundamental problem with representing infinitely many real numbers on a digital computer with a finite number of bit patterns, which is: it leads to rounding errors. Such rounding errors compound over certain operations and cause many theoretically correct algorithms to fail in practise. There are primarily two damaging forms of rounding errors:\n",
30 | "\n",
31 | "- **Underflow**: Underflow occurs when numbers near to zero are rounded down to zero.
\n",
32 | "The behaviour of certain functions like $\\frac{1}{x}$ , $log$, etc. can change dramatically due to this.\n",
33 | "\n",
34 | "- **Overflow**: Overflow occurs when a large number is approximated as $\\infty$ (or $-\\infty$).\n",
35 | "\n",
36 | "*Example* - Softmax\n",
37 | "\n",
38 | "\n",
39 | "Assume every $x_i$ is equal to some $c$.
\n",
40 | "\n",
41 | "**Problems**:\n",
42 | "- $c$ is very negative: This leads to underflow when computing $exp(c)$ and thus $0$ in the denominator.\n",
43 | "- $c$ is very positive: This leads to overflow when computing $exp(c)$.\n",
44 | "\n",
45 | "**Solution**: \n",
46 | "\n",
47 | "Instead of computing $softmax(\\mathbf{x})$, we compute $softmax(\\mathbf{z})$, where $\\mathbf{z} = \\mathbf{x} - \\max_i x_i$. It can be proven that the value doesn't change after subtracting the same value from each of the elements. Now, the maximum value in $\\mathbf{z}$ is $0$, thus preventing overflow. Also, this ensures that atleast one element in the denominator is $1$, preventing underflow.\n",
48 | "\n",
49 | "*Food for thought*: This still doesn't prevent underflow in the numerator. Think of the case when the output from the softmax function is passed as input to another function, e.g., $log$."
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "### 2. Poor Conditioning\n",
57 | "\n",
58 | "Conditioning measures how rapidly the output of a function changes with small changes in the input. Large conditioning means poor conditioning as rounding errors can lead to large changes in output.\n",
59 | "For e.g., let's observe: $ f(x) = A^{-1}x$. Given that $A \\in \\mathbb{R}^{n \\hspace{.1cm} \\text{x} \\hspace{.1cm} n}$ has an eigen value decomposition, its **condition number** is given by:\n",
60 | "\n",
61 | "\n",
62 | "\n",
63 | "which is equal to the ratio of the largest and the smallest eigen values. Having a large condition number signifies that matrix inversion is highly sensitive to errors in the input."
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "### 3. Gradient Based Optimization\n",
71 | "\n",
72 | "Most optimization problems are phrased in terms of minimizing $f(x)$. Maximization can be achieved via a minimization algorithm by minimizing $-f(x)$.\n",
73 | "\n",
74 | "The **derivative** of a function $f$, denoted as $f'(x)$, specifies how a small change in input reflects as a change in output: $f(x + \\epsilon) \\approx f(x) + \\epsilon * f'(x)$. The derivative is useful for minimizing a function because it tells us how to change $x$ in order to make a small improvement in $y$. E.g. for a small enough $\\epsilon$, $f\\Big(x - \\epsilon\\, sign\\big(f'(x)\\big) \\Big)$ will be smaller than $f(x)$. This technique is called **gradient descent**.\n",
75 | "\n",
76 | "\n",
77 | "\n",
78 | "Points where $f'(x)=0$ are called **critical** or **stationary** points. Types of critical points:\n",
79 | "\n",
80 | "\n",
81 | "\n",
82 | "For functions with multiple inputs, **partial derivative** $\\frac{\\delta}{\\delta x_i}f(x)$ measures how $f$ changes as only the variable $x_i$ changes at point $x$. The **gradient** of $f$ is a vector containing all partial derivatives denoted $\\nabla_x\\, f(x)$. The **directional derivative** in a direction ***u*** (unit vector) is the slope of $f$ in the direction *u*.\n",
83 | "\n",
84 | "i.e. the directional derivative is the value of $\\frac{\\delta}{\\delta \\alpha}f(x+\\alpha*u)$ evaluated as $\\alpha \\rightarrow 0$.\n",
85 | "\n",
86 | "Using the chain rule:\n",
87 | "\n",
88 | "$\\frac{\\delta}{\\delta \\alpha}f(x+\\alpha*u) = \\big(\\frac{\\delta}{\\delta \\alpha}(x+\\alpha*u)\\big)^T\\frac{\\delta}{\\delta(x+\\alpha*u)}f(x+\\alpha*u)$\n",
89 | "\n",
90 | "as $\\alpha$ tends to 0 the expression reduces to $u^T\\nabla_x\\, f(x)$. To minimize $f$ we need to find the direction *u* in which $f$ decreases the fastest i.e.:\n",
91 | "\n",
92 | "\n",
93 | "\n",
94 | "Ignoring terms not relating to *u* we see that function *f* is decreased most when $cos\\theta = -1$ i.e. we move in the direction opposite to the gradient. This is the method of **steepest descent** or **gradient descent**. Steepest descent proposes the new point: $x' = x - \\epsilon \\nabla_x\\, f(x)$ where $\\epsilon$ is the **learning rate**. $\\epsilon$ can be a small constant or can be solved analytically to make the gradient vanish. Another approach is to try different values of $\\epsilon$ and choose the value that causes the most decrease (**line search**).\n",
95 | "\n",
96 | "The general concept of repeatedly making a small move in the locally best direction can be generalized to discrete spaces (**hill climbing**)."
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "#### 3.1 Jacobian and Hessian Matrices\n",
104 | "\n",
105 | "If we have a function $f: \\mathbb{R}^m \\rightarrow \\mathbb{R}^n$, then the **Jacobian** matrix $J \\in \\mathbb{R}^{n\\, \\times\\, m}$ of $f$ is defined such that $J_{i,j} = \\frac{\\delta}{\\delta x_j}f_i(x)$\n",
106 | "\n",
107 | "The **second derivative** tells us how the first derivative changes with small changes in input. It is a measurement of **curvature**. The **Hessian** matrix **H**(f)(**x**) is defined such that\n",
108 | "\n",
109 | "$H(f)(x)_{i,j} = \\frac{\\delta^2}{\\delta x_i \\delta x_j}f(x)$\n",
110 | "\n",
111 | "Hessian is the Jacobian of the gradient.\n",
112 | "\n",
113 | "Anywhere that the second partial derivatives are continuous, the differential operators are commutative. This means that $H_{i,j} = H_{j,i}$ and the Hessian is symmetric. This is the most common case in the deep learning regime.\n",
114 | "\n",
115 | "Hessian matrix is real and symmetric $\\Rightarrow$ it can be decomposed into a set of real eigenvalues and orthogonal eigenvector basis\n",
116 | "\n",
117 | "Second derivative in a specific direction **d**(unit vector) is $\\mathbf{d^THd}$.\n",
118 | "\n",
119 | "When **d** is an eigenvector of H, the second derivative is the corresponding eigenvalue. In the general case, the second derivative is given by the weighted average of eigenvalues.\n",
120 | "\n",
121 | "Second-order Taylor series approximation of $f(\\mathbf{x})$ around the point $\\mathbf{x^{(0)}}$:\n",
122 | "\n",
123 | "$f(\\mathbf{x}) \\approx f(\\mathbf{x^{(0)}}) + (\\mathbf{x}-\\mathbf{x^{(0)}})^T\\mathbf{g} + \\frac{1}{2}(\\mathbf{x}-\\mathbf{x^{(0)}})^T\\mathbf{H}(\\mathbf{x}-\\mathbf{x^{(0)}})$ where **g** is the gradient and **H** is the Hessian\n",
124 | "\n",
125 | "Using gradient descent, the new point will be $(\\mathbf{x^{(0)}} - \\epsilon \\mathbf{g})$. Substituting in the above equation:\n",
126 | "\n",
127 | "$f(\\mathbf{x^{(0)}} - \\epsilon \\mathbf{g}) \\approx f(\\mathbf{x^{(0)}}) - \\epsilon \\mathbf{g}^T\\mathbf{g} + \\frac{1}{2}\\epsilon^2\\mathbf{g}^T\\mathbf{H}\\mathbf{g}$\n",
128 | "\n",
129 | "Breakdown: Orignal function - expected decrease due to gradient + correction due to function curvature. When the last term is large, the update actually moves the point uphill. When it is zero or negative, the equation gives that larger $\\epsilon$ will always decrease the function value, however, moving too far from $\\mathbf{x^{(0)}}$ will invalidate the Taylor approximation. When $\\mathbf{g}^T\\mathbf{H}\\mathbf{g}$ is positive, optimal step size is given by:\n",
130 | "\n",
131 | "$\\epsilon^* = \\frac{\\mathbf{g}^T\\mathbf{g}}{\\mathbf{g}^T\\mathbf{H}\\mathbf{g}}$\n",
132 | "\n",
133 | "In the worst case, **g** aligns with an eigenvector of **H** corresponding to the largest eigenvalue ($\\lambda_{max}$) and the optimal step size will be $\\frac{1}{\\lambda_{max}}$. The eigenvalues of H give the scale of learning rate.\n",
134 | "\n",
135 | "At critical point, where $f'(x) = 0$, the second derivative test for the univariate case is given as:\n",
136 | "\n",
137 | "| $f''(x)\\, $ | conclusion |\n",
138 | "| --- | --- |\n",
139 | "| $>0$ | local minimum |\n",
140 | "| $<0$ | local maximum |\n",
141 | "| $=0$ | inconclusive |\n",
142 | "\n",
143 | "In multiple dimensions, for Hessian matrix:\n",
144 | "\n",
145 | "| eigenvalue | conclusion |\n",
146 | "| --- | --- |\n",
147 | "| all positive | local minimum |\n",
148 | "| all negative | local maximum |\n",
149 | "| atleast one positive and negative each | saddle |\n",
150 | "| all non-zero same sign, atleast one zero | inconclusive |\n",
151 | "\n",
152 | "\n",
153 | "\n",
154 | "Explanation of `all_positive` (rest follow similarly): When all the eigenvalues are positive, the Hessian is positive definite. Hence, the directional second derivative in each direction is positive, and infering from the univariate second derivative test, we get that the critical point is a local minimum. The image above shows a saddle point.\n",
155 | "\n",
156 | "When the Hessian has a poor condition number, gradient descent performs poorly, as it is confused between one direction where the gradient increases significantly, and another direction where it increases slowly. Gradient descent is unaware of this change in the derivative, so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer.\n",
157 | "\n",
158 | "Solution to the issue: use information from the Hessian matrix. An example is the **Newton's method** based on second degree Taylor expansion.\n",
159 | "\n",
160 | "$f(\\mathbf{x}) = f(\\mathbf{x^{(0)}}) + (\\mathbf{x}-\\mathbf{x^{(0)}})^T\\nabla_xf(\\mathbf{x^{(0)}}) + \\frac{1}{2}(\\mathbf{x}-\\mathbf{x^{(0)}})^T\\mathbf{H}(f)(\\mathbf{x^{(0)}})(\\mathbf{x}-\\mathbf{x^{(0)}})$\n",
161 | "\n",
162 | "Taking gradient wrt **x** and setting L.H.S. to zero:\n",
163 | "\n",
164 | "$0 = \\nabla_xf(\\mathbf{x^{(0)}}) + \\mathbf{H}(f)(\\mathbf{x^{(0)}})\\mathbf{x} - \\mathbf{H}(f)(\\mathbf{x^{(0)}})\\mathbf{x^{(0)}}$\n",
165 | "\n",
166 | "$\\mathbf{x} = \\mathbf{x^{(0)}} - \\mathbf{H}(f)(\\mathbf{x^{(0)}})^{-1}\\nabla_xf(\\mathbf{x^{(0)}})$\n",
167 | "\n",
168 | "This method consists of iteratively jumping to the minimum of a locally approximated quadratic function $\\rightarrow$ converges faster than gradient descent. However, unlike gradient descent, solution of Newton's method is attracted to saddle points as well.\n",
169 | "\n",
170 | "To treat functions in deep learning, we assume that they are Lipschitz continuous or have lipschitz continuous derivatives. (weak constraint) A **Lipschitz continuous** function satisfies for a Lipschitz constant $\\mathcal{L}$ the bound:\n",
171 | "\n",
172 | "$\\forall \\mathbf{x}\\, ,\\forall \\mathbf{y}\\, ,|f(\\mathbf{x}) - f(\\mathbf{y})| \\leq \\mathcal{L}||\\mathbf{x}-\\mathbf{y}||_2$\n",
173 | "\n",
174 | "This property is useful because it enables us to quantify our assumption that a small change in the input made by an algorithm such as gradient descent will have a small change in the output.\n",
175 | "\n",
176 | "**Convex optimization** algorithms are able to provide many more guarantees by making stronger restrictions. These algorithms are applicable only to convex functions—functions for which the Hessian is positive semidefinite everywhere. It is sometimes used as a subroutine in deep learning algorithms."
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "### 4. Constrained Optimization\n",
184 | "\n",
185 | "It might be the case that although we want to maximize (or minimize) $f(x)$, but aren't allowed to use all possible values of $x$, say $x \\in \\mathbb{S}$, for some set $\\mathbb{S}$. This now becomes a problem of **Constrained Optimization**. The points $\\mathbf{x}$ in $S$ are called **feasible points**. \n",
186 | "\n",
187 | "An example of such a constraint can be the L2-norm constraint, e.g. $|| \\hspace{.1cm} x \\hspace{.1cm}||^2 < 1$. This is useful as we often want the values for our weights to be small (i.e. close to $0$).\n",
188 | "\n",
189 | "*Approach*: Design a separate, unconstrained optimization problem, whose solution can be converted to the original constrained optimization problem. E.g. in the above described constrained optimization problem, we could instead minimize:\n",
190 | "$$g(\\theta) = f([\\cos\\theta, \\sin\\theta]^T)$$\n",
191 | "\n",
192 | "with respect to $\\theta$ and return ($\\cos\\theta, \\sin\\theta$).\n",
193 | "\n",
194 | "\n",
195 | "General solution: **Karush–Kuhn–Tucker(KKT)** approach which introduces a **generalized Lagrangian**.\n",
196 | "\n",
197 | "Approach: \n",
198 | "\n",
199 | "We use $m$ functions $g^{(i)}(x)$ and $n$ functions $h^{(j)}(x)$ to describe $\\mathbb{S}$, such that any element $x \\in \\mathbb{S}$ satisfies: \n",
200 | "$$g^{(i)}(x) = 0 \\hspace{.1cm} \\text{and} \\hspace{.1cm} h^{(j)}(x) \\leq 0 \\hspace{.1cm} \\forall \\hspace{.1cm} i, j$$\n",
201 | "\n",
202 | "There are two constraints specified here. I'll explain them with an example. Let's take $g(x)$ as $x - 2$ and $h(x)$ as $x-3$.
\n",
203 | "Then for $x = 2$, we have the following:\n",
204 | "\n",
205 | "- **Equality constraints**: $g^{(i)}(x) = 0$. Here, $g(2) = 0$. Hence, $x = 2$ satisfies the equality constraints.\n",
206 | "- **Inequality constraints**: $h^{(i)}(x) \\leq 0$. Here, $h(2) = -1 < 0$. Hence, $x = 2$ satisfies the inequality constraints.\n",
207 | "\n",
208 | "Note that for $x = 3$, $h(x)$ is an equality constraint that it satisfies whereas $g(x)$ is neither.\n",
209 | "\n",
210 | "New paramaters (called KKT multipliers): $\\lambda_i$, $\\alpha_j$ for each constraint.
\n",
211 | "Generalized Lagrangian:\n",
212 | "\n",
213 | "\n",
214 | "\n",
215 | "\n",
216 | "Now, let: $Y =\\max\\limits_{\\alpha} \\max\\limits_{\\lambda} L(x, \\lambda, \\alpha)$\n",
217 | "Then, $\\min\\limits_x(f(x)) = \\min\\limits_x(Y)$\n",
218 | "\n",
219 | "This is because, if the constraints are satisfied, $Y = f(x)$ and if it isn't, $Y = \\infty$. This ensures that only feasible points are optimal. For finding the maximum of f(x), we can use the same generalized Lagrangian applied on $-f(x)$. \n",
220 | "\n",
221 | "The inequality constraints need to be observed more closely. Suppose the optimal point comes out to be $x^*$. If $h^{(i)}(x^*) = 0$, then the constraint is said to be **active**. However, if the constraint is inactive, i.e. $h^{(i)}(x^*) < 0$, then even if we remove the constraint, $x^*$ continues to be a local solution. Also, by definition, an inactive $h^{(i)}$ is negative and hence $\\max\\limits_{\\alpha} \\max\\limits_{\\lambda} L(x, \\lambda, \\alpha) \\Rightarrow \\alpha_i = 0$. Thus, either $\\alpha_i = 0$ or $h^{(i)}(x^*) = 0$ (in the case of active constraint). Hence, $\\mathbf{\\alpha} \\odot h{(x)} = 0$.\n",
222 | "\n",
223 | "Intuition: \n",
224 | "\n",
225 | "The relation of the optimal point can satisfy only of these two conditions:\n",
226 | "\n",
227 | "- The point is at the boundary of the constraint (i.e. active), then the corresponding KKT multiplier should be used.\n",
228 | "\n",
229 | "- The constraint has no influence in the evaluation of the point and hence, the corresponding KKT multiplier is zeroed out.\n",
230 | "\n",
231 | "The optimal points satisfy the following KKT conditions, which are necessary but not always sufficient:\n",
232 | "\n",
233 | ""
234 | ]
235 | }
236 | ],
237 | "metadata": {
238 | "kernelspec": {
239 | "display_name": "Python 2",
240 | "language": "python",
241 | "name": "python2"
242 | },
243 | "language_info": {
244 | "codemirror_mode": {
245 | "name": "ipython",
246 | "version": 2
247 | },
248 | "file_extension": ".py",
249 | "mimetype": "text/x-python",
250 | "name": "python",
251 | "nbconvert_exporter": "python",
252 | "pygments_lexer": "ipython2",
253 | "version": "2.7.12"
254 | }
255 | },
256 | "nbformat": 4,
257 | "nbformat_minor": 2
258 | }
259 |
--------------------------------------------------------------------------------
/11 - Practical Methodology.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Deep Learning Book (Simplified)\n",
8 | "## Part II - Modern Practical Deep Networks\n",
9 | "*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org)\n",
10 | "where we are attempting to provide a summary of each chapter highlighting the concepts that we found to be the most important so that other people can use it as a starting point for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on \n",
11 | "notation.*\n",
12 | "\n",
13 | "\n",
14 | "## Chapter 11: Practical Methodology\n",
15 | "\n",
16 | "We are excited to say that this is going to be the last chapter that we cover before entering the Deep Learning Research section of the book which is, for the most part, unfamiliar terrains for us. A lot of what we'd been talking about till now covered the theoretical aspects of Deep Learning. However, there's a large gap between theory and what works in practice. This chapter is specifically dedicated to practitioners and people who are looking to apply Deep Learning for building cool applications and solving real-world problems. \n",
17 | "\n",
18 | "The various choices that one might need to make include which type of data to gather, where would they find that data, should they gather more data, change model complexities, change (add/remove) regularization, improve optimization, debug the software implementation, etc. The recommended practical design process is as follows:\n",
19 | "\n",
20 | "- Decide on a a single number metric to evaluate your model. This represents the final goal and you need to set a specific target that you want to achieve. Coming from Andrew Ng's Machine Learning Yearning and also from personal experience, most teams forget to decide upon this only to realize the mistake very late in the process that setting this up would have gave them a clear guide on what they wanted to improve.\n",
21 | "\n",
22 | "- Get an end-to-end pipeline working as soon as possible, including the evaluation of the required metrics. This will, more often than not, require that you use a very simple model that can accept the inputs correctly and produce the outputs in the right format that can be further used for training / evaluation / analysis. The major benefit here is that now you can solely focus on improving the model and on doing any specific change, you can instantly get the final results and check whether that change improved the model or not.\n",
23 | "\n",
24 | "\n",
25 | "\n",
26 | "- Instrument the system well to determine bottlenecks in performance which requires diagnosing which components perform worse than expected and understanding the reason behind poor performance - overfitting, underfitting, modelling, problems in data, software implementation errors, etc.\n",
27 | "\n",
28 | "- Based on the diagnosis above, keep improving the algorithm iteratively either by adding more data, increasing the capacity of the model, tuning hyperparameters or improving the quality of data by better annotation, etc.\n",
29 | "\n",
30 | "The chapter is organized as follows:\n",
31 | "\n",
32 | "**1. Performance Metrics**
\n",
33 | "**2. Default Baseline Models**
\n",
34 | "**3. Determining Whether to Gather More Data**
\n",
35 | "**4. Selecting Hyperparameters**
\n",
36 | "**5. Debugging Strategies**"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## 1. Performance Metrics\n",
44 | "\n",
45 | "As mentioned above, it's extremely important to decide on which error metric to use as that will ultimately guide you on how to make progress. It should be sufficiently representative of the end goal that you are trying to achieve. Let me give you an example. Suppose you are working on a [Semantic Segmentation](http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review) problem where we want to assign a class to each pixel of the input image. The image below demonstrates how the output should look like given an input image:\n",
46 | "\n",
47 | "\n",
48 | "\n",
49 | "In the figure above, all the pixels constituting the man has been marked as one class, those representing the bicycle as another class and the remaining ones as the background class.\n",
50 | "\n",
51 | "To simplify, let's consider a binary segmentation task where class 1 represents \"man\" and class 0 represents the background class. Thus, the expected output now becomes:\n",
52 | "\n",
53 | "\n",
54 | "\n",
55 | "Notice that the pixels belonging to the bicycle class are also labelled as 0 as we are considering only a binary semantic segmentation task. Thus, the bicycle class now comes under the background class\n",
56 | "\n",
57 | "Now, what would be a reasonable metric to choose here that would be representative of the final goal here? - Think before looking down.\n",
58 | "\n",
59 | "A default metric to start off with is [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision), which indicates the percentage of pixels where our model predicted the right class. Although the example above has a fairly equal number of class 1 and class 0 pixels, this need not be the case. There can be images where there is a single person in the image or there can be a lot of images with no people. In such cases where there is a high [class imbalance](http://www.chioka.in/class-imbalance-problem/), a very simple way to achieve a high accuracy could be to always predict the class 0. However, it's clearly not a good classifier although it might get 90% accuracy. You'd ideally like a metric that is not dependent on the distribution of the classes in the dataset. For this reason, the most commonly used metric for semantic segmentation is **Intersection over Union (IoU)**, which is defined as follows:\n",
60 | "\n",
61 | "\n",
62 | "\n",
63 | "The image below shows why IoU is a good metric:\n",
64 | "\n",
65 | "\n",
66 | "\n",
67 | "In the first case, although the red box is almost entirely within the green one, but their union is high and that makes the IoU low. However, in the other two cases, the intersection nears the union more and more\n",
68 | "\n",
69 | "One more thing that I'd like to point out before moving on from this example is that it is equally useful in images where there is no object, i.e. the entire image represents the background class. This is achieved by adding a small $\\epsilon$ to both the numerator and denominator during the calculation of IoU. Now, if the ground truth contained only background class and our model predicted that as well, the intersection as well as the union is 0. Thus, the IoU becomes (0 + $\\epsilon$) / (0 + $\\epsilon$) = $\\epsilon$ / $\\epsilon$ = 1. This indicates the importance of choosing the right metric.\n",
70 | "\n",
71 | "Then there can be problems where one type of mistake is more costly than another. In the case of spam detection, classifying a spam mail as non-spam is much less costlier than classifying a non-spam message as spam. In such cases, instead of measuring the error rate, we might be interested in observing some form of total cost which is representative of our problem.\n",
72 | "\n",
73 | "Similar to the semantic segmentation problem described above, there are many cases where there is a large class imbalance. For example, in a particular sample of population, one out of a 1000 people might have cancer. Thus, 9999 people don't have cancer. If I simply use a classifier that classifier everyone as not having cancer, I can achieve an accuracy of 99.99%. But would you be willing to use this classifier for testing yourself?\n",
74 | "\n",
75 | "Definitely not. In such a case, accuracy is a bad metric. We instead use **precision** and **recall** to evaluate our classifier. I generally use this figure to remember what both of them mean:\n",
76 | "\n",
77 | "\n",
78 | "\n",
79 | "Precision represents the fraction of detections that were actually true, whereas Recall stands for the the fraction of true events that were successfully detected.\n",
80 | "\n",
81 | "\n",
82 | "\n",
83 | "Now, consider that if a detector says that all the cases are not cancer will achieve the perfect precision, but 0 recall. Many a times, it's actually desirable to have a single metric to judge on, rather than have a trade-off between two of them. F1-score, which is the Harmonic Mean of Precision & Recall is a widely accepted metric:\n",
84 | "\n",
85 | "\n",
86 | "\n",
87 | "However, F1-score gives equal weightage to both precision and recall. There can be cases where you want to weigh one over the other and hence, we have the more general, F-beta score:\n",
88 | "\n",
89 | "\n",
90 | "\n",
91 | "Also, in some cases the machine learning algorithm can refuse to make any decision at all in cases where it's not very confident about it's decision. This can be important in situations where a misclassification can be harmful and it'd be much better for a human to have a look. Then again, a ML system is useful only when it significantly reduces the number of instances that a human operator must process. A natural performance metric here is **coverage**, which stands for the fraction of images that where the machine learning system is able to produce a response.\n",
92 | "\n",
93 | "In most applications, it might not be possible to achieve absolute zero error even after having infinite data either due to the features not being sufficiently representative or the system being intrinsically stochastic. The minimum amount of error possible for a system is called the Bayes' error for the system.\n",
94 | "\n",
95 | "A major bottleneck to performance is often the fact that training data is limited. Now, once away from standard datasets like [MNIST](http://yann.lecun.com/exdb/mnist/) into more real-world problems, you'll realize that getting accurate data is much more harder than it initially seems and in most often, doesn't come for free as well. So, you really need to analyze how much is additional data going to improve your performance metric. I'll try to explain this with an example. As mentioned in Andrew Ng's book, [Machine Learning Yearning](http://www.mlyearning.org), a standard method of error analysis to actually observe (say) 100 examples where your model is failing and then checking which classes actually account for the maximum % of error:\n",
96 | "\n",
97 | "\n",
98 | "\n",
99 | "From the figure above, it can be clearly seen that collecting more Dog images might improve the error rate by 8% at max. However, collecting more Blurry images might potentially improve the error rate by 61%, which is very significant. Thus, it makes sense to spend time on collecting more blurry images and doing this exercise would save you the embarrassment of spending months on collecting better Dog images only to see the error rate improve by 8%. For more such tips, refer to the book linked above.\n",
100 | "\n",
101 | "The bottom line being that you need to decide what is realistic desired error rate for your intended application beforehand and use that to guide your design decisions in the future."
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "## 2. Default Baseline models\n",
109 | "\n",
110 | "As mentioned at the start, it is very important to establish a working end-to-end system as soon as possible. Depending on the complexity of the problem, we might even choose to begin with a very simplistic model like logistic regression. However, if the problem that you intend to solve falls under the [\"AI-complete\"](https://en.wikipedia.org/wiki/AI-complete) category like Image Classification, Speech Recognition, etc., starting off with a deep learning model would almost always be better.\n",
111 | "\n",
112 | "You first begin with choosing the general category of model to use based on the structure of your data. If your data consists of fixed-size vectors and you intend to perform a supervised learning task, use a multi-layer perceptron. If your data has a fixed topological structure, using a [Convolutional Neural Network](https://medium.com/inveterate-learner/deep-learning-book-chapter-9-convolutional-networks-45e43bfc718d) might be the best way forward. Similarly, if your data has a sequential pattern, [Recurrent Neural Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) would be the ideal starting point. However, the [speed](https://www.technologyreview.com/s/513696/deep-learning/) at which Deep Learning as a field is progressing, default algorithms are likely to change. For example, 3–4 years ago [AlexNet](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) would have been the ideal starting point for image-based tasks. However, now [ResNets](https://arxiv.org/abs/1512.03385) are the widely accepted default choice.\n",
113 | "\n",
114 | "\n",
115 | "ImageNet top-5 error progress over the years. AlexNet had 8 layers where the more powerful ResNet has more than 150 layers. Source: https://medium.com/@RaghavPrabhu/cnn-architectures-lenet-alexnet-vgg-googlenet-and-resnet-7c81c017b848\n",
116 | "\n",
117 | "For training the model, a reasonable starting point is to use the Adam optimizer. Apart from this, SGD with momentum and a learning rate decay is widely used too where the learning rate is decayed exponentially until a point and then, reduced linearly by a factor of 2–10 each time validation error plateaus. [Batch-Normalization](https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135#5bbf) would, in general, always improve performance by providing stability and allowing the use of larger learning rate thereby helping to reach convergence faster. \n",
118 | "\n",
119 | "As you increase your model complexity, you'll eventually become prone to overfitting since your training data is limited. Thus, it's always advised to add some [regularization](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c) to your model as well. Common choices include L2-penalty to the loss function, Dropout between layers, Early stopping and Batch-Normalization. Using Batch-Normalization [allows the omission](http://forums.fast.ai/t/batch-normalisation-vs-dropout/5172) of Dropout. If you missed our post on regularization, feel free to go through it where all of these have been [explained in detail](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c).\n",
120 | "\n",
121 | "If your task is reasonably similar to any other task where prior work has been done, it is advised to just copy the model (along with the weights) from the latter and use that as an initialization point for your task. This way of training is called **transfer learning**. For example, in the famous [Dogs Vs Cats Image Classification challenge on Kaggle](https://www.kaggle.com/c/dogs-vs-cats), a model pretrained on ImageNet which contained similar images, was used as the starting point to achieve the best performance, rather than training the model from scratch.\n",
122 | "\n",
123 | "\n",
124 | "Here, the large dataset of object images refers to ImageNet. Source: https://towardsdatascience.com/transfer-learning-using-differential-learning-rates-638455797f00\n",
125 | "\n",
126 | "Finally, some domains like Natural Language Processing (NLP) benefit tremendously from using unsupervised learning methods during initialization. In the current trend of Deep Learning applied to NLP, it's common to represent each word as an embedding (vector) and there exist unsupervised learning methods like [word2vec](https://en.wikipedia.org/wiki/Word2vec) and [GLoVe](https://nlp.stanford.edu/projects/glove/) for learning these word embeddings."
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "## 3. Determining Whether to Gather more data\n",
134 | "\n",
135 | "A rookie mistake that a lot of people make is that they keep trying different algorithms to improve the performance of their models, whereas simply improving the data they have or gathering more data can be the best source of improvement. We touched upon the topic of how to decide when to get more data, but since data is the most integral part of getting an AI solution working, we'll explore this in a bit more detail now.\n",
136 | "So, how do you decide when to get more data? Firstly, if the performance of your model on your training set is poor, it is not making full use of the information present in your data and in this case, you need to increase the complexity of your model by adding more layers or increasing the number of hidden units in each layer. Also, hyperparameter tuning is an important step to perform. You'd be surprised how large an effect choosing the right hyperparameters can have in getting your model working. For example, learning rate is THE [most important](https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2#7da2) hyperparameter that you need to tune. Setting the right value of the learning rate for your problem can save you loads of hours of wasted effort. However, if your model is reasonably complex and optimization carefully tuned but still the performance is not up to the desired level, the problem might be the quality of data instead, in which you have to go back to square one and start collecting cleaner data.\n",
137 | "If the training error is low but the validation error is much higher, then you can safely assume that your best would be to say:\n",
138 | "\n",
139 | "\n",
140 | "\n",
141 | "The specific situation mentioned above, where training error is low but test error is high, is called overfitting and is one of the most commonly occurring problems in training deep models, in which case regularization might help. To reinforce the importance of data in the modern deep networks, for those who might not be aware, the reason that Deep Learning started gaining attention was the ImageNet competition where a deep learning model outperformed the previous best model by a significantly large margin in 2012. ImageNet consists of millions of annotated images and the creation of similar large labelled datasets is the reason that extremely complex problems like object detection have become solved problems today.\n",
142 | "Finally, it's generally observed that adding a small fraction of the total number of examples won't have a noticeable effect on the performance. Thus, we need to monitor how much the performance of a model improves as the dataset size increases and it should be monitored at a logarithmic scale.\n",
143 | "\n",
144 | "\n",
145 | "\n",
146 | "As can been seen from the plot above, the training error will generally increase as you increase the dataset size. This is because the model will find it harder to fit to all the datapoints exactly now. Also, by increasing the dataset size, your validation (dev) error will decrease as the model would learn to be more generalized now."
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "## 4. Selecting Hyperparameters\n",
154 | "\n",
155 | "Most deep learning algorithms have a lot of hyperparameters that need to be chosen correctly. Different hyperparameters control different aspects of the model. Some affect the memory cost like the number of layers to use while others affect the performance like the keep probability for Dropout, learning rate, momentum, etc. Broadly, there are two approaches to choosing these hyperparameters. The first one is to choose them manually which involves understanding what the hyperparameters do and how they affect training and generalization. The other approach is to choose the hyperparameters automatically, which reduces the complexity a lot but comes at the cost of compute power. We'll discuss these two approaches in more details now:\n",
156 | "\n",
157 | "**i) Manual Hyperparameter Tuning:**\n",
158 | "\n",
159 | "As briefly mentioned above, manual hyperparameter tuning requires a lot of domain knowledge and fundamental understanding of training error, generalization error, learning theory, etc. The primary aim of manual hyperparameter tuning is to achieve effective capacity to match the complexity of the task by trading off memory and runtime. Factors influencing the effective capacity are representational capacity of the model, ability of the learning algorithm to minimize the cost function used to train the model and the degree to which the cost function and the training procedure regularize the model. \n",
160 | "\n",
161 | "The generalization error typically follows a U-shaped curve as shown below:\n",
162 | "\n",
163 | "\n",
164 | "On the extreme left, we are in the underfitting regime where the capacity of the model is low and both training and generalization errors are high. On the extreme right, we enter the overfitting regime where the training error is low but the gap between the training error and test error is high. The optimal spot is somewhere in the middle where we trade-off a slightly higher training error for the lowest possible generalization error. \n",
165 | "\n",
166 | "Many hyperparameters affect overfitting (or underfitting) and in different ways. For e.g. increasing certain hyperparameters like, the number of hidden units, increases the chances of overfitting, whereas increasing others like weight decay reduces. Some of them are discrete like the number of hidden units, whereas others might be binary like whether to use Batch Normalization or not. Some hyperparameters have bounds that implicitly restrict them, like the weight decay coefficient which can only *reduce* capacity. Thus, if the model is underfitting, you can't get it to overfit by adding weight decay.\n",
167 | "\n",
168 | "As mentioned before, if you can tune only one hyperparameter, tune the learning rate. The effective capacity of the model is the highest at the right learning rate, neither too high nor too low. We discuss in more details about the effect of learning rate on the training process in one of our [earlier posts](https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2#7da2), but to summarize: setting the learning rate too low slows training and might even cause the algorithm to get stuck in local minima; setting it too high might make the training unstable due to wild oscillations.\n",
169 | "\n",
170 | "\n",
171 | "\n",
172 | "If the training error is high, general approach is to add more layers or more hidden units to increase the capacity. If the training error is low but the test error is high, you need to reduce the gap between the train and test errors without increasing the training error too much. Usually, a sufficiently large model which is well-regularized (for e.g. by using Dropout, Batch Normalization, weight decay, etc.) works the best.\n",
173 | "\n",
174 | "The two broad approaches to achieve the final goal of a low generalization are: adding regularization to the model and increasing the dataset size. The table below shows how each hyperparameter affects capacity:\n",
175 | "\n",
176 | "\n",
177 | "\n",
178 | "**ii) Automatic Hyperparameter Optimization Algorithms:** Hyperparameter tuning can be viewed as an optimization process itself which optimizes an objective function, such as the validates, sometimes under constraints like training time, memory limits, etc. Thus, we can design *Hyperparameter Optimization (HO)* algorithms that wrap a learning algorithm and choose its hyperparameters. Unfortunately, these HO algorithms have their own set of hyperparameters, but these are generally easier to choose as would be discussed now:\n",
179 | "\n",
180 | "*Grid Search*: For grid search, first pick a range of values that you feel is suitable for each hyperparameter. Then, you train the model for each possible combination of the values of the hyperparameters. To simplify, if you have 2 hyperparameters and pick a range of N values for each of them, you'll need to train the model for all the possible $N^2$ combinations. You generally set the maximum and minimum of the range based on your understanding (and/or experience) and then choose the values in between, generally, on a logarithmic scale. For e.g. possible values for learning rate: {0.1, 0.01, 0.001}, number of hidden units: {50, 100, 200, 400}, etc.\n",
181 | "\n",
182 | "Also, Grid search works best when performed repeatedly. E.g. if the range that you set was {0.1, 0, 1} and the best performing one was 1, you probably set the range wrong and you should check again for a higher range like {1, 2, 3}. In case the best performing value comes out as 0, then you should do a more refined search between {-0.1, 0, 0.1}. \n",
183 | "The main problem with Grid Search is the computational cost. If there are m hyperparameters to be tuned, and each of them can take N values, the number of training and evaluation trials grows as O($N^m$).\n",
184 | "\n",
185 | "*Random Search*: A better and faster approach is something known as random search. In this case, you define some distribution over the choice of values, e.g.binomial for binary, multinomial for discrete, uniform on a log-scale for say, learning rate:\n",
186 | "\n",
187 | "\n",
188 | "\n",
189 | "Then, for each run, randomly sample the value of each hyperparameter based on its distribution. This can prove to be exponentially more efficient than grid search. The figure below explains this:\n",
190 | "\n",
191 | "\n",
192 | "\n",
193 | "To make it clearer, the main reason that random search reduces validation error faster than grid search is that it doesn't perform any wasted computation. Since grid search goes over all possible combinations, it'll evaluate cases where the value of only one hyperparameter changes, with the values of the rest being the same. Now, if this hyperparameter doesn't affect the performance too much, then grid search has performed a wasted evaluation. However, in the case of random search, for different values of a hyperparameter, the values of the rest of the hyperparameters would most likely also be different. Thus, random search doesn't do any wasted evaluation.\n",
194 | "\n",
195 | "*Model-based Hyperparameter Optimization*: As mentioned briefly above, hyperparameter tuning can be viewed as an optimization process. In simplified settings, it might be possible to take the gradient of some differentiable error measure on the validation set with respect to the hyperparameters and simply use Gradient Descent. However, in most practical settings, this gradient is not unavailable. To compensate for this, you can build a model of the validation error and perform optimization on this model. A general approach is to build a Bayesian regression model to estimate the expected value of the validation error along with the uncertainty around this estimate. Bayesian Hyperparameter Optimization (BHO) is still nascent and not sufficiently reliable. One major drawback compared to random search would be that BHO requires each experiment to go till completion to be able to extract any information out of it, whereas in many cases it might be clearly visible at the initial stages itself that that particular set of hyperparameters doesn't work."
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "## 5. Debugging Strategies\n",
203 | "\n",
204 | "- *Visualize the model in action*: This is one of the best ways to verify if the training is going correctly and also, understanding which areas might need improvement. Once the training starts, visualize the output of your model after a few epochs. If you're working on a semantic segmentation problem, look at the segmentation output. If you're training a generative model of speech, listen to a few sample of speech that it produces. Also, it's common to have bugs in the evaluation metric as they might need corner-case handling which you might not have taken care of. Evaluation bugs are the hardest ones to catch and they fool you into believing that your model is performing/not performing well.\n",
205 | "\n",
206 | "\n",
207 | "\n",
208 | "- *Visualize the worst mistakes*: Going back to the semantic segmentation problem above, suppose we run the model on our test set. Based on the IoU scores, we can sort the samples to identify where our model performed the worst. Visualizing those examples where the model fails terribly, is a great way to identify errors in data processing/annotation. In the case where you infer that the problem had been with the annotation of data, the best way to improve performance would be to actually correct the annotations, even manually if required, as the payoff of having the correct data is very high.\n",
209 | "\n",
210 | "\n",
211 | "\n",
212 | "Google misclassified the photo of humans as that of gorillas. It came under some scrutiny for having this bias in its algorithms\n",
213 | "\n",
214 | "- *Fit a tiny dataset*: Before starting to train on your entire training set, always fit your model to a small subset of the entire dataset. Even very simple models will overfit to a handful of examples. Taking the extreme case of a single example, it's very easy to correctly fit to it by setting the weights to zero and the biases appropriately. From my practical experience too, if you're making a modification or trying something different, first make sure that it can overfit on a small enough dataset. If it can't, then there's a high probability that there's been a software bug in setting up the training process.\n",
215 | "\n",
216 | "- *Monitor histograms of activations and gradients*: It can be useful to monitor the pre-activation values of hidden units in case there is a problem in training. What to monitor depends on the type of activation function used. For example, in the case of ReLU (commonly used between layers), we can check how often is the unit off (which would happen if the pre-activation value is < 0). In the case of sigmoidal units, it can be useful to check how often does it stay in the saturated regions, i.e. either too positive or too negative. Also, if the gradients grow or vanish too quickly, it can be a problem during training. It has advised in the book that the magnitude of the gradient should be approximately 1% of the magnitude of the parameter, neither too high (50%) nor too low (0.001%). Thus, comparing the two magnitudes can be a good approach for debugging too. \n",
217 | "Finally, it can be shown (covered in later chapters) that some optimization algorithms provide certain guarantees, like the objective function not increasing after each epoch, all the gradients being zero at convergence, etc. and we can ensure that these guarantees are met."
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": null,
223 | "metadata": {},
224 | "outputs": [],
225 | "source": []
226 | }
227 | ],
228 | "metadata": {
229 | "kernelspec": {
230 | "display_name": "Python 2",
231 | "language": "python",
232 | "name": "python2"
233 | },
234 | "language_info": {
235 | "codemirror_mode": {
236 | "name": "ipython",
237 | "version": 2
238 | },
239 | "file_extension": ".py",
240 | "mimetype": "text/x-python",
241 | "name": "python",
242 | "nbconvert_exporter": "python",
243 | "pygments_lexer": "ipython2",
244 | "version": "2.7.12"
245 | }
246 | },
247 | "nbformat": 4,
248 | "nbformat_minor": 2
249 | }
250 |
--------------------------------------------------------------------------------
/07 - Regularization for Deep Learning.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Deep Learning Book (Simplified)\n",
8 | "## Part II - Modern Practical Deep Networks\n",
9 | "*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org)\n",
10 | "where we are attempting to provide a summary of each chapter highlighting the concepts \n",
11 | "that we found to be most important so that other people can use it as a starting point\n",
12 | "for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on \n",
13 | "notation.*\n",
14 | "\n",
15 | "\n",
16 | "## Chapter 7: Regularization for Deep Learning\n",
17 | "\n",
18 | "Recalling from Chapter 5, **overfitting** is said to occur when the training error keeps decreasing but the test error (or the generalization error) starts increasing. **Regularization** is the modification we make to a learning algorithm that reduces its generalization error, but not its training error. There are various ways of doing this, some of which include restriction on parameter values or adding terms to the objective function, etc.\n",
19 | "\n",
20 | "These constraints are designed to encode some sort of prior knowledge, with a preference towards simpler models to promote generalization (See [Occam's Razor](https://en.wikipedia.org/wiki/Occam%27s_razor)). The sections present in this chapter are listed below:
\n",
21 | "\n",
22 | "**1. Parameter Norm Penalties**
\n",
23 | "**2. Norm Penalties as Constrained Optimization**
\n",
24 | "**3. Regularization and Under-Constrained Problems**
\n",
25 | "**4. Dataset Augmentation**
\n",
26 | "**5. Noise Robustness**
\n",
27 | "**6. Semi-Supervised Learning**
\n",
28 | "**7. Mutlitask Learning**
\n",
29 | "**8. Early Stopping**
\n",
30 | "**9. Parameter Tying and Parameter Sharing**
\n",
31 | "**10. Sparse Representations**
\n",
32 | "**11. Bagging and Other Ensemble Methods**
\n",
33 | "**12. Dropout**
\n",
34 | "**13. Adversarial Training**
\n",
35 | "**14. Tangent Distance, Tangent Prop and Manifold Tangent Classifier**
"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "### 1. Parameter Norm Penalties\n",
43 | "\n",
44 | "The idea here is to limit the capacity (the space of all possible model families) of the model \n",
45 | "by adding a parameter norm
\n",
46 | "penalty, $\\Omega(\\theta)$, to the objective function, $J$:\n",
47 | "\n",
48 | "$$ \\tilde{J}(\\theta; X, y) = J(\\theta; X, y) + \\lambda \\Omega(\\theta)$$\n",
49 | "\n",
50 | "Here, $\\theta$ represents only the weights and not the biases, the reason being that the biases require much less data to fit and do not add much variance.\n",
51 | "\n",
52 | "**1.1 $L^2$ Parameter Regularization**\n",
53 | "\n",
54 | "Here, the parameter norm penalty:\n",
55 | "$$\\Omega(\\theta) = \\frac {||w||_2^2} {2}$$\n",
56 | "\n",
57 | "This makes the objective function:\n",
58 | "\n",
59 | "$$ \\tilde{J} (\\theta; X, y) = J(\\theta; X, y) + \\alpha \\frac {w^T w} {2} $$\n",
60 | "\n",
61 | "Applying the 2nd order Taylor-Series approximation at the point $w^*$ where $\\tilde{J} (\\theta; X, y)$ assumes the minimum value, i.e., $\\bigtriangledown_w \\tilde {J} (w^*) = 0$:\n",
62 | "\n",
63 | "$$ \\hat{J}(w) = J(w^*) + \\frac{(w - w^*)^T H(J(w^*))(w - w^*)} {2} $$\n",
64 | "\n",
65 | "Finally, $\\bigtriangledown_w \\hat{J}(w) = H(J(w^*))(w - w^*)$ and the overall gradient of the objective function becomes:\n",
66 | "\n",
67 | "$$ \\bigtriangledown_w \\tilde{J}(w) = H(J(w^*))(\\tilde{w} - w^*) + \\alpha \\tilde{w} = 0$$\n",
68 | "$$ \\tilde{w} = (H + \\alpha I)^{-1} H w^* $$\n",
69 | "\n",
70 | "As $\\alpha$ approaches 0, $w$ comes closer to $w^*$. Finally, since $H$ is real and symmetric, it can be decomposed into a diagonal matrix $\\wedge$ and an orthonormal set of eigenvectors, $Q$. That is, $H = Q^T\\wedge Q$.\n",
71 | "\n",
72 | "\n",
73 | "\n",
74 | "Because of the marked term, the value of each weight is rescaled along the eigenvectors of $H$. The value of the weights along the $i^{th}$ eigenvector is rescaled by $\\frac {\\lambda_i}{\\lambda_i + \\alpha}$, where $\\lambda_i$ represents the eigenvalue corresponding to the $i^{th}$ eigenvector.\n",
75 | "\n",
76 | "| Condition| Effect of regularization|\n",
77 | "| --- | --- |\n",
78 | "| $\\lambda_i >> \\alpha$ | Not much |\n",
79 | "| $\\lambda_i << \\alpha$ | The weight value almost shrunk to zero |\n",
80 | "\n",
81 | "The diagram below illustrates this well.\n",
82 | "\n",
83 | "\n",
84 | "\n",
85 | "To look at its application to Machine Learning, we have to look at linear regression. The objective function there is exactly quadratic, given by:\n",
86 | "\n",
87 | "\n",
88 | "\n",
89 | "**1.2 $L^1$ parameter regularization**\n",
90 | "\n",
91 | "Here, the parameter norm penalty:\n",
92 | "$$\\Omega(\\theta) = ||w||_1 $$\n",
93 | "\n",
94 | "Making the gradient of the overall objective function:\n",
95 | "\n",
96 | "$$ \\bigtriangledown_w \\tilde{J}(\\theta; X, y) = \\bigtriangledown_w J(\\theta; X, y) + \\alpha * sign(w) $$\n",
97 | "\n",
98 | "Now, the last term, sign(w), create a difficulty that the gradient no longer scales linearly with $w$. This leads to a few complexities in arriving at the optimal solution (which I am going to skip):\n",
99 | "\n",
100 | "\n",
101 | "Our current interpretation of the `max` term is that, there shouldn't be a zero crossing, as the gradient of the absolute value function is not differentiable at zero.\n",
102 | "\n",
103 | "\n",
104 | "\n",
105 | "\n",
106 | "Thus, $L^1$ regularization has the property of sparsity, which is its fundamental distinguishing feature from $L^2$. Hence, $L^1$ is used for feature selection as *LASSO*."
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "### 2. Norm penalties as constrained optimization\n",
114 | "\n",
115 | "From chapter 4's section 4, we know that to minimize any function under some constraints, we can construct a generalized Lagrangian function containing the objective function along with the penalties. Suppose we wanted $\\Omega(\\theta)) < k$, then we could construct the following Lagrangian:\n",
116 | "\n",
117 | "\n",
118 | "Thus, $\\theta^* = argmin_{\\theta} (max_{\\alpha, \\alpha >= 0} \\hspace{.2cm} \\mathcal{L}(\\theta, \\alpha; X, y))$. If $\\Omega(\\theta) > k$, then $\\alpha$ should be large to reduce its value below k.
\n",
119 | "Likewise, if $\\Omega(\\theta) < k$, then $\\alpha$ should be small. Assuming $\\alpha$ to be a constant $\\alpha^{*}$:\n",
120 | "\n",
121 | "$$ \\theta^* = argmin_{\\theta} \\hspace{.2cm} J(\\theta; X, y) + \\alpha^* \\Omega(\\theta)$$\n",
122 | "\n",
123 | "This is now similar to the parameter norm penalty regularized objective function. Thus, parameter norm penalties naturally impose a constraint, like the L2-regularization defining a constrained L2-ball. Larger $\\alpha$ means a smaller constrained region and vice versa. The idea of constraints over penalties, is important for several reasons. Penalties might cause non-convex optimization algorithms to get stuck in local minima due to small values of $\\theta$, leading to the formation of so-called `dead cells`, as the weights entering and leaving them are too small to have an impact. Constraints don't enforce the weights to be near zero, rather being confined to a constrained region.\n",
124 | "\n",
125 | "Another reason is that constraints induce higher stability. With higher learning rates, there might be a large weight, leading to a large gradient, which could go on iteratively leading to numerical overflow in the value of $\\theta$. Constrains along with reprojection (to the corresponding ball) prevent the weights from becoming too large, thus, maintaining stability. \n",
126 | "\n",
127 | "A final suggestion made by Hinton was to restrict the individual column norms of the weight matrix rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden unit from having a large weight."
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "### 3. Regularized & Under-constrained problems\n",
135 | "\n",
136 | "Underdetermined problems are those problems that have infinitely many solutions. In some machine learning problems, regularization is necessary. For e.g., many algorithms (e.g. PCA) require the inversion of $X^TX$, which might be singular. In such a case, we can use a regularized form instead. $(X^TX + \\alpha I)$ is guaranteed to be invertible. A logistic regression problem having linearly separable classes with $w$ as a solution, will always have $2w$ as a solution and so on.\n",
137 | "\n",
138 | "Regularization can solve underdetermined problems. For e.g. the Moore-Pentose pseudoinverse defined in a previous chapter is given as:\n",
139 | "\n",
140 | "\n",
141 | "This can be seen as performing a linear regression with L2-regularization. "
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "### 4. Data augmentation\n",
149 | "\n",
150 | "Having more data is the most desirable thing to improving a machine learning model's performance. In many cases, it is relatively easy to artifically generate data. For a classification task, we desire for the model to be invariant to certain types of transformations, and we can generate the corresponding $(x, y)$ pairs by translating the input $x$. But for certain problems, like density estimation, we can't apply this directly unless we have already solved the density estimation problem. \n",
151 | "\n",
152 | "However, caution needs to be mentioned while data augmentation to make sure that the class doesn't change. For e.g., if the labels contain both \"b\" and \"d\", then horizontal flipping would be a bad idea for data augmentation. Add random noise to the inputs is another form of data augmentation, while adding noise to hidden units can be seen as doing data augmentation at multiple levels of abstraction.\n",
153 | "\n",
154 | "Finally, when comparing machine learning models, we need to evaluate them using the same hand-designed data augmentation schemes or else it might happen that algorithm A outperforms algorithm B, just because it was trained on a dataset which had more / better data augmentation."
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "### 5. Noise Robustness\n",
162 | "\n",
163 | "Noise with infinitesimal variance imposes a penalty on the norm of the weights. Noise added to hidden units is very important and is discussed later in **12. Dropout**. Noise can even be added to the weights. This has several interpretations. One of them is that adding noise to weights is a stochastic implementation of Bayesian inference over the weights, where the weights are considered to be uncertain, with the uncertainty being modelled by a probability distribution. It is also interpreted as a more traditional form of regularization by ensuring stability in learning. \n",
164 | "\n",
165 | "For e.g. in the linear regression case, we want to learn the mapping $y(x)$ for each feature vector $x$, by reducing the mean square error.\n",
166 | "\n",
167 | "$$ J = E_{p(x, y)} [\\hat{y} (x) - y] ^ 2 $$\n",
168 | "\n",
169 | "Now, suppose a random noise $\\epsilon_w \\in \\mathcal{N}(\\epsilon; 0, \\eta I)$ is added to the weights, we get the output $\\hat{y}_{\\epsilon_w}(x)$ and still want to learn this through reducing the mean square. Minimizing the loss after adding noise to the weights, is equivalent to adding another regularization term, $\\eta E_{p(x, y)}(\\bigtriangledown_w \\hat{y}(x))$, which makes sure that small perturbations in the weight values don't affect the predictions much, thus stabilising training.\n",
170 | "\n",
171 | "**5.1 Injecting noise at output targets**\n",
172 | "\n",
173 | "Sometimes we may have the wrong output labels, in which case maximizing $p(y \\hspace{.1cm} | \\hspace{.1cm} x)$ may not be a good idea. In such a case, we can add noise to the labels by assigning a probability of (1 - $\\epsilon$) that the label is correct and a probability of $\\epsilon$ that it is not. In the latter case, all the other labels are equally likely. **Label Smoothing** regularizes a model with $k$ softmax outputs by assigning the classification targets as (1 - $\\epsilon$) and $\\frac {\\epsilon} {k-1}$."
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "### 6. Semi-Supervised Learning\n",
181 | "\n",
182 | "`P(x, y)` denotes the joint distribution of *x* and *y*, i.e., corresponding to training sample *x*, I have a label *y*. `P(x)` denotes just the distribution of *x*, i.e., just the training examples without any labels. In **Semi-supervised Learning**, we use both `P(x, y)` and `P(x)` to estimate `P(y | x)`. We want to learn some representation `h = f(x)` such that samples from the same class have similar distributions and a linear classfier in the new space achieves better generalization error.\n",
183 | "\n",
184 | "Instead of separating the supervised and unsupervised criteria, we can instead have a generative model of `P(x)` or `P(x, y)` which shares parameters with the discriminative model, where the shared parameters encode the prior belief that `P(x)` (or `P(x, y)`) is connected to `P(y | x)`."
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "### 7. Multitask Learning\n",
192 | "\n",
193 | "The idea is to improve the generalization error by pooling together examples from multiple tasks. Similar to how more data leads to more generalizability, using a part of the model for different tasks constrains that part to learn good values. There are two types of model parts:\n",
194 | "\n",
195 | "- Task specific parameters: These parameters benefit only from that particular task.\n",
196 | "- Generic parameters, shared across all tasks: These are the ones which benefit from learning through various tasks.\n",
197 | "\n",
198 | "\n",
199 | "\n",
200 | "Multitask learning leads to better generalization when there is actually some relationship between the tasks, which actually happens in the context of Deep Learning where some of the factors, which explain the variation observed in the data, are shared across different tasks. "
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "### 8. Early Stopping\n",
208 | "\n",
209 | "After a certain point of time during training, for a model with extremely high representational capacity, the training error continues to decrease but the validation error begins to increase. In such a scenario, a better idea woud be to return back to the point where the validation error was the least. Thus, we need to keep calculating the validation accuracy after each epoch and if there is any improvement, we store that parameter setting and upon termination of training, we return the last *saved* parameters.
\n",
210 | "\n",
211 | "\n",
212 | "\n",
213 | "The idea of **Early Stopping** is that if the validation error doesn't improve over a certain fixed number of iterations, we terminate the algorithm. This effectively reduces the capacity of the model by reducing the number of steps required to fit the model. The evaluation on the validation set can be done both on another GPU in parallel or done after the epoch. A drawback of weight decay was to manually tweak the weight decay coefficient, which, if chosen wrongly, can lead the model to local minimia by squashing the weight values too much. In Early Stopping, no such parameter needs to be tweaked which affects the model dynamics.\n",
214 | "\n",
215 | "However, since we are setting aside some part of the training data for validation, we are not using the complete training set. So, once Early Stopping is done, a second phase of training can be done where the complete training set is used. There are two choices here:\n",
216 | "\n",
217 | "- Train from scratch for the same number of steps as in the Early Stopping case.\n",
218 | "- Use the weights learned from the first phase of training and retrain using the complete data.\n",
219 | "\n",
220 | "Other than lowering the number of training steps, it reduces the computational cost also by regularizing the model without having to add additional penalty terms. It affects the optimization procedure by restricting it to a smal volume of the parameter space, in the neighbourhood of the initial parameters ($\\theta_0$). Suppose $\\tau$ and $\\epsilon$ represent the number of iterations and the learning rate respectively. Then, $\\epsilon\\tau$ effectively represents the capacity of the model. Intuitively, this can be seen as the inverse of the weight decay co-efficient $\\lambda$. When $\\epsilon\\tau$ is small (or $\\lambda$ is large), the parameter space is small and vice versa. We show this equivalence holds true for a linear model with quadratic cost function (initial parameters $w^{(0)} = 0$). Taking the Taylor Series Approximation of $J(w)$ around the empirically optimal weights $w^*$:\n",
221 | "\n",
222 | "\n",
223 | "\n",
224 | "\n",
225 | "\n",
226 | "$$ w^{(\\tau)} - w^* = (I - \\epsilon Q \\wedge Q^T)(w^{(\\tau - 1)} - w^*) $$\n",
227 | "\n",
228 | "multiplying with $Q^T$ on both sides and using the fact that $Q^TQ = I$\n",
229 | "\n",
230 | "$$ Q^T(w^{(\\tau)} - w^*) = (Q^TI - \\epsilon \\wedge Q^T)(w^{(\\tau - 1)} - w^*) $$ \n",
231 | "\n",
232 | "\n",
233 | "$$ Q^T(w^{(\\tau)} - w^*) = (I - \\epsilon \\wedge)Q^T(w^{(\\tau - 1)} - w^*) $$\n",
234 | "\n",
235 | "Assuming $\\epsilon$ to be small enough that $|1 - \\epsilon \\lambda_i| < 1$, the parameter trajectory after $\\tau$ steps of training:\n",
236 | "\n",
237 | "$$ Q^Tw^{(\\tau)} = (I - (I - \\epsilon \\wedge)^{\\tau})Q^Tw^* $$\n",
238 | "\n",
239 | "The equation for $L2$ regularization is given by:\n",
240 | "\n",
241 | "$$ Q^Tw^{(\\tau)} = (\\wedge + \\alpha I)^{-1} \\wedge Q^Tw^* $$\n",
242 | "\n",
243 | "$$ (\\wedge + \\alpha I)^{-1} (\\wedge + \\alpha I) = I $$\n",
244 | "\n",
245 | "$$ \\Rightarrow (\\wedge + \\alpha I)^{-1} \\wedge = I - (\\wedge + \\alpha I)^{-1} \\alpha$$\n",
246 | "\n",
247 | "$$ \\Rightarrow Q^Tw^{(\\tau)} = (I - (\\wedge + \\alpha I)^{-1} \\alpha) Q^Tw^* $$\n",
248 | "\n",
249 | "Thus, if the hyperparameters $\\epsilon$, $\\alpha$ & $\\tau$ such that:\n",
250 | "\n",
251 | "$$ (\\wedge + \\alpha I)^{-1} \\alpha = (I - \\epsilon \\wedge)^{\\tau} $$\n",
252 | " \n",
253 | "L2-regularization can be seen as equivalent to Early Stopping and on further simplification, we get, $\\epsilon \\tau \\approx \\frac {1} {\\lambda}$"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "### 9. Parameter Tying and Parameter Sharing\n",
261 | "\n",
262 | "Till now, most of the methods focused on bringing the weights to a fixed point, e.g. 0 in the case of norm penalty. However, there might be situations where we might have some prior knowledge on the kind of dependencies that the model should encode. Suppose, two models A and B, perform a classification task on similar input and output distributions. In such a case, we'd expect the parameters ($W_a$ and $W_b$) to be similar to each other as well. We could impose a norm penalty on the distance between the weights, but a more popular method is to **force** the set of parameters to be equal. This is the essence behind **Parameter Sharing**. A major benefit here is that we need to store only a subset of the parameters (e.g. storing only $W_a$ instead of both $W_a$ and $W_b$) which leads to large memory savings. In the example of Convolutional Neural Networks or CNNs (discussed in Chapter 9), the feature is computed across different regions of the image and hence, a cat is detected irrespective of whether it is at position `i` or `i+1`."
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "metadata": {},
268 | "source": [
269 | "### 10. Sparse Representations\n",
270 | "\n",
271 | "We can place penalties on even the activation values of the units which indirectly imposes a penalty on the parameters. This leads to representational sparsity, where many of the activation values of the units are zero. \n",
272 | "\n",
273 | "\n",
274 | "\n",
275 | "Another idea could be to average the activation values across various examples and push it towards some value. An example of getting representational sparsity by imposing hard constraint on the activation value is the **Orthogonal Matching Pursuit (OMP) ** algorithm, where a representation `h` is learned for the input `x` by solving the constrained optimization problem:\n",
276 | "\n",
277 | "$$ arg min_{h, ||h||_b < k} ||x - Wh||^2 $$\n",
278 | "\n",
279 | "where $||h||_b$ indicates the number of non-zero entries. The problem can be solved efficiently when $W$ is restricted to be orthogonal.\n"
280 | ]
281 | },
282 | {
283 | "cell_type": "markdown",
284 | "metadata": {},
285 | "source": [
286 | "### 11. Bagging and Other Ensemble Methods\n",
287 | "\n",
288 | "The techniques which train multiple models and take the maximum vote across those models for the final prediction are called ensemble methods. The idea is that it's highly unlikely that multiple models would make the same error in the test set. \n",
289 | "\n",
290 | "Suppose that we have `K` regression models, with the $i^{th}$ model making an error $\\epsilon_i$ on each example, where $\\epsilon_i$ is drawn from a zero mean, multivariate normal distribution such that: $ \\mathbb{E}(\\epsilon_i^2) = v$ and $\\mathbb{E} (\\epsilon_i \\epsilon_j) = c$. The error on each example is then the average across all the models: $\\frac {\\sum_i \\epsilon_i} {K}$.\n",
291 | "\n",
292 | "The mean of this average error is 0 (as the mean of each of the individual $\\epsilon_i$ is 0). The variance of the average error is given by:\n",
293 | "\n",
294 | "\n",
295 | "$$ \\mathbb{E} \\Big( \\frac {\\sum_i \\epsilon_i} {K} \\Big)^2 = \\frac { \\mathbb{E} (\\sum_i \\epsilon_i^2 + \\sum_i \\sum_{j \\neq i} \\epsilon_i \\epsilon_j)} {K^2}$$\n",
296 | "\n",
297 | "$$ \\Rightarrow \\mathbb{E} \\Big( \\frac {\\sum_i \\epsilon_i} {K} \\Big)^2 = \\frac { \\mathbb{E} \\sum_i \\epsilon_i^2} {K^2} + \\frac {\\sum_i \\sum_{j \\neq i} \\mathbb{E}(\\epsilon_i \\epsilon_j)} {K^2}$$\n",
298 | "\n",
299 | "$$ \\Rightarrow \\mathbb{E} \\Big( \\frac {\\sum_i \\epsilon_i} {K} \\Big)^2 = \\frac {K * v} {K^2} + \\frac {K * (K-1) c} {K^2}$$\n",
300 | "\n",
301 | "$$ \\Rightarrow \\mathbb{E} \\Big( \\frac {\\sum_i \\epsilon_i} {K} \\Big)^2 = \\frac {v} {K} + \\frac {(K-1) c} {K}$$\n",
302 | "\n",
303 | "Thus, if `c = v`, then there is no change. If `c = 0`, then the variance of the average error decreases with K. There are various ensembling techniques. In the case of Bagging (Bootstrap Aggregating), the same training algorithm is used multiple times. The dataset is broken into K parts by sampling with replacement (see figure below for clarity) and a model is trained on each of those K parts. Because of sampling with replacement, the K parts have a few similarities as well as a few differences. These differences cause the difference in the predictions of the K models. Model averaging is a very strong technique.\n",
304 | "\n",
305 | ""
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "metadata": {},
311 | "source": [
312 | "### 12. Dropout\n",
313 | "\n",
314 | "**Dropout** is a computationally inexpensive yet, powerful regularization technique. The problem with bagging is that we can't train an exponentially large number of models and store them for prediction later. Dropout makes bagging practical by making an inexpensive approximation. In a simplistic view, dropout trains the ensemble of all sub-networks formed by randomly removing a few non-output units by multiplying their outputs by $0$. For every training sample, a mask is computed for all the input and hidden units independently. For clarification, suppose we have $h$ hidden units in some layer. Then, a mask for that layer refers to a $h$ dimensional vector with values either $0$ (remove the unit) or $1$ (keep the unit).\n",
315 | "\n",
316 | "There are a few differences from bagging though:\n",
317 | "\n",
318 | "- In bagging, the models are independent of each other, whereas in dropout, the different models share parameters, with each model taking as input, a sample of the total parameters.\n",
319 | "\n",
320 | "- In bagging, each model is trained till convergence, but in dropout, each model is trained for just one step and the parameter sharing makes sure that subsequent updates ensure better predictions in the future.\n",
321 | "\n",
322 | "At test time, we combine the predictions of all the models. In the case of bagging with K models, this was given by the arithmetic mean, $\\frac {\\sum_i p^i (y \\hspace{.1cm} | \\hspace{.1cm} x)} {K}$. In case of dropout, the probability that a model is chosen is given by $p(\\mu)$, with $\\mu$ denoting the mask vector. The prediction then becomes $ {\\sum_{\\mu} p(\\mu) p (y \\hspace{.1cm} | \\hspace{.1cm} x, \\mu)} $. This is not computationally feasible, and there's a better method to compute this in one go, using the geomtric mean instead of the arithmetic mean.\n",
323 | "\n",
324 | "We need to take care of two main things when working with geometric mean:\n",
325 | "- None of the probabilities should be zero.\n",
326 | "- Re-normalization to make sure all the probabilities sum to 1.\n",
327 | "\n",
328 | "\n",
329 | "\n",
330 | "\n",
331 | "The advantage for dropout is that $\\tilde{p}_{ensemble} (y^{'} | x)$ can be approximate in one pass of the complete model by dividing the weight values by the keep probability (**weight scaling inference rule**). The motivation behind this, is to capture the right expected values from the output of each unit, i.e. the total expected input to a unit at train time is equal to the total expected unit at test time. A big advantage of dropout then, is that it doesn't place any restricted of the *type* of model or training procedure to use.\n",
332 | "\n",
333 | "**Points to note**:\n",
334 | "- Reduces the representational capacity of the model and hence, the model should be large enough to begin with.\n",
335 | "- Works better with more data.\n",
336 | "- Equivalent to L2 for linear regression, with different weight decay coefficient for each input feature.\n",
337 | "\n",
338 | "However, stochasticity is not necessary for regularization (see Fast Dropout), neither sufifficient (see Dropout Boosting). \n",
339 | "\n",
340 | "**Biological Interpretration**: During sexual reproduction, genes could be swapped between organisms if they are unable to correctly adapt to the unusual features of any organism. Thus, the units in dropout learn to perform well regardless of the presence of other hidden units, and also in many different contexts.\n",
341 | "\n",
342 | "Adding noise in the hidden layer is more effective than adding noise in the input layer. For e.g. if some unit $h_i$ learns to detect a nose in a face recognition task. Now, if this $h_i$ is removed, then some other unit either learns to redundantly detect a nose or associates some other feature (like mouth) for recognising a face. In either way, the model learns to make more use of the information in the input. On the other hand, adding noise to the input won't completely removed the nose information, unless the noise is so large as to remove most of the information from the input.\n"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "### 13. Adversarial Training\n",
350 | "\n",
351 | "Deep Learning has outperformed humans in the task of Image Recognition (Reference: ImageNet), which might lead us to believe that these models have acquired a human-level understanding of an image. However, experimentally searching for an $x^{'}$ for a given $x$, such that prediction made by the model changes, shows other wise. As shown in the image below, although the newly formed image (adversarial image) looks almost exactly the same to a human, the model classifies it wrongly and with very high confidence. \n",
352 | "\n",
353 | "\n",
354 | "\n",
355 | "**Adversarial training** refers to training on images which are adversarially generated and it has been shown to reduce the error rate. The main factor attributed to the above mentioned behaviour is the linearity of the model, caused by the main building blocks being primarily linear. Thus, a small change of $\\epsilon$ in the input causes a drastic change of $W\\epsilon$ in the output. The idea of adversarial training is to avoid this jumping and induce the model to be locally constant in the neighborhood of the training data.\n",
356 | "\n",
357 | "This can also be used in semi-supervised learning. For an unlabelled sample $x$, we can assign the label $\\hat{y}(x)$ using our model. Then, we find an adversarial example, $x^{'}$, such that $y(x^{'}) \\neq \\hat{y}(x)$ (an adversary found this way is called virtual adversarial example). The objective then is to assign the same class to both $x$ and $x^{'}$. The idea behind this is that different classes are assumed to lie on disconnected manifolds and a little push from one manifold shouldn't land in any other manifold."
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "### 14. Tangent Distance, Tangent Prop and manifold Tangent Classifier\n",
365 | "\n",
366 | "Many ML models assume the data to lie on a low dimensional manifold to overcome the curse of dimensionality. The inherent assumption which follows is that small perturbations that cause the data to move along the manifold (it originally belonged to), shouldn't lead to different class predictions. The idea of the **tangent distance** algorithm to find the K-nearest neighbors using the distance metric as the distance between manifolds. A manifold $M_i$ is approximated by the tangent plane at $x_i$, hence, this technique needs tangent vectors to be specified.\n",
367 | "\n",
368 | "\n",
369 | "\n",
370 | "The **tangent prop** algorithm proposed to a learn a neural network based classifier, $f(x)$, which is invariant to known transformations causing the input to move along its manifold. Local invariance would require that $\\bigtriangledown_x f(x)$ is perpendicular to the tangent vectors $V^{(i)}$. This can also be achieved by adding a penalty term that minimizes the directional directive of $f(x)$ along each of the $V(i)$.\n",
371 | "\n",
372 | "$$ \\Omega(f) = \\sum_i (\\bigtriangledown_x f(x))^T V(i) $$\n",
373 | "\n",
374 | "It is similar to data augmentation in that both of them use prior knowledge of the domain to specify various transformations that the model should be invariant to. However, tangent prop only resists infinitesimal perturbations while data augmentation causes invariance to much larger perturbations.\n",
375 | "\n",
376 | "**Manifold Tangent Classifier** works in two parts:\n",
377 | "- Use Autoencoders to learn the manifold structures using Unsupervised Learning.\n",
378 | "- Use these learned manifolds with tangent prop."
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": null,
384 | "metadata": {},
385 | "outputs": [],
386 | "source": []
387 | }
388 | ],
389 | "metadata": {
390 | "kernelspec": {
391 | "display_name": "Python 2",
392 | "language": "python",
393 | "name": "python2"
394 | },
395 | "language_info": {
396 | "codemirror_mode": {
397 | "name": "ipython",
398 | "version": 2
399 | },
400 | "file_extension": ".py",
401 | "mimetype": "text/x-python",
402 | "name": "python",
403 | "nbconvert_exporter": "python",
404 | "pygments_lexer": "ipython2",
405 | "version": "2.7.12"
406 | }
407 | },
408 | "nbformat": 4,
409 | "nbformat_minor": 2
410 | }
411 |
--------------------------------------------------------------------------------
/02 - Linear Algebra.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Deep Learning Book (Simplified)\n",
8 | "## Part I - Applied Math and Machine Learning basics\n",
9 | "*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org) where we are attempting to provide a summary of each chapter highlighting the concepts that we found to be most important so that other people can use it as a starting point for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on notation.*\n",
10 | "\n",
11 | "## Chapter 2: Linear Algebra\n",
12 | "A good understanding of linear algebra is essential for understanding and working with many machine learning algorithms, especially deep learning algorithms. The chapter aims to teach you enough for understanding deep learning and hence, omits many key topics in Linear Algebra that are not relevant.\n",
13 | "\n",
14 | "The sections present in this chapter are listed below. Feel free to navigate as you like:
\n",
15 | "\n",
16 | "**1. Scalars, Vectors, Matrices and Tensors**
\n",
17 | "**2. Multiplying Matrices and Vectors**
\n",
18 | "**3. Identity and Inverse Matrices**
\n",
19 | "**4. Linear Dependence and Span**
\n",
20 | "**5. Norms**
\n",
21 | "**6. Special Kinds of Matrices and Vectors**
\n",
22 | "**7. Eigendecomposition**
\n",
23 | "**8. Singular Value Decomposition (SVD)**
\n",
24 | "**9. The Moore Penrose Pseudoinverse**
\n",
25 | "**10. The Trace Operator**
\n",
26 | "**11. The Determinant**
"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "### 1. Scalars, Vectors, Matrices and Tensors\n",
34 | "\n",
35 | "\n",
36 | "The study of linear algebra involves several types of mathematical objects:\n",
37 | "\n",
38 | "\n",
39 | "**Scalars**: A scalar is just a single number which may take up different types of values, e.g. real-valued (slope of a line), natural number (number of units), etc.\n",
40 | "\n",
41 | "\n",
42 | "**Vectors**: A vector is an array of numbers of the same type (e.g. $x$ ∈ ℝ), arranged in order and indexed as $x_1, x_2$, etc. They are expressed as:\n",
43 | "\n",
44 | "\n",
45 | "$$x= \\begin{bmatrix}x_1 \\\\ x_2 \\\\ . \\\\ . \\\\ . \\\\ x_n\\end{bmatrix}$$\n",
46 | "\n",
47 | "**Matrices**: A matrix is a 2-D array of elements, each element being indexed by two numbers. A real-valued matrix **A** of height *m* and width *y* is represented as $ A \\in ℝ ^ {mxn} $. The element in the $i^{th}$ row and $j^{th}$ column is indexed as $A_{i, j}$. $f(A)_{i, j}$ represents the element $(i, j)$ of the matrix computed by applying the function $f$ to $A$.\n",
48 | "\n",
49 | "\n",
50 | "**Tensors**: An array of numbers arranged in a regular grid with a variable number of axes is known as a tensor. The element at coordinates $(i, j, k)$ of a tensor $A$ is represented as $A_{i, j, k}$.\n",
51 | "\n",
52 | "An important operation on any matrix $A$ is the **transpose** (denoted by $A^T$) which is its mirror image across the [main diagonal](https://www.wikiwand.com/en/Main_diagonal). It can be better understood through this animation:\n",
53 | "\n",
54 | "\n",
55 | "A column vector of n-dimension can be thought of as a matrix of shape $NX1$. Thus, the transpose of a row-vector gives a column vector: $x = [x_1 \\hspace{.1cm} x_2 ... \\hspace{.1cm} x_n]^T$. We can add matrices of the same size ($C_{i, j} = A_{i,j} + B_{i,j}$) and add or multiply a matrix by a scalar ($D = a.B + c$). We can even add a vector to a matrix: $C = A + b$ where $C_{i, j} = A_{i, j} + b_{j}$, i.e. the vector $b$ is added to each row of $A$, an operation known as [broadcasting](http://deeplearning.net/software/theano/tutorial/broadcasting.html). The diagram below provides a visual explanation for broadcasting:
\n",
56 | ""
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "### 2. Multiplying Matrices and Vectors\n",
64 | "Matrix multiplication is one of the important operations involving operations. For a matrix multiplication between two matrices $A_{mn}$ and $B_{kp}$ to exist, we must have $n == k$. The resulting matrix $C (= AB)$ has the shape $m$ x $p$.
\n",
65 | "**Element-wise product** is given by $A \\odot B$ and has the same shape as $A$ (and $B$).
\n",
66 | "**Dot-product** between two vectors $x$ and $y$ of the same dimension is a scalar given by $x^Ty$.\n",
67 | "\n",
68 | "Useful properties:\n",
69 | "\n",
70 | "1. $A(B+C) = AB + AC$ (Distributive)\n",
71 | "2. $A(BC) = (AB)C$ (Associative)\n",
72 | "3. $AB \\ne BA$ (not commutative, in general)\n",
73 | "4. $(AB)^T = B^TA^T$\n",
74 | "5. $x^Ty = (x^Ty)^T = y^Tx$\n",
75 | "\n",
76 | "System of linear equations:\n",
77 | "\n",
78 | "$ Ax = B \\tag{1} $\n",
79 | "\n",
80 | "where $A \\in ℝ^{mxn}$ & $b \\in ℝ^{m}$ are known and $x \\in ℝ^{n}$ is to be found. We have $m$ linear equations where the $i^{th}$ equation is given by:
\n",
81 | "\n",
82 | "$A_{i,1}x_1 + A_{i,2}x_2 + ... + A_{i,n}x_n = b_i$"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "### 3. Identity and Inverse Matrices\n",
90 | "\n",
91 | "**Identity Matrix** is a matrix which doesn't change a vector when multiplied by the vector. The entries along its main diagonal is 1. All other entries are zero.\n",
92 | "\n",
93 | "\n",
94 | "Now we define the inverse of a matrix, $A^{-1}$ as:\n",
95 | "\n",
96 | "$A^{-1}A = AA^{-1} = I_n$\n",
97 | "\n",
98 | "We use this to solve the previously defined system of equations:\n",
99 | "\n",
100 | "$$ Ax = b \\\\ \n",
101 | "A^{-1}Ax = A^{-1}b \\\\\n",
102 | "I_nx = A^{-1}b \\\\ \n",
103 | "x = A^{-1}b \\\\\n",
104 | "$$\n",
105 | "\n",
106 | "The next section covers the conditions required for $A^{-1}$ to exist."
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "### 4. Linear Dependence and Span\n",
114 | "For $A^{-1}$ to exist, equation (1) must have exactly one solution for every real value of $b$. The system can have infinitely many solutions for some values of b, but can't have a finite number of solutions greater than one.\n",
115 | "\n",
116 | "**Linear Combination**\n",
117 | "\n",
118 | "We know:\n",
119 | "\n",
120 | "$$ Ax = \\begin{bmatrix} {A_{1,1}x_1 + A_{1, 2}x_2 + ... A_{1, n}x_n} \\\\ {A_{2,1}x_1 + A_{2, 2}x_2 + ... A_{2, n}x_n } \\\\ . \\\\ . \\\\ . \\\\ {A_{m,1}x_1 + A_{m, 2}x_2 + ... A_{m, n}x_n} \\end{bmatrix} $$\n",
121 | "\n",
122 | "This can now be written as:\n",
123 | "$$ \\begin{bmatrix} {A_{1,1}x_1 + A_{1, 2}x_2 + ... A_{1, n}x_n} \\\\ {A_{2,1}x_1 + A_{2, 2}x_2 + ... A_{2, n}x_n } \\\\ . \\\\ . \\\\ . \\\\ {A_{m,1}x_1 + A_{m, 2}x_2 + ... A_{m, n}x_n} \\end{bmatrix} = x_1 \\begin{bmatrix} A_{1,1} \\\\ A_{2,1}\\\\ . \\\\ . \\\\ . \\\\ A_{m,1} \\end{bmatrix} + x_2 \\begin{bmatrix} A_{1,2} \\\\ A_{2,2}\\\\ . \\\\ . \\\\ . \\\\ A_{m,2} \\end{bmatrix} + ... x_n \\begin{bmatrix} A_{1,n} \\\\ A_{2,n}\\\\ . \\\\ . \\\\ . \\\\ A_{m,n} \\end{bmatrix} = \\sum_{i=1}^{n} x_iA_{:, i}$$
where $A_{:, i}$ is the $i^{th}$ column of $A$. Thus:\n",
124 | "$$ Ax = \\sum_{i=1}^{n} x_iA_{:, i} $$\n",
125 | "\n",
126 | "This kind of operation is called a linear combination.\n",
127 | "\n",
128 | "**Span** of a set of vectors is the set of all points that can be obtained from the linear combination of the vectors. The span of the columns of $A$ is called the column space or range of $A$. For equation (1) to have a solution, $b$ must belong to this column space. Hence, in order for the system to have a solution for all $b ∈ ℝ^m$, the column space of $A$ must be all of $ℝ^m$. Thus, the number of columns (n) should be >= m. This is a necessary condition, but not a sufficient one as we might have redundant columns and that doesn't add to the column space. Here, the concept of **linear dependency** is introduced. A set of vectors are **linearly independent** if none of the vectors is a linear combination of the other vectors. Thus, for the column space of $A$ to be all of $ℝ^m$, it must contain atleast one *set* of m linearly independent columns. The system must have *at most* one solution for each value of $b$, hence, $A$ must have at most m columns, or else there are multiple ways of parameterizing each solution. But we started this section stating that we need exactly one solution for $A^{-1}$ to exist. Hence, m = n, i.e., the matrix must be **square** and its columns are linearly independent for its inverse to exist. Such a matrix is said to be **singular**."
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "### 5. Norms\n",
136 | "\n",
137 | "The size of a vector is given by functions called norms. We can have different types of norms, more commonly used, is the $L^p$ norm given by:\n",
138 | "$$ ||\\mathbf{x}||_p = (\\sum_{i} |x_i|^p)^{\\frac{1}{p}} $$\n",
139 | "\n",
140 | "A norm function $f$ satisfies the following properties:\n",
141 | "\n",
142 | "- $f(x) = 0 \\Rightarrow x = 0$\n",
143 | "- $f(x+y) \\leq f(x) + f(y)$ (The **triangle inequality**)\n",
144 | "- $\\forall \\alpha \\in ℝ, \\hspace{.1cm} f(\\alpha x) = |\\alpha|f(x)$\n",
145 | "\n",
146 | "Different types of norms:\n",
147 | "\n",
148 | "- **Euclidean Norm**: This is the $L^2$ norm, which is heavily used in machine learning, and can be also calculated as $x^Tx$.\n",
149 | "- **L1 Norm**: It is used when the difference between the zero and non-zero elements is very important.\n",
150 | "- **Max Norm (also known as the $L^{\\infty}$ norm)**: Absolute value of the largest magnitude in the vector:\n",
151 | "$||x||_{\\infty} = \\displaystyle \\max_{i}|x_i|$\n",
152 | "- **Frobenius Norm**: Used to measure the size of a matrix (similar to the $L^2$ norm):\n",
153 | "$||A||_F = \\sqrt{\\displaystyle \\sum_{i,j} A_{i,j}^2} $"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "### 6. Special Kinds of Matrices and Vectors\n",
161 | "\n",
162 | "- **Diagonal Matrices**: These matrices have non-zero entries only along the main diagonal, e.g. the identity matrix $I_n$. Some of the key features are:
\n",
163 | " - A square diagonal matrix can be represented as: $diag(v)$ where the vector $v$ represents the elements along tha main diagonal.\n",
164 | " - Multiplying by a diagonal matrix is computationally efficient. $Dx$ can be calculated by simply scaling each $x_i$ by $v_i$.\n",
165 | " - A diagonal matrix need not be square.\n",
166 | " \n",
167 | "- **Symmetric Matrix**: $A = A^T$\n",
168 | "- **Unit vector**: A vector which has **unit norm**, i.e. $||x||_2 = 1$.\n",
169 | "- **Orthogonal vectors**: Two vectors $x$ and $y$ are orthogonal if $x^Ty = 0$, which means that if both of them have non-zero norm, these vectors are at a 90 degree angle to each other. Orthogonal vectors having unit norm are called **orthonormal vectors**.\n",
170 | "- **Orthogonal Matrix**: A matrix whose rows are mutually orthonormal (and columns too). Thus:\n",
171 | "$$ A^TA = AA^T = I \\Rightarrow A^{-1} = A^T $$\n",
172 | "For orthogonal matrices, the inverse is cheap to compute."
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "### 7. Eigendecomposition\n",
180 | "Matrices can be understood better by breaking them into constituent parts (e.g. breaking a number like 12 into its prime factors: 12 = 2 x 2 x 3) which are universal and not obvious from their representation. One of the widely used matrix decomposition is called **eigendecomposition**, where we decompose a matrix into its **eigenvectors** and **eigenvalues**.\n",
181 | "\n",
182 | "An **eigenvector** $v$ of a matrix $A$, is a non-zero vector satisfying the following equation:\n",
183 | "\n",
184 | "$$ Av = \\lambda v $$\n",
185 | "\n",
186 | "The scalar $\\lambda$ is called the **eigenvalue** corresponding to the eigenvector. Any scaled version of an eigenvector is also an eigenvector with the same eigenvalue, hence we focus on only unit eigenvectors.\n",
187 | "\n",
188 | "The eigendecomposition of a matrix $A$, having $n$ linearly independent eigenvectors represented as a matrix $V = [v^{(1)}, ... , v^{(n)}]$ with the corresponding eigenvalues given by the vector $\\lambda = [\\lambda_1, ... , \\lambda_n]$ is given by:\n",
189 | "\n",
190 | "$$ A = Vdiag(\\lambda)V^{-1} $$\n",
191 | "\n",
192 | "For our purpose, in the case of $A$ being real-valued, $V$ is an orthogonal matrix. $A$ can be thought of as scaling space by $\\lambda_i$ in the direction $v^{(i)}$. We generally sort the entries of $\\lambda$ in decreasing order. Two types of matrices are specially useful:\n",
193 | "\n",
194 | "- **Positive definite**: All eigenvalues are positive and $x^TAx = 0 \\Rightarrow x = 0$.\n",
195 | "- **Positive semidefinite**: All eigenvalues are non-zero and $\\forall x, \\hspace{0.1cm} x^TAx \\geq 0$."
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 14,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "import numpy as np\n",
205 | "import matplotlib.pyplot as plt\n",
206 | "%matplotlib inline"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 30,
212 | "metadata": {},
213 | "outputs": [],
214 | "source": [
215 | "A = np.array([[1, 2], [4, 3]])"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": 55,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "w, v = np.linalg.eig(A)"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": 58,
230 | "metadata": {},
231 | "outputs": [],
232 | "source": [
233 | "def normalize(u):\n",
234 | " for i, x in enumerate(u):\n",
235 | " norm = np.sqrt(sum([val**2 for val in x]))\n",
236 | " for j, y in enumerate(x):\n",
237 | " u[i][j] = y / norm"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": 59,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "normalize(v)"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": 46,
252 | "metadata": {},
253 | "outputs": [],
254 | "source": [
255 | "x_values = np.linspace(-1, 1, 1000)\n",
256 | "y_values = np.array([np.sqrt(1 - (x**2)) for x in x_values])\n",
257 | "\n",
258 | "x_values = np.concatenate([x_values, x_values])\n",
259 | "y_values = np.concatenate([y_values, -y_values])"
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": 47,
265 | "metadata": {},
266 | "outputs": [],
267 | "source": [
268 | "u = np.array([[x, y] for x, y in zip(x_values, y_values)])"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": 48,
274 | "metadata": {},
275 | "outputs": [],
276 | "source": [
277 | "trans = np.dot(A, u.T).T"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 62,
283 | "metadata": {},
284 | "outputs": [
285 | {
286 | "data": {
287 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAgwAAAFkCAYAAABMyWOlAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3XecVOXZ//HPBdJEgyYGsMWCUZAOgoIKiho1UVCwrZqA\nGsujooEnMT81UWNNMWJsMZggsWRjYcVoEkksixpRmqCjGGPEFmWwPRiRonD//jgzOrt7zpTdOWVm\nvu/X67xg7jln5mKZmb3mLtdtzjlERERE8mkXdwAiIiKSfEoYREREpCAlDCIiIlKQEgYREREpSAmD\niIiIFKSEQURERApSwiAiIiIFKWEQERGRgpQwiIiISEFKGERERKSgUBMGMzvfzOab2Udmljaz+8xs\n1yKuO9rMlpnZGjNbamaHhhmniIiI5Bd2D8O+wPXAnsCBQAfgb2bWJegCMxsB/AG4BRgEzAZmm9nu\nIccqIiIiASzKzafMbCtgJTDKOfdkwDl/BDZ1zo3NaZsHPOucOzOaSEVERCRX1HMYtgAc8EGec0YA\nDzdrm5NpFxERkRhsEtUTmZkB1wJPOudezHNqTyDdrC2dafd73K8ABwOvAWvbHqlITesM7AjMcc69\nH3MsgfS+Fymbot/zkSUMwE3A7sDerbjW8Hom/BwM3NnaoETE1wl4c4mSSu97kfIq+J6PJGEwsxuA\nbwL7OufeKXD6CqBHs7butOx1yHoN4I477qBPnz5tCbNoU6ZMYdq0aZE8V6kUW+skNbao41q2bBkn\nnngiZN5XCfYalP99n9TXQXOKs/wqJdZyx1nKez70hCGTLIwDRjvn3ijiknnAAcB1OW0HZdr9rAXo\n06cPQ4YMaUuoRevWrVtkz1UqxdY6SY0txriS3s0fyvs+qa+D5hRn+VVKrCHGWfA9H2rCYGY3AXXA\nWGC1mWV7DlY559Zmzvk98B/n3AWZ+34FzDWzqcCfM9cPBU4NM1YREREJFvYqiTOALwGNwNs5xzE5\n52xPzoRG59w8vCThNGAJMB4YV2CipIiIiIQo1B4G51zBhMQ5N8anbRYwK5SgREREpGTaS6IV6urq\n4g4hkGJrnaTGltS4qlWl/LwVZ/lVSqxxxhlppccwmNkQYNGiRYsqYsKKSJItXryYoUOHAgx1zi2O\nO54get+LlEcp73n1MIiIiEhBShhERESkICUMIiIiUpASBhERESlICYOIiIgUpIRBREREClLCICIi\nIgUpYRAREZGClDCIiIhIQUoYREREpCAlDCIiIlKQEgYREREpSAmDiIiIFKSEQURERApSwiAiIiIF\nKWEQkdiZ2TZmdruZvWdmn5jZUjMbEndcIvKFTeIOQERqm5ltAfwDeAQ4GHgP+DrwYZxxiUhTShhE\nJG7/D3jDOffdnLbX4wpGRPxpSEJE4nY4sNDM7jaztJktNrPvFrxKRCKlhEFE4rYz8D/AP4FvADcD\n15nZibFGJSJNaEhCROLWDpjvnPtx5vZSM+uLl0TcEV9YIpJLCYOIxO0dYFmztmXA+EIXTpkyhW7d\nujVpq6uro66urnzRiVSJ+vp66uvrm7StWrWq6OuVMIhI3P4B7NasbTeKmPg4bdo0hgzR6kuRYvgl\n04sXL2bo0KFFXa85DCISt2nAXmZ2vpn1MrPjge8CN8Qcl4jkUMIgIrFyzi0EjgTqgOeBC4FznXN/\njDUwEWlCQxIiEjvn3F+Av8Qdh4gEC7WHwcz2NbM/mdl/zGyjmY0tcP7ozHm5xwYz6x5mnCIiIpJf\n2EMSXYElwFmAK/Iah1cWtmfm2No5tzKc8ERERKQYoQ5JOOceAh4CMDMr4dJ3nXMfhROViIiIlCqJ\nkx4NWGJmb5vZ38xsZNwBiYiI1LqkJQzvAKcDE/CKtrwJNJrZoFijEhERqXGJWiXhnHsZeDmn6Wkz\n6wVMASbmu1YV30RK09aqbyJSWxKVMASYD+xd6CRVfBMpTVurvolIbUnakISfQXhDFSIiIhKTUHsY\nzKwrsAveREaAnc1sIPCBc+5NM7sK2MY5NzFz/rnAcuAFoDNwKrA/cFCYcYqIiEh+YQ9J7AE8hldb\nwQG/zLT/HjgZr87C9jnnd8ycsw3wCfAccIBz7vGQ4xQREZE8wq7DMJc8wx7OuZOa3f4F8IswYxIR\nEZHSVcIcBhEREYmZEgYREREpSAmDiIiIFKSEQUREasqMGWCW/zj11LijTJ5KKNwkIiLSZqkUDB0K\n69cXPve3v/X+vOWWcGOqJOphEBGRqtfQAP37F5csZGWTBvEoYRARkarW0AATJrTu2oNUNvBzShhE\nRKRqtSVZAHj44fLFUumUMIiISFUqNlmYObNtSUWtUMIgIiJVp7GxcBJw3XXgHEycCPfeC506RRJa\nxVLCICIiVaWxEfbfP/85c+fC5MlN2954w//cE08sS1gVTwmDiIhUjVQqf7Jg5p0zalTL+7p397/m\nzjvLE1ulUx0GERGpCuk0DBwYfH+XLvDaa8GJgeSnHgYREal46TT06gUbN/rfv+mmxSULmvwYTAmD\niIhUvMMPh9Wr/e/bdFNYvry4noV77y1vXNVECYOIiFS0VAoWLPC/r2vX4pMFyU8Jg4iIVKx0GgYN\n8r+vXTt49VUlC+WihEFERCpSOg277AIbNrS8r107eO45JQvlpFUSIiJSkSZMgI8/btnevj28/baS\nhXJTD4OIiFScdBoWLvS/b+nS1icL6XTrY6p2ShhEJFHM7Hwz22hm18QdiyRTOg1f/zqsW9fyvuHD\noW/f1j/20KGtv7baKWEQkcQws2HAqcDSuGOR5JowAf7735btw4bBAw+07bH/85+2XV/NlDCISCKY\n2WbAHcB3gf+LORxJsHfeadm2994wf77mLYRJCYOIJMWNwAPOuUfjDkSSbeutm97efHNvK+uwnH9+\neI9dSbRKQkRiZ2bHAYOAPeKORZIrlYKRI2HNGm8lxNZbww47eMlCmD0LV14Z3mNXEiUMIhIrM9sO\nuBY4yDn3adzxSHKNHNl07sKqVfDkk/HFU2uUMIhI3IYCXwUWmZll2toDo8zsbKCTc875XThlyhS6\ndevWpK2uro66urow45WYrFmT/7bkV19fT319fZO2VatWFX29EgYRidvDQP9mbTOBZcBPg5IFgGnT\npjFkyJAQQ5Mk6dKlaQ9Dly7lffxUqryPlzR+yfTixYsZWuRa0lAnPZrZvmb2JzP7T2Zd9dgirtnP\nzBaZ2Voze9nMJoYZo4jEyzm32jn3Yu4BrAbed84tizs+iVc6Dfvs421dvcsusNlmsMkm3kTHefPK\n+1yqwZBf2D0MXYElwAxgVqGTzWxH4EHgJuB44EDgt2b2tnPu7+GFKSIJE9irILVlwgT4xz++uL33\n3uHNW1i/PpzHrRahJgzOuYeAhwByxibz+R/gVefceZnb/zSzfYApgBIGkRrhnBsTdwwSP7/yz341\nGCQaSavDsBfeeGauOcCIGGKRIkybBmblP+6/P+5/mYjEKaj8c/MaDGG77rpony/JkpYw9ASab/2R\nBr5kZp1iiEfw3rg77uj/i33q1HCe84gj/J/v+uvDeT4RSRa/8s+dOoVboMnP5MnRPl+SVcIqiexQ\nRt4xTS2vKp9UCoYMgU8TuCL+nHO8I+uKK+CCC+KLp5K1dYmVSFiCdqLcY4/wCjRFnYhUoqQlDCuA\nHs3augMfOefyTkfR8qq2mTYtvN6CMF14oXdkzZ4N48bFF08laesSK5EwBA1FhF3+ecKE8B67WiRt\nSGIecECztm9k2qXMcucfVGKy4Cd3KEPzIEQqz9ix/kMRr7yijaXiFnYdhq5mNtDMBmWads7c3j5z\n/1Vm9vucS24GepnZz8xsNzM7EzgKuCbMOGtJY2N4ScIVV4BzrTtSKejQobzxZJOHHXaAlSvL+9gi\nUn7pNCxY0LI9zKEIKV7YPQx7AM8Ci/DmIPwSWAz8JHN/T2D77MnOudeAb+HVX1iCt5zyFOdc85UT\nUqIZM7xfnvvv3/bHmjnT/5d+W+YS9O3rrYFu/pjptPcLvy3eeAN69FCvg0iSpdNeYabmdT3N4ptf\nMHduPM+bVKEmDM65uc65ds659s2OkzP3n9R8vXXmmqHOuS7Oua87524PM8Zqlx12OOWU1j/G7NlN\nf4lPjLD2Zvfu8NprTZ+/Lcucsr0Ov/994XNFJDpjx8LHH7dsj6J3YcYM//ZRo8J93kqTtDkMUiaX\nXtr6YYeddvKy/ewv6KRNIpw8ue1DGZMmKXEQSYqgoQgzePDB8J+/LV+oaokShiqT7VG4+OLSrstN\nEl59tXLGC3OHMlrTfZhNHDRUIRKfQw9tORQBmruQNEoYqkR2jkIpPQpm3i/ZSksSgowa9UXPw8yZ\npV17xBHQrh08/ngooYlIgHQann22Zftmm0XTuxBku+3ie+6kUsJQ4VIpb8lRKV1qX/ua9ybduLF6\nx+gmTiy918E5GD0arrwyvLhEpKmxAXsY//vf0XyJaWz0b1+0KPznrjRKGCrYtGnQv3/xO6xdd533\nS/H11yu/N6FY2V6HVAo6dizumgsv1PwGkSikUjB/fsv2rl2j+4waE7DNWa18RpZCCUMFytZSKHb4\nIZso1HJN9L59vcpxpSQOkybBzjurhoNIGNJpGDjQ/75nnokuDr+5E+JPCUOFmTGj+FoK2YmMtZwo\nNJdNHIodqli+3KvhoLkNIuU1dqw3LNrcsGHe+1SSRwlDhUinYfvti5ur0KmT9026GiYyhiU7VFFs\n4jB6tHbKFCmXoGWUEO1Ex6D5CyrY5E8JQwVoaICePeGttwqfO3MmrF2rDL1Y2cShmGJQ55wDnTvD\nCy+EH5dINQtaRjloULRfcoLmL1TrZPC2UsKQcDNmFLeLWnaeQpRVGKvJ5MnFzW9Ytw769VPdBpHW\nSqX8l1GawZw50cai+QulUcKQYNOmFTcEMXu25imUQ3Z+QzG9DUccoaRBpDVGjPBvV5Gm5FPCkEDp\nNOy4Y+FVENl6Ckkr3VzpJk8ubgzziCM0r0GkFKmU/34R7dpFX6QpaEMrzV8IpoQhYVIp2Hprr1ZC\nPjNn1lY9haiNGlXcTpnnnKOkQaRYe+3l3/7cc9F/lgUN9Wr+QjAlDAmSSnmFmAqNq82dq7kKUcju\nlDl7dv7zzjlH1SFFCkmlYPXqlu2DBmmSdqVQwpAQ2WQhn06dvG+9yoCjNW5c4W7KCy9UT4NIPkG9\nC1FPdMyn2KJutUoJQwIUkyzstBO88YaGIOIyapT3/2QWfI56GkT8BfUubLppPJ9pM2b4ty9eHG0c\nlUYJQ8yKSRZmzlQRpiTo2xdWrMg/r0E9DSItBfUu+O0jEYWg1WcaGslPCUOM0mkYMCD/OTNnar5C\nkmTnNVxxRfA5mggp8oXGxuDeBf2CrixKGGK0zz75JzjOnq1kIakuuCB/vYZzztFulyLpdPDeN3H1\nLqRS/u29ekUbRyVSwhCTadPglVeC7585U/UVkm7y5Pw9DZMmqbiT1LZDD/Vvj7N3Ydgw//annoo2\njkqkhCEGDQ35izKpZ6FyFOppOOII7XQptWvJEv/2uHoXwNtrx4/miBWmhCFijY3594ZQz0LlmTw5\nf9IwejSsXBldPCJJ0NjoP+Tat6/mLlQqJQwRSqWCx/NAExwr2eTJ3k6WQUaOjC4Wkbjlm7vw6KPR\nxpIraDmlykEXRwlDhILGzsBbqqdkobItXBh837//rUmQUjvyzV2Is+s/aDmliuEVRwlDRGbMCB47\nM4t3TE/Ko2/f4BnY4E2C1HwGf2Z2vpnNN7OPzCxtZveZ2a5xxyWlS6f9t68Gfc5VOiUMEWhszL9N\n9fPPa8JNtejbN//eE5rPEGhf4HpgT+BAoAPwNzPrEmtUUrKg3oW45y40Nvq3z5wZZRSVTQlDyPKN\n5YE3dqYJQNVl3Lj8H0Kaz9CSc+6bzrnbnXPLnHPPA5OArwFD441MSpGvdyHOuQsAY8b4t2souHiR\nJAxmdpaZLTezNWb2tJkFjuab2UQz22hmGzJ/bjSzT6KIMwz77Rd838yZGjurVhMnBk+C/Pe/NTRR\nhC0AB3wQdyBSvKTOXYDCuwBLYaEnDGZ2LPBL4GJgMLAUmGNmW+W5bBXQM+fIU70/uRob4aWX/O/r\n1UuZbbXLNwlSQxPBzMyAa4EnnXMvxh2PFCfJcxeC5hZ16hRtHJVukwieYwrwG+fcbQBmdgbwLeBk\n4OcB1zjn3LsRxBaqAw4Ivk9Vxapf377ekNPo0f73jxoVnFDWuJuA3YG9C504ZcoUunXr1qStrq6O\nurq6kEKTIEms6pgVtEJt0aJo44hbfX099fX1TdpWrVpV9PWhJgxm1gFvDPLzTX+dc87MHgZG5Ll0\nMzN7Da8HZDFwQaV902hogI0b/e+bPTv+7jmJxqhR8PWvw7/+1fK+f/7TG5rQsNQXzOwG4JvAvs65\ndwqdP23aNIYMGRJ+YFJQEqs6ZgWtUIs7kYmaXzK9ePFihg4tbqpQ2EMSWwHtgXSz9jTeUIOff+L1\nPowFTsCL8Skz2zasIMstlQqu5rjbbqrkWGuefDL4vqDeh1qUSRbGAfs7596IOx4pXlBVxyT0LqSb\n//aRVotrlYThTWhqwTn3tHPuDufcc865J4DxwLvAaVEG2BZ77hl8nya71Z7u3fNXktNrAszsJrwv\nCMcDq82sR+bIUz9TkiCJO1Lm2mcf/3ZVdyxd2HMY3gM2AD2atXenZa+DL+fcZ2b2LLBLvvOSMpaZ\nSsEnAWs6NBRRu0aNgt69/ecsjB7tfehG/dpo63hmmZ2B9yWisVn7ScBtkUcjRQuau2AWf+8CBO8K\nrKHA0oWaMDjnPjWzRcABwJ/g8xnQBwB5tuv5gpm1A/oBf8l3XlLGMoN6Fzp31lBErZs7F3o0T50z\nxozJXyUyDG0dzywn55xqwlSooLkLQYWSoqThiPKK4k16DXCamX3HzHoDNwObAjMBzOw2M/t8UqSZ\n/djMDjKzncxsMHAn3rLK30YQa5vk613It8ROakP37sFVIF94IdpYRMohaO5Cly7J+AYfVAcnXzVW\nCRb6skrn3N2ZmguX4g1NLAEOzlk2uR3wWc4lWwLT8SZFfggsAkY45xK/AC2od6FLl2R0zUn8xo2D\ndu38V9BoxYRUmqCl4wsWRBtHkKBly+rtbZ0o6jDgnLsJb221331jmt2eCkyNIq5yyte7kJQ3jyTD\nY4/5r44YPVrV6KRyNDb6J75Jmbug4Yjy07hhmQT1LsS94Yokz6hRXi+DH62YkEoR1LuQhLkLELw6\nQsMRraeEoQzS6eDehbg3XJFkeuwx/3bVZZBKkEoFF6ZLyrBa0OoIDUe0nhKGMsi3nauWUYqffL0M\n2mNCki6oRzUptQ00HBEOJQxlELSsSL0Lkk9QL0PQNrwiSZCvRzUpvQtaHREOJQxtlEoFLytS74Lk\nE/ThqiWWkmQHHujfPmhQtHHko9UR4VDC0EZBXXNaGSHF6NfPv12THyWpggqMzZkTbRxBNBwRHiUM\nbeTXNZeUZUWSfI884t8e1KUqEqegFRBJ6lHV6ojwKGFog6BMOynLiiT5unf3dvRrTvUYJGnybTKV\npB5VrY4IjxKGNthrL//2pEz8kcoQtKOf5jJIkgTNXYDk9Kjqy1q4lDC0werVLdu6dIk+DqlsQR+2\nw4dHG4dIPkE9qklZSgnBxaSSFGMlU8LQSkFvniR1zUnl6Nq1ZVvQ0jWRqOXbSTVJPapJLyZV6ZQw\ntFLQcERSuuaksjzzjH+7hiUkCZJeqAmChyM6d440jKqmhKGV/IYj/CaviRQjKNEM+qAWiUq+jfWS\n9M09aDhi4cJo46hmShjKKGjymkgx/IYl/BJTkSjl21gvSYKGI5IWZyVTwtAKQYVB9MKUtggalhCJ\nS6VsrNfQ4N/eq1e0cVQ7JQytELTZlEhbBCWcmscgcQlaSpmkQk0AEyb4tz/1VLRxVDslDK3gt9mU\nX3eySKnMWrZpHoPEpRJWg+UrBZ2kpKYaKGFoBb8qfOpOlnLw28BH8xgkDvnKQCdp+DWojPpuu0Ua\nRk1QwlAmSXoDSeV66KG4IxDxBK06SFLvAgTvTKkN3MpPCUOJ8hUwEWkrdaFKEqRSlbHqIF8paL2X\nyk8JQ4lGjmzZ1rFj9HGIiIQlqCx50nZ8VCnoaClhKNGaNS3bBg6MPg6pLStXxh2B1Ip02v9zDpK3\n46NKQUdLCUOJ/F6gDz4YfRxSvTbbrGXbYYdFH4fUpkqZRBhUe0GloMOjhKFE7ds3vd2hg8bKpLye\nfrpl29Kl0cchtalSJhEG1V5QKejwKGEo0Wef5b8t0lZ9+7asx/Dpp/HEIrUl3wZOSfpilG/yeZIm\nZVYbJQwl6tAh/22RctDrTOJQKRs4VcqkzGqjhKFE+iCXKDRfeaOVOBK2dLoyllJC5UzKrDZKGEq0\n5Zb5b4uUwxZb5L8tUm5Bkx2TlizkGzaRcEWSMJjZWWa23MzWmNnTZjaswPlHm9myzPlLzSwx2z2t\nWpX/tkg51OLrrNTPCSmvoMmOSdqVEipn2KQahZ4wmNmxwC+Bi4HBwFJgjpltFXD+COAPwC3AIGA2\nMNvMdg871mJ8+cv5b4uUQ629zkr9nJDyqpSKiZU0bFKNouhhmAL8xjl3m3PuJeAM4BPg5IDzzwX+\n6py7xjn3T+fcxcBi4OwIYi1ou+3y3xYphxp8nZX6OSFlVCkVEyulRkS1CjVhMLMOwFDgkWybc84B\nDwMjAi4bkbk/15w850cmlfpiPbwZDB4cXDxEpC0uv/yLmh/t28OVV8YbT5ha+TkhZZJv34ikVUys\nlBoR1SrsHoatgPZA8x3L00DPgGt6lnh+ZEaOhI8/9v7uHLzySrK665Js1apVOL99wcXX2LGwYYP3\n9w0bqr7SY2s+J6RM9tzTvz1pXfyVMmxSzTaJ6XkNKOW3R8Hzp0yZQrdu3Zq01dXVUVdXV3p0AZov\n5Qla2iNfePfdd5k6dSqPPPII2267LX//+9/ZQlP+C4ritVZfX099fX2TtlXJml2ZiPd9tfvkE//2\nSpnsmLRhkyRr83veORfaAXQAPgXGNmufCdwXcM3rwDnN2i4Bng04fwjgFi1a5MK2+ebOeX0L3rH5\n5qE/ZcXauHGjmzlzpvvyl7/sAHfAAQc4wB122GFuw4YNcYeXeHG91hYtWuTwfkkPcSF+Nrim7+HW\nfE5E9r6vZo891vR1lj06d447sqZWrPCP0xu8krYo5T0f6pCEc+5TYBHweW5oZpa5/VTAZfNyz884\nKNMeq3nzYPPNYZNNvD/nxR5RMr3yyisceOCBTJo0ia222orGxkYefvhhzj//fB588EEuu+yyuENM\nvFp6rbXyc0LKoFKWKGqyYzJEsUriGuA0M/uOmfUGbgY2xfv2gJndZma5U7p+BRxqZlPNbDczuwRv\nQtQNEcSaV9++8K9/eWN+X/0qnH66th3O9emnn3LVVVfRv39/nnjiCS666CKWLl3K6NGjAbjssss4\n6KCDuOSSS3hQW3zmtdVWMGAAfO1r3p9f/WrcEYUu7+eElF++yY5Jm7+gyY4JUagLohwHcCbwGrAG\nr6dgj5z7HgVmNDt/AvBS5vzngIPzPHakXZN77920O2zvvSN52sSbN2+e69evnwPc3nvv7V544QXf\n89577z234447um7durmXX3454igrR1yvsziGJLJHvs8Jn3M1JNFGXbr4d/HvtlvckTUVNGyi4Yjy\nSMyQRJZz7ibn3I7OuS7OuRHOuYU5941xzp3c7PxZzrnemfMHOOfmRBFnMd58s+ntt96KJ46k+Oij\njzj77LMZOXIkb7zxBjfffDOPP/44u+/uX2frK1/5Cg0NDaxbt47x48fzcXbZiTTR/HVVC6+zfJ8T\nUn5BE2mT9q1dkx2TQ3tJlOjDD5ve/uCDeOJIgtmzZ7P77rtz4403MmHCBJYtW8bpp59Ou3b5X1aD\nBw9m+vTppFIpTjnllOw3RsnR/HVVy68zKb9K2cY6X2XHpNWIqAVKGErUbAVXi9u14D//+Q/jx4/n\nyCOPxMy4//77ueeee9hmm22Kfoxvf/vbnH322dx9991cc801IUZbmfQ6kzBV+mTHpM2xqBVKGErU\nvIeh+e1qtnHjRm666Sb69OnD7NmzOffcc3nxxRcZO3Zsqx7vl7/8Jfvssw/nnXcejyZt0XfM/u//\n8t8Waa1K2o+hUjbEqhVKGEr06af5b1erVCrFPvvsw1lnncVOO+3E008/zbXXXsvmm2/e6sfs2LEj\n99xzDz169ODYY4/ljTfeKGPElW39+vy3RVqrUpYoVsqwSS1RwlCiWvsgX7t2LT/60Y8YPHgwzz77\nLD/72c9YuHAhw4cPL8vj9+zZk1mzZrFq1SomTJjA2rVry/K4lSydbvm6qpXEVMJXKUsUK2XYpJYo\nYZBAjz32GAMGDOCKK65g//33J5VKcd5559GhQ4eyPs+IESP41a9+xcKFCznrrLNqfhKk3whPmX/k\nUqNSqeD7kvStvZJqRNQSJQwlMmvZVm3Fm95//31OOukkxowZw4cffsjtt9/OnDlz6NWrV2jPecYZ\nZzBp0iRmzJjB9OnTQ3ueSpDdETXXwIHRxyHVJ6hjMGlLFCtlQ6xao4ShRB07tmyrlp0EnXPceeed\n9O7dm5kzZzJx4kSWLVvGiSeeiPllSmVkZtx0000MHTqUyZMnM6+aayEX4DfMpcKYUg5BtReStEQx\nna6cDbFqjRKGEvl90/P7RlhpXn31VQ455BBOPPFEunXrxsMPP8zMmTPZaqutIouhS5cuzJo1iy99\n6UscddRRrFixIrLnThK/EZkkdRdLZco3iTBJDjzQv71LF70P4qaEoUQPPNCyrZInPn722Wf84he/\noF+/fjz66KOcf/75PP/88xwQNOMoZDvssAN//OMfWbFiBccccwyfarafSFmMGePfnrRJhEHzLBYs\niDYOaUkJQ4mqKcNdsGABw4YN47zzzmPAgAEsXryYK6+8ki5dusQa14EHHshPf/pTnnjiCb7//e/H\nGkvU0um4I5BqlE7791xBsuYFBPWCQLLirFVKGMrkhRfijqB4H3/8MVOmTGGvvfbi3//+NzfccAP/\n+Mc/6N8jlAB1AAAgAElEQVS/f9yhfe773/8+Rx99NNdddx133HFH3OFE5tBD445AqlGl1F7QvhHJ\npoShFfzm/5WpLEHo/vznP7P77rtz7bXXcvjhh/Piiy9y1lln0b59+7hDa8LMmDFjBrvvvjunnXYa\nS5YsiTukSPj9M7t2jT4OqS6VUHsh31LKJE3KrGVKGFph0KCWbUGzepNixYoVHHvssRx22GFs2LCB\nhoYGZs+ezXbbbRd3aIE222wz7rvvPjp06MD48eP5oAZ2YPLrNn7mmejjkOpRKbUXtJQy+ZQwtMJD\nD8UdQfE2btzI9OnT6d27N/fccw9nnnkmL774IkceeWTcoRVl11135fbbb2f58uUcf/zxbNiwIe6Q\nQhP0wa4PTGmLSqi9oKWUlUEJQysEZeVJm8ewbNkyRo8ezemnn852223Hk08+yY033ki3Ctv6cOzY\nsVx00UXMmTOHiy66KO5wQrPXXnFHINWoEmov5NuVMkm9ILVOCUMZJWUew7p167jkkksYOHAgCxYs\n4PLLL2fx4sWMHDky7tBa7eKLL+ab3/wmV155Jffdd1/c4YRi9eqWbX7DXyLFCuq16tQp2jjySae1\nK2WlUMLQSptt1rItCfMYHn/8cQYNGsRPfvIT9tlnH5577jkuvPBCOvqVqKwg7dq144477qBXr15M\nnDiRl4I+YSpU0Af7nDnRxiHVZdgw//ZFi6KNIx8VaqocShha6emn/dvjmnX84YcfctpppzF69GhW\nrlzJrbfeyiOPPMKuu+4aT0Ah2HLLLWloaGDDhg0ceeSRfPTRR3GHVDZBE770gSltEbT5a5LmxahQ\nU+VQwtBKQW+4oLG4sDjnuOuuu+jTpw+33HILJ5xwAsuWLWPSpEmh7/8QhwEDBvC73/2Ol156iUmT\nJlXNzpZ+vVObbhp9HFI9Ghr825M0HJGvXHWSkhrxKGFoA7/18VH+/nr99dc5/PDDOe644+jSpQsP\nPfQQd9xxB92r/Gvpcccdx9SpU7nvvvv46U9/Gnc4bRb0DWv+/GjjkOoyYYJ/e5KGI4IKNSWtXLV4\nlDC0QdD6+LCHJTZs2MC1115L3759eeihh/jBD35AKpXi4IMPDveJE+RnP/sZ++23HxdeeCF/+9vf\n4g6nTbT+XKKUlNdVvkJNSYlRmlLC0AZBL+rRo8N7zmeffZY999yTKVOm0Lt3bxYsWMDPf/5zutZY\nOcBNNtmEu+66i2233Za6ujqWL18ed0itkkr5D0fEvJ2HVLig4YgklYIOSpSTVB9CmlLC0EZBv6fL\nXZNh9erV/OAHP2DYsGG89NJLTJs2jWeeeYbBgweX94kqSPfu3Zk1axYff/wx48eP55MkLFMpUVDt\nBU34krYIGo5ISinooEQZklUfQppSwtBGQcMSQcuZWmPOnDn069ePq6++mkMOOYQXXniB733ve4nb\n/yEOw4cP58Ybb2TJkiWcccYZFTcJ0q/2AqhLVsKRlOlNGoarTEoY2qhvX2jn81NcswZWrmzbY69c\nuZITTjiBQw45hDVr1nDXXXfxwAMPsMMOO7TtgavMd7/7XU499VRuv/12brjhhrjDKVrQDHF9aEpb\nBA1HzJwZaRiBVAa6cilhKIPHHvNvHzOmdY/nnOPWW2+ld+/e/OEPf+C0005j2bJlHHPMMVW5VLIc\nrr/+eoYPH87UqVN54okn4g6nKEEzxPWhKW0RNBwxcWK0cQQJ2sJdZaCTTwlDGQSNub3wQulzGV5+\n+WXGjBnDySefTI8ePXj88cf5zW9+w5Zbbtn2QKtYp06dmDVrFl/+8pc5+uijefvtt+MOKa+GBv8Z\n4mb60JTqFrRTvRLl5As1YTCzLc3sTjNbZWYfmtlvzSzvdH4zazSzjTnHBjO7Kcw4yyGoyE6xcxnW\nr1/P5ZdfzoABA3jqqae45JJLWLJkCfvuu2/5gqxy2223HXfffTfvvfceRx11FOvXr487pEBB3wKD\nhilEijFjhn97UoYjGhv9a9WoDHRlCLuH4Q9AH+AA4FvAKOA3Ba5xwHSgB9AT2Bo4L8QYyyKoyM6a\nNf69DB988MHnf3/qqacYMmQIP/7xjxk+fDhLlizh4osvplOSSrJViNGjR3P11Vczb948vve978Ud\njq98SYFmiEtbnHKKf3tShiOChuG0KqgyhJYwmFlv4GDgFOfcQufcU8Bk4Dgz61ng8k+cc+8651Zm\njo/DirNc+vYN7mXYY4+mt9955x2OPPJIVq1axZlnnsk+++zDW2+9xfTp02lsbKRPnz7hB1zFzj33\nXI4//nh+/etfc+utt8YdThPpNOy/v/99Wn8u1SyoUJOZJvpWijB7GEYAHzrnns1pexivByFgUc3n\nTjCzd83seTO70swqooxNUC/D2rVfrH/++OOPOeyww3jqqafo06cPv/71rzn66KN56aWXOPXUU2nn\nt+RCSmJmTJ8+nQEDBvA///M/LExQndl8e43UYu+Cme2QGap81cw+MbN/mdklZtYh7tgqTdJXRwQt\npdQwXOUI87dTT6DJwkLn3Abgg8x9Qe4ETgT2A64Evg3cHk6I5dW3L/Tr53/f6NHw2Wefcdxxx7F4\n8WI+++wz3n33XRoaGrjrrrvo2bNQp4uUomvXrjQ0NNClSxfGjx/Pu+++G3dIpNMQtCt3Dfcu9AYM\nOBXYHZgCnAFcEWdQlSjJqyMaG1WoqRpsUuoFZnYV8MM8pzi8eQuBD5E5x/9i536bc/MFM1sBPGxm\nOznnAuv/TpkyhW7dujVpq6uro66uLk8o5ffII9Cjh989jsMPn85DD/3585bPPvuM6dOnc+CBB7L5\n5ptHFmOt6NWrF3/4wx/41re+xXHHHcecOXPYZJOSX/Jlk7Tehfr6eurr65u0rVq1KtIYnHNzgDk5\nTa+Z2dV4SUPi5y5JcYLmLtRwolyZnHMlHcBXgF0LHJsAJwHvN7u2PfApMK6E59sU2AgcFHD/EMAt\nWrTIJUXv3s55c4GbHxtdp07bu0MOOcRde+21btmyZW7jxo1xh1v1LrvsMge4H/zgB7HF8NhjQa8J\n5+bOjS2sFhYtWuTwEvohrsTPhnIdwOXA/ALnJO59H6eg11evXnFH5tzzz/vHZhZ3ZOJcae/5kr9u\nOefeB94vdJ6ZzQO2MLPB7ot5DAfg9TAEFFT2NTjzj3mn1FjjMnduUC+Dse22y/nrX1XSOUoXXHAB\nCxYs4Be/+AV77LEHxxxzTKTPn0oFT3Ts3FldsrnMbBfgbGBq3LFUkqAicU89FW0cfoL2S9HchcoT\n2hwG59xLeF2Nt5jZMDPbG7geqHfOrQAws23MbJmZ7ZG5vbOZ/cjMhmQmQ40Ffg/Mdc6lwoq13Lp3\nD+5qe/XV9tx/f7Tx1Lp27dpx2223seuuu3LyySeTSkX7Uho+PPi+BM3HLCszu6pZPZXmxwYz27XZ\nNdsCfwXucs4FVBQQP0FbqMRd2yCVCt4vRYly5Ql7QPd44Aa81REbgXuBc3Pu74A3hJFdkLgeODBz\nTlfgTeAeKnAC1KhR3h4TfsuIjjjCeyNpKVF0unXrxn333cfw4cMZP3488+fPZ4sttgj9eRsavFoc\nfnbbrapfA1cDhda0vpr9i5ltAzwKPOmcO73YJ0nK3KU4BeW/SSjjEtS7MGhQtHGIp83zlgqNWST9\nIMFjmXPnBo9bd+4cd3S16Z577nGAO+yww9yGDRtCfa6gsdvskU6H+vStEsccBmBb4J/AHYAVeU1i\n3/dR69zZ//WVSsUbV77XfxJf+7WqlPe8Fv2HaNSo4DXQa9fC738faTgCHHXUUfzwhz/kwQcf5PLL\nLw/1ufKVBZ87N/7u4iQws62BRuANvFUR3c2sh5n5zgKSltau9W+Pu/cqX++CXvuVSQlDyCZO9N/+\nGmDSpC8KOkl0Lr/8cg488EAuueQS/vznPxe+oBVmzAj+IN9tN43f5vgGsDMwBm8I8m28Cc7J3j0s\nIdLpuCPwl2/uwpw5/u2SfEoYIhC0/TV4BZ1K3dFS2maTTTahvr6er33ta5xwwgm88sorZX38hobg\nmv5mShJzOed+75xr3+xo55zTUqIi7LOPf3vc9Q3Uu1CdlDBEYNQomD07+P5+/ZQ0RG2rrbaioaGB\ndevWceSRR7I66OtQiRobgyvuATz/vD4wpXyCct04e7DUu1C9lDBEZNw4b819kKFDo4tFPEOGDOHm\nm28mlUpxyimnZCfTtVq+egvgzWeJe1xZJGzqXaheShgilG/N/bp1cP310cUinokTJ3LWWWdx1113\nMW3atFY/TjoNAwYE39+rVzJq+kv1CNpsKl9vZtiCehfatVPvQjVQwhChvn3zjy2ec46Shjhcc801\njBw5kvPOO4/H8k04CZBOw447BhfPgWRU3JPqEjT0NW5ctHHkCupdeO459S5UAyUMESs0n+Gcc7Tc\nMmodO3bknnvu4atf/SrHHnssb775ZtHXZpOFoBURoCWUUhvyzV3QUFx1UMIQg3Hj8u9RP2mSkoao\nbbPNNtx77718+OGHTJgwgbVr17Jy5cqC8xpGjCicLGgJpZRb0HLKHXaINo5cqupY/ZQwxGTiRLju\nuuD7lTREb++99+baa69lwYIFnH322Vx22WXMDRhDyvYsLA/ccN1LCpUsSBhGjPBvnz8/2jiyNHeh\nNoS9l4TkMXkyTJsW/Etn0iT46CPvPInGmWeeyfz58/nd735Hu3bteP/999lvv/2anJNOe9/k1q0L\nfpyZMzXJUcIT9JkR19CX5i7UBvUwxOzpp/Mvt9REyGjNmjWL5557DoCNGzcya9Ys3n//i93cUynY\nemslCyJZQb0Lm26quQvVRglDzLp3h9df9yoABjnnHC+pUHGn8I0fP55vf/vbdOzYEYD169dz2223\nAV5Rpv7986+GuO46JQsSrsZG//a4llMG9S7ENTwi4VHCkADdu3sVAPMlDevWqSJkFNq1a8fUqVNZ\nuHAh/fv3B+CWW26hocHlLcoEXrKg4SMJ25gx/u1xLKdU70JtUcKQEH37wooVhfew79cP7r8/mphq\nWf/+/Zk/fz7/+7//y7Jlw/KWewYlCxKdNhYkLaukTb6UcClhSJDu3eGNNwovjTriCM1riMIrr3Tm\n+uuvBmYCwd0/M2cqWZDak0rBxx+3bFfvQvVSwpAw3bvDa6/lX3IJ3ryGTp00RBGWhgZvvsL69VAo\nWdCcBYlKUDnoqHenTKdh4ED/+9S7UL2UMCTU5Mn5izuB98usXz/Vayi3Sy/Nv+Nk1ty5ShYkWkGv\ny6jrfRx6KGzc2LJ90CD1LlQzJQwJNnFicTOfJ03SKopyaGjwJp5efHHhc1XBUWpVOg3PPut/n4o0\nVTclDAk3blxx3Y3ZVRSa21C6bNXGYnoVdtrJO1/JgiRFZgVwZA491L9d21dXPyUMFWDUKG9m9BVX\nFD73nHO8b8laSVGcGTOgZ0+vFkYhM2fCq6/qQ1HiETR/YfHi6GII6l1QCejaoIShglxwQfHFWY44\nQsMU+WSHH045pfC5HTt6M8I1X0HiFNQDFtWcgXQadtnF/z6VgK4NShgqzLhx3i+vYrohs8MUO+8M\nK1eGH1slaGz0EoVihh/AW62ybp0mcomMHeu/jLJrV70/aoUShgrUt6/3S6zY3obly6FHj9pOHLKJ\nQqFqjVlm3twR1VeQJIty/sLSpf7tzzwTXQwSLyUMFWzcuC92TixGNnFo1w4efzzc2JJixozSEgXw\nehU2btTERkmOVMq/Par5C6mU/4Zrw4apd6GWKGGocNlCT6UUbnEORo/2fpFWYw2H7KqHYucoZGVX\nQKhXQZJmjz3826P4ZZ1OeysgmjODBx8M//klOZQwVInsSopCxZ6amzTJe+NXQ69Dtjeh2FUPWdnh\nB62AkKTKt5162MaOhQ0bWrbvsYfeL7VGCUOVmTjRSxwKlZZuLrfXoZKWZWaThFJ7E7Jmz9bwg0iQ\ndBoWLGjZ3r69ehdqUWgJg5ldYGb/MLPVZvZBCdddamZvm9knZvZ3MwtYyCP5TJ7cuh6HrCOO+OIX\ncZImS2YnL7YlSQAvUXAuni2BRcqh2EnPbTF2rP/umEuXqnehFoXZw9ABuBv4dbEXmNkPgbOB04Hh\nwGpgjplFXMusemR7HNqyOU12smT2l3SUwxe5PQilTl5srlMnb/KWEgWpJI2N/u1hv4aDehc00bF2\nhZYwOOd+4pz7FfB8CZedC1zmnHvAOZcCvgNsAxwRRoy1JDvHIZ2G7bZr22M1H77IPVrTG9G816Ac\nPQi5rrvOi3ntWn3QSeUZMyae550woWXvgiY61rZN4g4gy8x2AnoCj2TbnHMfmdkzwAi83gppo+7d\n4c03vb/ff7839FBO2d6IuHXqBIsWKUGQyuc3JBC2dBoWLmzZromOtS1Jkx57Ag5IN2tPZ+6TMhs3\nzvswauuQRVJ87WveB516E0RaL52Gr3+95cqMzTdX70KtK6mHwcyuAn6Y5xQH9HHOvdymqJo9beZx\n85oyZQrdunVr0lZXV0ddXV0ZQ6le2SGLrOuv9zaySrrZszUfobXq6+upr69v0rZq1aqYopFShJng\nT5gA//1v07ZOneCVV9S7UOtKHZK4Gri1wDmvtjKWFXjJQQ+a9jJ0BwJ2X//CtGnTGDJkSCufWpqb\nPLlpAaPHH/fmLcRJwwzl5ZdQL168mKFDh8YUkRQrrGXAGoqQfEoaknDOve+ce7nA8VlrAnHOLcdL\nGg7ItpnZl4A9gada85hSPtkeCL+j1JoPhWSXPDY/NMxQ3cyso5ktMbONZjYg7niSYNq0aJ9vwgT/\noYigrbWltoRZh2F7MxsI7AC0N7OBmaNrzjkvmVluh/K1wI/M7HAz6w/cBrwFVEgZodqUrflQrkND\nDDXr53jv9xim+SXT1KnRPt877zS9raEIyRXmKolL8ZZFZmW3SdkfyK7i/zrw+cQD59zPzWxT4DfA\nFsATwKHOufUhxikiMTOzQ4GDgAnAN2MOp2ZtvbVXIj1LQxGSK7SEwTl3EnBSgXPa+7RdAlwSTlQi\nkjRm1gOYDowF1sQcTuK1tY5Kc6kUjBwJa9ZA584weDCsWuUlDxqKkFyJqcMgIjXrVuAm59yzZlbk\nZu21a9Gi8j7eyJFfrIr4+GNvCOKjj8r7HFIdlDCISNkVuwQbOATYHPhZ9tJSnqcWl1OXe4hgzZr8\nt6V6tHUptRIGEQlDMUuwl+PNadoLWGfWJFdYaGZ3ZoY2A2k5ddt16dK07kKXLvHFIuFq61JqJQwi\nUnbOufeB9wudZ2aTgQtzmrYB5gDHAPPDiU7Aq7kwYQJ06waffOK1bbopzJsXb1ySXEoYRCQ2zrm3\ncm+b2Wq8YYlXnXNvxxNVMpx1VniPnS3/nNuzsPfe8OST4T2nVL4k7SUhIgKqwwDATTeF87h+yQK0\nrMEg0px6GEQkMZxzrwMtlltLeQQlC+AtoxTJRz0MIiIV4vzz23b94Yf7Jwsq/yzFUMIgIlIhrryy\n9demUrBgQcv2zTdX+WcpjhIGEZEql0pB//4t282ULEjxlDCIiFSxoGQBYNgwJQtSPCUMIiJVKp2G\ngQP972vfHh54INp4pLIpYRARqULpNPTqBRs3+t+/dKl6F6Q0ShhERCrEypXFnZdOw847w+rV/ven\nUtC3b/niktqghEFEpELsvnvhc7LJQrbcc3ODBytZkNZR4SYRkQrxfoHdOdJp2Gmn4B0n27WDhx4q\nf1xSG9TDICJSBVIp2Gab/NtTP/ec5i1I6ylhEBFJoO9+17/98cdbtjU0eEsngyY4Asydq6EIaRsl\nDCIiCXTLLf7to0c3vX3ppd421fnMnQujRpUnLqldmsMgIlJhRoyAQw+Fiy8ufO7s2UoWpDyUMIiI\nVJinn/aOQmbPhnHjwo9HaoOGJEREEqrQUEM+Shak3NTDICKSUPfe620QVSrNWZAwqIdBRCTBzj+/\n+HO33tqrxaBkQcKghEFEJMGuvBKuu67weXPnwttvq86ChEcJg4hIwk2eDM7lP9SrIGFTwiAiIiIF\nhZYwmNkFZvYPM1ttZh8Uec2tZrax2fGXsGIUERGR4oS5SqIDcDcwDzi5hOv+CkwCsnOD15U3LBER\nESlVaAmDc+4nAGY2scRL1znn3g0hJBEREWmlJM5h2M/M0mb2kpndZGZfjjsgERGRWpe0wk1/BWYB\ny4FewFXAX8xshHPOxRqZiIhIDSspYTCzq4Af5jnFAX2ccy+3Jhjn3N05N18ws+eBfwP7AY+15jFF\nRESk7UrtYbgauLXAOa+2MpYWnHPLzew9YBcKJAxTpkyhW7duTdrq6uqoq6srVzgiVaW+vp76+vom\nbatWrYopGhFJupISBufc+8D7IcXSgpltB3wFeKfQudOmTWPIkCHhByVSJfwS6sWLFzN06NCYIhKR\nJAuzDsP2ZjYQ2AFob2YDM0fXnHNeMrNxmb93NbOfm9meZraDmR0AzAZeBuaEFaeIiIgUFuakx0uB\n7+TcXpz5c3/g8czfvw5kxxE2AAMy12wBvI2XKFzknPs0xDhFRESkgDDrMJwEnFTgnPY5f18LHBJW\nPCIiItJ6SazDICI1xsy+ZWZPm9knZvaBmTXEHZOINJW0OgwiUmPMbAIwHfh/wKN4ZeX7xRqUiLSg\nhEFEYmNm7YFrgf91zs3MueuleCISkSAakhCROA0BtgEws8Vm9raZ/cXMdo85LhFpRgmDiMRpZ7yd\naS/GW1n1LeBDYK6ZbRFnYCLSlIYkRKTsii0jzxdfWi53zs3OXHsS8BZwNHBLvudRhVeR4rW1uqsS\nBhEJQ7Fl5LfJ/H1ZttE5t97MXgW+VuhJVOFVpHhtre6qhEFEyq7YMvJmtghYB+wGPJVp6wDsCLwe\nYogiUiIlDCISG+fcf83sZuAnZvYWXpJwHt6QxT2xBiciTShhEJG4fR/4FLgN6AI8A4xxzmnrTJEE\nUcIgIrFyzm3A61U4L+5YRCSYllWKiIhIQUoYREREpCAlDCIiIlKQEgYREREpSAmDiIiIFKSEQURE\nRApSwiAiIiIFKWEQERGRgpQwiIiISEFKGERERKQgJQwiIiJSkBIGERERKUgJg4iIiBSkhEFEREQK\nUsIgIiIiBSlhEBERkYKUMLRCfX193CEEUmytk9TYkhpXtaqUn7fiLL9KiTXOOENLGMxsBzP7rZm9\namafmNm/zOwSM+tQ4LpOZnajmb1nZv81s3vNrHtYcbZGkl9Yiq11khpbUuOqVpXy81ac5VcpsVZl\nwgD0Bgw4FdgdmAKcAVxR4LprgW8BE4BRwDbArPDCFBERkUI2CeuBnXNzgDk5Ta+Z2dV4ScN5fteY\n2ZeAk4HjnHNzM20nAcvMbLhzbn5Y8YqIiEiwqOcwbAF8kOf+oXhJzCPZBufcP4E3gBHhhiYiIiJB\nQuthaM7MdgHOBqbmOa0nsN4591Gz9nTmPj+dAZYtW9bmGIu1atUqFi9eHNnzlUKxtU5SY4s6rpz3\nUefInrR1QnnfJ/V10JziLL9KibXccZb0nnfOlXQAVwEb8xwbgF2bXbMt8C/gNwUeuw5Y49M+H7gy\n4JrjAadDh46yHseX+tkQ5YHe9zp0lPso+J5vTQ/D1cCtBc55NfsXM9sGeBR40jl3eoHrVgAdzexL\nzXoZuuP1MviZA5wAvAasLfD4IpJfZ2BHms4/SiK970XKo+j3vGWy9VCY2bZ4ycIC4NuuwJNlJj2+\nizfp8b5M267AS8BemvQoIiISj9ASBjPbGngc7xvARLyhCgCcc+nMOdvgTXD8tnNuYabtJuBQ4CTg\nv8B1wEbn3L6hBCoiIiIFhTnp8RvAzpnjzUyb4Y2VtM/c7gDsCmyac90UvOTiXqAT8BBwVohxioiI\nSAGhDkmIiIhIddBeEiIiIlKQEgYREREpSAlDEczsAjP7h5mtNrN8lSqbX3epmb2d2Xzr75niVeWO\nbUszu9PMVpnZh5kNv7oWuKbRzDbmHBsyk03bGstZZrbczNaY2dNmNqzA+Ueb2bLM+UvN7NC2xlCO\n2MxsYs7PJfsz+iSEmPY1sz+Z2X8yzzG2iGv2M7NFZrbWzF42s4nljku+YGYdzWxJ5v9nQNzxNNfa\nTf4iiq2kz4Oomdn5ZjbfzD4ys7SZ3ZdZlZdombg3mtk1UT+3EobidADuBn5d7AVm9kO8ypanA8OB\n1cAcM+tY5tj+APQBDsDbtGsU8JsC1zhgOtADr4Lm1gTs71EsMzsW+CVwMTAYWIr3790q4PwRmdhv\nAQYBs4HZZrZ7W+IoR2wZq/B+Ntljh3LHBXQFluBN6i04mcjMdgQexFtZNBD4FfBbMzsohNjE83Pg\nLYr4/4lJazf5C1Ur33NR2xe4HtgTOBDvc/5vZtYl1qjyyCRdp+L9PKMXd8W2Sjrwlod+UOS5bwNT\ncm5/CVgDHFPGeHrjVdccnNN2MPAZ0DPPdY8B15T5Z/M08Kuc24b3QXtewPl/BP7UrG0ecFMI/2+l\nxlb0/3MZY9wIjC1wzs+A55q11QN/iTLWWjnwlne/kPM+GxB3TEXG/X3glZhjKOk9l4QD2Crz/7xP\n3LEExLcZ8E9gTBif4cUc6mEIgZnthPetNHcTrY+AZyjvJlojgA+dc8/mtD2M921ozwLXnmBm75rZ\n82Z2ZVuy6kz351Ca/ntdJpagf++IzP255uQ5P8rYADYzs9fM7A0zC6XnoxX2IoKfmYCZ9cDrhTsR\nL9GvJIU2+QtVG95zcdsC77Mztp9dATcCDzjnHo0rgMg2n6oxPfFeeM3LWefbRKu1z7Myt8E5tyEz\nzyLf89wJvI7XCzIAr9t1V+CoVsaxFV5tDb9/724B1/QMOL+cPx9oXWz/xNtm/TmgG/AD4Ckz6+uc\n+0+Z4ytF0M/sS2bWyTm3LoaYqtWteL1dz5pZGMNRoShyk7+wteY9FyszM+BavC0MXow7nubM7Di8\nods94oyjZnsYzOyqZhP/mh8bQpgAky1cFXZseZ/HOfdb59zfnXMvOOfqge8AR2Z6RsqpqH9vG85v\ni4qxegUAAAONSURBVMDncs497Zy7wzn3nHPuCWA8Xsny0yKKrRSW+TOpY+yJUez7yszOATbHGwKC\nL37GiYu12TXbAn8F7nLOzYg65iJE+f4u1U14c0COizuQ5sxsO7xk5kTn3KdxxlLLPQwlbaJVohV4\nb44eNM2yuwPP+l7RVLGxrcg85ufMrD2wJcGbdfl5Bi/eXYDlJVyX9R5edc4ezdrzbRq2osTzW6s1\nsTXhnPvMzJ7F+/nEKehn9pFzbn0M8VSaYt5Xy4H98YZ/1nlfPD+30MzudM6dFFJ8ucLc5C9sbX7P\nRcnMbgC+CezrnHsn7nh8DAW+CiyyL16Q7YFRZnY20Ckz5BO6mk0YnHPvA++H9NjLzWwF3sqF5+Dz\njbX2xBuHKktsZjYP2MLMBufMYzgA75f/MyWEPBgv82/Vm8U596mZLco8958ysVnm9nUBl83zuf+g\nTHvZtDK2JsysHdAP+Es5Y2uFeXgT8XJ9gzL/zKpVCe+rycCFOU3b4M0VOQaIZAO8Uj6frOkmfyeH\nGVcxyvGei0omWRgHjHbOvRF3PAEeBvo3a5sJLAN+GlWyAGiVRDEHsD3eMraL8JbbDcwcXXPOeQkY\nl3P7PLw3/OF4/9mzgX8BHcsc21+AhcAwYG+88ffbc+7fBu+FtUfm9s7Aj4AheEsFxwKvAI+2MY5j\n8CaHfQdvVvlvMv/+r2buvw24Muf8EcB6vLHW3YBL8LYp3j2E/79SY/sxXvKyE14yVY+3LLZ3mePq\nmnkdDcKbnf29zO3tM/dfBfw+5/wdgY/xusp3A87M/AwPjPs9Us1H5n2SyFUSeEui/wX8PfNe75E9\nYo4r73suCQfeMMSHeMsre+QcneOOrYjYY1klEfs/vBIOvK7BDT7HqJxzNgDfaXbdJXgTCz/B+4ay\nSwixbQHcgZfIfIhX12DTnPt3yI0V2A5oxBuT/wQvwbgK2KwMsZyJtzvpGrxvvXvk3PcoMKPZ+RPw\nEq01eD0xB4f4f1h0bMA1eF3TazL/fw+E8csCGJ35RdT8dTUj53X3qM81izKx/Qtvp9fY3yPVfOS8\nh5KYMEz0ef1sBDYkILbA91wSjoD3XovP8SQemc+syBMGbT4lIiIiBdXsKgkREREpnhIGERERKUgJ\ng4iIiBSkhEFEREQKUsIgIiIiBSlhEBERkYKUMIiIiEhBShhERESkICUMIiIiUpASBhERESlICYOI\niIgU9P8B/EcwhVzEWEEAAAAASUVORK5CYII=\n",
288 | "text/plain": [
289 | ""
290 | ]
291 | },
292 | "metadata": {},
293 | "output_type": "display_data"
294 | }
295 | ],
296 | "source": [
297 | "fig = plt.figure()\n",
298 | "arrow_dir = np.array([[0, 0, v[0][0], v[0][1]], [0, 0, v[1][0], v[1][1]]])\n",
299 | "X, Y, U, V = zip(*arrow_dir)\n",
300 | "ax = fig.add_subplot(1, 2, 1)\n",
301 | "plt.plot(x_values, y_values, '.')\n",
302 | "ax.quiver(X, Y, U, V, angles='xy', scale_units='xy', scale=1)\n",
303 | "ax.axis('equal')\n",
304 | "\n",
305 | "plt.xlim([-2, 2])\n",
306 | "plt.ylim([-2, 2])\n",
307 | "\n",
308 | "\n",
309 | "ax = fig.add_subplot(1, 2, 2)\n",
310 | "ax.axis('equal')\n",
311 | "plt.plot(trans[:, 0], trans[:, 1], '.')\n",
312 | "\n",
313 | "plt.xlim([-5, 5])\n",
314 | "plt.ylim([-5, 5])\n",
315 | "plt.show()"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "metadata": {},
321 | "source": [
322 | "### 8. Singular Value Decomposition (SVD)\n",
323 | "\n",
324 | "**SVD** is another way of factorizing a matrix to give **singular values** and **singular vectors**. However, it is more generally applicable than eigendecomposition, e.g. eigendecomposition is not defined for a non-square matrix and we opt for SVD there. Here, we write $A$ as:\n",
325 | "\n",
326 | "$$A = UDV^T $$\n",
327 | "\n",
328 | "**Shapes**:\n",
329 | "- $A$: $m$ x $n$\n",
330 | "- $U$: $m$ x $m$\n",
331 | "- $D$: $m$ x $n$\n",
332 | "- $V$: $n$ x $n$\n",
333 | "\n",
334 | "**Properties**:\n",
335 | "- $U$ and $V$ are defined to be orthogonal matrices. \n",
336 | "- $D$ is a diagonal matrix (not necessarily square), diagonal elements of which are called **singular values**.\n",
337 | "- The columns of $U$ are known as **left-singular vectors** (eigenvectors of $AA^T$) and those of $V$ are called **right-singular vectors** (eigenvectors of $A^TA$).\n",
338 | "- Most useful feature is to extend matrix inversion to non-square matrices."
339 | ]
340 | },
341 | {
342 | "cell_type": "markdown",
343 | "metadata": {},
344 | "source": [
345 | "### 9. The Moore-Penrose Pseudoinverse\n",
346 | "Suppose we want a left-inverse $B$ of a matrix $A$ ($m$ x $n$)to solve a linear equation:\n",
347 | "$$ Ax = y \\Rightarrow x = By$$ \n",
348 | "\n",
349 | "We define the pseudoinverse of $A$ as:\n",
350 | "\n",
351 | "$$A^+ = \\lim\\limits_{\\alpha \\rightarrow 0} (A^TA + \\alpha I)^{-1}A^T$$\n",
352 | "\n",
353 | "However, for practical algorithms its defined as:\n",
354 | "\n",
355 | "$$ A^+ = VD^+U^T $$\n",
356 | "\n",
357 | "where $U$, $D$ and $V$ are the SVD of $A$ and $D^+$ is obtained by taking the reciprocal of all non-zero elements of D and then taking the transpose of the resulting matrix.\n",
358 | "\n",
359 | "**Case 1**: m <= n\n",
360 | "\n",
361 | "Using $A^+$, gives one of many possible solutions, with the minimal **Euclidean norm**:\n",
362 | "\n",
363 | "$$ x = A^{+}y $$\n",
364 | "\n",
365 | "**Case 2**: m > n\n",
366 | "\n",
367 | "It is possible for there to be no solution and $A^+$ gives the $x$ such that $Ax$ is as close to $y$ in terms of the **Euclidean norm** $||Ax - y||$."
368 | ]
369 | },
370 | {
371 | "cell_type": "markdown",
372 | "metadata": {},
373 | "source": [
374 | "### 10. The Trace Operator\n",
375 | "\n",
376 | "The trace operator gives the sum of all the diagonal elements.\n",
377 | "\n",
378 | "$$ Tr(A) = \\sum_{i}A_{i,i}$$\n",
379 | "\n",
380 | "Properties:\n",
381 | "\n",
382 | "- $||A||_F = \\sqrt{Tr(AA^T)} $ (**Frobenius Norm**)\n",
383 | "- $Tr(A) = Tr(A^T)$ (**Transpose Invariance**)\n",
384 | "- $Tr(ABC) = Tr(CAB) = Tr(BCA)$ (**Cyclical Invariance** given that the individual matrix multiplications are defined)"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "### 11. The Determinant\n",
392 | "\n",
393 | "The determinant of a square matrix (denoted by $det(A)$) maps matrices to real scalars. It is equal to the product of all the eigenvalues of the matrix. It denotes how much multiplication by the matrix expands or contracts space. If the value is 0, then space is contracted completely atleast along one dimension causing it to lose all its volume. If the value is 1, then the transformation preserves volume."
394 | ]
395 | }
396 | ],
397 | "metadata": {
398 | "kernelspec": {
399 | "display_name": "Python 2",
400 | "language": "python",
401 | "name": "python2"
402 | },
403 | "language_info": {
404 | "codemirror_mode": {
405 | "name": "ipython",
406 | "version": 2
407 | },
408 | "file_extension": ".py",
409 | "mimetype": "text/x-python",
410 | "name": "python",
411 | "nbconvert_exporter": "python",
412 | "pygments_lexer": "ipython2",
413 | "version": "2.7.12"
414 | }
415 | },
416 | "nbformat": 4,
417 | "nbformat_minor": 2
418 | }
419 |
--------------------------------------------------------------------------------