├── .gitignore
├── LICENSE
├── README.md
├── images
    ├── L_v1_v2.png
    ├── cmc_gif.gif
    ├── cookie_given_a.jpg
    ├── cookie_given_aa.jpg
    ├── edge_to_node.png
    ├── gnn_graph.png
    ├── hierarchical_models1.jpg
    ├── hierarchical_models2.jpg
    ├── hierarchical_models3.jpg
    ├── hierarchical_models4.jpg
    ├── hierarchical_models5.jpg
    ├── learning_from_many_views.png
    ├── node_to_edge.png
    ├── redshift_fig5.png
    ├── redshift_table1.png
    ├── size_principle.svg
    ├── understanding_betavae.png
    └── value_vs_perceived_value.jpg
├── notebooks
    ├── Noise_Contrastive_Estimation_Experiments.ipynb
    └── Transformer - Illustration and code.ipynb
├── notes
    ├── SimCLR.md
    ├── amdim.md
    ├── betavae.md
    ├── classify_without_labels.md
    ├── cmc_notes.md
    ├── contrastive_predictive_coding.md
    ├── cvpr2024
    │   └── neural_redshift.md
    ├── deepinfomax.md
    ├── iic.md
    ├── mine.md
    ├── moco.md
    ├── moco_v2.md
    ├── on_mi_maximization.md
    ├── probmods
    │   ├── Chapter 10_ Learning with a language of thought.md
    │   ├── Chapter 11_ Hierarchical models.md
    │   ├── Chapter 12_ Occam's Razor.md
    │   ├── Chapter 13_ Learning (deep) continuous functions.md
    │   ├── Chapter 14_ Mixture models.md
    │   ├── Chapter 15_ Social Cognition.md
    │   ├── Chapter 3_ Conditioning.md
    │   ├── Chapter 4_ Causal and Statistical Dependence.md
    │   ├── Chapter 5_ Conditional dependence.md
    │   ├── Chapter 6_ Bayesian data analysis.md
    │   ├── Chapter 7_ Algorithms for inference.md
    │   └── Chapter 9_ Learning as conditional inference.md
    ├── understanding_betavae.md
    └── unsupervised_disentanglement.md
└── pdfs
    ├── 1502.05767.pdf
    └── autodiff.pdf


/.gitignore:
--------------------------------------------------------------------------------
1 | notebooks/.ipynb_checkpoints/*
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Vinay Sisodia
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Links to some important research papers or links. I plan to add notes as I go through each topic one by one.
  2 | 
  3 | 
  4 | ## ✔ 1. Information theory based (unsupervised) learning
  5 | * [x] [Invariant Information Clustering](https://arxiv.org/abs/1807.06653)
  6 |    * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/iic.md)
  7 | * [x] [Mutual Information Neural Estimation](https://arxiv.org/abs/1801.04062)
  8 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/mine.md)
  9 | * [x] [Deep Infomax](https://arxiv.org/abs/1808.06670)
 10 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/deepinfomax.md)
 11 | * [x] [Learning Representations by Maximizing Mutual Information Across Views](https://arxiv.org/abs/1906.00910)
 12 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/amdim.md)
 13 | * [x] How Google decoupled MI maximization and representation learning: [On Mutual Information Maximization for Representation Learning](https://arxiv.org/abs/1907.13625)
 14 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/on_mi_maximization.md)
 15 | 
 16 | ---
 17 | 
 18 | ## ✔ 2. Disentangled representations
 19 | * [x] [Quick overview by Google](https://ai.googleblog.com/2019/04/evaluating-unsupervised-learning-of.html)
 20 |   *  [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/unsupervised_disentanglement.md)
 21 | * [x] [β-VAE, pdf](https://openreview.net/pdf?id=Sy2fzU9gl)
 22 |   *  [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/betavae.md)
 23 | * [x] [Understanding disentangling in β-VAE](https://arxiv.org/abs/1804.03599)
 24 |   *  [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/understanding_betavae.md)
 25 | 
 26 | ```
 27 | Decided not to delve deeper into this topic. It is not mature yet.
 28 | * [Disentangling Disentanglement in Variational Autoencoders](https://arxiv.org/abs/1812.02833)
 29 | * [Isolating Sources of Disentanglement in Variational Autoencoders](https://arxiv.org/abs/1802.04942)
 30 | * [InfoGAN-CR: Disentangling Generative Adversarial Networks with Contrastive Regularizers](https://arxiv.org/abs/1906.06034)
 31 | * [Disentangling by Factorising, pdf](https://www.cs.toronto.edu/~amnih/papers/disentangling_nips_ws.pdf)
 32 | ```
 33 | ---
 34 | 
 35 | ## ✔ 3. Self Supervised Learning
 36 | * [x] [Representation Learning with Contrastive Predictive Coding](https://arxiv.org/abs/1807.03748)
 37 | * [x] [Data-Efficient Image Recognition with Contrastive Predictive Coding](https://arxiv.org/abs/1905.09272)
 38 |   *  [__Combined notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/contrastive_predictive_coding.md)
 39 | * [x] [Contrastive Multiview Coding](https://arxiv.org/abs/1906.05849)
 40 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/cmc_notes.md)
 41 | * [x] [Momentum Contrast for Unsupervised Visual Representation Learning](https://arxiv.org/abs/1911.05722)
 42 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/moco.md)
 43 | * [x] [SimCLR](https://arxiv.org/abs/2002.05709)
 44 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/SimCLR.md)
 45 | * [x] [MoCo v2: Improved Baselines with Momentum Contrastive Learning](https://arxiv.org/abs/2003.04297)
 46 |   * The shortest paper I have seen. The authors modified MoCo based on some ideas borrowed from SimCLR.
 47 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/moco_v2.md)
 48 | * [ ] Emerging Properties in Self-Supervised Vision Transformers [PDF](https://arxiv.org/pdf/2104.14294.pdf) | [Official Code](https://github.com/facebookresearch/dino)
 49 | 
 50 | ---
 51 | 
 52 | ## ✔ 4. Automatic differentiation
 53 | * [x] [Automatic differentiation in machine learning: a survey](https://arxiv.org/abs/1502.05767)
 54 |   * [__Annotated pdf__](https://github.com/vinsis/math-and-ml-notes/blob/master/pdfs/1502.05767.pdf)
 55 | * [x] [Automatic Reverse-Mode Differentiation: Lecture Notes](http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf)
 56 |   * [__Annotated pdf__](https://github.com/vinsis/math-and-ml-notes/blob/master/pdfs/autodiff.pdf)
 57 | * [x] [Reverse mode automatic differentiation](https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation)
 58 | 
 59 | ---
 60 | 
 61 | ## 5. NNs and ODEs
 62 | * [ ] [Neural Ordinary Differential Equations](https://arxiv.org/pdf/1806.07366.pdf)
 63 |   * [ ] [Adjoint tutorial](https://cs.stanford.edu/~ambrad/adjoint_tutorial.pdf)
 64 | * [ ] [Augmented Neural ODEs](https://arxiv.org/abs/1904.01681)
 65 | * [ ] [Invertible ResNets](https://arxiv.org/pdf/1811.00995.pdf)
 66 | * [ ] [Universal Differential Equations for Scientific Machine Learning](https://arxiv.org/abs/2001.04385)
 67 | 
 68 | ---
 69 | 
 70 | ## 6. Probabilistic Programming
 71 | * [x] [Probabilistic models of cognition](http://probmods.org/)
 72 |   * [__Notes__](https://github.com/vinsis/math-and-ml-notes/blob/master/notes/probmods/)
 73 | * [ ] [The Design and Implementation of Probabilistic Programming Languages](http://dippl.org)
 74 | * [ ] [Composition in Probabilistic Language Understanding](http://gscontras.github.io/ESSLLI-2016/)
 75 | 
 76 | ---
 77 | 
 78 | ## 7. Miscellaneous
 79 | ### Memorization in neural networks
 80 | * [Blog post by BAIR](https://bair.berkeley.edu/blog/2019/08/13/memorization/)
 81 | * [Identity Crisis: Paper](https://arxiv.org/abs/1902.04698)
 82 | 
 83 | ### Online Learning
 84 | * [A Modern Introduction to Online Learning](https://arxiv.org/abs/1912.13213)
 85 | 
 86 | ### Graph Neural Networks
 87 | * [ ] [A Gentle Introduction to Deep Learning for Graphs, pdf](https://arxiv.org/pdf/1912.12693.pdf)
 88 | 
 89 | ### Normalizing Flows
 90 | * [x] [Tutorial and implementations for UCB DUL course](https://github.com/TinyVolt/normalizing-flows)
 91 | 
 92 | ### Transformers
 93 | * [x] [Attention is all you need](https://arxiv.org/abs/1706.03762)
 94 |   * [x] [A nice visual introduction](http://jalammar.github.io/illustrated-transformer/)
 95 |   * [x] [Annonated paper](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
 96 |   * [__Python notebook with illustrations and working code__](https://github.com/vinsis/math-and-ml-notes/blob/master/notebooks/Transformer%20-%20Illustration%20and%20code.ipynb)
 97 | * [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451)
 98 | 
 99 | ### Others
100 | * [Zero-shot knowledge transfer](https://arxiv.org/abs/1905.09768)
101 | * [SpecNet](https://arxiv.org/abs/1905.10915)
102 | * [Deep Learning & Symbolic Mathematics](https://arxiv.org/abs/1912.01412)
103 | * [Deep Equilibrium Models](https://papers.nips.cc/paper/8358-deep-equilibrium-models)
104 | 
105 | ---
106 | 
107 | ## 8. Theory of neural networks
108 | ### Lottery tickets
109 | * [Lottery ticket hypothesis](http://news.mit.edu/2019/smarter-training-neural-networks-0506)
110 | * [Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask](https://arxiv.org/abs/1905.01067)
111 | * [Rigging the Lottery: Making All Tickets Winners](https://arxiv.org/abs/1911.11134)
112 | 
113 | ### Others
114 | * [What's Hidden in a Randomly Weighted Neural Network?](https://arxiv.org/abs/1911.13299)
115 | * [Topological properties of the set of functions generated by neural networks of fixed size](https://arxiv.org/abs/1806.08459)
116 | * [YOUR  CLASSIFIER  IS  SECRETLY  AN  ENERGY  BASED MODEL AND YOU  SHOULD TREAT IT LIKE ONE](https://arxiv.org/abs/1912.03263)
117 | * [Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology](https://openreview.net/pdf?id=ByxkijC5FQ)
118 | 
119 | ---
120 | 
121 | ## 9. Advanced Variational Inference
122 | * [Amortized Population Gibbs Samplers with Neural Sufficient Statistics](https://arxiv.org/abs/1911.01382)
123 | * [Evaluating Combinatorial Generalization in Variational Autoencoders](https://arxiv.org/abs/1911.04594)
124 | 
125 | ---
126 | 
127 | ## 10. Causal Inference
128 | * [Elements of Causal Inference, pdf](https://www.dropbox.com/s/gkmsow492w3oolt/11283.pdf)
129 | * [Causal Inference: What If, pdf](https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/2019/10/ci_hernanrobins_1oct19.pdf)
130 | * [Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution](https://arxiv.org/abs/1801.04016)
131 | 


--------------------------------------------------------------------------------
/images/L_v1_v2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/L_v1_v2.png


--------------------------------------------------------------------------------
/images/cmc_gif.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/cmc_gif.gif


--------------------------------------------------------------------------------
/images/cookie_given_a.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/cookie_given_a.jpg


--------------------------------------------------------------------------------
/images/cookie_given_aa.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/cookie_given_aa.jpg


--------------------------------------------------------------------------------
/images/edge_to_node.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/edge_to_node.png


--------------------------------------------------------------------------------
/images/gnn_graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/gnn_graph.png


--------------------------------------------------------------------------------
/images/hierarchical_models1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/hierarchical_models1.jpg


--------------------------------------------------------------------------------
/images/hierarchical_models2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/hierarchical_models2.jpg


--------------------------------------------------------------------------------
/images/hierarchical_models3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/hierarchical_models3.jpg


--------------------------------------------------------------------------------
/images/hierarchical_models4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/hierarchical_models4.jpg


--------------------------------------------------------------------------------
/images/hierarchical_models5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/hierarchical_models5.jpg


--------------------------------------------------------------------------------
/images/learning_from_many_views.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/learning_from_many_views.png


--------------------------------------------------------------------------------
/images/node_to_edge.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/node_to_edge.png


--------------------------------------------------------------------------------
/images/redshift_fig5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/redshift_fig5.png


--------------------------------------------------------------------------------
/images/redshift_table1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/redshift_table1.png


--------------------------------------------------------------------------------
/images/size_principle.svg:
--------------------------------------------------------------------------------
1 | <svg class="marks" width="267" height="255" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="translate(60,12)"><g class="mark-group"><g transform="translate(0,0)"><rect class="background" width="1" height="1" style="fill: none;"></rect><g class="mark-group root"><g transform="translate(0,0)"><rect class="background" width="200" height="200" style="fill: none;"></rect><g class="mark-group"><g transform="translate(0,200)"><rect class="background" width="0" height="0" style="pointer-events: none; fill: none;"></rect><g class="mark-rule" style="pointer-events: none;"><line transform="translate(0.5,-200)" x2="0" y2="200" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(57.5,-200)" x2="0" y2="200" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(114.5,-200)" x2="0" y2="200" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(171.5,-200)" x2="0" y2="200" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line></g><g class="mark-rule" style="pointer-events: none;"><line transform="translate(0.5,0)" x2="0" y2="6" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(57.5,0)" x2="0" y2="6" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(114.5,0)" x2="0" y2="6" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(171.5,0)" x2="0" y2="6" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line></g><g class="mark-rule" style="pointer-events: none;"></g><g class="mark-text"><text text-anchor="middle" transform="translate(0.5,19)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0</text><text text-anchor="middle" transform="translate(57.5,19)" style="font: 11px sans-serif; fill: #000; opacity: 1;">2</text><text text-anchor="middle" transform="translate(114.5,19)" style="font: 11px sans-serif; fill: #000; opacity: 1;">4</text><text text-anchor="middle" transform="translate(171.5,19)" style="font: 11px sans-serif; fill: #000; opacity: 1;">6</text></g><g class="mark-path" style="pointer-events: none;"><path transform="translate(0.5,0.5)" d="M0,6V0H200V6" style="fill: none; stroke: #000; stroke-width: 1;"></path></g><g class="mark-text"><text text-anchor="middle" transform="translate(100,35)" style="font: bold 11px sans-serif; fill: #000;">numDataPoints</text></g></g></g><g class="mark-group"><g transform="translate(0,0)"><rect class="background" width="0" height="0" style="pointer-events: none; fill: none;"></rect><g class="mark-rule" style="pointer-events: none;"><line transform="translate(0,200.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,180.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,160.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,140.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,120.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,100.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,80.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,60.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,40.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,20.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line><line transform="translate(0,0.5)" x2="200" y2="0" style="fill: none; stroke: #000; stroke-opacity: 0.15; opacity: 1;"></line></g><g class="mark-rule" style="pointer-events: none;"><line transform="translate(-6,200.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,180.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,160.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,140.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,120.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,100.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,80.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,60.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,40.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,20.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line><line transform="translate(-6,0.5)" x2="6" y2="0" style="fill: none; stroke: #000; stroke-width: 1; opacity: 1;"></line></g><g class="mark-rule" style="pointer-events: none;"></g><g class="mark-text"><text text-anchor="end" transform="translate(-9,203.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.0</text><text text-anchor="end" transform="translate(-9,183.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.050</text><text text-anchor="end" transform="translate(-9,163.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.10</text><text text-anchor="end" transform="translate(-9,143.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.15</text><text text-anchor="end" transform="translate(-9,123.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.20</text><text text-anchor="end" transform="translate(-9,103.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.25</text><text text-anchor="end" transform="translate(-9,83.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.30</text><text text-anchor="end" transform="translate(-9,63.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.35</text><text text-anchor="end" transform="translate(-9,43.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.40</text><text text-anchor="end" transform="translate(-9,23.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.45</text><text text-anchor="end" transform="translate(-9,3.5)" style="font: 11px sans-serif; fill: #000; opacity: 1;">0.50</text></g><g class="mark-path" style="pointer-events: none;"><path transform="translate(0.5,0.5)" d="M-6,0H0V200H-6" style="fill: none; stroke: #000; stroke-width: 1;"></path></g><g class="mark-text"><text text-anchor="middle" transform="translate(-48,100) rotate(-90) translate(0,3)" style="font: bold 11px sans-serif; fill: #000;">P(Big | data)</text></g></g></g><g class="mark-line marks"><path d="M200,197L143,188L86,156L29,67L0,0" style="fill: none; stroke: #4682b4; stroke-width: 2;"></path></g></g></g></g></g></g></svg>


--------------------------------------------------------------------------------
/images/understanding_betavae.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/understanding_betavae.png


--------------------------------------------------------------------------------
/images/value_vs_perceived_value.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/images/value_vs_perceived_value.jpg


--------------------------------------------------------------------------------
/notes/SimCLR.md:
--------------------------------------------------------------------------------
  1 | Code samples are taken from [here](https://github.com/wilson1yan/cs294-158-ssl/blob/master/deepul_helper/tasks/simclr.py) and [here](https://github.com/sthalles/SimCLR/blob/master/data_aug/contrastive_learning_dataset.py#L15-L22).
  2 | 
  3 | The SimCLR framework has four major components:
  4 | 
  5 | ### 1. A stochastic data augmentation module:
  6 | 
  7 | Taken from [here](https://github.com/sthalles/SimCLR/blob/master/data_aug/contrastive_learning_dataset.py#L15-L22):
  8 | 
  9 | ```python
 10 |         color_jitter = transforms.ColorJitter(0.8 * s, 0.8 * s, 0.8 * s, 0.2 * s)
 11 |         data_transforms = transforms.Compose([transforms.RandomResizedCrop(size=size),
 12 |                                               transforms.RandomHorizontalFlip(),
 13 |                                               transforms.RandomApply([color_jitter], p=0.8),
 14 |                                               transforms.RandomGrayscale(p=0.2),
 15 |                                               GaussianBlur(kernel_size=int(0.1 * size)),
 16 |                                               transforms.ToTensor()])
 17 | ```
 18 | 
 19 | ### 2. A neural network _base encoder `f(·)`_ that extracts representation vectors from augmented data examples:
 20 | 
 21 | 
 22 | ```python
 23 | from torchvision import models
 24 | import torch
 25 | import torch.nn as nn
 26 | import torch.nn.functional as F
 27 | ```
 28 | 
 29 | 
 30 | ```python
 31 | backbone = models.resnet18(pretrained=False, num_classes=50)
 32 | [name for (name,_) in backbone.named_children()]
 33 | ```
 34 | 
 35 | 
 36 | 
 37 | 
 38 |     ['conv1',
 39 |      'bn1',
 40 |      'relu',
 41 |      'maxpool',
 42 |      'layer1',
 43 |      'layer2',
 44 |      'layer3',
 45 |      'layer4',
 46 |      'avgpool',
 47 |      'fc']
 48 | 
 49 | 
 50 | 
 51 | 
 52 | ```python
 53 | backbone.fc
 54 | ```
 55 | 
 56 | 
 57 | 
 58 | 
 59 |     Linear(in_features=512, out_features=50, bias=True)
 60 | 
 61 | 
 62 | 
 63 | 
 64 | ```python
 65 | backbone(torch.randn(4,3,224,224)).size()
 66 | ```
 67 | 
 68 | 
 69 | 
 70 | 
 71 |     torch.Size([4, 50])
 72 | 
 73 | 
 74 | 
 75 | ### 3. A small neural network _projection head `g(·)`_ that mapsrepresentations to the space where contrastive loss is applied
 76 | 
 77 | We use a MLP with one hidden layer to obtain:
 78 | 
 79 | $$ z_i = g(h_i) = W_2 (\sigma (W_1(h_i))) $$
 80 | 
 81 | We find it beneficial to define the contrastive loss on $z_i’s$ rather than $h_i’s$
 82 | 
 83 | 
 84 | [Source](https://github.com/wilson1yan/cs294-158-ssl/blob/master/deepul_helper/tasks/simclr.py#L30-L36)
 85 | 
 86 | ```python
 87 |         self.proj = nn.Sequential(
 88 |             nn.Linear(self.latent_dim, self.projection_dim, bias=False),
 89 |             BatchNorm1d(self.projection_dim),
 90 |             nn.ReLU(inplace=True),
 91 |             nn.Linear(self.projection_dim, self.projection_dim, bias=False),
 92 |             BatchNorm1d(self.projection_dim, center=False)
 93 |         )
 94 | ```
 95 | 
 96 | ### 4. A contrastive loss function defined for a contrastive prediction task
 97 | 
 98 | > We randomly sample a minibatch of `N` examples and define the contrastive prediction task on pairs of augmented examples derived from the minibatch, resulting in `2N` data points. We do not sample negative examples explicitly.  Instead, given a positive pair, we treatthe other `2(N−1)` augmented examples within a minibatch as negative examples.
 99 | 
100 | No wonder you need such a huge batch size to train.
101 | 
102 | > To keep it simple, we do not train the model with a memory bank. Instead, we vary the training batch size `N` from `256` to `8192`.  A batch size of `8192` gives us `16382` negative examples per positive pair from both augmentation views. 
103 | 
104 | Define $l(i,j)$ as:
105 | 
106 | $$ l(i,j) = -log \frac{exp(sim(i,j)/\tau)}{\sum_{1,k\ne i}^{2N} exp(sim(i,k/\tau))} $$
107 | 
108 | Then the loss is defined as:
109 | 
110 | $$ \frac{1}{2N} \sum_{k=1}^{N} [ l(2k-1,2k) + l(2k,2k-1) ] $$
111 | 
112 | where adjacent images at indices `2k` and `2k-1` are augmentations of the same image.
113 | 
114 | > Training with large batch size may be unstable when using standard SGD/Momentum with linear learning rate scaling. To stabilize the training, we use the LARS optimizer for all batch sizes.  We train our model with CloudTPUs, using 32 to 128 cores depending on the batch size.
115 | 
116 | 
117 | ```python
118 | class SimCLR(torch.nn.Module):
119 |     def __init__(self, base_encoder, output_dim=128):
120 |         super(self, SimCLR).__init__()
121 |         self.temperature = 0.5
122 |         self.output_dim = output_dim
123 |         
124 |         latent_dim = base_encoder.fc.out_features
125 |         
126 |         self.proj = nn.Sequential(
127 |             nn.Linear(latent_dim, self.output_dim, bias=False),
128 |             nn.BatchNorm1d(self.output_dim),
129 |             nn.ReLU(),
130 |             nn.Linear(self.output_dim, self.output_dim, bias=False),
131 |             nn.BatchNorm1d(self.output_dim, center=False)
132 |         )
133 |         
134 |     def forward(self, images):
135 |         N = images[0].shape[0]
136 |         xi, xj = images
137 |         hi, hj = self.base_encoder(xi), self.base_encoder(xj) # (N, latent_dim)
138 |         zi, zj = self.proj(hi), self.proj(hj) # (N, output_dim)
139 |         zi, zj = F.normalize(zi, dim=-1), F.normalize(zj, dim=-1)
140 |         
141 |         # Each training example has 2N - 2 negative samples
142 |         # Thus we have 2N * (2N-2) negative samples and 4N positive samples
143 |         all_features = torch.cat([zi,zj], dim=0) # (2N, output_dim)
144 |         sim_mat = (all_features @ all_features.T) / self.temperature # (2N,2N)
145 |         # set all diagonal entries to -inf
146 |         sim_mat[torch.arange(0,2*N), torch.arange(0,2*N)] = torch.tensor(-float('inf'))
147 |         # image i should match with image N+i
148 |         # image N+i should match with image i
149 |         labels = torch.cat( [N + torch.arange(N), torch.arange(N)] ).long()
150 |         loss = F.cross_entropy(sim_mat, labels, reduction='mean')
151 |         return loss
152 | ```
153 | 
154 | > We conjecture that one serious issue when using only random cropping as data augmentation is that most patches from an image share a similar color distribution. Figure 6 shows that color histograms alone suffice to distinguish images. Neural nets may exploit this shortcut to solve the predictive task. Therefore, it is critical to compose cropping with color distortionin order to learn generalizable features.
155 | 
156 | ### Contrastive learning needs stronger data augmentation than supervised learning
157 | 
158 | > ## A nonlinear projection head improves the representation quality of the layer before it
159 | 
160 | 
161 | ```python
162 | 
163 | ```
164 | 


--------------------------------------------------------------------------------
/notes/amdim.md:
--------------------------------------------------------------------------------
 1 | ## Learning Representations by Maximizing Mutual Information Across Views
 2 | 
 3 | This is a build up on top of deep infomax:
 4 | 
 5 | > Our model, which we call Augmented Multiscale DIM (AMDIM), extends the local version of DeepInfoMax introduced by Hjelm et al. [2019] in several ways. First, we maximize mutual information between features extracted from independently-augmented copies of each image, rather than between features extracted from a single, unaugmented copy of each image. Second, we maximize mutual information __between multiple feature scales simultaneously, rather than between a single global and local scale__. Third, we use a more powerful encoder architecture. Finally, we introduce mixture-based representations. We now describe local DIM and the components added by our new model.
 6 | 
 7 | * Global features are replaced with _antecedent features_. Local features are replaced with _consequent features_. We want to predict consequent features conditioned on antedecent features.
 8 | 
 9 | > Intuitively, the task of the antecedent feature is to pick its true consequent out of a large bag of distractors.
10 | 


--------------------------------------------------------------------------------
/notes/betavae.md:
--------------------------------------------------------------------------------
 1 | ## β-VAE
 2 | 
 3 | ### Key idea
 4 | 
 5 | A variational autoencoder tries to learn the distribution of a set of latent variables `z` which encode an image `x`: q<sub>ϕ</sub>(z | x). To encourage disentanglement we try to constrain `q` to be close to a multivariate unit Gaussian distribution `N(0, I)` denoted by `p(z)`. We use KL divergence as a loss function for this:
 6 | 
 7 | minimize D<sub>KL</sub>(q(z|x), p(z))
 8 | 
 9 | We can think of `z` as a union of two disjoint subsets `v` and `w` where
10 | * `v` is a set of latent variables which are disentangled
11 | * `w` is a set of latent variables which may be entangled
12 | 
13 | `v`s are assumed to be conditionally independent: p(v|x) = Π<sub>i</sub>p(v<sub>i</sub> | x)
14 | 
15 | Given a distribution over latent variables, we want to maximize the log-likelihood of images in our dataset: E<sub>q(z|x)</sub>[log p<sub>θ</sub>(x|z)]
16 | 
17 | Thus we want to maximize:
18 | 
19 | - E<sub>q(z|x)</sub>[log p<sub>θ</sub>(x|z)] - β D<sub>KL</sub>(q<sub>ϕ</sub>(z|x), p(z))
20 | 
21 | > Varying β changes the degree of applied learning pressure during training, thus encouraging different learnt representations.β-VAE where β = 1 corresponds to the original VAE formulation of (Kingma & Welling, 2014).
22 | 
23 | > Since the data `x` is generated using at least some conditionally independent ground truth factors `v`, and the D<sub>KL</sub> term of the β-VAE objective function encourages conditional independence  in q<sub>φ</sub>(z|x),  we  hypothesize  that  higher  values  of β should  encourage  learning  a disentangled representation of `v`.
24 | 
25 | ---
26 | 
27 | ### How is the disentanglement metric evaluated?
28 | 
29 | Let's say we want to evaluate the effectiveness of disentangled factor `k` (could be scale, color etc)
30 | 
31 | * Sample two sets of latent representations v<sub>1</sub> and v<sub>2</sub>. Enforce v<sub>1</sub>[k] = v<sub>2</sub>[k]
32 | * Create images `x` and `y` from v<sub>1</sub> and v<sub>2</sub>.
33 | * Get latent representations of `x` and `y`: z<sub>1</sub> and z<sub>2</sub>.
34 | * Calculate the absolute difference |z<sub>1</sub> - z<sub>2</sub>|.
35 | * Train a linear classifier to classify `k`.
36 | 
37 | If the encoder learnt disentanglement effectively, z[k] will have low variance compared to z[i≠k] and the linear classifier should learn easily how to classify.
38 | 
39 | > The accuracy of this classifier over multiple batches is used as our disentanglement metric score.We choose a linear classifier with low VC-dimension in order to ensure it has no capacity to perform non-linear disentangling by itself.
40 | 
41 | ---
42 | 
43 | ### Questions
44 | 
45 | * How do we know which index corresponds to a given disentanglement factor?
46 | * I believe the only way is to change each index and look at the results. Something similar is done [here](https://github.com/1Konny/Beta-VAE/blob/master/solver.py#L346) where the index is termed `loc=1`.
47 | 
48 | By changing the value at `loc`, we get a new `z` which is passed to the decoder and the resulting output image is collected:
49 | 
50 | [Source](https://github.com/1Konny/Beta-VAE/blob/master/solver.py#L398-L402)
51 | ```
52 | for val in interpolation:
53 |     z[:, row] = val
54 |     sample = F.sigmoid(decoder(z)).data
55 |     samples.append(sample)
56 |     gifs.append(sample)
57 | ```
58 | 
59 | ---
60 | 
61 | ### Other notes from the paper
62 | 
63 | > The most informative latent units z<sub>m</sub> of β-VAE have the highest KL divergence from the unit Gaussian prior `(p(z) = N(0,I))`, while the uninformative latents have KL divergence close to zero. (on 2D shapes dataset with β=4)
64 | 
65 | > We found that larger latent `z` layer sizes `m` require higher constraint pressures (higher β values). Furthermore, the relationship of β for a given m is characterized by an inverted U curve. Whenβis too low or too high the model learns an entangled latent representation due to either too much or too little capacity in the latent `z` bottleneck.
66 | 
67 | > We also note that VAE reconstruction quality is a poor indicator of learnt disentanglement. Good disentangled representations often lead to blurry reconstructions due to the restricted capacity of the latent information channel `z`, while entangled representations often result in the sharpest reconstructions.
68 | 


--------------------------------------------------------------------------------
/notes/classify_without_labels.md:
--------------------------------------------------------------------------------
  1 | ### 1. Use K-means to group representations
  2 | 
  3 | Here `group_vectors` are is a numpy array with shape `(num_image, dimension_length)`.
  4 | 
  5 | ```python
  6 | import sklearn
  7 | from sklearn.cluster import KMeans
  8 | 
  9 | n_clusters = 3
 10 | kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(group_vectors)
 11 | ```
 12 | 
 13 | Now let's define a function to get nearest neighbors for a given index:
 14 | 
 15 | ```python
 16 | def get_nearest_neighbors(index, num_neighbors=5):
 17 |     source_vector = group_vectors[index]
 18 |     distances = [np.linalg.norm(source_vector-vector) for vector in group_vectors]
 19 |     return np.argsort(distances)[1:1+num_neighbors]
 20 | ```
 21 | 
 22 | Now let's create a model. The loss function is implemented within the model itself.
 23 | 
 24 | ```python
 25 | class Model(nn.Module):
 26 |     def __init__(self, output_dims = 2):
 27 |         super(Model, self).__init__()
 28 |         self.sequential = nn.Sequential(
 29 |             nn.Linear(512,10),
 30 |             nn.ReLU(True),
 31 |             nn.Linear(10, output_dims)
 32 |         )
 33 | 
 34 |     def forward(self, x):
 35 |         '''
 36 |         x is a batch consisting of an image and its nearest neighbors
 37 |         '''
 38 |         return self.sequential(x).softmax(1)
 39 | 
 40 |     def loss(self, output, weight=0.5):
 41 |         # output has size (B, num_neighbors, num_classes)
 42 |         batch_size = output.size(0)
 43 |         total_loss = torch.Tensor([0.0])
 44 |         for batch_index in range(batch_size):
 45 |             total_loss += self.loss_per_group(output[batch_index], weight)
 46 |         return total_loss / batch_size
 47 | 
 48 |     def loss_per_group(self, output, weight):
 49 |         # output has size (num_neighbors, num_classes)
 50 |         batch_size = output.size(0)
 51 |         assert batch_size > 1
 52 |         source_vector = output[0].unsqueeze(0)
 53 |         # dot product
 54 |         dot_product = torch.Tensor([0.0])
 55 |         for index in range(1, batch_size):
 56 |             vector = output[0].unsqueeze(0)
 57 |             dot_product += -(source_vector * vector).sum().log()
 58 |         # entropy term
 59 |         probs = output.mean(0)
 60 |         log_probs = probs.log()
 61 |         entropy_loss = (probs * log_probs).sum()
 62 | 
 63 |         total_loss = dot_product + weight * entropy_loss
 64 |         return total_loss
 65 | ```
 66 | 
 67 | The loader looks like this:
 68 | 
 69 | ```python
 70 | class Loader(Dataset):
 71 |     def __init__(self, vectors, num_neighbors, num_labels):
 72 |         self.vectors = vectors
 73 |         self.num_neighbors = num_neighbors
 74 |         self.num_vectors = len(vectors)
 75 |         self.kmeans = self.perform_kmeans(vectors, num_labels)
 76 |         self.labels = self.kmeans.labels_
 77 | 
 78 |     def get_nearest_neighbors(self, index, vectors, num_neighbors=None):
 79 |         if num_neighbors is None:
 80 |             num_neighbors = self.num_neighbors
 81 |         source_vector = self.vectors[index]
 82 |         distances = [np.linalg.norm(source_vector-vector) for vector in vectors]
 83 |         return np.argsort(distances)[1:1+num_neighbors]
 84 | 
 85 |     def perform_kmeans(self, vectors, n_clusters):
 86 |         kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(vectors)
 87 |         return kmeans
 88 | 
 89 |     def __getitem__(self, index):
 90 |         source_vector = self.vectors[index]
 91 |         source_label = self.labels[index]
 92 |         vectors_with_same_label = self.vectors[self.labels == source_label]
 93 |         indices = self.get_nearest_neighbors(index, vectors_with_same_label)
 94 |         result = [source_vector]
 95 |         for i in indices:
 96 |             result.append(self.vectors[i])
 97 |         result = np.stack(result)
 98 |         return torch.from_numpy(result).float()
 99 | 
100 |     def __len__(self):
101 |         return self.num_vectors
102 | ```
103 | 
104 | Then define the loader object like so:
105 | 
106 | ```python
107 | loader = DataLoader(Loader(group_vectors, num_neighbors=3, num_labels=3), batch_size=16, shuffle=True)
108 | ```
109 | 
110 | Then just train a model:
111 | 
112 | ```python
113 | model = Model(output_dims=3)
114 | optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
115 | 
116 | for epoch in range(2):
117 |     for x in loader:
118 |         output = model(x)
119 |         loss = model.loss(output, weight=10)
120 |         optimizer.zero_grad()
121 |         loss.backward()
122 |         optimizer.step()
123 |         print(loss.item())
124 | ```
125 | 


--------------------------------------------------------------------------------
/notes/cmc_notes.md:
--------------------------------------------------------------------------------
  1 | ## Contrastive Multiview Coding
  2 | 
  3 | Code snippets are used from [the official implementation](https://github.com/HobbitLong/CMC).
  4 | 
  5 | ### Key idea
  6 | Different transformations of image (changing color space, segmentation, depth view etc) still have the same semantic content. Hence their representations should also be similar. Thus:
  7 | 
  8 | > Given a pair of sensory views, a deep representation is learnt by bringing views of the same scene together in embedding space, while pushing views of different scenes apart.
  9 | > 
 10 | 
 11 | ---
 12 | 
 13 | ### Contrastive objective vs cross-view prediction
 14 | 
 15 | Cross-view prediction is the standard encoder decoder architecture where the loss is measured pixel-wise between the constructed output and the input. Pixel-wise loss doesn't care about which pixels are important and which pixels are not.
 16 | 
 17 | In constrastive objective two different inputs representing the same semantic content create two representations. The loss is measured between the two representations. This way the model has a change to learn which information to keep and which to discard while _encoding_ an image. Thus the learned representation is better as it ignores all the noise and retains all the important information.
 18 | 
 19 | ---
 20 | 
 21 | ### Contrastive learning with two views
 22 | 
 23 | `V1` is a dataset of images with one kind of transformation (or view). `V2` is a dataset of the same images but seen in a different view. One view is sampled from `V1` and one view is sampled from `V2`. If both the views belong to the same image, we want a critic h<sub>θ</sub>(.) to give a high value. If they don't, the critic will give a low value. Here is how the visual looks like:
 24 | 
 25 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/cmc_gif.gif)
 26 | 
 27 | The loss function is constructed like so:
 28 | 
 29 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/L_v1_v2.png)
 30 | 
 31 | ---
 32 | 
 33 | ### Implementing the critic
 34 | 
 35 | > To extract compact latent representations of v<sub>1</sub> and v<sub>2</sub>, we employ two encoders fθ<sub>1</sub>(·) and fθ<sub>2</sub>(·) with parameters θ<sub>1</sub> and θ<sub>2</sub> respectively. The latent representions are extracted as z1=f<sub>θ1</sub>(v<sub>1</sub>), z2=f<sub>θ2</sub>(v<sub>2</sub>). On top of these features, the score is computed as the exponential of a bivariate function of z<sub>1</sub> and z<sub>1</sub>, e.g., a bilinear function parameterized by W<sub>12</sub>.
 36 | > 
 37 | 
 38 | We make the loss between the views symmetric:
 39 | 
 40 | L(V<sub>1</sub>, V<sub>2</sub>) = L<sub>Contrast</sub><sup>V<sub>1</sub>V<sub>2</sub></sup> + L<sub>Contrast</sub><sup>V<sub>2</sub>V<sub>1</sub></sup>
 41 | 
 42 | > ... we use the representation z<sub>1</sub>, z<sub>2</sub>, or the concatenation of both, [z<sub>1</sub>,z<sub>2</sub>], depending on our paradigm
 43 | > 
 44 | 
 45 | ### An example of a critic
 46 | 
 47 | The critic takes images from two views: L space and AB space. It has f<sub>θ1</sub> = `l_to_ab` and f<sub>θ2</sub> = `ab_to_l`. I find these names misleading since `self.l_to_ab` does not map `l` to `ab`. It maps `l` to a vector of dimension = `feat_dim`. The same applies to `ab_to_l`. 
 48 | 
 49 | [Source](https://github.com/HobbitLong/CMC/blob/58d06e9a82f7fea2e4af0a251726e9c6bf67c7c9/models/alexnet.py#L7)
 50 | 
 51 | ```python
 52 | class alexnet(nn.Module):
 53 |     def __init__(self, feat_dim=128):
 54 |         super(alexnet, self).__init__()
 55 | 
 56 |         self.l_to_ab = alexnet_half(in_channel=1, feat_dim=feat_dim)
 57 |         self.ab_to_l = alexnet_half(in_channel=2, feat_dim=feat_dim)
 58 | 
 59 |     def forward(self, x, layer=8):
 60 |         l, ab = torch.split(x, [1, 2], dim=1)
 61 |         feat_l = self.l_to_ab(l, layer)
 62 |         feat_ab = self.ab_to_l(ab, layer)
 63 |         return feat_l, feat_ab
 64 | ```
 65 | 
 66 | ---
 67 | 
 68 | ### Connection with mutual information
 69 | 
 70 | It can be shown that the optical critic is proportional to the density ratio between p(z<sub>1</sub>, z<sub>2</sub>) and p(z<sub>1</sub>)p(z<sub>2</sub>).
 71 | 
 72 | It can also be shown that 
 73 | I(z<sub>1</sub>; z<sub>2</sub>) >= log(k) - L<sub>Contrast</sub>
 74 | 
 75 | where k `is the number of negative pairs in sample set`.
 76 | 
 77 | > Hence minimizing the objective `L` maximizes the lower bound on the mutual information I(z<sub>1</sub>; z<sub>2</sub>), which is bounded above by I(v<sub>1</sub>; v<sub>2</sub>) by the data processing inequality. The dependency on `k` also suggests that using more negative samples can lead to an improved representation; we show that this is indeed the case.
 78 | > 
 79 | ---
 80 | 
 81 | ### Contrastive learning with more than two views
 82 | 
 83 | There are two ways to do so:
 84 | 
 85 | #### Core graph view
 86 | Given `M` views V<sub>1</sub>, ..., V<sub>M</sub>, we can choose to optimize over one view only. What this means is that the model will learn best how to learn representations of image in that particular view.
 87 | 
 88 | If we want to optimize over the first view, the loss function is defined as:
 89 | 
 90 | L(V<sub>1</sub>) = Σ<sub>j</sub> L(V<sub>1</sub>, V<sub>j</sub>)
 91 | 
 92 | A more general equation is:
 93 | 
 94 | L(V<sub>i</sub>) = Σ<sub>j</sub> L(V<sub>i</sub>, V<sub>j</sub>)
 95 | 
 96 | #### Full graph view
 97 | Here you optimize over all views by choosing all possible `(i,j)` pairs for creating a loss function. There are <sup>M</sup>C<sub>2 </sub> ways to do so.
 98 | 
 99 | > Both these formulations have the effect that information is prioritized in proportion to the numberof views that share that information. This can be seen in the information diagrams visualized below:
100 | > 
101 | 
102 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/learning_from_many_views.png)
103 | 
104 | This in the core graph view, the mutual information between say V<sub>2</sub> and V<sub>3</sub> is discarded but not in the case of full graph view.
105 | 
106 | > Under both the core view and full graph objectives, a factor,like “presence of dog”, that is common to all views will be preferred over a factor that affects fewerviews, such as “depth sensor noise”.
107 | > 
108 | 
109 | ---
110 | 
111 | ### Approximating the softmax distribution with noise-contrastive estimation
112 | 
113 | Let's revisit the function below:
114 | 
115 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/L_v1_v2.png)
116 | 
117 | If `k` in the above formula is large, computing the full softmax loss will be expensive. 
118 | 
119 | This problem is solved by using noise contrastive estimation trick. Assume that the `m` negative samples are distributed uniformly i.e. p<sub>n</sub> is uniform. Then we have: 
120 | 
121 | P(D=1|v<sub>2</sub>; v<sub>1</sub><sup>i</sup> ) = p<sub>d</sub>(v<sub>2</sub> | v<sub>1</sub><sup>i</sup>) / \[ p<sub>d</sub>(v<sub>2</sub> | v<sub>1</sub><sup>i</sup>) + m*p<sub>n</sub>(v<sub>2</sub> | v<sub>1</sub><sup>i</sup>) \]
122 | 
123 | The distribution of positive samples p<sub>d</sub> is unknown. p<sub>d</sub> is approximated by an unnormalized density h<sub>θ</sub>(.).
124 | 
125 | In this paper, h<sub>θ</sub>(v<sub>1</sub><sup>i</sup>, v<sub>2</sub><sup>i</sup>) = <v<sub>1</sub><sup>i</sup>, v<sub>2</sub><sup>i</sup>> where <.,.> stands for dot product.
126 | 
127 | ---
128 | 
129 | ### Implementation of the loss function
130 | 
131 | This section mostly explains the implementation of the NCE Loss [here](https://github.com/HobbitLong/CMC/tree/58d06e9a82f7fea2e4af0a251726e9c6bf67c7c9/NCE).
132 | 
133 | [This file](https://github.com/HobbitLong/CMC/blob/58d06e9a82f7fea2e4af0a251726e9c6bf67c7c9/NCE/alias_multinomial.py) simply uses [a simple trick](https://lips.cs.princeton.edu/the-alias-method-efficient-sampling-with-many-discrete-outcomes/) to allow faster sampling. I won't go into the details here since it is not relevant to the central idea of the paper. In short it creates a class called `AliasMethod` which is used in lieu of multinomial sampling:
134 | 
135 | `self.multinomial = AliasMethod(self.unigrams)`
136 | 
137 | The implementation converts RGB image into LAB and splits it into two views: `L` and `AB`.
138 | 
139 | #### Storing representations in memory bank
140 | 
141 | > We maintain a memory bank to store latent features for each training sample. Therefore, we can efficiently retrieve `m` noise samples from the memory bank to pair with each positive sample without recomputing their features. The memory bank is dynamically updated with features computed on the fly.
142 | 
143 | These representations are stored in the same file that calculates the NCELoss: `NCEAverage.py`. This is done using the `register_buffer` property of PyTorch `nn.module`s:
144 | 
145 | ```python
146 | self.register_buffer('memory_l', torch.rand(outputSize, inputSize)
147 | self.register_buffer('memory_ab', torch.rand(outputSize, inputSize)
148 | ```
149 | 
150 | Here `outputSize` is the size of the dataset and `inputSize` is the size of representations (128 in case of Alexnet).
151 | 
152 | We want these representations to have a unit size on average. The way these are initialized is by uniform sampling from the interval `[-a,a]` such that the expected value of L2 norm of vector with size `inputSize` is 1. In other words,
153 | 
154 | Σ<sub>i</sub> E[x<sub>i</sub><sup>2</sup>] = 1
155 | which means
156 | Σ<sub>i</sub> Var[x<sub>i</sub>] + (E[x<sub>i</sub>])<sup>2</sup> = 1
157 | 
158 | Solving this gives us:
159 | a = `1. / math.sqrt(inputSize / 3)`.
160 | 
161 | Thus the actual initilization of memory bank looks like:
162 | 
163 | ```python
164 | stdv = 1. / math.sqrt(inputSize / 3)
165 | self.register_buffer('memory_l', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv))
166 | self.register_buffer('memory_ab', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv))
167 | ```
168 | 
169 | The following code below then does the following:
170 | 
171 | 1. randomly sample negative samples and get the values from the memory bank
172 | 2. copy the values of the positive samples in the first index
173 | 3. calculate dot product using batch matrix multiplication (`bmm`)
174 | 
175 | ```python
176 | 
177 | # score computation
178 | if idx is None:
179 |     idx = self.multinomial.draw(batchSize * (self.K + 1)).view(batchSize, -1)
180 |     idx.select(1, 0).copy_(y.data)
181 | # sample
182 | weight_l = torch.index_select(self.memory_l, 0, idx.view(-1)).detach()
183 | weight_l = weight_l.view(batchSize, K + 1, inputSize)
184 | out_ab = torch.bmm(weight_l, ab.view(batchSize, inputSize, 1))
185 | # sample
186 | weight_ab = torch.index_select(self.memory_ab, 0, idx.view(-1)).detach()
187 | weight_ab = weight_ab.view(batchSize, K + 1, inputSize)
188 | out_l = torch.bmm(weight_ab, l.view(batchSize, inputSize, 1))
189 | ```
190 | 
191 | Finally the memory bank is updated using momentum [here](https://github.com/HobbitLong/CMC/blob/58d06e9a82f7fea2e4af0a251726e9c6bf67c7c9/NCE/NCEAverage.py#L70-L83).
192 | 
193 | ```python
194 | l_pos = torch.index_select(self.memory_l, 0, y.view(-1))
195 | l_pos.mul_(momentum)
196 | l_pos.add_(torch.mul(l, 1 - momentum))
197 | l_norm = l_pos.pow(2).sum(1, keepdim=True).pow(0.5)
198 | updated_l = l_pos.div(l_norm)
199 | self.memory_l.index_copy_(0, y, updated_l)
200 | ```
201 | 


--------------------------------------------------------------------------------
/notes/contrastive_predictive_coding.md:
--------------------------------------------------------------------------------
  1 | ## Data Efficient Image Recognition with Constrastive Predictive Coding
  2 | 
  3 | ### The basic idea
  4 | Take two different overlapping patches from a single image x<sub>1</sub> and x<sub>2</sub>. Since they came from the same image and are close to each other (since they overlap), they are related. We use a neural network (f<sub>θ</sub>) to find a representation for each of these patches, say z<sub>1</sub> and z<sub>2</sub>. Since the patches are related, z<sub>1</sub> and z<sub>2</sub> should also be related. In other words, z<sub>1</sub> should be able to _predict_ z<sub>2</sub>.
  5 | 
  6 | ### But what do we mean by a vector z<sub>1</sub> predicting another vector z<sub>2</sub>?
  7 | 
  8 | Let's say we take random patches from another image, x<sub>3</sub>, x<sub>4</sub> ... x<sub>10</sub>. We calculate z<sub>3</sub> = f<sub>θ</sub>(x<sub>3</sub>) ... z<sub>10</sub> = f<sub>θ</sub>(x<sub>10</sub>). Since z<sub>1</sub> is related to z<sub>2</sub> but not to z<sub>i</sub> for i > 2, it should be able to pick out z<sub>2</sub> from a set of vectors z<sub>i</sub> for i > 1.
  9 | 
 10 | ### But what do we mean by a vector picking out another particular vector from a set of vectors?
 11 | It is a two step process:
 12 | Step 1: A vector at time step t, v<sub>t</sub>, defines a _context_ c<sub>t</sub>. This can be done by passing v<sub>t</sub> to an autoregressive model g<sub>ar</sub>. Sometimes not just the vector at last time step but all vectors from time 0 to time t are used to define a _context_.
 13 | 
 14 | (v<sub>0</sub>, v<sub>1</sub>, ..., v<sub>t</sub>) → [ g<sub>ar</sub> ] → c<sub>t</sub>
 15 | 
 16 | g<sub>ar</sub> could be a GRU, LSTM or CNN.
 17 | 
 18 | Step 2: The context vector at time t, c<sub>t</sub> can predict encoded vectors k steps ahead of time, z<sub>t+k</sub> where k>0. This is done by a simple linear transformation of c<sub>t</sub>. We use a separate linear transformation W<sub>k</sub> for predict
 19 | z<sub>t+k</sub>.
 20 | 
 21 | In other words, W<sub>1</sub> is used to predict c<sub>t+1</sub>, W<sub>2</sub> is used to predict c<sub>t+2</sub>, W<sub>3</sub> is used to predict c<sub>t+3</sub> and so on.
 22 | 
 23 | c<sub>t</sub> → [ W<sub>1</sub> ] → z<sub>t+1</sub>
 24 | 
 25 | c<sub>t</sub> → [ W<sub>2</sub> ] → z<sub>t+2</sub>
 26 | 
 27 | c<sub>t</sub> → [ W<sub>3</sub> ] → z<sub>t+3</sub>
 28 | <br>...<br>
 29 | c<sub>t</sub> → [ W<sub>k</sub> ] → z<sub>t+k</sub>
 30 | 
 31 | ### How do we measure the accuracy of prediction?
 32 | Simple: a dot product. If z<sub>t+k</sub> came from the same image, the dot product should have a high value. If it came from a different image, it should have a low value. This can be turned into a loss by passing the dot product to sigmoid function and then calculating binary cross entropy loss.
 33 | 
 34 | ### Code
 35 | The Keras implementation is lifted straight from [here](https://github.com/davidtellez/contrastive-predictive-coding/blob/master/train_model.py).
 36 | 
 37 | #### f<sub>θ</sub> (image patch x<sub>t</sub> → encoded vector z<sub>t</sub>)
 38 | It is a [simple CNN](https://github.com/davidtellez/contrastive-predictive-coding/blob/master/train_model.py#L14-L36):
 39 | ```python
 40 | def network_encoder(x, code_size):
 41 | 
 42 |     ''' Define the network mapping images to embeddings '''
 43 | 
 44 |     x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
 45 |     x = keras.layers.BatchNormalization()(x)
 46 |     x = keras.layers.LeakyReLU()(x)
 47 |     x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
 48 |     x = keras.layers.BatchNormalization()(x)
 49 |     x = keras.layers.LeakyReLU()(x)
 50 |     x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
 51 |     x = keras.layers.BatchNormalization()(x)
 52 |     x = keras.layers.LeakyReLU()(x)
 53 |     x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
 54 |     x = keras.layers.BatchNormalization()(x)
 55 |     x = keras.layers.LeakyReLU()(x)
 56 |     x = keras.layers.Flatten()(x)
 57 |     x = keras.layers.Dense(units=256, activation='linear')(x)
 58 |     x = keras.layers.BatchNormalization()(x)
 59 |     x = keras.layers.LeakyReLU()(x)
 60 |     x = keras.layers.Dense(units=code_size, activation='linear', name='encoder_embedding')(x)
 61 | 
 62 |     return x
 63 | ```
 64 | 
 65 | #### g<sub>ar</sub>
 66 | It is a [simple GRU](https://github.com/davidtellez/contrastive-predictive-coding/blob/master/train_model.py#L39-L47). I have modified the docstring to make it clearer:
 67 | ```python
 68 | def network_autoregressive(x):
 69 | 
 70 |     ''' x is a iterable of vectors z1, z2, ... zt '''
 71 | 
 72 |     # x = keras.layers.GRU(units=256, return_sequences=True)(x)
 73 |     # x = keras.layers.BatchNormalization()(x)
 74 |     x = keras.layers.GRU(units=256, return_sequences=False, name='ar_context')(x)
 75 | 
 76 |     return x
 77 | ```
 78 | 
 79 | #### Implementation of W<sub>k</sub> transformations
 80 | Note that `k` is the number of time steps ahead in future the encoded vectors at which are predicted. We need to define a `Dense` (or `Linear` if you are coming from PyTorch) layer for each value to k. This is implemented [here](https://github.com/davidtellez/contrastive-predictive-coding/blob/master/train_model.py#L50-L63)
 81 | 
 82 | ```python
 83 | def network_prediction(context, code_size, predict_terms):
 84 | 
 85 |     ''' `predict_terms` is only used to determine the number of time steps ahead of time. '''
 86 | 
 87 |     outputs = []
 88 |     for i in range(predict_terms):
 89 |         outputs.append(keras.layers.Dense(units=code_size, activation="linear", name='z_t_{i}'.format(i=i))(context))
 90 | 
 91 |     if len(outputs) == 1:
 92 |         output = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(outputs[0])
 93 |     else:
 94 |         output = keras.layers.Lambda(lambda x: K.stack(x, axis=1))(outputs)
 95 | 
 96 |     return output
 97 | ```
 98 | 
 99 | #### Taking the sigmoid of dot product
100 | [Implementation here](https://github.com/davidtellez/contrastive-predictive-coding/blob/master/train_model.py#L66-L86):
101 | 
102 | ```python
103 | class CPCLayer(keras.layers.Layer):
104 | 
105 |     ''' Computes dot product between true and predicted embedding vectors '''
106 | 
107 |     def __init__(self, **kwargs):
108 |         super(CPCLayer, self).__init__(**kwargs)
109 | 
110 |     def call(self, inputs):
111 | 
112 |         # Compute dot product among vectors
113 |         preds, y_encoded = inputs
114 |         dot_product = K.mean(y_encoded * preds, axis=-1)
115 |         dot_product = K.mean(dot_product, axis=-1, keepdims=True)  # average along the temporal dimension
116 | 
117 |         # Keras loss functions take probabilities
118 |         dot_product_probs = K.sigmoid(dot_product)
119 | 
120 |         return dot_product_probs
121 | 
122 |     def compute_output_shape(self, input_shape):
123 |         return (input_shape[0][0], 1)
124 | ```
125 | 
126 | All of the above functions are bundled into a single model [here](https://github.com/davidtellez/contrastive-predictive-coding/blob/master/train_model.py#L89).
127 | 
128 | ---
129 | 
130 | ## Data-Efficient Image Recognition with Contrastive Predictive Coding
131 | 
132 | This paper basically made improvements on the previous implementation of CPC.
133 | 
134 | > We revisit CPC in terms of its architecture and training methodology, and arrive at a new implementation  with  a  dramatically-improved  ability  to  linearly  separate  image  classes.
135 | 
136 | Here is how they made these improvements:
137 | 
138 | 1. Increasing model capacity: `the  original  CPC  model  used  only the   first   3   stacks   of   a   ResNet-101 ... we converted the third residual stack of ResNet-101  to  use  46  blocks  with4096-dimensional feature maps and 512-dimensional bottleneck layers`.
139 | 
140 | 2. Replacing batch normalization with layer normalization: `We hypothesize that batch normalization allows these models to find a trivial solution to CPC: it introduces  a  dependency  between  patches  (through  the  batch  statistics)  that  can  be  exploited  to bypass the constraints on the receptive field.   Nevertheless we find that we can reclaim much of batch normalization’s training efficiency using layer normalization`.
141 | 
142 | 3. Predicting not just top to bottom from from all directions: `we repeatedly predict the patch using context frombelow,  the right and the left,  resulting in up to four times as many prediction tasks.`
143 | 
144 | 4. Augmenting image patches better: `The originalCPC model spatially jitters individual patches independently. We further this logic by adopting the ‘color dropping’ method of [14],  which randomly drops two of the three color channels in each patch, and find it to deliver systematic gains (+3% accuracy).  We therefore continued by adding a fixed, generic augmentation scheme using the primitives from Cubuk et al. [10] (e.g. shearing, rotation, etc), as well as random elastic deformations and color transforms [11] (+4.5% accuracy).`
145 | 
146 | There is also some material on data-efficiency but I am going to skip it.
147 | 


--------------------------------------------------------------------------------
/notes/cvpr2024/neural_redshift.md:
--------------------------------------------------------------------------------
 1 | # Neural Redshift: Random Networks are not Random Functions
 2 | 
 3 | - [Arxiv PDF](https://arxiv.org/pdf/2403.02241)
 4 | 
 5 | "The success of deep learning is thus not a product primarily of GD, nor is it universal to all architectures. This paper propose an explanation compatible with all above observations. It builds on the growing evidence that NNs benefit from their parametrization and the structure of their weight space."
 6 | 
 7 | ## Contributions
 8 | 1. NNs are biased to implement functions of a particular level of complexity (not necessarily low) determined
 9 | by the architecture.
10 | 2. This preferred complexity is observable in networks with random weights from an uninformed prior.
11 | 3. Generalization is enabled by popular components like ReLUs setting this bias to a low complexity that often aligns with the target function.
12 | 
13 | > ... the parameter space of NNs is biased towards functions of low frequency, one of the measures of complexity used in this work.
14 | 
15 | That is why it is called Neural _Redshift_. They examine various architectures with random weights. They use three measures of complexity:
16 | 1. decompositions in Fourier series (low frequency)
17 | 2. in the bases of orthogonal polynomials (low order)
18 | 3. compressibility with an approximation of Kolmogorov complexity (compressibility)
19 | They are collectively referred to as _simplicity_.
20 | 
21 | > We show that the simplicity bias is not universal but depends on common components (ReLUs, residual connections, layer normalizations). ReLU networks also have the unique property of maintaining their simplicity bias for any depth and weight magnitudes. It suggests that the historical importance of ReLUs in the development of deep learning goes beyond the common narrative about vanishing gradients.
22 | 
23 | ### Analyzing random networks
24 | - Weights and biases are uniformly sampled (not so important)
25 | - The model maps to a function $R^2 -> R$. A 2D grid of points is used as input. This allows visualization of the function as a grayscale image.
26 | 
27 | - ReLU-like activations (GELU, Swish, SELU [16]) are also biased towards low complexity. Unlike ReLUs, close examination in Appendix F shows that increasing depth or weight magnitudes slightly increases the complexity.
28 | - Others activations (TanH, Gaussian, sine) show completely different behaviour. Depth and weight magnitude cause a dramatic increase in complexity. Unsurprisingly, these activations are only used in special applications [58] with careful initializations [68]. Networks with these activations have no fixed preferred complexity independent of the weights’ or activations’ magnitudes.
29 | 
30 | ![](https://github.com/vinsis/math-and-ml-notes/blob/30d38f7320ade8e4860d9166d2e0b157c0a7636b/images/redshift_table1.png)
31 | 
32 | - Width has no impact on complexity, perhaps surprisingly. Additional neurons change the capacity of a model (what can be represented after training) but they do not affect its inductive biases.
33 | - __Layer normalization__: We place layer normalizations before each activation. Layer normalization has the significant effect of removing variations in complexity with the weights’ magnitude for all activations (Figure 5). The weights can now vary (e.g. during training) without directly affecting the preferred complexity of the architecture.
34 | - Residual connections: This has the dramatic effect of forcing the preferred complexity to some of the lowest levels for all activations regardless of depth.
35 | - __Multiplicative interactions__: They refer to multiplications of internal representations with one another [39] as in attention layers, highway networks, dynamic convolutions, etc. We place them in our MLPs as gating operations, such that each hidden layer corresponds to: x← ϕ(Wx+b) ⊙ σ(W\′x+b\′)
36 | where `σ(·)` is the logistic function. This creates a clear increase in complexity dependent on depth and weight magnitude, even when ReLU is used.
37 | 
38 | ![](https://github.com/vinsis/math-and-ml-notes/blob/fa0163f771c352a665fcd3c4f3cb3607eb5d7451/images/redshift_fig5.png)
39 | 
40 | - Unbiased model: This is build by creating a uniform bias over frequencies. The inverse Fourier transform is a weighted sum of sine waves, so this architecture can be implemented as a one-layer MLP with sine activations and fixed input weights representing each one Fourier component. This architecture behaves very differently from standard MLPs (Figure 4). With random weights, its Fourier spectrum is uniform, which gives a high complexity for any weight magnitude (depth is fixed). Functions implemented by this architecture look like white noise.
41 | 
42 | ## Indictive biases in trained models
43 | There is a strong correlation between the complexity at initialization (i.e. with random weights as examined in the previous section) and in the trained model. We will also see that unusual architectures with a bias towards high complexity can improve generalization on tasks where the standard “simplicity bias” is suboptimal.
44 | 
45 | ### Learning complex functions
46 | 
47 | ## Experimental setup
48 | The input to our task is a vector of integers $x ∈ [0, N-1]^d$ and output is $ \Sigma{x_i} \le (M/2)\text{mod M} $. "We consider three versions with N = 16 and M ={10, 7, 4} that correspond to increasingly higher frequencies in the target function". ReLU MLP solves only the low-frequency version of this task.
49 | 
50 | > We then train MLPs with other
51 | activations (TanH, Gaussian, sine) whose preferred complexity is sensitive to the activations’ magnitude. We also introduce a constant multiplicative prefactor before each activation function to modulate this bias without changing the weights’ magnitude, which could introduce optimization side effects. Some of these models succeed in learning all versions of the task when the prefactor is correctly tuned. For higher-frequency versions, the prefactor needs to be larger to shift the bias towards higher complexity.
52 | 
53 | ## Impact on shortcut learning
54 | ### Experimental setup
55 | > We consider a regression task similar to Colored-MNIST. Inputs are images of handwritten digits juxtaposed with a uniform band of colored pixels that simulate spurious features. The labels in the training data are values in [0, 1] proportional to the digit value as well as to the color intensity. Therefore, a model can attain high training accuracy by relying either on the simple linear relation with the color, or the more complex recognition of the digits (the target task). To measure the reliance of a model on color or digit, we use two test sets where either the color or digit is correlated with the label while the other is randomized.
56 | 
57 | > We see in Figure 9 that the LZ complexity at initialization increases with prefactor values for TanH, Gaussian, and sine activations. Most interestingly, the accuracy on the digit and color also varies with the prefactor. The color is learned more easily with small prefactors (corresponding to a low complexity at initialization) while the digit is learned more easily at an intermediate value (corresponding to medium complexity at initialization). The best performance on the digit is reached at a sweet spot that we explain as the hypothesized “best match” between the complexity of the target function, and that preferred by the architecture. With larger prefactors, i.e. beyond this sweet spot, the accuracy on the digit decreases, and even more so with sine activations for which the complexity also increases more rapidly, further supporting the proposed explanation.


--------------------------------------------------------------------------------
/notes/deepinfomax.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ### DeepInfoMax
  3 | 
  4 | As [mentioned here](https://www.microsoft.com/en-us/research/blog/deep-infomax-learning-good-representations-through-mutual-information-maximization/):
  5 | 
  6 | > DIM is based on two learning principles: mutual information maximization in the vein of the infomax optimization principle and self-supervision, an important unsupervised learning method that relies on intrinsic properties of the data to provide its own annotation.
  7 | 
  8 | 
  9 | ```python
 10 | import torch
 11 | import torch.nn as nn
 12 | import torch.nn.functional as F
 13 | ```
 14 | 
 15 | 
 16 | ```python
 17 | class Encoder(nn.Module):
 18 |     def __init__(self):
 19 |         super().__init__()
 20 |         self.c0 = nn.Conv2d(3, 64, kernel_size=4, stride=1)
 21 |         self.c1 = nn.Conv2d(64, 128, kernel_size=4, stride=1)
 22 |         self.c2 = nn.Conv2d(128, 256, kernel_size=4, stride=1)
 23 |         self.c3 = nn.Conv2d(256, 512, kernel_size=4, stride=1)
 24 |         self.l1 = nn.Linear(512*20*20, 64)
 25 | 
 26 |         self.b1 = nn.BatchNorm2d(128)
 27 |         self.b2 = nn.BatchNorm2d(256)
 28 |         self.b3 = nn.BatchNorm2d(512)
 29 | 
 30 |     def forward(self, x):
 31 |         h = F.relu(self.c0(x))
 32 |         features = F.relu(self.b1(self.c1(h)))
 33 |         h = F.relu(self.b2(self.c2(features)))
 34 |         h = F.relu(self.b3(self.c3(h)))
 35 |         encoded = self.l1(h.view(x.shape[0], -1))
 36 |         return encoded, features
 37 | ```
 38 | 
 39 | 
 40 | ```python
 41 | class GlobalDiscriminator(nn.Module):
 42 |     def __init__(self):
 43 |         super().__init__()
 44 |         self.c0 = nn.Conv2d(128, 64, kernel_size=3)
 45 |         self.c1 = nn.Conv2d(64, 32, kernel_size=3)
 46 |         self.l0 = nn.Linear(32 * 22 * 22 + 64, 512)
 47 |         self.l1 = nn.Linear(512, 512)
 48 |         self.l2 = nn.Linear(512, 1)
 49 | 
 50 |     def forward(self, y, M):
 51 |         h = F.relu(self.c0(M))
 52 |         h = self.c1(h)
 53 |         h = h.view(y.shape[0], -1)
 54 |         h = torch.cat((y, h), dim=1)
 55 |         h = F.relu(self.l0(h))
 56 |         h = F.relu(self.l1(h))
 57 |         return self.l2(h)
 58 | ```
 59 | 
 60 | 
 61 | ```python
 62 | class LocalDiscriminator(nn.Module):
 63 |     def __init__(self):
 64 |         super().__init__()
 65 |         self.c0 = nn.Conv2d(192, 512, kernel_size=1)
 66 |         self.c1 = nn.Conv2d(512, 512, kernel_size=1)
 67 |         self.c2 = nn.Conv2d(512, 1, kernel_size=1)
 68 | 
 69 |     def forward(self, x):
 70 |         h = F.relu(self.c0(x))
 71 |         h = F.relu(self.c1(h))
 72 |         return self.c2(h)
 73 | ```
 74 | 
 75 | 
 76 | ```python
 77 | class PriorDiscriminator(nn.Module):
 78 |     def __init__(self):
 79 |         super().__init__()
 80 |         self.l0 = nn.Linear(64, 1000)
 81 |         self.l1 = nn.Linear(1000, 200)
 82 |         self.l2 = nn.Linear(200, 1)
 83 | 
 84 |     def forward(self, x):
 85 |         h = F.relu(self.l0(x))
 86 |         h = F.relu(self.l1(h))
 87 |         return torch.sigmoid(self.l2(h))
 88 | ```
 89 | 
 90 | 
 91 | ```python
 92 | class Classifier(nn.Module):
 93 |     def __init__(self):
 94 |         super().__init__()
 95 |         self.l1 = nn.Linear(64, 15)
 96 |         self.bn1 = nn.BatchNorm1d(15)
 97 |         self.l2 = nn.Linear(15, 10)
 98 |         self.bn2 = nn.BatchNorm1d(10)
 99 |         self.l3 = nn.Linear(10, 10)
100 |         self.bn3 = nn.BatchNorm1d(10)
101 | 
102 |     def forward(self, x):
103 |         encoded, _ = x[0], x[1]
104 |         clazz = F.relu(self.bn1(self.l1(encoded)))
105 |         clazz = F.relu(self.bn2(self.l2(clazz)))
106 |         clazz = F.softmax(self.bn3(self.l3(clazz)), dim=1)
107 |         return clazz
108 | ```
109 | 
110 | ### Encoder takes as input an image of size `(32,32)` and returns features of size `(128,26,26)` and encoded output of length `(64)`
111 | 
112 | 
113 | ```python
114 | x = torch.randn(1,3,32,32)
115 | with torch.no_grad():
116 |     encoded, features = Encoder()(x)
117 | encoded.size(), features.size()
118 | ```
119 | 
120 | 
121 | 
122 | 
123 |     (torch.Size([1, 64]), torch.Size([1, 128, 26, 26]))
124 | 
125 | 
126 | 
127 | ### Global discriminator learns to discriminate whether or not `encoded` and `features` come from the same image. Note that `encoded` and `features` are concatenated in the linear layer.
128 | 
129 | 
130 | ```python
131 | with torch.no_grad():
132 |     out = GlobalDiscriminator()(encoded, features)
133 | out.size()
134 | ```
135 | 
136 | 
137 | 
138 | 
139 |     torch.Size([1, 1])
140 | 
141 | 
142 | 
143 | ### Local discriminator does the same thing but for each individual cell. Note that `encoded` and `features` are concatenated at the start at the convolutional layer.
144 | 
145 | 
146 | ```python
147 | encoded_expanded = encoded.unsqueeze(2).unsqueeze(3).expand(-1,-1,26,26)
148 | x = torch.cat((features, encoded_expanded), dim=1)
149 | 
150 | with torch.no_grad():
151 |     out = LocalDiscriminator()(x)
152 | out.size()
153 | ```
154 | 
155 | 
156 | 
157 | 
158 |     torch.Size([1, 1, 26, 26])
159 | 
160 | 
161 | 
162 | ### Prior discriminator simply learns to predict whether or not `encoded` comes from a uniform distribution
163 | 
164 | 
165 | ```python
166 | with torch.no_grad():
167 |     out = PriorDiscriminator()(encoded)
168 | out.size()
169 | ```
170 | 
171 | 
172 | 
173 | 
174 |     torch.Size([1, 1])
175 | 
176 | 
177 | 
178 | In the [official implementation](https://github.com/DuaneNielsen/DeepInfomaxPytorch/blob/master/train.py#L25-L49),
179 | 
180 | * `y` is `encoded`
181 | * `M` is `features`
182 | * `M_prime` is `features` from another image
183 | 
184 | The objective is to maximize the log likelihood for (`y`, `M`) and minimize that for (`y`, `M_prime`).
185 | 
186 | ### How is `y_prime` created?
187 | 
188 | In every batch, the sequence of the images [is changed](https://github.com/DuaneNielsen/DeepInfomaxPytorch/blob/master/train.py#L91-L92) (in a non-random way). This is different from the random way the sequence was changed in MINE.
189 | 
190 | ```python
191 |             y, M = encoder(x)
192 |             # rotate images to create pairs for comparison
193 |             M_prime = torch.cat((M[1:], M[0].unsqueeze(0)), dim=0)
194 | ```
195 | 


--------------------------------------------------------------------------------
/notes/iic.md:
--------------------------------------------------------------------------------
 1 | From the paper:
 2 | > Consider  now  a  pair  of  such  cluster  assignment  variables z and z′ for two inputs xand x′respectively. Their conditional  joint distribution is given by P(z=c, z′=c′|x,x′) = Φc(x)·Φc′(x′).This equation states that z and z′ are independent when conditioned on specific inputs x and x′; however, in general they are not independent after marginalization over a dataset of input pairs(xi,x′i), i= 1, . . . , n. 
 3 | 
 4 | I think this happens because information is lost when summing up observations. Correlation comes into picture in presence of uncertainty, which comes up when less information is available.
 5 | 
 6 | #### IIC Loss:
 7 | 
 8 | The official implementation is [here](https://github.com/xu-ji/IIC/blob/master/code/utils/cluster/IID_losses.py#L6-L33). Below are two (quite similar) implementations which are simplified.
 9 | 
10 | ```python
11 | def IIC(z, zt, C=10):
12 |     log = torch.log
13 |     EPS = 1e-5
14 |     P = (z.unsqueeze(2) * zt.unsqueeze(1)).sum(dim=0)
15 |     P = (P + P.t())/2 / P.sum()
16 |     P[ (P < EPS).data ] = EPS
17 |     Pi = P.sum(dim=1).view(C,1).expand(C,C)
18 |     Pj = P.sum(dim=0).view(1,C).expand(C,C)
19 |     return (P * ( log(Pi) + log(Pj) - log(P) )).sum()
20 | 
21 | def IICv2(z, zt, C=10):
22 |     log = torch.log
23 |     EPS = 1e-5
24 |     P = (z.unsqueeze(2) * zt.unsqueeze(1)).sum(dim=0)
25 |     P = (P + P.t())/2 / P.sum()
26 |     P[ (P < EPS).data ] = EPS
27 |     Pi = P.sum(dim=1).view(C,1).expand(C,C)
28 |     Pj = Pi.T
29 |     return (P * ( log(Pi) + log(Pj) - log(P) )).sum()
30 | ```
31 | 
32 | 
33 | > __Why  degenerate  solutions  are  avoided.__ Mutual information expands to I(z, z′) = H(z)−H(z|z′). Hence, maximizing this quantity trades-off minimizing the conditional cluster assignment entropy H(z|z′) and maximising individual cluster assignments  entropy H(z). The  smallest  value  of H(z|z′) is 0, obtained when the cluster assignments are exactly predictable from each other. The largest value of H(z) is lnC, obtained when all clusters are equally likely to be picked. This occurs when the data is assigned evenly between the clusters, equalizing their mass. Therefore the loss is not minimised if all samplesare assigned to a single cluster (i.e. output class is identicalfor all samples).
34 | 
35 | > __Meaning of mutual information.__  Firstly, due to the soft clustering, entropy alone could be maximised trivially by setting all prediction vectors Φ(x) to uniform distributions, resulting in no clustering. This is corrected by the conditional entropy component, which encourages deterministic one-hot predictions. For example, even for the degenerate case of identical pairs x=x′, the IIC objective encourages a deterministic clustering function (i.e.Φ(x) is a one-hot vector) as this results in null conditional entropy H(z|z′) = 0. Secondly, the objective of IIC is to find what is common between two data points that share redundancy,such as different images of the same object, explicitly encouraging  distillation of the common part while  ignoring the rest, i.e. instance details specific to one of the samples.This would not be possible without pairing samples.
36 | 
37 | > __Image clustering.__ IIC requires a source of paired samples (x,x′), which are often unavailable in unsupervised image clustering applications. In this case, we propose to use generated image pairs, consisting of image x and its randomly perturbed version x′=gx. The objective eq. (1) can thus be written as:maxΦ I(Φ(x),Φ(gx))
38 | 


--------------------------------------------------------------------------------
/notes/mine.md:
--------------------------------------------------------------------------------
  1 | ### Key idea
  2 | 
  3 | Gradient descent/ascent can be used to minimize/maximize any (differentiable) expression. If an expression E cannot be estimated directly then 
  4 | a) find a lower or upper bound for that expression, 
  5 | b) parameterize it 
  6 | c) maximize or minimize it
  7 | 
  8 | The idea behind variational autoencoder is the same. 
  9 | 
 10 | ### What is the inequality
 11 | 
 12 | We have two inequalities:
 13 | 1) Donsker Vardhan representation:<br>
 14 | D<sub>KL</sub>(P||Q) =   sup<sub>T:Ω→R</sub>E<sub>P</sub>[T]−log(E<sub>Q</sub>[e<sup>T</sup>])
 15 | 
 16 | 2) f-divergence representation:<br>
 17 | D<sub>KL</sub>(P||Q) =   sup<sub>T:Ω→R</sub>E<sub>P</sub>[T]E<sub>Q</sub>[e<sup>T-1</sup>])
 18 | 
 19 | Note that the first inequality is stricter/tighter.
 20 | 
 21 | ### Actual implementation
 22 | 
 23 | > ... the idea is to choose F to be the family of functions T<sub>θ</sub>:X×Z →R parametrized by a deep neural network with parameters θ∈Θ. We call this network the statistics network. We exploit the bound: <br>
 24 | > I(X;Z)≥I<sub>Θ</sub>(X,Z)<br>
 25 | >
 26 | > where I<sub>Θ</sub>(X,Z) is the _neural information measure_ defined as
 27 | >
 28 | > I<sub>Θ</sub>(X,Z) = sup<sub>θ∈Θ</sub>E<sub>P(x,z)</sub>[T<sub>Θ</sub>] - log( E<sub>P(x)P(z)</sub>[e<sup>TΘ</sup>] )<br>
 29 | 
 30 | #### Code
 31 | 
 32 | Lifted straight from [here](https://github.com/sungyubkim/MINE-Mutual-Information-Neural-Estimation-/blob/master/MINE.ipynb):
 33 | 
 34 | ```python
 35 | def mutual_information(joint, marginal, mine_net):
 36 |     t = mine_net(joint)
 37 |     et = torch.exp(mine_net(marginal))
 38 |     mi_lb = torch.mean(t) - torch.log(torch.mean(et))
 39 |     return mi_lb, t, et
 40 | ```
 41 | 
 42 | Here `mine_net` can be any neural network T<sub>Θ</sub>:
 43 | 
 44 | ```python
 45 | class Mine(nn.Module):
 46 |     def __init__(self, input_size=2, hidden_size=100):
 47 |         super().__init__()
 48 |         self.fc1 = nn.Linear(input_size, hidden_size)
 49 |         self.fc2 = nn.Linear(hidden_size, hidden_size)
 50 |         self.fc3 = nn.Linear(hidden_size, 1)
 51 |         nn.init.normal_(self.fc1.weight,std=0.02)
 52 |         nn.init.constant_(self.fc1.bias, 0)
 53 |         nn.init.normal_(self.fc2.weight,std=0.02)
 54 |         nn.init.constant_(self.fc2.bias, 0)
 55 |         nn.init.normal_(self.fc3.weight,std=0.02)
 56 |         nn.init.constant_(self.fc3.bias, 0)
 57 |         
 58 |     def forward(self, input):
 59 |         output = F.elu(self.fc1(input))
 60 |         output = F.elu(self.fc2(output))
 61 |         output = self.fc3(output)
 62 |         return output
 63 | 
 64 | 
 65 | mine_net = Mine()
 66 | ```
 67 | 
 68 | Pretty straight forward right? It is important to note how `joint` and `marginal` distributions are calculated.
 69 | 
 70 | > The expectations in Eqn. 10 are estimated using empirical samples from P(X,Z) and P(X)⊗P(Z) or by shuffling the samples from the joint distribution along the batch axis. 
 71 | 
 72 | When you look at samples `x`, you want the `y` samples to be independent since `y` has been marginalized out. This is what random shuffling achieves. Here is how it is implemented:
 73 | 
 74 | ```python
 75 | def sample_batch(data, batch_size=100, sample_mode='joint'):
 76 | 	'''
 77 | 	data is the entire dataset
 78 | 	'''
 79 |     if sample_mode == 'joint':
 80 |         indices = np.random.choice(range(data.shape[0]), size=batch_size, replace=False)
 81 |         batch = data[indices]
 82 |     else:
 83 |         indices1 = np.random.choice(range(data.shape[0]), size=batch_size, replace=False)
 84 |         indices2 = np.random.choice(range(data.shape[0]), size=batch_size, replace=False)
 85 |         # this step makes the two columns independent by shuffling
 86 |         batch = np.concatenate([data[indices1][:,0].reshape(-1,1), data[indices2][:,1].reshape(-1,1)], axis=1)
 87 |     return batch
 88 | 
 89 | ```
 90 | 
 91 | ### Correcting the bias from the stochastic gradients
 92 | 
 93 | Since the expectations are taken over the batch, they are biased.
 94 | 
 95 | > Fortunately, the bias can be reduced by replacing the estimate in the denominator by an exponential moving average. For small learning rates, this improved MINE gradient estimator can be made to have arbitrarily small bias. We found in our experiments that this improves all-around performance of MINE.
 96 | 
 97 | Biased loss:
 98 | ```python
 99 | loss = torch.mean(t) - torch.log(torch.mean(et))
100 | loss = loss * -1
101 | ```
102 | 
103 | Unbiasing the loss:
104 | ```python
105 | #ma_et is initialized to 1.0
106 | ma_et = (1-ma_rate)*ma_et + ma_rate*torch.mean(et)
107 | 
108 | loss = torch.mean(t) - torch.mean(et) / ma_et.mean().detach()
109 | loss = loss * -1
110 | ```
111 | 
112 | ### Other notes
113 | 
114 | * MINE is [strongly consistent](https://en.wikipedia.org/wiki/Consistent_estimator).
115 | 
116 | * MINE captures equitability: It is invariant to deterministic non-linear functions (as it should be).
117 | 
118 | > An important property of mutual information between random variables with relationship `Y=f(X) +σ*eps`, where `f` is a deterministic non-linear transformation and `eps` is random noise, is that it is invariant to the deterministic non-linear transformation, but should only depend on the amount of noise,`σ*eps`. 
119 | 
120 | In the experiments, `X∼U(−1,1)` and `Y=f(X) +σ*eps`,  where `f(x)∈ {x,x3,sin(x)}` and `eps∼N(0,I)`.
121 | 
122 | 


--------------------------------------------------------------------------------
/notes/moco.md:
--------------------------------------------------------------------------------
  1 | ## Momentum Contrast for Unsupervised Visual Representation Learning
  2 | 
  3 | - [Official implementation](https://github.com/facebookresearch/moco). I will use some code snippets from here.
  4 | 
  5 | ### Background
  6 | > (Contrastive loss based) methods can  be  thought  of  as  building dynamic dictionaries. The “keys” (tokens) in the dictionary are sampled from  data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up:  an encoded “query” should be similar to its matching key and dissimilar to others.
  7 | 
  8 | > From this perspective, we hypothesize that it is desirable to  build  dictionaries  that  are: (i) large and (ii) consistent as they evolve during training.
  9 | 
 10 | ### Two main lines of performing unsupervised/self-unsupervised learning:
 11 | 
 12 | 1. Loss functions: A model learns to predict a target. A target could be fixed (`reconstructing the input pixels using L1/L2 losses`) or moving.
 13 | 
 14 | - __Contrastive losses__: `Instead of matching an input to a fixed target, in contrastive loss formulations the target can vary on-the-fly during training and can be defined in terms of the data representation computed by a network.`
 15 | 
 16 | - __Adversarial losses__: They `measure the difference between probability distributions.`
 17 | 
 18 | 2. Pretext tasks: `The term “pretext” implies that the task being solved is not of genuine interest, but is solved only for the true purpose of learning a good data representation.` Some examples are `denoising auto-encoders, context auto-encoders, or cross-channel auto-encoders (colorization).`
 19 | 
 20 | > __Contrastive  learning vs.  pretext  tasks__: Various  pretext tasks can be based on some form of contrastive loss functions. The instance discrimination method is related to the exemplar-based task and NCE. The pretext task in contrastive predictive coding (CPC) is a form of context auto-encoding, and in contrastive multi view coding (CMC) it is related to colorization.
 21 | 
 22 | ---
 23 | 
 24 | ### Key idea
 25 | 
 26 | You are given two neural networks (`encoder` aka `f_q` and `momentum encoder` aka `f_k`) as shown below:
 27 | 
 28 | ![](https://user-images.githubusercontent.com/11435359/71603927-0ca98d00-2b14-11ea-9fd8-10d984a2de45.png)
 29 | 
 30 | The query `q` matches exactly one of the keys (chosen to be `k0`). `encoder` is learnt through backprop. `momentum encoder` then copies the parameters of `encoder` but uses a moving average:
 31 | 
 32 | ```python
 33 | f_k.params = momentum * f_k.params + (1-momentum)*f_q.params
 34 | ```
 35 | 
 36 | ### Code and some minor details
 37 | 
 38 | `f_q` and `f_k` are build from the same encoder class (a ResNet) with `output_dim`=128:
 39 | 
 40 | ```python
 41 | self.encoder_q = base_encoder(num_classes=output_dim)
 42 | self.encoder_k = base_encoder(num_classes=output_dim)
 43 | ```
 44 | 
 45 | The initial parameters of the two encoders are [set to be the same](https://github.com/facebookresearch/moco/blob/master/moco/builder.py#L34-L36):
 46 | ```python
 47 | for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
 48 |     param_k.data.copy_(param_q.data)  # initialize
 49 |     param_k.requires_grad = False  # not update by gradient
 50 | ```
 51 | 
 52 | The parameters update with has pseudocode
 53 | 
 54 | ```python
 55 | f_k.params = momentum * f_k.params + (1-momentum)*f_q.params
 56 | ```
 57 | 
 58 | is [implemented](https://github.com/facebookresearch/moco/blob/master/moco/builder.py#L45-L50) like so:
 59 | 
 60 | ```python
 61 | for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
 62 |         param_k.data = param_k.data * self.m + param_q.data * (1. - self.m)
 63 | ```
 64 | 
 65 | Here we have `K: queue size; number of negative keys (default: 65536)`.
 66 | 
 67 | The queue is [defined](https://github.com/facebookresearch/moco/blob/master/moco/builder.py#L39-L40) as shown below. Each column of the matrix is a negative key (embedding). The embeddings are all normalized.
 68 | 
 69 | ```python
 70 | self.register_buffer("queue", torch.randn(dim, K))
 71 | self.queue = nn.functional.normalize(self.queue, dim=0)
 72 | ```
 73 | 
 74 | #### Tackling issues with batch normalization by shuffling
 75 | 
 76 | __I don't fully understand what the problem is and what the solution is. I will update this section once I understand.__
 77 | 
 78 | > ...  we found that using BN prevents the model from  learning  good  representations.
 79 | 
 80 | > We resolve this problem by shuffling BN. We train with multiple GPUs and perform BN on the samples independently for each GPU (as done in common practice). For the key encoder `f_k`, we shuffle the sample order in the current mini-batch before distributing it among GPUs (and shuffle back  after  encoding);  the  sample  order  of  the  mini-batch for  the  query  encoder `f_q` is  not  altered. This ensures  the batch statistics used to compute a query and its positive key come from two different subsets. This effectively tackles the cheating issue and allows training to benefit from BN.
 81 | 
 82 | ### Forward pass
 83 | 
 84 | In forward pass, you provide a batch of query and key images: `im_q` and `im_k`.
 85 | 
 86 | [Computation of query features](https://github.com/facebookresearch/moco/blob/master/moco/builder.py#L124-L126) is straight-forward:
 87 | 
 88 | ```python
 89 | # compute query features
 90 | q = self.encoder_q(im_q)  # queries: NxC
 91 | q = nn.functional.normalize(q, dim=1)
 92 | ```
 93 | 
 94 | Then the key encoder is updated and keys are calculated like so:
 95 | 
 96 | ```python
 97 | self._momentum_update_key_encoder()
 98 | k = self.encoder_k(im_k)  # keys: NxC
 99 | ```
100 | 
101 | Note that `k` are the positive samples for `q`. The negative samples come from the queue.
102 | 
103 | ```python
104 | l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)
105 | l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])
106 | logits = torch.cat([l_pos, l_neg], dim=1)
107 | logits /= self.T # self.T is the temperature (default value 0.07)
108 | ```
109 | 
110 | Finally the keys in the queue are updated:
111 | 
112 | ```python
113 | batch_size = k.shape[0]
114 | ptr = int(self.queue_ptr)
115 | self.queue[:, ptr:ptr + batch_size] = keys.T
116 | ptr = (ptr + batch_size) % self.K  # move pointer
117 | ```
118 | 
119 | Note that I skipped (for now) the shuffling and de-shuffling part since I don't clearly understand it.
120 | 


--------------------------------------------------------------------------------
/notes/moco_v2.md:
--------------------------------------------------------------------------------
1 | > We verify the effectiveness of two of SimCLR’s design improvements by implementing them in the MoCo framework. With simple modifications to MoCo — namely,  `using  an  MLP  projection  head`  and  `more  data augmentation` — we establish stronger baselines that outperform SimCLR and do not require large training batches.
2 | 
3 | ### MLP head
4 | > Using the default `τ=0.07`,  pre-training with the MLP head improves from 60.6% to 62.9%; switching to the optimal value for MLP (0.2), the accuracy increases to 66.2%.
5 | 
6 | ### Augmentation
7 | >  The extra augmentation alone (i.e. no MLP) improves the MoCo baseline on ImageNet by 2.8% to 63.4%.


--------------------------------------------------------------------------------
/notes/on_mi_maximization.md:
--------------------------------------------------------------------------------
  1 | ## On mutual information maximization for representation learning
  2 | 
  3 | ### Key idea
  4 | This paper challenged the (implicit) assumption that maximizing mutual information between encoder outputs leads to better representations. According to the paper, good representations are obtained when the encoder __discards__ some information (mostly noise). In other words, the encoder is non-invertible.
  5 | 
  6 | ### Setup
  7 | * Take an image from MNIST
  8 | * From each image, create two inputs: one from upper half and one from lower half
  9 | * Maximize the MI between the encoder outputs of the above two inputs
 10 | 
 11 | Note that exact MI is hard to estimate. But there are many estimates of MI. Some estimates have a _tight bound_ meaning they are more accurate than those with a _loose bound_.
 12 | 
 13 | ### Types of encoders used
 14 | * First they used encoders which could be invertible and non-invertible.
 15 | 
 16 | > We first consider encoders which are bijective by design. Even though the true MI is maximized for any choice of model parameters, the representation quality (measured by downstream linear classification accuracy) improves during training. Furthermore, there exist invertible encoders for which the representation quality is worse than using raw pixels, despite also maximizing MI.
 17 | 
 18 | They did this by using adversarial training. In other words, the encoder tries to come up with representations such that:
 19 | * the MI between representations is high
 20 | * the linear separability of representations is low
 21 | 
 22 | The fact that it is possible to successfully train such an encoder shows that high MI doesn't necessary mean high linear separability.
 23 | 
 24 | > We next consider encoders that can model both invertible and non-invertible functions. When the encoder can be non-invertible, but is initialized to be invertible, I<sub>EST</sub> still biases the encoders to be very ill-conditioned and hard to invert.
 25 | 
 26 | ### Bias towards hard to invert encoders
 27 | 
 28 | The authors wanted to measure how _non-invertible_ the encoders got during training. They used a metric called `condition number` to measure the level of non-invertibility. The higher this number, the harder it is to invert the encoder.
 29 | 
 30 | #### Condition number
 31 | Condition number is the ratio σ<sub>largest</sub>/σ<sub>smallest</sub> where
 32 | 
 33 | σ<sub>largest</sub> and σ<sub>smallest</sub> are the largest and smallest singular values of Jacobian of `g(x)` where `g()` is the function represented by the encoder and `x` is the input.
 34 | 
 35 | Actually they computed the log of condition number:<br />
 36 | log(σ<sub>largest</sub>) - log(σ<sub>smallest</sub>)
 37 | 
 38 | #### What if the data occupied only a subspace of the entire space?
 39 | In this case, the Jacobian matrix would be singular (and condition number would be really really big). However the encoder might still be able to invert the output if the transformation of the subspace was non-singular. To deal with this problem, they added the same noise vector to `x1` and `x2` to ensure that the inputs spanned the entire space.
 40 | 
 41 | Below are some snippets of code taken from the [official implementation](https://github.com/google-research/google-research/blob/master/mutual_information_representation_learning/mirl.ipynb):
 42 | 
 43 | ```python
 44 | from tensorflow.python.ops.parallel_for import gradients
 45 | x_1, x_2, _ = processed_train_data(data_dimensions, batch_size)
 46 | 
 47 | # to make sure x_1 and x_2 were not limited to a subspace
 48 | if noise_std > 0.0:
 49 |   assert x_1.shape == x_2.shape, "X1 and X2 shapes must agree to add noise!"
 50 |   noise = noise_std * tf.random.normal(x_1.shape)
 51 |   x_1 += noise
 52 |   x_2 += noise
 53 | 
 54 | code_1, code_2 = g1(x_1), g2(x_2)
 55 | if compute_jacobian:
 56 |     jacobian = gradients.batch_jacobian(code_1, x_1, use_pfor=False)
 57 |     singular_values = tf.linalg.svd(jacobian, compute_uv=False)
 58 | 
 59 | ...
 60 | ...
 61 | ...
 62 | 
 63 | for run_number, results in enumerate(results_all_runs):
 64 |       stacked_singular_values = np.stack(results.singular_values)
 65 |       sorted_singular_values = np.sort(stacked_singular_values, axis=-1)
 66 |       log_condition_numbers = np.log(sorted_singular_values[..., -1]) \
 67 |                               - np.log(sorted_singular_values[..., 0])
 68 |       condition_numbers_runs.append(log_condition_numbers)
 69 | ```
 70 | 
 71 | Here's what they found:
 72 | 
 73 | > Moreover, even though `g1` is initialized very close to the identity function (which maximizes the true MI), the condition number of its Jacobian evaluated at inputs randomly sampled from the data-distribution steadily deteriorates over time, suggesting that in practice (i.e. numerically)inverting the model becomes increasingly hard.
 74 | 
 75 | ### Critics
 76 | 
 77 | Critics are basically functions (neural networks) used to predict whether or not two representations (vectors) come from the same image. They compared three critic architectures:
 78 | `bilinear`, `separable` and `MLP`. Below are some implementations of critics:
 79 | 
 80 | ```python
 81 | class InnerProdCritic(tf.keras.Model):
 82 |   def call(self, x, y):
 83 |     return tf.matmul(x, y, transpose_b=True)
 84 | 
 85 | class BilinearCritic(tf.keras.Model):
 86 |   def __init__(self, feature_dim=100, **kwargs):
 87 |     super(BilinearCritic, self).__init__(**kwargs)
 88 |     self._W = tfkl.Dense(feature_dim, use_bias=False)
 89 | 
 90 |   def call(self, x, y):
 91 |     return tf.matmul(x, self._W(y), transpose_b=True)
 92 | 
 93 | class ConcatCritic(tf.keras.Model):
 94 |   def __init__(self, hidden_dim=200, layers=1, activation='relu', **kwargs):
 95 |     super(ConcatCritic, self).__init__(**kwargs)
 96 |     # output is scalar score
 97 |     self._f = MLP([hidden_dim for _ in range(layers)]+[1], False, {"activation": "relu"})
 98 | 
 99 |   def call(self, x, y):
100 |     batch_size = tf.shape(x)[0]
101 |     # Tile all possible combinations of x and y
102 |     x_tiled = tf.tile(x[None, :],  (batch_size, 1, 1))
103 |     y_tiled = tf.tile(y[:, None],  (1, batch_size, 1))
104 |     # xy is [batch_size * batch_size, x_dim + y_dim]
105 |     xy_pairs = tf.reshape(tf.concat((x_tiled, y_tiled), axis=2),
106 |                           [batch_size * batch_size, -1])
107 |     # Compute scores for each x_i, y_j pair.
108 |     scores = self._f(xy_pairs)
109 |     return tf.transpose(tf.reshape(scores, [batch_size, batch_size]))
110 | 
111 | 
112 | class SeparableCritic(tf.keras.Model):
113 |   def __init__(self, hidden_dim=100, output_dim=100, layers=1,
114 |                activation='relu', **kwargs):
115 |     super(SeparableCritic, self).__init__(**kwargs)
116 |     self._f_x = MLP([hidden_dim for _ in range(layers)] + [output_dim], False, {"activation": activation})
117 |     self._f_y = MLP([hidden_dim for _ in range(layers)] + [output_dim], False, {"activation": activation})
118 | 
119 |   def call(self, x, y):
120 |     x_mapped = self._f_x(x)
121 |     y_mapped = self._f_y(y)
122 |     return tf.matmul(x_mapped, y_mapped, transpose_b=True)
123 | ```
124 | 
125 | Here's what they found:
126 | > It can be seen that for both lower bounds, representations trained with the MLP critic barely outperform the baseline on pixel space, whereas the same lower bounds with bilinear and separable critics clearly lead to a higher accuracy than the baseline.
127 | 
128 | ### Connection to deep metric learning and triplet losses
129 | 
130 | After decoupling representation quality and MI maximization, the authors made a connection between representation quality and triplet losses.
131 | 
132 | #### The metric learning view
133 | > Given sets of triplets, namely an anchor point `x`, a positive instance `y`, and a negative instance `z`, the goal is to learn a representation `g(x)` such that the distances between `g(x)` and `g(y)` is smaller than the distance between `g(x)` and `g(z)`, for each triplet.
134 | 
135 | They make this association in two ways:
136 | a) mathematically formulating the critic objective function and drawing parallels with the triplet loss function
137 | 
138 | and
139 | 
140 | b)emphasizing the importance of negative sampling. I didn't spend too much time trying to understand it so will not provide a gist here.
141 | 


--------------------------------------------------------------------------------
/notes/probmods/Chapter 10_ Learning with a language of thought.md:
--------------------------------------------------------------------------------
 1 | ## Chapter 10: Learning with a language of thought
 2 | 
 3 | #### Key idea
 4 | How can we create complex hypotheses and representation spaces? Simply by using stochastic recursion. When the recursion is going to end is not deterministic, it is probabilistic. 
 5 | 
 6 | Here is how we can create an infinite amount of mathematical expressions:
 7 | ```javascript
 8 | var randomConstant = function() {
 9 |   return uniformDraw(_.range(10))
10 | }
11 | 
12 | var randomCombination = function(f,g) {
13 |   var op = uniformDraw(['+','-','*','/','^']);
14 |   return '('+f+op+g+')'
15 | }
16 | 
17 | // sample an arithmetic expression
18 | var randomArithmeticExpression = function() {
19 |   flip() ? 
20 |     randomCombination(randomArithmeticExpression(), randomArithmeticExpression()) : 
21 |     randomConstant()
22 | }
23 | 
24 | randomArithmeticExpression()
25 | ```
26 | 
27 | But more complex strings are less likely due to the way the function is defined.
28 | 
29 | ### Inferring an arithmetic expression
30 | We can do so by conditioning:
31 | ```javascript
32 |   var e = randomArithmeticExpression();
33 |   var s = prettify(e);
34 |   var f = runify(e);
35 |   condition(f(1) == 3 && f(2) == 4);
36 | 
37 |   return {s: s};
38 | ```
39 | 
40 | ### Grammar based induction
41 | > What is the general principle in the two above examples? We can think of it as the following recipe: we build hypotheses by stochastically choosing between primitives and combination operations, this specifies an infinite “language of thought”; each expression in this language in turn specifies the likelihood of observations. Formally, the stochastic combination process specifies a probabilistic grammar; which yields terms compositionally interpreted into a likelihood over data.
42 | > 
43 | 
44 | ---
45 | 
46 | ### ToDo: Example: Rational Rules
47 | I didn't quite get how this works. Required reading: [A Rational Analysis of Rule-based Concept Learning](https://onlinelibrary.wiley.com/doi/epdf/10.1080/03640210701802071).
48 | 
49 | [Online, non-pdf version](https://onlinelibrary.wiley.com/doi/full/10.1080/03640210701802071)


--------------------------------------------------------------------------------
/notes/probmods/Chapter 11_ Hierarchical models.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 11: Hierarchical models
  2 | 
  3 | #### Key idea
  4 | We learn generalized concepts naturally:
  5 | - poodle, Dalmatian, Labrador → dog
  6 | - sedan, coupe, convertible, wagon → car
  7 | 
  8 | How do we build models that can learn these _abstract_ concepts?
  9 | 
 10 | ### Example 1: Bags with colored balls
 11 | Each bag can learn its own categorical distribution. It explains previously observed data well but fails to generalize.
 12 | 
 13 | Let's say this is what we observe:
 14 | ```javascript
 15 | var observedData = [
 16 | {bag: 'bag1', draw: 'blue'},
 17 | {bag: 'bag1', draw: 'blue'},
 18 | {bag: 'bag1', draw: 'black'},
 19 | {bag: 'bag1', draw: 'blue'},
 20 | {bag: 'bag1', draw: 'blue'},
 21 | {bag: 'bag1', draw: 'blue'},
 22 | {bag: 'bag2', draw: 'blue'},
 23 | {bag: 'bag2', draw: 'green'},
 24 | {bag: 'bag2', draw: 'blue'},
 25 | {bag: 'bag2', draw: 'blue'},
 26 | {bag: 'bag2', draw: 'blue'},
 27 | {bag: 'bag2', draw: 'red'},
 28 | {bag: 'bag3', draw: 'blue'},
 29 | {bag: 'bag3', draw: 'orange'}
 30 | ]
 31 | ```
 32 | 
 33 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/hierarchical_models1.jpg)
 34 | 
 35 | Human observation: All bags have __blue__ as predominant color. This is an abstract (generalized) notion of distribution of colors in bags. The below approach does not work:
 36 | 
 37 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/hierarchical_models2.jpg)
 38 | 
 39 | As you can see, it predicts poorly the distribution of bags 3 and N.
 40 | 
 41 | But if we try to learn a shared prototype, it works:
 42 | 
 43 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/hierarchical_models3.jpg)
 44 | 
 45 | It predicts the distribution of an unseen bag N very well.
 46 | 
 47 | ---
 48 | 
 49 | ### Example 2: Learning generalized vs specific prototypes
 50 | 
 51 | > Suppose that we have a number of bags that all have identical prototypes: they mix red and blue in proportion 2:1. But the learner doesn’t know this. She observes only one ball from each of N bags. What can she learn about an individual bag versus the population as a whole as the number of bags changes?
 52 | > 
 53 | 
 54 | If the data comes from different bags, the generalized prototype learns well but the specific one does not:
 55 | ```javascript
 56 | var data = [{bag:'bag1', draw:'red'}, {bag:'bag2', draw:'red'}, {bag:'bag3', draw:'blue'},
 57 |             {bag:'bag4', draw:'red'}, {bag:'bag5', draw:'red'}, {bag:'bag6', draw:'blue'},
 58 |             {bag:'bag7', draw:'red'}, {bag:'bag8', draw:'red'}, {bag:'bag9', draw:'blue'},
 59 |             {bag:'bag10', draw:'red'}, {bag:'bag11', draw:'red'}, {bag:'bag12', draw:'blue'}]
 60 | ```
 61 | 
 62 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/hierarchical_models4.jpg)
 63 | 
 64 | But if all samples come from a single bag, the specific prototype learns well but the generalized one does not:
 65 | 
 66 | ```javascript
 67 | var data = [{bag:'bag1', draw:'red'}, {bag:'bag1', draw:'red'}, {bag:'bag1', draw:'blue'},
 68 |             {bag:'bag1', draw:'red'}, {bag:'bag1', draw:'red'}, {bag:'bag1', draw:'blue'},
 69 |             {bag:'bag1', draw:'red'}, {bag:'bag1', draw:'red'}, {bag:'bag1', draw:'blue'},
 70 |             {bag:'bag1', draw:'red'}, {bag:'bag1', draw:'red'}, {bag:'bag1', draw:'blue'}]
 71 | ```
 72 | 
 73 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/hierarchical_models5.jpg)
 74 | 
 75 | ---
 76 | 
 77 | ### Learning Overhypotheses: Abstraction at the Superordinate Level
 78 | 
 79 | > Suppose that we observe that `bag1` consists of all blue marbles, `bag2` consists of all green marbles, `bag3` all red, and so on. This doesn’t tell us to expect a particular color in future bags, but it does suggest that bags are very regular—that all bags consist of marbles of only one color.
 80 | > 
 81 | Suppose we have the following data:
 82 | ```javascript
 83 | var observedData = [
 84 | {bag: 'bag1', draw: 'blue'}, {bag: 'bag1', draw: 'blue'}, {bag: 'bag1', draw: 'blue'},
 85 | {bag: 'bag1', draw: 'blue'}, {bag: 'bag1', draw: 'blue'}, {bag: 'bag1', draw: 'blue'},
 86 | {bag: 'bag2', draw: 'green'}, {bag: 'bag2', draw: 'green'}, {bag: 'bag2', draw: 'green'},
 87 | {bag: 'bag2', draw: 'green'}, {bag: 'bag2', draw: 'green'}, {bag: 'bag2', draw: 'green'},
 88 | {bag: 'bag3', draw: 'red'}, {bag: 'bag3', draw: 'red'}, {bag: 'bag3', draw: 'red'},
 89 | {bag: 'bag3', draw: 'red'}, {bag: 'bag3', draw: 'red'}, {bag: 'bag3', draw: 'red'},
 90 | {bag: 'bag4', draw: 'orange'}]
 91 | ```
 92 | 
 93 | Note that we only have one sample from `bag4` and no sample from `bag N`.
 94 | - We can confidently say that all samples from `bag4` are orange.
 95 | - For bag N, any color is equally probable.
 96 | 
 97 | This can be modeled by defining our prototype as:
 98 | 
 99 | ```javascript
100 |   // the global prototype mixture:
101 |   var phi = dirichlet(ones([5, 1]))
102 |   // regularity parameters: how strongly we expect the global prototype to project
103 |   // (ie. determine the local prototypes):
104 |   var alpha = gamma(2,2)
105 |   var prototype = T.mul(phi, alpha)
106 | ```
107 | 
108 | After observing the data, `alpha` will end up being significantly smaller than 1. 
109 | > This means roughly that the learned prototype in phi should exert less influence on prototype estimation for a new bag than a single observation.
110 | > 
111 | 
112 | Now let's say we have the following data:
113 | ```javascript
114 | var observedData = [
115 | {bag: 'bag1', draw: 'blue'}, {bag: 'bag1', draw: 'red'}, {bag: 'bag1', draw: 'green'},
116 | {bag: 'bag1', draw: 'black'}, {bag: 'bag1', draw: 'red'}, {bag: 'bag1', draw: 'blue'},
117 | {bag: 'bag2', draw: 'green'}, {bag: 'bag2', draw: 'red'}, {bag: 'bag2', draw: 'black'},
118 | {bag: 'bag2', draw: 'black'}, {bag: 'bag2', draw: 'blue'}, {bag: 'bag2', draw: 'green'},
119 | {bag: 'bag3', draw: 'red'}, {bag: 'bag3', draw: 'green'}, {bag: 'bag3', draw: 'blue'},
120 | {bag: 'bag3', draw: 'blue'}, {bag: 'bag3', draw: 'black'}, {bag: 'bag3', draw: 'green'},
121 | {bag: 'bag4', draw: 'orange'}]
122 | ```
123 | 
124 | > The marble color is instead variable within bags to about the same degree that it varies in the population as a whole.
125 | > 
126 | In this case `alpha` is significantly greater than 1.
127 | 
128 | ---
129 | 
130 | ### Example: The Shape Bias
131 | 
132 | It is `the preference to generalize a novel label for some object to other objects of the same shape, rather than say the same color or texture.`
133 | 
134 | Let's say each object category has four attributes: `'shape', 'color', 'texture', 'size'`. Let's say the following data is observed:
135 | 
136 | ```javascript
137 | var observedData = [{cat: 'cat1', shape: 1, color: 1, texture: 1, size: 1},
138 |                     {cat: 'cat1', shape: 1, color: 2, texture: 2, size: 2},
139 |                     {cat: 'cat2', shape: 2, color: 3, texture: 3, size: 1},
140 |                     {cat: 'cat2', shape: 2, color: 4, texture: 4, size: 2},
141 |                     {cat: 'cat3', shape: 3, color: 5, texture: 5, size: 1},
142 |                     {cat: 'cat3', shape: 3, color: 6, texture: 6, size: 2},
143 |                     {cat: 'cat4', shape: 4, color: 7, texture: 7, size: 1},
144 |                     {cat: 'cat4', shape: 4, color: 8, texture: 8, size: 2},
145 |                     {cat: 'cat5', shape: 5, color: 9, texture: 9, size: 1}]
146 | 
147 | ```
148 | 
149 | Let's define the range of values of attributes:
150 | ```javascript
151 | var values = {shape: _.range(11), color: _.range(11), texture: _.range(11), size: _.range(11)};
152 | ```
153 | 
154 | > One needs to allow for more values along each dimension than appear in the training data so as to be able to generalize to novel shapes, colors, etc.
155 | > 
156 | 
157 | Here each `attr` has its own `phi` and `alpha`:
158 | 
159 | ```javascript
160 | var categoryPosterior = Infer({method: 'MCMC', samples: 10000}, function(){
161 | 
162 |   var prototype = mem(function(attr){
163 |     var phi = dirichlet(ones([values[attr].length, 1]))
164 |     var alpha = exponential(1)
165 |     return T.mul(phi,alpha)
166 |   })
167 | 
168 |   var makeAttrDist = mem(function(cat, attr){
169 |     var probs = dirichlet(prototype(attr))
170 |     return Categorical({vs: values[attr], ps: probs})
171 |   })
172 | 
173 |   var obsFn = function(datum){
174 |     map(function(attr){observe(makeAttrDist(datum.cat,attr), datum[attr])},
175 |         attributes)
176 |   }
177 | 
178 |   mapData({data: observedData}, obsFn)
179 | 
180 |   return {cat5shape: sample(makeAttrDist('cat5','shape')),
181 |           cat5color: sample(makeAttrDist('cat5','color')),
182 |           catNshape: sample(makeAttrDist('catN','shape')),
183 |           catNcolor: sample(makeAttrDist('catN','color'))}
184 | })
185 | ```
186 | 
187 | > The program above gives us draws from some novel category for which we’ve seen a single instance. In the experiments with children, they had to choose one of three choice objects which varied according to the dimension they matched the example object from the category. 
188 | > 
189 | 
190 | ---
191 | 
192 | ### Example: Beliefs about Homogeneity and Generalization
193 | [In this study](https://scholar.google.com/scholar?q=%22The%20use%20of%20statistical%20heuristics%20in%20everyday%20inductive%20reasoning.%22), the authors found that:
194 | > to what extent people generalise depends on beliefs about the homogeneity of the group that the object falls in with respect to the property they are being asked to generalize about.
195 | > 
196 | 
197 | Let's say on a new island you encounter one male person a tribe T.
198 | 
199 | __Obesity__: If he is obese, how likely are other male members of tribe T to be obese?
200 | __Intuition__: Not so likely because obesity is a feature with heterogenous distribution within a tribe.
201 | 
202 | __Skin color__: If he is brown, how likely are other male members of tribe T to be brown?
203 | __Intuition__: Quite likely because skin color varies across tribes but is uniform in a single tribe.
204 | 
205 | #### Analogy to bags with color balls
206 | Bag: tribe
207 | Color: obesity or skin color
208 | 
209 | Here is what they found:
210 | ![](https://probmods.org/assets/img/nisbett_model_humans.png)
211 | 
212 | Again, a compound Dirichlet-multinomial distribution was used to model this experiment.
213 | 
214 | ---
215 | 
216 | ### ToDo: One-shot learning of visual categories
217 | 
218 | > __Motivation__: Humans are able to categorize objects (in a space with a huge number of dimensions) after seeing just one example of a new category. For example, after seeing a single wildebeest people are able to identify other wildebeest, perhaps by drawing on their knowledge of other animals.
219 | 
220 | Read [this paper (pdf)](http://proceedings.mlr.press/v27/salakhutdinov12a/salakhutdinov12a.pdf).
221 | 
222 | ---
223 | 
224 | ### ToDo: Get some ideas using overhypotheses from [this](https://sci-hub.tw/https://www.cambridge.org/core/journals/journal-of-child-language/article/variability-negative-evidence-and-the-acquisition-of-verb-argument-constructions/D62EDBFF5A8F1ACC821451FEAD3C88FB) paper.


--------------------------------------------------------------------------------
/notes/probmods/Chapter 12_ Occam's Razor.md:
--------------------------------------------------------------------------------
 1 | ## Chapter 12: Occam's Razor
 2 | 
 3 | Humans choose the least _complex_ hypothesis that _fits_ the data well.
 4 | 
 5 | - How is complexity measured? How is fitness measured?
 6 | If fitness is semantically measured and complexity is syntactically measured (eg description length of the hypothesis in some representation language, or a count of the number of free parameters used to specify the hypothesis), the two are incommensurable.
 7 | 
 8 | In Bayesian models both complexity and fitness are measured semantically. Complexity is measured by _flexibility_: the ability to generate a more diverse set of observations.
 9 | 
10 | ### Key Idea: The Law of Conservation of Belief
11 | Since all probabilities should add up to 1, a complex model spreads its probabilities over a larger number of possibilities whereas a simple model will have high probabilities for a smaller set of events. Hence:
12 | 
13 | `P(simple hypothesis | event) > P(complex hypothesis | event)`
14 | 
15 | ---
16 | 
17 | ### The Size Principle
18 | > Of hypotheses which generate data uniformly, the one with smallest extension that is still consistent with the data is the most probable.
19 | > 
20 | 
21 | Consider the two possible hypotheses:
22 | ```javascript
23 | Categorical({vs: ['a', 'b', 'c', 'd', 'e', 'f'], ps: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]}) :
24 | Categorical({vs: ['a', 'b', 'c'], ps: [1/3, 1/3, 1/3]}));
25 | ```
26 | 
27 | Let say we observe a sample to be `a`. Which hypothesis is more likely? The smaller one! In fact the smaller hypothesis is twice as likely as the first one.
28 | 
29 | ```javascript
30 | var fullData = ['a', 'b', 'a', 'b', 'b', 'a', 'b']
31 | ```
32 | 
33 | As we observe more data, how does our learning look like? Take a look here:
34 | 
35 | https://github.com/vinsis/math-and-ml-notes/blob/master/images/size_principle.svg
36 | 
37 | #### Example 2:
38 | 
39 | Now consider these two hypotheses:
40 | ```javascript
41 | Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.375, 0.375, 0.125, 0.125]}) :
42 | Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.25, 0.25, 0.25, 0.25]}))
43 | ```
44 | 
45 | And our observed data is:
46 | ```javascript
47 | var observedData = ['a', 'b', 'a', 'b', 'c', 'd', 'b', 'b']
48 | ```
49 | 
50 | > The Bayesian Occam’s razor says that all else being equal the hypothesis that assigns the highest likelihood to the data will dominate the posterior. Because of the law of conservation of belief, assigning higher likelihood to the observed data requires assigning lower likelihood to other possible data.
51 | > 
52 | 
53 | Hence the observed data is much more likely to have come from the first hypothesis.
54 | 
55 | ```javascript
56 | var hypothesisToDist = function(hypothesis) {
57 |   return (hypothesis == 'A' ?
58 |           Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.375, 0.375, 0.125, 0.125]}) :
59 |           Categorical({vs: ['a', 'b', 'c', 'd'], ps: [0.25, 0.25, 0.25, 0.25]}))
60 | }
61 | 
62 | var observedData = ['a', 'b', 'a', 'b', 'c', 'd', 'b', 'b']
63 | 
64 | var posterior = Infer({method: 'enumerate'}, function(){
65 |   var hypothesis = flip() ? 'A' : 'B'
66 |   mapData({data: observedData}, function(d){observe(hypothesisToDist(hypothesis),d)})
67 |   return hypothesis
68 | })
69 | 
70 | viz(posterior)
71 | ```
72 | 
73 | ---
74 | 
75 | ### Example: The Rectangle Game
76 | 
77 | Given a set of points `[(x,y)]` uniformly sampled from a rectangle, which rectangle is most likely?
78 | 
79 | By the same argument as above, the tightest fitting rectangle.


--------------------------------------------------------------------------------
/notes/probmods/Chapter 13_ Learning (deep) continuous functions.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 13: Learning (deep) continuous functions
  2 | 
  3 | It's the same idea of updating prior beliefs, just applied to neural networks.
  4 | 
  5 | ### Key idea
  6 | The parameters of a neural network comes from one or more Gaussian distributions. Given some data we can update the priors to come up with neural nets that fit the data better.
  7 | 
  8 | ```javascript
  9 | var dm = 10 //size of hidden layer
 10 | 
 11 | var makeFn = function(M1,M2,B1){
 12 |   return function(x){
 13 |     return T.toScalars(
 14 |       // M2 * sigm(x * M1 + B1):
 15 |       T.dot(M2,T.sigmoid(T.add(T.mul(M1,x),B1)))
 16 |     )[0]}
 17 | }
 18 | 
 19 | var observedData = [{"x":-4,"y":69.76636938284166},{"x":-3,"y":36.63586217969598},{"x":-2,"y":19.95244368751754},{"x":-1,"y":4.819485497724985},{"x":0,"y":4.027631414787425},{"x":1,"y":3.755022418210824},{"x":2,"y":6.557548104903805},{"x":3,"y":23.922485493795072},{"x":4,"y":50.69924692420815}]
 20 | 
 21 | var inferOptions = {method: 'optimize', samples: 100, steps: 3000, optMethod: {adam: {stepSize: 0.1}}}
 22 | 
 23 | var post = Infer(inferOptions,
 24 |   function() {  
 25 |     var M1 = sample(DiagCovGaussian({mu: zeros([dm, 1]), sigma: ones([dm,1])}))
 26 |     var B1 = sample(DiagCovGaussian({mu: zeros([dm, 1]), sigma: ones([dm,1])}))
 27 |     var M2 = sample(DiagCovGaussian({mu: zeros([1, dm]), sigma: ones([1,dm])}))
 28 |     
 29 |     var f = makeFn(M1,M2,B1)
 30 |     
 31 |     var obsFn = function(datum){
 32 |       observe(Gaussian({mu: f(datum.x), sigma: 0.1}), datum.y)
 33 |     }
 34 |     mapData({data: observedData}, obsFn)
 35 | 
 36 |     return {M1: M1, M2: M2, B1: B1}
 37 |   }
 38 | )
 39 | 
 40 | print("observed data:")
 41 | viz.scatter(observedData)
 42 | 
 43 | var postFnSample = function(){
 44 |   var p = sample(post)
 45 |   return makeFn(p.M1,p.M2,p.B1) 
 46 | }
 47 | ```
 48 | 
 49 | Notice two things here:
 50 | 1. How the parameters for the network are sampled from a `DiagCovGaussian` (multivariate Gaussian distribution).
 51 | 2. How we `observe` the data: `observe(Gaussian({mu: f(datum.x), sigma: 0.1}), datum.y)`
 52 | 
 53 | The second step is key to updating the parameters. __A non-Bayesian way of updating parameters requires a loss function to backpropagate on. In a Bayesian way, we are trying to increase the likelihood of getting `y` from a Gaussian centered at the output of the neural net.__
 54 | 
 55 | 
 56 | __Note__: As the width of hidden layer goes to infinity, the network approaches a Gaussian process.
 57 | 
 58 | > Infinitely “wide” neural nets yield a model where `f(x)` is Gaussian distributed for each `x`, and further (it turns out) the covariance among different `x`s is also Gaussian.
 59 | 
 60 | ---
 61 | 
 62 | ### Deep generative models
 63 | 
 64 | > Many interesting problems are unsupervised: we get a bunch of examples and want to understand them by capturing their distribution.
 65 | > 
 66 | Notice how this works:
 67 | 
 68 | ```javascript
 69 | var hd = 10
 70 | var ld = 2
 71 | var outSig = Vector([0.1, 0.1])
 72 | 
 73 | var post = Infer(inferOptions,
 74 |   function() {  
 75 |     var M1 = sample(DiagCovGaussian({mu: zeros([hd,ld]), sigma: ones([hd,ld])}), {
 76 |       guide: function() {return Delta({v: param({dims: [hd, ld]})})}})
 77 |     var B1 = sample(DiagCovGaussian({mu: zeros([hd, 1]), sigma: ones([hd,1])}), {
 78 |       guide: function() {return Delta({v: param({dims: [hd, 1]})})}})
 79 |     var M2 = sample(DiagCovGaussian({mu: zeros([2,hd]), sigma: ones([2,hd])}), {
 80 |       guide: function() {return Delta({v: param({dims: [2,hd]})})}})
 81 |     
 82 |     var f = makeFn(M1,M2,B1)
 83 |     var sampleXY = function(){return f(sample(DiagCovGaussian({mu: zeros([ld, 1]), sigma: ones([ld,1])})))}
 84 | 
 85 |     var means = repeat(observedData.length, sampleXY)
 86 |     var obsFn = function(datum,i){
 87 |       observe(DiagCovGaussian({mu: means[i], sigma: outSig}), Vector([datum.x, datum.y]))
 88 |     }
 89 |     mapData({data: observedData}, obsFn)
 90 | 
 91 |     return {means: means, 
 92 |             pp: repeat(100, sampleXY)}
 93 |   }
 94 | )
 95 | ```
 96 | 
 97 | Note:
 98 | 1. The output is not a scalar anymore; it is a vector of length two.
 99 | 2. We use `sampleXY` defined as `var sampleXY = function(){return f(sample(DiagCovGaussian({mu: zeros([ld, 1]), sigma: ones([ld,1])})))}` to sample `means`.
100 | 3. We observe `means` to be as close to our observed data as possible: `observe(DiagCovGaussian({mu: means[i], sigma: outSig}), Vector([datum.x, datum.y]))`
101 | 
102 | ---
103 | 
104 | ### Minibatches and amortized inference
105 | 
106 | > Minibatches: the idea is that randomly sub-sampling the data on each step can give us a good enough approximation to the whole data set.
107 | > 
108 | But if split the data into smaller batches, we need to be sure the latent variable used to sample `means` improves over time. 
109 | 
110 | ### ToDo: Read about [Amortized Inference in Probabilistic Reasoning](https://web.stanford.edu/~ngoodman/papers/amortized_inference.pdf)


--------------------------------------------------------------------------------
/notes/probmods/Chapter 14_ Mixture models.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 14: Mixture models
  2 | 
  3 | Let's take a look at marbles problem earlier:
  4 | 
  5 | Bag1 → `Unknown distribution` → Sample1 → Color1
  6 | Bag2 → `Unknown distribution` → Sample2 → Color2
  7 | ...
  8 | BagN → `Unknown distribution` → SampleN → ColorN
  9 | 
 10 | Here we know deterministically that Color1 came from Bag1 and so on. What if we remove this information?
 11 | 
 12 | [Bag1, Bag2, ..., BagN] → `Sample bag1` → `Sample Color1` 
 13 | [Bag1, Bag2, ..., BagN] → `Sample bag2` → `Sample Color2` 
 14 | ...
 15 | [Bag1, Bag2, ..., BagN] → `Sample bag n` → `Sample Color n`
 16 | 
 17 | Here we need to learn two things given some observed data:
 18 | - how are the bags distributed?
 19 | - how are colors in each bag distributed?
 20 | 
 21 | ```javascript
 22 | var colors = ['blue', 'green', 'red']
 23 | 
 24 | var observedData = [{name: 'obs1', draw: 'red'},
 25 |                     {name: 'obs2', draw: 'red'},
 26 |                     {name: 'obs3', draw: 'blue'},
 27 |                     {name: 'obs4', draw: 'blue'},
 28 |                     {name: 'obs5', draw: 'red'},
 29 |                     {name: 'obs6', draw: 'blue'}]
 30 | 
 31 | var predictives = Infer({method: 'MCMC', samples: 30000}, function(){
 32 | 
 33 |   var phi = dirichlet(ones([3, 1]))
 34 |   var alpha = 1000.1
 35 |   var prototype = T.mul(phi, alpha)
 36 | 
 37 |   var makeBag = mem(function(bag){
 38 |     var colorProbs = dirichlet(prototype)
 39 |     return Categorical({vs: colors, ps: colorProbs})
 40 |   })
 41 | 
 42 |   // each observation (which is named for convenience) comes from one of three bags:
 43 |   var obsToBag = mem(function(obsName) {return uniformDraw(['bag1', 'bag2', 'bag3'])})
 44 | 
 45 |   var obsFn = function(datum){
 46 |     observe(makeBag(obsToBag(datum.name)), datum.draw)
 47 |   }
 48 |   mapData({data: observedData}, obsFn)
 49 | 
 50 |   return {sameBag1and2: obsToBag(observedData[0].name) === obsToBag(observedData[1].name),
 51 |           sameBag1and3: obsToBag(observedData[0].name) === obsToBag(observedData[2].name)}
 52 | })
 53 | ```
 54 | 
 55 | Notice how the bag is selected uniformly at random:
 56 | ```javascript
 57 | var obsToBag = mem(function(obsName) {return uniformDraw(['bag1', 'bag2', 'bag3'])})
 58 | ```
 59 | 
 60 | > Instead of assuming that a marble is equally likely to come from each bag, we could instead learn a distribution over bags where each bag has a different probability. 
 61 | > 
 62 | ```javascript
 63 | var bagMixture = dirichlet(ones([3, 1]))
 64 |   var obsToBag = mem(function(obsName) {
 65 |     return categorical({vs: ['bag1', 'bag2', 'bag3'], ps: bagMixture});
 66 |   })
 67 | ```
 68 | 
 69 | ---
 70 | 
 71 | ### Example: Topic Models
 72 | 
 73 | Problem: Given a document (assumed to be a bag of words) and a set of topics, classify the document as one of the topics.
 74 | 
 75 | Approach:
 76 | 
 77 | > Each topic is associated with a distribution over words, and this distribution is drawn from a Dirichlet prior.
 78 | 
 79 | [Vocabulary] → [Topic1]
 80 | [Vocabulary] → [Topic2]
 81 | ...
 82 | [Vocabulary] → [Topic k]
 83 | 
 84 | > For each document, mixture weights over a set of `k` topics are drawn from a Dirichlet prior.
 85 | 
 86 | [Dirichlet prior] → [Topic1, Topic2, ... Topic k] for a document
 87 | 
 88 | > For each of the N topics (where N = length of the doc) drawn for the document, a word is sampled from the corresponding multinomial distribution.
 89 | > 
 90 | 
 91 | For each word:<br>
 92 | [Topic1, Topic2, ... Topic k] → [Topic] → [Word] → Observe similarity with word in a document
 93 | 
 94 | ```javascript
 95 | var expectationOver = function(topicID, results) {
 96 |   return function(i) {
 97 |     return expectation(results, function(v) {return T.get(v[topicID], i)})
 98 |   }
 99 | }
100 | 
101 | var vocabulary = ['DNA', 'evolution', 'parsing', 'phonology'];
102 | var eta = ones([vocabulary.length, 1])
103 | 
104 | var numTopics = 2
105 | var alpha = ones([numTopics, 1])
106 | 
107 | var corpus = [
108 |   'DNA evolution DNA evolution DNA evolution DNA evolution DNA evolution'.split(' '),
109 |   'DNA evolution DNA evolution DNA evolution DNA evolution DNA evolution'.split(' '),
110 |   'DNA evolution DNA evolution DNA evolution DNA evolution DNA evolution'.split(' '),
111 |   'parsing phonology parsing phonology parsing phonology parsing phonology parsing phonology'.split(' '),
112 |   'parsing phonology parsing phonology parsing phonology parsing phonology parsing phonology'.split(' '),
113 |   'parsing phonology parsing phonology parsing phonology parsing phonology parsing phonology'.split(' ')
114 | ]
115 | 
116 | var model = function() {
117 | 
118 |   var topics = repeat(numTopics, function() {
119 |     return dirichlet({alpha: eta})
120 |   })
121 | 
122 |   mapData({data: corpus}, function(doc) {
123 |     var topicDist = dirichlet({alpha: alpha})
124 |     mapData({data: doc}, function(word) {
125 |       var z = sample(Discrete({ps: topicDist}))
126 |       var topic = topics[z]
127 |       observe(Categorical({ps: topic, vs: vocabulary}), word)
128 |     })
129 |   })
130 | 
131 |   return topics
132 | }
133 | 
134 | var results = Infer({method: 'MCMC', samples: 20000}, model)
135 | 
136 | //plot expected probability of each word, for each topic:
137 | var vocabRange = _.range(vocabulary.length)
138 | print('topic 0 distribution')
139 | viz.bar(vocabulary, map(expectationOver(0, results), vocabRange))
140 | print('topic 1 distribution')
141 | viz.bar(vocabulary, map(expectationOver(1, results), vocabRange))
142 | ```
143 | 
144 | Intuitively what's happening here is:
145 | - Each time we see a document, we sample a distribution over topics
146 | - For each word, we sample a topic which should have a high probability of showing the corresponding word
147 | 
148 | Image source: [Latent Dirichlet Allocation (LDA) \| NLP-guidance](https://moj-analytical-services.github.io/NLP-guidance/LDA.html)
149 | ![](https://moj-analytical-services.github.io/NLP-guidance/LDAresults.png)
150 | 
151 | #### Plate notation (from wiki)
152 | ![251px-Smoothed_LDA.png](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Smoothed_LDA.png/251px-Smoothed_LDA.png)
153 | 
154 | > M denotes the number of documents
155 | > N is number of words in a given document (document i has N<sub>i</sub> words)
156 | > α is the parameter of the Dirichlet prior on the per-document topic distributions
157 | > β is the parameter of the Dirichlet prior on the per-topic word distribution
158 | > θ<sub>i</sub> is the topic distribution for document i
159 | > z<sub>ij</sub> is the topic for the j-th word in document i
160 | 
161 | ---
162 | 
163 | ### Example: Categorical Perception of Speech Sounds
164 | 
165 | > Human perception is often skewed by our expectations. A common example of this is called categorical perception – when we perceive objects as being more similar to the category prototype than they really are. In phonology this is been particularly important and is called the perceptual magnet effect: Hearers regularize a speech sound into the category that they think it corresponds to. Of course this category isn’t known a priori, so a hearer must be doing a simultaneous inference of what category the speech sound corresponded to, and what the sound must have been. 
166 | > 
167 | 
168 | The code is simple but the implications are deep.
169 | 
170 | ```javascript
171 | var prototype1 = 0
172 | var prototype2 = 5
173 | var stimuli = _.range(prototype1, prototype2, 0.2)
174 | 
175 | var perceivedValue = function(stim){
176 |   return expectation(Infer({method: 'MCMC', samples: 10000}, function(){
177 |     var vowel1 = Gaussian({mu: prototype1, sigma: 1})
178 |     var vowel2 = Gaussian({mu: prototype2, sigma: 1})
179 | 
180 |     var category = flip()
181 |     var value = category ? sample(vowel1) : sample(vowel2)
182 | 
183 |     observe(Gaussian({mu: value, sigma: 1}),stim)
184 | 
185 |     return value
186 |   }))
187 | }  
188 | 
189 | var stimulusVsPerceivedValues = map(function(s){return {x:s, y:perceivedValue(s)}}, stimuli)
190 | 
191 | viz.scatter(stimulusVsPerceivedValues)
192 | ```
193 | 
194 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/value_vs_perceived_value.jpg)
195 | 
196 | Notice that the perceived value is the expectation of all values that are observed for a given `value`. This means __on average most values for a given `value` tend to be on left or right side of `value`__. This is where the skewness comes in.
197 | 
198 | ---
199 | 
200 | ### Unknown Numbers of Categories
201 | 
202 | In the previous examples the number of categories was fixed. But that can be problematic. 
203 | 
204 | > The simplest way to address this problem, which we call unbounded models, is to simply place uncertainty on the number of categories in the form of a hierarchical prior.
205 | > __Example__: Inferring whether one or two coins were responsible for a set of outcomes (i.e. imagine a friend is shouting each outcome from the next room–“heads, heads, tails…”–is she using a fair coin, or two biased coins?).
206 | > 
207 | 
208 | ```javascript
209 | // var observedData = [true, true, true, true, false, false, false, false]
210 | var observedData = [true, true, true, true, true, true, true, true]
211 | 
212 | var results = Infer({method: 'rejection', samples: 100}, function(){
213 |   var coins = flip() ? ['c1'] : ['c1', 'c2'];
214 |   var coinToWeight = mem(function(c) {return uniform(0,1)})
215 |   mapData({data: observedData},
216 |           function(d){observe(Bernoulli({p: coinToWeight(uniformDraw(coins))}),d)})
217 |   return {numCoins: coins.length}
218 | })
219 | 
220 | viz(results)
221 | ```
222 | 
223 | We can extend the idea to a higher number of bags by using a `Poisson distrbution`:
224 | 
225 | ```javascript
226 | var colors = ['blue', 'green', 'red']
227 | var observedMarbles = ['red', 'red', 'blue', 'blue', 'red', 'blue']
228 | var results = Infer({method: 'rejection', samples: 100}, function() {
229 |   var phi = dirichlet(ones([3,1]));
230 |   var alpha = 0.1;
231 |   var prototype = T.mul(phi, alpha);
232 | 
233 |   var makeBag = mem(function(bag){
234 |     var colorProbs = dirichlet(prototype);
235 |     return Categorical({vs: colors, ps: colorProbs});
236 |   })
237 | 
238 |   // unknown number of categories (created with placeholder names):
239 |   var numBags = (1 + poisson(1));
240 |   var bags = map(function(i) {return 'bag' + i;}, _.range(numBags))
241 | 
242 |   mapData({data: observedMarbles},
243 |           function(d){observe(makeBag(uniformDraw(bags)), d)})
244 | 
245 |   return {numBags: numBags}
246 | })
247 | 
248 | viz(results)
249 | ```
250 | 
251 | But is the number of categories infinite here? No! 
252 | 
253 | > In an unbounded model, there are a finite number of categories whose number is drawn from an unbounded prior distribution
254 | > 
255 | 
256 | An alternative is to use _infinite_ mixture models.
257 | 
258 | ---
259 | 
260 | ### Infinite mixture models aka Dirichlet Process
261 | 
262 | Consider the discrete probability distribution:
263 | 
264 | `[a,b,c,d]` where `a+b+c+d=1`
265 | 
266 | It can be interpreted as:
267 | - Probability of stopping at a = `a / (a+b+c+d)`
268 | - Probability of stopping at b = `b / (b+c+d)`
269 | - Probability of stopping at c = `c / (c+d)`
270 | - Probability of stopping at d = `d / (d)`
271 | 
272 | Note that the last number is always 1 and all other numbers are always between 0 and 1.
273 | 
274 | Conversely, we can convert any list with all but last entries between 0 and 1 and last entry as 1 like so:
275 | 
276 | `[p,q,r,1]` means
277 | - Probability of stopping at first index = p
278 | - Probability of stopping at second index = q
279 | - Probability of stopping at second index = r
280 | - Probability of stopping at second index = 1
281 | 
282 | Thus it could be converted into a discrete probability distribution of length 4.
283 | 
284 | But what if we never decide to stop. In other words what if all the numbers `[p,q,r,s,...]` are between 0 and 1. 
285 | 
286 | Whenever we sample from this distribution, it will stop at a random index between 1 and infinity.
287 | 
288 | This can be modeled as:
289 | ```javascript
290 | // resid = [p,q,r,s,...] 
291 | var mySampleDiscrete = function(resid,i) {
292 |     return flip(resid(i)) ? i : mySampleDiscrete(resid, i+1)
293 |   }
294 | ```
295 | 
296 | But how do we create this infinitely long array of probabilities? In other words how does one set a prior over an infinite set of bags?
297 | 
298 | ```javascript
299 |   //a prior over an infinite set of bags:
300 |   var residuals = mem(function(i){uniform(0,1)}) // could also be beta(1,1) instead of uniform(0,1)
301 |   var mySampleDiscrete = function(resid,i) {
302 |     return flip(resid(i)) ? i : mySampleDiscrete(resid, i+1)
303 |   }
304 |   var getBag = mem(function(obs){
305 |     return mySampleDiscrete(residuals,0)
306 |   })
307 | ```
308 | 
309 | Thus we can construct an infinite mixture model like so:
310 | ```javascript
311 | var colors = ['blue', 'green', 'red']
312 | var observedMarbles = [{name:'obs1', draw: 'red'},
313 | {name:'obs2', draw: 'blue'},
314 | {name:'obs3', draw: 'red'},
315 | {name:'obs4', draw: 'blue'},
316 | {name:'obs5', draw: 'red'},
317 | {name:'obs6', draw: 'blue'}]
318 | var results = Infer({method: 'MCMC', samples: 200, lag: 100}, function() {
319 |   var phi = dirichlet(ones([3,1]));
320 |   var alpha = 0.1
321 |   var prototype = T.mul(phi, alpha);
322 |   var makeBag = mem(function(bag){
323 |     var colorProbs = dirichlet(prototype);
324 |     return Categorical({vs: colors, ps: colorProbs});
325 |   })
326 | 
327 |   //a prior over an infinite set of bags:
328 |   var residuals = mem(function(i){uniform(0,1)})
329 |   var mySampleDiscrete = function(resid,i) {
330 |     return flip(resid(i)) ? i : mySampleDiscrete(resid, i+1)
331 |   }
332 |   var getBag = mem(function(obs){
333 |     return mySampleDiscrete(residuals,0)
334 |   })
335 | 
336 |   mapData({data: observedMarbles},
337 |           function(d){observe(makeBag(getBag(d.name)), d.draw)})
338 | 
339 |   return {samebag12: getBag('obs1')==getBag('obs2'),
340 |           samebag13: getBag('obs1')==getBag('obs3')}
341 | })
342 | 
343 | ```


--------------------------------------------------------------------------------
/notes/probmods/Chapter 15_ Social Cognition.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 15: Social Cognition
  2 | 
  3 | ### Prelude: Two ways to look at the same problem
  4 | 
  5 | > Imagine a factory where the widget-maker makes a stream of widgets, and the widget-tester removes the faulty ones. You don’t know what tolerance the widget tester is set to, and wish to infer it.
  6 | > 
  7 | 
  8 | #### Way 1 to create `n` widgets:
  9 | Sample a widget from a distribution of widgets. Condition on the widget passing the test.
 10 | - If it passes the test, create `n-1` widgets recursively
 11 | - If it fails, create `n` widgets recursively
 12 | 
 13 | #### Way 2 to create `n` widgets:
 14 | Sample `n` widgets at the same time from a distribution of widgets. Condition on _all_ widgets passing the test.
 15 | 
 16 | Way 1:
 17 | ```javascript
 18 | var makeWidgetSeq = function(numWidgets, threshold) {
 19 |   if(numWidgets == 0) {
 20 |     return [];
 21 |   } else {
 22 |     var widget = sample(widgetMachine);
 23 |     return (widget > threshold ? 
 24 |             [widget].concat(makeWidgetSeq(numWidgets - 1, threshold)) : 
 25 |             makeWidgetSeq(numWidgets, threshold));
 26 |   }
 27 | }
 28 | 
 29 | var widgetDist = Infer({method: 'rejection', samples: 300}, function() {
 30 |   var threshold = sample(thresholdPrior);
 31 |   var goodWidgetSeq = makeWidgetSeq(3, threshold);
 32 |   condition(_.isEqual([.6, .7, .8], goodWidgetSeq))
 33 |   return [threshold].join("");
 34 | })
 35 | ```
 36 | 
 37 | Way 2:
 38 | ```javascript
 39 | var makeGoodWidgetSeq = function(numWidgets, threshold) {
 40 |   return Infer({method: 'enumerate'}, function() {
 41 |     var widgets = repeat(numWidgets, function() {return sample(widgetMachine)});
 42 |     condition(all(function(widget) {return widget > threshold}, widgets));
 43 |     return widgets;
 44 |   })
 45 | }
 46 | ```
 47 | 
 48 | > #### Rather than thinking about the details inside the widget tester, we are now abstracting to represent that the machine correctly chooses a good widget
 49 | > 
 50 | 
 51 | We don't know the behavior of widget testing machine. So we think of testing at an abstract level and infer to maximize what we want.
 52 | 
 53 | ---
 54 | 
 55 | ### Social Cognition
 56 | 
 57 | > An agent tends to choose actions that she expects to lead to outcomes that satisfy her goals.
 58 | > 
 59 | 
 60 | Let's say Sally wants to buy a cookie from a __deterministic__ vending machine. This is how it works:
 61 | 
 62 | ```javascript
 63 | var vendingMachine = function(state, action) {
 64 |   return (action == 'a' ? 'bagel' :
 65 |           action == 'b' ? 'cookie' :
 66 |           'nothing');
 67 | }
 68 | ```
 69 | 
 70 | Here is how actions are chosen:
 71 | ```javascript
 72 | var chooseAction = function(goalSatisfied, transition, state) {
 73 |   return Infer(..., function() {
 74 |     var action = sample(actionPrior)
 75 |     condition(goalSatisfied(transition(state, action)))
 76 |     return action;
 77 |   })
 78 | }
 79 | ```
 80 | 
 81 | where `transition` is the output of taking `action` in `state`.
 82 | 
 83 | She is clearly always press `b` to get the cookie. Now let's say the vending machine is not realistic:
 84 | 
 85 | ```javascript
 86 | var vendingMachine = function(state, action) {
 87 |   return (action == 'a' ? categorical({vs: ['bagel', 'cookie'], ps: [.9, .1]}) :
 88 |           action == 'b' ? categorical({vs: ['bagel', 'cookie'], ps: [.1, .9]}) :
 89 |           'nothing');
 90 | }
 91 | ```
 92 | 
 93 | We see Sally still presses `b` most of the time (but not every time).
 94 | 
 95 | > Technically, this method of making a choices is not optimal, but rather it is soft-max optimal (also known as following the “Boltzmann policy”).
 96 | > 
 97 | 
 98 | Here is how we would represent the whole thing:
 99 | 
100 | ```javascript
101 | ///fold:
102 | var actionPrior = Categorical({vs: ['a', 'b'], ps: [.5, .5]})
103 | var haveCookie = function(obj) {return obj == 'cookie'};
104 | ///
105 | var vendingMachine = function(_, action) {
106 |   return (action == 'a' ? categorical({vs: ['bagel', 'cookie'], ps: [.9, .1]}) :
107 |           action == 'b' ? categorical({vs: ['bagel', 'cookie'], ps: [.1, .9]}) :
108 |           'nothing');
109 | }
110 | 
111 | var chooseAction = function(goalSatisfied, transition, state) {
112 |   return Infer({method: 'enumerate'}, function() {
113 |     var action = sample(actionPrior)
114 |     condition(goalSatisfied(transition(_, action)))
115 |     return action;
116 |   })
117 | }
118 | 
119 | viz.auto(chooseAction(haveCookie, vendingMachine, '_'));
120 | ```
121 | 
122 | ---
123 | 
124 | ### Goal Inference
125 | Let's say we don't know what Sally wants but we observe her pressing `b`. How can we infer what she wants?
126 | 
127 | Here we don't know what the goal is. So we our `goalSatisfied` becomes probabilistic instead of deterministic:
128 | 
129 | ```javascript
130 | var goal = categorical({vs: ['bagel', 'cookie'], ps: [.5, .5]})
131 | var goalSatisfied = function(outcome) {return outcome == goal};
132 | ```
133 | 
134 | We randomly sample `goalSatisfied` and then for that goal, infer the best action. 
135 | 
136 | We draw an inference on `goal` by observing that the chosen action was `b`:
137 | 
138 | ```javascript
139 | var goalPosterior = Infer({method: 'enumerate'}, function() {
140 |   var goal = categorical({vs: ['bagel', 'cookie'], ps: [.5, .5]})
141 |   var goalSatisfied = function(outcome) {return outcome == goal};
142 |   var actionDist = chooseAction(goalSatisfied, vendingMachine, 'state')
143 |   factor(actionDist.score('b'));
144 |   return goal;
145 | })
146 | ```
147 | 
148 | Note how we are doing inference inside an inference here.
149 | 
150 | Now let's say the button `b` gives any of the two options equally probably:
151 | 
152 | ```javascript
153 | var vendingMachine = function(state, action) {
154 |   return (action == 'a' ? categorical({vs: ['bagel', 'cookie'], ps: [.9, .1]}) :
155 |           action == 'b' ? categorical({vs: ['bagel', 'cookie'], ps: [.5, .5]}) :
156 |           'nothing');
157 | }
158 | ```
159 | 
160 | > Despite the fact that button b is equally likely to result in either bagel or cookie, we have inferred that Sally probably wants a cookie. This is a result of the inference implicitly taking into account the counterfactual alternatives: if Sally had wanted a bagel, she would have likely pressed button a. 
161 | > 
162 | 
163 | ---
164 | 
165 | ### Preferences
166 | 
167 | Let's say we observe Sally pressing `b` several times. We don't know what she wants but we do know she has some preference. In this case this is how we define the `goal`:
168 | 
169 | ```javascript
170 | var preference = uniform(0, 1);
171 | var goalPrior = function() {return flip(preference) ? 'bagel' : 'cookie'};
172 | var makeGoal = function(food) {return function(outcome) {return outcome == food}};
173 | ```
174 | 
175 | ... and this is how we condition:
176 | 
177 | ```javascript
178 | var goalPosterior = Infer({method: 'MCMC', samples: 20000}, function() {
179 |   var preference = uniform(0, 1);
180 |   var goalPrior = function() {return flip(preference) ? 'bagel' : 'cookie'};
181 |   var makeGoal = function(food) {return function(outcome) {return outcome == food}};
182 |   condition((sample(chooseAction(makeGoal(goalPrior()), vendingMachine, 'state')) == 'b') &&
183 |             (sample(chooseAction(makeGoal(goalPrior()), vendingMachine, 'state')) == 'b') &&
184 |             (sample(chooseAction(makeGoal(goalPrior()), vendingMachine, 'state')) == 'b'));
185 |   return goalPrior();
186 | })
187 | ```
188 | 
189 | ---
190 | 
191 | 
192 | ### Epistemic States
193 | 
194 | When we defined the vending machine like so:
195 | ```javascript
196 | var vendingMachine = function(state, action) {
197 |   return (action == 'a' ? categorical({vs: ['bagel', 'cookie'], ps: [.9, .1]}) :
198 |           action == 'b' ? categorical({vs: ['bagel', 'cookie'], ps: [.5, .5]}) :
199 |           'nothing');
200 | }
201 | ```
202 | 
203 | we made the assumption that we knew how the vending machine worked. What if don't know how it works? Then we can replace `ps: [.9, .1]` and `ps: [.5, .5]` by a distribution:
204 | 
205 | ```javascript
206 | var makeVendingMachine = function(aEffects, bEffects) {
207 |   return function(state, action) {
208 |     return (action == 'a' ? categorical({vs: ['bagel', 'cookie'], ps: aEffects}) :
209 |             action == 'b' ? categorical({vs: ['bagel', 'cookie'], ps: bEffects}) :
210 |             'nothing');
211 |   }
212 | };
213 | 
214 | var aEffects = dirichlet(Vector([1,1]))
215 | var bEffects = dirichlet(Vector([1,1]))
216 | var vendingMachine = makeVendingMachine(aEffects, bEffects);
217 | ```
218 | 
219 | Now if we assume that Sally knows how it works, she does not need to `Infer` it. Thus
220 | 
221 | > We can capture this by placing uncertainty on the vending machine, inside the overall query but “outside” of Sally’s inference:
222 | > 
223 | ```javascript
224 | ///fold:
225 | var actionPrior = Categorical({vs: ['a', 'b'], ps: [.5, .5]})
226 | 
227 | var chooseAction = function(goalSatisfied, transition, state) {
228 |   return Infer({method: 'enumerate'}, function() {
229 |     var action = sample(actionPrior)
230 |     condition(goalSatisfied(transition(state, action)))
231 |     return action;
232 |   })
233 | }
234 | ///
235 | var makeVendingMachine = function(aEffects, bEffects) {
236 |   return function(state, action) {
237 |     return (action == 'a' ? categorical({vs: ['bagel', 'cookie'], ps: aEffects}) :
238 |             action == 'b' ? categorical({vs: ['bagel', 'cookie'], ps: bEffects}) :
239 |             'nothing');
240 |   }
241 | };
242 | 
243 | var goalPosterior = Infer({method: 'MCMC', samples: 50000}, function() {
244 |   var aEffects = dirichlet(Vector([1,1]))
245 |   var bEffects = dirichlet(Vector([1,1]));
246 | 
247 |   var vendingMachine = makeVendingMachine(aEffects, bEffects);
248 |   
249 |   var goal = categorical({vs: ['bagel', 'cookie'], ps: [.5, .5]})
250 |   var goalSatisfied = function(outcome) {return outcome == goal};
251 |   
252 |   condition(goal == 'cookie' &&
253 |             sample(chooseAction(goalSatisfied, vendingMachine, 'state')) == 'b');
254 |   return T.get(bEffects, 1);
255 | })
256 | 
257 | print("probability of action 'b' giving cookie")
258 | viz.auto(goalPosterior);
259 | ```
260 | 
261 | Observe the below line:
262 | 
263 | ```javascript
264 | condition(goal == 'cookie' &&
265 |             sample(chooseAction(goalSatisfied, vendingMachine, 'state')) == 'b');
266 |   return T.get(bEffects, 1);
267 | ```
268 | 
269 | We are basically asking:
270 | > Assuming Sally knows how the machine works and she wants a cookie and she is seen pressing `b`,  what is the probability of action 'b' giving cookie?
271 | > 
272 | 
273 | > Now imagine a vending machine that has only one button, but it can be pressed many times. We don’t know what the machine will do in response to a given button sequence. We do know that pressing more buttons is less a priori likely.
274 | > 
275 | ```javascript
276 | var actionPrior = function() {
277 |   return categorical({vs: ['a', 'aa', 'aaa'], ps:[0.7, 0.2, 0.1] })
278 | }
279 | ```
280 | 
281 | The vending machine is now defined as:
282 | ```javascript
283 |   var buttonsToOutcomeProbs = {'a': T.toScalars(dirichlet(ones([2,1]))),
284 |                                'aa': T.toScalars(dirichlet(ones([2,1]))),
285 |                                'aaa': T.toScalars(dirichlet(ones([2,1])))}
286 |   var vendingMachine = function(state, action) {
287 |     return categorical({vs: ['bagel', 'cookie'], ps: buttonsToOutcomeProbs[action]})
288 |   }
289 | ```
290 | 
291 | We first condition on Sally pressing `a` to get a `cookie`:
292 | ```javascript
293 | condition(goal == 'cookie' && chosenAction == 'a')
294 | ```
295 | 
296 | Then we compare it with Sally pressing `aa` to get a `cookie`:
297 | ```javascript
298 | condition(goal == 'cookie' && chosenAction == 'aa')
299 | ```
300 | 
301 | >  Why can we draw much stronger inferences about the machine when Sally chooses to press the button twice? When Sally does press the button twice, she could have done the “easier” (or rather, a priori more likely) action of pressing the button just once. Since she doesn’t, a single press must have been unlikely to result in a cookie. This is an example of the principle of efficiency—all other things being equal, an agent will take the actions that require least effort (and hence, when an agent expends more effort all other things must not be equal).
302 | >  
303 | 
304 | > In these examples we have seen two important assumptions combining to allow us to infer something about the world from the indirect evidence of an agents actions. The first assumption is the principle of rational action, the second is an assumption of knowledgeability—we assumed that Sally knows how the machine works, though we don’t. Thus inference about inference, can be a powerful way to learn what others already know, by observing their actions.
305 | > 
306 | 
307 | ### Joint inference about beliefs and desires
308 | > Suppose we condition on two observations: that Sally presses the button twice, and that this results in a cookie. Then, assuming that she knows how the machine works, we jointly infer that she wanted a cookie, that pressing the button twice is likely to give a cookie, and that pressing the button once is unlikely to give a cookie.
309 | > 
310 | ```javascript
311 | var actionPrior = function() {
312 |   return categorical({vs: ['a', 'aa', 'aaa'], ps:[0.7, 0.2, 0.1] })
313 | }
314 | 
315 | var goalPosterior = Infer({method: 'rejection', samples: 5000}, function() {
316 |   var buttonsToOutcomeProbs = {'a': T.toScalars(dirichlet(ones([2,1]))),
317 |                                'aa': T.toScalars(dirichlet(ones([2,1]))),
318 |                                'aaa': T.toScalars(dirichlet(ones([2,1])))}
319 |   
320 |   var vendingMachine = function(state, action) {
321 |     return categorical({vs: ['bagel', 'cookie'], ps: buttonsToOutcomeProbs[action]})
322 |   }
323 |   
324 |   var goal = categorical({vs: ['bagel', 'cookie'], ps: [.5, .5]})
325 |   var goalSatisfied = function(outcome) {return outcome == goal}
326 |   var chosenAction = sample(chooseAction(goalSatisfied, vendingMachine, 'state'))
327 |   // we saw Sally press `aa`
328 |   condition(chosenAction == 'aa')
329 |   // we saw a cookie came out when Sally pressed `aa`
330 |   condition(vendingMachine('state', 'aa') == 'cookie')
331 | 
332 |   return {goal: goal, 
333 |           once: buttonsToOutcomeProbs['a'][1],
334 |           twice: buttonsToOutcomeProbs['aa'][1]};
335 | })
336 | 
337 | ```
338 | 
339 | Probability of cookie given `a` was pushed:
340 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/cookie_given_a.jpg)
341 | 
342 | Probability of cookie given `a` was pushed:
343 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/cookie_given_aa.jpg)
344 | 
345 | > Notice the U-shaped distribution for the effect of pressing the button just once.
346 | > 
347 | How do we explain this?
348 | 
349 | Note that the probability that she wanted a cookie is 0.65. Thus there is a 0.35 chance that she did not want a cookie. But we saw her press `aa`. Thus it is likely that pressing `aa` gives a bagel. Thus pressing `a` gives a cookie.
350 | 
351 | > This very complex (and hard to describe!) inference comes naturally from joint inference of goals and knowledge.
352 | > 
353 | 
354 | ---
355 | 
356 | ### Communication and Language
357 | 
358 | __Key idea__
359 | We have two entities communicating to each other. First entity tries to infer how second entity thinks. Second entity tries to infer how first entity thinks. They keep on updating their beliefs. 
360 | 
361 | However this is an infinite loop. To prevent this we find a way to get out of it after a certain depth.
362 | 
363 | Say we have two dice with the probabilities as shown:
364 | 
365 | ```javascript
366 | var dieToProbs = function(die) {
367 |   return (die == 'A' ? [0, .2, .8] :
368 |           die == 'B' ? [.1, .3, .6] :
369 |           'uhoh')
370 | }
371 | ```
372 | 
373 | > On each round the “teacher” pulls a die from a bag of weighted dice, and has to communicate to the “learner” which die it is by showing them faces of the die. Both players are familiar with the dice and their weights.
374 | > 
375 | Teacher has a prior over sides. Student has a prior over dice.
376 | ```javascript
377 | var sidePrior = Categorical({vs: ['red', 'green', 'blue'], ps: [1/3, 1/3, 1/3]})
378 | var diePrior = Categorical({vs: ['A', 'B'], ps: [1/2, 1/2]})
379 | ```
380 | 
381 | A simple roll of a die:
382 | ```javascript
383 | var roll = function(die) {return categorical({vs: ['red', 'green', 'blue'], ps: dieToProbs(die)})}
384 | ```
385 | 
386 | This is how teacher and student communicate and infer about each other:
387 | ```javascript
388 | var teacher = function(die, depth) {
389 |   return Infer({method: 'enumerate'}, function() {
390 |     var side = sample(sidePrior);
391 |     condition(sample(learner(side, depth)) == die)
392 |     return side
393 |   })
394 | }
395 | 
396 | var learner = function(side, depth) {
397 |   return Infer({method: 'enumerate'}, function() {
398 |     var die = sample(diePrior);
399 |     condition(depth == 0 ? 
400 |               side == roll(die) :
401 |               side == sample(teacher(die, depth - 1)))
402 |     return die
403 |   })
404 | }
405 | ```
406 | 
407 | > assume that there are two dice, A and B, which each have three sides (red, green, blue) that have weights like so:
408 | > 
409 | ![](https://probmods.org/assets/img/pedagogy-pic.png)
410 | 
411 | Now let's say the learner is shown a `green` side.
412 | 
413 | - If the depth is 0, the learner will infer it came from `B` since it has a higher chance of showing green.
414 | - If the depth is increased, the learner will infer it came from `A`. Why? `because “if the teacher had meant to communicate B, they would have shown the red side because that can never come from A.”`
415 | 
416 | ### ToDo: Read [this paper](https://langcog.stanford.edu/papers/SGF-perspectives2012.pdf) to get some more examples of this kind of learning.


--------------------------------------------------------------------------------
/notes/probmods/Chapter 3_ Conditioning.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 3: Conditioning
  2 | 
  3 | > Much of cognition can be understood in terms of conditional inference. In its most basic form, causal attribution is conditional inference: given some observed effects, what were the likely causes? Predictions are conditional inferences in the opposite direction: given that I have observed some cause, what are its likely effects?
  4 | 
  5 | Inference can be done in various ways. The most basic way is rejection sampling:
  6 | 
  7 | ```javascript
  8 | var model = function () {
  9 |     var A = flip()
 10 |     var B = flip()
 11 |     var C = flip()
 12 |     var D = A + B + C
 13 |     condition(D >= 2)
 14 |     return A
 15 | }
 16 | var dist = Infer({method: 'rejection', samples: 100}, model)
 17 | viz(dist)
 18 | ```
 19 | 
 20 | Another way is to enumerate all possibilities and use Bayes theorem.
 21 | 
 22 | > In the case of a WebPPL `Infer` statement with a `condition`, `A = a` will be the “event” that the return value is `a` while `B = b` will be the event that the value passed to condition is `true`. Because each of these is a regular (unconditional) probability, they and their ratio can often be computed exactly using the rules of probability. In WebPPL the inference method 'enumerate' attempts to do this calculation (by first enumerating all the possible executions of the model):
 23 | 
 24 | ```javascript
 25 | var model = function () {
 26 |     var A = flip()
 27 |     var B = flip()
 28 |     var C = flip()
 29 |     var D = A + B + C
 30 |     condition(D >= 2)
 31 |     return A
 32 | }
 33 | var dist = Infer({method: 'enumerate'}, model)
 34 | viz(dist)
 35 | ```
 36 | 
 37 | ### Other ways to implement `Infer`
 38 | > Much of the difficulty of implementing the WebPPL language (or probabilistic models in general) is in finding useful ways to do conditional inference—to implement `Infer`.
 39 | 
 40 | ### Conditions and observations
 41 | You can add complex propositions to `condition` without assigning a variable to them:
 42 | 
 43 | ```javascript
 44 | var dist = Infer(
 45 |   function () {
 46 |     var A = flip()
 47 |     var B = flip()
 48 |     var C = flip()
 49 |     condition(A + B + C >= 2)
 50 |     return A
 51 | });
 52 | viz(dist)
 53 | ```
 54 | 
 55 | > Using `condition` allows the flexibility to build complex random expressions like this as needed, making assumptions that are phrased as complex propositions, rather than simple observations. Hence the effective number of queries we can construct for most programs will not merely be a large number but countably infinite, much like the sentences in a natural language.
 56 | > 
 57 | 
 58 | This will run forever:
 59 | ```javascript
 60 | var model = function(){
 61 |   var trueX = sample(Gaussian({mu: 0, sigma: 1}))
 62 |   var obsX = sample(Gaussian({mu: trueX, sigma: 0.1}))
 63 |   condition(obsX == 0.2)
 64 |   return trueX
 65 | }
 66 | viz(Infer({method: 'rejection', samples:1000}, model))
 67 | ```
 68 | 
 69 | Instead of `condition`ing, it is better to `observe`:
 70 | ```javascript
 71 | var model = function(){
 72 |   var trueX = sample(Gaussian({mu: 0, sigma: 1}))
 73 |   observe(Gaussian({mu: trueX, sigma: 0.1}), 0.2)
 74 |   return trueX
 75 | }
 76 | viz(Infer({method: 'rejection', samples:1000, maxScore: 2}, model))
 77 | ```
 78 | 
 79 | `observe` does the same thing as shown below but in a much more efficient manner:
 80 | 
 81 | ```javascript
 82 | var x = sample(distribution);
 83 | condition(x === value);
 84 | return x;
 85 | ```
 86 | 
 87 | > In particular, it’s essential to use `observe` to condition on the value drawn from a _continuous_ distribution.
 88 | 
 89 | ### Factors
 90 | 
 91 | > In WebPPL, `condition` and `observe` are actually special cases of a more general operator: `factor`. Whereas `condition` is like making an assumption that must be `true`, then `factor` is like making a _soft_ assumption that is merely preferred to be `true`.
 92 | > 
 93 | 
 94 | For example if we use `observe`, `A` will always be `true`:
 95 | 
 96 | ```javascript
 97 | var dist = Infer(
 98 |   function () {
 99 |     var A = flip(0.001)
100 |     condition(A)
101 |     return A
102 | });
103 | viz(dist)
104 | ```
105 | 
106 | But if we use `factor`, we can _tweak_ how often we want `A` to be true:
107 | 
108 | ```javascript
109 | var dist = Infer(
110 |   function () {
111 |     var A = flip(0.01)
112 |     factor(A?10:0)
113 |     return A
114 | });
115 | viz(dist)
116 | ```
117 | 
118 | `factor(A?10:0)` gives a much higher preference to `true` than `factor(A?5:0)` for example.
119 | 
120 | `factor(A?x:y)` technically means:
121 | 
122 | `P(A=true) = e^x / (e^x + e^y)`
123 | 
124 | ### Reasoning about Tug of War
125 | 
126 | An interesting question posed here is:
127 | > For instance, how likely is it that Bob is strong, given that he’s been on a series of winning teams?
128 | > 
129 | 
130 | Some points to reflect on:
131 | * If team [Bob, randomly chosen player] almost always defeats [Tom, Y], is Bob strong or is [Tom, Y] weak? Do these beliefs change as the number of matches between them goes up?
132 | * Does the likelihood of Bob being strong go up if the team he is in defeats _different_ teams?
133 | 
134 | 
135 | 


--------------------------------------------------------------------------------
/notes/probmods/Chapter 4_ Causal and Statistical Dependence.md:
--------------------------------------------------------------------------------
 1 | ## Chapter 4: Causal and Statistical Dependence
 2 | 
 3 | ### Causal Dependence
 4 | > expression A depends on expression B if it is __ever__ necessary to evaluate B in order to evaluate A
 5 | > 
 6 | 
 7 | What about an expression like:
 8 | ```
 9 | A = C ? B + 2 : 5
10 | ```
11 | 
12 | Does `A` depend on `B`? Answer is `only in certain contexts`.
13 | 
14 | Note that `A`, `B` and `C` are evaluations of a function. This incorporates another level of subtlety:
15 | 
16 | `a specific evaluation of A might depend on a specific evaluation of B`
17 | 
18 | > However, note that if a specific evaluation of A depends on a specific evaluation of B, then any other specific evaluation of A will depend on some specific evaluation of B. Why?
19 | > 
20 | 
21 | My interpretation is that if `A=5` depends on `B=3`, `A != 5` will depend on some value `x` in `B=x`. This implies causation as a mapping from domain of `B` (the cause) to the domain of `A` (the effect).
22 | 
23 | ### Detecting Dependence Through Intervention
24 | 
25 | The idea is pretty straight-forward:
26 | > If we manipulate A, does B tend to change?
27 | > 
28 | 
29 | Note how `var A` is given a value directly.
30 | 
31 | >  If setting A to different values in this way changes the distribution of values of B, then B causally depends on A.
32 | 
33 | ```javascript
34 | var BdoA = function(Aval) {
35 |   return Infer({method: 'enumerate'}, function() {
36 |     var C = flip()
37 |     var A = Aval //we directly set A to the target value
38 |     var B = A ? flip(.1) : flip(.4)
39 |     return {B: B}
40 |   })
41 | }
42 | 
43 | viz(BdoA(true))
44 | viz(BdoA(false))
45 | ```
46 | 
47 | Another example:
48 | 
49 | ```javascript
50 | var cold = flip(0.02)
51 | 
52 | var cough = (cold && flip(0.5)) || (lungDisease && flip(0.5)) || flip(0.001)
53 | ```
54 | 
55 | You can set `cold = true (or false)` manually and see if it changes the distribution of `cough` (it does). But if you set `cough = true (or false)`, it does not change the distribution of `cold`.
56 | 
57 | > treating the symptoms of a disease directly doesn’t cure the disease (taking cough medicine doesn’t make your cold go away), but treating the disease does relieve the symptoms.
58 | > 
59 | 
60 | ### Statistical Dependence
61 | 
62 | It simply means
63 | > learning information about A tells us something about B, and vice versa.
64 | > 
65 | 
66 | > causal dependencies give rise to statistical dependencies
67 | > 
68 | 
69 | A simple example:
70 | ```javascript
71 | var BcondA = function(Aval) {
72 |   return Infer({method: 'enumerate'}, function() {
73 |     var C = flip()
74 |     var A = flip()
75 |     var B = A ? flip(.1) : flip(.4)
76 |     condition(A == Aval) //condition on new information about A
77 |     return {B: B}
78 |   })
79 | }
80 | 
81 | viz(BcondA(true))
82 | viz(BcondA(false))
83 | ```
84 | 
85 | > Because the two distributions on `B` (when we have different information about `A`) are different, we can conclude that B statistically depends on `A`.
86 | > 
87 | 
88 | Two variables can be statistically dependent even though they is no causal dependence between them. For example, if `A` and `B` are leaves of an inverted V graphical model (`Ʌ`), where the root is `C` then `A` and `B` are not causally related but statistically related.


--------------------------------------------------------------------------------
/notes/probmods/Chapter 5_ Conditional dependence.md:
--------------------------------------------------------------------------------
 1 | ## Chapter 5: Conditional dependence
 2 | 
 3 | Two forms of dependence are explored in detail:
 4 | 
 5 | a) __Screening off__: The graphical model looks like `• ← • → •` or `• → • → •`. It is called so because if the variable(s) in the middle node are observed, the corner variables become independent. 
 6 | 
 7 | > Screening off is a purely statistical phenomenon. For example, consider the the causal chain model, where A directly causes C, which in turn directly causes B. Here, when we observe C – the event that mediates an indirect causal relation between A and B – A and B are still causally dependent in our model of the world: it is just our beliefs about the states of A and B that become uncorrelated. There is also an analogous causal phenomenon. If we can actually manipulate or intervene on the causal system, and set the value of C to some known value, then A and B become both statistically and causally independent (by intervening on C, we break the causal link between A and C).
 8 | 
 9 | b) __Explaning away__: The graphical model looks like `• → • ← •`. If the bottom variable is observed, previously independent variables (the two roots at the top) become dependent. 
10 | 
11 | > The most typical pattern of explaining away we see in causal reasoning is a kind of anti-correlation: the probabilities of two possible causes for the same effect increase when the effect is observed, but they are conditionally anti-correlated, so that observing additional evidence in favor of one cause should lower our degree of belief in the other cause. (This pattern is where the term explaining away comes from.)
12 | > 
13 | 
14 | ### Non-monotonic Reasoning
15 | 
16 | > In formal logic, a theory is said to be monotonic if adding an assumption (or formula) to the theory never reduces the set of conclusions that can be drawn. 
17 | > For instance, if I tell you that Tweety is a bird, you conclude that he can fly; if I now tell you that Tweety is an ostrich you retract the conclusion that he can fly.
18 | > 
19 | 
20 | > Another way to think about monotonicity is by considering the trajectory of our belief in a specific proposition, as we gain additional relevant information.
21 | > 
22 | 
23 | #### Traditional logic
24 | > there are only three states of belief: true, false, and unknown. As we learn more about the world, maintaining logical consistency requires that our belief in any proposition only move from unknown to true or false. __That is our “confidence” in any conclusion only increases.__
25 | > 
26 | 
27 | #### Probabilistic approach
28 | > We can think of confidence as a measure of how far our beliefs are from a uniform distribution. Our confidence in a proposition can both increase and decrease. 
29 | > 
30 | 
31 | ### Example: Trait Attribution
32 | 
33 | If a student fails in an example, it could be due to two reasons:
34 | 1. Non-personal reason: the exam was not fair. It was too difficult.
35 | 2. Person reason: the student did not do their homework.
36 | 
37 | ```javascript
38 | var examPosterior = Infer({method: 'enumerate'}, function() {
39 |   var examFair = flip(.8)
40 |   var doesHomework = flip(.8)
41 |   var pass = flip(examFair ?
42 |                   (doesHomework ? 0.9 : 0.4) :
43 |                   (doesHomework ? 0.6 : 0.2))
44 |   condition(!pass)
45 |   return {doesHomework: doesHomework, examFair: examFair}
46 | })
47 | 
48 | viz.marginals(examPosterior)
49 | viz(examPosterior)
50 | ```
51 | 
52 | > whether a student does homework has a greater influence on passing the test than whether the exam is fair. This in turns means that when inferring the cause of a failed exam, the model tends to attribute it to the person property (not doing homework) over the situation property (exam being unfair). This asymmetry is an example of the fundamental attribution bias ([Ross, 1977](https://scholar.google.com/scholar?q=%22The%20intuitive%20psychologist%20and%20his%20shortcomings%3A%20Distortions%20in%20the%20attribution%20process%22)): we tend to attribute outcomes to personal traits rather than situations.
53 | > 
54 | 
55 | The above example can be modified to gather evidence from several students and several exams:
56 | 
57 | ```javascript
58 | var examPosterior = Infer({method: 'enumerate'}, function() {
59 |   var examFair = mem(function(exam){return flip(0.8)})
60 |   var doesHomework = mem(function(student){return flip(0.8)})
61 | 
62 |   var pass = function(student, exam) {
63 |     return flip(examFair(exam) ?
64 |                 (doesHomework(student) ? 0.9 : 0.4) :
65 |                 (doesHomework(student) ? 0.6 : 0.2))
66 |   };
67 | 
68 |   condition(!pass('bill', 'exam1'))
69 | 
70 |   return {doesHomework: doesHomework('bill'), examFair: examFair('exam1')}
71 | })
72 | 
73 | viz.marginals(examPosterior)
74 | viz(examPosterior)
75 | ```
76 | 
77 | ### Visual Perception of Surface Color
78 | 
79 | Explaining away can be used to explain an optical illusion: `the checker shadow illusion`. In the figure below, squares A and B are the same shade of gray.
80 | 
81 | ![Source: https://probmods.org](https://probmods.org/assets/img/Checkershadow_illusion_small.png)
82 | 
83 | The key idea is that the expression:
84 | `var luminance = reflectance * illumination`
85 | induces a graphical model like so:
86 | `reflectance → luminance ← illumination`.
87 | 
88 | > The visual system has to determine what proportion of the `luminance` is due to `reflectance` and what proportion is due to the `illumination` of the scene.
89 | 
90 | When the cylinder is present, we perceive the `illumination` to be low. Hence the observed `luminance` is `explained away` by a higher perceived `reflectance`.
91 | 
92 | > The presence of the cylinder is providing evidence that the illumination of square B is actually less than that of square A (because it is expected to cast a shadow). Thus we perceive square B as having higher reflectance since its luminance is identical to square A and we believe there is less light hitting it.


--------------------------------------------------------------------------------
/notes/probmods/Chapter 6_ Bayesian data analysis.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 6: Bayesian data analysis
  2 | 
  3 | Bayesian cognitive model and Bayesian data analysis are the same thing under the hood; they just have different contexts.
  4 | 
  5 | > If the generative model is a hypothesis about a person’s model of the world, then we have a _Bayesian cognitive model_ – the main topic of this book. If the generative model is instead the scientist’s model of how the data are generated, then we have _Bayesian data analysis_.
  6 | > 
  7 | 
  8 | ### Two competing hypotheses
  9 | Consider an example of a spinning coin (as opposed to a flipping coin).
 10 | 
 11 | - Scientists believe the probability of heads up is uniform [0,1]
 12 | - People _might_ believe it is the same as flipping an unbiased coin i.e. 0.5
 13 | 
 14 | Which hypothesis is correct? How do we design an experiment? A Bayesian way of answering this question is: `given some data, what is the probability that it was generated by hypothesis one?`
 15 | 
 16 | Imagine we ran the “predict the next 10” experiment with 20 participants, and observed the following responses:
 17 | 
 18 | `var experimentalData = [9,8,7,5,4,5,6,7,9,4,8,7,8,3,9,6,5,7,8,5]`
 19 | 
 20 | ```javascript
 21 | var opts = {method: "rejection", samples: 5000}
 22 | 
 23 | // hypothesis one
 24 | var observerModel =  Infer(opts, function(){
 25 |   var p = uniform(0, 1)
 26 |   var coinSpinner = Binomial({n:20, p:p})
 27 |   observe(coinSpinner, 15)
 28 |   return binomial(p, 10)
 29 | })
 30 | viz(observerModel)
 31 | 
 32 | // hypothesis two
 33 | var skepticalModel =  Infer(opts, function(){
 34 |   var sameAsFlipping = flip(0.5)
 35 |   var p = sameAsFlipping ? 0.5 : uniform(0, 1)
 36 |   var coinSpinner = Binomial({n:20, p:p})
 37 |   observe(coinSpinner, 15)
 38 |   return binomial(p, 10)
 39 | })
 40 | ///
 41 | viz(skepticalModel)
 42 | 
 43 | var experimentalData = [9,8,7,5,4,5,6,7,9,4,8,7,8,3,9,6,5,7,8,5]
 44 | 
 45 | // package the models up in an Object (for ease of reference)
 46 | var modelObject = {observerModel: observerModel, skepticalModel: skepticalModel};
 47 | 
 48 | var scientistModel = function(){
 49 |   var theBetterModel_name = flip(0.5) ? "observerModel" : "skepticalModel"
 50 |   var theBetterModel = modelObject[theBetterModel_name]
 51 |   map(function(d){ observe(theBetterModel, d) }, experimentalData)
 52 |   return {betterModel: theBetterModel_name}
 53 | }
 54 | 
 55 | var modelPosterior = Infer({method: "enumerate"}, scientistModel)
 56 | 
 57 | viz(modelPosterior)
 58 | ```
 59 | 
 60 | Note that in this case the observed data came from people who tend to support hypothesis two. Hence in the above program's result, hypothesis two will be supported more.
 61 | 
 62 | ---
 63 | 
 64 | For a given Bayesian model (together with data), there are four conceptually distinct distributions of interest:
 65 | 1. The __prior distribution__ over parameters captures our initial state of knowledge (or, our beliefs) about the values that the latent parameters could have.
 66 | 2. The __posterior distribution__ over parameters captures what we know about the latent parameters having updated our beliefs with the evidence provided by data.
 67 | 3. The __prior predictive distribution__ tells us what data to expect, given our model and our initial beliefs about the parameters. The prior predictive is a distribution over data, and gives the relative probability of different observable outcomes before we have seen any data.
 68 | 4. The __posterior predictive distribution__ tells us what data to expect, given the same model we started with, but with beliefs that have been updated by the observed data. The posterior predictive is a distribution over data, and gives the relative probability of different observable outcomes, after some data has been seen.
 69 | 
 70 | ### Posterior prediction and model checking
 71 | 
 72 | > The posterior predictive distribution describes what data you should expect to see, given the model you’ve assumed and the data you’ve collected so far. If the model is a good description of the data you’ve collected, then the model shouldn’t be surprised if you got the same data by running the experiment again.
 73 | > 
 74 | > It’s natural then to use the posterior predictive distribution to examine the descriptive adequacy of a model. If these predictions do not match the data _already seen_ (i.e., the data used to arrive at the posterior distribution over parameters), the model is descriptively inadequate.
 75 | > 
 76 | 
 77 | __Question: we arrive at the posterior distribution using the data _already seen_. Then under what circumstances can the posterior predictive distribution different from the distribution of data _already seen_?__
 78 | 
 79 | Here is an example of when that can happen:
 80 | 
 81 | You collect data from ten students each from two schools whether or not they are likely to help other students.
 82 | 
 83 | * 10/10 students from school one will help other students
 84 | * 0/10 students from school two wilh help other students
 85 | 
 86 | - Prior belief: The number of students willing to help other students is uniformly distributed.
 87 | 
 88 | The posterior distribution of parameter will peak around 0.5. But the posterior predictive distribution generated from this posterior will look nothing like the observed data. Thus, the model is **descriptively inadequate**.
 89 | 
 90 | ---
 91 | 
 92 | ### Simple vs Complex Hypotheses
 93 | 
 94 | Let's say you toss a coin 20 times and get 7 heads. You want to know if the coin is unbiased (perfectly random) or not. Let's concretely define the terms biased and unbiased.
 95 | 
 96 | - Unbiased coin: p(head) = 0.5
 97 | - Biased coin: p(head) = Uniform(0,1)
 98 | 
 99 | `p(head) = 0.5` is a simple model. `p(head) = Uniform(0,1)` is more complex. It means **each time** a coin is tossed, it will turn up heads with any probability between 0 and 1. All probabilities are equally likely. Note that `p(head) = 0.5` is just a subset of this complex model.
100 | 
101 | Which model do you think is more likely to generate the observed data (7/20 heads)?
102 | 
103 | One might think it is the biased coin but it is actually the unbiased coin.
104 | 
105 | ```javascript
106 | var k = 7, n = 20;
107 | 
108 | var compareModels = function() {
109 | 
110 |   // binary decision variable for which hypothesis is better
111 |   var x = flip(0.5) ? "simple" : "complex";
112 |   var p = (x == "simple") ? 0.5 : uniform(0, 1);
113 | 
114 |   observe(Binomial({p: p, n: n}), k);
115 | 
116 |   return {model: x}
117 | }
118 | 
119 | var opts = {method: "rejection", samples: 2000};
120 | print("We observed " + k + " successes out of " + n + " attempts")
121 | var modelPosterior = Infer(opts, compareModels);
122 | viz(modelPosterior)
123 | ```
124 | 
125 | > Shouldn’t the more general model always be better? If we’re at a track, and you bet on horse A, and I bet on horse A and B, aren’t I strictly in a better position than you? The answer is no, and the reason has to do with our metric for winning. Intuitively, we don’t care whether your horse won or not, but how much money you win. How much money you win depends on how much money you bet, and the rule is, when we go to track, we have the same amount of money.
126 | > 
127 | > In probabilistic models, our money is probabilities. Each model must allocate its probability so that it sums to 1. So my act of betting on horse A and horse B actually requires me to split my money (say, betting 50 / 50 on each). On the other hand, you put all your money on horse A (100 on A, 0 on B). If A wins, you will gain more money because you put more money down.
128 | > 
129 | > This idea is called the principle of parsimony or Occam’s razor
130 | > 
131 | 
132 | ---
133 | 
134 | ### Bayes factor and Savage-Dickey method
135 | 
136 | Bayes factor is simple the ratio A / B where 
137 | A = Marginal likelihood of data under the hypothesis in consideration
138 | B = Marginal likelihood of data under all hypotheses
139 | 
140 | ```javascript
141 | var k = 7, n = 20;
142 | 
143 | var simpleLikelihood = Math.exp(Binomial({p: 0.5, n: n}).score(k))
144 | 
145 | var complexModel = Infer({method: "forward", samples: 10000}, function(){
146 |   var p = uniform(0, 1);
147 |   return binomial(p, n)
148 | })
149 | var complexLikelihood = Math.exp(complexModel.score(k))
150 | 
151 | var bayesFactor_01 = simpleLikelihood / complexLikelihood
152 | bayesFactor_01
153 | ```
154 | 
155 | As you can see, B can be hard to calculate.
156 | 
157 | Savage-Dickey method comes to the rescue:
158 | 
159 | > The Bayes factor can also be obtained by considering only the more complex hypothesis. What you do is look at the distribution over the parameter of interest (here, p) at the point of interest (here, p = 0.5). Dividing the probability density of the posterior by the density of the prior (of the parameter at the point of interest) also gives you the Bayes Factor!
160 | > 
161 | 
162 | ```javascript
163 | var k = 7, n = 20;
164 | 
165 | var complexModelPrior = Infer({method: "forward", samples: 10000}, function(){
166 |   var p = uniform(0, 1);
167 |   return p
168 | })
169 | 
170 | var complexModelPosterior = Infer({method: "rejection", samples: 10000}, function(){
171 |   var p = uniform(0, 1);
172 |   observe(Binomial({p: p, n: n}), k);
173 |   return p
174 | })
175 | 
176 | var savageDickeyDenomenator = expectation(complexModelPrior, function(x){return Math.abs(x-0.5)<0.05})
177 | var savageDickeyNumerator = expectation(complexModelPosterior, function(x){return Math.abs(x-0.5)<0.05})
178 | var savageDickeyRatio = savageDickeyNumerator / savageDickeyDenomenator
179 | print( savageDickeyRatio )
180 | ```
181 | 
182 | > Note that we have approximated the densities by looking at the expectation that p is within 0.05 of the target value p=0.5.
183 | > 
184 | 
185 | Here we only consider the complex model (which includes the simple model). We only look at the behavior of complex model before and after the data is observed at our point of interest (which was represented earlier by our simple model).
186 | 
187 | ---
188 | 
189 | ### Single Regression
190 | 
191 | The concepts and ideas explained above are used in another example: Regression using Bayesian inference.
192 | 
193 | Given the data of a tournament of tug of war which contains wins and strengths of players, we build a model like so:
194 | 
195 | strength = β<sub>0</sub> + β<sub>1</sub> * n<sub>wins</sub>
196 | 
197 | The observed data of strength is generated from `strength` like so:
198 | 
199 | observed strength = N(strength, σ)
200 | 
201 | Here β<sub>0</sub>, β<sub>1</sub> and σ are unknowns. We define a prior over these parameters, observe the data and get a posterior.
202 | 
203 | ```javascript
204 | var uniformKernel = function(prevVal) {
205 |   return Uniform({a: prevVal - 0.2, b: prevVal + 0.2});
206 | };
207 | 
208 | var singleRegression = function(){
209 |   // parameters of a simple linear regression
210 |   var b0 = sample(Uniform({a: -1, b: 1}), {driftKernel: uniformKernel})
211 |   var b1 = sample(Uniform({a: -1, b: 1}), {driftKernel: uniformKernel})
212 |   var sigma = sample(Uniform({a: 0, b: 2}), {driftKernel: uniformKernel})
213 | 
214 |   map(function(d){
215 | 
216 |     // linear regression formula
217 |     var predicted_y = b0 + d.nWins*b1
218 | 
219 |     observe(Gaussian({mu: predicted_y, sigma: sigma}), d.ratingZ)
220 | 
221 |   }, towData)
222 | 
223 |   return {b0: b0, b1: b1, sigma: sigma}
224 | }
225 | 
226 | var nSamples = 2500
227 | var opts = { method: "MCMC", callbacks: [editor.MCMCProgress()],
228 |              samples: nSamples, burn: nSamples/2 }
229 | 
230 | var posterior = Infer(opts, singleRegression)
231 | ```
232 | 
233 | As before, we want to find out whether this model is descriptively adequate. This is done by generating samples from this model and comparing them with actual data.
234 | 
235 | It is seen that the generated data has 91% correlation with the observed data.


--------------------------------------------------------------------------------
/notes/probmods/Chapter 7_ Algorithms for inference.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 7: Algorithms for inference
  2 | 
  3 | ### Markov Chain Monte Carlo (MCMC)
  4 | The idea is to find a Markov Chain whose stationary distribution is the same as the conditional distrbution we want to estimate. Eg. we want to estimate a geometric distribution:
  5 | 
  6 | ```javascript
  7 | var p = .7
  8 | 
  9 | var geometric = function(p){
 10 | 	return ((flip(p) == true) ? 1 : (1 + geometric(p)))
 11 | }
 12 | 
 13 | var post = Infer({method: 'MCMC', samples: 25000, lag: 10, model: function(){
 14 | 	var mygeom = geometric(p);
 15 | 	condition(mygeom>2)
 16 | 	return(mygeom)
 17 | 	}
 18 | })
 19 | 
 20 | viz.table(post)
 21 | ```
 22 | 
 23 | The distribution of the above variable is the same as the stationary distribution of the Markov Chain shown here:
 24 | 
 25 | ```javascript
 26 | var p = 0.7
 27 | 
 28 | var transition = function(state){
 29 | 	return (state == 3 ? sample(Categorical({vs: [3, 4], ps: [(1 - 0.5 * (1 - p)), (0.5 * (1 - p))]})) :
 30 | 						    sample(Categorical({vs: [(state - 1), state, (state + 1)], ps: [0.5, (0.5 - 0.5 * (1 - p)), (0.5 * (1 - p))]})))
 31 | }
 32 | 
 33 | var chain = function(state, n){
 34 | 	return (n == 0 ? state : chain(transition(state), n-1))
 35 | }
 36 | 
 37 | var samples = repeat(5000, function() {chain(3, 250)})
 38 | viz.table(samples)
 39 | ```
 40 | 
 41 | > As we have already seen, each successive sample from a Markov chain is highly correlated with the prior state.
 42 | > 
 43 | 
 44 | This can take long if the initial state was not bad. However in the long run these local correlations disappear. But we will get a large collection of samples. In order to get samples which are not strongly connected to each other, we can every `n`th sample.
 45 | 
 46 | > WebPPL provides an option for MCMC called `'lag'`.
 47 | > 
 48 | > Fortunately, it turns out that for any given (condition) distribution we might want to sample from, there is at least one Markov chain with a matching stationary distribution.
 49 | > 
 50 | 
 51 | ---
 52 | 
 53 | ### Metropolis-Hastings
 54 | > To create the necessary transition function, we first create a proposal distribution, `q(x→x′)`. A common option for continuous state spaces is to sample a new state from a multivariate Gaussian centered on the current state.
 55 | > 
 56 | 
 57 | Once you have defined `q` and `p(x)`, the target distribution, calculate the ratio `A / B` where:
 58 | - `A = p(x′) * q(x′ → x)` &
 59 | - `B = p(x) * q(x → x′)`
 60 | 
 61 | If `A/B` is bigger than 1, round it down to 1. Flip a coin with a probability of `A/B` of showing heads.
 62 | 
 63 | If heads shows up, transition from `x` to `x′`. Otherwise, stay in the current state `x`.
 64 | 
 65 | #### Balance condition and detailed balance condition
 66 | Balance condition is achieved when a Markov chain reaches a stationary state:
 67 | > p(x′) = ∑<sub>x</sub>p(x) π(x → x′) where π is the transition distribution.
 68 | >
 69 |  
 70 | A stronger condition is the detailed balance condition:
 71 | > p(x) π(x → x′) = p(x′) π(x′ → x)
 72 | > 
 73 | It is stronger in the sense that detailed balance condition implies balance condition.
 74 | 
 75 | It can be shown that MH algorithm gives a transition distribution π(x → x′) that satisfies detailed balance equation.
 76 | 
 77 | Let's look at an example:
 78 | 
 79 | ```javascript
 80 | var p = 0.7
 81 | 
 82 | //the target distribution (not normalized):
 83 | //prob = 0 if x condition is violated, otherwise proportional to geometric distribution
 84 | 
 85 | // we want to sample from this distribution
 86 | // it is not easy to see what this distribution looks like just by looking at the formula below
 87 | // so how do we sample from it? This is where MH algorithm helps us
 88 | var target_dist = function(x){
 89 |   return (x < 3 ? 0 : (p * Math.pow((1-p),(x-1))))
 90 | }
 91 | 
 92 | // the proposal function and distribution,
 93 | // here we're equally likely to propose x+1 or x-1.
 94 | 
 95 | // this function decides where to go in the next state
 96 | var proposal_fn = function(x){
 97 |   return (flip() ? x - 1 : x + 1)
 98 | }
 99 | var proposal_dist = function (x1, x2){
100 |   return 0.5
101 | }
102 | 
103 | // the MH recipe:
104 | var accept = function (x1, x2){
105 |   let p = Math.min(1, (target_dist(x2) * proposal_dist(x2, x1)) / (target_dist(x1) * proposal_dist(x1,x2)))
106 |   return flip(p)
107 | }
108 | var transition = function(x){
109 |   // decide where to go in the next step
110 |   let proposed_x = proposal_fn(x)
111 |   // decide whether to go or not
112 |   return (accept(x, proposed_x) ? proposed_x : x)
113 | }
114 | 
115 | //the MCMC loop:
116 | var mcmc = function(state, iterations){
117 |   return ((iterations == 1) ? [state] : mcmc(transition(state), iterations-1).concat(state))
118 | }
119 | 
120 | var chain = mcmc(3, 10000) // mcmc for conditioned geometric
121 | ```
122 | 
123 | ---
124 | 
125 | ### Hamiltonian Monte Carlo
126 | 
127 | Sometimes the MHMC will get stuck if it cannot find states which have a high probability (in other words the ratio `A/B` is really small.) For example if we want to find ten numbers randomly sampled from `uniform(0,1)` and we use a Gaussian distribution for transition probabilities it will likely get stuck:
128 | 
129 | ```javascript
130 | var constrainedSumModel = function() {
131 |   var xs = repeat(10, function() {
132 |     return uniform(0, 1);
133 |   });
134 |   observe(Gaussian({mu: 5, sigma: 0.005}), sum(xs));
135 |   return map(bin, xs);
136 | };
137 | ```
138 | 
139 | The acceptance ratio in this case is 1-2%.Hamiltonian Monte Carlo solves this problem by calcuating `the gradient of the posterior with respect to the random choices made by the program`. The book does not explain very well how it works.
140 | 
141 | It also does not go into detail how particle filter works.
142 | 
143 | ---
144 | 
145 | ### Variational Inference
146 | 
147 | In contrast to non-parametric methods mentioned above, VI is a parametric method. __Mean-field variational inference__ tries to find an optimized set of parameters by `approximating the posterior with a product of independent distributions (one for each random choice in the program).`
148 | 
149 | ```javascript
150 | var trueMu = 3.5
151 | var trueSigma = 0.8
152 | 
153 | var data = repeat(100, function() { return gaussian(trueMu, trueSigma)})
154 | 
155 | var gaussianModel = function() {
156 |   var mu = gaussian(0, 20)
157 |   var sigma = Math.exp(gaussian(0, 1)) // ensure sigma > 0
158 |   map(function(d) {
159 |     observe(Gaussian({mu: mu, sigma: sigma}), d)
160 |   }, data)
161 |   return {mu: mu, sigma: sigma}
162 | };
163 | ```
164 | 
165 | > By default, it takes the given arguments of random choices in the program (in this case, the arguments `(0, 20)` and `(0, 1)` to the two gaussian random choices used as priors) and replaces with them with free parameters which it then optimizes to bring the resulting distribution as close as possible to the true posterior... __the mean-field approximation necessarily fails to capture correlation between variables.__
166 | > 
167 | 
168 | For example it fails to capture the correlation between `reflectance` and `illumination`.
169 | 
170 | ```javascript
171 | var observedLuminance = 3;
172 |                             
173 | var model = function() {
174 |   var reflectance = gaussian({mu: 1, sigma: 1})
175 |   var illumination = gaussian({mu: 3, sigma: 1})
176 |   var luminance = reflectance * illumination
177 |   observe(Gaussian({mu: luminance, sigma: 1}), observedLuminance)
178 |   return {reflectance: reflectance, illumination: illumination}
179 | }
180 | ```
181 | 
182 | 


--------------------------------------------------------------------------------
/notes/probmods/Chapter 9_ Learning as conditional inference.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 9: Learning as conditional inference
  2 | 
  3 | ### Learning and the rate of learning
  4 | 
  5 | Let's say you see a series of heads when a coin is tossed. Your beliefs about the bias of the coin depend on two items:
  6 | 1. How likely it is to see a biased coin?
  7 | 2. How much data have you seen?
  8 | 
  9 | One can measure the rate of learning (when the inferred belief of a learner comes close to the actual fact that the coin is biased).
 10 | 
 11 | ```javascript
 12 | var fairnessPosterior = function(observedData) {
 13 |   return Infer({method: 'enumerate'}, function() {
 14 |     var fair = flip(0.999)
 15 |     var coin = Bernoulli({p: fair ? 0.5 : 0.95})
 16 |     var obsFn = function(datum){observe(coin, datum == 'h')}
 17 |     mapData({data: observedData}, obsFn)
 18 |     return fair
 19 |   })
 20 | }
 21 | ```
 22 | 
 23 | > If we set `fairPrior` to be 0.5, equal for the two alternative hypotheses, just 5 heads in a row are sufficient to favor the trick coin by a large margin. If `fairPrior` is 99 in 100, 10 heads in a row are sufficient. We have to increase `fairPrior` quite a lot, however, before 15 heads in a row is no longer sufficient evidence for a trick coin: even at `fairPrior` = 0.9999, 15 heads without a single tail still weighs in favor of the trick coin. This is because the evidence in favor of a trick coin accumulates exponentially as the data set increases in size; each successive h flip increases the evidence by nearly a factor of 2.
 24 | > 
 25 | 
 26 | ---
 27 | 
 28 | ### Independent and Exchangeable Sequences
 29 | Coin flips are i.i.d. This can be seen by conditioning the next flip result on the previous result.
 30 | 
 31 | Similarly the below program samples i.i.d:
 32 | 
 33 | ```javascript
 34 | var words = ['chef', 'omelet', 'soup', 'eat', 'work', 'bake', 'stop']
 35 | var probs = [0.0032, 0.4863, 0.0789, 0.0675, 0.1974, 0.1387, 0.0277]
 36 | var thunk = function() {return categorical({ps: probs, vs: words})};
 37 | ```
 38 | 
 39 | However the below function does not:
 40 | ```javascript
 41 | var words = ['chef', 'omelet', 'soup', 'eat', 'work', 'bake', 'stop']
 42 | var probs = (flip() ?
 43 |              [0.0032, 0.4863, 0.0789, 0.0675, 0.1974, 0.1387, 0.0277] :
 44 |              [0.3699, 0.1296, 0.0278, 0.4131, 0.0239, 0.0159, 0.0194])
 45 | var thunk = function() {return categorical({ps: probs, vs: words})};
 46 | ```
 47 | 
 48 | This is because `learning about the first word tells us something about the probs, which in turn tells us about the second word.`
 49 | 
 50 | The samples are not i.i.d but are `exchangeable`: `the probability of a sequence of values remains the same if permuted into any order`. 
 51 | 
 52 | `de Finetti’s theorem` says that, under certain technical conditions, any exchangeable sequence can be represented as follows, for some latentPrior distribution and observation function `f`:
 53 | 
 54 | ```javascript
 55 | var latent = sample(latentPrior)
 56 | var thunk = function() {return f(latent)}
 57 | var sequence = repeat(2,thunk)
 58 | ```
 59 | 
 60 | ### Polya's urn
 61 | 
 62 | > Imagine an urn that contains some number of white and black balls. On each step we draw a random ball from the urn, note its color, and return it to the urn along with another ball of that color.
 63 | > 
 64 | 
 65 | It can be shown that the distribution of samples is exchangeable: `bbw`, `bwb`, `wbb` have the same probability; `bww`, `wbw`, `wwb` as well.
 66 | 
 67 | > Because the distribution is exchangeable, we know that there must be an alterative representation in terms of a latent quantity followed by independent samples. The de Finetti representation of this model is:
 68 | > 
 69 | 
 70 | ```javascript
 71 | var urn_deFinetti = function(urn, numsamples) {
 72 |   var numWhite = sum(map(function(b){return b=='w'},urn))
 73 |   var numBlack = urn.length - numWhite
 74 |   var latentPrior = Beta({a: numWhite, b: numBlack})
 75 |   var latent = sample(latentPrior)
 76 |   return repeat(numsamples, function() {return flip(latent) ? 'b' : 'w'}).join("")
 77 | }
 78 | 
 79 | var urnDist = Infer({method: 'forward', samples: 10000},
 80 |                     function(){return urn_deFinetti(['b', 'w'],3)})
 81 | 
 82 | viz(urnDist)
 83 | ```
 84 | 
 85 | > We sample a shared latent parameter – in this case, a sample from a Beta distribution – generating the sequence samples independently given this parameter.
 86 | > 
 87 | 
 88 | ---
 89 | 
 90 | ### Ideal learners
 91 | 
 92 | A common pattern often used in building models is:
 93 | ```javascript
 94 | Infer({...}, function() {
 95 |   var hypothesis = sample(prior)
 96 |   // obsFn will usually contain an observe function
 97 |   var obsFn = function(datum){...uses hypothesis...}
 98 |   mapData({data: observedData}, obsFn)
 99 |   return hypothesis
100 | });
101 | ```
102 | 
103 | ---
104 | 
105 | ### Learning a continuous parameter
106 | 
107 | Here we see how a proper definition of priors is important to mimic how humans learn. 
108 | 
109 | - When a coin shows 7/10 heads, most humans will still believe it to be fair. But if we used `uniform(0,1)` as prior, we get 0.7 as MLE. Not the same as humans.
110 | - We can replace `uniform(0,1)` with `beta(10,10)`. But then even when the coin shows 100/100 heads, most humans will believe the coin always shows heads. But the model shows the bias to be 0.9 (instead of 1).
111 | - This can be remedied by having a small bias for a fair coin in the prior:
112 | 
113 | ```javascript
114 | var weightPosterior = function(observedData){
115 |   return Infer({method: 'MCMC', burn:1000, samples: 10000}, function() {
116 |     var isFair = flip(0.999)
117 |     var realWeight = isFair ? 0.5 : uniform({a:0, b:1})
118 |     var coin = Bernoulli({p: realWeight})
119 |     var obsFn = function(datum){observe(coin, datum=='h')}
120 |     mapData({data: observedData}, obsFn)
121 |     return realWeight
122 |   })
123 | }
124 | ```
125 | 
126 | > This model stubbornly believes the coin is fair until around 10 successive heads have been observed. After that, it rapidly concludes that the coin can only come up heads. The shape of this learning trajectory is much closer to what we would expect for humans.
127 | > 
128 | 
129 | ### Another example: estimating causal power
130 | 
131 | An effect E can occur due to a cause C or a background effect. We want to find out, `from observed evidence about the co-occurrence of events, attempt to infer the causal structure relating them.`
132 | 
133 | ```javascript
134 | var observedData = [{C:true, E:true}, {C:true, E:true}, {C:false, E:false}, {C:true, E:true}]
135 | 
136 | var causalPowerPost = Infer({method: 'MCMC', samples: 10000}, function() {
137 |   // Causal power of C to cause E
138 |   var cp = uniform(0, 1)
139 | 
140 |   // Background probability of E
141 |   var b = uniform(0, 1)
142 | 
143 |   var obsFn = function(datum) {
144 |     // The noisy causal relation to get E given C
145 |     var E = (datum.C && flip(cp)) || flip(b)
146 |     condition( E == datum.E)
147 |   }
148 | 
149 |   mapData({data: observedData}, obsFn)
150 | 
151 |   return {causal_power: cp}
152 | });
153 | 
154 | viz(causalPowerPost);
155 | ```


--------------------------------------------------------------------------------
/notes/understanding_betavae.md:
--------------------------------------------------------------------------------
 1 | ## Understanding disentangling in β-VAE
 2 | 
 3 | ### Overview
 4 | This paper puts forth explanations about why β-VAE learns disentangled representations. It looks closely at how the constraint of latent posterior being as close as possible to unit Gaussian affects learning of latent representations.
 5 | 
 6 | #### Points close in data space are forced to get closer in latent space as β is increased
 7 | Recall that KL(p(x;μ<sub>1</sub>, σ<sub>1</sub>), q(x;μ<sub>2</sub>, σ<sub>2</sub>)) = log(σ<sub>2</sub>/σ<sub>1</sub>) + (σ<sub>1</sub><sup>2</sup> + (μ<sub>1</sub> - μ<sub>2</sub>)<sup>2</sup> / (2σ<sub>2</sub><sup>2</sup>)) - 1/2
 8 | 
 9 | In order to decrease KL(p,q), we can either bring μ<sub>2</sub> close to μ<sub>1</sub> or increase the variance σ<sub>2</sub>. Increasing the variance means increasing the overlap between two distributions as shown in this figure:
10 | 
11 | ![](https://github.com/vinsis/math-and-ml-notes/raw/master/images/understanding_betavae.png)
12 | 
13 | > However, a greater degree of overlap between posterior distributions will tend to result in a cost in terms of log likelihood due to their reduced average discriminability. A sample drawn from the posterior given one data point may have a higher probability under the posterior of a different data point, an increasingly frequent occurrence as overlap between the distributions is increased.
14 | 
15 | > Nonetheless, under a constraint of maximizing such overlap, the smallest cost in the log likelihood can be achieved by arranging nearby points in data space close together in the latent space. By doings o, when samples from a given posterior `q(z2|x2)` are more likely under another data point such as `x1`, the log likelihood E<sub>q(z2|x2)</sub>[logp(x2|z2)] cost will be smaller if `x1` is close to `x2` in data space.
16 | 
17 | #### Forcing independence between latent dimensions forces the model to align dimensions with components that make different contributions to reconstruction
18 | 
19 | Let us define β >> 1.
20 | 
21 | > The optimal thing to do in this scenario is to onlyencode information about the data points which can yield the most significant improvement in data log-likelihood E<sub>q(z2|x2)</sub>[logp(x2|z2)].
22 | 
23 | For example, the dSprites dataset consists of generating factors `position`, `rotation`, `scale` and `shape`. The model can increase the likelihood the most by choosing `position` over other factors under such a constraint.
24 | 
25 | >  Intuitively, when optimizing a pixel-wise decoder log likelihood, information about position will result in the most gains compared to information about any of the other factors of variation in the data, since the likelihood will vanish if reconstructed position is off by just a few pixels. Continuing this intuitive picture, we can imagine that if the capacity of the information bottleneck were gradually increased, the model would continue to utilize those extra bits for an increasingly precise encoding of position, until some point of diminishing returns is reached for position information, where a larger improvement can be obtained by encoding and reconstructing another factor of variation in the dataset, such as sprite scale.
26 | 
27 | > A smooth representation of the new factor will allow an optimal packing of the posteriors in the new latent dimension, without affecting the other latent dimensions. We note that this pressure alone would not discourage the representational axes from rotating relative to the factors.  However, given the differing contributions each factor makes to the reconstruction log-likelihood, the model will try to allocate appropriately differing average capacities to the encoding axes of each factor (e.g. by optimizing the posterior variances).But, the diagonal covariance of the posterior distribution restricts the model to doing this in different latent dimensions, giving us the second pressure, encouraging the latent dimensions to align with the factors.
28 | 
29 | ### Improving disentangling in β-VAE with controlled capacity increase
30 | 
31 | With the above information available, we can start training aiming to keep `KL(q(z|x), p(z))` close  to `C=0` at the beginning. Then we gradually increase `C` so as to increase the capacity of the model to learn a more expressive representation. We stop increasing `C` when the output images are of high quality.
32 | 
33 | Thus our objective then looks something like:
34 | 
35 | ### E<sub>qφ(z|x)</sub>[log<sub>pθ</sub>(x|z)]−γ*|D<sub>KL</sub>(q<sub>φ</sub>(z|x)‖p(z))−C|
36 | 
37 | Note how β has been replaced with γ.
38 | 
39 | The code looks like ([source](https://github.com/1Konny/Beta-VAE/blob/master/solver.py#L164-L168)):
40 | 
41 | ```python
42 | if self.objective == 'H':
43 |     beta_vae_loss = recon_loss + self.beta*total_kld
44 | elif self.objective == 'B':
45 |     C = torch.clamp(self.C_max/self.C_stop_iter*self.global_iter, 0, self.C_max.data[0])
46 |     beta_vae_loss = recon_loss + self.gamma*(total_kld-C).abs()
47 | ```
48 | 


--------------------------------------------------------------------------------
/notes/unsupervised_disentanglement.md:
--------------------------------------------------------------------------------
 1 | ### Evaluating the Unsupervised Learning of Disentangled Representations
 2 | 
 3 | Disentangled representations are representations where `models capture the independent features of a given scene in such a way that if one feature changes, the others remain unaffected.`
 4 | 
 5 | ### Key points
 6 | * They present a theorem which states that
 7 | > Unsupervised learning of disentangled representations is impossible without inductive biases on both the data set and the models (i.e., one has to make assumptions about the data set and incorporate those assumptions into the model)
 8 | 
 9 | * For the considered models and data sets, we cannot validate the assumption that disentanglement is useful for downstream tasks, e.g., that with disentangled representations it is possible to learn with fewer labeled observations.
10 | 
11 | They also released a [library](https://github.com/google-research/disentanglement_lib) and released [lots of pretrained models](https://github.com/google-research/disentanglement_lib#pretrained-disentanglement_lib-modules).
12 | 


--------------------------------------------------------------------------------
/pdfs/1502.05767.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/pdfs/1502.05767.pdf


--------------------------------------------------------------------------------
/pdfs/autodiff.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vinsis/math-and-ml-notes/332a6105393f4995b0d3cf8e771c7a8003a9b6cb/pdfs/autodiff.pdf


--------------------------------------------------------------------------------