└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Deep Learning Parameters Cheatsheet
  2 | 
  3 | Essential, to-the-point cheatsheet and reference for neural network parameters
  4 | and hyperparameters, including common architecture and component blueprints.
  5 | 
  6 | ## Input to Network
  7 | 
  8 | * **CSV data**
  9 |   * Multilayer preceptron
 10 | * **Image**
 11 |   * `CNN` (Convolutional neural network)
 12 | * **Sequential**
 13 |   * `RNN LSTM` (Recurrent neural network, long-short-term-memory)
 14 | * **Audio**
 15 |   * `RNN LSTM`
 16 | * **Video**
 17 |   * `CNN` + `RNN` hybrid network
 18 | 
 19 | ## Initialization
 20 | 
 21 | * Biases
 22 | 
 23 |   * `0`
 24 |     * Biases can generally be zero.
 25 | 
 26 | * Weights
 27 |   * `XAVIER` (aka `Glorot`)
 28 |   * Generic, not `RELU`
 29 |   * `RELU` (aka `He`)
 30 |   * `RELU` activation
 31 |   * `Leaky RELU` activation
 32 | 
 33 | ## Activation Functions
 34 | 
 35 | * `Linear`
 36 |   * Regression (output)
 37 | * `Sigmoid`
 38 |   * Binary classification (output)
 39 | * `Tanh`
 40 |   * Continuous data, more than [-1, 1]
 41 |   * LSTM layers
 42 | * `Softmax`
 43 |   * Multiclass classification (output)
 44 | 
 45 | ## Loss functions
 46 | 
 47 | * **Recunstruction entropy** (`RBM`, `autoencoder` (Restricted Boltzmann
 48 |   Machine))
 49 |   * Feature engineering
 50 | * **Squared loss** (output)
 51 |   * Regression
 52 | * **Cross entropy** (output)
 53 |   * Binary classification
 54 | * **Multiclass cross entropy** (aka MCXE) (output)
 55 |   * Multiclass classification
 56 | * **Root MSE** (Mean squared error) (`RBM`, `autoencoder`, output)
 57 |   * Feature engineering
 58 |   * Regression
 59 | * **Hinge loss** (output)
 60 |   * Classification
 61 | * **Negative log likelihood** (output)
 62 |   * Classification
 63 | 
 64 | ## Learning Rates
 65 | 
 66 | * Strict values
 67 | 
 68 |   * Start with [0.1, 0.01, 0.001], 0.001 being most popular.
 69 | 
 70 | * Methods (try in the below order) _ **Adam** _ **Nestrov** (momentum) \*
 71 |   Momentum values: [0.5, 0.9, 0.95, 0.99], start with 0.9
 72 |   ## Optimizers
 73 | 
 74 | Match to networks
 75 | 
 76 | * `SGD` (Stochastic gradient descent)
 77 |   * CNN (+ dropout)
 78 |   * DBN (Deep belief network)
 79 |   * RNN
 80 | * `Hessian-free`
 81 |   * RNN
 82 | 
 83 | Properties
 84 | 
 85 | * `SGD`
 86 |   * Fast to converge (+)
 87 |   * Low cost (+)
 88 |   * Not as robust (-)
 89 | * `L-BFGS` (Limited memory Broydan-Fletcher-Goldfarb-Shanno)
 90 |   * Finds better local minima (+)
 91 |   * High cost and memory cost (-)
 92 | * `CG` (Conjugate gradient)
 93 |   * High cost and memory cost (-)
 94 | * `Hessian-free`
 95 |   * Automatic next step size (+)
 96 |   * Can't use on all archs (-)
 97 |   * High cost and memory cost (-)
 98 | 
 99 | ## Batch Sizes
100 | 
101 | Larger batch sizes improves training efficiency because they ship more data to
102 | computation units (e.g. GPU) at a time.
103 | 
104 | * Batch size
105 |   * 32 to 1024 on GPUs. Pick numbers that are powers of two.
106 |   * Increasing batch size by factor of N requires epoch number increase by
107 |     factor of N to maintain number of updates.
108 | 
109 | ## Regularization
110 | 
111 | Prevents overfitting and parameteres becoming too large.
112 | 
113 | * `L2`
114 |   * Sparse models
115 |   * More heavily penalizes large weights, but doesn’t drive small weights to 0.
116 | * `L1`
117 |   * Dense models
118 |   * Has less of a penalty for large weights, but leads to many weights being
119 |     driven to 0 (or very close to 0), meaning that the resultant weight vector
120 |     can be sparse.
121 | * `Max-norm`
122 |   * Alternative to `L2`, good with large learning ratesj
123 |   * Use with `AdaGrad`, `SGD`
124 | * `Dropout`
125 |   * Temporarily sets activation to 0
126 |   * Works with all NN types
127 |   * Avoid using on first layer, risks loosing information.
128 |   * Increases training times x2, x3, not a good fit for millions of training
129 |     records.
130 |   * Use with `SGD`
131 |   * Influences choice of momentum: 0.95 or 0.99
132 |   * Values (per layer type)
133 |     * Input: [0.5, 1.0)
134 |     * Hidden: 0.5
135 |     * Output: don't use.
136 | 
137 | ## References
138 | 
139 | * [Deep learning a practitioner's approach](https://www.amazon.com/Deep-Learning-Practitioners-Josh-Patterson/dp/1491914254)
140 | 
141 | ### Thanks:
142 | 
143 | To all
144 | [Contributors](https://github.com/jondot/deep-learning-parameters-cheatsheet/graphs/contributors) -
145 | you make this happen, thanks!
146 | 


--------------------------------------------------------------------------------