└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Deep Learning Parameters Cheatsheet 2 | 3 | Essential, to-the-point cheatsheet and reference for neural network parameters 4 | and hyperparameters, including common architecture and component blueprints. 5 | 6 | ## Input to Network 7 | 8 | * **CSV data** 9 | * Multilayer preceptron 10 | * **Image** 11 | * `CNN` (Convolutional neural network) 12 | * **Sequential** 13 | * `RNN LSTM` (Recurrent neural network, long-short-term-memory) 14 | * **Audio** 15 | * `RNN LSTM` 16 | * **Video** 17 | * `CNN` + `RNN` hybrid network 18 | 19 | ## Initialization 20 | 21 | * Biases 22 | 23 | * `0` 24 | * Biases can generally be zero. 25 | 26 | * Weights 27 | * `XAVIER` (aka `Glorot`) 28 | * Generic, not `RELU` 29 | * `RELU` (aka `He`) 30 | * `RELU` activation 31 | * `Leaky RELU` activation 32 | 33 | ## Activation Functions 34 | 35 | * `Linear` 36 | * Regression (output) 37 | * `Sigmoid` 38 | * Binary classification (output) 39 | * `Tanh` 40 | * Continuous data, more than [-1, 1] 41 | * LSTM layers 42 | * `Softmax` 43 | * Multiclass classification (output) 44 | 45 | ## Loss functions 46 | 47 | * **Recunstruction entropy** (`RBM`, `autoencoder` (Restricted Boltzmann 48 | Machine)) 49 | * Feature engineering 50 | * **Squared loss** (output) 51 | * Regression 52 | * **Cross entropy** (output) 53 | * Binary classification 54 | * **Multiclass cross entropy** (aka MCXE) (output) 55 | * Multiclass classification 56 | * **Root MSE** (Mean squared error) (`RBM`, `autoencoder`, output) 57 | * Feature engineering 58 | * Regression 59 | * **Hinge loss** (output) 60 | * Classification 61 | * **Negative log likelihood** (output) 62 | * Classification 63 | 64 | ## Learning Rates 65 | 66 | * Strict values 67 | 68 | * Start with [0.1, 0.01, 0.001], 0.001 being most popular. 69 | 70 | * Methods (try in the below order) _ **Adam** _ **Nestrov** (momentum) \* 71 | Momentum values: [0.5, 0.9, 0.95, 0.99], start with 0.9 72 | ## Optimizers 73 | 74 | Match to networks 75 | 76 | * `SGD` (Stochastic gradient descent) 77 | * CNN (+ dropout) 78 | * DBN (Deep belief network) 79 | * RNN 80 | * `Hessian-free` 81 | * RNN 82 | 83 | Properties 84 | 85 | * `SGD` 86 | * Fast to converge (+) 87 | * Low cost (+) 88 | * Not as robust (-) 89 | * `L-BFGS` (Limited memory Broydan-Fletcher-Goldfarb-Shanno) 90 | * Finds better local minima (+) 91 | * High cost and memory cost (-) 92 | * `CG` (Conjugate gradient) 93 | * High cost and memory cost (-) 94 | * `Hessian-free` 95 | * Automatic next step size (+) 96 | * Can't use on all archs (-) 97 | * High cost and memory cost (-) 98 | 99 | ## Batch Sizes 100 | 101 | Larger batch sizes improves training efficiency because they ship more data to 102 | computation units (e.g. GPU) at a time. 103 | 104 | * Batch size 105 | * 32 to 1024 on GPUs. Pick numbers that are powers of two. 106 | * Increasing batch size by factor of N requires epoch number increase by 107 | factor of N to maintain number of updates. 108 | 109 | ## Regularization 110 | 111 | Prevents overfitting and parameteres becoming too large. 112 | 113 | * `L2` 114 | * Sparse models 115 | * More heavily penalizes large weights, but doesn’t drive small weights to 0. 116 | * `L1` 117 | * Dense models 118 | * Has less of a penalty for large weights, but leads to many weights being 119 | driven to 0 (or very close to 0), meaning that the resultant weight vector 120 | can be sparse. 121 | * `Max-norm` 122 | * Alternative to `L2`, good with large learning ratesj 123 | * Use with `AdaGrad`, `SGD` 124 | * `Dropout` 125 | * Temporarily sets activation to 0 126 | * Works with all NN types 127 | * Avoid using on first layer, risks loosing information. 128 | * Increases training times x2, x3, not a good fit for millions of training 129 | records. 130 | * Use with `SGD` 131 | * Influences choice of momentum: 0.95 or 0.99 132 | * Values (per layer type) 133 | * Input: [0.5, 1.0) 134 | * Hidden: 0.5 135 | * Output: don't use. 136 | 137 | ## References 138 | 139 | * [Deep learning a practitioner's approach](https://www.amazon.com/Deep-Learning-Practitioners-Josh-Patterson/dp/1491914254) 140 | 141 | ### Thanks: 142 | 143 | To all 144 | [Contributors](https://github.com/jondot/deep-learning-parameters-cheatsheet/graphs/contributors) - 145 | you make this happen, thanks! 146 | --------------------------------------------------------------------------------