├── README.md
├── cnn4Images.md
├── img
    ├── activation.png
    ├── cross_entropy.png
    ├── cv.png
    ├── descent.png
    ├── dropout.png
    ├── early_term.png
    ├── gd.jpg
    ├── lr.jpg
    ├── lstm.png
    ├── nn.png
    ├── over_under.png
    └── rnn.png
├── machineLearning.md
├── neuralNets.md
└── rnn.md


/README.md:
--------------------------------------------------------------------------------
 1 | ### My Notes on Machine Learning
 2 | 
 3 | Machine Learning is a sub-field of artificial intelligence that uses data to train predictive models.  
 4 | 
 5 | #### 3 Types of machine learning
 6 | 1. Supervised learning - Minimize an error function using labeled training data.  
 7 | 2. Unsupervised learning - Find patterns using unlabled training data. Examples include Principal component analysis and clustering. 
 8 | 3. Reinforcement learning - Maximize a reward. An agent interacts with an environment and learns to take action by maximizing a cumulative reward.   
 9 | 
10 | #### 2 Types of machine learning problems
11 | 1. regression - predicting a continuous value attribute (Example: house prices)
12 | 2. classification - predicting a discrete value. (Example: pass or fail, hot dog/not hot dog)
13 | 
14 | #### Input features
15 | [Features](https://en.wikipedia.org/wiki/Feature_(machine_learning)) - are the inputs to a machine learning model. They are the measurable property being observed.  An example of a features is pixel brightness in computer vision tasks or the square footgage of a house in home pricing prediction.  
16 |   
17 | Feature selection - The process of choosing the features. It is important to pick features that correlate with the output. 
18 | 
19 | Dimensoionality Reduction - Reducing the number of features while preserving important features. A simple example is selecting the area of a house as a feature rather than using width and length seperately. Other examples include singular value decomposition, variational auto-encoders, and t-SNE (for visualizations), and max pooling layers for CNNs.
20 | 
21 | #### Data
22 | In suprervised learning data is typically split into training, validation and test data.  
23 | 
24 | An example is a single instance from your dataset.  
25 | 
26 | ### Machine learning models and applications
27 | 
28 | [Neural Networks](https://github.com/andrewt3000/MachineLearning/blob/master/neuralNets.md) - Neural networks are a suitable model for fixed input features.  
29 | 
30 | [Convolutional Neural Networks](https://github.com/andrewt3000/MachineLearning/blob/master/cnn4Images.md) CNNS  are suitable models for computer vision problems.   
31 | 
32 | Transformers - are designed to handle sequential data, like text, in a manner that allows for much more parallelization than previous models like [recurrent neural networks](https://github.com/andrewt3000/MachineLearning/blob/master/rnn.md).  See 2017 paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762)  
33 | Transformer is the architecture for [chat gpt](https://chat.openai.com/). Example [nanogpt](https://github.com/karpathy/nanoGPT)      
34 | 
35 | ### Computer Vision
36 | These are common computer vision tasks and state of the art methods for solving them.  
37 | 
38 | - Image classification: [res net](https://arxiv.org/abs/1512.03385), [Inception v4](https://arxiv.org/abs/1602.07261), [dense net](https://arxiv.org/abs/1608.06993)   
39 | - Object detection: (with bounding boxes) [yolo v4](https://arxiv.org/abs/2004.10934) (realtime object detection)   
40 | - Instance segmentation: [mask r-cnn](https://arxiv.org/abs/1703.06870)  
41 | - Semantic segmentation:  [U-Net](https://arxiv.org/abs/1505.04597)  
42 | 
43 | ### NLP Natural Language Processing
44 | [Deep Learning for NLP](https://github.com/andrewt3000/DL4NLP/blob/master/README.md) Deep learning models and nlp applications such as sentiment analysis, translation and dialog generation.  
45 | 
46 | ### Transfer learning
47 | [Transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) - storing knowledge gained while solving one problem and applying it to a different but related problem.
48 | 
49 | ### Explainablity
50 | Feature visualization - In computer vision, generating images representative of what neural networks are looking for.   
51 | [TensorFlow Lucid](https://github.com/tensorflow/lucid/),  [Activation Atlas](https://distill.pub/2019/activation-atlas/)  
52 | 
53 | Feature attribution - In computer vision, determining and representing which pixels contribute to a classification. Example: [Saliency maps](https://arxiv.org/pdf/1312.6034.pdf), [Deconvolution](https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf), [CAM](https://arxiv.org/pdf/1512.04150.pdf), [Grad-CAM](https://arxiv.org/abs/1610.02391)  
54 | [tf explain](https://github.com/sicara/tf-explain) - tensorflow visualization library.  
55 | [fast ai heatmap](https://docs.fast.ai/vision.learner.html#_cl_int_plot_top_losses) - uses grad-cam  
56 |   
57 | [Lime](https://github.com/marcotcr/lime) 
58 | 


--------------------------------------------------------------------------------
/cnn4Images.md:
--------------------------------------------------------------------------------
  1 | # Convolutional Neural Networks for image processing
  2 | [Convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network) (CNNs) are variations of neural networks that contain convolutional layers. Typically used for grids of input such as images.   
  3 | 
  4 | ### CNN HyperParameters
  5 | 
  6 | Types of hidden layers in cnns: convolutional layers, max pooling layers, fully connected layers.  
  7 | 
  8 | 1. Convolutional Layers - layers that have filters (aka kernels) that are convolved around the image. Then perform a Relu activation.    
  9 |   CNN Parmeters: All parameters can vary by layer.  
 10 |     1. Number of filters (aka kernels) (K) - typically a power of 2. eg. 32, 64, 128. Typically increases at deeper layers.  
 11 | 
 12 |     2. Size of filter (F) - typically odd square numbers. typical values are 3x3, 5x5, up to 11x11.  
 13 | 
 14 |     3. Stride (S) - How much to shift the filter.  
 15 |   
 16 |     4. Padding (P) - zero padding on edges of input volume. Padding can make output volume same size as input volume when padding =  (F - 1)/2. Examples: 3x3 filters / padding = 1, 5x5 filters / padding 2, 7x7 filters / padding 3.
 17 | 
 18 | 2. Max pool layer: max pooling is when you take a section and subsample only the largest value in the pool.  
 19 |   hyperparameter: size of the section to subsample. example 2x2.
 20 | 
 21 | 3. Fully connected layers are the same as hidden layers in a vanilla neural network.  
 22 | 
 23 | #### Calculate output volume of convolutional layer
 24 | The depth is the number of filters.  
 25 | The width (height is the same assuming image is square) is W2=(W1−F+2P)/S+1  
 26 | W1 is original width of image. F is filter witdth, P is padding, S is stride. example: F=3,S=1,P=1  
 27 | [Source](http://cs231n.github.io/convolutional-networks/)  
 28 | 
 29 | 
 30 | ### Computer Vision tasks
 31 | **Object classification** - identifying an object in an image.  
 32 | **Classification + Localization** - drawing a bounding box around an object. One or fixed number of objects. Train a CNN to classify but also train box coordinates as a regression problem finding (x, y, w, h)    
 33 | **Semantic segmentation** - Label every pixel between classes. Don’t distinguish between different instances. Use a cnn, downsample and then upsample for efficiency.     
 34 | **Object detection** - drawing a bounding box around an unlimited number of objects. Get regions of interest, and run algorithms such as R-CNN, fast R-CNN, faster R-CNN. There are also new single pass models such as [Yolo](https://pjreddie.com/darknet/yolo/), [V2](https://pjreddie.com/darknet/yolov2/), [V3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)  
 35 |  and [SSD](https://arxiv.org/abs/1512.02325).  
 36 | **Instance segmentation** - Segment between different instances of objects. State of the art is Mask R-CNN.  
 37 | **Image cpationing** - describing the objects in an image.  
 38 | 
 39 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/cv.png" />  
 40 | 
 41 | ### Object Classification: Case Studies
 42 | [Case Studies](http://cs231n.github.io/convolutional-networks/#case)  
 43 | 
 44 | #### LeNet-5
 45 | [Gradient-based learning applied to document recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)  
 46 | LeCun et al., 1998 .  Input 32x32 images.  
 47 | Architecture: 6 5x5 filters, stride 1. Pooling layers 2x2 stride 2. (conv pool conv pool conv fc)  
 48 | 
 49 | [ImageNet](http://www.image-net.org/) [[paper](http://www.image-net.org/papers/imagenet_cvpr09.pdf)] is a dataset of 1.2 million images with 1,000 labels.  
 50 | 
 51 | #### AlexNet
 52 | [ImageNet Classification with Deep Convolutional Neural Networks](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)  
 53 | Krizhevsky et al. (2012)  breakthrough paper for convolutional neural networks. First use of Relu (rectified linear units)  
 54 | top-1 and top-5 error rates of 37.5% and 17.0% on ImageNet.  
 55 | Architecture: First conv layer: 96 11x11 filters, stride 4. 5 convolutional layers, 3 fully connected layers.  
 56 | 
 57 | #### ZFNet
 58 | [Visualizing and Understanding Convolutional Networks](https://arxiv.org/pdf/1311.2901v3.pdf)  
 59 | Zeiler and Fergus, 2013 -   
 60 | 14.8% top 5 error on ImageNet (eventually got to 11% with company clarify)  
 61 | Architecture: First conv layer: 7x7 filter, stride 2.  
 62 | 
 63 | #### VGGNet
 64 | [VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION](https://arxiv.org/pdf/1409.1556.pdf)  
 65 | Simonyan et al. 2014.  .  
 66 | 7.3% top 5 error on ImageNet.   
 67 | Architecture: 16 - 19 layers. 3x3 convolutional filters, stride 1, pad 1. 2x2 max pool, stride 2.  
 68 | Smaller filters, deeper layers. Typically there are more filters as you go deeper. 
 69 | 
 70 | #### Inception
 71 | [Going deeper with convolutions](https://arxiv.org/pdf/1409.4842v1.pdf)
 72 | Szegedy et al., 2014  GoogleNet. Inception.  
 73 | Architecture: 22 layers.  
 74 | 
 75 | #### resnet
 76 | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385v1.pdf)  
 77 | He et al., 2015 aka .  Uses skip connections (residual net). That diminishes vanishing/exploding gradient problems that occur in deep nets.  
 78 | 3.57% top 5 error on ImageNet  
 79 | Architecture: 152 layers. [ResNet model code](https://github.com/KaimingHe/deep-residual-networks).
 80 | 
 81 | #### Inception v4
 82 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261.pdf)  
 83 | Szegedy et al. 2016 
 84 | Updated inception to use resnets.  
 85 | 3.08% top 5 error on ImageNet  [blog post](https://research.googleblog.com/2016/08/improving-inception-and-image.html)  
 86 | Architecture: 75 layers  
 87 | 
 88 | #### DenseNet
 89 | [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)  
 90 | Huang et al. 2018  
 91 | Breakthrough: "For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers."  
 92 | 
 93 | #### EfficientNet
 94 | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946)  
 95 | [blog post](https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html)  
 96 | 
 97 | #### Additional Resources
 98 | 
 99 | [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/)   
100 | [2017 videos](https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv)  
101 | 
102 | [A Beginner's Guide To Understanding Convolutional Neural Networks](https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/)  
103 | [An Intuitive Explanation of Convolutional Neural Networks](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/)
104 | 
105 | 


--------------------------------------------------------------------------------
/img/activation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/activation.png


--------------------------------------------------------------------------------
/img/cross_entropy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/cross_entropy.png


--------------------------------------------------------------------------------
/img/cv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/cv.png


--------------------------------------------------------------------------------
/img/descent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/descent.png


--------------------------------------------------------------------------------
/img/dropout.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/dropout.png


--------------------------------------------------------------------------------
/img/early_term.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/early_term.png


--------------------------------------------------------------------------------
/img/gd.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/gd.jpg


--------------------------------------------------------------------------------
/img/lr.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/lr.jpg


--------------------------------------------------------------------------------
/img/lstm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/lstm.png


--------------------------------------------------------------------------------
/img/nn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/nn.png


--------------------------------------------------------------------------------
/img/over_under.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/over_under.png


--------------------------------------------------------------------------------
/img/rnn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andrewt3000/MachineLearning/d1e83e8025b8dbbaa93803a54c7aadb0a52779b1/img/rnn.png


--------------------------------------------------------------------------------
/machineLearning.md:
--------------------------------------------------------------------------------
  1 | Notes for [Coursera: Machine Learning](https://www.coursera.org/learn/machine-learning/home/welcome?module=tN10A)  
  2 | 
  3 | #### 3 Types of machine learning
  4 | 1. Supervised learning - learns using labeled training data.  Training data has input and output.  
  5 | 2. Unsupervised learning - learns using unlabled training data.  Training data is input but has no output.  Algorithm finds structure. Clustering is an example of unsupervised learning. Examples: social network analysis, market segmentation, astronomical data analysis.  
  6 | 3. Reinforcement learning - learns to take action by maximizing a cumulative reward. [deep mind learns breakout](https://www.youtube.com/watch?v=V1eYniJ0Rnk)  
  7 | 
  8 | #### 2 Types of machine learning problems
  9 | 1. regression - predicting a continuous value attribute (Example: house prices)
 10 | 2. classification - predicting a discrete value. (Example: pass or fail)
 11 | 
 12 | #### Types of classification  
 13 | Binary classification - classifying elements into one of two groups. Examples: benign/malignant tumor, fraudulent or legitimate financial transaction, spam or non-spam email.  
 14 | 
 15 | [Multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification)/multinomial classification - classify instances into more than 2 classes. Example: MNIST evaluates handwritten single numeric characters and classifies into 10 binary classes 0 - 9.  
 16 | 
 17 | ### Week 1: Linear Regression with one variable
 18 | 
 19 | m = number of training examples  
 20 | x = input features  
 21 | y = output variable / target variable  
 22 | 
 23 | (x, y) = refers to one training example  
 24 | x<sup>(i)</sup> y<sup>(i)</sup> refers to specific training example at index i. It doesn’t refer to an exponent.
 25 | 
 26 | h = hypothesis. function that maps x to y  
 27 | Predict that y is a linear function of x.  
 28 | <strong>y = h(x) = ϴ<sub>0</sub> + ϴ<sub>1</sub>x</strong>  
 29 | 
 30 | #### Cost Function
 31 | J is the cost function  
 32 | ϴ<sub>i</sub> are parameters. the coefficients of the function. Similar to weights in a neural net.  
 33 | find values of ϴ<sub>0</sub> and ϴ<sub>1</sub> that minimize the cost (error) function.  
 34 | h(x) - y = difference in function versus actual… we want to minimize this.  
 35 | aka squared error cost function.  
 36 | J(ϴ<sub>0</sub>, ϴ<sub>1</sub>) = 1/(2m) &#931; from i = 1 to m (h(x<sup>(i)</sup>) - y<sup>(i)</sup>)<sup>2</sup>  
 37 | find ϴ<sub>0</sub> and ϴ<sub>1</sub> that minimizes the error.  
 38 | 
 39 | Square the error because  
 40 | 1) Squared gets rid of negative numbers that would cancel each other out.  Although you could use magnitude.  
 41 | 2) For many applications small errors are not important, but big errors are very important.  example: self driving car. ½ foot steering error no big deal, 5 foot error fatal problem. So it’s not 10x more important… it is 100x more important.  
 42 | 3) The convex nature of quadratic equation avoids local minimums.  
 43 | 
 44 | #### Gradient Descent
 45 | :=  (assignment operator, not equality)  
 46 | &alpha; = learning rate.  the learning rate controls the size of the adjustments made during the training process. 
 47 | 
 48 | Repeat until convergence.  
 49 | &theta;<sub>0</sub> := &theta;<sub>0</sub> - &alpha; (1/m) &#931; i = 1 to m (h<sub>&theta;</sub>(x<sup>(i)</sup>) - y<sup>(i)</sup>)  
 50 | &theta;<sub>1</sub> := &theta;<sub>1</sub> - &alpha; (1/m) &#931; i = 1 to m (h<sub>&theta;</sub>(x<sup>(i)</sup>) - y<sup>(i)</sup>) &bull; x<sup>(i)</sup>  
 51 | 
 52 | ### Week 2 MultiVariate Linear Regression
 53 | n = the number of features  
 54 | Feature scaling - get features in the range -1 to 1.  
 55 | x<sub>i</sub> = x<sub>i</sub> - average / range (max - min)  
 56 | 
 57 | ### Week 3 Logistic regression
 58 | Logistic regression is a confusing term because it is a classification algorithm, not a regression algorithm.  
 59 | "Logistic regression" is named after the logistic function (aka sigmoid function)  
 60 | The sigmoid function is the hypothesis for logistic regression: h(x)= 1 / (1 + e<sup>-&theta; T x</sup>)  
 61 | The cost function for logistic regression is:  
 62 | for y = 1, -log(h<sub>&theta;</sub>(x))  
 63 | for y = 0, -log(1 - h<sub>&theta;</sub>(x))   
 64 | 
 65 | ##### Multiclass classification
 66 | Multiclass classification is solved using "One versus All." There are K output classes. Solve for K binary logistic regression classifiers and choose the one with the highest probability.    
 67 | 
 68 | #### Regularization
 69 | Underfitting (high bias) - output doesn't fit the training data well.  
 70 | Overfitting (high variance) - output fits training data well, but doesn't work well on test data. Failes to generalize.  
 71 | 
 72 | Regularization factor (&lambda;) - variable to control overfitting. If model is underfitting, you need lower &lambda;. If the model is overfitting, you need higher lambda.  
 73 | 
 74 | The intuition of &lambda; is that you add the regularization term to the cost function, and as you minimize the cost function, you minimize the magnitude of theta (the weights). If theta is smaller, especially for higher order polynomials, the hypothesis is simpler.  
 75 | 
 76 | The regularization term is added to the cost function.  
 77 | The regularization term is for linear regression is:  &lambda; times the sum from j=1 to n of &theta;<sub>j</sub><sup>2</sup>  
 78 | The regularization term is for logistic regression is:  &lambda;/2m times the sum from j=1 to n of &theta;<sub>j</sub><sup>2</sup>  
 79 | Notice they sum from j=1, not j=0. i.e. it doesn't consider the bias term.  
 80 | 
 81 | ### Week 5 Neural Networks
 82 | L is the number of layers.  
 83 | K number of output units.  
 84 | 
 85 | 
 86 | #### How to train a neural network:
 87 | 1. randomize initial weights.
 88 | 2. Implement forward propagation. h<sub>ϴ</sub>(x<sup>i</sup>)
 89 | 3. Implement cost function J(ϴ)
 90 | 4. Implement backprop to compute partial derivative of cost function with respect to theta.
 91 |  for 1 through m  (each training example)
 92 | 5. Do gradient checking
 93 | 6. Use gradient descent.  
 94 | 
 95 | Use gradient checking to verify that back prop is working correctly with no bugs. Don’s use gradient checking in production.  
 96 | 
 97 | ### Week 8 Unsupervised learning and dimensionality reduction
 98 | K-means clustering. Example using 2 clusters. Randomly initialize 2 cluster centroids. K-means is an iterative algorithm. 
 99 | Step 1 is to assign each data point to the nearest cluster centroid. Step 2: move the cluster centroid to the location of the mean of the assigned data points.  
100 | 
101 | K = number of clusters to find.  
102 | 


--------------------------------------------------------------------------------
/neuralNets.md:
--------------------------------------------------------------------------------
  1 | ### Neural Networks
  2 | [Neural Network](https://en.wikipedia.org/wiki/Neural_network) - Neural networks are machine learning models and [universal approximators](https://en.wikipedia.org/wiki/Universal_approximation_theorem). This document explains their architecture and how to [train a neural network](#training-a-neural-network).    
  3 | 
  4 | ### Neural network Architecture
  5 | Neural nets have input layers, hidden layers and output layers.  
  6 | Layers are connected by weighted synapsis (the lines with arrows) that multiply their input times the weight. 
  7 | Hidden layer consists of neurons (the circles) that sum their inputs from synapsis and execute an activation function on the sum.  
  8 | The layers are of fixed size. Neural networks also typically have a single bias input node that is a constant value. It's similar to the constant in a linear function. Biases ensure that even when all input features are zero, a neuron can still output a non-zero value. The weights and biases are often refered to as parameters.   
  9 | 
 10 | Aka artificial neural networks, feedforward neural networks, vanilla neural networks.   
 11 | 
 12 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/nn.png" height='250px' width='250px'/>  
 13 | 
 14 | ### Hyperparameters
 15 | Hyperparameters - the model’s parameters in a neural net such as architecture, learning rate, and regularization factor.	
 16 | 
 17 | Architecture - The structure of a neural network i.e. number of hidden layers, and number of nodes. 
 18 | 
 19 | Number of hidden layers - the higher the number of layers the more layers of abstraction it can represent. too many layers and the the network suffers from the vanishing or exploding gradient problem.  
 20 | 
 21 | ### Activation Functions
 22 | [Activation function](https://en.wikipedia.org/wiki/Activation_function) - the "neuron" in the neural network executes an activation function on the sum of the weighted inputs. In the neuron metaphor you can assume as the value approaches 1 the neuron is "firing". ReLu is a popular modern activation function.  
 23 | 
 24 | #### Sigmoid
 25 | Sigmoid activation functions outputs a value between 0 and 1. It is a smoothed out step function. Sigmoid is not zero centered and it suffers from activation saturation issues. Historically popular, but not currently popular.  
 26 | 
 27 | #### Tanh
 28 | Tanh activation function outputs value between -1 and 1. Tanh is a rescaled sigmoid function. Tanh is zero centered but still suffers from activation saturation issues similar to sigmoid. Historically popular, but not currently popular.  
 29 | 
 30 | #### ReLu
 31 | ReLu activation is currently (2018) the most popular activation function. ReLu stands for rectified linear unit. It returns 0 for negative values, and the same number for positive values. Relu can suffer from "dead" relus (vanishing gradient?)    
 32 | 
 33 | ```python
 34 | def relu(x):
 35 |   if x < 0:
 36 |     return 0
 37 |   if x >= 0:
 38 |     return x
 39 | ```
 40 | 
 41 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/activation.png" />
 42 | 
 43 | #### More activations
 44 | There are newer, experimental variants of the relu: Leaky ReLu (solves the dead relu issue) Elu (exponential relu), and MaxOut.   
 45 | 
 46 | #### Softmax
 47 | The [softmax function](https://en.wikipedia.org/wiki/Softmax_function) is often used as the model's final output activation function for classification. Softmax is used for modeling probability distributions for classification where outputs are mutually exclusive (MNIST is an example). 
 48 | Softmax is a "soft" maximum function. It's properties are:  
 49 | Output values are in the range [0, 1].  
 50 | The sum of output nodes is 1.  
 51 | 
 52 | The softmax function as applied to each node NN output is the exponent of the output divided by the sum of all the exponent outputs. So for instance, if there are 3 nodes, the output of the 1st node y1 is:     
 53 | e <sup>y^1</sup> / (e <sup>y^1</sup> + e <sup>y^2</sup> + e <sup>y^3</sup>)
 54 | 
 55 | ```python
 56 | def softmax(X):
 57 |     exps = np.exp(X)
 58 |     return exps / np.sum(exps)
 59 | ```
 60 | 
 61 | [Keras activations](https://keras.io/activations/)  
 62 | [Pytorch activations](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity)  
 63 | 
 64 | ### Training a neural network
 65 | Training a neural network - minimize a cost function. Use backpropagation and gradient descent to adjust weights to make model more accurate. 
 66 | 
 67 | Steps to training a network.  
 68 | - [Prepare the data](#prepare-the-data)
 69 | - [initialize weights and biases](#initialization)  
 70 | - [Implement forward propagation](#forward-propagation)  
 71 | - [Implement cost (aka error, loss) function](#loss-function)
 72 | - [Implement backpropagation](#backpropagation) 
 73 | - [Run optimization algorithm](#optimization-algorithms) such as gradient descent adjusting weights using learning rate.  
 74 | 
 75 | You train until the model (hopefully) converges on an acceptablely low error level. An epoch means the network has been trained on every example once.  
 76 | 
 77 | One tip is to begin by overtraining a small portion of your data by getting rid of regularization to see if you have a reasonalbe architecture. If you can't get the loss to zero, reconsider your architecture.   
 78 | 
 79 | Start with small regularization and find a learning rate that makes the loss go down.  
 80 | 
 81 | ### Prepare the data
 82 | [Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) - scale each feature to be in a common range typically -1 to 1 where 0 is the mean value. For instance, in image processing you could subtract the mean pixel intensity for the whole image or by channel before scaling it to zero center the features.    
 83 | 
 84 | [Data augmentation](https://en.wikipedia.org/wiki/Data_augmentation) - An example of data augmentation for images is mirroring, flipping, rotating, and translating your images to create new examples.   
 85 | 
 86 | Training data - The data used in machine learning models to make the model more accurate. In the case of supervised learning it consists of input features and output labels that serve as examples. Data is typically split into training data, cross validation data and test data. Typical mix is 60% -80% training, 10%-20% validation and 10%-20% testing data. Validation data is evaluated while the model is training and it indicates if the model is generalizing. Testing data is evaluated after you have stopped training the model to indicate the model's accuracy.  
 87 | 
 88 | 
 89 | ### Initialization
 90 | The weights and biases are historically initialized with small random numbers centered on zero. If the weights are the same (say all 0s) they will remain the same throughout training, making the weights random breaks this symmetry.  
 91 | 
 92 | As your neural networks get deeper, initialization becomes more important. If the initial weights are too small, you get a vanishing gradient. If the initial weights are too large, you get an exploding gradient. Consider more advanced initializations such as Xavier and He initialization.   
 93 | 
 94 | In Keras, you can specify kernel and bias initialization on each Dense layer. See all available [keras initializations](https://keras.io/initializers/). Glorot (aka Xavier) initialization is the default.  
 95 | 
 96 | ### Forward Propagation
 97 | If X is the input matrix, and W1 is the weight matrix for the first hidden layer, we take the dot product to get the values passed to the activation functions. Then we apply the activation function to each element in the matrix. Repeat for each layer.  
 98 | 
 99 | ```python
100 | z2 = np.dot(X, W1)
101 | a2 = activation(z2)
102 | ```
103 | 
104 | [Example of Forward propagation in numpy](https://github.com/stephencwelch/Neural-Networks-Demystified/blob/master/.ipynb_checkpoints/Part%202%20Forward%20Propagation-checkpoint.ipynb)
105 | 
106 | ### Loss Function 
107 | Loss (aka cost/error/objective) function measures how inaccurate a model is. Training a model minimizes the loss function. Sum of squared errors is a common loss function for regression. Cross entropy (aka log loss, negative log probability) is a common loss function for softmax function.   
108 | 
109 | Cross entropy function is suitable for a classification where the output is a value between 0 and 1. The loss will be 0 if the output value is 1 and that is the correct classification. Conversely, the loss approaches infinity as the output approaches 0 for the correct classification.  
110 | 
111 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/cross_entropy.png" />
112 | 
113 | ```python
114 | def logloss(true_label, predicted_prob):
115 |   if true_label == 1:
116 |     return -log(predicted_prob)
117 |   else:
118 |     return -log(1 - predicted_prob)
119 |  ```
120 | 
121 | 
122 | [Keras loss functions](https://keras.io/losses/)  
123 | [pytorch loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions)  
124 | 
125 | ### Backpropagation 
126 | The backpropagation algorithm applies the chain rule recursively to compute the gradient for each weight. The gradient is caulated by taking the partial derivative of the loss function with respect to the weights at each layer of the network by moving backwards (output to input) through the network. Backprop indicates how to adjust the weights to minimize the loss function. If the gradient (i.e. partial derivative/slope) is positive, that means the loss is getting higher as the weight increases. If the derivative is 0, the weight is set to a minimum loss. The gradient indicates the magnitude and direction of adjustments to our weights that will reduce the loss.  
127 | 
128 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/descent.png"   height='360px' width='640px' />
129 | 
130 | We can combine all these functions for cost and forward propagation to get one function. So for instance, the cost function for a NN with one hidden relu layer using a softmax output is (where W1 and W2 are weights for 1st and 2nd layers, and X1 is the input feature nodes:  
131 |   
132 | ```
133 | J = Cost(Softmax(DotProduct(Relu(DotProduct(X1,W1)), W2)))
134 | ```
135 | 
136 | We then calculate the partial derivative of the loss function with respect to each weights (dJ/dW). We use the [chain rule](https://en.wikipedia.org/wiki/Chain_rule). The chain rule is the derivative of f(g(x)) is f'(g(x)) * g'(x)
137 | 
138 | The result is a gradient for each set of weights, dJ/dW1 and dJ/dW2 which are the same size as W1, W2.  
139 | 
140 | The derivative of the softmax cost function is the probablity for the incorrect labels and the probablity - 1 for the correct label. 
141 | 
142 | The derivative of the relu function is:
143 | 
144 | ```
145 | def reluprime(x):
146 |     if x > 0: #the derivative dy/dx of y = x is dy/dx = 1 from the power rule in calculus.
147 |         return 1
148 |     else: #The derivative of a constant is zero. if x <0, y = 0 so dy/dx = 0
149 |         return 0
150 | ```  
151 |   
152 | Here is an [example of backprop in numpy](https://github.com/stephencwelch/Neural-Networks-Demystified/blob/master/.ipynb_checkpoints/Part%204%20Backpropagation-checkpoint.ipynb) for a regression problem that uses sum of squared errors as a cost function and sigmoid activations.  
153 | 
154 | Packages such Keras handle this part of the training process for you.  
155 | 
156 | #### Learning Rate
157 | Learning rate (&alpha;) - controls the size of the adjustments made during the training process. Typical values are .1, .01, .001. Consider these values are relative to your input features which are typically scaled to ranges such as 0 to 1, or -1 to +1.  
158 | if &alpha; is too low, convergance is slow.
159 | if &alpha; is too high, there is no convergance, because it overshoots the local minimum.  
160 | The learning rate is often reduced to a smaller number over time. This is often called annealing or decay. (examples: step decay, exponential decay)  
161 | 
162 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/lr.jpg" />
163 | 
164 | ### Optimization algorithms
165 | [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) is an iterative optimization algorithm that, in the context of neural networks, adjusts the weight by learning rate times the negative of the gradient (calculated by backpropagation) to mimimize the loss function.  
166 | 
167 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/gd.jpg"  height='360px' width='640px' />
168 | 
169 | Batch gradient descent - The term batch refers to the fact it uses the entire dataset to make one gradient step. Batch works well for small datasets that have convex loss functions. The loss function needs to be convex or it may find a local minimum.    
170 | 
171 | [Stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (sgd) is a variation of gradient descent that uses a single randomly choosen example to make an update to the weights. sgd is more scalable than batch graident descent and is used more often in practice for large scale deep learning. It's random nature makes it unlikely to get stuck in a local minima.  
172 | 
173 | Mini batch gradient descent: Stochastic gradient descent that considers more than one randomly choosen example before making an update. Batch size is a hyperparmeter that determines how many training examples you consider before making a weight update. Typical values are factors of 2, such as 32 or 128. Values are typically in the range of 32-512.  Larger batches are faster to train, but can cause overfitting and require more memory.  Lower batch sizes are the opposite: slower to train, more regularized, and require less memory.  
174 | 
175 | #### Gradient Descent Optimization
176 | Momentum sgd is a variation that accelerates sgd, dampens oscillations, and helps skip over local minima and saddlepoints. It collects data on each update in a velocity vector to assist in calculating the gradient. The velocity matrix represents the momentum. Rho is a hyperparameter that represents the friction. Rho is in the range of 0 to 1. Typical values for rho are 0.9 and 0.99. Nesterov accelerated gradient descent is a variation that builds on moment and adds a look ahead step.  
177 | 
178 | Other optimization algorithms include: AdaGrad, AdaDelta, Adam, Adamax, NAdam, RMSProp, and AMSGrad.  
179 | See [Keras Optimizers](https://keras.io/optimizers/)  
180 | See [Pytorch optimizers](https://pytorch.org/docs/stable/optim.html#algorithms)  
181 | 
182 | #### Regularization
183 | Underfitting - output doesn't fit the training data well.  
184 | [Overfitting](https://en.wikipedia.org/wiki/Overfitting) - output fits training data well, but doesn't work well on validation or test data.  
185 | 
186 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/over_under.png"/>
187 | 
188 | Regularization - a technique to minimize overfitting.  
189 | 
190 | L1 regularization uses sum of absolute value of weights. L1 works best with sparse outputs.  
191 | L2 regularization uses sum of squared weights. L2 doesn't work well with yielding sparse outputs.    
192 | 
193 | [Dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) - a form of regularization. "The key idea is to randomly drop units (along with their connections) from the neural network during training." Typical hyperparameter value is .5 (50%). As dropout value approaches zero, dropout has less effect, as it approaches 1 there are more connections are being zeroed out. The remaining active connections are scaled up to compensate for the zeroed out connections. Dropout is typically implemented in training but not present in inference. Dropout is typically applied to units in the hidden layers.  
194 | 
195 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/dropout.png" />  
196 | 
197 | Early termination - Stop training when the training error is getting lower but the validation error is increasing. This indicates overfitting.  
198 | 
199 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/early_term.png" />
200 | 
201 | 
202 | ### Other resources
203 | [Neural Networks demystified video](https://www.youtube.com/watch?v=bxe2T-V8XRs) - videos explaining neural networks. Includes [notes](https://github.com/stephencwelch/Neural-Networks-Demystified).    
204 | 
205 | [Stanford CS231n](http://cs231n.github.io/neural-networks-1/)  
206 | 
207 | [TensorFlow Neural Network Playground](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.28720&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false)  - This demo lets you run a neural network in your browser and see results graphically. I wrote about the lessons on [intuition about deep learning]([https://medium.com/@andrewt3000/understanding-tensorflow-playground-c20cdb7a250b]).   
208 | 
209 | [Practical tips for deep learning.  Ilya Sutskever](http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html)  
210 | 
211 | [A Recipe for Training Neural Networks - Andrej Karpathy](http://karpathy.github.io/2019/04/25/recipe/) 
212 | 


--------------------------------------------------------------------------------
/rnn.md:
--------------------------------------------------------------------------------
 1 | ## Recurrent Neural Networks
 2 | [Recurrent Neural Network](https://en.wikipedia.org/wiki/Recurrent_neural_network) (RNN) - RNNs are an extension of neural networks that also pass the hidden state as output of each neuron via a weighted connection as an input to the neurons in the same layer during the next sequence. RNNs are trained by backpropagation through time. RNNs are used for input sequences such as text, audio, or video.  
 3 | 
 4 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/rnn.png" />  
 5 | 
 6 | This feedback architecture allows the network to have memory of previous inputs. The memory is limited by vanishing/exploding gradient problem. Exploding gradient can be mitigated by gradient clipping. A common hyperparameter is the number of steps to go back or "unroll" during training. 
 7 | 
 8 | There are variations such as bi-directional and recursive RNNs  
 9 | 
10 | ### GRU
11 | GRU - Gated Recurrent Unit - Introduced by Cho. Another RNN variant similar but simpler than LSTM. It contains one update gate and combines the hidden state and memory cells among other differences.  
12 | 
13 | ### LSTM
14 | LSTM - [Long Short Term Memory] - Introduced Hochreiter in 1997, a specialized RNN that is capable of long term dependencies and mitigates the vanishing gradient problem.  It contains memory cells and gate units. The number of memory cells is a hyperparameter. Memory cells pass memory information forward. The gates decide what information is stored in the memory cells. A vanilla LSTM has a forget gates, input gates and output gates. There are many variations of the LSTM.
15 | 
16 | <img src="https://github.com/andrewt3000/MachineLearning/blob/master/img/lstm.png" />  
17 | 
18 | [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) Blog post by Chris Olah.  
19 | 
20 | 


--------------------------------------------------------------------------------