├── .gitattributes ├── Week 1 PA 1 ├── Initialization.zip └── README.md ├── Week 1 PA 2 ├── Regularization.zip ├── Regularization │ ├── output_3_0.png │ ├── output_9_1.png │ ├── output_11_0.png │ ├── output_22_1.png │ ├── output_24_0.png │ ├── output_35_3.png │ ├── output_37_0.png │ └── Regularization.md └── README.md ├── Week 2 PA 1 ├── Optimization+methods.zip └── README.md ├── Week 3 PA 1 ├── Tensorflow+Tutorial.zip ├── Tensorflow+Tutorial │ ├── output_35_1.png │ ├── output_62_1.png │ └── Tensorflow Tutorial.md ├── README.md └── Tensorflow+Tutorial.py ├── Lecture Slides ├── Week 2 Optimization algorithms.pdf ├── Week 1 Practical aspects of Deep Learning.pdf ├── Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks.pdf └── README.md ├── LICENSE ├── Week 1 PA 3 ├── README.md ├── Gradient+Checking.md └── Gradient+Checking.ipynb └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | * linguist-vendored 2 | *.py linguist-vendored=false 3 | -------------------------------------------------------------------------------- /Week 1 PA 1/Initialization.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 1/Initialization.zip -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization.zip -------------------------------------------------------------------------------- /Week 2 PA 1/Optimization+methods.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 2 PA 1/Optimization+methods.zip -------------------------------------------------------------------------------- /Week 3 PA 1/Tensorflow+Tutorial.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 3 PA 1/Tensorflow+Tutorial.zip -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/output_3_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_3_0.png -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/output_9_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_9_1.png -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/output_11_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_11_0.png -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/output_22_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_22_1.png -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/output_24_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_24_0.png -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/output_35_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_35_3.png -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/output_37_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_37_0.png -------------------------------------------------------------------------------- /Week 3 PA 1/Tensorflow+Tutorial/output_35_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 3 PA 1/Tensorflow+Tutorial/output_35_1.png -------------------------------------------------------------------------------- /Week 3 PA 1/Tensorflow+Tutorial/output_62_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 3 PA 1/Tensorflow+Tutorial/output_62_1.png -------------------------------------------------------------------------------- /Lecture Slides/Week 2 Optimization algorithms.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Lecture Slides/Week 2 Optimization algorithms.pdf -------------------------------------------------------------------------------- /Lecture Slides/Week 1 Practical aspects of Deep Learning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Lecture Slides/Week 1 Practical aspects of Deep Learning.pdf -------------------------------------------------------------------------------- /Lecture Slides/Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Lecture Slides/Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks.pdf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Sheng-Qi Shen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Week 1 PA 1/README.md: -------------------------------------------------------------------------------- 1 | # Initialization 2 | 3 | # Goal 4 | - see how different initializations lead to different results 5 | - Understand that different regularization methods that could help your model. 6 | 7 | # File Description 8 | - `.ipynb` file is the solution of Week 1 program assignment 1 9 | - Initialization.ipynb 10 | - `.html` file is the html version of `.ipynb` file. 11 | - Initialization.html 12 | - file 13 | - Initialization.md 14 | 15 | # Snapshot 16 | - computer view. open .html file via brower for quick look. 17 | - **Recommend** brower view. Initialization.md 18 | 19 | 20 | # Implementation 21 | - Zeros initialization -- setting initialization = "zeros" in the input argument. 22 | - Random initialization -- setting initialization = "random" in the input argument. This initializes the weights to large random values. 23 | - He initialization -- setting initialization = "he" in the input argument. This initializes the weights to random values scaled according to a paper by He et al., 2015. 24 | 25 | # What you should remember from this notebook: 26 | - Different initializations lead to different results 27 | - Random initialization is used to break symmetry and make sure different hidden units can learn different things 28 | - Don't intialize to values that are too large 29 | - He initialization works well for networks with ReLU activations. 30 | -------------------------------------------------------------------------------- /Week 1 PA 3/README.md: -------------------------------------------------------------------------------- 1 | # Gradient Checking 2 | 3 | # Goal 4 | - Implement gradient checking from scratch. 5 | 6 | - Understand how to use the difference formula to check your backpropagation implementation. 7 | 8 | - Recognize that your backpropagation algorithm should give you similar results as the ones you got by computing the difference formula. 9 | 10 | - Learn how to identify which parameter's gradient was computed incorrectly. 11 | 12 | # File Description 13 | - `.ipynb` file is the solution of Week 1 program assignment 3 14 | - Gradient+Checking.ipynb 15 | - `.html` file is the html version of `.ipynb` file. 16 | - Gradient+Checking.html 17 | - file 18 | - Gradient+Checking.md 19 | 20 | # Snapshot 21 | - computer view. open .html file via brower for quick look. 22 | - **Recommend** brower view. Initialization.md 23 | 24 | 25 | # Implementation 26 | - 1-dimensional gradient checking 27 | - N-dimensional gradient checking 28 | 29 | # What you should remember from this notebook: 30 | - Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation). 31 | - Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. 32 | 33 | -------------------------------------------------------------------------------- /Lecture Slides/README.md: -------------------------------------------------------------------------------- 1 | # Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization 2 | 3 | Week 1 Practical aspects of Deep Learning 4 | - Setting up your Machine Learning Application 5 | - Train / Dev / Test sets 6 | - Bias / Variance 7 | - Basic Recipe for Machine Learning 8 | - Regularizing your neural network 9 | - Regularization 10 | - Why regularization reduces overfitting? 11 | - Dropout Regularization 12 | - Understanding Dropout 13 | - Other regularization methods 14 | - Setting up your optimization problem 15 | - Normalizing inputs 16 | - Vanishing / Exploding gradients 17 | - Weight Initialization for Deep Networks 18 | - Numerical approximation of gradients 19 | - Gradient checking 20 | - Gradient Checking Implementation Notes 21 | 22 | Week 2 Optimization algorithms 23 | - Optimization algorithms 24 | - Mini-batch gradient descent 25 | - Understanding mini-batch gradient descent 26 | - Exponentially weighted averages 27 | - Understanding exponentially weighted averages 28 | - Bias correction in exponentially weighted averages 29 | - Gradient descent with momentum 30 | - RMSprop 31 | - Adam optimization algorithm 32 | - Learning rate decay 33 | - The problem of local optima 34 | - Heroes of Deep Learning (Optional) 35 | - Yuanqing Lin interview 36 | 37 | Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks 38 | - Hyperparameter tuning 39 | - Tuning process 40 | - Using an appropriate scale to pick hyperparameters 41 | - Hyperparameters tuning in practice: Pandas vs. Caviar 42 | - Batch Normalization 43 | - Normalizing activations in a network 44 | - Fitting Batch Norm into a neural network 45 | - Why does Batch Norm work? 46 | - Batch Norm at test time 47 | - Multi-class classification 48 | - Softmax Regression 49 | - Training a softmax classifier 50 | - Introduction to programming frameworks 51 | - Deep learning frameworks 52 | - TensorFlow1 53 | -------------------------------------------------------------------------------- /Week 2 PA 1/README.md: -------------------------------------------------------------------------------- 1 | # Optimization 2 | 3 | # Goal 4 | - Understand the intuition between Adam and RMS prop 5 | 6 | - Recognize the importance of mini-batch gradient descent 7 | 8 | - Learn the effects of momentum on the overall performance of your model 9 | 10 | # File Description 11 | - `.ipynb` file is the solution of Week 2 program assignment 1 12 | - Optimization+methods.ipynb 13 | - `.html` file is the html version of `.ipynb` file. 14 | - Optimization+methods.html 15 | - `.py` file 16 | - Optimization+methods.py 17 | - file 18 | - Optimization+methods.md 19 | 20 | # Snapshot 21 | - computer view. open .html file via brower for quick look. 22 | - **Recommend** brower view. Optimization+methods.md 23 | 24 | 25 | # Implementation 26 | - Gradient Descent 27 | - Mini-Batch Gradient descent 28 | - Momentum 29 | - Adam 30 | - Model with different optimization algorithms 31 | - Mini-batch Gradient descent 32 | - Mini-batch gradient descent with momentum 33 | - Mini-batch with Adam mode 34 | 35 | # What you should remember from this notebook: 36 | ## Gradient Descent 37 | - The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step. 38 | - You have to tune a learning rate hyperparameter α . 39 | - With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large). 40 | ## Mini-Batch Gradient descent 41 | - Shuffling and Partitioning are the two steps required to build mini-batches 42 | - Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128. 43 | ## Momentum 44 | - Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent. 45 | - You have to tune a momentum hyperparameter ββ and a learning rate αα . 46 | # Conclusion 47 | optimization method | accuracy | cost shape 48 | --|--|-- 49 | Gradient descent | 79.7% | oscillations 50 | Momentum | 79.7% | oscillations 51 | Adam | 94% | smoother 52 | 53 | # Reference 54 | Adam paper: https://arxiv.org/pdf/1412.6980.pdf 55 | 56 | -------------------------------------------------------------------------------- /Week 1 PA 2/README.md: -------------------------------------------------------------------------------- 1 | # Regularization 2 | 3 | # Goal 4 | - Understand that different regularization methods that could help your model. 5 | 6 | - Implement dropout and see it work on data. 7 | 8 | - Recognize that a model without regularization gives you a better accuracy on the training set but nor necessarily on the test set. 9 | 10 | - Understand that you could use both dropout and regularization on your model. 11 | 12 | # File Description 13 | - `.ipynb` file is the solution of Week 1 program assignment 2 14 | - Regularization.ipynb 15 | - `.html` file is the html version of `.ipynb` file. 16 | - Regularization.html 17 | - file 18 | - Regularization.md 19 | 20 | # Snapshot 21 | - computer view. open .html file via brower for quick look. 22 | - **Recommend** brower view. Regularization.md 23 | 24 | 25 | # Implementation 26 | - Non-regularized model 27 | - L2 Regularization 28 | - Dropout 29 | - Forward propagation with dropout 30 | - Backward propagation with dropout 31 | 32 | # Conclusion 33 | ## What you should remember -- the implications of L2-regularization on: 34 | - The cost computation: 35 | - A regularization term is added to the cost 36 | - The backpropagation function: 37 | - There are extra terms in the gradients with respect to weight matrices 38 | - Weights end up smaller ("weight decay"): 39 | - Weights are pushed to smaller values. 40 | 41 | ## What you should remember about dropout: 42 | - Dropout is a regularization technique. 43 | - You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time. 44 | - Apply dropout both during forward and backward propagation. 45 | - During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5. 46 | 47 | ## What we want you to remember from this notebook: 48 | - Regularization will help you reduce overfitting. 49 | - Regularization will drive your weights to lower values. 50 | - L2 regularization and Dropout are two very effective regularization techniques. 51 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization 2 | 3 | Course can be found in [Coursera](https://www.coursera.org/learn/deep-neural-network) 4 | 5 | Quiz and answers are collected for quick search in my blog [SSQ](https://ssq.github.io/2017/08/28/Coursera%20Ng%20Deep%20Learning%20Specialization%20Notebook/#Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization) 6 | 7 | - Week 1 Practical aspects of Deep Learning 8 | - Recall that different types of initializations lead to different results 9 | - Recognize the importance of initialization in complex neural networks. 10 | - Recognize the difference between train/dev/test sets 11 | - Diagnose the bias and variance issues in your model 12 | - Learn when and how to use regularization methods such as dropout or L2 regularization. 13 | - Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them 14 | - Use gradient checking to verify the correctness of your backpropagation implementation 15 | - [x] [Initialization](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%201%20PA%201) 16 | - [x] [Regularization](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%201%20PA%202) 17 | - [x] [Gradient Checking](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%201%20PA%203) 18 | - Week 2 Optimization algorithms 19 | - Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam 20 | - Use random minibatches to accelerate the convergence and improve the optimization 21 | - Know the benefits of learning rate decay and apply it to your optimization 22 | - [x] [Optimization](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%202%20PA%201) 23 | - Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks 24 | - Master the process of hyperparameter tuning 25 | - Master the process of batch Normalization 26 | - [x] [Tensorflow](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%203%20PA%201) 27 | -------------------------------------------------------------------------------- /Week 3 PA 1/README.md: -------------------------------------------------------------------------------- 1 | # Tensorflow 2 | 3 | # Goal 4 | - Initialize variables 5 | - Start your own session 6 | - Train algorithms 7 | - Implement a Neural Network 8 | # File Description 9 | - `.ipynb` file is the solution of Week 3 program assignment 1 10 | - Tensorflow+Tutorial.ipynb 11 | - `.html` file is the html version of `.ipynb` file. 12 | - Tensorflow+Tutorial.html 13 | - `.py` file 14 | - Tensorflow+Tutorial.py 15 | - file 16 | - Tensorflow+Tutorial.md 17 | 18 | # Snapshot 19 | - computer view. open .html file via brower for quick look. 20 | - **Recommend** brower view. Tensorflow+Tutorial.md 21 | 22 | 23 | # Implementation 24 | - Exploring the Tensorflow Library 25 | - Linear function 26 | - Computing the sigmoid 27 | - Computing the Cost 28 | - Using One Hot encodings 29 | - Initialize with zeros and ones 30 | - Building your first neural network in tensorflow 31 | - Problem statement: SIGNS Dataset 32 | - Create placeholders 33 | - Initializing the parameters 34 | - Forward propagation in tensorflow 35 | - Compute cost 36 | - Backward propagation & parameter updates 37 | - Building the model 38 | 39 | 40 | # What you should remember from this notebook: 41 | ## Writing and running programs in TensorFlow has the following steps: 42 | 1. Create Tensors (variables) that are not yet executed/evaluated. 43 | 1. Write operations between those Tensors. 44 | 1. Initialize your Tensors. 45 | 1. Create a Session. 46 | 1. Run the Session. This will run the operations you'd written above. 47 | ## To summarize, you how know how to: 48 | 1. Create placeholders 49 | 1. Specify the computation graph corresponding to operations you want to compute 50 | 1. Create the session 51 | 1. Run the session, using a feed dictionary if necessary to specify placeholder variables' values. 52 | ## What you should remember from this notebook: 53 | - Tensorflow is a programming framework used in deep learning 54 | - The two main object classes in tensorflow are Tensors and Operators. 55 | - When you code in tensorflow you have to take the following steps: 56 | - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...) 57 | - Create a session 58 | - Initialize the session 59 | - Run the session to execute the graph 60 | - You can execute the graph multiple times as you've seen in model() 61 | - The backpropagation and optimization is automatically done when running the session on the "optimizer" object. 62 | 63 | -------------------------------------------------------------------------------- /Week 1 PA 3/Gradient+Checking.md: -------------------------------------------------------------------------------- 1 | 2 | # Gradient Checking 3 | 4 | Welcome to the final assignment for this week! In this assignment you will learn to implement and use gradient checking. 5 | 6 | You are part of a team working to make mobile payments available globally, and are asked to build a deep learning model to detect fraud--whenever someone makes a payment, you want to see if the payment might be fraudulent, such as if the user's account has been taken over by a hacker. 7 | 8 | But backpropagation is quite challenging to implement, and sometimes has bugs. Because this is a mission-critical application, your company's CEO wants to be really certain that your implementation of backpropagation is correct. Your CEO says, "Give me a proof that your backpropagation is actually working!" To give this reassurance, you are going to use "gradient checking". 9 | 10 | Let's do it! 11 | 12 | 13 | ```python 14 | # Packages 15 | import numpy as np 16 | from testCases import * 17 | from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector 18 | ``` 19 | 20 | ## 1) How does gradient checking work? 21 | 22 | Backpropagation computes the gradients $\frac{\partial J}{\partial \theta}$, where $\theta$ denotes the parameters of the model. $J$ is computed using forward propagation and your loss function. 23 | 24 | Because forward propagation is relatively easy to implement, you're confident you got that right, and so you're almost 100% sure that you're computing the cost $J$ correctly. Thus, you can use your code for computing $J$ to verify the code for computing $\frac{\partial J}{\partial \theta}$. 25 | 26 | Let's look back at the definition of a derivative (or gradient): 27 | $$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$ 28 | 29 | If you're not familiar with the "$\displaystyle \lim_{\varepsilon \to 0}$" notation, it's just a way of saying "when $\varepsilon$ is really really small." 30 | 31 | We know the following: 32 | 33 | - $\frac{\partial J}{\partial \theta}$ is what you want to make sure you're computing correctly. 34 | - You can compute $J(\theta + \varepsilon)$ and $J(\theta - \varepsilon)$ (in the case that $\theta$ is a real number), since you're confident your implementation for $J$ is correct. 35 | 36 | Lets use equation (1) and a small value for $\varepsilon$ to convince your CEO that your code for computing $\frac{\partial J}{\partial \theta}$ is correct! 37 | 38 | ## 2) 1-dimensional gradient checking 39 | 40 | Consider a 1D linear function $J(\theta) = \theta x$. The model contains only a single real-valued parameter $\theta$, and takes $x$ as input. 41 | 42 | You will implement code to compute $J(.)$ and its derivative $\frac{\partial J}{\partial \theta}$. You will then use gradient checking to make sure your derivative computation for $J$ is correct. 43 | 44 | 45 |
**Figure 1** : **1D linear model**
46 | 47 | The diagram above shows the key computation steps: First start with $x$, then evaluate the function $J(x)$ ("forward propagation"). Then compute the derivative $\frac{\partial J}{\partial \theta}$ ("backward propagation"). 48 | 49 | **Exercise**: implement "forward propagation" and "backward propagation" for this simple function. I.e., compute both $J(.)$ ("forward propagation") and its derivative with respect to $\theta$ ("backward propagation"), in two separate functions. 50 | 51 | 52 | ```python 53 | # GRADED FUNCTION: forward_propagation 54 | 55 | def forward_propagation(x, theta): 56 | """ 57 | Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x) 58 | 59 | Arguments: 60 | x -- a real-valued input 61 | theta -- our parameter, a real number as well 62 | 63 | Returns: 64 | J -- the value of function J, computed using the formula J(theta) = theta * x 65 | """ 66 | 67 | ### START CODE HERE ### (approx. 1 line) 68 | J = theta * x 69 | ### END CODE HERE ### 70 | 71 | return J 72 | ``` 73 | 74 | 75 | ```python 76 | x, theta = 2, 4 77 | J = forward_propagation(x, theta) 78 | print ("J = " + str(J)) 79 | ``` 80 | 81 | J = 8 82 | 83 | 84 | **Expected Output**: 85 | 86 | 87 | 88 | 89 | 90 | 91 |
** J ** 8
92 | 93 | **Exercise**: Now, implement the backward propagation step (derivative computation) of Figure 1. That is, compute the derivative of $J(\theta) = \theta x$ with respect to $\theta$. To save you from doing the calculus, you should get $dtheta = \frac { \partial J }{ \partial \theta} = x$. 94 | 95 | 96 | ```python 97 | # GRADED FUNCTION: backward_propagation 98 | 99 | def backward_propagation(x, theta): 100 | """ 101 | Computes the derivative of J with respect to theta (see Figure 1). 102 | 103 | Arguments: 104 | x -- a real-valued input 105 | theta -- our parameter, a real number as well 106 | 107 | Returns: 108 | dtheta -- the gradient of the cost with respect to theta 109 | """ 110 | 111 | ### START CODE HERE ### (approx. 1 line) 112 | dtheta = x 113 | ### END CODE HERE ### 114 | 115 | return dtheta 116 | ``` 117 | 118 | 119 | ```python 120 | x, theta = 2, 4 121 | dtheta = backward_propagation(x, theta) 122 | print ("dtheta = " + str(dtheta)) 123 | ``` 124 | 125 | dtheta = 2 126 | 127 | 128 | **Expected Output**: 129 | 130 | 131 | 132 | 133 | 134 | 135 |
** dtheta ** 2
136 | 137 | **Exercise**: To show that the `backward_propagation()` function is correctly computing the gradient $\frac{\partial J}{\partial \theta}$, let's implement gradient checking. 138 | 139 | **Instructions**: 140 | - First compute "gradapprox" using the formula above (1) and a small value of $\varepsilon$. Here are the Steps to follow: 141 | 1. $\theta^{+} = \theta + \varepsilon$ 142 | 2. $\theta^{-} = \theta - \varepsilon$ 143 | 3. $J^{+} = J(\theta^{+})$ 144 | 4. $J^{-} = J(\theta^{-})$ 145 | 5. $gradapprox = \frac{J^{+} - J^{-}}{2 \varepsilon}$ 146 | - Then compute the gradient using backward propagation, and store the result in a variable "grad" 147 | - Finally, compute the relative difference between "gradapprox" and the "grad" using the following formula: 148 | $$ difference = \frac {\mid\mid grad - gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2} \tag{2}$$ 149 | You will need 3 Steps to compute this formula: 150 | - 1'. compute the numerator using np.linalg.norm(...) 151 | - 2'. compute the denominator. You will need to call np.linalg.norm(...) twice. 152 | - 3'. divide them. 153 | - If this difference is small (say less than $10^{-7}$), you can be quite confident that you have computed your gradient correctly. Otherwise, there may be a mistake in the gradient computation. 154 | 155 | 156 | 157 | ```python 158 | # GRADED FUNCTION: gradient_check 159 | 160 | def gradient_check(x, theta, epsilon = 1e-7): 161 | """ 162 | Implement the backward propagation presented in Figure 1. 163 | 164 | Arguments: 165 | x -- a real-valued input 166 | theta -- our parameter, a real number as well 167 | epsilon -- tiny shift to the input to compute approximated gradient with formula(1) 168 | 169 | Returns: 170 | difference -- difference (2) between the approximated gradient and the backward propagation gradient 171 | """ 172 | 173 | # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit. 174 | ### START CODE HERE ### (approx. 5 lines) 175 | thetaplus = theta + epsilon # Step 1 176 | thetaminus = theta - epsilon # Step 2 177 | J_plus = thetaplus * x # Step 3 178 | J_minus = thetaminus * x # Step 4 179 | gradapprox = (J_plus - J_minus) / (2. * epsilon) # Step 5 180 | ### END CODE HERE ### 181 | 182 | # Check if gradapprox is close enough to the output of backward_propagation() 183 | ### START CODE HERE ### (approx. 1 line) 184 | grad = x 185 | ### END CODE HERE ### 186 | 187 | ### START CODE HERE ### (approx. 1 line) 188 | numerator = np.linalg.norm(grad - gradapprox) # Step 1' 189 | denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) # Step 2' 190 | difference = numerator / denominator # Step 3' 191 | ### END CODE HERE ### 192 | 193 | if difference < 1e-7: 194 | print ("The gradient is correct!") 195 | else: 196 | print ("The gradient is wrong!") 197 | 198 | return difference 199 | ``` 200 | 201 | 202 | ```python 203 | x, theta = 2, 4 204 | difference = gradient_check(x, theta) 205 | print("difference = " + str(difference)) 206 | ``` 207 | 208 | The gradient is correct! 209 | difference = 2.91933588329e-10 210 | 211 | 212 | **Expected Output**: 213 | The gradient is correct! 214 | 215 | 216 | 217 | 218 | 219 |
** difference ** 2.9193358103083e-10
220 | 221 | Congrats, the difference is smaller than the $10^{-7}$ threshold. So you can have high confidence that you've correctly computed the gradient in `backward_propagation()`. 222 | 223 | Now, in the more general case, your cost function $J$ has more than a single 1D input. When you are training a neural network, $\theta$ actually consists of multiple matrices $W^{[l]}$ and biases $b^{[l]}$! It is important to know how to do a gradient check with higher-dimensional inputs. Let's do it! 224 | 225 | ## 3) N-dimensional gradient checking 226 | 227 | The following figure describes the forward and backward propagation of your fraud detection model. 228 | 229 | 230 |
**Figure 2** : **deep neural network**
*LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID*
231 | 232 | Let's look at your implementations for forward propagation and backward propagation. 233 | 234 | 235 | ```python 236 | def forward_propagation_n(X, Y, parameters): 237 | """ 238 | Implements the forward propagation (and computes the cost) presented in Figure 3. 239 | 240 | Arguments: 241 | X -- training set for m examples 242 | Y -- labels for m examples 243 | parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3": 244 | W1 -- weight matrix of shape (5, 4) 245 | b1 -- bias vector of shape (5, 1) 246 | W2 -- weight matrix of shape (3, 5) 247 | b2 -- bias vector of shape (3, 1) 248 | W3 -- weight matrix of shape (1, 3) 249 | b3 -- bias vector of shape (1, 1) 250 | 251 | Returns: 252 | cost -- the cost function (logistic cost for one example) 253 | """ 254 | 255 | # retrieve parameters 256 | m = X.shape[1] 257 | W1 = parameters["W1"] 258 | b1 = parameters["b1"] 259 | W2 = parameters["W2"] 260 | b2 = parameters["b2"] 261 | W3 = parameters["W3"] 262 | b3 = parameters["b3"] 263 | 264 | # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID 265 | Z1 = np.dot(W1, X) + b1 266 | A1 = relu(Z1) 267 | Z2 = np.dot(W2, A1) + b2 268 | A2 = relu(Z2) 269 | Z3 = np.dot(W3, A2) + b3 270 | A3 = sigmoid(Z3) 271 | 272 | # Cost 273 | logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y) 274 | cost = 1./m * np.sum(logprobs) 275 | 276 | cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) 277 | 278 | return cost, cache 279 | ``` 280 | 281 | Now, run backward propagation. 282 | 283 | 284 | ```python 285 | def backward_propagation_n(X, Y, cache): 286 | """ 287 | Implement the backward propagation presented in figure 2. 288 | 289 | Arguments: 290 | X -- input datapoint, of shape (input size, 1) 291 | Y -- true "label" 292 | cache -- cache output from forward_propagation_n() 293 | 294 | Returns: 295 | gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables. 296 | """ 297 | 298 | m = X.shape[1] 299 | (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache 300 | 301 | dZ3 = A3 - Y 302 | dW3 = 1./m * np.dot(dZ3, A2.T) 303 | db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) 304 | 305 | dA2 = np.dot(W3.T, dZ3) 306 | dZ2 = np.multiply(dA2, np.int64(A2 > 0)) 307 | # dW2 = 1./m * np.dot(dZ2, A1.T) * 2 308 | dW2 = 1./m * np.dot(dZ2, A1.T) 309 | db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) 310 | 311 | dA1 = np.dot(W2.T, dZ2) 312 | dZ1 = np.multiply(dA1, np.int64(A1 > 0)) 313 | dW1 = 1./m * np.dot(dZ1, X.T) 314 | # db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True) 315 | db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) 316 | 317 | gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, 318 | "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, 319 | "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} 320 | 321 | return gradients 322 | ``` 323 | 324 | You obtained some results on the fraud detection test set but you are not 100% sure of your model. Nobody's perfect! Let's implement gradient checking to verify if your gradients are correct. 325 | 326 | **How does gradient checking work?**. 327 | 328 | As in 1) and 2), you want to compare "gradapprox" to the gradient computed by backpropagation. The formula is still: 329 | 330 | $$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$ 331 | 332 | However, $\theta$ is not a scalar anymore. It is a dictionary called "parameters". We implemented a function "`dictionary_to_vector()`" for you. It converts the "parameters" dictionary into a vector called "values", obtained by reshaping all parameters (W1, b1, W2, b2, W3, b3) into vectors and concatenating them. 333 | 334 | The inverse function is "`vector_to_dictionary`" which outputs back the "parameters" dictionary. 335 | 336 | 337 |
**Figure 2** : **dictionary_to_vector() and vector_to_dictionary()**
You will need these functions in gradient_check_n()
338 | 339 | We have also converted the "gradients" dictionary into a vector "grad" using gradients_to_vector(). You don't need to worry about that. 340 | 341 | **Exercise**: Implement gradient_check_n(). 342 | 343 | **Instructions**: Here is pseudo-code that will help you implement the gradient check. 344 | 345 | For each i in num_parameters: 346 | - To compute `J_plus[i]`: 347 | 1. Set $\theta^{+}$ to `np.copy(parameters_values)` 348 | 2. Set $\theta^{+}_i$ to $\theta^{+}_i + \varepsilon$ 349 | 3. Calculate $J^{+}_i$ using to `forward_propagation_n(x, y, vector_to_dictionary(`$\theta^{+}$ `))`. 350 | - To compute `J_minus[i]`: do the same thing with $\theta^{-}$ 351 | - Compute $gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon}$ 352 | 353 | Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to `parameter_values[i]`. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1', 2', 3'), compute: 354 | $$ difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 + \| gradapprox \|_2 } \tag{3}$$ 355 | 356 | 357 | ```python 358 | # GRADED FUNCTION: gradient_check_n 359 | 360 | def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7): 361 | """ 362 | Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n 363 | 364 | Arguments: 365 | parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3": 366 | grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 367 | x -- input datapoint, of shape (input size, 1) 368 | y -- true "label" 369 | epsilon -- tiny shift to the input to compute approximated gradient with formula(1) 370 | 371 | Returns: 372 | difference -- difference (2) between the approximated gradient and the backward propagation gradient 373 | """ 374 | 375 | # Set-up variables 376 | parameters_values, _ = dictionary_to_vector(parameters) 377 | grad = gradients_to_vector(gradients) 378 | num_parameters = parameters_values.shape[0] 379 | J_plus = np.zeros((num_parameters, 1)) 380 | J_minus = np.zeros((num_parameters, 1)) 381 | gradapprox = np.zeros((num_parameters, 1)) 382 | 383 | # Compute gradapprox 384 | for i in range(num_parameters): 385 | 386 | # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]". 387 | # "_" is used because the function you have to outputs two parameters but we only care about the first one 388 | ### START CODE HERE ### (approx. 3 lines) 389 | thetaplus = np.copy(parameters_values) # Step 1 390 | thetaplus[i][0] = thetaplus[i][0] + epsilon # Step 2 391 | J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus)) # Step 3 392 | ### END CODE HERE ### 393 | 394 | # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]". 395 | ### START CODE HERE ### (approx. 3 lines) 396 | thetaminus = np.copy(parameters_values) # Step 1 397 | thetaminus[i][0] = thetaminus[i][0] - epsilon # Step 2 398 | J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus)) # Step 3 399 | ### END CODE HERE ### 400 | 401 | # Compute gradapprox[i] 402 | ### START CODE HERE ### (approx. 1 line) 403 | gradapprox[i] = (J_plus[i] - J_minus[i]) / (2. * epsilon) 404 | ### END CODE HERE ### 405 | 406 | # Compare gradapprox to backward propagation gradients by computing difference. 407 | ### START CODE HERE ### (approx. 1 line) 408 | numerator = np.linalg.norm(grad - gradapprox) # Step 1' 409 | denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) # Step 2' 410 | difference = numerator / denominator # Step 3' 411 | ### END CODE HERE ### 412 | 413 | if difference > 1e-7: 414 | print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m") 415 | else: 416 | print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m") 417 | 418 | return difference 419 | ``` 420 | 421 | 422 | ```python 423 | X, Y, parameters = gradient_check_n_test_case() 424 | 425 | cost, cache = forward_propagation_n(X, Y, parameters) 426 | gradients = backward_propagation_n(X, Y, cache) 427 | difference = gradient_check_n(parameters, gradients, X, Y) 428 | ``` 429 | 430 | There is a mistake in the backward propagation! difference = 1.18904178788e-07 431 | 432 | 433 | **Expected output**: 434 | 435 | 436 | 437 | 438 | 439 | 440 |
** There is a mistake in the backward propagation!** difference = 0.285093156781
441 | 442 | It seems that there were errors in the `backward_propagation_n` code we gave you! Good that you've implemented the gradient check. Go back to `backward_propagation` and try to find/correct the errors *(Hint: check dW2 and db1)*. Rerun the gradient check when you think you've fixed it. Remember you'll need to re-execute the cell defining `backward_propagation_n()` if you modify the code. 443 | 444 | Can you get gradient check to declare your derivative computation correct? Even though this part of the assignment isn't graded, we strongly urge you to try to find the bug and re-run gradient check until you're convinced backprop is now correctly implemented. 445 | 446 | **Note** 447 | - Gradient Checking is slow! Approximating the gradient with $\frac{\partial J}{\partial \theta} \approx \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$ is computationally costly. For this reason, we don't run gradient checking at every iteration during training. Just a few times to check if the gradient is correct. 448 | - Gradient Checking, at least as we've presented it, doesn't work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout. 449 | 450 | Congrats, you can be confident that your deep learning model for fraud detection is working correctly! You can even use this to convince your CEO. :) 451 | 452 | 453 | **What you should remember from this notebook**: 454 | - Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation). 455 | - Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. 456 | 457 | 458 | ```python 459 | 460 | ``` 461 | -------------------------------------------------------------------------------- /Week 1 PA 3/Gradient+Checking.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Gradient Checking\n", 8 | "\n", 9 | "Welcome to the final assignment for this week! In this assignment you will learn to implement and use gradient checking. \n", 10 | "\n", 11 | "You are part of a team working to make mobile payments available globally, and are asked to build a deep learning model to detect fraud--whenever someone makes a payment, you want to see if the payment might be fraudulent, such as if the user's account has been taken over by a hacker. \n", 12 | "\n", 13 | "But backpropagation is quite challenging to implement, and sometimes has bugs. Because this is a mission-critical application, your company's CEO wants to be really certain that your implementation of backpropagation is correct. Your CEO says, \"Give me a proof that your backpropagation is actually working!\" To give this reassurance, you are going to use \"gradient checking\".\n", 14 | "\n", 15 | "Let's do it!" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 5, 21 | "metadata": { 22 | "collapsed": true 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "# Packages\n", 27 | "import numpy as np\n", 28 | "from testCases import *\n", 29 | "from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## 1) How does gradient checking work?\n", 37 | "\n", 38 | "Backpropagation computes the gradients $\\frac{\\partial J}{\\partial \\theta}$, where $\\theta$ denotes the parameters of the model. $J$ is computed using forward propagation and your loss function.\n", 39 | "\n", 40 | "Because forward propagation is relatively easy to implement, you're confident you got that right, and so you're almost 100% sure that you're computing the cost $J$ correctly. Thus, you can use your code for computing $J$ to verify the code for computing $\\frac{\\partial J}{\\partial \\theta}$. \n", 41 | "\n", 42 | "Let's look back at the definition of a derivative (or gradient):\n", 43 | "$$ \\frac{\\partial J}{\\partial \\theta} = \\lim_{\\varepsilon \\to 0} \\frac{J(\\theta + \\varepsilon) - J(\\theta - \\varepsilon)}{2 \\varepsilon} \\tag{1}$$\n", 44 | "\n", 45 | "If you're not familiar with the \"$\\displaystyle \\lim_{\\varepsilon \\to 0}$\" notation, it's just a way of saying \"when $\\varepsilon$ is really really small.\"\n", 46 | "\n", 47 | "We know the following:\n", 48 | "\n", 49 | "- $\\frac{\\partial J}{\\partial \\theta}$ is what you want to make sure you're computing correctly. \n", 50 | "- You can compute $J(\\theta + \\varepsilon)$ and $J(\\theta - \\varepsilon)$ (in the case that $\\theta$ is a real number), since you're confident your implementation for $J$ is correct. \n", 51 | "\n", 52 | "Lets use equation (1) and a small value for $\\varepsilon$ to convince your CEO that your code for computing $\\frac{\\partial J}{\\partial \\theta}$ is correct!" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "## 2) 1-dimensional gradient checking\n", 60 | "\n", 61 | "Consider a 1D linear function $J(\\theta) = \\theta x$. The model contains only a single real-valued parameter $\\theta$, and takes $x$ as input.\n", 62 | "\n", 63 | "You will implement code to compute $J(.)$ and its derivative $\\frac{\\partial J}{\\partial \\theta}$. You will then use gradient checking to make sure your derivative computation for $J$ is correct. \n", 64 | "\n", 65 | "\n", 66 | "
**Figure 1** : **1D linear model**
\n", 67 | "\n", 68 | "The diagram above shows the key computation steps: First start with $x$, then evaluate the function $J(x)$ (\"forward propagation\"). Then compute the derivative $\\frac{\\partial J}{\\partial \\theta}$ (\"backward propagation\"). \n", 69 | "\n", 70 | "**Exercise**: implement \"forward propagation\" and \"backward propagation\" for this simple function. I.e., compute both $J(.)$ (\"forward propagation\") and its derivative with respect to $\\theta$ (\"backward propagation\"), in two separate functions. " 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 6, 76 | "metadata": { 77 | "collapsed": true 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "# GRADED FUNCTION: forward_propagation\n", 82 | "\n", 83 | "def forward_propagation(x, theta):\n", 84 | " \"\"\"\n", 85 | " Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)\n", 86 | " \n", 87 | " Arguments:\n", 88 | " x -- a real-valued input\n", 89 | " theta -- our parameter, a real number as well\n", 90 | " \n", 91 | " Returns:\n", 92 | " J -- the value of function J, computed using the formula J(theta) = theta * x\n", 93 | " \"\"\"\n", 94 | " \n", 95 | " ### START CODE HERE ### (approx. 1 line)\n", 96 | " J = theta * x\n", 97 | " ### END CODE HERE ###\n", 98 | " \n", 99 | " return J" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 7, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "J = 8\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "x, theta = 2, 4\n", 117 | "J = forward_propagation(x, theta)\n", 118 | "print (\"J = \" + str(J))" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "**Expected Output**:\n", 126 | "\n", 127 | "\n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | "
** J ** 8
" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "**Exercise**: Now, implement the backward propagation step (derivative computation) of Figure 1. That is, compute the derivative of $J(\\theta) = \\theta x$ with respect to $\\theta$. To save you from doing the calculus, you should get $dtheta = \\frac { \\partial J }{ \\partial \\theta} = x$." 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 8, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "# GRADED FUNCTION: backward_propagation\n", 151 | "\n", 152 | "def backward_propagation(x, theta):\n", 153 | " \"\"\"\n", 154 | " Computes the derivative of J with respect to theta (see Figure 1).\n", 155 | " \n", 156 | " Arguments:\n", 157 | " x -- a real-valued input\n", 158 | " theta -- our parameter, a real number as well\n", 159 | " \n", 160 | " Returns:\n", 161 | " dtheta -- the gradient of the cost with respect to theta\n", 162 | " \"\"\"\n", 163 | " \n", 164 | " ### START CODE HERE ### (approx. 1 line)\n", 165 | " dtheta = x\n", 166 | " ### END CODE HERE ###\n", 167 | " \n", 168 | " return dtheta" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 9, 174 | "metadata": { 175 | "scrolled": true 176 | }, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "dtheta = 2\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "x, theta = 2, 4\n", 188 | "dtheta = backward_propagation(x, theta)\n", 189 | "print (\"dtheta = \" + str(dtheta))" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "**Expected Output**:\n", 197 | "\n", 198 | "\n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | "
** dtheta ** 2
" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "**Exercise**: To show that the `backward_propagation()` function is correctly computing the gradient $\\frac{\\partial J}{\\partial \\theta}$, let's implement gradient checking.\n", 211 | "\n", 212 | "**Instructions**:\n", 213 | "- First compute \"gradapprox\" using the formula above (1) and a small value of $\\varepsilon$. Here are the Steps to follow:\n", 214 | " 1. $\\theta^{+} = \\theta + \\varepsilon$\n", 215 | " 2. $\\theta^{-} = \\theta - \\varepsilon$\n", 216 | " 3. $J^{+} = J(\\theta^{+})$\n", 217 | " 4. $J^{-} = J(\\theta^{-})$\n", 218 | " 5. $gradapprox = \\frac{J^{+} - J^{-}}{2 \\varepsilon}$\n", 219 | "- Then compute the gradient using backward propagation, and store the result in a variable \"grad\"\n", 220 | "- Finally, compute the relative difference between \"gradapprox\" and the \"grad\" using the following formula:\n", 221 | "$$ difference = \\frac {\\mid\\mid grad - gradapprox \\mid\\mid_2}{\\mid\\mid grad \\mid\\mid_2 + \\mid\\mid gradapprox \\mid\\mid_2} \\tag{2}$$\n", 222 | "You will need 3 Steps to compute this formula:\n", 223 | " - 1'. compute the numerator using np.linalg.norm(...)\n", 224 | " - 2'. compute the denominator. You will need to call np.linalg.norm(...) twice.\n", 225 | " - 3'. divide them.\n", 226 | "- If this difference is small (say less than $10^{-7}$), you can be quite confident that you have computed your gradient correctly. Otherwise, there may be a mistake in the gradient computation. \n" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 10, 232 | "metadata": { 233 | "collapsed": true 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "# GRADED FUNCTION: gradient_check\n", 238 | "\n", 239 | "def gradient_check(x, theta, epsilon = 1e-7):\n", 240 | " \"\"\"\n", 241 | " Implement the backward propagation presented in Figure 1.\n", 242 | " \n", 243 | " Arguments:\n", 244 | " x -- a real-valued input\n", 245 | " theta -- our parameter, a real number as well\n", 246 | " epsilon -- tiny shift to the input to compute approximated gradient with formula(1)\n", 247 | " \n", 248 | " Returns:\n", 249 | " difference -- difference (2) between the approximated gradient and the backward propagation gradient\n", 250 | " \"\"\"\n", 251 | " \n", 252 | " # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.\n", 253 | " ### START CODE HERE ### (approx. 5 lines)\n", 254 | " thetaplus = theta + epsilon # Step 1\n", 255 | " thetaminus = theta - epsilon # Step 2\n", 256 | " J_plus = thetaplus * x # Step 3\n", 257 | " J_minus = thetaminus * x # Step 4\n", 258 | " gradapprox = (J_plus - J_minus) / (2. * epsilon) # Step 5\n", 259 | " ### END CODE HERE ###\n", 260 | " \n", 261 | " # Check if gradapprox is close enough to the output of backward_propagation()\n", 262 | " ### START CODE HERE ### (approx. 1 line)\n", 263 | " grad = x\n", 264 | " ### END CODE HERE ###\n", 265 | " \n", 266 | " ### START CODE HERE ### (approx. 1 line)\n", 267 | " numerator = np.linalg.norm(grad - gradapprox) # Step 1'\n", 268 | " denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) # Step 2'\n", 269 | " difference = numerator / denominator # Step 3'\n", 270 | " ### END CODE HERE ###\n", 271 | " \n", 272 | " if difference < 1e-7:\n", 273 | " print (\"The gradient is correct!\")\n", 274 | " else:\n", 275 | " print (\"The gradient is wrong!\")\n", 276 | " \n", 277 | " return difference" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 11, 283 | "metadata": { 284 | "scrolled": true 285 | }, 286 | "outputs": [ 287 | { 288 | "name": "stdout", 289 | "output_type": "stream", 290 | "text": [ 291 | "The gradient is correct!\n", 292 | "difference = 2.91933588329e-10\n" 293 | ] 294 | } 295 | ], 296 | "source": [ 297 | "x, theta = 2, 4\n", 298 | "difference = gradient_check(x, theta)\n", 299 | "print(\"difference = \" + str(difference))" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "**Expected Output**:\n", 307 | "The gradient is correct!\n", 308 | "\n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | "
** difference ** 2.9193358103083e-10
" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "Congrats, the difference is smaller than the $10^{-7}$ threshold. So you can have high confidence that you've correctly computed the gradient in `backward_propagation()`. \n", 321 | "\n", 322 | "Now, in the more general case, your cost function $J$ has more than a single 1D input. When you are training a neural network, $\\theta$ actually consists of multiple matrices $W^{[l]}$ and biases $b^{[l]}$! It is important to know how to do a gradient check with higher-dimensional inputs. Let's do it!" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "## 3) N-dimensional gradient checking" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": { 335 | "collapsed": true 336 | }, 337 | "source": [ 338 | "The following figure describes the forward and backward propagation of your fraud detection model.\n", 339 | "\n", 340 | "\n", 341 | "
**Figure 2** : **deep neural network**
*LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID*
\n", 342 | "\n", 343 | "Let's look at your implementations for forward propagation and backward propagation. " 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 12, 349 | "metadata": { 350 | "collapsed": true 351 | }, 352 | "outputs": [], 353 | "source": [ 354 | "def forward_propagation_n(X, Y, parameters):\n", 355 | " \"\"\"\n", 356 | " Implements the forward propagation (and computes the cost) presented in Figure 3.\n", 357 | " \n", 358 | " Arguments:\n", 359 | " X -- training set for m examples\n", 360 | " Y -- labels for m examples \n", 361 | " parameters -- python dictionary containing your parameters \"W1\", \"b1\", \"W2\", \"b2\", \"W3\", \"b3\":\n", 362 | " W1 -- weight matrix of shape (5, 4)\n", 363 | " b1 -- bias vector of shape (5, 1)\n", 364 | " W2 -- weight matrix of shape (3, 5)\n", 365 | " b2 -- bias vector of shape (3, 1)\n", 366 | " W3 -- weight matrix of shape (1, 3)\n", 367 | " b3 -- bias vector of shape (1, 1)\n", 368 | " \n", 369 | " Returns:\n", 370 | " cost -- the cost function (logistic cost for one example)\n", 371 | " \"\"\"\n", 372 | " \n", 373 | " # retrieve parameters\n", 374 | " m = X.shape[1]\n", 375 | " W1 = parameters[\"W1\"]\n", 376 | " b1 = parameters[\"b1\"]\n", 377 | " W2 = parameters[\"W2\"]\n", 378 | " b2 = parameters[\"b2\"]\n", 379 | " W3 = parameters[\"W3\"]\n", 380 | " b3 = parameters[\"b3\"]\n", 381 | "\n", 382 | " # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID\n", 383 | " Z1 = np.dot(W1, X) + b1\n", 384 | " A1 = relu(Z1)\n", 385 | " Z2 = np.dot(W2, A1) + b2\n", 386 | " A2 = relu(Z2)\n", 387 | " Z3 = np.dot(W3, A2) + b3\n", 388 | " A3 = sigmoid(Z3)\n", 389 | "\n", 390 | " # Cost\n", 391 | " logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)\n", 392 | " cost = 1./m * np.sum(logprobs)\n", 393 | " \n", 394 | " cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)\n", 395 | " \n", 396 | " return cost, cache" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "Now, run backward propagation." 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 16, 409 | "metadata": { 410 | "collapsed": true 411 | }, 412 | "outputs": [], 413 | "source": [ 414 | "def backward_propagation_n(X, Y, cache):\n", 415 | " \"\"\"\n", 416 | " Implement the backward propagation presented in figure 2.\n", 417 | " \n", 418 | " Arguments:\n", 419 | " X -- input datapoint, of shape (input size, 1)\n", 420 | " Y -- true \"label\"\n", 421 | " cache -- cache output from forward_propagation_n()\n", 422 | " \n", 423 | " Returns:\n", 424 | " gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.\n", 425 | " \"\"\"\n", 426 | " \n", 427 | " m = X.shape[1]\n", 428 | " (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache\n", 429 | " \n", 430 | " dZ3 = A3 - Y\n", 431 | " dW3 = 1./m * np.dot(dZ3, A2.T)\n", 432 | " db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)\n", 433 | " \n", 434 | " dA2 = np.dot(W3.T, dZ3)\n", 435 | " dZ2 = np.multiply(dA2, np.int64(A2 > 0))\n", 436 | " # dW2 = 1./m * np.dot(dZ2, A1.T) * 2\n", 437 | " dW2 = 1./m * np.dot(dZ2, A1.T) \n", 438 | " db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)\n", 439 | " \n", 440 | " dA1 = np.dot(W2.T, dZ2)\n", 441 | " dZ1 = np.multiply(dA1, np.int64(A1 > 0))\n", 442 | " dW1 = 1./m * np.dot(dZ1, X.T)\n", 443 | " # db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)\n", 444 | " db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)\n", 445 | " \n", 446 | " gradients = {\"dZ3\": dZ3, \"dW3\": dW3, \"db3\": db3,\n", 447 | " \"dA2\": dA2, \"dZ2\": dZ2, \"dW2\": dW2, \"db2\": db2,\n", 448 | " \"dA1\": dA1, \"dZ1\": dZ1, \"dW1\": dW1, \"db1\": db1}\n", 449 | " \n", 450 | " return gradients" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": { 456 | "collapsed": true 457 | }, 458 | "source": [ 459 | "You obtained some results on the fraud detection test set but you are not 100% sure of your model. Nobody's perfect! Let's implement gradient checking to verify if your gradients are correct." 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "**How does gradient checking work?**.\n", 467 | "\n", 468 | "As in 1) and 2), you want to compare \"gradapprox\" to the gradient computed by backpropagation. The formula is still:\n", 469 | "\n", 470 | "$$ \\frac{\\partial J}{\\partial \\theta} = \\lim_{\\varepsilon \\to 0} \\frac{J(\\theta + \\varepsilon) - J(\\theta - \\varepsilon)}{2 \\varepsilon} \\tag{1}$$\n", 471 | "\n", 472 | "However, $\\theta$ is not a scalar anymore. It is a dictionary called \"parameters\". We implemented a function \"`dictionary_to_vector()`\" for you. It converts the \"parameters\" dictionary into a vector called \"values\", obtained by reshaping all parameters (W1, b1, W2, b2, W3, b3) into vectors and concatenating them.\n", 473 | "\n", 474 | "The inverse function is \"`vector_to_dictionary`\" which outputs back the \"parameters\" dictionary.\n", 475 | "\n", 476 | "\n", 477 | "
**Figure 2** : **dictionary_to_vector() and vector_to_dictionary()**
You will need these functions in gradient_check_n()
\n", 478 | "\n", 479 | "We have also converted the \"gradients\" dictionary into a vector \"grad\" using gradients_to_vector(). You don't need to worry about that.\n", 480 | "\n", 481 | "**Exercise**: Implement gradient_check_n().\n", 482 | "\n", 483 | "**Instructions**: Here is pseudo-code that will help you implement the gradient check.\n", 484 | "\n", 485 | "For each i in num_parameters:\n", 486 | "- To compute `J_plus[i]`:\n", 487 | " 1. Set $\\theta^{+}$ to `np.copy(parameters_values)`\n", 488 | " 2. Set $\\theta^{+}_i$ to $\\theta^{+}_i + \\varepsilon$\n", 489 | " 3. Calculate $J^{+}_i$ using to `forward_propagation_n(x, y, vector_to_dictionary(`$\\theta^{+}$ `))`. \n", 490 | "- To compute `J_minus[i]`: do the same thing with $\\theta^{-}$\n", 491 | "- Compute $gradapprox[i] = \\frac{J^{+}_i - J^{-}_i}{2 \\varepsilon}$\n", 492 | "\n", 493 | "Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to `parameter_values[i]`. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1', 2', 3'), compute: \n", 494 | "$$ difference = \\frac {\\| grad - gradapprox \\|_2}{\\| grad \\|_2 + \\| gradapprox \\|_2 } \\tag{3}$$" 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": 14, 500 | "metadata": { 501 | "collapsed": true 502 | }, 503 | "outputs": [], 504 | "source": [ 505 | "# GRADED FUNCTION: gradient_check_n\n", 506 | "\n", 507 | "def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):\n", 508 | " \"\"\"\n", 509 | " Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n\n", 510 | " \n", 511 | " Arguments:\n", 512 | " parameters -- python dictionary containing your parameters \"W1\", \"b1\", \"W2\", \"b2\", \"W3\", \"b3\":\n", 513 | " grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. \n", 514 | " x -- input datapoint, of shape (input size, 1)\n", 515 | " y -- true \"label\"\n", 516 | " epsilon -- tiny shift to the input to compute approximated gradient with formula(1)\n", 517 | " \n", 518 | " Returns:\n", 519 | " difference -- difference (2) between the approximated gradient and the backward propagation gradient\n", 520 | " \"\"\"\n", 521 | " \n", 522 | " # Set-up variables\n", 523 | " parameters_values, _ = dictionary_to_vector(parameters)\n", 524 | " grad = gradients_to_vector(gradients)\n", 525 | " num_parameters = parameters_values.shape[0]\n", 526 | " J_plus = np.zeros((num_parameters, 1))\n", 527 | " J_minus = np.zeros((num_parameters, 1))\n", 528 | " gradapprox = np.zeros((num_parameters, 1))\n", 529 | " \n", 530 | " # Compute gradapprox\n", 531 | " for i in range(num_parameters):\n", 532 | " \n", 533 | " # Compute J_plus[i]. Inputs: \"parameters_values, epsilon\". Output = \"J_plus[i]\".\n", 534 | " # \"_\" is used because the function you have to outputs two parameters but we only care about the first one\n", 535 | " ### START CODE HERE ### (approx. 3 lines)\n", 536 | " thetaplus = np.copy(parameters_values) # Step 1\n", 537 | " thetaplus[i][0] = thetaplus[i][0] + epsilon # Step 2\n", 538 | " J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus)) # Step 3\n", 539 | " ### END CODE HERE ###\n", 540 | " \n", 541 | " # Compute J_minus[i]. Inputs: \"parameters_values, epsilon\". Output = \"J_minus[i]\".\n", 542 | " ### START CODE HERE ### (approx. 3 lines)\n", 543 | " thetaminus = np.copy(parameters_values) # Step 1\n", 544 | " thetaminus[i][0] = thetaminus[i][0] - epsilon # Step 2 \n", 545 | " J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus)) # Step 3\n", 546 | " ### END CODE HERE ###\n", 547 | " \n", 548 | " # Compute gradapprox[i]\n", 549 | " ### START CODE HERE ### (approx. 1 line)\n", 550 | " gradapprox[i] = (J_plus[i] - J_minus[i]) / (2. * epsilon)\n", 551 | " ### END CODE HERE ###\n", 552 | " \n", 553 | " # Compare gradapprox to backward propagation gradients by computing difference.\n", 554 | " ### START CODE HERE ### (approx. 1 line)\n", 555 | " numerator = np.linalg.norm(grad - gradapprox) # Step 1'\n", 556 | " denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) # Step 2'\n", 557 | " difference = numerator / denominator # Step 3'\n", 558 | " ### END CODE HERE ###\n", 559 | "\n", 560 | " if difference > 1e-7:\n", 561 | " print (\"\\033[93m\" + \"There is a mistake in the backward propagation! difference = \" + str(difference) + \"\\033[0m\")\n", 562 | " else:\n", 563 | " print (\"\\033[92m\" + \"Your backward propagation works perfectly fine! difference = \" + str(difference) + \"\\033[0m\")\n", 564 | " \n", 565 | " return difference" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 17, 571 | "metadata": { 572 | "scrolled": false 573 | }, 574 | "outputs": [ 575 | { 576 | "name": "stdout", 577 | "output_type": "stream", 578 | "text": [ 579 | "\u001b[93mThere is a mistake in the backward propagation! difference = 1.18904178788e-07\u001b[0m\n" 580 | ] 581 | } 582 | ], 583 | "source": [ 584 | "X, Y, parameters = gradient_check_n_test_case()\n", 585 | "\n", 586 | "cost, cache = forward_propagation_n(X, Y, parameters)\n", 587 | "gradients = backward_propagation_n(X, Y, cache)\n", 588 | "difference = gradient_check_n(parameters, gradients, X, Y)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "**Expected output**:\n", 596 | "\n", 597 | "\n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | "
** There is a mistake in the backward propagation!** difference = 0.285093156781
" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": {}, 608 | "source": [ 609 | "It seems that there were errors in the `backward_propagation_n` code we gave you! Good that you've implemented the gradient check. Go back to `backward_propagation` and try to find/correct the errors *(Hint: check dW2 and db1)*. Rerun the gradient check when you think you've fixed it. Remember you'll need to re-execute the cell defining `backward_propagation_n()` if you modify the code. \n", 610 | "\n", 611 | "Can you get gradient check to declare your derivative computation correct? Even though this part of the assignment isn't graded, we strongly urge you to try to find the bug and re-run gradient check until you're convinced backprop is now correctly implemented. \n", 612 | "\n", 613 | "**Note** \n", 614 | "- Gradient Checking is slow! Approximating the gradient with $\\frac{\\partial J}{\\partial \\theta} \\approx \\frac{J(\\theta + \\varepsilon) - J(\\theta - \\varepsilon)}{2 \\varepsilon}$ is computationally costly. For this reason, we don't run gradient checking at every iteration during training. Just a few times to check if the gradient is correct. \n", 615 | "- Gradient Checking, at least as we've presented it, doesn't work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout. \n", 616 | "\n", 617 | "Congrats, you can be confident that your deep learning model for fraud detection is working correctly! You can even use this to convince your CEO. :) \n", 618 | "\n", 619 | "\n", 620 | "**What you should remember from this notebook**:\n", 621 | "- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).\n", 622 | "- Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. " 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": null, 628 | "metadata": { 629 | "collapsed": true 630 | }, 631 | "outputs": [], 632 | "source": [] 633 | } 634 | ], 635 | "metadata": { 636 | "coursera": { 637 | "course_slug": "deep-neural-network", 638 | "graded_item_id": "n6NBD", 639 | "launcher_item_id": "yfOsE" 640 | }, 641 | "kernelspec": { 642 | "display_name": "Python 3", 643 | "language": "python", 644 | "name": "python3" 645 | }, 646 | "language_info": { 647 | "codemirror_mode": { 648 | "name": "ipython", 649 | "version": 3 650 | }, 651 | "file_extension": ".py", 652 | "mimetype": "text/x-python", 653 | "name": "python", 654 | "nbconvert_exporter": "python", 655 | "pygments_lexer": "ipython3", 656 | "version": "3.6.0" 657 | } 658 | }, 659 | "nbformat": 4, 660 | "nbformat_minor": 1 661 | } 662 | -------------------------------------------------------------------------------- /Week 1 PA 2/Regularization/Regularization.md: -------------------------------------------------------------------------------- 1 | 2 | # Regularization 3 | 4 | Welcome to the second assignment of this week. Deep Learning models have so much flexibility and capacity that **overfitting can be a serious problem**, if the training dataset is not big enough. Sure it does well on the training set, but the learned network **doesn't generalize to new examples** that it has never seen! 5 | 6 | **You will learn to:** Use regularization in your deep learning models. 7 | 8 | Let's first import the packages you are going to use. 9 | 10 | 11 | ```python 12 | # import packages 13 | import numpy as np 14 | import matplotlib.pyplot as plt 15 | from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec 16 | from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters 17 | import sklearn 18 | import sklearn.datasets 19 | import scipy.io 20 | from testCases import * 21 | 22 | %matplotlib inline 23 | plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots 24 | plt.rcParams['image.interpolation'] = 'nearest' 25 | plt.rcParams['image.cmap'] = 'gray' 26 | ``` 27 | 28 | /home/jovyan/work/week5/Regularization/reg_utils.py:85: SyntaxWarning: assertion is always true, perhaps remove parentheses? 29 | assert(parameters['W' + str(l)].shape == layer_dims[l], layer_dims[l-1]) 30 | /home/jovyan/work/week5/Regularization/reg_utils.py:86: SyntaxWarning: assertion is always true, perhaps remove parentheses? 31 | assert(parameters['W' + str(l)].shape == layer_dims[l], 1) 32 | 33 | 34 | **Problem Statement**: You have just been hired as an AI expert by the French Football Corporation. They would like you to recommend positions where France's goal keeper should kick the ball so that the French team's players can then hit it with their head. 35 | 36 | 37 |
**Figure 1** : **Football field**
The goal keeper kicks the ball in the air, the players of each team are fighting to hit the ball with their head
38 | 39 | 40 | They give you the following 2D dataset from France's past 10 games. 41 | 42 | 43 | ```python 44 | train_X, train_Y, test_X, test_Y = load_2D_dataset() 45 | ``` 46 | 47 | 48 | ![png](output_3_0.png) 49 | 50 | 51 | Each dot corresponds to a position on the football field where a football player has hit the ball with his/her head after the French goal keeper has shot the ball from the left side of the football field. 52 | - If the dot is blue, it means the French player managed to hit the ball with his/her head 53 | - If the dot is red, it means the other team's player hit the ball with their head 54 | 55 | **Your goal**: Use a deep learning model to find the positions on the field where the goalkeeper should kick the ball. 56 | 57 | **Analysis of the dataset**: This dataset is a little noisy, but it looks like a diagonal line separating the upper left half (blue) from the lower right half (red) would work well. 58 | 59 | You will first try a non-regularized model. Then you'll learn how to regularize it and decide which model you will choose to solve the French Football Corporation's problem. 60 | 61 | ## 1 - Non-regularized model 62 | 63 | You will use the following neural network (already implemented for you below). This model can be used: 64 | - in *regularization mode* -- by setting the `lambd` input to a non-zero value. We use "`lambd`" instead of "`lambda`" because "`lambda`" is a reserved keyword in Python. 65 | - in *dropout mode* -- by setting the `keep_prob` to a value less than one 66 | 67 | You will first try the model without any regularization. Then, you will implement: 68 | - *L2 regularization* -- functions: "`compute_cost_with_regularization()`" and "`backward_propagation_with_regularization()`" 69 | - *Dropout* -- functions: "`forward_propagation_with_dropout()`" and "`backward_propagation_with_dropout()`" 70 | 71 | In each part, you will run this model with the correct inputs so that it calls the functions you've implemented. Take a look at the code below to familiarize yourself with the model. 72 | 73 | 74 | ```python 75 | def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1): 76 | """ 77 | Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID. 78 | 79 | Arguments: 80 | X -- input data, of shape (input size, number of examples) 81 | Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples) 82 | learning_rate -- learning rate of the optimization 83 | num_iterations -- number of iterations of the optimization loop 84 | print_cost -- If True, print the cost every 10000 iterations 85 | lambd -- regularization hyperparameter, scalar 86 | keep_prob - probability of keeping a neuron active during drop-out, scalar. 87 | 88 | Returns: 89 | parameters -- parameters learned by the model. They can then be used to predict. 90 | """ 91 | 92 | grads = {} 93 | costs = [] # to keep track of the cost 94 | m = X.shape[1] # number of examples 95 | layers_dims = [X.shape[0], 20, 3, 1] 96 | 97 | # Initialize parameters dictionary. 98 | parameters = initialize_parameters(layers_dims) 99 | 100 | # Loop (gradient descent) 101 | 102 | for i in range(0, num_iterations): 103 | 104 | # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID. 105 | if keep_prob == 1: 106 | a3, cache = forward_propagation(X, parameters) 107 | elif keep_prob < 1: 108 | a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob) 109 | 110 | # Cost function 111 | if lambd == 0: 112 | cost = compute_cost(a3, Y) 113 | else: 114 | cost = compute_cost_with_regularization(a3, Y, parameters, lambd) 115 | 116 | # Backward propagation. 117 | assert(lambd==0 or keep_prob==1) # it is possible to use both L2 regularization and dropout, 118 | # but this assignment will only explore one at a time 119 | if lambd == 0 and keep_prob == 1: 120 | grads = backward_propagation(X, Y, cache) 121 | elif lambd != 0: 122 | grads = backward_propagation_with_regularization(X, Y, cache, lambd) 123 | elif keep_prob < 1: 124 | grads = backward_propagation_with_dropout(X, Y, cache, keep_prob) 125 | 126 | # Update parameters. 127 | parameters = update_parameters(parameters, grads, learning_rate) 128 | 129 | # Print the loss every 10000 iterations 130 | if print_cost and i % 10000 == 0: 131 | print("Cost after iteration {}: {}".format(i, cost)) 132 | if print_cost and i % 1000 == 0: 133 | costs.append(cost) 134 | 135 | # plot the cost 136 | plt.plot(costs) 137 | plt.ylabel('cost') 138 | plt.xlabel('iterations (x1,000)') 139 | plt.title("Learning rate =" + str(learning_rate)) 140 | plt.show() 141 | 142 | return parameters 143 | ``` 144 | 145 | Let's train the model without any regularization, and observe the accuracy on the train/test sets. 146 | 147 | 148 | ```python 149 | parameters = model(train_X, train_Y) 150 | print ("On the training set:") 151 | predictions_train = predict(train_X, train_Y, parameters) 152 | print ("On the test set:") 153 | predictions_test = predict(test_X, test_Y, parameters) 154 | ``` 155 | 156 | Cost after iteration 0: 0.6557412523481002 157 | Cost after iteration 10000: 0.16329987525724216 158 | Cost after iteration 20000: 0.13851642423255986 159 | 160 | 161 | 162 | ![png](output_9_1.png) 163 | 164 | 165 | On the training set: 166 | Accuracy: 0.947867298578 167 | On the test set: 168 | Accuracy: 0.915 169 | 170 | 171 | The train accuracy is 94.8% while the test accuracy is 91.5%. This is the **baseline model** (you will observe the impact of regularization on this model). Run the following code to plot the decision boundary of your model. 172 | 173 | 174 | ```python 175 | plt.title("Model without regularization") 176 | axes = plt.gca() 177 | axes.set_xlim([-0.75,0.40]) 178 | axes.set_ylim([-0.75,0.65]) 179 | plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y) 180 | ``` 181 | 182 | 183 | ![png](output_11_0.png) 184 | 185 | 186 | The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets now look at two techniques to reduce overfitting. 187 | 188 | ## 2 - L2 Regularization 189 | 190 | The standard way to avoid overfitting is called **L2 regularization**. It consists of appropriately modifying your cost function, from: 191 | $$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$ 192 | To: 193 | $$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$$ 194 | 195 | Let's modify your cost and observe the consequences. 196 | 197 | **Exercise**: Implement `compute_cost_with_regularization()` which computes the cost given by formula (2). To calculate $\sum\limits_k\sum\limits_j W_{k,j}^{[l]2}$ , use : 198 | ```python 199 | np.sum(np.square(Wl)) 200 | ``` 201 | Note that you have to do this for $W^{[1]}$, $W^{[2]}$ and $W^{[3]}$, then sum the three terms and multiply by $ \frac{1}{m} \frac{\lambda}{2} $. 202 | 203 | 204 | ```python 205 | # GRADED FUNCTION: compute_cost_with_regularization 206 | 207 | def compute_cost_with_regularization(A3, Y, parameters, lambd): 208 | """ 209 | Implement the cost function with L2 regularization. See formula (2) above. 210 | 211 | Arguments: 212 | A3 -- post-activation, output of forward propagation, of shape (output size, number of examples) 213 | Y -- "true" labels vector, of shape (output size, number of examples) 214 | parameters -- python dictionary containing parameters of the model 215 | 216 | Returns: 217 | cost - value of the regularized loss function (formula (2)) 218 | """ 219 | m = Y.shape[1] 220 | W1 = parameters["W1"] 221 | W2 = parameters["W2"] 222 | W3 = parameters["W3"] 223 | 224 | cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost 225 | 226 | ### START CODE HERE ### (approx. 1 line) 227 | L2_regularization_cost = (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) * lambd / (m * 2.) 228 | ### END CODER HERE ### 229 | 230 | cost = cross_entropy_cost + L2_regularization_cost 231 | 232 | return cost 233 | ``` 234 | 235 | 236 | ```python 237 | A3, Y_assess, parameters = compute_cost_with_regularization_test_case() 238 | 239 | print("cost = " + str(compute_cost_with_regularization(A3, Y_assess, parameters, lambd = 0.1))) 240 | ``` 241 | 242 | cost = 1.78648594516 243 | 244 | 245 | **Expected Output**: 246 | 247 | 248 | 249 | 252 | 255 | 256 | 257 | 258 |
250 | **cost** 251 | 253 | 1.78648594516 254 |
259 | 260 | Of course, because you changed the cost, you have to change backward propagation as well! All the gradients have to be computed with respect to this new cost. 261 | 262 | **Exercise**: Implement the changes needed in backward propagation to take into account regularization. The changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term's gradient ($\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W$). 263 | 264 | 265 | ```python 266 | # GRADED FUNCTION: backward_propagation_with_regularization 267 | 268 | def backward_propagation_with_regularization(X, Y, cache, lambd): 269 | """ 270 | Implements the backward propagation of our baseline model to which we added an L2 regularization. 271 | 272 | Arguments: 273 | X -- input dataset, of shape (input size, number of examples) 274 | Y -- "true" labels vector, of shape (output size, number of examples) 275 | cache -- cache output from forward_propagation() 276 | lambd -- regularization hyperparameter, scalar 277 | 278 | Returns: 279 | gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables 280 | """ 281 | 282 | m = X.shape[1] 283 | (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache 284 | 285 | dZ3 = A3 - Y 286 | 287 | ### START CODE HERE ### (approx. 1 line) 288 | dW3 = 1./m * np.dot(dZ3, A2.T) + float(lambd) / m * W3 289 | ### END CODE HERE ### 290 | db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) 291 | 292 | dA2 = np.dot(W3.T, dZ3) 293 | dZ2 = np.multiply(dA2, np.int64(A2 > 0)) 294 | ### START CODE HERE ### (approx. 1 line) 295 | dW2 = 1./m * np.dot(dZ2, A1.T) + float(lambd) / m * W2 296 | ### END CODE HERE ### 297 | db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) 298 | 299 | dA1 = np.dot(W2.T, dZ2) 300 | dZ1 = np.multiply(dA1, np.int64(A1 > 0)) 301 | ### START CODE HERE ### (approx. 1 line) 302 | dW1 = 1./m * np.dot(dZ1, X.T) + float(lambd) / m * W1 303 | ### END CODE HERE ### 304 | db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) 305 | 306 | gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, 307 | "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 308 | "dZ1": dZ1, "dW1": dW1, "db1": db1} 309 | 310 | return gradients 311 | ``` 312 | 313 | 314 | ```python 315 | X_assess, Y_assess, cache = backward_propagation_with_regularization_test_case() 316 | 317 | grads = backward_propagation_with_regularization(X_assess, Y_assess, cache, lambd = 0.7) 318 | print ("dW1 = "+ str(grads["dW1"])) 319 | print ("dW2 = "+ str(grads["dW2"])) 320 | print ("dW3 = "+ str(grads["dW3"])) 321 | ``` 322 | 323 | dW1 = [[-0.25604646 0.12298827 -0.28297129] 324 | [-0.17706303 0.34536094 -0.4410571 ]] 325 | dW2 = [[ 0.79276486 0.85133918] 326 | [-0.0957219 -0.01720463] 327 | [-0.13100772 -0.03750433]] 328 | dW3 = [[-1.77691347 -0.11832879 -0.09397446]] 329 | 330 | 331 | **Expected Output**: 332 | 333 | 334 | 335 | 338 | 342 | 343 | 344 | 347 | 352 | 353 | 354 | 357 | 360 | 361 |
336 | **dW1** 337 | 339 | [[-0.25604646 0.12298827 -0.28297129] 340 | [-0.17706303 0.34536094 -0.4410571 ]] 341 |
345 | **dW2** 346 | 348 | [[ 0.79276486 0.85133918] 349 | [-0.0957219 -0.01720463] 350 | [-0.13100772 -0.03750433]] 351 |
355 | **dW3** 356 | 358 | [[-1.77691347 -0.11832879 -0.09397446]] 359 |
362 | 363 | Let's now run the model with L2 regularization $(\lambda = 0.7)$. The `model()` function will call: 364 | - `compute_cost_with_regularization` instead of `compute_cost` 365 | - `backward_propagation_with_regularization` instead of `backward_propagation` 366 | 367 | 368 | ```python 369 | parameters = model(train_X, train_Y, lambd = 0.7) 370 | print ("On the train set:") 371 | predictions_train = predict(train_X, train_Y, parameters) 372 | print ("On the test set:") 373 | predictions_test = predict(test_X, test_Y, parameters) 374 | ``` 375 | 376 | Cost after iteration 0: 0.6974484493131264 377 | Cost after iteration 10000: 0.2684918873282239 378 | Cost after iteration 20000: 0.2680916337127301 379 | 380 | 381 | 382 | ![png](output_22_1.png) 383 | 384 | 385 | On the train set: 386 | Accuracy: 0.938388625592 387 | On the test set: 388 | Accuracy: 0.93 389 | 390 | 391 | Congrats, the test set accuracy increased to 93%. You have saved the French football team! 392 | 393 | You are not overfitting the training data anymore. Let's plot the decision boundary. 394 | 395 | 396 | ```python 397 | plt.title("Model with L2-regularization") 398 | axes = plt.gca() 399 | axes.set_xlim([-0.75,0.40]) 400 | axes.set_ylim([-0.75,0.65]) 401 | plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y) 402 | ``` 403 | 404 | 405 | ![png](output_24_0.png) 406 | 407 | 408 | **Observations**: 409 | - The value of $\lambda$ is a hyperparameter that you can tune using a dev set. 410 | - L2 regularization makes your decision boundary smoother. If $\lambda$ is too large, it is also possible to "oversmooth", resulting in a model with high bias. 411 | 412 | **What is L2-regularization actually doing?**: 413 | 414 | L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. 415 | 416 | 417 | **What you should remember** -- the implications of L2-regularization on: 418 | - The cost computation: 419 | - A regularization term is added to the cost 420 | - The backpropagation function: 421 | - There are extra terms in the gradients with respect to weight matrices 422 | - Weights end up smaller ("weight decay"): 423 | - Weights are pushed to smaller values. 424 | 425 | ## 3 - Dropout 426 | 427 | Finally, **dropout** is a widely used regularization technique that is specific to deep learning. 428 | **It randomly shuts down some neurons in each iteration.** Watch these two videos to see what this means! 429 | 430 | 437 | 438 | 439 |
440 | 442 |
443 |
444 |
Figure 2 : Drop-out on the second hidden layer.
At each iteration, you shut down (= set to zero) each neuron of a layer with probability $1 - keep\_prob$ or keep it with probability $keep\_prob$ (50% here). The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration.
445 | 446 |
447 | 449 |
450 | 451 |
Figure 3 : Drop-out on the first and third hidden layers.
$1^{st}$ layer: we shut down on average 40% of the neurons. $3^{rd}$ layer: we shut down on average 20% of the neurons.
452 | 453 | 454 | When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. 455 | 456 | ### 3.1 - Forward propagation with dropout 457 | 458 | **Exercise**: Implement the forward propagation with dropout. You are using a 3 layer neural network, and will add dropout to the first and second hidden layers. We will not apply dropout to the input layer or output layer. 459 | 460 | **Instructions**: 461 | You would like to shut down some neurons in the first and second layers. To do that, you are going to carry out 4 Steps: 462 | 1. In lecture, we dicussed creating a variable $d^{[1]}$ with the same shape as $a^{[1]}$ using `np.random.rand()` to randomly get numbers between 0 and 1. Here, you will use a vectorized implementation, so create a random matrix $D^{[1]} = [d^{[1](1)} d^{[1](2)} ... d^{[1](m)}] $ of the same dimension as $A^{[1]}$. 463 | 2. Set each entry of $D^{[1]}$ to be 0 with probability (`1-keep_prob`) or 1 with probability (`keep_prob`), by thresholding values in $D^{[1]}$ appropriately. Hint: to set all the entries of a matrix X to 0 (if entry is less than 0.5) or 1 (if entry is more than 0.5) you would do: `X = (X < 0.5)`. Note that 0 and 1 are respectively equivalent to False and True. 464 | 3. Set $A^{[1]}$ to $A^{[1]} * D^{[1]}$. (You are shutting down some neurons). You can think of $D^{[1]}$ as a mask, so that when it is multiplied with another matrix, it shuts down some of the values. 465 | 4. Divide $A^{[1]}$ by `keep_prob`. By doing this you are assuring that the result of the cost will still have the same expected value as without drop-out. (This technique is also called inverted dropout.) 466 | 467 | 468 | ```python 469 | # GRADED FUNCTION: forward_propagation_with_dropout 470 | 471 | def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5): 472 | """ 473 | Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID. 474 | 475 | Arguments: 476 | X -- input dataset, of shape (2, number of examples) 477 | parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3": 478 | W1 -- weight matrix of shape (20, 2) 479 | b1 -- bias vector of shape (20, 1) 480 | W2 -- weight matrix of shape (3, 20) 481 | b2 -- bias vector of shape (3, 1) 482 | W3 -- weight matrix of shape (1, 3) 483 | b3 -- bias vector of shape (1, 1) 484 | keep_prob - probability of keeping a neuron active during drop-out, scalar 485 | 486 | Returns: 487 | A3 -- last activation value, output of the forward propagation, of shape (1,1) 488 | cache -- tuple, information stored for computing the backward propagation 489 | """ 490 | 491 | np.random.seed(1) 492 | 493 | # retrieve parameters 494 | W1 = parameters["W1"] 495 | b1 = parameters["b1"] 496 | W2 = parameters["W2"] 497 | b2 = parameters["b2"] 498 | W3 = parameters["W3"] 499 | b3 = parameters["b3"] 500 | 501 | # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID 502 | Z1 = np.dot(W1, X) + b1 503 | A1 = relu(Z1) 504 | ### START CODE HERE ### (approx. 4 lines) # Steps 1-4 below correspond to the Steps 1-4 described above. 505 | D1 = np.random.rand(A1.shape[0], A1.shape[1]) # Step 1: initialize matrix D1 = np.random.rand(..., ...) 506 | D1 = D1 < keep_prob # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold) 507 | A1 = A1 * D1 # Step 3: shut down some neurons of A1 508 | A1 = A1 / keep_prob # Step 4: scale the value of neurons that haven't been shut down 509 | ### END CODE HERE ### 510 | Z2 = np.dot(W2, A1) + b2 511 | A2 = relu(Z2) 512 | ### START CODE HERE ### (approx. 4 lines) 513 | D2 = np.random.rand(A2.shape[0], A2.shape[1]) # Step 1: initialize matrix D2 = np.random.rand(..., ...) 514 | D2 = D2 < keep_prob # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold) 515 | A2 = A2 * D2 # Step 3: shut down some neurons of A2 516 | A2 = A2 / keep_prob # Step 4: scale the value of neurons that haven't been shut down 517 | ### END CODE HERE ### 518 | Z3 = np.dot(W3, A2) + b3 519 | A3 = sigmoid(Z3) 520 | 521 | cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) 522 | 523 | return A3, cache 524 | ``` 525 | 526 | 527 | ```python 528 | X_assess, parameters = forward_propagation_with_dropout_test_case() 529 | 530 | A3, cache = forward_propagation_with_dropout(X_assess, parameters, keep_prob = 0.7) 531 | print ("A3 = " + str(A3)) 532 | ``` 533 | 534 | A3 = [[ 0.36974721 0.00305176 0.04565099 0.49683389 0.36974721]] 535 | 536 | 537 | **Expected Output**: 538 | 539 | 540 | 541 | 544 | 547 | 548 | 549 | 550 |
542 | **A3** 543 | 545 | [[ 0.36974721 0.00305176 0.04565099 0.49683389 0.36974721]] 546 |
551 | 552 | ### 3.2 - Backward propagation with dropout 553 | 554 | **Exercise**: Implement the backward propagation with dropout. As before, you are training a 3 layer network. Add dropout to the first and second hidden layers, using the masks $D^{[1]}$ and $D^{[2]}$ stored in the cache. 555 | 556 | **Instruction**: 557 | Backpropagation with dropout is actually quite easy. You will have to carry out 2 Steps: 558 | 1. You had previously shut down some neurons during forward propagation, by applying a mask $D^{[1]}$ to `A1`. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask $D^{[1]}$ to `dA1`. 559 | 2. During forward propagation, you had divided `A1` by `keep_prob`. In backpropagation, you'll therefore have to divide `dA1` by `keep_prob` again (the calculus interpretation is that if $A^{[1]}$ is scaled by `keep_prob`, then its derivative $dA^{[1]}$ is also scaled by the same `keep_prob`). 560 | 561 | 562 | 563 | ```python 564 | # GRADED FUNCTION: backward_propagation_with_dropout 565 | 566 | def backward_propagation_with_dropout(X, Y, cache, keep_prob): 567 | """ 568 | Implements the backward propagation of our baseline model to which we added dropout. 569 | 570 | Arguments: 571 | X -- input dataset, of shape (2, number of examples) 572 | Y -- "true" labels vector, of shape (output size, number of examples) 573 | cache -- cache output from forward_propagation_with_dropout() 574 | keep_prob - probability of keeping a neuron active during drop-out, scalar 575 | 576 | Returns: 577 | gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables 578 | """ 579 | 580 | m = X.shape[1] 581 | (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache 582 | 583 | dZ3 = A3 - Y 584 | dW3 = 1./m * np.dot(dZ3, A2.T) 585 | db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) 586 | dA2 = np.dot(W3.T, dZ3) 587 | ### START CODE HERE ### (≈ 2 lines of code) 588 | dA2 = dA2 * D2 # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation 589 | dA2 = dA2 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down 590 | ### END CODE HERE ### 591 | dZ2 = np.multiply(dA2, np.int64(A2 > 0)) 592 | dW2 = 1./m * np.dot(dZ2, A1.T) 593 | db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) 594 | 595 | dA1 = np.dot(W2.T, dZ2) 596 | ### START CODE HERE ### (≈ 2 lines of code) 597 | dA1 = dA1 * D1 # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation 598 | dA1 = dA1 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down 599 | ### END CODE HERE ### 600 | dZ1 = np.multiply(dA1, np.int64(A1 > 0)) 601 | dW1 = 1./m * np.dot(dZ1, X.T) 602 | db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) 603 | 604 | gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, 605 | "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 606 | "dZ1": dZ1, "dW1": dW1, "db1": db1} 607 | 608 | return gradients 609 | ``` 610 | 611 | 612 | ```python 613 | X_assess, Y_assess, cache = backward_propagation_with_dropout_test_case() 614 | 615 | gradients = backward_propagation_with_dropout(X_assess, Y_assess, cache, keep_prob = 0.8) 616 | 617 | print ("dA1 = " + str(gradients["dA1"])) 618 | print ("dA2 = " + str(gradients["dA2"])) 619 | ``` 620 | 621 | dA1 = [[ 0.36544439 0. -0.00188233 0. -0.17408748] 622 | [ 0.65515713 0. -0.00337459 0. -0. ]] 623 | dA2 = [[ 0.58180856 0. -0.00299679 0. -0.27715731] 624 | [ 0. 0.53159854 -0. 0.53159854 -0.34089673] 625 | [ 0. 0. -0.00292733 0. -0. ]] 626 | 627 | 628 | **Expected Output**: 629 | 630 | 631 | 632 | 635 | 639 | 640 | 641 | 642 | 645 | 650 | 651 | 652 |
633 | **dA1** 634 | 636 | [[ 0.36544439 0. -0.00188233 0. -0.17408748] 637 | [ 0.65515713 0. -0.00337459 0. -0. ]] 638 |
643 | **dA2** 644 | 646 | [[ 0.58180856 0. -0.00299679 0. -0.27715731] 647 | [ 0. 0.53159854 -0. 0.53159854 -0.34089673] 648 | [ 0. 0. -0.00292733 0. -0. ]] 649 |
653 | 654 | Let's now run the model with dropout (`keep_prob = 0.86`). It means at every iteration you shut down each neurons of layer 1 and 2 with 24% probability. The function `model()` will now call: 655 | - `forward_propagation_with_dropout` instead of `forward_propagation`. 656 | - `backward_propagation_with_dropout` instead of `backward_propagation`. 657 | 658 | 659 | ```python 660 | parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3) 661 | 662 | print ("On the train set:") 663 | predictions_train = predict(train_X, train_Y, parameters) 664 | print ("On the test set:") 665 | predictions_test = predict(test_X, test_Y, parameters) 666 | ``` 667 | 668 | Cost after iteration 0: 0.6543912405149825 669 | 670 | 671 | /home/jovyan/work/week5/Regularization/reg_utils.py:236: RuntimeWarning: divide by zero encountered in log 672 | logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y) 673 | /home/jovyan/work/week5/Regularization/reg_utils.py:236: RuntimeWarning: invalid value encountered in multiply 674 | logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y) 675 | 676 | 677 | Cost after iteration 10000: 0.06101698657490559 678 | Cost after iteration 20000: 0.060582435798513114 679 | 680 | 681 | 682 | ![png](output_35_3.png) 683 | 684 | 685 | On the train set: 686 | Accuracy: 0.928909952607 687 | On the test set: 688 | Accuracy: 0.95 689 | 690 | 691 | Dropout works great! The test accuracy has increased again (to 95%)! Your model is not overfitting the training set and does a great job on the test set. The French football team will be forever grateful to you! 692 | 693 | Run the code below to plot the decision boundary. 694 | 695 | 696 | ```python 697 | plt.title("Model with dropout") 698 | axes = plt.gca() 699 | axes.set_xlim([-0.75,0.40]) 700 | axes.set_ylim([-0.75,0.65]) 701 | plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y) 702 | ``` 703 | 704 | 705 | ![png](output_37_0.png) 706 | 707 | 708 | **Note**: 709 | - A **common mistake** when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training. 710 | - Deep learning frameworks like [tensorflow](https://www.tensorflow.org/api_docs/python/tf/nn/dropout), [PaddlePaddle](http://doc.paddlepaddle.org/release_doc/0.9.0/doc/ui/api/trainer_config_helpers/attrs.html), [keras](https://keras.io/layers/core/#dropout) or [caffe](http://caffe.berkeleyvision.org/tutorial/layers/dropout.html) come with a dropout layer implementation. Don't stress - you will soon learn some of these frameworks. 711 | 712 | 713 | **What you should remember about dropout:** 714 | - Dropout is a regularization technique. 715 | - You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time. 716 | - Apply dropout both during forward and backward propagation. 717 | - During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5. 718 | 719 | ## 4 - Conclusions 720 | 721 | **Here are the results of our three models**: 722 | 723 | 724 | 725 | 728 | 731 | 734 | 735 | 736 | 739 | 742 | 745 | 746 | 749 | 752 | 755 | 756 | 757 | 760 | 763 | 766 | 767 |
726 | **model** 727 | 729 | **train accuracy** 730 | 732 | **test accuracy** 733 |
737 | 3-layer NN without regularization 738 | 740 | 95% 741 | 743 | 91.5% 744 |
747 | 3-layer NN with L2-regularization 748 | 750 | 94% 751 | 753 | 93% 754 |
758 | 3-layer NN with dropout 759 | 761 | 93% 762 | 764 | 95% 765 |
768 | 769 | Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system. 770 | 771 | Congratulations for finishing this assignment! And also for revolutionizing French football. :-) 772 | 773 | 774 | **What we want you to remember from this notebook**: 775 | - Regularization will help you reduce overfitting. 776 | - Regularization will drive your weights to lower values. 777 | - L2 regularization and Dropout are two very effective regularization techniques. 778 | -------------------------------------------------------------------------------- /Week 3 PA 1/Tensorflow+Tutorial.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # # TensorFlow Tutorial 5 | # 6 | # Welcome to this week's programming assignment. Until now, you've always used numpy to build neural networks. Now we will step you through a deep learning framework that will allow you to build neural networks more easily. Machine learning frameworks like TensorFlow, PaddlePaddle, Torch, Caffe, Keras, and many others can speed up your machine learning development significantly. All of these frameworks also have a lot of documentation, which you should feel free to read. In this assignment, you will learn to do the following in TensorFlow: 7 | # 8 | # - Initialize variables 9 | # - Start your own session 10 | # - Train algorithms 11 | # - Implement a Neural Network 12 | # 13 | # Programing frameworks can not only shorten your coding time, but sometimes also perform optimizations that speed up your code. 14 | # 15 | # ## 1 - Exploring the Tensorflow Library 16 | # 17 | # To start, you will import the library: 18 | # 19 | 20 | # In[1]: 21 | 22 | import math 23 | import numpy as np 24 | import h5py 25 | import matplotlib.pyplot as plt 26 | import tensorflow as tf 27 | from tensorflow.python.framework import ops 28 | from tf_utils import load_dataset, random_mini_batches, convert_to_one_hot, predict 29 | 30 | get_ipython().magic('matplotlib inline') 31 | np.random.seed(1) 32 | 33 | 34 | # Now that you have imported the library, we will walk you through its different applications. You will start with an example, where we compute for you the loss of one training example. 35 | # $$loss = \mathcal{L}(\hat{y}, y) = (\hat y^{(i)} - y^{(i)})^2 \tag{1}$$ 36 | 37 | # In[2]: 38 | 39 | y_hat = tf.constant(36, name='y_hat') # Define y_hat constant. Set to 36. 40 | y = tf.constant(39, name='y') # Define y. Set to 39 41 | 42 | loss = tf.Variable((y - y_hat)**2, name='loss') # Create a variable for the loss 43 | 44 | init = tf.global_variables_initializer() # When init is run later (session.run(init)), 45 | # the loss variable will be initialized and ready to be computed 46 | with tf.Session() as session: # Create a session and print the output 47 | session.run(init) # Initializes the variables 48 | print(session.run(loss)) # Prints the loss 49 | 50 | 51 | # Writing and running programs in TensorFlow has the following steps: 52 | # 53 | # 1. Create Tensors (variables) that are not yet executed/evaluated. 54 | # 2. Write operations between those Tensors. 55 | # 3. Initialize your Tensors. 56 | # 4. Create a Session. 57 | # 5. Run the Session. This will run the operations you'd written above. 58 | # 59 | # Therefore, when we created a variable for the loss, we simply defined the loss as a function of other quantities, but did not evaluate its value. To evaluate it, we had to run `init=tf.global_variables_initializer()`. That initialized the loss variable, and in the last line we were finally able to evaluate the value of `loss` and print its value. 60 | # 61 | # Now let us look at an easy example. Run the cell below: 62 | 63 | # In[3]: 64 | 65 | a = tf.constant(2) 66 | b = tf.constant(10) 67 | c = tf.multiply(a,b) 68 | print(c) 69 | 70 | 71 | # As expected, you will not see 20! You got a tensor saying that the result is a tensor that does not have the shape attribute, and is of type "int32". All you did was put in the 'computation graph', but you have not run this computation yet. In order to actually multiply the two numbers, you will have to create a session and run it. 72 | 73 | # In[4]: 74 | 75 | sess = tf.Session() 76 | print(sess.run(c)) 77 | 78 | 79 | # Great! To summarize, **remember to initialize your variables, create a session and run the operations inside the session**. 80 | # 81 | # Next, you'll also have to know about placeholders. A placeholder is an object whose value you can specify only later. 82 | # To specify values for a placeholder, you can pass in values by using a "feed dictionary" (`feed_dict` variable). Below, we created a placeholder for x. This allows us to pass in a number later when we run the session. 83 | 84 | # In[5]: 85 | 86 | # Change the value of x in the feed_dict 87 | 88 | x = tf.placeholder(tf.int64, name = 'x') 89 | print(sess.run(2 * x, feed_dict = {x: 3})) 90 | sess.close() 91 | 92 | 93 | # When you first defined `x` you did not have to specify a value for it. A placeholder is simply a variable that you will assign data to only later, when running the session. We say that you **feed data** to these placeholders when running the session. 94 | # 95 | # Here's what's happening: When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph. 96 | 97 | # ### 1.1 - Linear function 98 | # 99 | # Lets start this programming exercise by computing the following equation: $Y = WX + b$, where $W$ and $X$ are random matrices and b is a random vector. 100 | # 101 | # **Exercise**: Compute $WX + b$ where $W, X$, and $b$ are drawn from a random normal distribution. W is of shape (4, 3), X is (3,1) and b is (4,1). As an example, here is how you would define a constant X that has shape (3,1): 102 | # ```python 103 | # X = tf.constant(np.random.randn(3,1), name = "X") 104 | # 105 | # ``` 106 | # You might find the following functions helpful: 107 | # - tf.matmul(..., ...) to do a matrix multiplication 108 | # - tf.add(..., ...) to do an addition 109 | # - np.random.randn(...) to initialize randomly 110 | # 111 | 112 | # In[6]: 113 | 114 | # GRADED FUNCTION: linear_function 115 | 116 | def linear_function(): 117 | """ 118 | Implements a linear function: 119 | Initializes W to be a random tensor of shape (4,3) 120 | Initializes X to be a random tensor of shape (3,1) 121 | Initializes b to be a random tensor of shape (4,1) 122 | Returns: 123 | result -- runs the session for Y = WX + b 124 | """ 125 | 126 | np.random.seed(1) 127 | 128 | ### START CODE HERE ### (4 lines of code) 129 | X = tf.constant(np.random.randn(3,1), name = "X") 130 | W = tf.constant(np.random.randn(4,3), name = "W") 131 | b = tf.constant(np.random.randn(4,1), name = "b") 132 | Y = tf.add(tf.matmul(W, X), b) 133 | ### END CODE HERE ### 134 | 135 | # Create the session using tf.Session() and run it with sess.run(...) on the variable you want to calculate 136 | 137 | ### START CODE HERE ### 138 | sess = tf.Session() 139 | result = sess.run(Y) 140 | ### END CODE HERE ### 141 | 142 | # close the session 143 | sess.close() 144 | 145 | return result 146 | 147 | 148 | # In[7]: 149 | 150 | print( "result = " + str(linear_function())) 151 | 152 | 153 | # *** Expected Output ***: 154 | # 155 | # 156 | # 157 | # 160 | # 166 | # 167 | # 168 | #
158 | # **result** 159 | # 161 | # [[-2.15657382] 162 | # [ 2.95891446] 163 | # [-1.08926781] 164 | # [-0.84538042]] 165 | #
169 | 170 | # ### 1.2 - Computing the sigmoid 171 | # Great! You just implemented a linear function. Tensorflow offers a variety of commonly used neural network functions like `tf.sigmoid` and `tf.softmax`. For this exercise lets compute the sigmoid function of an input. 172 | # 173 | # You will do this exercise using a placeholder variable `x`. When running the session, you should use the feed dictionary to pass in the input `z`. In this exercise, you will have to (i) create a placeholder `x`, (ii) define the operations needed to compute the sigmoid using `tf.sigmoid`, and then (iii) run the session. 174 | # 175 | # ** Exercise **: Implement the sigmoid function below. You should use the following: 176 | # 177 | # - `tf.placeholder(tf.float32, name = "...")` 178 | # - `tf.sigmoid(...)` 179 | # - `sess.run(..., feed_dict = {x: z})` 180 | # 181 | # 182 | # Note that there are two typical ways to create and use sessions in tensorflow: 183 | # 184 | # **Method 1:** 185 | # ```python 186 | # sess = tf.Session() 187 | # # Run the variables initialization (if needed), run the operations 188 | # result = sess.run(..., feed_dict = {...}) 189 | # sess.close() # Close the session 190 | # ``` 191 | # **Method 2:** 192 | # ```python 193 | # with tf.Session() as sess: 194 | # # run the variables initialization (if needed), run the operations 195 | # result = sess.run(..., feed_dict = {...}) 196 | # # This takes care of closing the session for you :) 197 | # ``` 198 | # 199 | 200 | # In[8]: 201 | 202 | # GRADED FUNCTION: sigmoid 203 | 204 | def sigmoid(z): 205 | """ 206 | Computes the sigmoid of z 207 | 208 | Arguments: 209 | z -- input value, scalar or vector 210 | 211 | Returns: 212 | results -- the sigmoid of z 213 | """ 214 | 215 | ### START CODE HERE ### ( approx. 4 lines of code) 216 | # Create a placeholder for x. Name it 'x'. 217 | x = tf.placeholder(tf.float32, name = "x") 218 | 219 | # compute sigmoid(x) 220 | sigmoid = tf.sigmoid(x) 221 | 222 | # Create a session, and run it. Please use the method 2 explained above. 223 | # You should use a feed_dict to pass z's value to x. 224 | with tf.Session() as sess: 225 | # Run session and call the output "result" 226 | result = sess.run(sigmoid, feed_dict = {x: z}) 227 | 228 | ### END CODE HERE ### 229 | 230 | return result 231 | 232 | 233 | # In[9]: 234 | 235 | print ("sigmoid(0) = " + str(sigmoid(0))) 236 | print ("sigmoid(12) = " + str(sigmoid(12))) 237 | 238 | 239 | # *** Expected Output ***: 240 | # 241 | # 242 | # 243 | # 246 | # 249 | # 250 | # 251 | # 254 | # 257 | # 258 | # 259 | #
244 | # **sigmoid(0)** 245 | # 247 | # 0.5 248 | #
252 | # **sigmoid(12)** 253 | # 255 | # 0.999994 256 | #
260 | 261 | # 262 | # **To summarize, you how know how to**: 263 | # 1. Create placeholders 264 | # 2. Specify the computation graph corresponding to operations you want to compute 265 | # 3. Create the session 266 | # 4. Run the session, using a feed dictionary if necessary to specify placeholder variables' values. 267 | 268 | # ### 1.3 - Computing the Cost 269 | # 270 | # You can also use a built-in function to compute the cost of your neural network. So instead of needing to write code to compute this as a function of $a^{[2](i)}$ and $y^{(i)}$ for i=1...m: 271 | # $$ J = - \frac{1}{m} \sum_{i = 1}^m \large ( \small y^{(i)} \log a^{ [2] (i)} + (1-y^{(i)})\log (1-a^{ [2] (i)} )\large )\small\tag{2}$$ 272 | # 273 | # you can do it in one line of code in tensorflow! 274 | # 275 | # **Exercise**: Implement the cross entropy loss. The function you will use is: 276 | # 277 | # 278 | # - `tf.nn.sigmoid_cross_entropy_with_logits(logits = ..., labels = ...)` 279 | # 280 | # Your code should input `z`, compute the sigmoid (to get `a`) and then compute the cross entropy cost $J$. All this can be done using one call to `tf.nn.sigmoid_cross_entropy_with_logits`, which computes 281 | # 282 | # $$- \frac{1}{m} \sum_{i = 1}^m \large ( \small y^{(i)} \log \sigma(z^{[2](i)}) + (1-y^{(i)})\log (1-\sigma(z^{[2](i)})\large )\small\tag{2}$$ 283 | # 284 | # 285 | 286 | # In[10]: 287 | 288 | # GRADED FUNCTION: cost 289 | 290 | def cost(logits, labels): 291 | """ 292 |     Computes the cost using the sigmoid cross entropy 293 |      294 |     Arguments: 295 |     logits -- vector containing z, output of the last linear unit (before the final sigmoid activation) 296 |     labels -- vector of labels y (1 or 0) 297 | 298 | Note: What we've been calling "z" and "y" in this class are respectively called "logits" and "labels" 299 | in the TensorFlow documentation. So logits will feed into z, and labels into y. 300 |      301 |     Returns: 302 |     cost -- runs the session of the cost (formula (2)) 303 | """ 304 | 305 | ### START CODE HERE ### 306 | 307 | # Create the placeholders for "logits" (z) and "labels" (y) (approx. 2 lines) 308 | z = tf.placeholder(tf.float32, name = "logits") 309 | y = tf.placeholder(tf.float32, name = "labels") 310 | 311 | # Use the loss function (approx. 1 line) 312 | cost = tf.nn.sigmoid_cross_entropy_with_logits(logits = z, labels = y) 313 | 314 | # Create a session (approx. 1 line). See method 1 above. 315 | sess = tf.Session() 316 | 317 | # Run the session (approx. 1 line). 318 | cost = sess.run(cost, feed_dict = {z: logits, y: labels}) 319 | 320 | # Close the session (approx. 1 line). See method 1 above. 321 | sess.close() 322 | 323 | ### END CODE HERE ### 324 | 325 | return cost 326 | 327 | 328 | # In[11]: 329 | 330 | logits = sigmoid(np.array([0.2,0.4,0.7,0.9])) 331 | cost = cost(logits, np.array([0,0,1,1])) 332 | print ("cost = " + str(cost)) 333 | 334 | 335 | # ** Expected Output** : 336 | # 337 | # 338 | # 339 | # 342 | # 345 | # 346 | # 347 | #
340 | # **cost** 341 | # 343 | # [ 1.00538719 1.03664088 0.41385433 0.39956614] 344 | #
348 | 349 | # ### 1.4 - Using One Hot encodings 350 | # 351 | # Many times in deep learning you will have a y vector with numbers ranging from 0 to C-1, where C is the number of classes. If C is for example 4, then you might have the following y vector which you will need to convert as follows: 352 | # 353 | # 354 | # 355 | # 356 | # This is called a "one hot" encoding, because in the converted representation exactly one element of each column is "hot" (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In tensorflow, you can use one line of code: 357 | # 358 | # - tf.one_hot(labels, depth, axis) 359 | # 360 | # **Exercise:** Implement the function below to take one vector of labels and the total number of classes $C$, and return the one hot encoding. Use `tf.one_hot()` to do this. 361 | 362 | # In[17]: 363 | 364 | # GRADED FUNCTION: one_hot_matrix 365 | 366 | def one_hot_matrix(labels, C): 367 | """ 368 | Creates a matrix where the i-th row corresponds to the ith class number and the jth column 369 | corresponds to the jth training example. So if example j had a label i. Then entry (i,j) 370 | will be 1. 371 | 372 | Arguments: 373 | labels -- vector containing the labels 374 | C -- number of classes, the depth of the one hot dimension 375 | 376 | Returns: 377 | one_hot -- one hot matrix 378 | """ 379 | 380 | ### START CODE HERE ### 381 | 382 | # Create a tf.constant equal to C (depth), name it 'C'. (approx. 1 line) 383 | depth = tf.constant(C, name = "C") 384 | 385 | # Use tf.one_hot, be careful with the axis (approx. 1 line) 386 | one_hot_matrix = tf.one_hot(labels, depth, axis = 0) 387 | 388 | # Create the session (approx. 1 line) 389 | sess = tf.Session() 390 | 391 | # Run the session (approx. 1 line) 392 | one_hot = sess.run(one_hot_matrix) 393 | 394 | # Close the session (approx. 1 line). See method 1 above. 395 | sess.close() 396 | 397 | ### END CODE HERE ### 398 | 399 | return one_hot 400 | 401 | 402 | # In[18]: 403 | 404 | labels = np.array([1,2,3,0,2,1]) 405 | one_hot = one_hot_matrix(labels, C = 4) 406 | print ("one_hot = " + str(one_hot)) 407 | 408 | 409 | # **Expected Output**: 410 | # 411 | # 412 | # 413 | # 416 | # 422 | # 423 | # 424 | #
414 | # **one_hot** 415 | # 417 | # [[ 0. 0. 0. 1. 0. 0.] 418 | # [ 1. 0. 0. 0. 0. 1.] 419 | # [ 0. 1. 0. 0. 1. 0.] 420 | # [ 0. 0. 1. 0. 0. 0.]] 421 | #
425 | # 426 | 427 | # ### 1.5 - Initialize with zeros and ones 428 | # 429 | # Now you will learn how to initialize a vector of zeros and ones. The function you will be calling is `tf.ones()`. To initialize with zeros you could use tf.zeros() instead. These functions take in a shape and return an array of dimension shape full of zeros and ones respectively. 430 | # 431 | # **Exercise:** Implement the function below to take in a shape and to return an array (of the shape's dimension of ones). 432 | # 433 | # - tf.ones(shape) 434 | # 435 | 436 | # In[19]: 437 | 438 | # GRADED FUNCTION: ones 439 | 440 | def ones(shape): 441 | """ 442 | Creates an array of ones of dimension shape 443 | 444 | Arguments: 445 | shape -- shape of the array you want to create 446 | 447 | Returns: 448 | ones -- array containing only ones 449 | """ 450 | 451 | ### START CODE HERE ### 452 | 453 | # Create "ones" tensor using tf.ones(...). (approx. 1 line) 454 | ones = tf.ones(shape) 455 | 456 | # Create the session (approx. 1 line) 457 | sess = tf.Session() 458 | 459 | # Run the session to compute 'ones' (approx. 1 line) 460 | ones = sess.run(ones) 461 | 462 | # Close the session (approx. 1 line). See method 1 above. 463 | sess.close() 464 | 465 | ### END CODE HERE ### 466 | return ones 467 | 468 | 469 | # In[20]: 470 | 471 | print ("ones = " + str(ones([3]))) 472 | 473 | 474 | # **Expected Output:** 475 | # 476 | # 477 | # 478 | # 481 | # 484 | # 485 | # 486 | #
479 | # **ones** 480 | # 482 | # [ 1. 1. 1.] 483 | #
487 | 488 | # # 2 - Building your first neural network in tensorflow 489 | # 490 | # In this part of the assignment you will build a neural network using tensorflow. Remember that there are two parts to implement a tensorflow model: 491 | # 492 | # - Create the computation graph 493 | # - Run the graph 494 | # 495 | # Let's delve into the problem you'd like to solve! 496 | # 497 | # ### 2.0 - Problem statement: SIGNS Dataset 498 | # 499 | # One afternoon, with some friends we decided to teach our computers to decipher sign language. We spent a few hours taking pictures in front of a white wall and came up with the following dataset. It's now your job to build an algorithm that would facilitate communications from a speech-impaired person to someone who doesn't understand sign language. 500 | # 501 | # - **Training set**: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number). 502 | # - **Test set**: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number). 503 | # 504 | # Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs. 505 | # 506 | # Here are examples for each number, and how an explanation of how we represent the labels. These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels. 507 | #
**Figure 1**: SIGNS dataset
508 | # 509 | # 510 | # Run the following code to load the dataset. 511 | 512 | # In[21]: 513 | 514 | # Loading the dataset 515 | X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset() 516 | 517 | 518 | # Change the index below and run the cell to visualize some examples in the dataset. 519 | 520 | # In[22]: 521 | 522 | # Example of a picture 523 | index = 0 524 | plt.imshow(X_train_orig[index]) 525 | print ("y = " + str(np.squeeze(Y_train_orig[:, index]))) 526 | 527 | 528 | # As usual you flatten the image dataset, then normalize it by dividing by 255. On top of that, you will convert each label to a one-hot vector as shown in Figure 1. Run the cell below to do so. 529 | 530 | # In[23]: 531 | 532 | # Flatten the training and test images 533 | X_train_flatten = X_train_orig.reshape(X_train_orig.shape[0], -1).T 534 | X_test_flatten = X_test_orig.reshape(X_test_orig.shape[0], -1).T 535 | # Normalize image vectors 536 | X_train = X_train_flatten/255. 537 | X_test = X_test_flatten/255. 538 | # Convert training and test labels to one hot matrices 539 | Y_train = convert_to_one_hot(Y_train_orig, 6) 540 | Y_test = convert_to_one_hot(Y_test_orig, 6) 541 | 542 | print ("number of training examples = " + str(X_train.shape[1])) 543 | print ("number of test examples = " + str(X_test.shape[1])) 544 | print ("X_train shape: " + str(X_train.shape)) 545 | print ("Y_train shape: " + str(Y_train.shape)) 546 | print ("X_test shape: " + str(X_test.shape)) 547 | print ("Y_test shape: " + str(Y_test.shape)) 548 | 549 | 550 | # **Note** that 12288 comes from $64 \times 64 \times 3$. Each image is square, 64 by 64 pixels, and 3 is for the RGB colors. Please make sure all these shapes make sense to you before continuing. 551 | 552 | # **Your goal** is to build an algorithm capable of recognizing a sign with high accuracy. To do so, you are going to build a tensorflow model that is almost the same as one you have previously built in numpy for cat recognition (but now using a softmax output). It is a great occasion to compare your numpy implementation to the tensorflow one. 553 | # 554 | # **The model** is *LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX*. The SIGMOID output layer has been converted to a SOFTMAX. A SOFTMAX layer generalizes SIGMOID to when there are more than two classes. 555 | 556 | # ### 2.1 - Create placeholders 557 | # 558 | # Your first task is to create placeholders for `X` and `Y`. This will allow you to later pass your training data in when you run your session. 559 | # 560 | # **Exercise:** Implement the function below to create the placeholders in tensorflow. 561 | 562 | # In[24]: 563 | 564 | # GRADED FUNCTION: create_placeholders 565 | 566 | def create_placeholders(n_x, n_y): 567 | """ 568 | Creates the placeholders for the tensorflow session. 569 | 570 | Arguments: 571 | n_x -- scalar, size of an image vector (num_px * num_px = 64 * 64 * 3 = 12288) 572 | n_y -- scalar, number of classes (from 0 to 5, so -> 6) 573 | 574 | Returns: 575 | X -- placeholder for the data input, of shape [n_x, None] and dtype "float" 576 | Y -- placeholder for the input labels, of shape [n_y, None] and dtype "float" 577 | 578 | Tips: 579 | - You will use None because it let's us be flexible on the number of examples you will for the placeholders. 580 | In fact, the number of examples during test/train is different. 581 | """ 582 | 583 | ### START CODE HERE ### (approx. 2 lines) 584 | X = tf.placeholder(tf.float32, [n_x, None], name = "X") 585 | Y = tf.placeholder(tf.float32, [n_y, None], name = "Y") 586 | ### END CODE HERE ### 587 | 588 | return X, Y 589 | 590 | 591 | # In[25]: 592 | 593 | X, Y = create_placeholders(12288, 6) 594 | print ("X = " + str(X)) 595 | print ("Y = " + str(Y)) 596 | 597 | 598 | # **Expected Output**: 599 | # 600 | # 601 | # 602 | # 605 | # 608 | # 609 | # 610 | # 613 | # 616 | # 617 | # 618 | #
603 | # **X** 604 | # 606 | # Tensor("Placeholder_1:0", shape=(12288, ?), dtype=float32) (not necessarily Placeholder_1) 607 | #
611 | # **Y** 612 | # 614 | # Tensor("Placeholder_2:0", shape=(10, ?), dtype=float32) (not necessarily Placeholder_2) 615 | #
619 | 620 | # ### 2.2 - Initializing the parameters 621 | # 622 | # Your second task is to initialize the parameters in tensorflow. 623 | # 624 | # **Exercise:** Implement the function below to initialize the parameters in tensorflow. You are going use Xavier Initialization for weights and Zero Initialization for biases. The shapes are given below. As an example, to help you, for W1 and b1 you could use: 625 | # 626 | # ```python 627 | # W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 628 | # b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer()) 629 | # ``` 630 | # Please use `seed = 1` to make sure your results match ours. 631 | 632 | # In[26]: 633 | 634 | # GRADED FUNCTION: initialize_parameters 635 | 636 | def initialize_parameters(): 637 | """ 638 | Initializes parameters to build a neural network with tensorflow. The shapes are: 639 | W1 : [25, 12288] 640 | b1 : [25, 1] 641 | W2 : [12, 25] 642 | b2 : [12, 1] 643 | W3 : [6, 12] 644 | b3 : [6, 1] 645 | 646 | Returns: 647 | parameters -- a dictionary of tensors containing W1, b1, W2, b2, W3, b3 648 | """ 649 | 650 | tf.set_random_seed(1) # so that your "random" numbers match ours 651 | 652 | ### START CODE HERE ### (approx. 6 lines of code) 653 | W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 654 | b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer()) 655 | W2 = tf.get_variable("W2", [12, 25], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 656 | b2 = tf.get_variable("b2", [12,1], initializer = tf.zeros_initializer()) 657 | W3 = tf.get_variable("W3", [6, 12], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 658 | b3 = tf.get_variable("b3", [6,1], initializer = tf.zeros_initializer()) 659 | ### END CODE HERE ### 660 | 661 | parameters = {"W1": W1, 662 | "b1": b1, 663 | "W2": W2, 664 | "b2": b2, 665 | "W3": W3, 666 | "b3": b3} 667 | 668 | return parameters 669 | 670 | 671 | # In[27]: 672 | 673 | tf.reset_default_graph() 674 | with tf.Session() as sess: 675 | parameters = initialize_parameters() 676 | print("W1 = " + str(parameters["W1"])) 677 | print("b1 = " + str(parameters["b1"])) 678 | print("W2 = " + str(parameters["W2"])) 679 | print("b2 = " + str(parameters["b2"])) 680 | 681 | 682 | # **Expected Output**: 683 | # 684 | # 685 | # 686 | # 689 | # 692 | # 693 | # 694 | # 697 | # 700 | # 701 | # 702 | # 705 | # 708 | # 709 | # 710 | # 713 | # 716 | # 717 | # 718 | #
687 | # **W1** 688 | # 690 | # < tf.Variable 'W1:0' shape=(25, 12288) dtype=float32_ref > 691 | #
695 | # **b1** 696 | # 698 | # < tf.Variable 'b1:0' shape=(25, 1) dtype=float32_ref > 699 | #
703 | # **W2** 704 | # 706 | # < tf.Variable 'W2:0' shape=(12, 25) dtype=float32_ref > 707 | #
711 | # **b2** 712 | # 714 | # < tf.Variable 'b2:0' shape=(12, 1) dtype=float32_ref > 715 | #
719 | 720 | # As expected, the parameters haven't been evaluated yet. 721 | 722 | # ### 2.3 - Forward propagation in tensorflow 723 | # 724 | # You will now implement the forward propagation module in tensorflow. The function will take in a dictionary of parameters and it will complete the forward pass. The functions you will be using are: 725 | # 726 | # - `tf.add(...,...)` to do an addition 727 | # - `tf.matmul(...,...)` to do a matrix multiplication 728 | # - `tf.nn.relu(...)` to apply the ReLU activation 729 | # 730 | # **Question:** Implement the forward pass of the neural network. We commented for you the numpy equivalents so that you can compare the tensorflow implementation to numpy. It is important to note that the forward propagation stops at `z3`. The reason is that in tensorflow the last linear layer output is given as input to the function computing the loss. Therefore, you don't need `a3`! 731 | # 732 | # 733 | 734 | # In[28]: 735 | 736 | # GRADED FUNCTION: forward_propagation 737 | 738 | def forward_propagation(X, parameters): 739 | """ 740 | Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX 741 | 742 | Arguments: 743 | X -- input dataset placeholder, of shape (input size, number of examples) 744 | parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3" 745 | the shapes are given in initialize_parameters 746 | 747 | Returns: 748 | Z3 -- the output of the last LINEAR unit 749 | """ 750 | 751 | # Retrieve the parameters from the dictionary "parameters" 752 | W1 = parameters['W1'] 753 | b1 = parameters['b1'] 754 | W2 = parameters['W2'] 755 | b2 = parameters['b2'] 756 | W3 = parameters['W3'] 757 | b3 = parameters['b3'] 758 | 759 | ### START CODE HERE ### (approx. 5 lines) # Numpy Equivalents: 760 | Z1 = tf.add(tf.matmul(W1, X), b1) # Z1 = np.dot(W1, X) + b1 761 | A1 = tf.nn.relu(Z1) # A1 = relu(Z1) 762 | Z2 = tf.add(tf.matmul(W2, A1), b2) # Z2 = np.dot(W2, a1) + b2 763 | A2 = tf.nn.relu(Z2) # A2 = relu(Z2) 764 | Z3 = tf.add(tf.matmul(W3, A2), b3) # Z3 = np.dot(W3,Z2) + b3 765 | ### END CODE HERE ### 766 | 767 | return Z3 768 | 769 | 770 | # In[29]: 771 | 772 | tf.reset_default_graph() 773 | 774 | with tf.Session() as sess: 775 | X, Y = create_placeholders(12288, 6) 776 | parameters = initialize_parameters() 777 | Z3 = forward_propagation(X, parameters) 778 | print("Z3 = " + str(Z3)) 779 | 780 | 781 | # **Expected Output**: 782 | # 783 | # 784 | # 785 | # 788 | # 791 | # 792 | # 793 | #
786 | # **Z3** 787 | # 789 | # Tensor("Add_2:0", shape=(6, ?), dtype=float32) 790 | #
794 | 795 | # You may have noticed that the forward propagation doesn't output any cache. You will understand why below, when we get to brackpropagation. 796 | 797 | # ### 2.4 Compute cost 798 | # 799 | # As seen before, it is very easy to compute the cost using: 800 | # ```python 801 | # tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...)) 802 | # ``` 803 | # **Question**: Implement the cost function below. 804 | # - It is important to know that the "`logits`" and "`labels`" inputs of `tf.nn.softmax_cross_entropy_with_logits` are expected to be of shape (number of examples, num_classes). We have thus transposed Z3 and Y for you. 805 | # - Besides, `tf.reduce_mean` basically does the summation over the examples. 806 | 807 | # In[30]: 808 | 809 | # GRADED FUNCTION: compute_cost 810 | 811 | def compute_cost(Z3, Y): 812 | """ 813 | Computes the cost 814 | 815 | Arguments: 816 | Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples) 817 | Y -- "true" labels vector placeholder, same shape as Z3 818 | 819 | Returns: 820 | cost - Tensor of the cost function 821 | """ 822 | 823 | # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...) 824 | logits = tf.transpose(Z3) 825 | labels = tf.transpose(Y) 826 | 827 | ### START CODE HERE ### (1 line of code) 828 | cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels)) 829 | ### END CODE HERE ### 830 | 831 | return cost 832 | 833 | 834 | # In[31]: 835 | 836 | tf.reset_default_graph() 837 | 838 | with tf.Session() as sess: 839 | X, Y = create_placeholders(12288, 6) 840 | parameters = initialize_parameters() 841 | Z3 = forward_propagation(X, parameters) 842 | cost = compute_cost(Z3, Y) 843 | print("cost = " + str(cost)) 844 | 845 | 846 | # **Expected Output**: 847 | # 848 | # 849 | # 850 | # 853 | # 856 | # 857 | # 858 | #
851 | # **cost** 852 | # 854 | # Tensor("Mean:0", shape=(), dtype=float32) 855 | #
859 | 860 | # ### 2.5 - Backward propagation & parameter updates 861 | # 862 | # This is where you become grateful to programming frameworks. All the backpropagation and the parameters update is taken care of in 1 line of code. It is very easy to incorporate this line in the model. 863 | # 864 | # After you compute the cost function. You will create an "`optimizer`" object. You have to call this object along with the cost when running the tf.session. When called, it will perform an optimization on the given cost with the chosen method and learning rate. 865 | # 866 | # For instance, for gradient descent the optimizer would be: 867 | # ```python 868 | # optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost) 869 | # ``` 870 | # 871 | # To make the optimization you would do: 872 | # ```python 873 | # _ , c = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y}) 874 | # ``` 875 | # 876 | # This computes the backpropagation by passing through the tensorflow graph in the reverse order. From cost to inputs. 877 | # 878 | # **Note** When coding, we often use `_` as a "throwaway" variable to store values that we won't need to use later. Here, `_` takes on the evaluated value of `optimizer`, which we don't need (and `c` takes the value of the `cost` variable). 879 | 880 | # ### 2.6 - Building the model 881 | # 882 | # Now, you will bring it all together! 883 | # 884 | # **Exercise:** Implement the model. You will be calling the functions you had previously implemented. 885 | 886 | # In[32]: 887 | 888 | def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001, 889 | num_epochs = 1500, minibatch_size = 32, print_cost = True): 890 | """ 891 | Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX. 892 | 893 | Arguments: 894 | X_train -- training set, of shape (input size = 12288, number of training examples = 1080) 895 | Y_train -- test set, of shape (output size = 6, number of training examples = 1080) 896 | X_test -- training set, of shape (input size = 12288, number of training examples = 120) 897 | Y_test -- test set, of shape (output size = 6, number of test examples = 120) 898 | learning_rate -- learning rate of the optimization 899 | num_epochs -- number of epochs of the optimization loop 900 | minibatch_size -- size of a minibatch 901 | print_cost -- True to print the cost every 100 epochs 902 | 903 | Returns: 904 | parameters -- parameters learnt by the model. They can then be used to predict. 905 | """ 906 | 907 | ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables 908 | tf.set_random_seed(1) # to keep consistent results 909 | seed = 3 # to keep consistent results 910 | (n_x, m) = X_train.shape # (n_x: input size, m : number of examples in the train set) 911 | n_y = Y_train.shape[0] # n_y : output size 912 | costs = [] # To keep track of the cost 913 | 914 | # Create Placeholders of shape (n_x, n_y) 915 | ### START CODE HERE ### (1 line) 916 | X, Y = create_placeholders(n_x, n_y) 917 | ### END CODE HERE ### 918 | 919 | # Initialize parameters 920 | ### START CODE HERE ### (1 line) 921 | parameters = initialize_parameters() 922 | ### END CODE HERE ### 923 | 924 | # Forward propagation: Build the forward propagation in the tensorflow graph 925 | ### START CODE HERE ### (1 line) 926 | Z3 = forward_propagation(X, parameters) 927 | ### END CODE HERE ### 928 | 929 | # Cost function: Add cost function to tensorflow graph 930 | ### START CODE HERE ### (1 line) 931 | cost = compute_cost(Z3, Y) 932 | ### END CODE HERE ### 933 | 934 | # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer. 935 | ### START CODE HERE ### (1 line) 936 | optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost) 937 | ### END CODE HERE ### 938 | 939 | # Initialize all the variables 940 | init = tf.global_variables_initializer() 941 | 942 | # Start the session to compute the tensorflow graph 943 | with tf.Session() as sess: 944 | 945 | # Run the initialization 946 | sess.run(init) 947 | 948 | # Do the training loop 949 | for epoch in range(num_epochs): 950 | 951 | epoch_cost = 0. # Defines a cost related to an epoch 952 | num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set 953 | seed = seed + 1 954 | minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed) 955 | 956 | for minibatch in minibatches: 957 | 958 | # Select a minibatch 959 | (minibatch_X, minibatch_Y) = minibatch 960 | 961 | # IMPORTANT: The line that runs the graph on a minibatch. 962 | # Run the session to execute the "optimizer" and the "cost", the feedict should contain a minibatch for (X,Y). 963 | ### START CODE HERE ### (1 line) 964 | _ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y}) 965 | ### END CODE HERE ### 966 | 967 | epoch_cost += minibatch_cost / num_minibatches 968 | 969 | # Print the cost every epoch 970 | if print_cost == True and epoch % 100 == 0: 971 | print ("Cost after epoch %i: %f" % (epoch, epoch_cost)) 972 | if print_cost == True and epoch % 5 == 0: 973 | costs.append(epoch_cost) 974 | 975 | # plot the cost 976 | plt.plot(np.squeeze(costs)) 977 | plt.ylabel('cost') 978 | plt.xlabel('iterations (per tens)') 979 | plt.title("Learning rate =" + str(learning_rate)) 980 | plt.show() 981 | 982 | # lets save the parameters in a variable 983 | parameters = sess.run(parameters) 984 | print ("Parameters have been trained!") 985 | 986 | # Calculate the correct predictions 987 | correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y)) 988 | 989 | # Calculate accuracy on the test set 990 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) 991 | 992 | print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train})) 993 | print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test})) 994 | 995 | return parameters 996 | 997 | 998 | # Run the following cell to train your model! On our machine it takes about 5 minutes. Your "Cost after epoch 100" should be 1.016458. If it's not, don't waste time; interrupt the training by clicking on the square (⬛) in the upper bar of the notebook, and try to correct your code. If it is the correct cost, take a break and come back in 5 minutes! 999 | 1000 | # In[33]: 1001 | 1002 | parameters = model(X_train, Y_train, X_test, Y_test) 1003 | 1004 | 1005 | # **Expected Output**: 1006 | # 1007 | # 1008 | # 1009 | # 1012 | # 1015 | # 1016 | # 1017 | # 1020 | # 1023 | # 1024 | # 1025 | #
1010 | # **Train Accuracy** 1011 | # 1013 | # 0.999074 1014 | #
1018 | # **Test Accuracy** 1019 | # 1021 | # 0.716667 1022 | #
1026 | # 1027 | # Amazing, your algorithm can recognize a sign representing a figure between 0 and 5 with 71.7% accuracy. 1028 | # 1029 | # **Insights**: 1030 | # - Your model seems big enough to fit the training set well. However, given the difference between train and test accuracy, you could try to add L2 or dropout regularization to reduce overfitting. 1031 | # - Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters. 1032 | 1033 | # ### 2.7 - Test with your own image (optional / ungraded exercise) 1034 | # 1035 | # Congratulations on finishing this assignment. You can now take a picture of your hand and see the output of your model. To do that: 1036 | # 1. Click on "File" in the upper bar of this notebook, then click "Open" to go on your Coursera Hub. 1037 | # 2. Add your image to this Jupyter Notebook's directory, in the "images" folder 1038 | # 3. Write your image's name in the following code 1039 | # 4. Run the code and check if the algorithm is right! 1040 | 1041 | # In[ ]: 1042 | 1043 | import scipy 1044 | from PIL import Image 1045 | from scipy import ndimage 1046 | 1047 | ## START CODE HERE ## (PUT YOUR IMAGE NAME) 1048 | my_image = "thumbs_up.jpg" 1049 | ## END CODE HERE ## 1050 | 1051 | # We preprocess your image to fit your algorithm. 1052 | fname = "images/" + my_image 1053 | image = np.array(ndimage.imread(fname, flatten=False)) 1054 | my_image = scipy.misc.imresize(image, size=(64,64)).reshape((1, 64*64*3)).T 1055 | my_image_prediction = predict(my_image, parameters) 1056 | 1057 | plt.imshow(image) 1058 | print("Your algorithm predicts: y = " + str(np.squeeze(my_image_prediction))) 1059 | 1060 | 1061 | # You indeed deserved a "thumbs-up" although as you can see the algorithm seems to classify it incorrectly. The reason is that the training set doesn't contain any "thumbs-up", so the model doesn't know how to deal with it! We call that a "mismatched data distribution" and it is one of the various of the next course on "Structuring Machine Learning Projects". 1062 | 1063 | # 1064 | # **What you should remember**: 1065 | # - Tensorflow is a programming framework used in deep learning 1066 | # - The two main object classes in tensorflow are Tensors and Operators. 1067 | # - When you code in tensorflow you have to take the following steps: 1068 | # - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...) 1069 | # - Create a session 1070 | # - Initialize the session 1071 | # - Run the session to execute the graph 1072 | # - You can execute the graph multiple times as you've seen in model() 1073 | # - The backpropagation and optimization is automatically done when running the session on the "optimizer" object. 1074 | -------------------------------------------------------------------------------- /Week 3 PA 1/Tensorflow+Tutorial/Tensorflow Tutorial.md: -------------------------------------------------------------------------------- 1 | 2 | # TensorFlow Tutorial 3 | 4 | Welcome to this week's programming assignment. Until now, you've always used numpy to build neural networks. Now we will step you through a deep learning framework that will allow you to build neural networks more easily. Machine learning frameworks like TensorFlow, PaddlePaddle, Torch, Caffe, Keras, and many others can speed up your machine learning development significantly. All of these frameworks also have a lot of documentation, which you should feel free to read. In this assignment, you will learn to do the following in TensorFlow: 5 | 6 | - Initialize variables 7 | - Start your own session 8 | - Train algorithms 9 | - Implement a Neural Network 10 | 11 | Programing frameworks can not only shorten your coding time, but sometimes also perform optimizations that speed up your code. 12 | 13 | ## 1 - Exploring the Tensorflow Library 14 | 15 | To start, you will import the library: 16 | 17 | 18 | 19 | ```python 20 | import math 21 | import numpy as np 22 | import h5py 23 | import matplotlib.pyplot as plt 24 | import tensorflow as tf 25 | from tensorflow.python.framework import ops 26 | from tf_utils import load_dataset, random_mini_batches, convert_to_one_hot, predict 27 | 28 | %matplotlib inline 29 | np.random.seed(1) 30 | ``` 31 | 32 | Now that you have imported the library, we will walk you through its different applications. You will start with an example, where we compute for you the loss of one training example. 33 | $$loss = \mathcal{L}(\hat{y}, y) = (\hat y^{(i)} - y^{(i)})^2 \tag{1}$$ 34 | 35 | 36 | ```python 37 | y_hat = tf.constant(36, name='y_hat') # Define y_hat constant. Set to 36. 38 | y = tf.constant(39, name='y') # Define y. Set to 39 39 | 40 | loss = tf.Variable((y - y_hat)**2, name='loss') # Create a variable for the loss 41 | 42 | init = tf.global_variables_initializer() # When init is run later (session.run(init)), 43 | # the loss variable will be initialized and ready to be computed 44 | with tf.Session() as session: # Create a session and print the output 45 | session.run(init) # Initializes the variables 46 | print(session.run(loss)) # Prints the loss 47 | ``` 48 | 49 | 9 50 | 51 | 52 | Writing and running programs in TensorFlow has the following steps: 53 | 54 | 1. Create Tensors (variables) that are not yet executed/evaluated. 55 | 2. Write operations between those Tensors. 56 | 3. Initialize your Tensors. 57 | 4. Create a Session. 58 | 5. Run the Session. This will run the operations you'd written above. 59 | 60 | Therefore, when we created a variable for the loss, we simply defined the loss as a function of other quantities, but did not evaluate its value. To evaluate it, we had to run `init=tf.global_variables_initializer()`. That initialized the loss variable, and in the last line we were finally able to evaluate the value of `loss` and print its value. 61 | 62 | Now let us look at an easy example. Run the cell below: 63 | 64 | 65 | ```python 66 | a = tf.constant(2) 67 | b = tf.constant(10) 68 | c = tf.multiply(a,b) 69 | print(c) 70 | ``` 71 | 72 | Tensor("Mul:0", shape=(), dtype=int32) 73 | 74 | 75 | As expected, you will not see 20! You got a tensor saying that the result is a tensor that does not have the shape attribute, and is of type "int32". All you did was put in the 'computation graph', but you have not run this computation yet. In order to actually multiply the two numbers, you will have to create a session and run it. 76 | 77 | 78 | ```python 79 | sess = tf.Session() 80 | print(sess.run(c)) 81 | ``` 82 | 83 | 20 84 | 85 | 86 | Great! To summarize, **remember to initialize your variables, create a session and run the operations inside the session**. 87 | 88 | Next, you'll also have to know about placeholders. A placeholder is an object whose value you can specify only later. 89 | To specify values for a placeholder, you can pass in values by using a "feed dictionary" (`feed_dict` variable). Below, we created a placeholder for x. This allows us to pass in a number later when we run the session. 90 | 91 | 92 | ```python 93 | # Change the value of x in the feed_dict 94 | 95 | x = tf.placeholder(tf.int64, name = 'x') 96 | print(sess.run(2 * x, feed_dict = {x: 3})) 97 | sess.close() 98 | ``` 99 | 100 | 6 101 | 102 | 103 | When you first defined `x` you did not have to specify a value for it. A placeholder is simply a variable that you will assign data to only later, when running the session. We say that you **feed data** to these placeholders when running the session. 104 | 105 | Here's what's happening: When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph. 106 | 107 | ### 1.1 - Linear function 108 | 109 | Lets start this programming exercise by computing the following equation: $Y = WX + b$, where $W$ and $X$ are random matrices and b is a random vector. 110 | 111 | **Exercise**: Compute $WX + b$ where $W, X$, and $b$ are drawn from a random normal distribution. W is of shape (4, 3), X is (3,1) and b is (4,1). As an example, here is how you would define a constant X that has shape (3,1): 112 | ```python 113 | X = tf.constant(np.random.randn(3,1), name = "X") 114 | 115 | ``` 116 | You might find the following functions helpful: 117 | - tf.matmul(..., ...) to do a matrix multiplication 118 | - tf.add(..., ...) to do an addition 119 | - np.random.randn(...) to initialize randomly 120 | 121 | 122 | 123 | ```python 124 | # GRADED FUNCTION: linear_function 125 | 126 | def linear_function(): 127 | """ 128 | Implements a linear function: 129 | Initializes W to be a random tensor of shape (4,3) 130 | Initializes X to be a random tensor of shape (3,1) 131 | Initializes b to be a random tensor of shape (4,1) 132 | Returns: 133 | result -- runs the session for Y = WX + b 134 | """ 135 | 136 | np.random.seed(1) 137 | 138 | ### START CODE HERE ### (4 lines of code) 139 | X = tf.constant(np.random.randn(3,1), name = "X") 140 | W = tf.constant(np.random.randn(4,3), name = "W") 141 | b = tf.constant(np.random.randn(4,1), name = "b") 142 | Y = tf.add(tf.matmul(W, X), b) 143 | ### END CODE HERE ### 144 | 145 | # Create the session using tf.Session() and run it with sess.run(...) on the variable you want to calculate 146 | 147 | ### START CODE HERE ### 148 | sess = tf.Session() 149 | result = sess.run(Y) 150 | ### END CODE HERE ### 151 | 152 | # close the session 153 | sess.close() 154 | 155 | return result 156 | ``` 157 | 158 | 159 | ```python 160 | print( "result = " + str(linear_function())) 161 | ``` 162 | 163 | result = [[-2.15657382] 164 | [ 2.95891446] 165 | [-1.08926781] 166 | [-0.84538042]] 167 | 168 | 169 | *** Expected Output ***: 170 | 171 | 172 | 173 | 176 | 182 | 183 | 184 |
174 | **result** 175 | 177 | [[-2.15657382] 178 | [ 2.95891446] 179 | [-1.08926781] 180 | [-0.84538042]] 181 |
185 | 186 | ### 1.2 - Computing the sigmoid 187 | Great! You just implemented a linear function. Tensorflow offers a variety of commonly used neural network functions like `tf.sigmoid` and `tf.softmax`. For this exercise lets compute the sigmoid function of an input. 188 | 189 | You will do this exercise using a placeholder variable `x`. When running the session, you should use the feed dictionary to pass in the input `z`. In this exercise, you will have to (i) create a placeholder `x`, (ii) define the operations needed to compute the sigmoid using `tf.sigmoid`, and then (iii) run the session. 190 | 191 | ** Exercise **: Implement the sigmoid function below. You should use the following: 192 | 193 | - `tf.placeholder(tf.float32, name = "...")` 194 | - `tf.sigmoid(...)` 195 | - `sess.run(..., feed_dict = {x: z})` 196 | 197 | 198 | Note that there are two typical ways to create and use sessions in tensorflow: 199 | 200 | **Method 1:** 201 | ```python 202 | sess = tf.Session() 203 | # Run the variables initialization (if needed), run the operations 204 | result = sess.run(..., feed_dict = {...}) 205 | sess.close() # Close the session 206 | ``` 207 | **Method 2:** 208 | ```python 209 | with tf.Session() as sess: 210 | # run the variables initialization (if needed), run the operations 211 | result = sess.run(..., feed_dict = {...}) 212 | # This takes care of closing the session for you :) 213 | ``` 214 | 215 | 216 | 217 | ```python 218 | # GRADED FUNCTION: sigmoid 219 | 220 | def sigmoid(z): 221 | """ 222 | Computes the sigmoid of z 223 | 224 | Arguments: 225 | z -- input value, scalar or vector 226 | 227 | Returns: 228 | results -- the sigmoid of z 229 | """ 230 | 231 | ### START CODE HERE ### ( approx. 4 lines of code) 232 | # Create a placeholder for x. Name it 'x'. 233 | x = tf.placeholder(tf.float32, name = "x") 234 | 235 | # compute sigmoid(x) 236 | sigmoid = tf.sigmoid(x) 237 | 238 | # Create a session, and run it. Please use the method 2 explained above. 239 | # You should use a feed_dict to pass z's value to x. 240 | with tf.Session() as sess: 241 | # Run session and call the output "result" 242 | result = sess.run(sigmoid, feed_dict = {x: z}) 243 | 244 | ### END CODE HERE ### 245 | 246 | return result 247 | ``` 248 | 249 | 250 | ```python 251 | print ("sigmoid(0) = " + str(sigmoid(0))) 252 | print ("sigmoid(12) = " + str(sigmoid(12))) 253 | ``` 254 | 255 | sigmoid(0) = 0.5 256 | sigmoid(12) = 0.999994 257 | 258 | 259 | *** Expected Output ***: 260 | 261 | 262 | 263 | 266 | 269 | 270 | 271 | 274 | 277 | 278 | 279 |
264 | **sigmoid(0)** 265 | 267 | 0.5 268 |
272 | **sigmoid(12)** 273 | 275 | 0.999994 276 |
280 | 281 | 282 | **To summarize, you how know how to**: 283 | 1. Create placeholders 284 | 2. Specify the computation graph corresponding to operations you want to compute 285 | 3. Create the session 286 | 4. Run the session, using a feed dictionary if necessary to specify placeholder variables' values. 287 | 288 | ### 1.3 - Computing the Cost 289 | 290 | You can also use a built-in function to compute the cost of your neural network. So instead of needing to write code to compute this as a function of $a^{[2](i)}$ and $y^{(i)}$ for i=1...m: 291 | $$ J = - \frac{1}{m} \sum_{i = 1}^m \large ( \small y^{(i)} \log a^{ [2] (i)} + (1-y^{(i)})\log (1-a^{ [2] (i)} )\large )\small\tag{2}$$ 292 | 293 | you can do it in one line of code in tensorflow! 294 | 295 | **Exercise**: Implement the cross entropy loss. The function you will use is: 296 | 297 | 298 | - `tf.nn.sigmoid_cross_entropy_with_logits(logits = ..., labels = ...)` 299 | 300 | Your code should input `z`, compute the sigmoid (to get `a`) and then compute the cross entropy cost $J$. All this can be done using one call to `tf.nn.sigmoid_cross_entropy_with_logits`, which computes 301 | 302 | $$- \frac{1}{m} \sum_{i = 1}^m \large ( \small y^{(i)} \log \sigma(z^{[2](i)}) + (1-y^{(i)})\log (1-\sigma(z^{[2](i)})\large )\small\tag{2}$$ 303 | 304 | 305 | 306 | 307 | ```python 308 | # GRADED FUNCTION: cost 309 | 310 | def cost(logits, labels): 311 | """ 312 |     Computes the cost using the sigmoid cross entropy 313 |      314 |     Arguments: 315 |     logits -- vector containing z, output of the last linear unit (before the final sigmoid activation) 316 |     labels -- vector of labels y (1 or 0) 317 | 318 | Note: What we've been calling "z" and "y" in this class are respectively called "logits" and "labels" 319 | in the TensorFlow documentation. So logits will feed into z, and labels into y. 320 |      321 |     Returns: 322 |     cost -- runs the session of the cost (formula (2)) 323 | """ 324 | 325 | ### START CODE HERE ### 326 | 327 | # Create the placeholders for "logits" (z) and "labels" (y) (approx. 2 lines) 328 | z = tf.placeholder(tf.float32, name = "logits") 329 | y = tf.placeholder(tf.float32, name = "labels") 330 | 331 | # Use the loss function (approx. 1 line) 332 | cost = tf.nn.sigmoid_cross_entropy_with_logits(logits = z, labels = y) 333 | 334 | # Create a session (approx. 1 line). See method 1 above. 335 | sess = tf.Session() 336 | 337 | # Run the session (approx. 1 line). 338 | cost = sess.run(cost, feed_dict = {z: logits, y: labels}) 339 | 340 | # Close the session (approx. 1 line). See method 1 above. 341 | sess.close() 342 | 343 | ### END CODE HERE ### 344 | 345 | return cost 346 | ``` 347 | 348 | 349 | ```python 350 | logits = sigmoid(np.array([0.2,0.4,0.7,0.9])) 351 | cost = cost(logits, np.array([0,0,1,1])) 352 | print ("cost = " + str(cost)) 353 | ``` 354 | 355 | cost = [ 1.00538719 1.03664088 0.41385433 0.39956614] 356 | 357 | 358 | ** Expected Output** : 359 | 360 | 361 | 362 | 365 | 368 | 369 | 370 |
363 | **cost** 364 | 366 | [ 1.00538719 1.03664088 0.41385433 0.39956614] 367 |
371 | 372 | ### 1.4 - Using One Hot encodings 373 | 374 | Many times in deep learning you will have a y vector with numbers ranging from 0 to C-1, where C is the number of classes. If C is for example 4, then you might have the following y vector which you will need to convert as follows: 375 | 376 | 377 | 378 | 379 | This is called a "one hot" encoding, because in the converted representation exactly one element of each column is "hot" (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In tensorflow, you can use one line of code: 380 | 381 | - tf.one_hot(labels, depth, axis) 382 | 383 | **Exercise:** Implement the function below to take one vector of labels and the total number of classes $C$, and return the one hot encoding. Use `tf.one_hot()` to do this. 384 | 385 | 386 | ```python 387 | # GRADED FUNCTION: one_hot_matrix 388 | 389 | def one_hot_matrix(labels, C): 390 | """ 391 | Creates a matrix where the i-th row corresponds to the ith class number and the jth column 392 | corresponds to the jth training example. So if example j had a label i. Then entry (i,j) 393 | will be 1. 394 | 395 | Arguments: 396 | labels -- vector containing the labels 397 | C -- number of classes, the depth of the one hot dimension 398 | 399 | Returns: 400 | one_hot -- one hot matrix 401 | """ 402 | 403 | ### START CODE HERE ### 404 | 405 | # Create a tf.constant equal to C (depth), name it 'C'. (approx. 1 line) 406 | depth = tf.constant(C, name = "C") 407 | 408 | # Use tf.one_hot, be careful with the axis (approx. 1 line) 409 | one_hot_matrix = tf.one_hot(labels, depth, axis = 0) 410 | 411 | # Create the session (approx. 1 line) 412 | sess = tf.Session() 413 | 414 | # Run the session (approx. 1 line) 415 | one_hot = sess.run(one_hot_matrix) 416 | 417 | # Close the session (approx. 1 line). See method 1 above. 418 | sess.close() 419 | 420 | ### END CODE HERE ### 421 | 422 | return one_hot 423 | ``` 424 | 425 | 426 | ```python 427 | labels = np.array([1,2,3,0,2,1]) 428 | one_hot = one_hot_matrix(labels, C = 4) 429 | print ("one_hot = " + str(one_hot)) 430 | ``` 431 | 432 | one_hot = [[ 0. 0. 0. 1. 0. 0.] 433 | [ 1. 0. 0. 0. 0. 1.] 434 | [ 0. 1. 0. 0. 1. 0.] 435 | [ 0. 0. 1. 0. 0. 0.]] 436 | 437 | 438 | **Expected Output**: 439 | 440 | 441 | 442 | 445 | 451 | 452 | 453 |
443 | **one_hot** 444 | 446 | [[ 0. 0. 0. 1. 0. 0.] 447 | [ 1. 0. 0. 0. 0. 1.] 448 | [ 0. 1. 0. 0. 1. 0.] 449 | [ 0. 0. 1. 0. 0. 0.]] 450 |
454 | 455 | 456 | ### 1.5 - Initialize with zeros and ones 457 | 458 | Now you will learn how to initialize a vector of zeros and ones. The function you will be calling is `tf.ones()`. To initialize with zeros you could use tf.zeros() instead. These functions take in a shape and return an array of dimension shape full of zeros and ones respectively. 459 | 460 | **Exercise:** Implement the function below to take in a shape and to return an array (of the shape's dimension of ones). 461 | 462 | - tf.ones(shape) 463 | 464 | 465 | 466 | ```python 467 | # GRADED FUNCTION: ones 468 | 469 | def ones(shape): 470 | """ 471 | Creates an array of ones of dimension shape 472 | 473 | Arguments: 474 | shape -- shape of the array you want to create 475 | 476 | Returns: 477 | ones -- array containing only ones 478 | """ 479 | 480 | ### START CODE HERE ### 481 | 482 | # Create "ones" tensor using tf.ones(...). (approx. 1 line) 483 | ones = tf.ones(shape) 484 | 485 | # Create the session (approx. 1 line) 486 | sess = tf.Session() 487 | 488 | # Run the session to compute 'ones' (approx. 1 line) 489 | ones = sess.run(ones) 490 | 491 | # Close the session (approx. 1 line). See method 1 above. 492 | sess.close() 493 | 494 | ### END CODE HERE ### 495 | return ones 496 | ``` 497 | 498 | 499 | ```python 500 | print ("ones = " + str(ones([3]))) 501 | ``` 502 | 503 | ones = [ 1. 1. 1.] 504 | 505 | 506 | **Expected Output:** 507 | 508 | 509 | 510 | 513 | 516 | 517 | 518 |
511 | **ones** 512 | 514 | [ 1. 1. 1.] 515 |
519 | 520 | # 2 - Building your first neural network in tensorflow 521 | 522 | In this part of the assignment you will build a neural network using tensorflow. Remember that there are two parts to implement a tensorflow model: 523 | 524 | - Create the computation graph 525 | - Run the graph 526 | 527 | Let's delve into the problem you'd like to solve! 528 | 529 | ### 2.0 - Problem statement: SIGNS Dataset 530 | 531 | One afternoon, with some friends we decided to teach our computers to decipher sign language. We spent a few hours taking pictures in front of a white wall and came up with the following dataset. It's now your job to build an algorithm that would facilitate communications from a speech-impaired person to someone who doesn't understand sign language. 532 | 533 | - **Training set**: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number). 534 | - **Test set**: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number). 535 | 536 | Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs. 537 | 538 | Here are examples for each number, and how an explanation of how we represent the labels. These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels. 539 |
**Figure 1**: SIGNS dataset
540 | 541 | 542 | Run the following code to load the dataset. 543 | 544 | 545 | ```python 546 | # Loading the dataset 547 | X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset() 548 | ``` 549 | 550 | Change the index below and run the cell to visualize some examples in the dataset. 551 | 552 | 553 | ```python 554 | # Example of a picture 555 | index = 0 556 | plt.imshow(X_train_orig[index]) 557 | print ("y = " + str(np.squeeze(Y_train_orig[:, index]))) 558 | ``` 559 | 560 | y = 5 561 | 562 | 563 | 564 | ![png](output_35_1.png) 565 | 566 | 567 | As usual you flatten the image dataset, then normalize it by dividing by 255. On top of that, you will convert each label to a one-hot vector as shown in Figure 1. Run the cell below to do so. 568 | 569 | 570 | ```python 571 | # Flatten the training and test images 572 | X_train_flatten = X_train_orig.reshape(X_train_orig.shape[0], -1).T 573 | X_test_flatten = X_test_orig.reshape(X_test_orig.shape[0], -1).T 574 | # Normalize image vectors 575 | X_train = X_train_flatten/255. 576 | X_test = X_test_flatten/255. 577 | # Convert training and test labels to one hot matrices 578 | Y_train = convert_to_one_hot(Y_train_orig, 6) 579 | Y_test = convert_to_one_hot(Y_test_orig, 6) 580 | 581 | print ("number of training examples = " + str(X_train.shape[1])) 582 | print ("number of test examples = " + str(X_test.shape[1])) 583 | print ("X_train shape: " + str(X_train.shape)) 584 | print ("Y_train shape: " + str(Y_train.shape)) 585 | print ("X_test shape: " + str(X_test.shape)) 586 | print ("Y_test shape: " + str(Y_test.shape)) 587 | ``` 588 | 589 | number of training examples = 1080 590 | number of test examples = 120 591 | X_train shape: (12288, 1080) 592 | Y_train shape: (6, 1080) 593 | X_test shape: (12288, 120) 594 | Y_test shape: (6, 120) 595 | 596 | 597 | **Note** that 12288 comes from $64 \times 64 \times 3$. Each image is square, 64 by 64 pixels, and 3 is for the RGB colors. Please make sure all these shapes make sense to you before continuing. 598 | 599 | **Your goal** is to build an algorithm capable of recognizing a sign with high accuracy. To do so, you are going to build a tensorflow model that is almost the same as one you have previously built in numpy for cat recognition (but now using a softmax output). It is a great occasion to compare your numpy implementation to the tensorflow one. 600 | 601 | **The model** is *LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX*. The SIGMOID output layer has been converted to a SOFTMAX. A SOFTMAX layer generalizes SIGMOID to when there are more than two classes. 602 | 603 | ### 2.1 - Create placeholders 604 | 605 | Your first task is to create placeholders for `X` and `Y`. This will allow you to later pass your training data in when you run your session. 606 | 607 | **Exercise:** Implement the function below to create the placeholders in tensorflow. 608 | 609 | 610 | ```python 611 | # GRADED FUNCTION: create_placeholders 612 | 613 | def create_placeholders(n_x, n_y): 614 | """ 615 | Creates the placeholders for the tensorflow session. 616 | 617 | Arguments: 618 | n_x -- scalar, size of an image vector (num_px * num_px = 64 * 64 * 3 = 12288) 619 | n_y -- scalar, number of classes (from 0 to 5, so -> 6) 620 | 621 | Returns: 622 | X -- placeholder for the data input, of shape [n_x, None] and dtype "float" 623 | Y -- placeholder for the input labels, of shape [n_y, None] and dtype "float" 624 | 625 | Tips: 626 | - You will use None because it let's us be flexible on the number of examples you will for the placeholders. 627 | In fact, the number of examples during test/train is different. 628 | """ 629 | 630 | ### START CODE HERE ### (approx. 2 lines) 631 | X = tf.placeholder(tf.float32, [n_x, None], name = "X") 632 | Y = tf.placeholder(tf.float32, [n_y, None], name = "Y") 633 | ### END CODE HERE ### 634 | 635 | return X, Y 636 | ``` 637 | 638 | 639 | ```python 640 | X, Y = create_placeholders(12288, 6) 641 | print ("X = " + str(X)) 642 | print ("Y = " + str(Y)) 643 | ``` 644 | 645 | X = Tensor("X_1:0", shape=(12288, ?), dtype=float32) 646 | Y = Tensor("Y:0", shape=(6, ?), dtype=float32) 647 | 648 | 649 | **Expected Output**: 650 | 651 | 652 | 653 | 656 | 659 | 660 | 661 | 664 | 667 | 668 | 669 |
654 | **X** 655 | 657 | Tensor("Placeholder_1:0", shape=(12288, ?), dtype=float32) (not necessarily Placeholder_1) 658 |
662 | **Y** 663 | 665 | Tensor("Placeholder_2:0", shape=(10, ?), dtype=float32) (not necessarily Placeholder_2) 666 |
670 | 671 | ### 2.2 - Initializing the parameters 672 | 673 | Your second task is to initialize the parameters in tensorflow. 674 | 675 | **Exercise:** Implement the function below to initialize the parameters in tensorflow. You are going use Xavier Initialization for weights and Zero Initialization for biases. The shapes are given below. As an example, to help you, for W1 and b1 you could use: 676 | 677 | ```python 678 | W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 679 | b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer()) 680 | ``` 681 | Please use `seed = 1` to make sure your results match ours. 682 | 683 | 684 | ```python 685 | # GRADED FUNCTION: initialize_parameters 686 | 687 | def initialize_parameters(): 688 | """ 689 | Initializes parameters to build a neural network with tensorflow. The shapes are: 690 | W1 : [25, 12288] 691 | b1 : [25, 1] 692 | W2 : [12, 25] 693 | b2 : [12, 1] 694 | W3 : [6, 12] 695 | b3 : [6, 1] 696 | 697 | Returns: 698 | parameters -- a dictionary of tensors containing W1, b1, W2, b2, W3, b3 699 | """ 700 | 701 | tf.set_random_seed(1) # so that your "random" numbers match ours 702 | 703 | ### START CODE HERE ### (approx. 6 lines of code) 704 | W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 705 | b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer()) 706 | W2 = tf.get_variable("W2", [12, 25], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 707 | b2 = tf.get_variable("b2", [12,1], initializer = tf.zeros_initializer()) 708 | W3 = tf.get_variable("W3", [6, 12], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) 709 | b3 = tf.get_variable("b3", [6,1], initializer = tf.zeros_initializer()) 710 | ### END CODE HERE ### 711 | 712 | parameters = {"W1": W1, 713 | "b1": b1, 714 | "W2": W2, 715 | "b2": b2, 716 | "W3": W3, 717 | "b3": b3} 718 | 719 | return parameters 720 | ``` 721 | 722 | 723 | ```python 724 | tf.reset_default_graph() 725 | with tf.Session() as sess: 726 | parameters = initialize_parameters() 727 | print("W1 = " + str(parameters["W1"])) 728 | print("b1 = " + str(parameters["b1"])) 729 | print("W2 = " + str(parameters["W2"])) 730 | print("b2 = " + str(parameters["b2"])) 731 | ``` 732 | 733 | W1 = 734 | b1 = 735 | W2 = 736 | b2 = 737 | 738 | 739 | **Expected Output**: 740 | 741 | 742 | 743 | 746 | 749 | 750 | 751 | 754 | 757 | 758 | 759 | 762 | 765 | 766 | 767 | 770 | 773 | 774 | 775 |
744 | **W1** 745 | 747 | < tf.Variable 'W1:0' shape=(25, 12288) dtype=float32_ref > 748 |
752 | **b1** 753 | 755 | < tf.Variable 'b1:0' shape=(25, 1) dtype=float32_ref > 756 |
760 | **W2** 761 | 763 | < tf.Variable 'W2:0' shape=(12, 25) dtype=float32_ref > 764 |
768 | **b2** 769 | 771 | < tf.Variable 'b2:0' shape=(12, 1) dtype=float32_ref > 772 |
776 | 777 | As expected, the parameters haven't been evaluated yet. 778 | 779 | ### 2.3 - Forward propagation in tensorflow 780 | 781 | You will now implement the forward propagation module in tensorflow. The function will take in a dictionary of parameters and it will complete the forward pass. The functions you will be using are: 782 | 783 | - `tf.add(...,...)` to do an addition 784 | - `tf.matmul(...,...)` to do a matrix multiplication 785 | - `tf.nn.relu(...)` to apply the ReLU activation 786 | 787 | **Question:** Implement the forward pass of the neural network. We commented for you the numpy equivalents so that you can compare the tensorflow implementation to numpy. It is important to note that the forward propagation stops at `z3`. The reason is that in tensorflow the last linear layer output is given as input to the function computing the loss. Therefore, you don't need `a3`! 788 | 789 | 790 | 791 | 792 | ```python 793 | # GRADED FUNCTION: forward_propagation 794 | 795 | def forward_propagation(X, parameters): 796 | """ 797 | Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX 798 | 799 | Arguments: 800 | X -- input dataset placeholder, of shape (input size, number of examples) 801 | parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3" 802 | the shapes are given in initialize_parameters 803 | 804 | Returns: 805 | Z3 -- the output of the last LINEAR unit 806 | """ 807 | 808 | # Retrieve the parameters from the dictionary "parameters" 809 | W1 = parameters['W1'] 810 | b1 = parameters['b1'] 811 | W2 = parameters['W2'] 812 | b2 = parameters['b2'] 813 | W3 = parameters['W3'] 814 | b3 = parameters['b3'] 815 | 816 | ### START CODE HERE ### (approx. 5 lines) # Numpy Equivalents: 817 | Z1 = tf.add(tf.matmul(W1, X), b1) # Z1 = np.dot(W1, X) + b1 818 | A1 = tf.nn.relu(Z1) # A1 = relu(Z1) 819 | Z2 = tf.add(tf.matmul(W2, A1), b2) # Z2 = np.dot(W2, a1) + b2 820 | A2 = tf.nn.relu(Z2) # A2 = relu(Z2) 821 | Z3 = tf.add(tf.matmul(W3, A2), b3) # Z3 = np.dot(W3,Z2) + b3 822 | ### END CODE HERE ### 823 | 824 | return Z3 825 | ``` 826 | 827 | 828 | ```python 829 | tf.reset_default_graph() 830 | 831 | with tf.Session() as sess: 832 | X, Y = create_placeholders(12288, 6) 833 | parameters = initialize_parameters() 834 | Z3 = forward_propagation(X, parameters) 835 | print("Z3 = " + str(Z3)) 836 | ``` 837 | 838 | Z3 = Tensor("Add_2:0", shape=(6, ?), dtype=float32) 839 | 840 | 841 | **Expected Output**: 842 | 843 | 844 | 845 | 848 | 851 | 852 | 853 |
846 | **Z3** 847 | 849 | Tensor("Add_2:0", shape=(6, ?), dtype=float32) 850 |
854 | 855 | You may have noticed that the forward propagation doesn't output any cache. You will understand why below, when we get to brackpropagation. 856 | 857 | ### 2.4 Compute cost 858 | 859 | As seen before, it is very easy to compute the cost using: 860 | ```python 861 | tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...)) 862 | ``` 863 | **Question**: Implement the cost function below. 864 | - It is important to know that the "`logits`" and "`labels`" inputs of `tf.nn.softmax_cross_entropy_with_logits` are expected to be of shape (number of examples, num_classes). We have thus transposed Z3 and Y for you. 865 | - Besides, `tf.reduce_mean` basically does the summation over the examples. 866 | 867 | 868 | ```python 869 | # GRADED FUNCTION: compute_cost 870 | 871 | def compute_cost(Z3, Y): 872 | """ 873 | Computes the cost 874 | 875 | Arguments: 876 | Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples) 877 | Y -- "true" labels vector placeholder, same shape as Z3 878 | 879 | Returns: 880 | cost - Tensor of the cost function 881 | """ 882 | 883 | # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...) 884 | logits = tf.transpose(Z3) 885 | labels = tf.transpose(Y) 886 | 887 | ### START CODE HERE ### (1 line of code) 888 | cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels)) 889 | ### END CODE HERE ### 890 | 891 | return cost 892 | ``` 893 | 894 | 895 | ```python 896 | tf.reset_default_graph() 897 | 898 | with tf.Session() as sess: 899 | X, Y = create_placeholders(12288, 6) 900 | parameters = initialize_parameters() 901 | Z3 = forward_propagation(X, parameters) 902 | cost = compute_cost(Z3, Y) 903 | print("cost = " + str(cost)) 904 | ``` 905 | 906 | cost = Tensor("Mean:0", shape=(), dtype=float32) 907 | 908 | 909 | **Expected Output**: 910 | 911 | 912 | 913 | 916 | 919 | 920 | 921 |
914 | **cost** 915 | 917 | Tensor("Mean:0", shape=(), dtype=float32) 918 |
922 | 923 | ### 2.5 - Backward propagation & parameter updates 924 | 925 | This is where you become grateful to programming frameworks. All the backpropagation and the parameters update is taken care of in 1 line of code. It is very easy to incorporate this line in the model. 926 | 927 | After you compute the cost function. You will create an "`optimizer`" object. You have to call this object along with the cost when running the tf.session. When called, it will perform an optimization on the given cost with the chosen method and learning rate. 928 | 929 | For instance, for gradient descent the optimizer would be: 930 | ```python 931 | optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost) 932 | ``` 933 | 934 | To make the optimization you would do: 935 | ```python 936 | _ , c = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y}) 937 | ``` 938 | 939 | This computes the backpropagation by passing through the tensorflow graph in the reverse order. From cost to inputs. 940 | 941 | **Note** When coding, we often use `_` as a "throwaway" variable to store values that we won't need to use later. Here, `_` takes on the evaluated value of `optimizer`, which we don't need (and `c` takes the value of the `cost` variable). 942 | 943 | ### 2.6 - Building the model 944 | 945 | Now, you will bring it all together! 946 | 947 | **Exercise:** Implement the model. You will be calling the functions you had previously implemented. 948 | 949 | 950 | ```python 951 | def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001, 952 | num_epochs = 1500, minibatch_size = 32, print_cost = True): 953 | """ 954 | Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX. 955 | 956 | Arguments: 957 | X_train -- training set, of shape (input size = 12288, number of training examples = 1080) 958 | Y_train -- test set, of shape (output size = 6, number of training examples = 1080) 959 | X_test -- training set, of shape (input size = 12288, number of training examples = 120) 960 | Y_test -- test set, of shape (output size = 6, number of test examples = 120) 961 | learning_rate -- learning rate of the optimization 962 | num_epochs -- number of epochs of the optimization loop 963 | minibatch_size -- size of a minibatch 964 | print_cost -- True to print the cost every 100 epochs 965 | 966 | Returns: 967 | parameters -- parameters learnt by the model. They can then be used to predict. 968 | """ 969 | 970 | ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables 971 | tf.set_random_seed(1) # to keep consistent results 972 | seed = 3 # to keep consistent results 973 | (n_x, m) = X_train.shape # (n_x: input size, m : number of examples in the train set) 974 | n_y = Y_train.shape[0] # n_y : output size 975 | costs = [] # To keep track of the cost 976 | 977 | # Create Placeholders of shape (n_x, n_y) 978 | ### START CODE HERE ### (1 line) 979 | X, Y = create_placeholders(n_x, n_y) 980 | ### END CODE HERE ### 981 | 982 | # Initialize parameters 983 | ### START CODE HERE ### (1 line) 984 | parameters = initialize_parameters() 985 | ### END CODE HERE ### 986 | 987 | # Forward propagation: Build the forward propagation in the tensorflow graph 988 | ### START CODE HERE ### (1 line) 989 | Z3 = forward_propagation(X, parameters) 990 | ### END CODE HERE ### 991 | 992 | # Cost function: Add cost function to tensorflow graph 993 | ### START CODE HERE ### (1 line) 994 | cost = compute_cost(Z3, Y) 995 | ### END CODE HERE ### 996 | 997 | # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer. 998 | ### START CODE HERE ### (1 line) 999 | optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost) 1000 | ### END CODE HERE ### 1001 | 1002 | # Initialize all the variables 1003 | init = tf.global_variables_initializer() 1004 | 1005 | # Start the session to compute the tensorflow graph 1006 | with tf.Session() as sess: 1007 | 1008 | # Run the initialization 1009 | sess.run(init) 1010 | 1011 | # Do the training loop 1012 | for epoch in range(num_epochs): 1013 | 1014 | epoch_cost = 0. # Defines a cost related to an epoch 1015 | num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set 1016 | seed = seed + 1 1017 | minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed) 1018 | 1019 | for minibatch in minibatches: 1020 | 1021 | # Select a minibatch 1022 | (minibatch_X, minibatch_Y) = minibatch 1023 | 1024 | # IMPORTANT: The line that runs the graph on a minibatch. 1025 | # Run the session to execute the "optimizer" and the "cost", the feedict should contain a minibatch for (X,Y). 1026 | ### START CODE HERE ### (1 line) 1027 | _ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y}) 1028 | ### END CODE HERE ### 1029 | 1030 | epoch_cost += minibatch_cost / num_minibatches 1031 | 1032 | # Print the cost every epoch 1033 | if print_cost == True and epoch % 100 == 0: 1034 | print ("Cost after epoch %i: %f" % (epoch, epoch_cost)) 1035 | if print_cost == True and epoch % 5 == 0: 1036 | costs.append(epoch_cost) 1037 | 1038 | # plot the cost 1039 | plt.plot(np.squeeze(costs)) 1040 | plt.ylabel('cost') 1041 | plt.xlabel('iterations (per tens)') 1042 | plt.title("Learning rate =" + str(learning_rate)) 1043 | plt.show() 1044 | 1045 | # lets save the parameters in a variable 1046 | parameters = sess.run(parameters) 1047 | print ("Parameters have been trained!") 1048 | 1049 | # Calculate the correct predictions 1050 | correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y)) 1051 | 1052 | # Calculate accuracy on the test set 1053 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) 1054 | 1055 | print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train})) 1056 | print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test})) 1057 | 1058 | return parameters 1059 | ``` 1060 | 1061 | Run the following cell to train your model! On our machine it takes about 5 minutes. Your "Cost after epoch 100" should be 1.016458. If it's not, don't waste time; interrupt the training by clicking on the square (⬛) in the upper bar of the notebook, and try to correct your code. If it is the correct cost, take a break and come back in 5 minutes! 1062 | 1063 | 1064 | ```python 1065 | parameters = model(X_train, Y_train, X_test, Y_test) 1066 | ``` 1067 | 1068 | Cost after epoch 0: 1.855702 1069 | Cost after epoch 100: 1.016458 1070 | Cost after epoch 200: 0.733102 1071 | Cost after epoch 300: 0.572940 1072 | Cost after epoch 400: 0.468774 1073 | Cost after epoch 500: 0.381021 1074 | Cost after epoch 600: 0.313822 1075 | Cost after epoch 700: 0.254158 1076 | Cost after epoch 800: 0.203829 1077 | Cost after epoch 900: 0.166421 1078 | Cost after epoch 1000: 0.141486 1079 | Cost after epoch 1100: 0.107580 1080 | Cost after epoch 1200: 0.086270 1081 | Cost after epoch 1300: 0.059371 1082 | Cost after epoch 1400: 0.052228 1083 | 1084 | 1085 | 1086 | ![png](output_62_1.png) 1087 | 1088 | 1089 | Parameters have been trained! 1090 | Train Accuracy: 0.999074 1091 | Test Accuracy: 0.716667 1092 | 1093 | 1094 | **Expected Output**: 1095 | 1096 | 1097 | 1098 | 1101 | 1104 | 1105 | 1106 | 1109 | 1112 | 1113 | 1114 |
1099 | **Train Accuracy** 1100 | 1102 | 0.999074 1103 |
1107 | **Test Accuracy** 1108 | 1110 | 0.716667 1111 |
1115 | 1116 | Amazing, your algorithm can recognize a sign representing a figure between 0 and 5 with 71.7% accuracy. 1117 | 1118 | **Insights**: 1119 | - Your model seems big enough to fit the training set well. However, given the difference between train and test accuracy, you could try to add L2 or dropout regularization to reduce overfitting. 1120 | - Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters. 1121 | 1122 | ### 2.7 - Test with your own image (optional / ungraded exercise) 1123 | 1124 | Congratulations on finishing this assignment. You can now take a picture of your hand and see the output of your model. To do that: 1125 | 1. Click on "File" in the upper bar of this notebook, then click "Open" to go on your Coursera Hub. 1126 | 2. Add your image to this Jupyter Notebook's directory, in the "images" folder 1127 | 3. Write your image's name in the following code 1128 | 4. Run the code and check if the algorithm is right! 1129 | 1130 | 1131 | ```python 1132 | import scipy 1133 | from PIL import Image 1134 | from scipy import ndimage 1135 | 1136 | ## START CODE HERE ## (PUT YOUR IMAGE NAME) 1137 | my_image = "thumbs_up.jpg" 1138 | ## END CODE HERE ## 1139 | 1140 | # We preprocess your image to fit your algorithm. 1141 | fname = "images/" + my_image 1142 | image = np.array(ndimage.imread(fname, flatten=False)) 1143 | my_image = scipy.misc.imresize(image, size=(64,64)).reshape((1, 64*64*3)).T 1144 | my_image_prediction = predict(my_image, parameters) 1145 | 1146 | plt.imshow(image) 1147 | print("Your algorithm predicts: y = " + str(np.squeeze(my_image_prediction))) 1148 | ``` 1149 | 1150 | You indeed deserved a "thumbs-up" although as you can see the algorithm seems to classify it incorrectly. The reason is that the training set doesn't contain any "thumbs-up", so the model doesn't know how to deal with it! We call that a "mismatched data distribution" and it is one of the various of the next course on "Structuring Machine Learning Projects". 1151 | 1152 | 1153 | **What you should remember**: 1154 | - Tensorflow is a programming framework used in deep learning 1155 | - The two main object classes in tensorflow are Tensors and Operators. 1156 | - When you code in tensorflow you have to take the following steps: 1157 | - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...) 1158 | - Create a session 1159 | - Initialize the session 1160 | - Run the session to execute the graph 1161 | - You can execute the graph multiple times as you've seen in model() 1162 | - The backpropagation and optimization is automatically done when running the session on the "optimizer" object. 1163 | --------------------------------------------------------------------------------