├── .gitattributes
├── Week 1 PA 1
    ├── Initialization.zip
    └── README.md
├── Week 1 PA 2
    ├── Regularization.zip
    ├── Regularization
    │   ├── output_3_0.png
    │   ├── output_9_1.png
    │   ├── output_11_0.png
    │   ├── output_22_1.png
    │   ├── output_24_0.png
    │   ├── output_35_3.png
    │   ├── output_37_0.png
    │   └── Regularization.md
    └── README.md
├── Week 2 PA 1
    ├── Optimization+methods.zip
    └── README.md
├── Week 3 PA 1
    ├── Tensorflow+Tutorial.zip
    ├── Tensorflow+Tutorial
    │   ├── output_35_1.png
    │   ├── output_62_1.png
    │   └── Tensorflow Tutorial.md
    ├── README.md
    └── Tensorflow+Tutorial.py
├── Lecture Slides
    ├── Week 2 Optimization algorithms.pdf
    ├── Week 1 Practical aspects of Deep Learning.pdf
    ├── Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks.pdf
    └── README.md
├── LICENSE
├── Week 1 PA 3
    ├── README.md
    ├── Gradient+Checking.md
    └── Gradient+Checking.ipynb
└── README.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | * linguist-vendored
2 | *.py linguist-vendored=false
3 | 


--------------------------------------------------------------------------------
/Week 1 PA 1/Initialization.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 1/Initialization.zip


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization.zip


--------------------------------------------------------------------------------
/Week 2 PA 1/Optimization+methods.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 2 PA 1/Optimization+methods.zip


--------------------------------------------------------------------------------
/Week 3 PA 1/Tensorflow+Tutorial.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 3 PA 1/Tensorflow+Tutorial.zip


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/output_3_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_3_0.png


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/output_9_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_9_1.png


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/output_11_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_11_0.png


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/output_22_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_22_1.png


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/output_24_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_24_0.png


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/output_35_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_35_3.png


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/output_37_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 1 PA 2/Regularization/output_37_0.png


--------------------------------------------------------------------------------
/Week 3 PA 1/Tensorflow+Tutorial/output_35_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 3 PA 1/Tensorflow+Tutorial/output_35_1.png


--------------------------------------------------------------------------------
/Week 3 PA 1/Tensorflow+Tutorial/output_62_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Week 3 PA 1/Tensorflow+Tutorial/output_62_1.png


--------------------------------------------------------------------------------
/Lecture Slides/Week 2 Optimization algorithms.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Lecture Slides/Week 2 Optimization algorithms.pdf


--------------------------------------------------------------------------------
/Lecture Slides/Week 1 Practical aspects of Deep Learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Lecture Slides/Week 1 Practical aspects of Deep Learning.pdf


--------------------------------------------------------------------------------
/Lecture Slides/Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/HEAD/Lecture Slides/Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks.pdf


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Sheng-Qi Shen
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Week 1 PA 1/README.md:
--------------------------------------------------------------------------------
 1 | # Initialization
 2 | 
 3 | # Goal
 4 | - see how different initializations lead to different results
 5 | - Understand that different regularization methods that could help your model.
 6 | 
 7 | # File Description
 8 | - `.ipynb` file is the solution of Week 1 program assignment 1
 9 |   - Initialization.ipynb
10 | - `.html` file is the html version of `.ipynb` file.
11 |   - Initialization.html
12 | - file
13 |   - Initialization.md
14 |   
15 | # Snapshot
16 | - computer view. open .html file via brower for quick look.
17 | - **Recommend** brower view. Initialization.md
18 | 
19 | 
20 | # Implementation
21 | - Zeros initialization -- setting initialization = "zeros" in the input argument.
22 | - Random initialization -- setting initialization = "random" in the input argument. This initializes the weights to large random values.
23 | - He initialization -- setting initialization = "he" in the input argument. This initializes the weights to random values scaled according to a paper by He et al., 2015.
24 | 
25 | # What you should remember from this notebook:
26 | - Different initializations lead to different results
27 | - Random initialization is used to break symmetry and make sure different hidden units can learn different things
28 | - Don't intialize to values that are too large
29 | - He initialization works well for networks with ReLU activations.
30 | 


--------------------------------------------------------------------------------
/Week 1 PA 3/README.md:
--------------------------------------------------------------------------------
 1 | # Gradient Checking
 2 | 
 3 | # Goal
 4 | - Implement gradient checking from scratch.
 5 | 
 6 | - Understand how to use the difference formula to check your backpropagation implementation.
 7 | 
 8 | - Recognize that your backpropagation algorithm should give you similar results as the ones you got by computing the difference formula.
 9 | 
10 | - Learn how to identify which parameter's gradient was computed incorrectly.
11 | 
12 | # File Description
13 | - `.ipynb` file is the solution of Week 1 program assignment 3
14 |   - Gradient+Checking.ipynb
15 | - `.html` file is the html version of `.ipynb` file.
16 |   - Gradient+Checking.html
17 | - file
18 |   - Gradient+Checking.md
19 |   
20 | # Snapshot
21 | - computer view. open .html file via brower for quick look.
22 | - **Recommend** brower view. Initialization.md
23 | 
24 | 
25 | # Implementation
26 | - 1-dimensional gradient checking
27 | - N-dimensional gradient checking
28 | 
29 | # What you should remember from this notebook:
30 | - Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
31 | - Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.
32 | 
33 | 


--------------------------------------------------------------------------------
/Lecture Slides/README.md:
--------------------------------------------------------------------------------
 1 | # Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
 2 | 
 3 | Week 1 Practical aspects of Deep Learning
 4 |   - Setting up your Machine Learning Application
 5 |     - Train / Dev / Test sets
 6 |     - Bias / Variance
 7 |     - Basic Recipe for Machine Learning
 8 |   - Regularizing your neural network    
 9 |     - Regularization
10 |     - Why regularization reduces overfitting?
11 |     - Dropout Regularization
12 |     - Understanding Dropout
13 |     - Other regularization methods
14 |   - Setting up your optimization problem    
15 |     - Normalizing inputs
16 |     - Vanishing / Exploding gradients
17 |     - Weight Initialization for Deep Networks
18 |     - Numerical approximation of gradients
19 |     - Gradient checking
20 |     - Gradient Checking Implementation Notes
21 | 
22 | Week 2 Optimization algorithms
23 |   - Optimization algorithms
24 |     - Mini-batch gradient descent
25 |     - Understanding mini-batch gradient descent
26 |     - Exponentially weighted averages
27 |     - Understanding exponentially weighted averages
28 |     - Bias correction in exponentially weighted averages
29 |     - Gradient descent with momentum
30 |     - RMSprop
31 |     - Adam optimization algorithm
32 |     - Learning rate decay
33 |     - The problem of local optima
34 |   - Heroes of Deep Learning (Optional)
35 |     - Yuanqing Lin interview  
36 |     
37 | Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks
38 |   - Hyperparameter tuning
39 |     - Tuning process
40 |     - Using an appropriate scale to pick hyperparameters
41 |     - Hyperparameters tuning in practice: Pandas vs. Caviar
42 |   - Batch Normalization
43 |     - Normalizing activations in a network
44 |     - Fitting Batch Norm into a neural network
45 |     - Why does Batch Norm work?
46 |     - Batch Norm at test time
47 |   - Multi-class classification
48 |     - Softmax Regression
49 |     - Training a softmax classifier
50 |   - Introduction to programming frameworks
51 |     - Deep learning frameworks
52 |     - TensorFlow1
53 | 


--------------------------------------------------------------------------------
/Week 2 PA 1/README.md:
--------------------------------------------------------------------------------
 1 | # Optimization
 2 | 
 3 | # Goal
 4 | - Understand the intuition between Adam and RMS prop
 5 | 
 6 | - Recognize the importance of mini-batch gradient descent
 7 | 
 8 | - Learn the effects of momentum on the overall performance of your model
 9 | 
10 | # File Description
11 | - `.ipynb` file is the solution of Week 2 program assignment 1
12 |   - Optimization+methods.ipynb
13 | - `.html` file is the html version of `.ipynb` file.
14 |   - Optimization+methods.html
15 | - `.py` file
16 |   - Optimization+methods.py
17 | - file
18 |   - Optimization+methods.md
19 |   
20 | # Snapshot
21 | - computer view. open .html file via brower for quick look.
22 | - **Recommend** brower view. Optimization+methods.md
23 | 
24 | 
25 | # Implementation
26 | - Gradient Descent
27 | - Mini-Batch Gradient descent
28 | - Momentum
29 | - Adam
30 | - Model with different optimization algorithms
31 |   - Mini-batch Gradient descent
32 |   - Mini-batch gradient descent with momentum
33 |   - Mini-batch with Adam mode
34 | 
35 | # What you should remember from this notebook:
36 | ## Gradient Descent
37 | - The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
38 | - You have to tune a learning rate hyperparameter  α .
39 | - With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).
40 | ## Mini-Batch Gradient descent
41 | - Shuffling and Partitioning are the two steps required to build mini-batches
42 | - Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.
43 | ## Momentum
44 | - Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
45 | - You have to tune a momentum hyperparameter  ββ  and a learning rate  αα .
46 | # Conclusion
47 | optimization method	| accuracy | cost shape
48 | --|--|--
49 | Gradient descent |	79.7%	| oscillations
50 | Momentum	| 79.7% |	oscillations
51 | Adam	| 94%	| smoother
52 | 
53 | # Reference
54 | Adam paper: https://arxiv.org/pdf/1412.6980.pdf
55 | 
56 | 


--------------------------------------------------------------------------------
/Week 1 PA 2/README.md:
--------------------------------------------------------------------------------
 1 | # Regularization
 2 | 
 3 | # Goal
 4 | - Understand that different regularization methods that could help your model.
 5 | 
 6 | - Implement dropout and see it work on data.
 7 | 
 8 | - Recognize that a model without regularization gives you a better accuracy on the training set but nor necessarily on the test set.
 9 | 
10 | - Understand that you could use both dropout and regularization on your model.
11 | 
12 | # File Description
13 | - `.ipynb` file is the solution of Week 1 program assignment 2
14 |   - Regularization.ipynb
15 | - `.html` file is the html version of `.ipynb` file.
16 |   - Regularization.html
17 | - file
18 |   - Regularization.md
19 |   
20 | # Snapshot
21 | - computer view. open .html file via brower for quick look.
22 | - **Recommend** brower view. Regularization.md
23 | 
24 | 
25 | # Implementation
26 | - Non-regularized model
27 | - L2 Regularization
28 | - Dropout
29 |   - Forward propagation with dropout
30 |   - Backward propagation with dropout
31 |   
32 | # Conclusion
33 | ## What you should remember -- the implications of L2-regularization on:
34 | - The cost computation:
35 |   - A regularization term is added to the cost
36 | - The backpropagation function:
37 |   - There are extra terms in the gradients with respect to weight matrices
38 | - Weights end up smaller ("weight decay"):
39 |   - Weights are pushed to smaller values.
40 |   
41 | ## What you should remember about dropout:
42 | - Dropout is a regularization technique.
43 | - You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
44 | - Apply dropout both during forward and backward propagation.
45 | - During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
46 |   
47 | ## What we want you to remember from this notebook:
48 | - Regularization will help you reduce overfitting.
49 | - Regularization will drive your weights to lower values.
50 | - L2 regularization and Dropout are two very effective regularization techniques.  
51 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
 2 | 
 3 | Course can be found in [Coursera](https://www.coursera.org/learn/deep-neural-network)
 4 | 
 5 | Quiz and answers are collected for quick search in my blog [SSQ](https://ssq.github.io/2017/08/28/Coursera%20Ng%20Deep%20Learning%20Specialization%20Notebook/#Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization)
 6 | 
 7 | - Week 1 Practical aspects of Deep Learning
 8 |   - Recall that different types of initializations lead to different results
 9 |   - Recognize the importance of initialization in complex neural networks.
10 |   - Recognize the difference between train/dev/test sets
11 |   - Diagnose the bias and variance issues in your model
12 |   - Learn when and how to use regularization methods such as dropout or L2 regularization.
13 |   - Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
14 |   - Use gradient checking to verify the correctness of your backpropagation implementation
15 |   - [x] [Initialization](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%201%20PA%201)
16 |   - [x] [Regularization](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%201%20PA%202)
17 |   - [x] [Gradient Checking](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%201%20PA%203)
18 | - Week 2 Optimization algorithms
19 |   - Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
20 |   - Use random minibatches to accelerate the convergence and improve the optimization
21 |   - Know the benefits of learning rate decay and apply it to your optimization
22 |   - [x] [Optimization](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%202%20PA%201)
23 | - Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks
24 |   - Master the process of hyperparameter tuning
25 |   - Master the process of batch Normalization
26 |   - [x] [Tensorflow](https://github.com/SSQ/Coursera-Ng-Improving-Deep-Neural-Networks-Hyperparameter-tuning-Regularization-and-Optimization/tree/master/Week%203%20PA%201)
27 | 


--------------------------------------------------------------------------------
/Week 3 PA 1/README.md:
--------------------------------------------------------------------------------
 1 | # Tensorflow
 2 | 
 3 | # Goal
 4 | - Initialize variables
 5 | - Start your own session
 6 | - Train algorithms
 7 | - Implement a Neural Network
 8 | # File Description
 9 | - `.ipynb` file is the solution of Week 3 program assignment 1
10 |   - Tensorflow+Tutorial.ipynb
11 | - `.html` file is the html version of `.ipynb` file.
12 |   - Tensorflow+Tutorial.html
13 | - `.py` file
14 |   - Tensorflow+Tutorial.py
15 | - file
16 |   - Tensorflow+Tutorial.md
17 |   
18 | # Snapshot
19 | - computer view. open .html file via brower for quick look.
20 | - **Recommend** brower view. Tensorflow+Tutorial.md
21 | 
22 | 
23 | # Implementation
24 | - Exploring the Tensorflow Library
25 |   - Linear function
26 |   - Computing the sigmoid
27 |   - Computing the Cost
28 |   - Using One Hot encodings
29 |   - Initialize with zeros and ones
30 | - Building your first neural network in tensorflow
31 |   - Problem statement: SIGNS Dataset
32 |   - Create placeholders
33 |   - Initializing the parameters
34 |   - Forward propagation in tensorflow
35 |   - Compute cost
36 |   - Backward propagation & parameter updates
37 |   - Building the model
38 | 
39 | 
40 | # What you should remember from this notebook:
41 | ## Writing and running programs in TensorFlow has the following steps:
42 | 1. Create Tensors (variables) that are not yet executed/evaluated.
43 | 1. Write operations between those Tensors.
44 | 1. Initialize your Tensors.
45 | 1. Create a Session.
46 | 1. Run the Session. This will run the operations you'd written above.
47 | ## To summarize, you how know how to:
48 | 1. Create placeholders
49 | 1. Specify the computation graph corresponding to operations you want to compute
50 | 1. Create the session
51 | 1. Run the session, using a feed dictionary if necessary to specify placeholder variables' values.
52 | ## What you should remember from this notebook:
53 | - Tensorflow is a programming framework used in deep learning
54 | - The two main object classes in tensorflow are Tensors and Operators.
55 | - When you code in tensorflow you have to take the following steps:
56 |   - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
57 |   - Create a session
58 |   - Initialize the session
59 |   - Run the session to execute the graph
60 | - You can execute the graph multiple times as you've seen in model()
61 | - The backpropagation and optimization is automatically done when running the session on the "optimizer" object.
62 | 
63 | 


--------------------------------------------------------------------------------
/Week 1 PA 3/Gradient+Checking.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Gradient Checking
  3 | 
  4 | Welcome to the final assignment for this week! In this assignment you will learn to implement and use gradient checking. 
  5 | 
  6 | You are part of a team working to make mobile payments available globally, and are asked to build a deep learning model to detect fraud--whenever someone makes a payment, you want to see if the payment might be fraudulent, such as if the user's account has been taken over by a hacker. 
  7 | 
  8 | But backpropagation is quite challenging to implement, and sometimes has bugs. Because this is a mission-critical application, your company's CEO wants to be really certain that your implementation of backpropagation is correct. Your CEO says, "Give me a proof that your backpropagation is actually working!" To give this reassurance, you are going to use "gradient checking".
  9 | 
 10 | Let's do it!
 11 | 
 12 | 
 13 | ```python
 14 | # Packages
 15 | import numpy as np
 16 | from testCases import *
 17 | from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector
 18 | ```
 19 | 
 20 | ## 1) How does gradient checking work?
 21 | 
 22 | Backpropagation computes the gradients $\frac{\partial J}{\partial \theta}$, where $\theta$ denotes the parameters of the model. $J$ is computed using forward propagation and your loss function.
 23 | 
 24 | Because forward propagation is relatively easy to implement, you're confident you got that right, and so you're almost  100% sure that you're computing the cost $J$ correctly. Thus, you can use your code for computing $J$ to verify the code for computing $\frac{\partial J}{\partial \theta}$. 
 25 | 
 26 | Let's look back at the definition of a derivative (or gradient):
 27 | $$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$
 28 | 
 29 | If you're not familiar with the "$\displaystyle \lim_{\varepsilon \to 0}$" notation, it's just a way of saying "when $\varepsilon$ is really really small."
 30 | 
 31 | We know the following:
 32 | 
 33 | - $\frac{\partial J}{\partial \theta}$ is what you want to make sure you're computing correctly. 
 34 | - You can compute $J(\theta + \varepsilon)$ and $J(\theta - \varepsilon)$ (in the case that $\theta$ is a real number), since you're confident your implementation for $J$ is correct. 
 35 | 
 36 | Lets use equation (1) and a small value for $\varepsilon$ to convince your CEO that your code for computing  $\frac{\partial J}{\partial \theta}$ is correct!
 37 | 
 38 | ## 2) 1-dimensional gradient checking
 39 | 
 40 | Consider a 1D linear function $J(\theta) = \theta x$. The model contains only a single real-valued parameter $\theta$, and takes $x$ as input.
 41 | 
 42 | You will implement code to compute $J(.)$ and its derivative $\frac{\partial J}{\partial \theta}$. You will then use gradient checking to make sure your derivative computation for $J$ is correct. 
 43 | 
 44 | <img src="images/1Dgrad_kiank.png" style="width:600px;height:250px;">
 45 | <caption><center> <u> **Figure 1** </u>: **1D linear model**<br> </center></caption>
 46 | 
 47 | The diagram above shows the key computation steps: First start with $x$, then evaluate the function $J(x)$ ("forward propagation"). Then compute the derivative $\frac{\partial J}{\partial \theta}$ ("backward propagation"). 
 48 | 
 49 | **Exercise**: implement "forward propagation" and "backward propagation" for this simple function. I.e., compute both $J(.)$ ("forward propagation") and its derivative with respect to $\theta$ ("backward propagation"), in two separate functions. 
 50 | 
 51 | 
 52 | ```python
 53 | # GRADED FUNCTION: forward_propagation
 54 | 
 55 | def forward_propagation(x, theta):
 56 |     """
 57 |     Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)
 58 |     
 59 |     Arguments:
 60 |     x -- a real-valued input
 61 |     theta -- our parameter, a real number as well
 62 |     
 63 |     Returns:
 64 |     J -- the value of function J, computed using the formula J(theta) = theta * x
 65 |     """
 66 |     
 67 |     ### START CODE HERE ### (approx. 1 line)
 68 |     J = theta * x
 69 |     ### END CODE HERE ###
 70 |     
 71 |     return J
 72 | ```
 73 | 
 74 | 
 75 | ```python
 76 | x, theta = 2, 4
 77 | J = forward_propagation(x, theta)
 78 | print ("J = " + str(J))
 79 | ```
 80 | 
 81 |     J = 8
 82 | 
 83 | 
 84 | **Expected Output**:
 85 | 
 86 | <table style=>
 87 |     <tr>
 88 |         <td>  ** J **  </td>
 89 |         <td> 8</td>
 90 |     </tr>
 91 | </table>
 92 | 
 93 | **Exercise**: Now, implement the backward propagation step (derivative computation) of Figure 1. That is, compute the derivative of $J(\theta) = \theta x$ with respect to $\theta$. To save you from doing the calculus, you should get $dtheta = \frac { \partial J }{ \partial \theta} = x$.
 94 | 
 95 | 
 96 | ```python
 97 | # GRADED FUNCTION: backward_propagation
 98 | 
 99 | def backward_propagation(x, theta):
100 |     """
101 |     Computes the derivative of J with respect to theta (see Figure 1).
102 |     
103 |     Arguments:
104 |     x -- a real-valued input
105 |     theta -- our parameter, a real number as well
106 |     
107 |     Returns:
108 |     dtheta -- the gradient of the cost with respect to theta
109 |     """
110 |     
111 |     ### START CODE HERE ### (approx. 1 line)
112 |     dtheta = x
113 |     ### END CODE HERE ###
114 |     
115 |     return dtheta
116 | ```
117 | 
118 | 
119 | ```python
120 | x, theta = 2, 4
121 | dtheta = backward_propagation(x, theta)
122 | print ("dtheta = " + str(dtheta))
123 | ```
124 | 
125 |     dtheta = 2
126 | 
127 | 
128 | **Expected Output**:
129 | 
130 | <table>
131 |     <tr>
132 |         <td>  ** dtheta **  </td>
133 |         <td> 2 </td>
134 |     </tr>
135 | </table>
136 | 
137 | **Exercise**: To show that the `backward_propagation()` function is correctly computing the gradient $\frac{\partial J}{\partial \theta}$, let's implement gradient checking.
138 | 
139 | **Instructions**:
140 | - First compute "gradapprox" using the formula above (1) and a small value of $\varepsilon$. Here are the Steps to follow:
141 |     1. $\theta^{+} = \theta + \varepsilon$
142 |     2. $\theta^{-} = \theta - \varepsilon$
143 |     3. $J^{+} = J(\theta^{+})$
144 |     4. $J^{-} = J(\theta^{-})$
145 |     5. $gradapprox = \frac{J^{+} - J^{-}}{2  \varepsilon}$
146 | - Then compute the gradient using backward propagation, and store the result in a variable "grad"
147 | - Finally, compute the relative difference between "gradapprox" and the "grad" using the following formula:
148 | $$ difference = \frac {\mid\mid grad - gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2} \tag{2}$$
149 | You will need 3 Steps to compute this formula:
150 |    - 1'. compute the numerator using np.linalg.norm(...)
151 |    - 2'. compute the denominator. You will need to call np.linalg.norm(...) twice.
152 |    - 3'. divide them.
153 | - If this difference is small (say less than $10^{-7}$), you can be quite confident that you have computed your gradient correctly. Otherwise, there may be a mistake in the gradient computation. 
154 | 
155 | 
156 | 
157 | ```python
158 | # GRADED FUNCTION: gradient_check
159 | 
160 | def gradient_check(x, theta, epsilon = 1e-7):
161 |     """
162 |     Implement the backward propagation presented in Figure 1.
163 |     
164 |     Arguments:
165 |     x -- a real-valued input
166 |     theta -- our parameter, a real number as well
167 |     epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
168 |     
169 |     Returns:
170 |     difference -- difference (2) between the approximated gradient and the backward propagation gradient
171 |     """
172 |     
173 |     # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
174 |     ### START CODE HERE ### (approx. 5 lines)
175 |     thetaplus = theta + epsilon                               # Step 1
176 |     thetaminus = theta - epsilon                              # Step 2
177 |     J_plus = thetaplus * x                                    # Step 3
178 |     J_minus = thetaminus * x                                  # Step 4
179 |     gradapprox = (J_plus - J_minus) / (2. * epsilon)          # Step 5
180 |     ### END CODE HERE ###
181 |     
182 |     # Check if gradapprox is close enough to the output of backward_propagation()
183 |     ### START CODE HERE ### (approx. 1 line)
184 |     grad = x
185 |     ### END CODE HERE ###
186 |     
187 |     ### START CODE HERE ### (approx. 1 line)
188 |     numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'
189 |     denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)             # Step 2'
190 |     difference = numerator / denominator                                        # Step 3'
191 |     ### END CODE HERE ###
192 |     
193 |     if difference < 1e-7:
194 |         print ("The gradient is correct!")
195 |     else:
196 |         print ("The gradient is wrong!")
197 |     
198 |     return difference
199 | ```
200 | 
201 | 
202 | ```python
203 | x, theta = 2, 4
204 | difference = gradient_check(x, theta)
205 | print("difference = " + str(difference))
206 | ```
207 | 
208 |     The gradient is correct!
209 |     difference = 2.91933588329e-10
210 | 
211 | 
212 | **Expected Output**:
213 | The gradient is correct!
214 | <table>
215 |     <tr>
216 |         <td>  ** difference **  </td>
217 |         <td> 2.9193358103083e-10 </td>
218 |     </tr>
219 | </table>
220 | 
221 | Congrats, the difference is smaller than the $10^{-7}$ threshold. So you can have high confidence that you've correctly computed the gradient in `backward_propagation()`. 
222 | 
223 | Now, in the more general case, your cost function $J$ has more than a single 1D input. When you are training a neural network, $\theta$ actually consists of multiple matrices $W^{[l]}$ and biases $b^{[l]}$! It is important to know how to do a gradient check with higher-dimensional inputs. Let's do it!
224 | 
225 | ## 3) N-dimensional gradient checking
226 | 
227 | The following figure describes the forward and backward propagation of your fraud detection model.
228 | 
229 | <img src="images/NDgrad_kiank.png" style="width:600px;height:400px;">
230 | <caption><center> <u> **Figure 2** </u>: **deep neural network**<br>*LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID*</center></caption>
231 | 
232 | Let's look at your implementations for forward propagation and backward propagation. 
233 | 
234 | 
235 | ```python
236 | def forward_propagation_n(X, Y, parameters):
237 |     """
238 |     Implements the forward propagation (and computes the cost) presented in Figure 3.
239 |     
240 |     Arguments:
241 |     X -- training set for m examples
242 |     Y -- labels for m examples 
243 |     parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
244 |                     W1 -- weight matrix of shape (5, 4)
245 |                     b1 -- bias vector of shape (5, 1)
246 |                     W2 -- weight matrix of shape (3, 5)
247 |                     b2 -- bias vector of shape (3, 1)
248 |                     W3 -- weight matrix of shape (1, 3)
249 |                     b3 -- bias vector of shape (1, 1)
250 |     
251 |     Returns:
252 |     cost -- the cost function (logistic cost for one example)
253 |     """
254 |     
255 |     # retrieve parameters
256 |     m = X.shape[1]
257 |     W1 = parameters["W1"]
258 |     b1 = parameters["b1"]
259 |     W2 = parameters["W2"]
260 |     b2 = parameters["b2"]
261 |     W3 = parameters["W3"]
262 |     b3 = parameters["b3"]
263 | 
264 |     # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
265 |     Z1 = np.dot(W1, X) + b1
266 |     A1 = relu(Z1)
267 |     Z2 = np.dot(W2, A1) + b2
268 |     A2 = relu(Z2)
269 |     Z3 = np.dot(W3, A2) + b3
270 |     A3 = sigmoid(Z3)
271 | 
272 |     # Cost
273 |     logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)
274 |     cost = 1./m * np.sum(logprobs)
275 |     
276 |     cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
277 |     
278 |     return cost, cache
279 | ```
280 | 
281 | Now, run backward propagation.
282 | 
283 | 
284 | ```python
285 | def backward_propagation_n(X, Y, cache):
286 |     """
287 |     Implement the backward propagation presented in figure 2.
288 |     
289 |     Arguments:
290 |     X -- input datapoint, of shape (input size, 1)
291 |     Y -- true "label"
292 |     cache -- cache output from forward_propagation_n()
293 |     
294 |     Returns:
295 |     gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.
296 |     """
297 |     
298 |     m = X.shape[1]
299 |     (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
300 |     
301 |     dZ3 = A3 - Y
302 |     dW3 = 1./m * np.dot(dZ3, A2.T)
303 |     db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
304 |     
305 |     dA2 = np.dot(W3.T, dZ3)
306 |     dZ2 = np.multiply(dA2, np.int64(A2 > 0))
307 |     # dW2 = 1./m * np.dot(dZ2, A1.T) * 2
308 |     dW2 = 1./m * np.dot(dZ2, A1.T) 
309 |     db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
310 |     
311 |     dA1 = np.dot(W2.T, dZ2)
312 |     dZ1 = np.multiply(dA1, np.int64(A1 > 0))
313 |     dW1 = 1./m * np.dot(dZ1, X.T)
314 |     # db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)
315 |     db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
316 |     
317 |     gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
318 |                  "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
319 |                  "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}
320 |     
321 |     return gradients
322 | ```
323 | 
324 | You obtained some results on the fraud detection test set but you are not 100% sure of your model. Nobody's perfect! Let's implement gradient checking to verify if your gradients are correct.
325 | 
326 | **How does gradient checking work?**.
327 | 
328 | As in 1) and 2), you want to compare "gradapprox" to the gradient computed by backpropagation. The formula is still:
329 | 
330 | $$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$
331 | 
332 | However, $\theta$ is not a scalar anymore. It is a dictionary called "parameters". We implemented a function "`dictionary_to_vector()`" for you. It converts the "parameters" dictionary into a vector called "values", obtained by reshaping all parameters (W1, b1, W2, b2, W3, b3) into vectors and concatenating them.
333 | 
334 | The inverse function is "`vector_to_dictionary`" which outputs back the "parameters" dictionary.
335 | 
336 | <img src="images/dictionary_to_vector.png" style="width:600px;height:400px;">
337 | <caption><center> <u> **Figure 2** </u>: **dictionary_to_vector() and vector_to_dictionary()**<br> You will need these functions in gradient_check_n()</center></caption>
338 | 
339 | We have also converted the "gradients" dictionary into a vector "grad" using gradients_to_vector(). You don't need to worry about that.
340 | 
341 | **Exercise**: Implement gradient_check_n().
342 | 
343 | **Instructions**: Here is pseudo-code that will help you implement the gradient check.
344 | 
345 | For each i in num_parameters:
346 | - To compute `J_plus[i]`:
347 |     1. Set $\theta^{+}$ to `np.copy(parameters_values)`
348 |     2. Set $\theta^{+}_i$ to $\theta^{+}_i + \varepsilon$
349 |     3. Calculate $J^{+}_i$ using to `forward_propagation_n(x, y, vector_to_dictionary(`$\theta^{+}$ `))`.     
350 | - To compute `J_minus[i]`: do the same thing with $\theta^{-}$
351 | - Compute $gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon}$
352 | 
353 | Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to `parameter_values[i]`. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1', 2', 3'), compute: 
354 | $$ difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 + \| gradapprox \|_2 } \tag{3}$$
355 | 
356 | 
357 | ```python
358 | # GRADED FUNCTION: gradient_check_n
359 | 
360 | def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
361 |     """
362 |     Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
363 |     
364 |     Arguments:
365 |     parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
366 |     grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
367 |     x -- input datapoint, of shape (input size, 1)
368 |     y -- true "label"
369 |     epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
370 |     
371 |     Returns:
372 |     difference -- difference (2) between the approximated gradient and the backward propagation gradient
373 |     """
374 |     
375 |     # Set-up variables
376 |     parameters_values, _ = dictionary_to_vector(parameters)
377 |     grad = gradients_to_vector(gradients)
378 |     num_parameters = parameters_values.shape[0]
379 |     J_plus = np.zeros((num_parameters, 1))
380 |     J_minus = np.zeros((num_parameters, 1))
381 |     gradapprox = np.zeros((num_parameters, 1))
382 |     
383 |     # Compute gradapprox
384 |     for i in range(num_parameters):
385 |         
386 |         # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
387 |         # "_" is used because the function you have to outputs two parameters but we only care about the first one
388 |         ### START CODE HERE ### (approx. 3 lines)
389 |         thetaplus = np.copy(parameters_values)                                      # Step 1
390 |         thetaplus[i][0] = thetaplus[i][0] + epsilon                                # Step 2
391 |         J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))                                   # Step 3
392 |         ### END CODE HERE ###
393 |         
394 |         # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
395 |         ### START CODE HERE ### (approx. 3 lines)
396 |         thetaminus = np.copy(parameters_values)                                     # Step 1
397 |         thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
398 |         J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))                                  # Step 3
399 |         ### END CODE HERE ###
400 |         
401 |         # Compute gradapprox[i]
402 |         ### START CODE HERE ### (approx. 1 line)
403 |         gradapprox[i] = (J_plus[i] - J_minus[i]) / (2. * epsilon)
404 |         ### END CODE HERE ###
405 |     
406 |     # Compare gradapprox to backward propagation gradients by computing difference.
407 |     ### START CODE HERE ### (approx. 1 line)
408 |     numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'
409 |     denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)             # Step 2'
410 |     difference = numerator / denominator                                        # Step 3'
411 |     ### END CODE HERE ###
412 | 
413 |     if difference > 1e-7:
414 |         print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
415 |     else:
416 |         print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
417 |     
418 |     return difference
419 | ```
420 | 
421 | 
422 | ```python
423 | X, Y, parameters = gradient_check_n_test_case()
424 | 
425 | cost, cache = forward_propagation_n(X, Y, parameters)
426 | gradients = backward_propagation_n(X, Y, cache)
427 | difference = gradient_check_n(parameters, gradients, X, Y)
428 | ```
429 | 
430 |     [93mThere is a mistake in the backward propagation! difference = 1.18904178788e-07[0m
431 | 
432 | 
433 | **Expected output**:
434 | 
435 | <table>
436 |     <tr>
437 |         <td>  ** There is a mistake in the backward propagation!**  </td>
438 |         <td> difference = 0.285093156781 </td>
439 |     </tr>
440 | </table>
441 | 
442 | It seems that there were errors in the `backward_propagation_n` code we gave you! Good that you've implemented the gradient check. Go back to `backward_propagation` and try to find/correct the errors *(Hint: check dW2 and db1)*. Rerun the gradient check when you think you've fixed it. Remember you'll need to re-execute the cell defining `backward_propagation_n()` if you modify the code. 
443 | 
444 | Can you get gradient check to declare your derivative computation correct? Even though this part of the assignment isn't graded, we strongly urge you to try to find the bug and re-run gradient check until you're convinced backprop is now correctly implemented. 
445 | 
446 | **Note** 
447 | - Gradient Checking is slow! Approximating the gradient with $\frac{\partial J}{\partial \theta} \approx  \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$ is computationally costly. For this reason, we don't run gradient checking at every iteration during training. Just a few times to check if the gradient is correct. 
448 | - Gradient Checking, at least as we've presented it, doesn't work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout. 
449 | 
450 | Congrats, you can be confident that your deep learning model for fraud detection is working correctly! You can even use this to convince your CEO. :) 
451 | 
452 | <font color='blue'>
453 | **What you should remember from this notebook**:
454 | - Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
455 | - Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. 
456 | 
457 | 
458 | ```python
459 | 
460 | ```
461 | 


--------------------------------------------------------------------------------
/Week 1 PA 3/Gradient+Checking.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Gradient Checking\n",
  8 |     "\n",
  9 |     "Welcome to the final assignment for this week! In this assignment you will learn to implement and use gradient checking. \n",
 10 |     "\n",
 11 |     "You are part of a team working to make mobile payments available globally, and are asked to build a deep learning model to detect fraud--whenever someone makes a payment, you want to see if the payment might be fraudulent, such as if the user's account has been taken over by a hacker. \n",
 12 |     "\n",
 13 |     "But backpropagation is quite challenging to implement, and sometimes has bugs. Because this is a mission-critical application, your company's CEO wants to be really certain that your implementation of backpropagation is correct. Your CEO says, \"Give me a proof that your backpropagation is actually working!\" To give this reassurance, you are going to use \"gradient checking\".\n",
 14 |     "\n",
 15 |     "Let's do it!"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 5,
 21 |    "metadata": {
 22 |     "collapsed": true
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "# Packages\n",
 27 |     "import numpy as np\n",
 28 |     "from testCases import *\n",
 29 |     "from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "## 1) How does gradient checking work?\n",
 37 |     "\n",
 38 |     "Backpropagation computes the gradients $\\frac{\\partial J}{\\partial \\theta}$, where $\\theta$ denotes the parameters of the model. $J$ is computed using forward propagation and your loss function.\n",
 39 |     "\n",
 40 |     "Because forward propagation is relatively easy to implement, you're confident you got that right, and so you're almost  100% sure that you're computing the cost $J$ correctly. Thus, you can use your code for computing $J$ to verify the code for computing $\\frac{\\partial J}{\\partial \\theta}$. \n",
 41 |     "\n",
 42 |     "Let's look back at the definition of a derivative (or gradient):\n",
 43 |     "$$ \\frac{\\partial J}{\\partial \\theta} = \\lim_{\\varepsilon \\to 0} \\frac{J(\\theta + \\varepsilon) - J(\\theta - \\varepsilon)}{2 \\varepsilon} \\tag{1}$$\n",
 44 |     "\n",
 45 |     "If you're not familiar with the \"$\\displaystyle \\lim_{\\varepsilon \\to 0}$\" notation, it's just a way of saying \"when $\\varepsilon$ is really really small.\"\n",
 46 |     "\n",
 47 |     "We know the following:\n",
 48 |     "\n",
 49 |     "- $\\frac{\\partial J}{\\partial \\theta}$ is what you want to make sure you're computing correctly. \n",
 50 |     "- You can compute $J(\\theta + \\varepsilon)$ and $J(\\theta - \\varepsilon)$ (in the case that $\\theta$ is a real number), since you're confident your implementation for $J$ is correct. \n",
 51 |     "\n",
 52 |     "Lets use equation (1) and a small value for $\\varepsilon$ to convince your CEO that your code for computing  $\\frac{\\partial J}{\\partial \\theta}$ is correct!"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "## 2) 1-dimensional gradient checking\n",
 60 |     "\n",
 61 |     "Consider a 1D linear function $J(\\theta) = \\theta x$. The model contains only a single real-valued parameter $\\theta$, and takes $x$ as input.\n",
 62 |     "\n",
 63 |     "You will implement code to compute $J(.)$ and its derivative $\\frac{\\partial J}{\\partial \\theta}$. You will then use gradient checking to make sure your derivative computation for $J$ is correct. \n",
 64 |     "\n",
 65 |     "<img src=\"images/1Dgrad_kiank.png\" style=\"width:600px;height:250px;\">\n",
 66 |     "<caption><center> <u> **Figure 1** </u>: **1D linear model**<br> </center></caption>\n",
 67 |     "\n",
 68 |     "The diagram above shows the key computation steps: First start with $x$, then evaluate the function $J(x)$ (\"forward propagation\"). Then compute the derivative $\\frac{\\partial J}{\\partial \\theta}$ (\"backward propagation\"). \n",
 69 |     "\n",
 70 |     "**Exercise**: implement \"forward propagation\" and \"backward propagation\" for this simple function. I.e., compute both $J(.)$ (\"forward propagation\") and its derivative with respect to $\\theta$ (\"backward propagation\"), in two separate functions. "
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 6,
 76 |    "metadata": {
 77 |     "collapsed": true
 78 |    },
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "# GRADED FUNCTION: forward_propagation\n",
 82 |     "\n",
 83 |     "def forward_propagation(x, theta):\n",
 84 |     "    \"\"\"\n",
 85 |     "    Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)\n",
 86 |     "    \n",
 87 |     "    Arguments:\n",
 88 |     "    x -- a real-valued input\n",
 89 |     "    theta -- our parameter, a real number as well\n",
 90 |     "    \n",
 91 |     "    Returns:\n",
 92 |     "    J -- the value of function J, computed using the formula J(theta) = theta * x\n",
 93 |     "    \"\"\"\n",
 94 |     "    \n",
 95 |     "    ### START CODE HERE ### (approx. 1 line)\n",
 96 |     "    J = theta * x\n",
 97 |     "    ### END CODE HERE ###\n",
 98 |     "    \n",
 99 |     "    return J"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 7,
105 |    "metadata": {},
106 |    "outputs": [
107 |     {
108 |      "name": "stdout",
109 |      "output_type": "stream",
110 |      "text": [
111 |       "J = 8\n"
112 |      ]
113 |     }
114 |    ],
115 |    "source": [
116 |     "x, theta = 2, 4\n",
117 |     "J = forward_propagation(x, theta)\n",
118 |     "print (\"J = \" + str(J))"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "**Expected Output**:\n",
126 |     "\n",
127 |     "<table style=>\n",
128 |     "    <tr>\n",
129 |     "        <td>  ** J **  </td>\n",
130 |     "        <td> 8</td>\n",
131 |     "    </tr>\n",
132 |     "</table>"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "**Exercise**: Now, implement the backward propagation step (derivative computation) of Figure 1. That is, compute the derivative of $J(\\theta) = \\theta x$ with respect to $\\theta$. To save you from doing the calculus, you should get $dtheta = \\frac { \\partial J }{ \\partial \\theta} = x$."
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 8,
145 |    "metadata": {
146 |     "collapsed": true
147 |    },
148 |    "outputs": [],
149 |    "source": [
150 |     "# GRADED FUNCTION: backward_propagation\n",
151 |     "\n",
152 |     "def backward_propagation(x, theta):\n",
153 |     "    \"\"\"\n",
154 |     "    Computes the derivative of J with respect to theta (see Figure 1).\n",
155 |     "    \n",
156 |     "    Arguments:\n",
157 |     "    x -- a real-valued input\n",
158 |     "    theta -- our parameter, a real number as well\n",
159 |     "    \n",
160 |     "    Returns:\n",
161 |     "    dtheta -- the gradient of the cost with respect to theta\n",
162 |     "    \"\"\"\n",
163 |     "    \n",
164 |     "    ### START CODE HERE ### (approx. 1 line)\n",
165 |     "    dtheta = x\n",
166 |     "    ### END CODE HERE ###\n",
167 |     "    \n",
168 |     "    return dtheta"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 9,
174 |    "metadata": {
175 |     "scrolled": true
176 |    },
177 |    "outputs": [
178 |     {
179 |      "name": "stdout",
180 |      "output_type": "stream",
181 |      "text": [
182 |       "dtheta = 2\n"
183 |      ]
184 |     }
185 |    ],
186 |    "source": [
187 |     "x, theta = 2, 4\n",
188 |     "dtheta = backward_propagation(x, theta)\n",
189 |     "print (\"dtheta = \" + str(dtheta))"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "markdown",
194 |    "metadata": {},
195 |    "source": [
196 |     "**Expected Output**:\n",
197 |     "\n",
198 |     "<table>\n",
199 |     "    <tr>\n",
200 |     "        <td>  ** dtheta **  </td>\n",
201 |     "        <td> 2 </td>\n",
202 |     "    </tr>\n",
203 |     "</table>"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "**Exercise**: To show that the `backward_propagation()` function is correctly computing the gradient $\\frac{\\partial J}{\\partial \\theta}$, let's implement gradient checking.\n",
211 |     "\n",
212 |     "**Instructions**:\n",
213 |     "- First compute \"gradapprox\" using the formula above (1) and a small value of $\\varepsilon$. Here are the Steps to follow:\n",
214 |     "    1. $\\theta^{+} = \\theta + \\varepsilon$\n",
215 |     "    2. $\\theta^{-} = \\theta - \\varepsilon$\n",
216 |     "    3. $J^{+} = J(\\theta^{+})$\n",
217 |     "    4. $J^{-} = J(\\theta^{-})$\n",
218 |     "    5. $gradapprox = \\frac{J^{+} - J^{-}}{2  \\varepsilon}$\n",
219 |     "- Then compute the gradient using backward propagation, and store the result in a variable \"grad\"\n",
220 |     "- Finally, compute the relative difference between \"gradapprox\" and the \"grad\" using the following formula:\n",
221 |     "$$ difference = \\frac {\\mid\\mid grad - gradapprox \\mid\\mid_2}{\\mid\\mid grad \\mid\\mid_2 + \\mid\\mid gradapprox \\mid\\mid_2} \\tag{2}$$\n",
222 |     "You will need 3 Steps to compute this formula:\n",
223 |     "   - 1'. compute the numerator using np.linalg.norm(...)\n",
224 |     "   - 2'. compute the denominator. You will need to call np.linalg.norm(...) twice.\n",
225 |     "   - 3'. divide them.\n",
226 |     "- If this difference is small (say less than $10^{-7}$), you can be quite confident that you have computed your gradient correctly. Otherwise, there may be a mistake in the gradient computation. \n"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 10,
232 |    "metadata": {
233 |     "collapsed": true
234 |    },
235 |    "outputs": [],
236 |    "source": [
237 |     "# GRADED FUNCTION: gradient_check\n",
238 |     "\n",
239 |     "def gradient_check(x, theta, epsilon = 1e-7):\n",
240 |     "    \"\"\"\n",
241 |     "    Implement the backward propagation presented in Figure 1.\n",
242 |     "    \n",
243 |     "    Arguments:\n",
244 |     "    x -- a real-valued input\n",
245 |     "    theta -- our parameter, a real number as well\n",
246 |     "    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)\n",
247 |     "    \n",
248 |     "    Returns:\n",
249 |     "    difference -- difference (2) between the approximated gradient and the backward propagation gradient\n",
250 |     "    \"\"\"\n",
251 |     "    \n",
252 |     "    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.\n",
253 |     "    ### START CODE HERE ### (approx. 5 lines)\n",
254 |     "    thetaplus = theta + epsilon                               # Step 1\n",
255 |     "    thetaminus = theta - epsilon                              # Step 2\n",
256 |     "    J_plus = thetaplus * x                                    # Step 3\n",
257 |     "    J_minus = thetaminus * x                                  # Step 4\n",
258 |     "    gradapprox = (J_plus - J_minus) / (2. * epsilon)          # Step 5\n",
259 |     "    ### END CODE HERE ###\n",
260 |     "    \n",
261 |     "    # Check if gradapprox is close enough to the output of backward_propagation()\n",
262 |     "    ### START CODE HERE ### (approx. 1 line)\n",
263 |     "    grad = x\n",
264 |     "    ### END CODE HERE ###\n",
265 |     "    \n",
266 |     "    ### START CODE HERE ### (approx. 1 line)\n",
267 |     "    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'\n",
268 |     "    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)             # Step 2'\n",
269 |     "    difference = numerator / denominator                                        # Step 3'\n",
270 |     "    ### END CODE HERE ###\n",
271 |     "    \n",
272 |     "    if difference < 1e-7:\n",
273 |     "        print (\"The gradient is correct!\")\n",
274 |     "    else:\n",
275 |     "        print (\"The gradient is wrong!\")\n",
276 |     "    \n",
277 |     "    return difference"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 11,
283 |    "metadata": {
284 |     "scrolled": true
285 |    },
286 |    "outputs": [
287 |     {
288 |      "name": "stdout",
289 |      "output_type": "stream",
290 |      "text": [
291 |       "The gradient is correct!\n",
292 |       "difference = 2.91933588329e-10\n"
293 |      ]
294 |     }
295 |    ],
296 |    "source": [
297 |     "x, theta = 2, 4\n",
298 |     "difference = gradient_check(x, theta)\n",
299 |     "print(\"difference = \" + str(difference))"
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "markdown",
304 |    "metadata": {},
305 |    "source": [
306 |     "**Expected Output**:\n",
307 |     "The gradient is correct!\n",
308 |     "<table>\n",
309 |     "    <tr>\n",
310 |     "        <td>  ** difference **  </td>\n",
311 |     "        <td> 2.9193358103083e-10 </td>\n",
312 |     "    </tr>\n",
313 |     "</table>"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "markdown",
318 |    "metadata": {},
319 |    "source": [
320 |     "Congrats, the difference is smaller than the $10^{-7}$ threshold. So you can have high confidence that you've correctly computed the gradient in `backward_propagation()`. \n",
321 |     "\n",
322 |     "Now, in the more general case, your cost function $J$ has more than a single 1D input. When you are training a neural network, $\\theta$ actually consists of multiple matrices $W^{[l]}$ and biases $b^{[l]}$! It is important to know how to do a gradient check with higher-dimensional inputs. Let's do it!"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "markdown",
327 |    "metadata": {},
328 |    "source": [
329 |     "## 3) N-dimensional gradient checking"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "markdown",
334 |    "metadata": {
335 |     "collapsed": true
336 |    },
337 |    "source": [
338 |     "The following figure describes the forward and backward propagation of your fraud detection model.\n",
339 |     "\n",
340 |     "<img src=\"images/NDgrad_kiank.png\" style=\"width:600px;height:400px;\">\n",
341 |     "<caption><center> <u> **Figure 2** </u>: **deep neural network**<br>*LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID*</center></caption>\n",
342 |     "\n",
343 |     "Let's look at your implementations for forward propagation and backward propagation. "
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 12,
349 |    "metadata": {
350 |     "collapsed": true
351 |    },
352 |    "outputs": [],
353 |    "source": [
354 |     "def forward_propagation_n(X, Y, parameters):\n",
355 |     "    \"\"\"\n",
356 |     "    Implements the forward propagation (and computes the cost) presented in Figure 3.\n",
357 |     "    \n",
358 |     "    Arguments:\n",
359 |     "    X -- training set for m examples\n",
360 |     "    Y -- labels for m examples \n",
361 |     "    parameters -- python dictionary containing your parameters \"W1\", \"b1\", \"W2\", \"b2\", \"W3\", \"b3\":\n",
362 |     "                    W1 -- weight matrix of shape (5, 4)\n",
363 |     "                    b1 -- bias vector of shape (5, 1)\n",
364 |     "                    W2 -- weight matrix of shape (3, 5)\n",
365 |     "                    b2 -- bias vector of shape (3, 1)\n",
366 |     "                    W3 -- weight matrix of shape (1, 3)\n",
367 |     "                    b3 -- bias vector of shape (1, 1)\n",
368 |     "    \n",
369 |     "    Returns:\n",
370 |     "    cost -- the cost function (logistic cost for one example)\n",
371 |     "    \"\"\"\n",
372 |     "    \n",
373 |     "    # retrieve parameters\n",
374 |     "    m = X.shape[1]\n",
375 |     "    W1 = parameters[\"W1\"]\n",
376 |     "    b1 = parameters[\"b1\"]\n",
377 |     "    W2 = parameters[\"W2\"]\n",
378 |     "    b2 = parameters[\"b2\"]\n",
379 |     "    W3 = parameters[\"W3\"]\n",
380 |     "    b3 = parameters[\"b3\"]\n",
381 |     "\n",
382 |     "    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID\n",
383 |     "    Z1 = np.dot(W1, X) + b1\n",
384 |     "    A1 = relu(Z1)\n",
385 |     "    Z2 = np.dot(W2, A1) + b2\n",
386 |     "    A2 = relu(Z2)\n",
387 |     "    Z3 = np.dot(W3, A2) + b3\n",
388 |     "    A3 = sigmoid(Z3)\n",
389 |     "\n",
390 |     "    # Cost\n",
391 |     "    logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)\n",
392 |     "    cost = 1./m * np.sum(logprobs)\n",
393 |     "    \n",
394 |     "    cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)\n",
395 |     "    \n",
396 |     "    return cost, cache"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "markdown",
401 |    "metadata": {},
402 |    "source": [
403 |     "Now, run backward propagation."
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": 16,
409 |    "metadata": {
410 |     "collapsed": true
411 |    },
412 |    "outputs": [],
413 |    "source": [
414 |     "def backward_propagation_n(X, Y, cache):\n",
415 |     "    \"\"\"\n",
416 |     "    Implement the backward propagation presented in figure 2.\n",
417 |     "    \n",
418 |     "    Arguments:\n",
419 |     "    X -- input datapoint, of shape (input size, 1)\n",
420 |     "    Y -- true \"label\"\n",
421 |     "    cache -- cache output from forward_propagation_n()\n",
422 |     "    \n",
423 |     "    Returns:\n",
424 |     "    gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.\n",
425 |     "    \"\"\"\n",
426 |     "    \n",
427 |     "    m = X.shape[1]\n",
428 |     "    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache\n",
429 |     "    \n",
430 |     "    dZ3 = A3 - Y\n",
431 |     "    dW3 = 1./m * np.dot(dZ3, A2.T)\n",
432 |     "    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)\n",
433 |     "    \n",
434 |     "    dA2 = np.dot(W3.T, dZ3)\n",
435 |     "    dZ2 = np.multiply(dA2, np.int64(A2 > 0))\n",
436 |     "    # dW2 = 1./m * np.dot(dZ2, A1.T) * 2\n",
437 |     "    dW2 = 1./m * np.dot(dZ2, A1.T) \n",
438 |     "    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)\n",
439 |     "    \n",
440 |     "    dA1 = np.dot(W2.T, dZ2)\n",
441 |     "    dZ1 = np.multiply(dA1, np.int64(A1 > 0))\n",
442 |     "    dW1 = 1./m * np.dot(dZ1, X.T)\n",
443 |     "    # db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)\n",
444 |     "    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)\n",
445 |     "    \n",
446 |     "    gradients = {\"dZ3\": dZ3, \"dW3\": dW3, \"db3\": db3,\n",
447 |     "                 \"dA2\": dA2, \"dZ2\": dZ2, \"dW2\": dW2, \"db2\": db2,\n",
448 |     "                 \"dA1\": dA1, \"dZ1\": dZ1, \"dW1\": dW1, \"db1\": db1}\n",
449 |     "    \n",
450 |     "    return gradients"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "markdown",
455 |    "metadata": {
456 |     "collapsed": true
457 |    },
458 |    "source": [
459 |     "You obtained some results on the fraud detection test set but you are not 100% sure of your model. Nobody's perfect! Let's implement gradient checking to verify if your gradients are correct."
460 |    ]
461 |   },
462 |   {
463 |    "cell_type": "markdown",
464 |    "metadata": {},
465 |    "source": [
466 |     "**How does gradient checking work?**.\n",
467 |     "\n",
468 |     "As in 1) and 2), you want to compare \"gradapprox\" to the gradient computed by backpropagation. The formula is still:\n",
469 |     "\n",
470 |     "$$ \\frac{\\partial J}{\\partial \\theta} = \\lim_{\\varepsilon \\to 0} \\frac{J(\\theta + \\varepsilon) - J(\\theta - \\varepsilon)}{2 \\varepsilon} \\tag{1}$$\n",
471 |     "\n",
472 |     "However, $\\theta$ is not a scalar anymore. It is a dictionary called \"parameters\". We implemented a function \"`dictionary_to_vector()`\" for you. It converts the \"parameters\" dictionary into a vector called \"values\", obtained by reshaping all parameters (W1, b1, W2, b2, W3, b3) into vectors and concatenating them.\n",
473 |     "\n",
474 |     "The inverse function is \"`vector_to_dictionary`\" which outputs back the \"parameters\" dictionary.\n",
475 |     "\n",
476 |     "<img src=\"images/dictionary_to_vector.png\" style=\"width:600px;height:400px;\">\n",
477 |     "<caption><center> <u> **Figure 2** </u>: **dictionary_to_vector() and vector_to_dictionary()**<br> You will need these functions in gradient_check_n()</center></caption>\n",
478 |     "\n",
479 |     "We have also converted the \"gradients\" dictionary into a vector \"grad\" using gradients_to_vector(). You don't need to worry about that.\n",
480 |     "\n",
481 |     "**Exercise**: Implement gradient_check_n().\n",
482 |     "\n",
483 |     "**Instructions**: Here is pseudo-code that will help you implement the gradient check.\n",
484 |     "\n",
485 |     "For each i in num_parameters:\n",
486 |     "- To compute `J_plus[i]`:\n",
487 |     "    1. Set $\\theta^{+}$ to `np.copy(parameters_values)`\n",
488 |     "    2. Set $\\theta^{+}_i$ to $\\theta^{+}_i + \\varepsilon$\n",
489 |     "    3. Calculate $J^{+}_i$ using to `forward_propagation_n(x, y, vector_to_dictionary(`$\\theta^{+}$ `))`.     \n",
490 |     "- To compute `J_minus[i]`: do the same thing with $\\theta^{-}$\n",
491 |     "- Compute $gradapprox[i] = \\frac{J^{+}_i - J^{-}_i}{2 \\varepsilon}$\n",
492 |     "\n",
493 |     "Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to `parameter_values[i]`. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1', 2', 3'), compute: \n",
494 |     "$$ difference = \\frac {\\| grad - gradapprox \\|_2}{\\| grad \\|_2 + \\| gradapprox \\|_2 } \\tag{3}$$"
495 |    ]
496 |   },
497 |   {
498 |    "cell_type": "code",
499 |    "execution_count": 14,
500 |    "metadata": {
501 |     "collapsed": true
502 |    },
503 |    "outputs": [],
504 |    "source": [
505 |     "# GRADED FUNCTION: gradient_check_n\n",
506 |     "\n",
507 |     "def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):\n",
508 |     "    \"\"\"\n",
509 |     "    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n\n",
510 |     "    \n",
511 |     "    Arguments:\n",
512 |     "    parameters -- python dictionary containing your parameters \"W1\", \"b1\", \"W2\", \"b2\", \"W3\", \"b3\":\n",
513 |     "    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. \n",
514 |     "    x -- input datapoint, of shape (input size, 1)\n",
515 |     "    y -- true \"label\"\n",
516 |     "    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)\n",
517 |     "    \n",
518 |     "    Returns:\n",
519 |     "    difference -- difference (2) between the approximated gradient and the backward propagation gradient\n",
520 |     "    \"\"\"\n",
521 |     "    \n",
522 |     "    # Set-up variables\n",
523 |     "    parameters_values, _ = dictionary_to_vector(parameters)\n",
524 |     "    grad = gradients_to_vector(gradients)\n",
525 |     "    num_parameters = parameters_values.shape[0]\n",
526 |     "    J_plus = np.zeros((num_parameters, 1))\n",
527 |     "    J_minus = np.zeros((num_parameters, 1))\n",
528 |     "    gradapprox = np.zeros((num_parameters, 1))\n",
529 |     "    \n",
530 |     "    # Compute gradapprox\n",
531 |     "    for i in range(num_parameters):\n",
532 |     "        \n",
533 |     "        # Compute J_plus[i]. Inputs: \"parameters_values, epsilon\". Output = \"J_plus[i]\".\n",
534 |     "        # \"_\" is used because the function you have to outputs two parameters but we only care about the first one\n",
535 |     "        ### START CODE HERE ### (approx. 3 lines)\n",
536 |     "        thetaplus = np.copy(parameters_values)                                      # Step 1\n",
537 |     "        thetaplus[i][0] = thetaplus[i][0] + epsilon                                # Step 2\n",
538 |     "        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))                                   # Step 3\n",
539 |     "        ### END CODE HERE ###\n",
540 |     "        \n",
541 |     "        # Compute J_minus[i]. Inputs: \"parameters_values, epsilon\". Output = \"J_minus[i]\".\n",
542 |     "        ### START CODE HERE ### (approx. 3 lines)\n",
543 |     "        thetaminus = np.copy(parameters_values)                                     # Step 1\n",
544 |     "        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        \n",
545 |     "        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))                                  # Step 3\n",
546 |     "        ### END CODE HERE ###\n",
547 |     "        \n",
548 |     "        # Compute gradapprox[i]\n",
549 |     "        ### START CODE HERE ### (approx. 1 line)\n",
550 |     "        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2. * epsilon)\n",
551 |     "        ### END CODE HERE ###\n",
552 |     "    \n",
553 |     "    # Compare gradapprox to backward propagation gradients by computing difference.\n",
554 |     "    ### START CODE HERE ### (approx. 1 line)\n",
555 |     "    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'\n",
556 |     "    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)             # Step 2'\n",
557 |     "    difference = numerator / denominator                                        # Step 3'\n",
558 |     "    ### END CODE HERE ###\n",
559 |     "\n",
560 |     "    if difference > 1e-7:\n",
561 |     "        print (\"\\033[93m\" + \"There is a mistake in the backward propagation! difference = \" + str(difference) + \"\\033[0m\")\n",
562 |     "    else:\n",
563 |     "        print (\"\\033[92m\" + \"Your backward propagation works perfectly fine! difference = \" + str(difference) + \"\\033[0m\")\n",
564 |     "    \n",
565 |     "    return difference"
566 |    ]
567 |   },
568 |   {
569 |    "cell_type": "code",
570 |    "execution_count": 17,
571 |    "metadata": {
572 |     "scrolled": false
573 |    },
574 |    "outputs": [
575 |     {
576 |      "name": "stdout",
577 |      "output_type": "stream",
578 |      "text": [
579 |       "\u001b[93mThere is a mistake in the backward propagation! difference = 1.18904178788e-07\u001b[0m\n"
580 |      ]
581 |     }
582 |    ],
583 |    "source": [
584 |     "X, Y, parameters = gradient_check_n_test_case()\n",
585 |     "\n",
586 |     "cost, cache = forward_propagation_n(X, Y, parameters)\n",
587 |     "gradients = backward_propagation_n(X, Y, cache)\n",
588 |     "difference = gradient_check_n(parameters, gradients, X, Y)"
589 |    ]
590 |   },
591 |   {
592 |    "cell_type": "markdown",
593 |    "metadata": {},
594 |    "source": [
595 |     "**Expected output**:\n",
596 |     "\n",
597 |     "<table>\n",
598 |     "    <tr>\n",
599 |     "        <td>  ** There is a mistake in the backward propagation!**  </td>\n",
600 |     "        <td> difference = 0.285093156781 </td>\n",
601 |     "    </tr>\n",
602 |     "</table>"
603 |    ]
604 |   },
605 |   {
606 |    "cell_type": "markdown",
607 |    "metadata": {},
608 |    "source": [
609 |     "It seems that there were errors in the `backward_propagation_n` code we gave you! Good that you've implemented the gradient check. Go back to `backward_propagation` and try to find/correct the errors *(Hint: check dW2 and db1)*. Rerun the gradient check when you think you've fixed it. Remember you'll need to re-execute the cell defining `backward_propagation_n()` if you modify the code. \n",
610 |     "\n",
611 |     "Can you get gradient check to declare your derivative computation correct? Even though this part of the assignment isn't graded, we strongly urge you to try to find the bug and re-run gradient check until you're convinced backprop is now correctly implemented. \n",
612 |     "\n",
613 |     "**Note** \n",
614 |     "- Gradient Checking is slow! Approximating the gradient with $\\frac{\\partial J}{\\partial \\theta} \\approx  \\frac{J(\\theta + \\varepsilon) - J(\\theta - \\varepsilon)}{2 \\varepsilon}$ is computationally costly. For this reason, we don't run gradient checking at every iteration during training. Just a few times to check if the gradient is correct. \n",
615 |     "- Gradient Checking, at least as we've presented it, doesn't work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout. \n",
616 |     "\n",
617 |     "Congrats, you can be confident that your deep learning model for fraud detection is working correctly! You can even use this to convince your CEO. :) \n",
618 |     "\n",
619 |     "<font color='blue'>\n",
620 |     "**What you should remember from this notebook**:\n",
621 |     "- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).\n",
622 |     "- Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. "
623 |    ]
624 |   },
625 |   {
626 |    "cell_type": "code",
627 |    "execution_count": null,
628 |    "metadata": {
629 |     "collapsed": true
630 |    },
631 |    "outputs": [],
632 |    "source": []
633 |   }
634 |  ],
635 |  "metadata": {
636 |   "coursera": {
637 |    "course_slug": "deep-neural-network",
638 |    "graded_item_id": "n6NBD",
639 |    "launcher_item_id": "yfOsE"
640 |   },
641 |   "kernelspec": {
642 |    "display_name": "Python 3",
643 |    "language": "python",
644 |    "name": "python3"
645 |   },
646 |   "language_info": {
647 |    "codemirror_mode": {
648 |     "name": "ipython",
649 |     "version": 3
650 |    },
651 |    "file_extension": ".py",
652 |    "mimetype": "text/x-python",
653 |    "name": "python",
654 |    "nbconvert_exporter": "python",
655 |    "pygments_lexer": "ipython3",
656 |    "version": "3.6.0"
657 |   }
658 |  },
659 |  "nbformat": 4,
660 |  "nbformat_minor": 1
661 | }
662 | 


--------------------------------------------------------------------------------
/Week 1 PA 2/Regularization/Regularization.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Regularization
  3 | 
  4 | Welcome to the second assignment of this week. Deep Learning models have so much flexibility and capacity that **overfitting can be a serious problem**, if the training dataset is not big enough. Sure it does well on the training set, but the learned network **doesn't generalize to new examples** that it has never seen!
  5 | 
  6 | **You will learn to:** Use regularization in your deep learning models.
  7 | 
  8 | Let's first import the packages you are going to use.
  9 | 
 10 | 
 11 | ```python
 12 | # import packages
 13 | import numpy as np
 14 | import matplotlib.pyplot as plt
 15 | from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
 16 | from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
 17 | import sklearn
 18 | import sklearn.datasets
 19 | import scipy.io
 20 | from testCases import *
 21 | 
 22 | %matplotlib inline
 23 | plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
 24 | plt.rcParams['image.interpolation'] = 'nearest'
 25 | plt.rcParams['image.cmap'] = 'gray'
 26 | ```
 27 | 
 28 |     /home/jovyan/work/week5/Regularization/reg_utils.py:85: SyntaxWarning: assertion is always true, perhaps remove parentheses?
 29 |       assert(parameters['W' + str(l)].shape == layer_dims[l], layer_dims[l-1])
 30 |     /home/jovyan/work/week5/Regularization/reg_utils.py:86: SyntaxWarning: assertion is always true, perhaps remove parentheses?
 31 |       assert(parameters['W' + str(l)].shape == layer_dims[l], 1)
 32 | 
 33 | 
 34 | **Problem Statement**: You have just been hired as an AI expert by the French Football Corporation. They would like you to recommend positions where France's goal keeper should kick the ball so that the French team's players can then hit it with their head. 
 35 | 
 36 | <img src="images/field_kiank.png" style="width:600px;height:350px;">
 37 | <caption><center> <u> **Figure 1** </u>: **Football field**<br> The goal keeper kicks the ball in the air, the players of each team are fighting to hit the ball with their head </center></caption>
 38 | 
 39 | 
 40 | They give you the following 2D dataset from France's past 10 games.
 41 | 
 42 | 
 43 | ```python
 44 | train_X, train_Y, test_X, test_Y = load_2D_dataset()
 45 | ```
 46 | 
 47 | 
 48 | ![png](output_3_0.png)
 49 | 
 50 | 
 51 | Each dot corresponds to a position on the football field where a football player has hit the ball with his/her head after the French goal keeper has shot the ball from the left side of the football field.
 52 | - If the dot is blue, it means the French player managed to hit the ball with his/her head
 53 | - If the dot is red, it means the other team's player hit the ball with their head
 54 | 
 55 | **Your goal**: Use a deep learning model to find the positions on the field where the goalkeeper should kick the ball.
 56 | 
 57 | **Analysis of the dataset**: This dataset is a little noisy, but it looks like a diagonal line separating the upper left half (blue) from the lower right half (red) would work well. 
 58 | 
 59 | You will first try a non-regularized model. Then you'll learn how to regularize it and decide which model you will choose to solve the French Football Corporation's problem. 
 60 | 
 61 | ## 1 - Non-regularized model
 62 | 
 63 | You will use the following neural network (already implemented for you below). This model can be used:
 64 | - in *regularization mode* -- by setting the `lambd` input to a non-zero value. We use "`lambd`" instead of "`lambda`" because "`lambda`" is a reserved keyword in Python. 
 65 | - in *dropout mode* -- by setting the `keep_prob` to a value less than one
 66 | 
 67 | You will first try the model without any regularization. Then, you will implement:
 68 | - *L2 regularization* -- functions: "`compute_cost_with_regularization()`" and "`backward_propagation_with_regularization()`"
 69 | - *Dropout* -- functions: "`forward_propagation_with_dropout()`" and "`backward_propagation_with_dropout()`"
 70 | 
 71 | In each part, you will run this model with the correct inputs so that it calls the functions you've implemented. Take a look at the code below to familiarize yourself with the model.
 72 | 
 73 | 
 74 | ```python
 75 | def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
 76 |     """
 77 |     Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
 78 |     
 79 |     Arguments:
 80 |     X -- input data, of shape (input size, number of examples)
 81 |     Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
 82 |     learning_rate -- learning rate of the optimization
 83 |     num_iterations -- number of iterations of the optimization loop
 84 |     print_cost -- If True, print the cost every 10000 iterations
 85 |     lambd -- regularization hyperparameter, scalar
 86 |     keep_prob - probability of keeping a neuron active during drop-out, scalar.
 87 |     
 88 |     Returns:
 89 |     parameters -- parameters learned by the model. They can then be used to predict.
 90 |     """
 91 |         
 92 |     grads = {}
 93 |     costs = []                            # to keep track of the cost
 94 |     m = X.shape[1]                        # number of examples
 95 |     layers_dims = [X.shape[0], 20, 3, 1]
 96 |     
 97 |     # Initialize parameters dictionary.
 98 |     parameters = initialize_parameters(layers_dims)
 99 | 
100 |     # Loop (gradient descent)
101 | 
102 |     for i in range(0, num_iterations):
103 | 
104 |         # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
105 |         if keep_prob == 1:
106 |             a3, cache = forward_propagation(X, parameters)
107 |         elif keep_prob < 1:
108 |             a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
109 |         
110 |         # Cost function
111 |         if lambd == 0:
112 |             cost = compute_cost(a3, Y)
113 |         else:
114 |             cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
115 |             
116 |         # Backward propagation.
117 |         assert(lambd==0 or keep_prob==1)    # it is possible to use both L2 regularization and dropout, 
118 |                                             # but this assignment will only explore one at a time
119 |         if lambd == 0 and keep_prob == 1:
120 |             grads = backward_propagation(X, Y, cache)
121 |         elif lambd != 0:
122 |             grads = backward_propagation_with_regularization(X, Y, cache, lambd)
123 |         elif keep_prob < 1:
124 |             grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
125 |         
126 |         # Update parameters.
127 |         parameters = update_parameters(parameters, grads, learning_rate)
128 |         
129 |         # Print the loss every 10000 iterations
130 |         if print_cost and i % 10000 == 0:
131 |             print("Cost after iteration {}: {}".format(i, cost))
132 |         if print_cost and i % 1000 == 0:
133 |             costs.append(cost)
134 |     
135 |     # plot the cost
136 |     plt.plot(costs)
137 |     plt.ylabel('cost')
138 |     plt.xlabel('iterations (x1,000)')
139 |     plt.title("Learning rate =" + str(learning_rate))
140 |     plt.show()
141 |     
142 |     return parameters
143 | ```
144 | 
145 | Let's train the model without any regularization, and observe the accuracy on the train/test sets.
146 | 
147 | 
148 | ```python
149 | parameters = model(train_X, train_Y)
150 | print ("On the training set:")
151 | predictions_train = predict(train_X, train_Y, parameters)
152 | print ("On the test set:")
153 | predictions_test = predict(test_X, test_Y, parameters)
154 | ```
155 | 
156 |     Cost after iteration 0: 0.6557412523481002
157 |     Cost after iteration 10000: 0.16329987525724216
158 |     Cost after iteration 20000: 0.13851642423255986
159 | 
160 | 
161 | 
162 | ![png](output_9_1.png)
163 | 
164 | 
165 |     On the training set:
166 |     Accuracy: 0.947867298578
167 |     On the test set:
168 |     Accuracy: 0.915
169 | 
170 | 
171 | The train accuracy is 94.8% while the test accuracy is 91.5%. This is the **baseline model** (you will observe the impact of regularization on this model). Run the following code to plot the decision boundary of your model.
172 | 
173 | 
174 | ```python
175 | plt.title("Model without regularization")
176 | axes = plt.gca()
177 | axes.set_xlim([-0.75,0.40])
178 | axes.set_ylim([-0.75,0.65])
179 | plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
180 | ```
181 | 
182 | 
183 | ![png](output_11_0.png)
184 | 
185 | 
186 | The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets now look at two techniques to reduce overfitting.
187 | 
188 | ## 2 - L2 Regularization
189 | 
190 | The standard way to avoid overfitting is called **L2 regularization**. It consists of appropriately modifying your cost function, from:
191 | $$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$
192 | To:
193 | $$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$$
194 | 
195 | Let's modify your cost and observe the consequences.
196 | 
197 | **Exercise**: Implement `compute_cost_with_regularization()` which computes the cost given by formula (2). To calculate $\sum\limits_k\sum\limits_j W_{k,j}^{[l]2}$  , use :
198 | ```python
199 | np.sum(np.square(Wl))
200 | ```
201 | Note that you have to do this for $W^{[1]}$, $W^{[2]}$ and $W^{[3]}$, then sum the three terms and multiply by $ \frac{1}{m} \frac{\lambda}{2} $.
202 | 
203 | 
204 | ```python
205 | # GRADED FUNCTION: compute_cost_with_regularization
206 | 
207 | def compute_cost_with_regularization(A3, Y, parameters, lambd):
208 |     """
209 |     Implement the cost function with L2 regularization. See formula (2) above.
210 |     
211 |     Arguments:
212 |     A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
213 |     Y -- "true" labels vector, of shape (output size, number of examples)
214 |     parameters -- python dictionary containing parameters of the model
215 |     
216 |     Returns:
217 |     cost - value of the regularized loss function (formula (2))
218 |     """
219 |     m = Y.shape[1]
220 |     W1 = parameters["W1"]
221 |     W2 = parameters["W2"]
222 |     W3 = parameters["W3"]
223 |     
224 |     cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
225 |     
226 |     ### START CODE HERE ### (approx. 1 line)
227 |     L2_regularization_cost = (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) * lambd / (m * 2.)
228 |     ### END CODER HERE ###
229 |     
230 |     cost = cross_entropy_cost + L2_regularization_cost
231 |     
232 |     return cost
233 | ```
234 | 
235 | 
236 | ```python
237 | A3, Y_assess, parameters = compute_cost_with_regularization_test_case()
238 | 
239 | print("cost = " + str(compute_cost_with_regularization(A3, Y_assess, parameters, lambd = 0.1)))
240 | ```
241 | 
242 |     cost = 1.78648594516
243 | 
244 | 
245 | **Expected Output**: 
246 | 
247 | <table> 
248 |     <tr>
249 |     <td>
250 |     **cost**
251 |     </td>
252 |         <td>
253 |     1.78648594516
254 |     </td>
255 |     
256 |     </tr>
257 | 
258 | </table> 
259 | 
260 | Of course, because you changed the cost, you have to change backward propagation as well! All the gradients have to be computed with respect to this new cost. 
261 | 
262 | **Exercise**: Implement the changes needed in backward propagation to take into account regularization. The changes only concern dW1, dW2 and dW3. For each, you have to add the regularization term's gradient ($\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m}  W^2) = \frac{\lambda}{m} W$).
263 | 
264 | 
265 | ```python
266 | # GRADED FUNCTION: backward_propagation_with_regularization
267 | 
268 | def backward_propagation_with_regularization(X, Y, cache, lambd):
269 |     """
270 |     Implements the backward propagation of our baseline model to which we added an L2 regularization.
271 |     
272 |     Arguments:
273 |     X -- input dataset, of shape (input size, number of examples)
274 |     Y -- "true" labels vector, of shape (output size, number of examples)
275 |     cache -- cache output from forward_propagation()
276 |     lambd -- regularization hyperparameter, scalar
277 |     
278 |     Returns:
279 |     gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
280 |     """
281 |     
282 |     m = X.shape[1]
283 |     (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
284 |     
285 |     dZ3 = A3 - Y
286 |     
287 |     ### START CODE HERE ### (approx. 1 line)
288 |     dW3 = 1./m * np.dot(dZ3, A2.T) + float(lambd) / m * W3
289 |     ### END CODE HERE ###
290 |     db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
291 |     
292 |     dA2 = np.dot(W3.T, dZ3)
293 |     dZ2 = np.multiply(dA2, np.int64(A2 > 0))
294 |     ### START CODE HERE ### (approx. 1 line)
295 |     dW2 = 1./m * np.dot(dZ2, A1.T) + float(lambd) / m * W2
296 |     ### END CODE HERE ###
297 |     db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
298 |     
299 |     dA1 = np.dot(W2.T, dZ2)
300 |     dZ1 = np.multiply(dA1, np.int64(A1 > 0))
301 |     ### START CODE HERE ### (approx. 1 line)
302 |     dW1 = 1./m * np.dot(dZ1, X.T) + float(lambd) / m * W1
303 |     ### END CODE HERE ###
304 |     db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
305 |     
306 |     gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
307 |                  "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
308 |                  "dZ1": dZ1, "dW1": dW1, "db1": db1}
309 |     
310 |     return gradients
311 | ```
312 | 
313 | 
314 | ```python
315 | X_assess, Y_assess, cache = backward_propagation_with_regularization_test_case()
316 | 
317 | grads = backward_propagation_with_regularization(X_assess, Y_assess, cache, lambd = 0.7)
318 | print ("dW1 = "+ str(grads["dW1"]))
319 | print ("dW2 = "+ str(grads["dW2"]))
320 | print ("dW3 = "+ str(grads["dW3"]))
321 | ```
322 | 
323 |     dW1 = [[-0.25604646  0.12298827 -0.28297129]
324 |      [-0.17706303  0.34536094 -0.4410571 ]]
325 |     dW2 = [[ 0.79276486  0.85133918]
326 |      [-0.0957219  -0.01720463]
327 |      [-0.13100772 -0.03750433]]
328 |     dW3 = [[-1.77691347 -0.11832879 -0.09397446]]
329 | 
330 | 
331 | **Expected Output**:
332 | 
333 | <table> 
334 |     <tr>
335 |     <td>
336 |     **dW1**
337 |     </td>
338 |         <td>
339 |     [[-0.25604646  0.12298827 -0.28297129]
340 |  [-0.17706303  0.34536094 -0.4410571 ]]
341 |     </td>
342 |     </tr>
343 |     <tr>
344 |     <td>
345 |     **dW2**
346 |     </td>
347 |         <td>
348 |     [[ 0.79276486  0.85133918]
349 |  [-0.0957219  -0.01720463]
350 |  [-0.13100772 -0.03750433]]
351 |     </td>
352 |     </tr>
353 |     <tr>
354 |     <td>
355 |     **dW3**
356 |     </td>
357 |         <td>
358 |     [[-1.77691347 -0.11832879 -0.09397446]]
359 |     </td>
360 |     </tr>
361 | </table> 
362 | 
363 | Let's now run the model with L2 regularization $(\lambda = 0.7)$. The `model()` function will call: 
364 | - `compute_cost_with_regularization` instead of `compute_cost`
365 | - `backward_propagation_with_regularization` instead of `backward_propagation`
366 | 
367 | 
368 | ```python
369 | parameters = model(train_X, train_Y, lambd = 0.7)
370 | print ("On the train set:")
371 | predictions_train = predict(train_X, train_Y, parameters)
372 | print ("On the test set:")
373 | predictions_test = predict(test_X, test_Y, parameters)
374 | ```
375 | 
376 |     Cost after iteration 0: 0.6974484493131264
377 |     Cost after iteration 10000: 0.2684918873282239
378 |     Cost after iteration 20000: 0.2680916337127301
379 | 
380 | 
381 | 
382 | ![png](output_22_1.png)
383 | 
384 | 
385 |     On the train set:
386 |     Accuracy: 0.938388625592
387 |     On the test set:
388 |     Accuracy: 0.93
389 | 
390 | 
391 | Congrats, the test set accuracy increased to 93%. You have saved the French football team!
392 | 
393 | You are not overfitting the training data anymore. Let's plot the decision boundary.
394 | 
395 | 
396 | ```python
397 | plt.title("Model with L2-regularization")
398 | axes = plt.gca()
399 | axes.set_xlim([-0.75,0.40])
400 | axes.set_ylim([-0.75,0.65])
401 | plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
402 | ```
403 | 
404 | 
405 | ![png](output_24_0.png)
406 | 
407 | 
408 | **Observations**:
409 | - The value of $\lambda$ is a hyperparameter that you can tune using a dev set.
410 | - L2 regularization makes your decision boundary smoother. If $\lambda$ is too large, it is also possible to "oversmooth", resulting in a model with high bias.
411 | 
412 | **What is L2-regularization actually doing?**:
413 | 
414 | L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. 
415 | 
416 | <font color='blue'>
417 | **What you should remember** -- the implications of L2-regularization on:
418 | - The cost computation:
419 |     - A regularization term is added to the cost
420 | - The backpropagation function:
421 |     - There are extra terms in the gradients with respect to weight matrices
422 | - Weights end up smaller ("weight decay"): 
423 |     - Weights are pushed to smaller values.
424 | 
425 | ## 3 - Dropout
426 | 
427 | Finally, **dropout** is a widely used regularization technique that is specific to deep learning. 
428 | **It randomly shuts down some neurons in each iteration.** Watch these two videos to see what this means!
429 | 
430 | <!--
431 | To understand drop-out, consider this conversation with a friend:
432 | - Friend: "Why do you need all these neurons to train your network and classify images?". 
433 | - You: "Because each neuron contains a weight and can learn specific features/details/shape of an image. The more neurons I have, the more featurse my model learns!"
434 | - Friend: "I see, but are you sure that your neurons are learning different features and not all the same features?"
435 | - You: "Good point... Neurons in the same layer actually don't talk to each other. It should be definitly possible that they learn the same image features/shapes/forms/details... which would be redundant. There should be a solution."
436 | !--> 
437 | 
438 | 
439 | <center>
440 | <video width="620" height="440" src="images/dropout1_kiank.mp4" type="video/mp4" controls>
441 | </video>
442 | </center>
443 | <br>
444 | <caption><center> <u> Figure 2 </u>: Drop-out on the second hidden layer. <br> At each iteration, you shut down (= set to zero) each neuron of a layer with probability $1 - keep\_prob$ or keep it with probability $keep\_prob$ (50% here). The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration. </center></caption>
445 | 
446 | <center>
447 | <video width="620" height="440" src="images/dropout2_kiank.mp4" type="video/mp4" controls>
448 | </video>
449 | </center>
450 | 
451 | <caption><center> <u> Figure 3 </u>: Drop-out on the first and third hidden layers. <br> $1^{st}$ layer: we shut down on average 40% of the neurons.  $3^{rd}$ layer: we shut down on average 20% of the neurons. </center></caption>
452 | 
453 | 
454 | When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. 
455 | 
456 | ### 3.1 - Forward propagation with dropout
457 | 
458 | **Exercise**: Implement the forward propagation with dropout. You are using a 3 layer neural network, and will add dropout to the first and second hidden layers. We will not apply dropout to the input layer or output layer. 
459 | 
460 | **Instructions**:
461 | You would like to shut down some neurons in the first and second layers. To do that, you are going to carry out 4 Steps:
462 | 1. In lecture, we dicussed creating a variable $d^{[1]}$ with the same shape as $a^{[1]}$ using `np.random.rand()` to randomly get numbers between 0 and 1. Here, you will use a vectorized implementation, so create a random matrix $D^{[1]} = [d^{[1](1)} d^{[1](2)} ... d^{[1](m)}] $ of the same dimension as $A^{[1]}$.
463 | 2. Set each entry of $D^{[1]}$ to be 0 with probability (`1-keep_prob`) or 1 with probability (`keep_prob`), by thresholding values in $D^{[1]}$ appropriately. Hint: to set all the entries of a matrix X to 0 (if entry is less than 0.5) or 1 (if entry is more than 0.5) you would do: `X = (X < 0.5)`. Note that 0 and 1 are respectively equivalent to False and True.
464 | 3. Set $A^{[1]}$ to $A^{[1]} * D^{[1]}$. (You are shutting down some neurons). You can think of $D^{[1]}$ as a mask, so that when it is multiplied with another matrix, it shuts down some of the values.
465 | 4. Divide $A^{[1]}$ by `keep_prob`. By doing this you are assuring that the result of the cost will still have the same expected value as without drop-out. (This technique is also called inverted dropout.)
466 | 
467 | 
468 | ```python
469 | # GRADED FUNCTION: forward_propagation_with_dropout
470 | 
471 | def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
472 |     """
473 |     Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
474 |     
475 |     Arguments:
476 |     X -- input dataset, of shape (2, number of examples)
477 |     parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
478 |                     W1 -- weight matrix of shape (20, 2)
479 |                     b1 -- bias vector of shape (20, 1)
480 |                     W2 -- weight matrix of shape (3, 20)
481 |                     b2 -- bias vector of shape (3, 1)
482 |                     W3 -- weight matrix of shape (1, 3)
483 |                     b3 -- bias vector of shape (1, 1)
484 |     keep_prob - probability of keeping a neuron active during drop-out, scalar
485 |     
486 |     Returns:
487 |     A3 -- last activation value, output of the forward propagation, of shape (1,1)
488 |     cache -- tuple, information stored for computing the backward propagation
489 |     """
490 |     
491 |     np.random.seed(1)
492 |     
493 |     # retrieve parameters
494 |     W1 = parameters["W1"]
495 |     b1 = parameters["b1"]
496 |     W2 = parameters["W2"]
497 |     b2 = parameters["b2"]
498 |     W3 = parameters["W3"]
499 |     b3 = parameters["b3"]
500 |     
501 |     # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
502 |     Z1 = np.dot(W1, X) + b1
503 |     A1 = relu(Z1)
504 |     ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
505 |     D1 = np.random.rand(A1.shape[0], A1.shape[1])     # Step 1: initialize matrix D1 = np.random.rand(..., ...)
506 |     D1 = D1 < keep_prob                               # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
507 |     A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
508 |     A1 = A1 / keep_prob                               # Step 4: scale the value of neurons that haven't been shut down
509 |     ### END CODE HERE ###
510 |     Z2 = np.dot(W2, A1) + b2
511 |     A2 = relu(Z2)
512 |     ### START CODE HERE ### (approx. 4 lines)
513 |     D2 = np.random.rand(A2.shape[0], A2.shape[1])     # Step 1: initialize matrix D2 = np.random.rand(..., ...)
514 |     D2 = D2 < keep_prob                               # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
515 |     A2 = A2 * D2                                      # Step 3: shut down some neurons of A2
516 |     A2 = A2 / keep_prob                               # Step 4: scale the value of neurons that haven't been shut down
517 |     ### END CODE HERE ###
518 |     Z3 = np.dot(W3, A2) + b3
519 |     A3 = sigmoid(Z3)
520 |     
521 |     cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
522 |     
523 |     return A3, cache
524 | ```
525 | 
526 | 
527 | ```python
528 | X_assess, parameters = forward_propagation_with_dropout_test_case()
529 | 
530 | A3, cache = forward_propagation_with_dropout(X_assess, parameters, keep_prob = 0.7)
531 | print ("A3 = " + str(A3))
532 | ```
533 | 
534 |     A3 = [[ 0.36974721  0.00305176  0.04565099  0.49683389  0.36974721]]
535 | 
536 | 
537 | **Expected Output**: 
538 | 
539 | <table> 
540 |     <tr>
541 |     <td>
542 |     **A3**
543 |     </td>
544 |         <td>
545 |     [[ 0.36974721  0.00305176  0.04565099  0.49683389  0.36974721]]
546 |     </td>
547 |     
548 |     </tr>
549 | 
550 | </table> 
551 | 
552 | ### 3.2 - Backward propagation with dropout
553 | 
554 | **Exercise**: Implement the backward propagation with dropout. As before, you are training a 3 layer network. Add dropout to the first and second hidden layers, using the masks $D^{[1]}$ and $D^{[2]}$ stored in the cache. 
555 | 
556 | **Instruction**:
557 | Backpropagation with dropout is actually quite easy. You will have to carry out 2 Steps:
558 | 1. You had previously shut down some neurons during forward propagation, by applying a mask $D^{[1]}$ to `A1`. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask $D^{[1]}$ to `dA1`. 
559 | 2. During forward propagation, you had divided `A1` by `keep_prob`. In backpropagation, you'll therefore have to divide `dA1` by `keep_prob` again (the calculus interpretation is that if $A^{[1]}$ is scaled by `keep_prob`, then its derivative $dA^{[1]}$ is also scaled by the same `keep_prob`).
560 | 
561 | 
562 | 
563 | ```python
564 | # GRADED FUNCTION: backward_propagation_with_dropout
565 | 
566 | def backward_propagation_with_dropout(X, Y, cache, keep_prob):
567 |     """
568 |     Implements the backward propagation of our baseline model to which we added dropout.
569 |     
570 |     Arguments:
571 |     X -- input dataset, of shape (2, number of examples)
572 |     Y -- "true" labels vector, of shape (output size, number of examples)
573 |     cache -- cache output from forward_propagation_with_dropout()
574 |     keep_prob - probability of keeping a neuron active during drop-out, scalar
575 |     
576 |     Returns:
577 |     gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
578 |     """
579 |     
580 |     m = X.shape[1]
581 |     (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
582 |     
583 |     dZ3 = A3 - Y
584 |     dW3 = 1./m * np.dot(dZ3, A2.T)
585 |     db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
586 |     dA2 = np.dot(W3.T, dZ3)
587 |     ### START CODE HERE ### (≈ 2 lines of code)
588 |     dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
589 |     dA2 = dA2 / keep_prob       # Step 2: Scale the value of neurons that haven't been shut down
590 |     ### END CODE HERE ###
591 |     dZ2 = np.multiply(dA2, np.int64(A2 > 0))
592 |     dW2 = 1./m * np.dot(dZ2, A1.T)
593 |     db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
594 |     
595 |     dA1 = np.dot(W2.T, dZ2)
596 |     ### START CODE HERE ### (≈ 2 lines of code)
597 |     dA1 = dA1 * D1              # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
598 |     dA1 = dA1 / keep_prob       # Step 2: Scale the value of neurons that haven't been shut down
599 |     ### END CODE HERE ###
600 |     dZ1 = np.multiply(dA1, np.int64(A1 > 0))
601 |     dW1 = 1./m * np.dot(dZ1, X.T)
602 |     db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
603 |     
604 |     gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
605 |                  "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
606 |                  "dZ1": dZ1, "dW1": dW1, "db1": db1}
607 |     
608 |     return gradients
609 | ```
610 | 
611 | 
612 | ```python
613 | X_assess, Y_assess, cache = backward_propagation_with_dropout_test_case()
614 | 
615 | gradients = backward_propagation_with_dropout(X_assess, Y_assess, cache, keep_prob = 0.8)
616 | 
617 | print ("dA1 = " + str(gradients["dA1"]))
618 | print ("dA2 = " + str(gradients["dA2"]))
619 | ```
620 | 
621 |     dA1 = [[ 0.36544439  0.         -0.00188233  0.         -0.17408748]
622 |      [ 0.65515713  0.         -0.00337459  0.         -0.        ]]
623 |     dA2 = [[ 0.58180856  0.         -0.00299679  0.         -0.27715731]
624 |      [ 0.          0.53159854 -0.          0.53159854 -0.34089673]
625 |      [ 0.          0.         -0.00292733  0.         -0.        ]]
626 | 
627 | 
628 | **Expected Output**: 
629 | 
630 | <table> 
631 |     <tr>
632 |     <td>
633 |     **dA1**
634 |     </td>
635 |         <td>
636 |     [[ 0.36544439  0.         -0.00188233  0.         -0.17408748]
637 |  [ 0.65515713  0.         -0.00337459  0.         -0.        ]]
638 |     </td>
639 |     
640 |     </tr>
641 |     <tr>
642 |     <td>
643 |     **dA2**
644 |     </td>
645 |         <td>
646 |     [[ 0.58180856  0.         -0.00299679  0.         -0.27715731]
647 |  [ 0.          0.53159854 -0.          0.53159854 -0.34089673]
648 |  [ 0.          0.         -0.00292733  0.         -0.        ]]
649 |     </td>
650 |     
651 |     </tr>
652 | </table> 
653 | 
654 | Let's now run the model with dropout (`keep_prob = 0.86`). It means at every iteration you shut down each neurons of layer 1 and 2 with 24% probability. The function `model()` will now call:
655 | - `forward_propagation_with_dropout` instead of `forward_propagation`.
656 | - `backward_propagation_with_dropout` instead of `backward_propagation`.
657 | 
658 | 
659 | ```python
660 | parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)
661 | 
662 | print ("On the train set:")
663 | predictions_train = predict(train_X, train_Y, parameters)
664 | print ("On the test set:")
665 | predictions_test = predict(test_X, test_Y, parameters)
666 | ```
667 | 
668 |     Cost after iteration 0: 0.6543912405149825
669 | 
670 | 
671 |     /home/jovyan/work/week5/Regularization/reg_utils.py:236: RuntimeWarning: divide by zero encountered in log
672 |       logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y)
673 |     /home/jovyan/work/week5/Regularization/reg_utils.py:236: RuntimeWarning: invalid value encountered in multiply
674 |       logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y)
675 | 
676 | 
677 |     Cost after iteration 10000: 0.06101698657490559
678 |     Cost after iteration 20000: 0.060582435798513114
679 | 
680 | 
681 | 
682 | ![png](output_35_3.png)
683 | 
684 | 
685 |     On the train set:
686 |     Accuracy: 0.928909952607
687 |     On the test set:
688 |     Accuracy: 0.95
689 | 
690 | 
691 | Dropout works great! The test accuracy has increased again (to 95%)! Your model is not overfitting the training set and does a great job on the test set. The French football team will be forever grateful to you! 
692 | 
693 | Run the code below to plot the decision boundary.
694 | 
695 | 
696 | ```python
697 | plt.title("Model with dropout")
698 | axes = plt.gca()
699 | axes.set_xlim([-0.75,0.40])
700 | axes.set_ylim([-0.75,0.65])
701 | plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
702 | ```
703 | 
704 | 
705 | ![png](output_37_0.png)
706 | 
707 | 
708 | **Note**:
709 | - A **common mistake** when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training. 
710 | - Deep learning frameworks like [tensorflow](https://www.tensorflow.org/api_docs/python/tf/nn/dropout), [PaddlePaddle](http://doc.paddlepaddle.org/release_doc/0.9.0/doc/ui/api/trainer_config_helpers/attrs.html), [keras](https://keras.io/layers/core/#dropout) or [caffe](http://caffe.berkeleyvision.org/tutorial/layers/dropout.html) come with a dropout layer implementation. Don't stress - you will soon learn some of these frameworks.
711 | 
712 | <font color='blue'>
713 | **What you should remember about dropout:**
714 | - Dropout is a regularization technique.
715 | - You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
716 | - Apply dropout both during forward and backward propagation.
717 | - During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.  
718 | 
719 | ## 4 - Conclusions
720 | 
721 | **Here are the results of our three models**: 
722 | 
723 | <table> 
724 |     <tr>
725 |         <td>
726 |         **model**
727 |         </td>
728 |         <td>
729 |         **train accuracy**
730 |         </td>
731 |         <td>
732 |         **test accuracy**
733 |         </td>
734 | 
735 |     </tr>
736 |         <td>
737 |         3-layer NN without regularization
738 |         </td>
739 |         <td>
740 |         95%
741 |         </td>
742 |         <td>
743 |         91.5%
744 |         </td>
745 |     <tr>
746 |         <td>
747 |         3-layer NN with L2-regularization
748 |         </td>
749 |         <td>
750 |         94%
751 |         </td>
752 |         <td>
753 |         93%
754 |         </td>
755 |     </tr>
756 |     <tr>
757 |         <td>
758 |         3-layer NN with dropout
759 |         </td>
760 |         <td>
761 |         93%
762 |         </td>
763 |         <td>
764 |         95%
765 |         </td>
766 |     </tr>
767 | </table> 
768 | 
769 | Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system. 
770 | 
771 | Congratulations for finishing this assignment! And also for revolutionizing French football. :-) 
772 | 
773 | <font color='blue'>
774 | **What we want you to remember from this notebook**:
775 | - Regularization will help you reduce overfitting.
776 | - Regularization will drive your weights to lower values.
777 | - L2 regularization and Dropout are two very effective regularization techniques.
778 | 


--------------------------------------------------------------------------------
/Week 3 PA 1/Tensorflow+Tutorial.py:
--------------------------------------------------------------------------------
   1 | 
   2 | # coding: utf-8
   3 | 
   4 | # # TensorFlow Tutorial
   5 | # 
   6 | # Welcome to this week's programming assignment. Until now, you've always used numpy to build neural networks. Now we will step you through a deep learning framework that will allow you to build neural networks more easily. Machine learning frameworks like TensorFlow, PaddlePaddle, Torch, Caffe, Keras, and many others can speed up your machine learning development significantly. All of these frameworks also have a lot of documentation, which you should feel free to read. In this assignment, you will learn to do the following in TensorFlow: 
   7 | # 
   8 | # - Initialize variables
   9 | # - Start your own session
  10 | # - Train algorithms 
  11 | # - Implement a Neural Network
  12 | # 
  13 | # Programing frameworks can not only shorten your coding time, but sometimes also perform optimizations that speed up your code. 
  14 | # 
  15 | # ## 1 - Exploring the Tensorflow Library
  16 | # 
  17 | # To start, you will import the library:
  18 | # 
  19 | 
  20 | # In[1]:
  21 | 
  22 | import math
  23 | import numpy as np
  24 | import h5py
  25 | import matplotlib.pyplot as plt
  26 | import tensorflow as tf
  27 | from tensorflow.python.framework import ops
  28 | from tf_utils import load_dataset, random_mini_batches, convert_to_one_hot, predict
  29 | 
  30 | get_ipython().magic('matplotlib inline')
  31 | np.random.seed(1)
  32 | 
  33 | 
  34 | # Now that you have imported the library, we will walk you through its different applications. You will start with an example, where we compute for you the loss of one training example. 
  35 | # $$loss = \mathcal{L}(\hat{y}, y) = (\hat y^{(i)} - y^{(i)})^2 \tag{1}$$
  36 | 
  37 | # In[2]:
  38 | 
  39 | y_hat = tf.constant(36, name='y_hat')            # Define y_hat constant. Set to 36.
  40 | y = tf.constant(39, name='y')                    # Define y. Set to 39
  41 | 
  42 | loss = tf.Variable((y - y_hat)**2, name='loss')  # Create a variable for the loss
  43 | 
  44 | init = tf.global_variables_initializer()         # When init is run later (session.run(init)),
  45 |                                                  # the loss variable will be initialized and ready to be computed
  46 | with tf.Session() as session:                    # Create a session and print the output
  47 |     session.run(init)                            # Initializes the variables
  48 |     print(session.run(loss))                     # Prints the loss
  49 | 
  50 | 
  51 | # Writing and running programs in TensorFlow has the following steps:
  52 | # 
  53 | # 1. Create Tensors (variables) that are not yet executed/evaluated. 
  54 | # 2. Write operations between those Tensors.
  55 | # 3. Initialize your Tensors. 
  56 | # 4. Create a Session. 
  57 | # 5. Run the Session. This will run the operations you'd written above. 
  58 | # 
  59 | # Therefore, when we created a variable for the loss, we simply defined the loss as a function of other quantities, but did not evaluate its value. To evaluate it, we had to run `init=tf.global_variables_initializer()`. That initialized the loss variable, and in the last line we were finally able to evaluate the value of `loss` and print its value.
  60 | # 
  61 | # Now let us look at an easy example. Run the cell below:
  62 | 
  63 | # In[3]:
  64 | 
  65 | a = tf.constant(2)
  66 | b = tf.constant(10)
  67 | c = tf.multiply(a,b)
  68 | print(c)
  69 | 
  70 | 
  71 | # As expected, you will not see 20! You got a tensor saying that the result is a tensor that does not have the shape attribute, and is of type "int32". All you did was put in the 'computation graph', but you have not run this computation yet. In order to actually multiply the two numbers, you will have to create a session and run it.
  72 | 
  73 | # In[4]:
  74 | 
  75 | sess = tf.Session()
  76 | print(sess.run(c))
  77 | 
  78 | 
  79 | # Great! To summarize, **remember to initialize your variables, create a session and run the operations inside the session**. 
  80 | # 
  81 | # Next, you'll also have to know about placeholders. A placeholder is an object whose value you can specify only later. 
  82 | # To specify values for a placeholder, you can pass in values by using a "feed dictionary" (`feed_dict` variable). Below, we created a placeholder for x. This allows us to pass in a number later when we run the session. 
  83 | 
  84 | # In[5]:
  85 | 
  86 | # Change the value of x in the feed_dict
  87 | 
  88 | x = tf.placeholder(tf.int64, name = 'x')
  89 | print(sess.run(2 * x, feed_dict = {x: 3}))
  90 | sess.close()
  91 | 
  92 | 
  93 | # When you first defined `x` you did not have to specify a value for it. A placeholder is simply a variable that you will assign data to only later, when running the session. We say that you **feed data** to these placeholders when running the session. 
  94 | # 
  95 | # Here's what's happening: When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph.
  96 | 
  97 | # ### 1.1 - Linear function
  98 | # 
  99 | # Lets start this programming exercise by computing the following equation: $Y = WX + b$, where $W$ and $X$ are random matrices and b is a random vector. 
 100 | # 
 101 | # **Exercise**: Compute $WX + b$ where $W, X$, and $b$ are drawn from a random normal distribution. W is of shape (4, 3), X is (3,1) and b is (4,1). As an example, here is how you would define a constant X that has shape (3,1):
 102 | # ```python
 103 | # X = tf.constant(np.random.randn(3,1), name = "X")
 104 | # 
 105 | # ```
 106 | # You might find the following functions helpful: 
 107 | # - tf.matmul(..., ...) to do a matrix multiplication
 108 | # - tf.add(..., ...) to do an addition
 109 | # - np.random.randn(...) to initialize randomly
 110 | # 
 111 | 
 112 | # In[6]:
 113 | 
 114 | # GRADED FUNCTION: linear_function
 115 | 
 116 | def linear_function():
 117 |     """
 118 |     Implements a linear function: 
 119 |             Initializes W to be a random tensor of shape (4,3)
 120 |             Initializes X to be a random tensor of shape (3,1)
 121 |             Initializes b to be a random tensor of shape (4,1)
 122 |     Returns: 
 123 |     result -- runs the session for Y = WX + b 
 124 |     """
 125 |     
 126 |     np.random.seed(1)
 127 |     
 128 |     ### START CODE HERE ### (4 lines of code)
 129 |     X = tf.constant(np.random.randn(3,1), name = "X")
 130 |     W = tf.constant(np.random.randn(4,3), name = "W")
 131 |     b = tf.constant(np.random.randn(4,1), name = "b")
 132 |     Y = tf.add(tf.matmul(W, X), b) 
 133 |     ### END CODE HERE ### 
 134 |     
 135 |     # Create the session using tf.Session() and run it with sess.run(...) on the variable you want to calculate
 136 |     
 137 |     ### START CODE HERE ###
 138 |     sess = tf.Session()
 139 |     result = sess.run(Y)
 140 |     ### END CODE HERE ### 
 141 |     
 142 |     # close the session 
 143 |     sess.close()
 144 | 
 145 |     return result
 146 | 
 147 | 
 148 | # In[7]:
 149 | 
 150 | print( "result = " + str(linear_function()))
 151 | 
 152 | 
 153 | # *** Expected Output ***: 
 154 | # 
 155 | # <table> 
 156 | # <tr> 
 157 | # <td>
 158 | # **result**
 159 | # </td>
 160 | # <td>
 161 | # [[-2.15657382]
 162 | #  [ 2.95891446]
 163 | #  [-1.08926781]
 164 | #  [-0.84538042]]
 165 | # </td>
 166 | # </tr> 
 167 | # 
 168 | # </table> 
 169 | 
 170 | # ### 1.2 - Computing the sigmoid 
 171 | # Great! You just implemented a linear function. Tensorflow offers a variety of commonly used neural network functions like `tf.sigmoid` and `tf.softmax`. For this exercise lets compute the sigmoid function of an input. 
 172 | # 
 173 | # You will do this exercise using a placeholder variable `x`. When running the session, you should use the feed dictionary to pass in the input `z`. In this exercise, you will have to (i) create a placeholder `x`, (ii) define the operations needed to compute the sigmoid using `tf.sigmoid`, and then (iii) run the session. 
 174 | # 
 175 | # ** Exercise **: Implement the sigmoid function below. You should use the following: 
 176 | # 
 177 | # - `tf.placeholder(tf.float32, name = "...")`
 178 | # - `tf.sigmoid(...)`
 179 | # - `sess.run(..., feed_dict = {x: z})`
 180 | # 
 181 | # 
 182 | # Note that there are two typical ways to create and use sessions in tensorflow: 
 183 | # 
 184 | # **Method 1:**
 185 | # ```python
 186 | # sess = tf.Session()
 187 | # # Run the variables initialization (if needed), run the operations
 188 | # result = sess.run(..., feed_dict = {...})
 189 | # sess.close() # Close the session
 190 | # ```
 191 | # **Method 2:**
 192 | # ```python
 193 | # with tf.Session() as sess: 
 194 | #     # run the variables initialization (if needed), run the operations
 195 | #     result = sess.run(..., feed_dict = {...})
 196 | #     # This takes care of closing the session for you :)
 197 | # ```
 198 | # 
 199 | 
 200 | # In[8]:
 201 | 
 202 | # GRADED FUNCTION: sigmoid
 203 | 
 204 | def sigmoid(z):
 205 |     """
 206 |     Computes the sigmoid of z
 207 |     
 208 |     Arguments:
 209 |     z -- input value, scalar or vector
 210 |     
 211 |     Returns: 
 212 |     results -- the sigmoid of z
 213 |     """
 214 |     
 215 |     ### START CODE HERE ### ( approx. 4 lines of code)
 216 |     # Create a placeholder for x. Name it 'x'.
 217 |     x = tf.placeholder(tf.float32, name = "x")
 218 | 
 219 |     # compute sigmoid(x)
 220 |     sigmoid = tf.sigmoid(x)
 221 | 
 222 |     # Create a session, and run it. Please use the method 2 explained above. 
 223 |     # You should use a feed_dict to pass z's value to x. 
 224 |     with tf.Session() as sess:
 225 |         # Run session and call the output "result"
 226 |         result = sess.run(sigmoid, feed_dict = {x: z})
 227 |     
 228 |     ### END CODE HERE ###
 229 |     
 230 |     return result
 231 | 
 232 | 
 233 | # In[9]:
 234 | 
 235 | print ("sigmoid(0) = " + str(sigmoid(0)))
 236 | print ("sigmoid(12) = " + str(sigmoid(12)))
 237 | 
 238 | 
 239 | # *** Expected Output ***: 
 240 | # 
 241 | # <table> 
 242 | # <tr> 
 243 | # <td>
 244 | # **sigmoid(0)**
 245 | # </td>
 246 | # <td>
 247 | # 0.5
 248 | # </td>
 249 | # </tr>
 250 | # <tr> 
 251 | # <td>
 252 | # **sigmoid(12)**
 253 | # </td>
 254 | # <td>
 255 | # 0.999994
 256 | # </td>
 257 | # </tr> 
 258 | # 
 259 | # </table> 
 260 | 
 261 | # <font color='blue'>
 262 | # **To summarize, you how know how to**:
 263 | # 1. Create placeholders
 264 | # 2. Specify the computation graph corresponding to operations you want to compute
 265 | # 3. Create the session
 266 | # 4. Run the session, using a feed dictionary if necessary to specify placeholder variables' values. 
 267 | 
 268 | # ### 1.3 -  Computing the Cost
 269 | # 
 270 | # You can also use a built-in function to compute the cost of your neural network. So instead of needing to write code to compute this as a function of $a^{[2](i)}$ and $y^{(i)}$ for i=1...m: 
 271 | # $$ J = - \frac{1}{m}  \sum_{i = 1}^m  \large ( \small y^{(i)} \log a^{ [2] (i)} + (1-y^{(i)})\log (1-a^{ [2] (i)} )\large )\small\tag{2}$$
 272 | # 
 273 | # you can do it in one line of code in tensorflow!
 274 | # 
 275 | # **Exercise**: Implement the cross entropy loss. The function you will use is: 
 276 | # 
 277 | # 
 278 | # - `tf.nn.sigmoid_cross_entropy_with_logits(logits = ...,  labels = ...)`
 279 | # 
 280 | # Your code should input `z`, compute the sigmoid (to get `a`) and then compute the cross entropy cost $J$. All this can be done using one call to `tf.nn.sigmoid_cross_entropy_with_logits`, which computes
 281 | # 
 282 | # $$- \frac{1}{m}  \sum_{i = 1}^m  \large ( \small y^{(i)} \log \sigma(z^{[2](i)}) + (1-y^{(i)})\log (1-\sigma(z^{[2](i)})\large )\small\tag{2}$$
 283 | # 
 284 | # 
 285 | 
 286 | # In[10]:
 287 | 
 288 | # GRADED FUNCTION: cost
 289 | 
 290 | def cost(logits, labels):
 291 |     """
 292 |     Computes the cost using the sigmoid cross entropy
 293 |     
 294 |     Arguments:
 295 |     logits -- vector containing z, output of the last linear unit (before the final sigmoid activation)
 296 |     labels -- vector of labels y (1 or 0) 
 297 |     
 298 |     Note: What we've been calling "z" and "y" in this class are respectively called "logits" and "labels" 
 299 |     in the TensorFlow documentation. So logits will feed into z, and labels into y. 
 300 |     
 301 |     Returns:
 302 |     cost -- runs the session of the cost (formula (2))
 303 |     """
 304 |     
 305 |     ### START CODE HERE ### 
 306 |     
 307 |     # Create the placeholders for "logits" (z) and "labels" (y) (approx. 2 lines)
 308 |     z = tf.placeholder(tf.float32, name = "logits")
 309 |     y = tf.placeholder(tf.float32, name = "labels")
 310 |     
 311 |     # Use the loss function (approx. 1 line)
 312 |     cost = tf.nn.sigmoid_cross_entropy_with_logits(logits = z, labels = y)
 313 |     
 314 |     # Create a session (approx. 1 line). See method 1 above.
 315 |     sess = tf.Session()
 316 |     
 317 |     # Run the session (approx. 1 line).
 318 |     cost = sess.run(cost, feed_dict = {z: logits, y: labels})
 319 |     
 320 |     # Close the session (approx. 1 line). See method 1 above.
 321 |     sess.close()
 322 |     
 323 |     ### END CODE HERE ###
 324 |     
 325 |     return cost
 326 | 
 327 | 
 328 | # In[11]:
 329 | 
 330 | logits = sigmoid(np.array([0.2,0.4,0.7,0.9]))
 331 | cost = cost(logits, np.array([0,0,1,1]))
 332 | print ("cost = " + str(cost))
 333 | 
 334 | 
 335 | # ** Expected Output** : 
 336 | # 
 337 | # <table> 
 338 | #     <tr> 
 339 | #         <td>
 340 | #             **cost**
 341 | #         </td>
 342 | #         <td>
 343 | #         [ 1.00538719  1.03664088  0.41385433  0.39956614]
 344 | #         </td>
 345 | #     </tr>
 346 | # 
 347 | # </table>
 348 | 
 349 | # ### 1.4 - Using One Hot encodings
 350 | # 
 351 | # Many times in deep learning you will have a y vector with numbers ranging from 0 to C-1, where C is the number of classes. If C is for example 4, then you might have the following y vector which you will need to convert as follows:
 352 | # 
 353 | # 
 354 | # <img src="images/onehot.png" style="width:600px;height:150px;">
 355 | # 
 356 | # This is called a "one hot" encoding, because in the converted representation exactly one element of each column is "hot" (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In tensorflow, you can use one line of code: 
 357 | # 
 358 | # - tf.one_hot(labels, depth, axis) 
 359 | # 
 360 | # **Exercise:** Implement the function below to take one vector of labels and the total number of classes $C$, and return the one hot encoding. Use `tf.one_hot()` to do this. 
 361 | 
 362 | # In[17]:
 363 | 
 364 | # GRADED FUNCTION: one_hot_matrix
 365 | 
 366 | def one_hot_matrix(labels, C):
 367 |     """
 368 |     Creates a matrix where the i-th row corresponds to the ith class number and the jth column
 369 |                      corresponds to the jth training example. So if example j had a label i. Then entry (i,j) 
 370 |                      will be 1. 
 371 |                      
 372 |     Arguments:
 373 |     labels -- vector containing the labels 
 374 |     C -- number of classes, the depth of the one hot dimension
 375 |     
 376 |     Returns: 
 377 |     one_hot -- one hot matrix
 378 |     """
 379 |     
 380 |     ### START CODE HERE ###
 381 |     
 382 |     # Create a tf.constant equal to C (depth), name it 'C'. (approx. 1 line)
 383 |     depth = tf.constant(C, name = "C")
 384 |     
 385 |     # Use tf.one_hot, be careful with the axis (approx. 1 line)
 386 |     one_hot_matrix = tf.one_hot(labels, depth, axis = 0)
 387 |     
 388 |     # Create the session (approx. 1 line)
 389 |     sess = tf.Session()
 390 |     
 391 |     # Run the session (approx. 1 line)
 392 |     one_hot = sess.run(one_hot_matrix)
 393 |     
 394 |     # Close the session (approx. 1 line). See method 1 above.
 395 |     sess.close()
 396 |     
 397 |     ### END CODE HERE ###
 398 |     
 399 |     return one_hot
 400 | 
 401 | 
 402 | # In[18]:
 403 | 
 404 | labels = np.array([1,2,3,0,2,1])
 405 | one_hot = one_hot_matrix(labels, C = 4)
 406 | print ("one_hot = " + str(one_hot))
 407 | 
 408 | 
 409 | # **Expected Output**: 
 410 | # 
 411 | # <table> 
 412 | #     <tr> 
 413 | #         <td>
 414 | #             **one_hot**
 415 | #         </td>
 416 | #         <td>
 417 | #         [[ 0.  0.  0.  1.  0.  0.]
 418 | #  [ 1.  0.  0.  0.  0.  1.]
 419 | #  [ 0.  1.  0.  0.  1.  0.]
 420 | #  [ 0.  0.  1.  0.  0.  0.]]
 421 | #         </td>
 422 | #     </tr>
 423 | # 
 424 | # </table>
 425 | # 
 426 | 
 427 | # ### 1.5 - Initialize with zeros and ones
 428 | # 
 429 | # Now you will learn how to initialize a vector of zeros and ones. The function you will be calling is `tf.ones()`. To initialize with zeros you could use tf.zeros() instead. These functions take in a shape and return an array of dimension shape full of zeros and ones respectively. 
 430 | # 
 431 | # **Exercise:** Implement the function below to take in a shape and to return an array (of the shape's dimension of ones). 
 432 | # 
 433 | #  - tf.ones(shape)
 434 | # 
 435 | 
 436 | # In[19]:
 437 | 
 438 | # GRADED FUNCTION: ones
 439 | 
 440 | def ones(shape):
 441 |     """
 442 |     Creates an array of ones of dimension shape
 443 |     
 444 |     Arguments:
 445 |     shape -- shape of the array you want to create
 446 |         
 447 |     Returns: 
 448 |     ones -- array containing only ones
 449 |     """
 450 |     
 451 |     ### START CODE HERE ###
 452 |     
 453 |     # Create "ones" tensor using tf.ones(...). (approx. 1 line)
 454 |     ones = tf.ones(shape)
 455 |     
 456 |     # Create the session (approx. 1 line)
 457 |     sess = tf.Session()
 458 |     
 459 |     # Run the session to compute 'ones' (approx. 1 line)
 460 |     ones = sess.run(ones)
 461 |     
 462 |     # Close the session (approx. 1 line). See method 1 above.
 463 |     sess.close()
 464 |     
 465 |     ### END CODE HERE ###
 466 |     return ones
 467 | 
 468 | 
 469 | # In[20]:
 470 | 
 471 | print ("ones = " + str(ones([3])))
 472 | 
 473 | 
 474 | # **Expected Output:**
 475 | # 
 476 | # <table> 
 477 | #     <tr> 
 478 | #         <td>
 479 | #             **ones**
 480 | #         </td>
 481 | #         <td>
 482 | #         [ 1.  1.  1.]
 483 | #         </td>
 484 | #     </tr>
 485 | # 
 486 | # </table>
 487 | 
 488 | # # 2 - Building your first neural network in tensorflow
 489 | # 
 490 | # In this part of the assignment you will build a neural network using tensorflow. Remember that there are two parts to implement a tensorflow model:
 491 | # 
 492 | # - Create the computation graph
 493 | # - Run the graph
 494 | # 
 495 | # Let's delve into the problem you'd like to solve!
 496 | # 
 497 | # ### 2.0 - Problem statement: SIGNS Dataset
 498 | # 
 499 | # One afternoon, with some friends we decided to teach our computers to decipher sign language. We spent a few hours taking pictures in front of a white wall and came up with the following dataset. It's now your job to build an algorithm that would facilitate communications from a speech-impaired person to someone who doesn't understand sign language.
 500 | # 
 501 | # - **Training set**: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number).
 502 | # - **Test set**: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number).
 503 | # 
 504 | # Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs.
 505 | # 
 506 | # Here are examples for each number, and how an explanation of how we represent the labels. These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels.
 507 | # <img src="images/hands.png" style="width:800px;height:350px;"><caption><center> <u><font color='purple'> **Figure 1**</u><font color='purple'>: SIGNS dataset <br> <font color='black'> </center>
 508 | # 
 509 | # 
 510 | # Run the following code to load the dataset.
 511 | 
 512 | # In[21]:
 513 | 
 514 | # Loading the dataset
 515 | X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
 516 | 
 517 | 
 518 | # Change the index below and run the cell to visualize some examples in the dataset.
 519 | 
 520 | # In[22]:
 521 | 
 522 | # Example of a picture
 523 | index = 0
 524 | plt.imshow(X_train_orig[index])
 525 | print ("y = " + str(np.squeeze(Y_train_orig[:, index])))
 526 | 
 527 | 
 528 | # As usual you flatten the image dataset, then normalize it by dividing by 255. On top of that, you will convert each label to a one-hot vector as shown in Figure 1. Run the cell below to do so.
 529 | 
 530 | # In[23]:
 531 | 
 532 | # Flatten the training and test images
 533 | X_train_flatten = X_train_orig.reshape(X_train_orig.shape[0], -1).T
 534 | X_test_flatten = X_test_orig.reshape(X_test_orig.shape[0], -1).T
 535 | # Normalize image vectors
 536 | X_train = X_train_flatten/255.
 537 | X_test = X_test_flatten/255.
 538 | # Convert training and test labels to one hot matrices
 539 | Y_train = convert_to_one_hot(Y_train_orig, 6)
 540 | Y_test = convert_to_one_hot(Y_test_orig, 6)
 541 | 
 542 | print ("number of training examples = " + str(X_train.shape[1]))
 543 | print ("number of test examples = " + str(X_test.shape[1]))
 544 | print ("X_train shape: " + str(X_train.shape))
 545 | print ("Y_train shape: " + str(Y_train.shape))
 546 | print ("X_test shape: " + str(X_test.shape))
 547 | print ("Y_test shape: " + str(Y_test.shape))
 548 | 
 549 | 
 550 | # **Note** that 12288 comes from $64 \times 64 \times 3$. Each image is square, 64 by 64 pixels, and 3 is for the RGB colors. Please make sure all these shapes make sense to you before continuing.
 551 | 
 552 | # **Your goal** is to build an algorithm capable of recognizing a sign with high accuracy. To do so, you are going to build a tensorflow model that is almost the same as one you have previously built in numpy for cat recognition (but now using a softmax output). It is a great occasion to compare your numpy implementation to the tensorflow one. 
 553 | # 
 554 | # **The model** is *LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX*. The SIGMOID output layer has been converted to a SOFTMAX. A SOFTMAX layer generalizes SIGMOID to when there are more than two classes. 
 555 | 
 556 | # ### 2.1 - Create placeholders
 557 | # 
 558 | # Your first task is to create placeholders for `X` and `Y`. This will allow you to later pass your training data in when you run your session. 
 559 | # 
 560 | # **Exercise:** Implement the function below to create the placeholders in tensorflow.
 561 | 
 562 | # In[24]:
 563 | 
 564 | # GRADED FUNCTION: create_placeholders
 565 | 
 566 | def create_placeholders(n_x, n_y):
 567 |     """
 568 |     Creates the placeholders for the tensorflow session.
 569 |     
 570 |     Arguments:
 571 |     n_x -- scalar, size of an image vector (num_px * num_px = 64 * 64 * 3 = 12288)
 572 |     n_y -- scalar, number of classes (from 0 to 5, so -> 6)
 573 |     
 574 |     Returns:
 575 |     X -- placeholder for the data input, of shape [n_x, None] and dtype "float"
 576 |     Y -- placeholder for the input labels, of shape [n_y, None] and dtype "float"
 577 |     
 578 |     Tips:
 579 |     - You will use None because it let's us be flexible on the number of examples you will for the placeholders.
 580 |       In fact, the number of examples during test/train is different.
 581 |     """
 582 | 
 583 |     ### START CODE HERE ### (approx. 2 lines)
 584 |     X = tf.placeholder(tf.float32, [n_x, None], name = "X")
 585 |     Y = tf.placeholder(tf.float32, [n_y, None], name = "Y")
 586 |     ### END CODE HERE ###
 587 |     
 588 |     return X, Y
 589 | 
 590 | 
 591 | # In[25]:
 592 | 
 593 | X, Y = create_placeholders(12288, 6)
 594 | print ("X = " + str(X))
 595 | print ("Y = " + str(Y))
 596 | 
 597 | 
 598 | # **Expected Output**: 
 599 | # 
 600 | # <table> 
 601 | #     <tr> 
 602 | #         <td>
 603 | #             **X**
 604 | #         </td>
 605 | #         <td>
 606 | #         Tensor("Placeholder_1:0", shape=(12288, ?), dtype=float32) (not necessarily Placeholder_1)
 607 | #         </td>
 608 | #     </tr>
 609 | #     <tr> 
 610 | #         <td>
 611 | #             **Y**
 612 | #         </td>
 613 | #         <td>
 614 | #         Tensor("Placeholder_2:0", shape=(10, ?), dtype=float32) (not necessarily Placeholder_2)
 615 | #         </td>
 616 | #     </tr>
 617 | # 
 618 | # </table>
 619 | 
 620 | # ### 2.2 - Initializing the parameters
 621 | # 
 622 | # Your second task is to initialize the parameters in tensorflow.
 623 | # 
 624 | # **Exercise:** Implement the function below to initialize the parameters in tensorflow. You are going use Xavier Initialization for weights and Zero Initialization for biases. The shapes are given below. As an example, to help you, for W1 and b1 you could use: 
 625 | # 
 626 | # ```python
 627 | # W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 628 | # b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
 629 | # ```
 630 | # Please use `seed = 1` to make sure your results match ours.
 631 | 
 632 | # In[26]:
 633 | 
 634 | # GRADED FUNCTION: initialize_parameters
 635 | 
 636 | def initialize_parameters():
 637 |     """
 638 |     Initializes parameters to build a neural network with tensorflow. The shapes are:
 639 |                         W1 : [25, 12288]
 640 |                         b1 : [25, 1]
 641 |                         W2 : [12, 25]
 642 |                         b2 : [12, 1]
 643 |                         W3 : [6, 12]
 644 |                         b3 : [6, 1]
 645 |     
 646 |     Returns:
 647 |     parameters -- a dictionary of tensors containing W1, b1, W2, b2, W3, b3
 648 |     """
 649 |     
 650 |     tf.set_random_seed(1)                   # so that your "random" numbers match ours
 651 |         
 652 |     ### START CODE HERE ### (approx. 6 lines of code)
 653 |     W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 654 |     b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
 655 |     W2 = tf.get_variable("W2", [12, 25], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 656 |     b2 = tf.get_variable("b2", [12,1], initializer = tf.zeros_initializer())
 657 |     W3 = tf.get_variable("W3", [6, 12], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 658 |     b3 = tf.get_variable("b3", [6,1], initializer = tf.zeros_initializer())
 659 |     ### END CODE HERE ###
 660 | 
 661 |     parameters = {"W1": W1,
 662 |                   "b1": b1,
 663 |                   "W2": W2,
 664 |                   "b2": b2,
 665 |                   "W3": W3,
 666 |                   "b3": b3}
 667 |     
 668 |     return parameters
 669 | 
 670 | 
 671 | # In[27]:
 672 | 
 673 | tf.reset_default_graph()
 674 | with tf.Session() as sess:
 675 |     parameters = initialize_parameters()
 676 |     print("W1 = " + str(parameters["W1"]))
 677 |     print("b1 = " + str(parameters["b1"]))
 678 |     print("W2 = " + str(parameters["W2"]))
 679 |     print("b2 = " + str(parameters["b2"]))
 680 | 
 681 | 
 682 | # **Expected Output**: 
 683 | # 
 684 | # <table> 
 685 | #     <tr> 
 686 | #         <td>
 687 | #             **W1**
 688 | #         </td>
 689 | #         <td>
 690 | #          < tf.Variable 'W1:0' shape=(25, 12288) dtype=float32_ref >
 691 | #         </td>
 692 | #     </tr>
 693 | #     <tr> 
 694 | #         <td>
 695 | #             **b1**
 696 | #         </td>
 697 | #         <td>
 698 | #         < tf.Variable 'b1:0' shape=(25, 1) dtype=float32_ref >
 699 | #         </td>
 700 | #     </tr>
 701 | #     <tr> 
 702 | #         <td>
 703 | #             **W2**
 704 | #         </td>
 705 | #         <td>
 706 | #         < tf.Variable 'W2:0' shape=(12, 25) dtype=float32_ref >
 707 | #         </td>
 708 | #     </tr>
 709 | #     <tr> 
 710 | #         <td>
 711 | #             **b2**
 712 | #         </td>
 713 | #         <td>
 714 | #         < tf.Variable 'b2:0' shape=(12, 1) dtype=float32_ref >
 715 | #         </td>
 716 | #     </tr>
 717 | # 
 718 | # </table>
 719 | 
 720 | # As expected, the parameters haven't been evaluated yet.
 721 | 
 722 | # ### 2.3 - Forward propagation in tensorflow 
 723 | # 
 724 | # You will now implement the forward propagation module in tensorflow. The function will take in a dictionary of parameters and it will complete the forward pass. The functions you will be using are: 
 725 | # 
 726 | # - `tf.add(...,...)` to do an addition
 727 | # - `tf.matmul(...,...)` to do a matrix multiplication
 728 | # - `tf.nn.relu(...)` to apply the ReLU activation
 729 | # 
 730 | # **Question:** Implement the forward pass of the neural network. We commented for you the numpy equivalents so that you can compare the tensorflow implementation to numpy. It is important to note that the forward propagation stops at `z3`. The reason is that in tensorflow the last linear layer output is given as input to the function computing the loss. Therefore, you don't need `a3`!
 731 | # 
 732 | # 
 733 | 
 734 | # In[28]:
 735 | 
 736 | # GRADED FUNCTION: forward_propagation
 737 | 
 738 | def forward_propagation(X, parameters):
 739 |     """
 740 |     Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
 741 |     
 742 |     Arguments:
 743 |     X -- input dataset placeholder, of shape (input size, number of examples)
 744 |     parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
 745 |                   the shapes are given in initialize_parameters
 746 | 
 747 |     Returns:
 748 |     Z3 -- the output of the last LINEAR unit
 749 |     """
 750 |     
 751 |     # Retrieve the parameters from the dictionary "parameters" 
 752 |     W1 = parameters['W1']
 753 |     b1 = parameters['b1']
 754 |     W2 = parameters['W2']
 755 |     b2 = parameters['b2']
 756 |     W3 = parameters['W3']
 757 |     b3 = parameters['b3']
 758 |     
 759 |     ### START CODE HERE ### (approx. 5 lines)              # Numpy Equivalents:
 760 |     Z1 = tf.add(tf.matmul(W1, X), b1)                      # Z1 = np.dot(W1, X) + b1
 761 |     A1 = tf.nn.relu(Z1)                                    # A1 = relu(Z1)
 762 |     Z2 = tf.add(tf.matmul(W2, A1), b2)                     # Z2 = np.dot(W2, a1) + b2
 763 |     A2 = tf.nn.relu(Z2)                                    # A2 = relu(Z2)
 764 |     Z3 = tf.add(tf.matmul(W3, A2), b3)                     # Z3 = np.dot(W3,Z2) + b3
 765 |     ### END CODE HERE ###
 766 |     
 767 |     return Z3
 768 | 
 769 | 
 770 | # In[29]:
 771 | 
 772 | tf.reset_default_graph()
 773 | 
 774 | with tf.Session() as sess:
 775 |     X, Y = create_placeholders(12288, 6)
 776 |     parameters = initialize_parameters()
 777 |     Z3 = forward_propagation(X, parameters)
 778 |     print("Z3 = " + str(Z3))
 779 | 
 780 | 
 781 | # **Expected Output**: 
 782 | # 
 783 | # <table> 
 784 | #     <tr> 
 785 | #         <td>
 786 | #             **Z3**
 787 | #         </td>
 788 | #         <td>
 789 | #         Tensor("Add_2:0", shape=(6, ?), dtype=float32)
 790 | #         </td>
 791 | #     </tr>
 792 | # 
 793 | # </table>
 794 | 
 795 | # You may have noticed that the forward propagation doesn't output any cache. You will understand why below, when we get to brackpropagation.
 796 | 
 797 | # ### 2.4 Compute cost
 798 | # 
 799 | # As seen before, it is very easy to compute the cost using:
 800 | # ```python
 801 | # tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...))
 802 | # ```
 803 | # **Question**: Implement the cost function below. 
 804 | # - It is important to know that the "`logits`" and "`labels`" inputs of `tf.nn.softmax_cross_entropy_with_logits` are expected to be of shape (number of examples, num_classes). We have thus transposed Z3 and Y for you.
 805 | # - Besides, `tf.reduce_mean` basically does the summation over the examples.
 806 | 
 807 | # In[30]:
 808 | 
 809 | # GRADED FUNCTION: compute_cost 
 810 | 
 811 | def compute_cost(Z3, Y):
 812 |     """
 813 |     Computes the cost
 814 |     
 815 |     Arguments:
 816 |     Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
 817 |     Y -- "true" labels vector placeholder, same shape as Z3
 818 |     
 819 |     Returns:
 820 |     cost - Tensor of the cost function
 821 |     """
 822 |     
 823 |     # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
 824 |     logits = tf.transpose(Z3)
 825 |     labels = tf.transpose(Y)
 826 |     
 827 |     ### START CODE HERE ### (1 line of code)
 828 |     cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels))
 829 |     ### END CODE HERE ###
 830 |     
 831 |     return cost
 832 | 
 833 | 
 834 | # In[31]:
 835 | 
 836 | tf.reset_default_graph()
 837 | 
 838 | with tf.Session() as sess:
 839 |     X, Y = create_placeholders(12288, 6)
 840 |     parameters = initialize_parameters()
 841 |     Z3 = forward_propagation(X, parameters)
 842 |     cost = compute_cost(Z3, Y)
 843 |     print("cost = " + str(cost))
 844 | 
 845 | 
 846 | # **Expected Output**: 
 847 | # 
 848 | # <table> 
 849 | #     <tr> 
 850 | #         <td>
 851 | #             **cost**
 852 | #         </td>
 853 | #         <td>
 854 | #         Tensor("Mean:0", shape=(), dtype=float32)
 855 | #         </td>
 856 | #     </tr>
 857 | # 
 858 | # </table>
 859 | 
 860 | # ### 2.5 - Backward propagation & parameter updates
 861 | # 
 862 | # This is where you become grateful to programming frameworks. All the backpropagation and the parameters update is taken care of in 1 line of code. It is very easy to incorporate this line in the model.
 863 | # 
 864 | # After you compute the cost function. You will create an "`optimizer`" object. You have to call this object along with the cost when running the tf.session. When called, it will perform an optimization on the given cost with the chosen method and learning rate.
 865 | # 
 866 | # For instance, for gradient descent the optimizer would be:
 867 | # ```python
 868 | # optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost)
 869 | # ```
 870 | # 
 871 | # To make the optimization you would do:
 872 | # ```python
 873 | # _ , c = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})
 874 | # ```
 875 | # 
 876 | # This computes the backpropagation by passing through the tensorflow graph in the reverse order. From cost to inputs.
 877 | # 
 878 | # **Note** When coding, we often use `_` as a "throwaway" variable to store values that we won't need to use later. Here, `_` takes on the evaluated value of `optimizer`, which we don't need (and `c` takes the value of the `cost` variable). 
 879 | 
 880 | # ### 2.6 - Building the model
 881 | # 
 882 | # Now, you will bring it all together! 
 883 | # 
 884 | # **Exercise:** Implement the model. You will be calling the functions you had previously implemented.
 885 | 
 886 | # In[32]:
 887 | 
 888 | def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,
 889 |           num_epochs = 1500, minibatch_size = 32, print_cost = True):
 890 |     """
 891 |     Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
 892 |     
 893 |     Arguments:
 894 |     X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
 895 |     Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
 896 |     X_test -- training set, of shape (input size = 12288, number of training examples = 120)
 897 |     Y_test -- test set, of shape (output size = 6, number of test examples = 120)
 898 |     learning_rate -- learning rate of the optimization
 899 |     num_epochs -- number of epochs of the optimization loop
 900 |     minibatch_size -- size of a minibatch
 901 |     print_cost -- True to print the cost every 100 epochs
 902 |     
 903 |     Returns:
 904 |     parameters -- parameters learnt by the model. They can then be used to predict.
 905 |     """
 906 |     
 907 |     ops.reset_default_graph()                         # to be able to rerun the model without overwriting tf variables
 908 |     tf.set_random_seed(1)                             # to keep consistent results
 909 |     seed = 3                                          # to keep consistent results
 910 |     (n_x, m) = X_train.shape                          # (n_x: input size, m : number of examples in the train set)
 911 |     n_y = Y_train.shape[0]                            # n_y : output size
 912 |     costs = []                                        # To keep track of the cost
 913 |     
 914 |     # Create Placeholders of shape (n_x, n_y)
 915 |     ### START CODE HERE ### (1 line)
 916 |     X, Y = create_placeholders(n_x, n_y)
 917 |     ### END CODE HERE ###
 918 | 
 919 |     # Initialize parameters
 920 |     ### START CODE HERE ### (1 line)
 921 |     parameters = initialize_parameters()
 922 |     ### END CODE HERE ###
 923 |     
 924 |     # Forward propagation: Build the forward propagation in the tensorflow graph
 925 |     ### START CODE HERE ### (1 line)
 926 |     Z3 = forward_propagation(X, parameters)
 927 |     ### END CODE HERE ###
 928 |     
 929 |     # Cost function: Add cost function to tensorflow graph
 930 |     ### START CODE HERE ### (1 line)
 931 |     cost = compute_cost(Z3, Y)
 932 |     ### END CODE HERE ###
 933 |     
 934 |     # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
 935 |     ### START CODE HERE ### (1 line)
 936 |     optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
 937 |     ### END CODE HERE ###
 938 |     
 939 |     # Initialize all the variables
 940 |     init = tf.global_variables_initializer()
 941 | 
 942 |     # Start the session to compute the tensorflow graph
 943 |     with tf.Session() as sess:
 944 |         
 945 |         # Run the initialization
 946 |         sess.run(init)
 947 |         
 948 |         # Do the training loop
 949 |         for epoch in range(num_epochs):
 950 | 
 951 |             epoch_cost = 0.                       # Defines a cost related to an epoch
 952 |             num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set
 953 |             seed = seed + 1
 954 |             minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed)
 955 | 
 956 |             for minibatch in minibatches:
 957 | 
 958 |                 # Select a minibatch
 959 |                 (minibatch_X, minibatch_Y) = minibatch
 960 |                 
 961 |                 # IMPORTANT: The line that runs the graph on a minibatch.
 962 |                 # Run the session to execute the "optimizer" and the "cost", the feedict should contain a minibatch for (X,Y).
 963 |                 ### START CODE HERE ### (1 line)
 964 |                 _ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})
 965 |                 ### END CODE HERE ###
 966 |                 
 967 |                 epoch_cost += minibatch_cost / num_minibatches
 968 | 
 969 |             # Print the cost every epoch
 970 |             if print_cost == True and epoch % 100 == 0:
 971 |                 print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
 972 |             if print_cost == True and epoch % 5 == 0:
 973 |                 costs.append(epoch_cost)
 974 |                 
 975 |         # plot the cost
 976 |         plt.plot(np.squeeze(costs))
 977 |         plt.ylabel('cost')
 978 |         plt.xlabel('iterations (per tens)')
 979 |         plt.title("Learning rate =" + str(learning_rate))
 980 |         plt.show()
 981 | 
 982 |         # lets save the parameters in a variable
 983 |         parameters = sess.run(parameters)
 984 |         print ("Parameters have been trained!")
 985 | 
 986 |         # Calculate the correct predictions
 987 |         correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))
 988 | 
 989 |         # Calculate accuracy on the test set
 990 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
 991 | 
 992 |         print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
 993 |         print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
 994 |         
 995 |         return parameters
 996 | 
 997 | 
 998 | # Run the following cell to train your model! On our machine it takes about 5 minutes. Your "Cost after epoch 100" should be 1.016458. If it's not, don't waste time; interrupt the training by clicking on the square (⬛) in the upper bar of the notebook, and try to correct your code. If it is the correct cost, take a break and come back in 5 minutes!
 999 | 
1000 | # In[33]:
1001 | 
1002 | parameters = model(X_train, Y_train, X_test, Y_test)
1003 | 
1004 | 
1005 | # **Expected Output**:
1006 | # 
1007 | # <table> 
1008 | #     <tr> 
1009 | #         <td>
1010 | #             **Train Accuracy**
1011 | #         </td>
1012 | #         <td>
1013 | #         0.999074
1014 | #         </td>
1015 | #     </tr>
1016 | #     <tr> 
1017 | #         <td>
1018 | #             **Test Accuracy**
1019 | #         </td>
1020 | #         <td>
1021 | #         0.716667
1022 | #         </td>
1023 | #     </tr>
1024 | # 
1025 | # </table>
1026 | # 
1027 | # Amazing, your algorithm can recognize a sign representing a figure between 0 and 5 with 71.7% accuracy.
1028 | # 
1029 | # **Insights**:
1030 | # - Your model seems big enough to fit the training set well. However, given the difference between train and test accuracy, you could try to add L2 or dropout regularization to reduce overfitting. 
1031 | # - Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters.
1032 | 
1033 | # ### 2.7 - Test with your own image (optional / ungraded exercise)
1034 | # 
1035 | # Congratulations on finishing this assignment. You can now take a picture of your hand and see the output of your model. To do that:
1036 | #     1. Click on "File" in the upper bar of this notebook, then click "Open" to go on your Coursera Hub.
1037 | #     2. Add your image to this Jupyter Notebook's directory, in the "images" folder
1038 | #     3. Write your image's name in the following code
1039 | #     4. Run the code and check if the algorithm is right!
1040 | 
1041 | # In[ ]:
1042 | 
1043 | import scipy
1044 | from PIL import Image
1045 | from scipy import ndimage
1046 | 
1047 | ## START CODE HERE ## (PUT YOUR IMAGE NAME) 
1048 | my_image = "thumbs_up.jpg"
1049 | ## END CODE HERE ##
1050 | 
1051 | # We preprocess your image to fit your algorithm.
1052 | fname = "images/" + my_image
1053 | image = np.array(ndimage.imread(fname, flatten=False))
1054 | my_image = scipy.misc.imresize(image, size=(64,64)).reshape((1, 64*64*3)).T
1055 | my_image_prediction = predict(my_image, parameters)
1056 | 
1057 | plt.imshow(image)
1058 | print("Your algorithm predicts: y = " + str(np.squeeze(my_image_prediction)))
1059 | 
1060 | 
1061 | # You indeed deserved a "thumbs-up" although as you can see the algorithm seems to classify it incorrectly. The reason is that the training set doesn't contain any "thumbs-up", so the model doesn't know how to deal with it! We call that a "mismatched data distribution" and it is one of the various of the next course on "Structuring Machine Learning Projects".
1062 | 
1063 | # <font color='blue'>
1064 | # **What you should remember**:
1065 | # - Tensorflow is a programming framework used in deep learning
1066 | # - The two main object classes in tensorflow are Tensors and Operators. 
1067 | # - When you code in tensorflow you have to take the following steps:
1068 | #     - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
1069 | #     - Create a session
1070 | #     - Initialize the session
1071 | #     - Run the session to execute the graph
1072 | # - You can execute the graph multiple times as you've seen in model()
1073 | # - The backpropagation and optimization is automatically done when running the session on the "optimizer" object.
1074 | 


--------------------------------------------------------------------------------
/Week 3 PA 1/Tensorflow+Tutorial/Tensorflow Tutorial.md:
--------------------------------------------------------------------------------
   1 | 
   2 | # TensorFlow Tutorial
   3 | 
   4 | Welcome to this week's programming assignment. Until now, you've always used numpy to build neural networks. Now we will step you through a deep learning framework that will allow you to build neural networks more easily. Machine learning frameworks like TensorFlow, PaddlePaddle, Torch, Caffe, Keras, and many others can speed up your machine learning development significantly. All of these frameworks also have a lot of documentation, which you should feel free to read. In this assignment, you will learn to do the following in TensorFlow: 
   5 | 
   6 | - Initialize variables
   7 | - Start your own session
   8 | - Train algorithms 
   9 | - Implement a Neural Network
  10 | 
  11 | Programing frameworks can not only shorten your coding time, but sometimes also perform optimizations that speed up your code. 
  12 | 
  13 | ## 1 - Exploring the Tensorflow Library
  14 | 
  15 | To start, you will import the library:
  16 | 
  17 | 
  18 | 
  19 | ```python
  20 | import math
  21 | import numpy as np
  22 | import h5py
  23 | import matplotlib.pyplot as plt
  24 | import tensorflow as tf
  25 | from tensorflow.python.framework import ops
  26 | from tf_utils import load_dataset, random_mini_batches, convert_to_one_hot, predict
  27 | 
  28 | %matplotlib inline
  29 | np.random.seed(1)
  30 | ```
  31 | 
  32 | Now that you have imported the library, we will walk you through its different applications. You will start with an example, where we compute for you the loss of one training example. 
  33 | $$loss = \mathcal{L}(\hat{y}, y) = (\hat y^{(i)} - y^{(i)})^2 \tag{1}$$
  34 | 
  35 | 
  36 | ```python
  37 | y_hat = tf.constant(36, name='y_hat')            # Define y_hat constant. Set to 36.
  38 | y = tf.constant(39, name='y')                    # Define y. Set to 39
  39 | 
  40 | loss = tf.Variable((y - y_hat)**2, name='loss')  # Create a variable for the loss
  41 | 
  42 | init = tf.global_variables_initializer()         # When init is run later (session.run(init)),
  43 |                                                  # the loss variable will be initialized and ready to be computed
  44 | with tf.Session() as session:                    # Create a session and print the output
  45 |     session.run(init)                            # Initializes the variables
  46 |     print(session.run(loss))                     # Prints the loss
  47 | ```
  48 | 
  49 |     9
  50 | 
  51 | 
  52 | Writing and running programs in TensorFlow has the following steps:
  53 | 
  54 | 1. Create Tensors (variables) that are not yet executed/evaluated. 
  55 | 2. Write operations between those Tensors.
  56 | 3. Initialize your Tensors. 
  57 | 4. Create a Session. 
  58 | 5. Run the Session. This will run the operations you'd written above. 
  59 | 
  60 | Therefore, when we created a variable for the loss, we simply defined the loss as a function of other quantities, but did not evaluate its value. To evaluate it, we had to run `init=tf.global_variables_initializer()`. That initialized the loss variable, and in the last line we were finally able to evaluate the value of `loss` and print its value.
  61 | 
  62 | Now let us look at an easy example. Run the cell below:
  63 | 
  64 | 
  65 | ```python
  66 | a = tf.constant(2)
  67 | b = tf.constant(10)
  68 | c = tf.multiply(a,b)
  69 | print(c)
  70 | ```
  71 | 
  72 |     Tensor("Mul:0", shape=(), dtype=int32)
  73 | 
  74 | 
  75 | As expected, you will not see 20! You got a tensor saying that the result is a tensor that does not have the shape attribute, and is of type "int32". All you did was put in the 'computation graph', but you have not run this computation yet. In order to actually multiply the two numbers, you will have to create a session and run it.
  76 | 
  77 | 
  78 | ```python
  79 | sess = tf.Session()
  80 | print(sess.run(c))
  81 | ```
  82 | 
  83 |     20
  84 | 
  85 | 
  86 | Great! To summarize, **remember to initialize your variables, create a session and run the operations inside the session**. 
  87 | 
  88 | Next, you'll also have to know about placeholders. A placeholder is an object whose value you can specify only later. 
  89 | To specify values for a placeholder, you can pass in values by using a "feed dictionary" (`feed_dict` variable). Below, we created a placeholder for x. This allows us to pass in a number later when we run the session. 
  90 | 
  91 | 
  92 | ```python
  93 | # Change the value of x in the feed_dict
  94 | 
  95 | x = tf.placeholder(tf.int64, name = 'x')
  96 | print(sess.run(2 * x, feed_dict = {x: 3}))
  97 | sess.close()
  98 | ```
  99 | 
 100 |     6
 101 | 
 102 | 
 103 | When you first defined `x` you did not have to specify a value for it. A placeholder is simply a variable that you will assign data to only later, when running the session. We say that you **feed data** to these placeholders when running the session. 
 104 | 
 105 | Here's what's happening: When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph.
 106 | 
 107 | ### 1.1 - Linear function
 108 | 
 109 | Lets start this programming exercise by computing the following equation: $Y = WX + b$, where $W$ and $X$ are random matrices and b is a random vector. 
 110 | 
 111 | **Exercise**: Compute $WX + b$ where $W, X$, and $b$ are drawn from a random normal distribution. W is of shape (4, 3), X is (3,1) and b is (4,1). As an example, here is how you would define a constant X that has shape (3,1):
 112 | ```python
 113 | X = tf.constant(np.random.randn(3,1), name = "X")
 114 | 
 115 | ```
 116 | You might find the following functions helpful: 
 117 | - tf.matmul(..., ...) to do a matrix multiplication
 118 | - tf.add(..., ...) to do an addition
 119 | - np.random.randn(...) to initialize randomly
 120 | 
 121 | 
 122 | 
 123 | ```python
 124 | # GRADED FUNCTION: linear_function
 125 | 
 126 | def linear_function():
 127 |     """
 128 |     Implements a linear function: 
 129 |             Initializes W to be a random tensor of shape (4,3)
 130 |             Initializes X to be a random tensor of shape (3,1)
 131 |             Initializes b to be a random tensor of shape (4,1)
 132 |     Returns: 
 133 |     result -- runs the session for Y = WX + b 
 134 |     """
 135 |     
 136 |     np.random.seed(1)
 137 |     
 138 |     ### START CODE HERE ### (4 lines of code)
 139 |     X = tf.constant(np.random.randn(3,1), name = "X")
 140 |     W = tf.constant(np.random.randn(4,3), name = "W")
 141 |     b = tf.constant(np.random.randn(4,1), name = "b")
 142 |     Y = tf.add(tf.matmul(W, X), b) 
 143 |     ### END CODE HERE ### 
 144 |     
 145 |     # Create the session using tf.Session() and run it with sess.run(...) on the variable you want to calculate
 146 |     
 147 |     ### START CODE HERE ###
 148 |     sess = tf.Session()
 149 |     result = sess.run(Y)
 150 |     ### END CODE HERE ### 
 151 |     
 152 |     # close the session 
 153 |     sess.close()
 154 | 
 155 |     return result
 156 | ```
 157 | 
 158 | 
 159 | ```python
 160 | print( "result = " + str(linear_function()))
 161 | ```
 162 | 
 163 |     result = [[-2.15657382]
 164 |      [ 2.95891446]
 165 |      [-1.08926781]
 166 |      [-0.84538042]]
 167 | 
 168 | 
 169 | *** Expected Output ***: 
 170 | 
 171 | <table> 
 172 | <tr> 
 173 | <td>
 174 | **result**
 175 | </td>
 176 | <td>
 177 | [[-2.15657382]
 178 |  [ 2.95891446]
 179 |  [-1.08926781]
 180 |  [-0.84538042]]
 181 | </td>
 182 | </tr> 
 183 | 
 184 | </table> 
 185 | 
 186 | ### 1.2 - Computing the sigmoid 
 187 | Great! You just implemented a linear function. Tensorflow offers a variety of commonly used neural network functions like `tf.sigmoid` and `tf.softmax`. For this exercise lets compute the sigmoid function of an input. 
 188 | 
 189 | You will do this exercise using a placeholder variable `x`. When running the session, you should use the feed dictionary to pass in the input `z`. In this exercise, you will have to (i) create a placeholder `x`, (ii) define the operations needed to compute the sigmoid using `tf.sigmoid`, and then (iii) run the session. 
 190 | 
 191 | ** Exercise **: Implement the sigmoid function below. You should use the following: 
 192 | 
 193 | - `tf.placeholder(tf.float32, name = "...")`
 194 | - `tf.sigmoid(...)`
 195 | - `sess.run(..., feed_dict = {x: z})`
 196 | 
 197 | 
 198 | Note that there are two typical ways to create and use sessions in tensorflow: 
 199 | 
 200 | **Method 1:**
 201 | ```python
 202 | sess = tf.Session()
 203 | # Run the variables initialization (if needed), run the operations
 204 | result = sess.run(..., feed_dict = {...})
 205 | sess.close() # Close the session
 206 | ```
 207 | **Method 2:**
 208 | ```python
 209 | with tf.Session() as sess: 
 210 |     # run the variables initialization (if needed), run the operations
 211 |     result = sess.run(..., feed_dict = {...})
 212 |     # This takes care of closing the session for you :)
 213 | ```
 214 | 
 215 | 
 216 | 
 217 | ```python
 218 | # GRADED FUNCTION: sigmoid
 219 | 
 220 | def sigmoid(z):
 221 |     """
 222 |     Computes the sigmoid of z
 223 |     
 224 |     Arguments:
 225 |     z -- input value, scalar or vector
 226 |     
 227 |     Returns: 
 228 |     results -- the sigmoid of z
 229 |     """
 230 |     
 231 |     ### START CODE HERE ### ( approx. 4 lines of code)
 232 |     # Create a placeholder for x. Name it 'x'.
 233 |     x = tf.placeholder(tf.float32, name = "x")
 234 | 
 235 |     # compute sigmoid(x)
 236 |     sigmoid = tf.sigmoid(x)
 237 | 
 238 |     # Create a session, and run it. Please use the method 2 explained above. 
 239 |     # You should use a feed_dict to pass z's value to x. 
 240 |     with tf.Session() as sess:
 241 |         # Run session and call the output "result"
 242 |         result = sess.run(sigmoid, feed_dict = {x: z})
 243 |     
 244 |     ### END CODE HERE ###
 245 |     
 246 |     return result
 247 | ```
 248 | 
 249 | 
 250 | ```python
 251 | print ("sigmoid(0) = " + str(sigmoid(0)))
 252 | print ("sigmoid(12) = " + str(sigmoid(12)))
 253 | ```
 254 | 
 255 |     sigmoid(0) = 0.5
 256 |     sigmoid(12) = 0.999994
 257 | 
 258 | 
 259 | *** Expected Output ***: 
 260 | 
 261 | <table> 
 262 | <tr> 
 263 | <td>
 264 | **sigmoid(0)**
 265 | </td>
 266 | <td>
 267 | 0.5
 268 | </td>
 269 | </tr>
 270 | <tr> 
 271 | <td>
 272 | **sigmoid(12)**
 273 | </td>
 274 | <td>
 275 | 0.999994
 276 | </td>
 277 | </tr> 
 278 | 
 279 | </table> 
 280 | 
 281 | <font color='blue'>
 282 | **To summarize, you how know how to**:
 283 | 1. Create placeholders
 284 | 2. Specify the computation graph corresponding to operations you want to compute
 285 | 3. Create the session
 286 | 4. Run the session, using a feed dictionary if necessary to specify placeholder variables' values. 
 287 | 
 288 | ### 1.3 -  Computing the Cost
 289 | 
 290 | You can also use a built-in function to compute the cost of your neural network. So instead of needing to write code to compute this as a function of $a^{[2](i)}$ and $y^{(i)}$ for i=1...m: 
 291 | $$ J = - \frac{1}{m}  \sum_{i = 1}^m  \large ( \small y^{(i)} \log a^{ [2] (i)} + (1-y^{(i)})\log (1-a^{ [2] (i)} )\large )\small\tag{2}$$
 292 | 
 293 | you can do it in one line of code in tensorflow!
 294 | 
 295 | **Exercise**: Implement the cross entropy loss. The function you will use is: 
 296 | 
 297 | 
 298 | - `tf.nn.sigmoid_cross_entropy_with_logits(logits = ...,  labels = ...)`
 299 | 
 300 | Your code should input `z`, compute the sigmoid (to get `a`) and then compute the cross entropy cost $J$. All this can be done using one call to `tf.nn.sigmoid_cross_entropy_with_logits`, which computes
 301 | 
 302 | $$- \frac{1}{m}  \sum_{i = 1}^m  \large ( \small y^{(i)} \log \sigma(z^{[2](i)}) + (1-y^{(i)})\log (1-\sigma(z^{[2](i)})\large )\small\tag{2}$$
 303 | 
 304 | 
 305 | 
 306 | 
 307 | ```python
 308 | # GRADED FUNCTION: cost
 309 | 
 310 | def cost(logits, labels):
 311 |     """
 312 |     Computes the cost using the sigmoid cross entropy
 313 |     
 314 |     Arguments:
 315 |     logits -- vector containing z, output of the last linear unit (before the final sigmoid activation)
 316 |     labels -- vector of labels y (1 or 0) 
 317 |     
 318 |     Note: What we've been calling "z" and "y" in this class are respectively called "logits" and "labels" 
 319 |     in the TensorFlow documentation. So logits will feed into z, and labels into y. 
 320 |     
 321 |     Returns:
 322 |     cost -- runs the session of the cost (formula (2))
 323 |     """
 324 |     
 325 |     ### START CODE HERE ### 
 326 |     
 327 |     # Create the placeholders for "logits" (z) and "labels" (y) (approx. 2 lines)
 328 |     z = tf.placeholder(tf.float32, name = "logits")
 329 |     y = tf.placeholder(tf.float32, name = "labels")
 330 |     
 331 |     # Use the loss function (approx. 1 line)
 332 |     cost = tf.nn.sigmoid_cross_entropy_with_logits(logits = z, labels = y)
 333 |     
 334 |     # Create a session (approx. 1 line). See method 1 above.
 335 |     sess = tf.Session()
 336 |     
 337 |     # Run the session (approx. 1 line).
 338 |     cost = sess.run(cost, feed_dict = {z: logits, y: labels})
 339 |     
 340 |     # Close the session (approx. 1 line). See method 1 above.
 341 |     sess.close()
 342 |     
 343 |     ### END CODE HERE ###
 344 |     
 345 |     return cost
 346 | ```
 347 | 
 348 | 
 349 | ```python
 350 | logits = sigmoid(np.array([0.2,0.4,0.7,0.9]))
 351 | cost = cost(logits, np.array([0,0,1,1]))
 352 | print ("cost = " + str(cost))
 353 | ```
 354 | 
 355 |     cost = [ 1.00538719  1.03664088  0.41385433  0.39956614]
 356 | 
 357 | 
 358 | ** Expected Output** : 
 359 | 
 360 | <table> 
 361 |     <tr> 
 362 |         <td>
 363 |             **cost**
 364 |         </td>
 365 |         <td>
 366 |         [ 1.00538719  1.03664088  0.41385433  0.39956614]
 367 |         </td>
 368 |     </tr>
 369 | 
 370 | </table>
 371 | 
 372 | ### 1.4 - Using One Hot encodings
 373 | 
 374 | Many times in deep learning you will have a y vector with numbers ranging from 0 to C-1, where C is the number of classes. If C is for example 4, then you might have the following y vector which you will need to convert as follows:
 375 | 
 376 | 
 377 | <img src="images/onehot.png" style="width:600px;height:150px;">
 378 | 
 379 | This is called a "one hot" encoding, because in the converted representation exactly one element of each column is "hot" (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In tensorflow, you can use one line of code: 
 380 | 
 381 | - tf.one_hot(labels, depth, axis) 
 382 | 
 383 | **Exercise:** Implement the function below to take one vector of labels and the total number of classes $C$, and return the one hot encoding. Use `tf.one_hot()` to do this. 
 384 | 
 385 | 
 386 | ```python
 387 | # GRADED FUNCTION: one_hot_matrix
 388 | 
 389 | def one_hot_matrix(labels, C):
 390 |     """
 391 |     Creates a matrix where the i-th row corresponds to the ith class number and the jth column
 392 |                      corresponds to the jth training example. So if example j had a label i. Then entry (i,j) 
 393 |                      will be 1. 
 394 |                      
 395 |     Arguments:
 396 |     labels -- vector containing the labels 
 397 |     C -- number of classes, the depth of the one hot dimension
 398 |     
 399 |     Returns: 
 400 |     one_hot -- one hot matrix
 401 |     """
 402 |     
 403 |     ### START CODE HERE ###
 404 |     
 405 |     # Create a tf.constant equal to C (depth), name it 'C'. (approx. 1 line)
 406 |     depth = tf.constant(C, name = "C")
 407 |     
 408 |     # Use tf.one_hot, be careful with the axis (approx. 1 line)
 409 |     one_hot_matrix = tf.one_hot(labels, depth, axis = 0)
 410 |     
 411 |     # Create the session (approx. 1 line)
 412 |     sess = tf.Session()
 413 |     
 414 |     # Run the session (approx. 1 line)
 415 |     one_hot = sess.run(one_hot_matrix)
 416 |     
 417 |     # Close the session (approx. 1 line). See method 1 above.
 418 |     sess.close()
 419 |     
 420 |     ### END CODE HERE ###
 421 |     
 422 |     return one_hot
 423 | ```
 424 | 
 425 | 
 426 | ```python
 427 | labels = np.array([1,2,3,0,2,1])
 428 | one_hot = one_hot_matrix(labels, C = 4)
 429 | print ("one_hot = " + str(one_hot))
 430 | ```
 431 | 
 432 |     one_hot = [[ 0.  0.  0.  1.  0.  0.]
 433 |      [ 1.  0.  0.  0.  0.  1.]
 434 |      [ 0.  1.  0.  0.  1.  0.]
 435 |      [ 0.  0.  1.  0.  0.  0.]]
 436 | 
 437 | 
 438 | **Expected Output**: 
 439 | 
 440 | <table> 
 441 |     <tr> 
 442 |         <td>
 443 |             **one_hot**
 444 |         </td>
 445 |         <td>
 446 |         [[ 0.  0.  0.  1.  0.  0.]
 447 |  [ 1.  0.  0.  0.  0.  1.]
 448 |  [ 0.  1.  0.  0.  1.  0.]
 449 |  [ 0.  0.  1.  0.  0.  0.]]
 450 |         </td>
 451 |     </tr>
 452 | 
 453 | </table>
 454 | 
 455 | 
 456 | ### 1.5 - Initialize with zeros and ones
 457 | 
 458 | Now you will learn how to initialize a vector of zeros and ones. The function you will be calling is `tf.ones()`. To initialize with zeros you could use tf.zeros() instead. These functions take in a shape and return an array of dimension shape full of zeros and ones respectively. 
 459 | 
 460 | **Exercise:** Implement the function below to take in a shape and to return an array (of the shape's dimension of ones). 
 461 | 
 462 |  - tf.ones(shape)
 463 | 
 464 | 
 465 | 
 466 | ```python
 467 | # GRADED FUNCTION: ones
 468 | 
 469 | def ones(shape):
 470 |     """
 471 |     Creates an array of ones of dimension shape
 472 |     
 473 |     Arguments:
 474 |     shape -- shape of the array you want to create
 475 |         
 476 |     Returns: 
 477 |     ones -- array containing only ones
 478 |     """
 479 |     
 480 |     ### START CODE HERE ###
 481 |     
 482 |     # Create "ones" tensor using tf.ones(...). (approx. 1 line)
 483 |     ones = tf.ones(shape)
 484 |     
 485 |     # Create the session (approx. 1 line)
 486 |     sess = tf.Session()
 487 |     
 488 |     # Run the session to compute 'ones' (approx. 1 line)
 489 |     ones = sess.run(ones)
 490 |     
 491 |     # Close the session (approx. 1 line). See method 1 above.
 492 |     sess.close()
 493 |     
 494 |     ### END CODE HERE ###
 495 |     return ones
 496 | ```
 497 | 
 498 | 
 499 | ```python
 500 | print ("ones = " + str(ones([3])))
 501 | ```
 502 | 
 503 |     ones = [ 1.  1.  1.]
 504 | 
 505 | 
 506 | **Expected Output:**
 507 | 
 508 | <table> 
 509 |     <tr> 
 510 |         <td>
 511 |             **ones**
 512 |         </td>
 513 |         <td>
 514 |         [ 1.  1.  1.]
 515 |         </td>
 516 |     </tr>
 517 | 
 518 | </table>
 519 | 
 520 | # 2 - Building your first neural network in tensorflow
 521 | 
 522 | In this part of the assignment you will build a neural network using tensorflow. Remember that there are two parts to implement a tensorflow model:
 523 | 
 524 | - Create the computation graph
 525 | - Run the graph
 526 | 
 527 | Let's delve into the problem you'd like to solve!
 528 | 
 529 | ### 2.0 - Problem statement: SIGNS Dataset
 530 | 
 531 | One afternoon, with some friends we decided to teach our computers to decipher sign language. We spent a few hours taking pictures in front of a white wall and came up with the following dataset. It's now your job to build an algorithm that would facilitate communications from a speech-impaired person to someone who doesn't understand sign language.
 532 | 
 533 | - **Training set**: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number).
 534 | - **Test set**: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number).
 535 | 
 536 | Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs.
 537 | 
 538 | Here are examples for each number, and how an explanation of how we represent the labels. These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels.
 539 | <img src="images/hands.png" style="width:800px;height:350px;"><caption><center> <u><font color='purple'> **Figure 1**</u><font color='purple'>: SIGNS dataset <br> <font color='black'> </center>
 540 | 
 541 | 
 542 | Run the following code to load the dataset.
 543 | 
 544 | 
 545 | ```python
 546 | # Loading the dataset
 547 | X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
 548 | ```
 549 | 
 550 | Change the index below and run the cell to visualize some examples in the dataset.
 551 | 
 552 | 
 553 | ```python
 554 | # Example of a picture
 555 | index = 0
 556 | plt.imshow(X_train_orig[index])
 557 | print ("y = " + str(np.squeeze(Y_train_orig[:, index])))
 558 | ```
 559 | 
 560 |     y = 5
 561 | 
 562 | 
 563 | 
 564 | ![png](output_35_1.png)
 565 | 
 566 | 
 567 | As usual you flatten the image dataset, then normalize it by dividing by 255. On top of that, you will convert each label to a one-hot vector as shown in Figure 1. Run the cell below to do so.
 568 | 
 569 | 
 570 | ```python
 571 | # Flatten the training and test images
 572 | X_train_flatten = X_train_orig.reshape(X_train_orig.shape[0], -1).T
 573 | X_test_flatten = X_test_orig.reshape(X_test_orig.shape[0], -1).T
 574 | # Normalize image vectors
 575 | X_train = X_train_flatten/255.
 576 | X_test = X_test_flatten/255.
 577 | # Convert training and test labels to one hot matrices
 578 | Y_train = convert_to_one_hot(Y_train_orig, 6)
 579 | Y_test = convert_to_one_hot(Y_test_orig, 6)
 580 | 
 581 | print ("number of training examples = " + str(X_train.shape[1]))
 582 | print ("number of test examples = " + str(X_test.shape[1]))
 583 | print ("X_train shape: " + str(X_train.shape))
 584 | print ("Y_train shape: " + str(Y_train.shape))
 585 | print ("X_test shape: " + str(X_test.shape))
 586 | print ("Y_test shape: " + str(Y_test.shape))
 587 | ```
 588 | 
 589 |     number of training examples = 1080
 590 |     number of test examples = 120
 591 |     X_train shape: (12288, 1080)
 592 |     Y_train shape: (6, 1080)
 593 |     X_test shape: (12288, 120)
 594 |     Y_test shape: (6, 120)
 595 | 
 596 | 
 597 | **Note** that 12288 comes from $64 \times 64 \times 3$. Each image is square, 64 by 64 pixels, and 3 is for the RGB colors. Please make sure all these shapes make sense to you before continuing.
 598 | 
 599 | **Your goal** is to build an algorithm capable of recognizing a sign with high accuracy. To do so, you are going to build a tensorflow model that is almost the same as one you have previously built in numpy for cat recognition (but now using a softmax output). It is a great occasion to compare your numpy implementation to the tensorflow one. 
 600 | 
 601 | **The model** is *LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX*. The SIGMOID output layer has been converted to a SOFTMAX. A SOFTMAX layer generalizes SIGMOID to when there are more than two classes. 
 602 | 
 603 | ### 2.1 - Create placeholders
 604 | 
 605 | Your first task is to create placeholders for `X` and `Y`. This will allow you to later pass your training data in when you run your session. 
 606 | 
 607 | **Exercise:** Implement the function below to create the placeholders in tensorflow.
 608 | 
 609 | 
 610 | ```python
 611 | # GRADED FUNCTION: create_placeholders
 612 | 
 613 | def create_placeholders(n_x, n_y):
 614 |     """
 615 |     Creates the placeholders for the tensorflow session.
 616 |     
 617 |     Arguments:
 618 |     n_x -- scalar, size of an image vector (num_px * num_px = 64 * 64 * 3 = 12288)
 619 |     n_y -- scalar, number of classes (from 0 to 5, so -> 6)
 620 |     
 621 |     Returns:
 622 |     X -- placeholder for the data input, of shape [n_x, None] and dtype "float"
 623 |     Y -- placeholder for the input labels, of shape [n_y, None] and dtype "float"
 624 |     
 625 |     Tips:
 626 |     - You will use None because it let's us be flexible on the number of examples you will for the placeholders.
 627 |       In fact, the number of examples during test/train is different.
 628 |     """
 629 | 
 630 |     ### START CODE HERE ### (approx. 2 lines)
 631 |     X = tf.placeholder(tf.float32, [n_x, None], name = "X")
 632 |     Y = tf.placeholder(tf.float32, [n_y, None], name = "Y")
 633 |     ### END CODE HERE ###
 634 |     
 635 |     return X, Y
 636 | ```
 637 | 
 638 | 
 639 | ```python
 640 | X, Y = create_placeholders(12288, 6)
 641 | print ("X = " + str(X))
 642 | print ("Y = " + str(Y))
 643 | ```
 644 | 
 645 |     X = Tensor("X_1:0", shape=(12288, ?), dtype=float32)
 646 |     Y = Tensor("Y:0", shape=(6, ?), dtype=float32)
 647 | 
 648 | 
 649 | **Expected Output**: 
 650 | 
 651 | <table> 
 652 |     <tr> 
 653 |         <td>
 654 |             **X**
 655 |         </td>
 656 |         <td>
 657 |         Tensor("Placeholder_1:0", shape=(12288, ?), dtype=float32) (not necessarily Placeholder_1)
 658 |         </td>
 659 |     </tr>
 660 |     <tr> 
 661 |         <td>
 662 |             **Y**
 663 |         </td>
 664 |         <td>
 665 |         Tensor("Placeholder_2:0", shape=(10, ?), dtype=float32) (not necessarily Placeholder_2)
 666 |         </td>
 667 |     </tr>
 668 | 
 669 | </table>
 670 | 
 671 | ### 2.2 - Initializing the parameters
 672 | 
 673 | Your second task is to initialize the parameters in tensorflow.
 674 | 
 675 | **Exercise:** Implement the function below to initialize the parameters in tensorflow. You are going use Xavier Initialization for weights and Zero Initialization for biases. The shapes are given below. As an example, to help you, for W1 and b1 you could use: 
 676 | 
 677 | ```python
 678 | W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 679 | b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
 680 | ```
 681 | Please use `seed = 1` to make sure your results match ours.
 682 | 
 683 | 
 684 | ```python
 685 | # GRADED FUNCTION: initialize_parameters
 686 | 
 687 | def initialize_parameters():
 688 |     """
 689 |     Initializes parameters to build a neural network with tensorflow. The shapes are:
 690 |                         W1 : [25, 12288]
 691 |                         b1 : [25, 1]
 692 |                         W2 : [12, 25]
 693 |                         b2 : [12, 1]
 694 |                         W3 : [6, 12]
 695 |                         b3 : [6, 1]
 696 |     
 697 |     Returns:
 698 |     parameters -- a dictionary of tensors containing W1, b1, W2, b2, W3, b3
 699 |     """
 700 |     
 701 |     tf.set_random_seed(1)                   # so that your "random" numbers match ours
 702 |         
 703 |     ### START CODE HERE ### (approx. 6 lines of code)
 704 |     W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 705 |     b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
 706 |     W2 = tf.get_variable("W2", [12, 25], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 707 |     b2 = tf.get_variable("b2", [12,1], initializer = tf.zeros_initializer())
 708 |     W3 = tf.get_variable("W3", [6, 12], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
 709 |     b3 = tf.get_variable("b3", [6,1], initializer = tf.zeros_initializer())
 710 |     ### END CODE HERE ###
 711 | 
 712 |     parameters = {"W1": W1,
 713 |                   "b1": b1,
 714 |                   "W2": W2,
 715 |                   "b2": b2,
 716 |                   "W3": W3,
 717 |                   "b3": b3}
 718 |     
 719 |     return parameters
 720 | ```
 721 | 
 722 | 
 723 | ```python
 724 | tf.reset_default_graph()
 725 | with tf.Session() as sess:
 726 |     parameters = initialize_parameters()
 727 |     print("W1 = " + str(parameters["W1"]))
 728 |     print("b1 = " + str(parameters["b1"]))
 729 |     print("W2 = " + str(parameters["W2"]))
 730 |     print("b2 = " + str(parameters["b2"]))
 731 | ```
 732 | 
 733 |     W1 = <tf.Variable 'W1:0' shape=(25, 12288) dtype=float32_ref>
 734 |     b1 = <tf.Variable 'b1:0' shape=(25, 1) dtype=float32_ref>
 735 |     W2 = <tf.Variable 'W2:0' shape=(12, 25) dtype=float32_ref>
 736 |     b2 = <tf.Variable 'b2:0' shape=(12, 1) dtype=float32_ref>
 737 | 
 738 | 
 739 | **Expected Output**: 
 740 | 
 741 | <table> 
 742 |     <tr> 
 743 |         <td>
 744 |             **W1**
 745 |         </td>
 746 |         <td>
 747 |          < tf.Variable 'W1:0' shape=(25, 12288) dtype=float32_ref >
 748 |         </td>
 749 |     </tr>
 750 |     <tr> 
 751 |         <td>
 752 |             **b1**
 753 |         </td>
 754 |         <td>
 755 |         < tf.Variable 'b1:0' shape=(25, 1) dtype=float32_ref >
 756 |         </td>
 757 |     </tr>
 758 |     <tr> 
 759 |         <td>
 760 |             **W2**
 761 |         </td>
 762 |         <td>
 763 |         < tf.Variable 'W2:0' shape=(12, 25) dtype=float32_ref >
 764 |         </td>
 765 |     </tr>
 766 |     <tr> 
 767 |         <td>
 768 |             **b2**
 769 |         </td>
 770 |         <td>
 771 |         < tf.Variable 'b2:0' shape=(12, 1) dtype=float32_ref >
 772 |         </td>
 773 |     </tr>
 774 | 
 775 | </table>
 776 | 
 777 | As expected, the parameters haven't been evaluated yet.
 778 | 
 779 | ### 2.3 - Forward propagation in tensorflow 
 780 | 
 781 | You will now implement the forward propagation module in tensorflow. The function will take in a dictionary of parameters and it will complete the forward pass. The functions you will be using are: 
 782 | 
 783 | - `tf.add(...,...)` to do an addition
 784 | - `tf.matmul(...,...)` to do a matrix multiplication
 785 | - `tf.nn.relu(...)` to apply the ReLU activation
 786 | 
 787 | **Question:** Implement the forward pass of the neural network. We commented for you the numpy equivalents so that you can compare the tensorflow implementation to numpy. It is important to note that the forward propagation stops at `z3`. The reason is that in tensorflow the last linear layer output is given as input to the function computing the loss. Therefore, you don't need `a3`!
 788 | 
 789 | 
 790 | 
 791 | 
 792 | ```python
 793 | # GRADED FUNCTION: forward_propagation
 794 | 
 795 | def forward_propagation(X, parameters):
 796 |     """
 797 |     Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
 798 |     
 799 |     Arguments:
 800 |     X -- input dataset placeholder, of shape (input size, number of examples)
 801 |     parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
 802 |                   the shapes are given in initialize_parameters
 803 | 
 804 |     Returns:
 805 |     Z3 -- the output of the last LINEAR unit
 806 |     """
 807 |     
 808 |     # Retrieve the parameters from the dictionary "parameters" 
 809 |     W1 = parameters['W1']
 810 |     b1 = parameters['b1']
 811 |     W2 = parameters['W2']
 812 |     b2 = parameters['b2']
 813 |     W3 = parameters['W3']
 814 |     b3 = parameters['b3']
 815 |     
 816 |     ### START CODE HERE ### (approx. 5 lines)              # Numpy Equivalents:
 817 |     Z1 = tf.add(tf.matmul(W1, X), b1)                      # Z1 = np.dot(W1, X) + b1
 818 |     A1 = tf.nn.relu(Z1)                                    # A1 = relu(Z1)
 819 |     Z2 = tf.add(tf.matmul(W2, A1), b2)                     # Z2 = np.dot(W2, a1) + b2
 820 |     A2 = tf.nn.relu(Z2)                                    # A2 = relu(Z2)
 821 |     Z3 = tf.add(tf.matmul(W3, A2), b3)                     # Z3 = np.dot(W3,Z2) + b3
 822 |     ### END CODE HERE ###
 823 |     
 824 |     return Z3
 825 | ```
 826 | 
 827 | 
 828 | ```python
 829 | tf.reset_default_graph()
 830 | 
 831 | with tf.Session() as sess:
 832 |     X, Y = create_placeholders(12288, 6)
 833 |     parameters = initialize_parameters()
 834 |     Z3 = forward_propagation(X, parameters)
 835 |     print("Z3 = " + str(Z3))
 836 | ```
 837 | 
 838 |     Z3 = Tensor("Add_2:0", shape=(6, ?), dtype=float32)
 839 | 
 840 | 
 841 | **Expected Output**: 
 842 | 
 843 | <table> 
 844 |     <tr> 
 845 |         <td>
 846 |             **Z3**
 847 |         </td>
 848 |         <td>
 849 |         Tensor("Add_2:0", shape=(6, ?), dtype=float32)
 850 |         </td>
 851 |     </tr>
 852 | 
 853 | </table>
 854 | 
 855 | You may have noticed that the forward propagation doesn't output any cache. You will understand why below, when we get to brackpropagation.
 856 | 
 857 | ### 2.4 Compute cost
 858 | 
 859 | As seen before, it is very easy to compute the cost using:
 860 | ```python
 861 | tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...))
 862 | ```
 863 | **Question**: Implement the cost function below. 
 864 | - It is important to know that the "`logits`" and "`labels`" inputs of `tf.nn.softmax_cross_entropy_with_logits` are expected to be of shape (number of examples, num_classes). We have thus transposed Z3 and Y for you.
 865 | - Besides, `tf.reduce_mean` basically does the summation over the examples.
 866 | 
 867 | 
 868 | ```python
 869 | # GRADED FUNCTION: compute_cost 
 870 | 
 871 | def compute_cost(Z3, Y):
 872 |     """
 873 |     Computes the cost
 874 |     
 875 |     Arguments:
 876 |     Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
 877 |     Y -- "true" labels vector placeholder, same shape as Z3
 878 |     
 879 |     Returns:
 880 |     cost - Tensor of the cost function
 881 |     """
 882 |     
 883 |     # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
 884 |     logits = tf.transpose(Z3)
 885 |     labels = tf.transpose(Y)
 886 |     
 887 |     ### START CODE HERE ### (1 line of code)
 888 |     cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels))
 889 |     ### END CODE HERE ###
 890 |     
 891 |     return cost
 892 | ```
 893 | 
 894 | 
 895 | ```python
 896 | tf.reset_default_graph()
 897 | 
 898 | with tf.Session() as sess:
 899 |     X, Y = create_placeholders(12288, 6)
 900 |     parameters = initialize_parameters()
 901 |     Z3 = forward_propagation(X, parameters)
 902 |     cost = compute_cost(Z3, Y)
 903 |     print("cost = " + str(cost))
 904 | ```
 905 | 
 906 |     cost = Tensor("Mean:0", shape=(), dtype=float32)
 907 | 
 908 | 
 909 | **Expected Output**: 
 910 | 
 911 | <table> 
 912 |     <tr> 
 913 |         <td>
 914 |             **cost**
 915 |         </td>
 916 |         <td>
 917 |         Tensor("Mean:0", shape=(), dtype=float32)
 918 |         </td>
 919 |     </tr>
 920 | 
 921 | </table>
 922 | 
 923 | ### 2.5 - Backward propagation & parameter updates
 924 | 
 925 | This is where you become grateful to programming frameworks. All the backpropagation and the parameters update is taken care of in 1 line of code. It is very easy to incorporate this line in the model.
 926 | 
 927 | After you compute the cost function. You will create an "`optimizer`" object. You have to call this object along with the cost when running the tf.session. When called, it will perform an optimization on the given cost with the chosen method and learning rate.
 928 | 
 929 | For instance, for gradient descent the optimizer would be:
 930 | ```python
 931 | optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost)
 932 | ```
 933 | 
 934 | To make the optimization you would do:
 935 | ```python
 936 | _ , c = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})
 937 | ```
 938 | 
 939 | This computes the backpropagation by passing through the tensorflow graph in the reverse order. From cost to inputs.
 940 | 
 941 | **Note** When coding, we often use `_` as a "throwaway" variable to store values that we won't need to use later. Here, `_` takes on the evaluated value of `optimizer`, which we don't need (and `c` takes the value of the `cost` variable). 
 942 | 
 943 | ### 2.6 - Building the model
 944 | 
 945 | Now, you will bring it all together! 
 946 | 
 947 | **Exercise:** Implement the model. You will be calling the functions you had previously implemented.
 948 | 
 949 | 
 950 | ```python
 951 | def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,
 952 |           num_epochs = 1500, minibatch_size = 32, print_cost = True):
 953 |     """
 954 |     Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
 955 |     
 956 |     Arguments:
 957 |     X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
 958 |     Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
 959 |     X_test -- training set, of shape (input size = 12288, number of training examples = 120)
 960 |     Y_test -- test set, of shape (output size = 6, number of test examples = 120)
 961 |     learning_rate -- learning rate of the optimization
 962 |     num_epochs -- number of epochs of the optimization loop
 963 |     minibatch_size -- size of a minibatch
 964 |     print_cost -- True to print the cost every 100 epochs
 965 |     
 966 |     Returns:
 967 |     parameters -- parameters learnt by the model. They can then be used to predict.
 968 |     """
 969 |     
 970 |     ops.reset_default_graph()                         # to be able to rerun the model without overwriting tf variables
 971 |     tf.set_random_seed(1)                             # to keep consistent results
 972 |     seed = 3                                          # to keep consistent results
 973 |     (n_x, m) = X_train.shape                          # (n_x: input size, m : number of examples in the train set)
 974 |     n_y = Y_train.shape[0]                            # n_y : output size
 975 |     costs = []                                        # To keep track of the cost
 976 |     
 977 |     # Create Placeholders of shape (n_x, n_y)
 978 |     ### START CODE HERE ### (1 line)
 979 |     X, Y = create_placeholders(n_x, n_y)
 980 |     ### END CODE HERE ###
 981 | 
 982 |     # Initialize parameters
 983 |     ### START CODE HERE ### (1 line)
 984 |     parameters = initialize_parameters()
 985 |     ### END CODE HERE ###
 986 |     
 987 |     # Forward propagation: Build the forward propagation in the tensorflow graph
 988 |     ### START CODE HERE ### (1 line)
 989 |     Z3 = forward_propagation(X, parameters)
 990 |     ### END CODE HERE ###
 991 |     
 992 |     # Cost function: Add cost function to tensorflow graph
 993 |     ### START CODE HERE ### (1 line)
 994 |     cost = compute_cost(Z3, Y)
 995 |     ### END CODE HERE ###
 996 |     
 997 |     # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
 998 |     ### START CODE HERE ### (1 line)
 999 |     optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
1000 |     ### END CODE HERE ###
1001 |     
1002 |     # Initialize all the variables
1003 |     init = tf.global_variables_initializer()
1004 | 
1005 |     # Start the session to compute the tensorflow graph
1006 |     with tf.Session() as sess:
1007 |         
1008 |         # Run the initialization
1009 |         sess.run(init)
1010 |         
1011 |         # Do the training loop
1012 |         for epoch in range(num_epochs):
1013 | 
1014 |             epoch_cost = 0.                       # Defines a cost related to an epoch
1015 |             num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set
1016 |             seed = seed + 1
1017 |             minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed)
1018 | 
1019 |             for minibatch in minibatches:
1020 | 
1021 |                 # Select a minibatch
1022 |                 (minibatch_X, minibatch_Y) = minibatch
1023 |                 
1024 |                 # IMPORTANT: The line that runs the graph on a minibatch.
1025 |                 # Run the session to execute the "optimizer" and the "cost", the feedict should contain a minibatch for (X,Y).
1026 |                 ### START CODE HERE ### (1 line)
1027 |                 _ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})
1028 |                 ### END CODE HERE ###
1029 |                 
1030 |                 epoch_cost += minibatch_cost / num_minibatches
1031 | 
1032 |             # Print the cost every epoch
1033 |             if print_cost == True and epoch % 100 == 0:
1034 |                 print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
1035 |             if print_cost == True and epoch % 5 == 0:
1036 |                 costs.append(epoch_cost)
1037 |                 
1038 |         # plot the cost
1039 |         plt.plot(np.squeeze(costs))
1040 |         plt.ylabel('cost')
1041 |         plt.xlabel('iterations (per tens)')
1042 |         plt.title("Learning rate =" + str(learning_rate))
1043 |         plt.show()
1044 | 
1045 |         # lets save the parameters in a variable
1046 |         parameters = sess.run(parameters)
1047 |         print ("Parameters have been trained!")
1048 | 
1049 |         # Calculate the correct predictions
1050 |         correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))
1051 | 
1052 |         # Calculate accuracy on the test set
1053 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
1054 | 
1055 |         print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
1056 |         print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
1057 |         
1058 |         return parameters
1059 | ```
1060 | 
1061 | Run the following cell to train your model! On our machine it takes about 5 minutes. Your "Cost after epoch 100" should be 1.016458. If it's not, don't waste time; interrupt the training by clicking on the square (⬛) in the upper bar of the notebook, and try to correct your code. If it is the correct cost, take a break and come back in 5 minutes!
1062 | 
1063 | 
1064 | ```python
1065 | parameters = model(X_train, Y_train, X_test, Y_test)
1066 | ```
1067 | 
1068 |     Cost after epoch 0: 1.855702
1069 |     Cost after epoch 100: 1.016458
1070 |     Cost after epoch 200: 0.733102
1071 |     Cost after epoch 300: 0.572940
1072 |     Cost after epoch 400: 0.468774
1073 |     Cost after epoch 500: 0.381021
1074 |     Cost after epoch 600: 0.313822
1075 |     Cost after epoch 700: 0.254158
1076 |     Cost after epoch 800: 0.203829
1077 |     Cost after epoch 900: 0.166421
1078 |     Cost after epoch 1000: 0.141486
1079 |     Cost after epoch 1100: 0.107580
1080 |     Cost after epoch 1200: 0.086270
1081 |     Cost after epoch 1300: 0.059371
1082 |     Cost after epoch 1400: 0.052228
1083 | 
1084 | 
1085 | 
1086 | ![png](output_62_1.png)
1087 | 
1088 | 
1089 |     Parameters have been trained!
1090 |     Train Accuracy: 0.999074
1091 |     Test Accuracy: 0.716667
1092 | 
1093 | 
1094 | **Expected Output**:
1095 | 
1096 | <table> 
1097 |     <tr> 
1098 |         <td>
1099 |             **Train Accuracy**
1100 |         </td>
1101 |         <td>
1102 |         0.999074
1103 |         </td>
1104 |     </tr>
1105 |     <tr> 
1106 |         <td>
1107 |             **Test Accuracy**
1108 |         </td>
1109 |         <td>
1110 |         0.716667
1111 |         </td>
1112 |     </tr>
1113 | 
1114 | </table>
1115 | 
1116 | Amazing, your algorithm can recognize a sign representing a figure between 0 and 5 with 71.7% accuracy.
1117 | 
1118 | **Insights**:
1119 | - Your model seems big enough to fit the training set well. However, given the difference between train and test accuracy, you could try to add L2 or dropout regularization to reduce overfitting. 
1120 | - Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters.
1121 | 
1122 | ### 2.7 - Test with your own image (optional / ungraded exercise)
1123 | 
1124 | Congratulations on finishing this assignment. You can now take a picture of your hand and see the output of your model. To do that:
1125 |     1. Click on "File" in the upper bar of this notebook, then click "Open" to go on your Coursera Hub.
1126 |     2. Add your image to this Jupyter Notebook's directory, in the "images" folder
1127 |     3. Write your image's name in the following code
1128 |     4. Run the code and check if the algorithm is right!
1129 | 
1130 | 
1131 | ```python
1132 | import scipy
1133 | from PIL import Image
1134 | from scipy import ndimage
1135 | 
1136 | ## START CODE HERE ## (PUT YOUR IMAGE NAME) 
1137 | my_image = "thumbs_up.jpg"
1138 | ## END CODE HERE ##
1139 | 
1140 | # We preprocess your image to fit your algorithm.
1141 | fname = "images/" + my_image
1142 | image = np.array(ndimage.imread(fname, flatten=False))
1143 | my_image = scipy.misc.imresize(image, size=(64,64)).reshape((1, 64*64*3)).T
1144 | my_image_prediction = predict(my_image, parameters)
1145 | 
1146 | plt.imshow(image)
1147 | print("Your algorithm predicts: y = " + str(np.squeeze(my_image_prediction)))
1148 | ```
1149 | 
1150 | You indeed deserved a "thumbs-up" although as you can see the algorithm seems to classify it incorrectly. The reason is that the training set doesn't contain any "thumbs-up", so the model doesn't know how to deal with it! We call that a "mismatched data distribution" and it is one of the various of the next course on "Structuring Machine Learning Projects".
1151 | 
1152 | <font color='blue'>
1153 | **What you should remember**:
1154 | - Tensorflow is a programming framework used in deep learning
1155 | - The two main object classes in tensorflow are Tensors and Operators. 
1156 | - When you code in tensorflow you have to take the following steps:
1157 |     - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
1158 |     - Create a session
1159 |     - Initialize the session
1160 |     - Run the session to execute the graph
1161 | - You can execute the graph multiple times as you've seen in model()
1162 | - The backpropagation and optimization is automatically done when running the session on the "optimizer" object.
1163 | 


--------------------------------------------------------------------------------