├── Padding.png
├── Stride.png
├── MaxPooling.PNG
├── Single_layer.png
├── ConvolutionGS.png
├── ConvolutionRGB.png
├── Single_layer_LR.png
├── ConvolutionExample.PNG
├── Two_hidden_layers-01.png
├── Single_layer_LR_with_values.png
├── KRG elegant logo for light BG-01.png
├── Stylesheet for R Markdown.txt
├── MultipleLinearRegression.csv
├── README.md
├── MNIST_model_1.R
├── MNIST_model_2.R
├── Implementing improvements using tfruns.R
├── MNIST_model_3.R
├── MNIST_base_file.R
├── Implementing training improvements.R
├── Demonstrating Keras in R.Rmd
├── LogisticRegression.csv
├── Dropout.Rmd
├── Regression as a shallow network.Rmd
├── Cross_entropy.Rmd
├── The basics of a neural network.Rmd
├── Example_of_a_convolutional_neural_network.Rmd
├── Deep neural networks for regression problems.Rmd
├── Introduction to convolutional neural networks.Rmd
├── Predicting skin.Rmd
├── Regularization.Rmd
├── Poor performance of a deep learning model.Rmd
├── Linear regression as a simple learner.Rmd
├── Regression.Rmd
├── Improving training of a neural network.Rmd
├── Implementing regularization and dropout.Rmd
├── A brief introduction to R.Rmd
└── Deep neural network example using R.Rmd
/Padding.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/Padding.png
--------------------------------------------------------------------------------
/Stride.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/Stride.png
--------------------------------------------------------------------------------
/MaxPooling.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/MaxPooling.PNG
--------------------------------------------------------------------------------
/Single_layer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/Single_layer.png
--------------------------------------------------------------------------------
/ConvolutionGS.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/ConvolutionGS.png
--------------------------------------------------------------------------------
/ConvolutionRGB.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/ConvolutionRGB.png
--------------------------------------------------------------------------------
/Single_layer_LR.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/Single_layer_LR.png
--------------------------------------------------------------------------------
/ConvolutionExample.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/ConvolutionExample.PNG
--------------------------------------------------------------------------------
/Two_hidden_layers-01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/Two_hidden_layers-01.png
--------------------------------------------------------------------------------
/Single_layer_LR_with_values.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/Single_layer_LR_with_values.png
--------------------------------------------------------------------------------
/KRG elegant logo for light BG-01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/juanklopper/Deep-learning-using-R/master/KRG elegant logo for light BG-01.png
--------------------------------------------------------------------------------
/Stylesheet for R Markdown.txt:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/MultipleLinearRegression.csv:
--------------------------------------------------------------------------------
1 | x1,x2,x3,y
2 | 20.1,39.3,1.3,394.5
3 | 23.6,31.6,1.5,211.4
4 | 29.2,36.9,1.4,251.4
5 | 29.3,34.1,1.2,85.4
6 | 30,37.2,1.2,248.6
7 | 22.9,39.3,1.9,46
8 | 25.1,33,1.3,252.5
9 | 27.7,36,2,315.4
10 | 24.7,34.5,1.3,120.5
11 | 24.2,39.8,1.5,110.1
12 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Deep-learning-using-R
2 |
3 | RStudio notebooks for deep learning in R
4 |
5 | This repository contains my R-markdown files on deep learing. They are also available on RPubs at https://rpubs.com/juanhklopper
6 |
7 | Videos based on these notebooks are available at https://www.youtube.com/playlist?list=PLsu0TcgLDUiIKPMXu1k_rItoTV8xPe1cj
8 |
9 | Juan.
10 |
--------------------------------------------------------------------------------
/MNIST_model_1.R:
--------------------------------------------------------------------------------
1 | model <- keras_model_sequential()
2 | model %>%
3 | layer_dense(units = 256, activation = "relu", input_shape = c(784)) %>%
4 | layer_dense(units = 128, activation = "relu") %>%
5 | layer_dense(units = 10, activation = "softmax")
6 | model %>% compile(loss = "categorical_crossentropy",
7 | optimizer = "rmsprop",
8 | metrics = c("accuracy"))
9 | history <- model %>% fit(x_train,
10 | y_train,
11 | batch_size = 256,
12 | epochs = 50,
13 | callbacks = list(callback_early_stopping(monitor = "loss",
14 | patience = 2)),
15 | verbose = 2,
16 | validation_data = list(x_test,
17 | y_test))
--------------------------------------------------------------------------------
/MNIST_model_2.R:
--------------------------------------------------------------------------------
1 | model <- keras_model_sequential()
2 | model %>%
3 | layer_dense(units = 128, activation = "relu", input_shape = c(784)) %>%
4 | layer_dropout(0.2) %>%
5 | layer_dense(units = 128, activation = "relu") %>%
6 | layer_dropout(0.2) %>%
7 | layer_dense(units = 10, activation = "softmax")
8 | model %>% compile(loss = "categorical_crossentropy",
9 | optimizer = "rmsprop",
10 | metrics = c("accuracy"))
11 | history <- model %>% fit(x_train,
12 | y_train,
13 | batch_size = 512,
14 | epochs = 50,
15 | callbacks = list(callback_early_stopping(monitor = "loss",
16 | patience = 3)),
17 | verbose = 2,
18 | validation_data = list(x_test,
19 | y_test))
--------------------------------------------------------------------------------
/Implementing improvements using tfruns.R:
--------------------------------------------------------------------------------
1 | # The model
2 |
3 | init_W = initializer_lecun_normal(seed = 123)
4 | init_B = initializer_zeros()
5 |
6 | baseline_model <-
7 | keras_model_sequential() %>%
8 | layer_dense(units = 12,
9 | activation = "relu",
10 | kernel_initializer = init_W,
11 | input_shape = c(12)) %>%
12 | layer_dense(units = 12,
13 | activation = "relu") %>%
14 | layer_dense(units = 1,
15 | activation = "sigmoid")
16 |
17 | baseline_model %>% compile(
18 | optimizer = optimizer_rmsprop(lr = 0.0005,
19 | rho = 0.95),
20 | loss = "binary_crossentropy",
21 | metrics = list("accuracy")
22 | )
23 |
24 | baseline_history <- baseline_model %>%
25 | fit(train_data_normalized,
26 | train_labels,
27 | epochs = 40,
28 | batch_size = 512,
29 | validation_data = list(test_data_normalized, test_labels),
30 | verbose = 2)
31 |
32 |
--------------------------------------------------------------------------------
/MNIST_model_3.R:
--------------------------------------------------------------------------------
1 | model <- keras_model_sequential()
2 | model %>%
3 | layer_dense(units = 128, activation = "relu", input_shape = c(784)) %>%
4 | layer_dropout(0.2) %>%
5 | layer_dense(units = 128, activation = "relu") %>%
6 | layer_dropout(0.2) %>%
7 | layer_dense(units = 10, activation = "softmax")
8 | model %>% compile(loss = "categorical_crossentropy",
9 | optimizer = optimizer_adam(lr = 0.003,
10 | beta_1 = 0.92,
11 | beta_2 = 0.95) ,
12 | metrics = c("accuracy"))
13 | history <- model %>% fit(x_train,
14 | y_train,
15 | batch_size = 512,
16 | epochs = 50,
17 | callbacks = list(callback_early_stopping(monitor = "loss",
18 | patience = 3)),
19 | verbose = 2,
20 | validation_data = list(x_test,
21 | y_test))
--------------------------------------------------------------------------------
/MNIST_base_file.R:
--------------------------------------------------------------------------------
1 | # Set working directory
2 | setwd("C:\\Users\\juank\\OneDrive\\R\\Deep learning")
3 |
4 | # Import keras and plotly
5 | library(keras)
6 | suppressMessages(library(plotly))
7 |
8 |
9 | # THE DATA
10 |
11 | # Importing the built-in MNIST dataset
12 | c(c(x_train, y_train), c(x_test, y_test)) %<-% dataset_mnist()
13 |
14 | # The shape of X-train shows 60000 images of 28 by 28 pixels
15 | dim(x_train)
16 |
17 | # Reshape the square image pixel values to vectors
18 | x_train <- array_reshape(x_train, c(nrow(x_train), 784))
19 | x_test <- array_reshape(x_test, c(nrow(x_test), 784))
20 |
21 | # The dimension of the tarining dataset after reshaping
22 | dim(x_train)
23 |
24 | # Normalize the pixel values by dividing by the maximum value
25 | x_train <- x_train / 255
26 | x_test <- x_test / 255
27 |
28 | # Showing the first training target class
29 | y_train[1]
30 |
31 | # One-hot-encoding of target variabes indicating the 10 classes
32 | y_train <- to_categorical(y_train, 10)
33 | y_test <- to_categorical(y_test, 10)
34 |
35 | # Showing the first training sample
36 | y_train[1, ]
37 |
38 | ## USING tfruns
39 |
40 | library(tfruns)
41 | tfruns::training_run("MNIST_model_1.R")
42 |
43 | training_run("MNIST_model_2.R")
44 |
45 | compare_runs()
46 |
47 | training_run("MNIST_model_3.R")
48 |
49 | compare_runs()
50 |
--------------------------------------------------------------------------------
/Implementing training improvements.R:
--------------------------------------------------------------------------------
1 | setwd("C:\\Users\\juank\\OneDrive\\R\\Deep learning")
2 |
3 | suppressWarnings(library(keras))
4 | suppressMessages(library(readr))
5 |
6 | train.import <- read_csv("ImprovementsTrain.csv")
7 | test.import <- read_csv("ImprovementsTest.csv")
8 |
9 | # Cast dataframe as a matrix and remove column names
10 | train.import <- as.matrix(train.import)
11 | dimnames(train.import) = NULL
12 | test.import <- as.matrix(test.import)
13 | dimnames(test.import) = NULL
14 |
15 | # Create train and test sets
16 | train_data <- train.import[, 1:12]
17 | train_labels <- train.import[, 13]
18 | test_data <- test.import[, 1:12]
19 | test_labels <- test.import[, 13]
20 |
21 | # Calculating the means of each of the training set feature variables
22 | feature.means = vector(length = ncol(train_data))
23 | for (i in 1:length(feature.means)){
24 | feature.means[i] = mean(train_data[, i])
25 | }
26 |
27 | # Calculating the standard deviations of each of the training set feature variables
28 | feature.sds = vector(length = ncol(train_data))
29 | for (i in 1:length(feature.sds)){
30 | feature.sds[i] = sd(train_data[, i])
31 | }
32 |
33 | # Normalizing the feature variables in the training set
34 | train_data_normalized <- matrix(nrow = nrow(train_data),
35 | ncol = ncol(train_data))
36 | for (n in 1:ncol(train_data)){
37 | for (m in 1:nrow(train_data)){
38 | train_data_normalized[m, n] = (train_data[m, n] - feature.means[n]) / feature.sds[n]
39 | }
40 | }
41 |
42 | # Normalizing the feature variables in the test set
43 | test_data_normalized <- matrix(nrow = nrow(test_data), ncol = ncol(test_data))
44 | for (n in 1:ncol(test_data)){
45 | for (m in 1:nrow(test_data)){
46 | test_data_normalized[m, n] = (test_data[m, n] - feature.means[n]) / feature.sds[n]
47 | }
48 | }
49 |
50 | suppressMessages(library(tfruns))
51 |
52 | tfruns::training_run(file = "Implementing improvements using tfruns.R")
53 |
54 | latest_run()
55 |
56 | training_run(file = "Implementing improvements using tfruns.R")
57 |
58 | compare_runs()
59 |
60 | training_run(file = "Implementing improvements using tfruns.R")
61 |
62 | compare_runs()
63 |
--------------------------------------------------------------------------------
/Demonstrating Keras in R.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to Keras"
3 | author: "Dr Juan H Klopper"
4 | output: html_document
5 | ---
6 |
7 | ```{r setup, include=FALSE}
8 | knitr::opts_chunk$set(echo = TRUE)
9 | setwd(getwd())
10 | library(keras)
11 | ```
12 |
13 | ## Preparing the data
14 |
15 | ```{r}
16 | mnist <- dataset_mnist()
17 | ```
18 |
19 | ```{r}
20 | names(mnist)
21 | ```
22 |
23 | ```{r}
24 | x_train <- mnist$train$x
25 | y_train <- mnist$train$y
26 | x_test <- mnist$test$x
27 | y_test <- mnist$test$y
28 | ```
29 |
30 | ```{r}
31 | dim(x_train)
32 | ```
33 |
34 | ```{r}
35 | dim(y_train)
36 | ```
37 |
38 | ```{r}
39 | # reshape
40 | x_train <- array_reshape(x_train, c(nrow(x_train), 784))
41 | x_test <- array_reshape(x_test, c(nrow(x_test), 784))
42 | # rescale
43 | x_train <- x_train / 255
44 | x_test <- x_test / 255
45 | ```
46 |
47 | One-hot encoding
48 |
49 | ```{r}
50 | y_train <- to_categorical(y_train, 10)
51 | y_test <- to_categorical(y_test, 10)
52 | ```
53 |
54 | ## Building the model
55 |
56 | ```{r}
57 | model <- keras_model_sequential()
58 | model %>%
59 | layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
60 | layer_dropout(rate = 0.4) %>%
61 | layer_dense(units = 128, activation = 'relu') %>%
62 | layer_dropout(rate = 0.3) %>%
63 | layer_dense(units = 10, activation = 'softmax')
64 | ```
65 |
66 | A summary of the model can be viewed. This shows the shape of each layer and the number of trainalble and non-trainable parameters.
67 |
68 | ```{r}
69 | summary(model)
70 | ```
71 |
72 | Next up is the compilation of the model, providing the loss function, the optimizer, and the metric to be displayed during each epoch.
73 |
74 | ```{r}
75 | model %>% compile(
76 | loss = 'categorical_crossentropy',
77 | optimizer = optimizer_rmsprop(),
78 | metrics = c('accuracy')
79 | )
80 | ```
81 |
82 | ## Training and evaluating
83 |
84 | ```{r}
85 | history <- model %>% fit(
86 | x_train, y_train,
87 | epochs = 30, batch_size = 128,
88 | validation_split = 0.2
89 | )
90 | ```
91 |
92 | ```{r}
93 | plot(history)
94 | ```
95 |
96 | ## Evaluating on the test set
97 |
98 | ```{r}
99 | model %>%
100 | evaluate(x_test,
101 | y_test)
102 | ```
103 |
104 |
--------------------------------------------------------------------------------
/LogisticRegression.csv:
--------------------------------------------------------------------------------
1 | x1,x2,x3,x4,y
2 | 15.5,110,2.5,52.6,1
3 | 14.8,95.5,8.1,20.1,1
4 | 11.5,110.5,9.5,6.7,1
5 | 16.6,108,9.5,19.2,1
6 | 17.2,110.2,4.6,30.2,1
7 | 12.8,103,2.1,11.5,0
8 | 19.3,94.2,3.8,7.9,0
9 | 10.4,98.7,3.1,14.7,0
10 | 19.4,96.5,4.9,7.9,0
11 | 14.5,115.3,7.9,3.6,1
12 | 14.1,104.4,2,16.1,0
13 | 11.6,108.5,7.2,15.2,1
14 | 11.7,82.2,8,0.8,1
15 | 10.8,99.9,7.6,7.7,1
16 | 18.9,97.9,1.6,32.7,0
17 | 16.9,112.6,9.9,14,1
18 | 12,99.2,8.9,8.5,1
19 | 11.8,102.1,8.2,24.2,1
20 | 10.1,82.4,0.9,31.8,0
21 | 11.3,98.6,10,32.5,1
22 | 12.1,104,3.6,20.6,1
23 | 15.8,109.4,2.1,28.4,0
24 | 11.7,82.5,8.5,16.1,1
25 | 19.9,105.1,8.3,26.8,1
26 | 18.2,101.2,0.3,0.9,0
27 | 11.3,94.2,5.9,12.3,1
28 | 16.2,101.9,4.1,13.9,1
29 | 16.4,99.4,6.7,66.1,1
30 | 17.7,115.8,5.5,2.4,1
31 | 15.6,107,2.8,0.3,0
32 | 19.3,111.6,1.2,9.6,0
33 | 14,112.4,2.3,21.7,0
34 | 14.1,104.9,2.3,16.7,0
35 | 16.4,106.6,0.2,21,0
36 | 14.7,82.8,5,27,0
37 | 16.5,93.2,1.9,6.6,0
38 | 10.6,101.4,9.8,26.3,1
39 | 12.8,107.6,9.1,2,1
40 | 16.2,114.8,2.5,5.6,0
41 | 14.7,97.1,4.4,22.3,1
42 | 15.2,109,0.4,24.1,0
43 | 14.5,102.8,7.5,49,1
44 | 10.2,90.6,7.4,11.1,1
45 | 10.9,81,1.4,9.1,0
46 | 13.3,92.6,6.8,108.9,1
47 | 19.8,108.3,3.3,14.4,0
48 | 10,88.9,1.9,14.7,0
49 | 16,87.4,3.3,11.8,0
50 | 12.5,104.8,6,2.7,1
51 | 12.5,94.6,5.6,22.9,1
52 | 16.4,104.4,4.3,3.9,0
53 | 11.8,112,7.2,46.8,1
54 | 18.7,99.7,0.5,60,0
55 | 13.6,105.3,0.8,15.6,0
56 | 19.9,102.1,7.4,1.8,1
57 | 10.3,103.9,9.2,6.3,1
58 | 12,107.1,0.7,14.3,0
59 | 14.6,94,3.6,6.7,0
60 | 18.2,92.8,0.1,34.9,0
61 | 19.1,107,9.2,14.4,1
62 | 10.2,99.1,9.7,20.3,1
63 | 13.9,96.9,1.5,1.4,0
64 | 17.4,92.2,7.8,30.4,1
65 | 12.5,94.7,4.8,10.6,1
66 | 18,91.9,3.8,30.2,0
67 | 19.8,81.7,1.3,23.2,0
68 | 19,104.8,9.4,1.4,1
69 | 17.9,112.3,0.6,3.7,0
70 | 13,100.3,5.4,14,1
71 | 18,84.5,0.2,22.7,0
72 | 20,112.1,8.5,25,1
73 | 16.5,98,0.3,27.1,0
74 | 15.7,102.8,5.1,0.2,0
75 | 12.9,100.4,8,32.4,1
76 | 16.9,96.9,8.6,1,1
77 | 11.1,111.5,6.7,10.5,1
78 | 17.7,100.8,5,10.3,1
79 | 12.1,107.5,1.1,41.8,1
80 | 18.5,101.4,2.9,3.7,0
81 | 11.6,99,5.1,3.4,1
82 | 15.7,94.6,8.6,16.3,1
83 | 10.9,113.5,6.6,6.5,1
84 | 13.2,96.1,5.8,34.4,1
85 | 10.4,102.9,4.2,17.4,1
86 | 19,94.4,2.3,4.2,0
87 | 19.5,100.9,4.8,0.6,0
88 | 11.5,102.4,7.8,27.9,1
89 | 17.7,107.8,9.6,81.1,1
90 | 10.1,94,8.6,8.8,1
91 | 11.4,100.5,5.5,4.6,1
92 | 13.7,97.7,7.7,29.8,1
93 | 15,101.1,9.1,1.3,1
94 | 11.7,106.4,6.5,6.4,1
95 | 18.6,101.4,5.9,6.3,1
96 | 13.1,105.8,2.7,51,0
97 | 14.2,110.8,4.8,6.9,1
98 | 15.6,94.1,8.1,17.1,1
99 | 16.1,96.1,4,8,0
100 | 12.8,120,9.7,15.3,1
101 | 16.1,107.2,7.9,2,1
102 | 13.8,104.8,6.5,0.9,1
103 | 18.2,91.9,4.4,13.7,0
104 | 17.2,108.8,4.4,60.2,1
105 | 18.1,94,9.4,1,1
106 | 16,108.2,1.2,29.1,0
107 | 15.2,108.1,1.5,11.9,0
108 | 18.8,121.5,4.5,1.9,0
109 | 19.2,117.2,4.5,23.5,0
110 | 14.2,106.3,0.6,8.9,0
111 | 12,102.3,2.8,23.8,0
112 | 19.1,89.7,10,59.2,1
113 | 20,103.6,7.7,21.2,1
114 | 15.6,88.4,3.3,8.5,0
115 | 18,79.1,5.8,4,0
116 | 15.4,88.2,9,12.7,1
117 | 11,115.2,7.1,25.5,1
118 | 19.5,97.8,0.2,2.1,0
119 | 14.7,95,5.9,33.1,1
120 | 19.7,109.3,3.1,12,0
121 | 19.3,102.2,5.5,39.4,1
122 | 19,122,5.3,18.3,1
123 | 10.5,88.7,6.4,2.5,1
124 | 15.6,105.5,4.8,10.2,1
125 | 19.7,100.9,8,0.3,1
126 | 18.6,99.4,4.1,9.2,0
127 | 12.7,85,1.4,22.6,0
128 | 16.6,85.4,2.7,14.8,0
129 | 10.9,98.5,3.6,18.7,1
130 | 14.4,106.7,4.5,0.4,0
131 | 17.6,97.8,4.4,18,0
132 | 13.4,100.5,5.1,91.4,1
133 | 19.6,82.9,6.9,5.7,1
134 | 15.8,95.7,1.2,19.9,0
135 | 12.7,89.5,3.8,16.7,1
136 | 13.9,99.7,0.4,10.6,0
137 | 14.9,90,8.4,10.2,1
138 | 14,89.1,2,3.2,0
139 | 14.9,104.7,7.5,6.1,1
140 | 11.8,94.7,8.7,12.8,1
141 | 16.1,83,8.4,2,1
142 | 11.5,100,3.4,4,0
143 | 12.3,98,7.2,34.4,1
144 | 13.9,92.9,9.5,36.7,1
145 | 14.9,95.2,6.1,26.4,1
146 | 12.4,105.3,1.7,39.5,0
147 | 11.5,95.6,8.7,1.1,1
148 | 15.5,93.7,8,28.9,1
149 | 14.1,97.4,7.2,27.3,1
150 | 18.1,123.5,2.7,16.4,0
151 | 19.5,116.9,7.8,65,1
152 |
--------------------------------------------------------------------------------
/Dropout.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Dropout"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 |
25 |
26 | 
27 |
28 | ## Introduction
29 |
30 | Another method to reduce high variance, where the network _fits_ the training set too well (thereby not generalizing to a test set or real-world data) is _dropout_.
31 |
32 | Dropout can be viewed as a form of regularization, whereby the network is forced to be _simpler_, thereby constraining the hypothesis space. As such, and similar to regularization, this technique must only be implemented when there is overfitting.
33 |
34 | ## Dropout
35 |
36 | Dropout _removes_ some nodes at random during each epoch of training. This removal is done by setting the node value to $0$ and by scaling up non-zero valued nodes. Note that the zero values are used during forward propagation and backpropagation.
37 |
38 | There are a variety of ways to implement dropout. This chapter describes the common method of _inverted dropout_.
39 |
40 | ### Inverted dropout
41 |
42 | This technique creates an vector of similar size to the number of nodes in a layer. Each of the elements in this vector will be either $0$ or $1$, with these values assigned at random given a probability for each.
43 |
44 | The code for this technique usually involves creating a random real number in the domain $\left[ 0,1 \right]$. A threshold is set, i.e. $0.2$. If the random real number is less than the threshold, then the node value becomes $0$, whereas if the value is equal to or greater than $0.2$, then the node value becomes $1$.
45 |
46 | This value of $0.2$ used above is actually subtracted from $1$, i.e. $1 - 0.2 = 0.8$. The latter is known as the _keep probability_, denoted in this text by $\kappa$.
47 |
48 | Element-wise multiplication then takes place between this vector containing zeros and ones and the vector of node values (after activation).
49 |
50 | The last step, which denotes this technique as inverted dropout, divides each element by $\kappa$. Because the layer created by dropout is reduced to the $\kappa$ value, it must be _increased_ so as to maintain the expected value for the next layer, i.e. no scaling is required during activation.
51 |
52 | Remember that a node receives various inputs, which are the sums of various node-weight-value multiplications. With some of them removed, the sum total will be less and hence activation, i.e. using a rectified linear unit function, will results in a different output value.
53 |
54 | 
55 |
56 | It should also be intuitive to see that reliance on a specific input, which might lead to high variance, is removed due to the fact that the specific input might _disappear_. The effect is the same as $\ell_2$-regularization seen in the preceding chapter, where the value of some weights were _driven_ to approach $0$. In this sense, the values of $\kappa$ and $\lambda$ play the same role.
57 |
58 | The value of $\kappa$ can be different for each layer. In general, it is set lower for layers with a higher number of parameters. With more parameters comes a greater chance of overfitting.
59 |
60 | Input features can also have dropout, although this is usually not implemented.
61 |
62 | Dropout by its nature creates a cost function that is not well-defined. The result of this is that a graph which should show a steady decline in the cost function is no longer possible. In practice this might require the execution of training without dropout to ensure that the network performs properly (as indicated by a monotonically decreasing cost function value). Once this has been established, the dropout can be implemented in an attempt to reduce overfitting.
63 |
64 | ## Conclusion
65 |
66 | Dropout is a regularization technique. It is used when overfitting of the training set exists. By randomly _removing_ nodes, the hypothesis space is reduced due to the creation of a simpler network. Success is measured by a better fit to the test set or real-world data.
--------------------------------------------------------------------------------
/Regression as a shallow network.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Regression as a shallow neural network"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 | ```{r}
15 | library(readr)
16 | library(tibble)
17 | ```
18 |
19 |
20 |
31 |
32 | 
33 |
34 | Watch the video series on YouTube at https://www.youtube.com/watch?v=9-QYsN_knG4&list=PLsu0TcgLDUiIKPMXu1k_rItoTV8xPe1cj
35 | Download the files at https://github.com/juanklopper/Deep-learning-using-R
36 |
37 | ## Introduction
38 |
39 | With the knowledge of the preceding chapters, it is time to view regression as a shallow network. All the calculations remain exactly as before. Conceptualizing the process as a neural network is the aim of this chapter.
40 |
41 | ## Multiple linear regression as a network
42 |
43 | Whereas the previous chapter viewed only a single feature variable, the dataset below represents three feature variables, all of continuous numerical data type, and a target variable, similarly of continuous numerical data type.
44 |
45 | ```{r}
46 | # Import a spreadsheet file
47 | df <- read_csv("MultipleLinearRegression.csv")
48 | df
49 | ```
50 |
51 | Note that there are $10$ samples and $3$ feature variables, the aim is to calculate values for $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$ that will minimize a specific cost function.
52 |
53 | If each of the variable are seen as a column vector, expressed with underlines, $\underline{x}_1$, $\underline{x}_2$, and $\underline{x}_3$, the requirement is to find values such that the predicted target is as close to the ground-truth values in the column vector $\underline{y}$ as seen in equation (1).
54 |
55 | $$\beta_0 + \beta_1 \underline{x}_1 + \beta_2 \underline{x}_2 + \beta_3 \underline{x}_3 \approx \underline{y} \tag{1}$$
56 |
57 | As before, the loss function is calculated for every sample (row in the table above). The notation changes to a superscript $\left( i \right)$ to indicate every sample and the loss function is given in equation (2).
58 |
59 | $$L^{\left( i \right)} \left( \beta_0 , \beta_1 , \beta_2 , \beta_3 \right) = { \left( \beta_0 + \beta_1 x_1^{\left( i \right)} + \beta_2 x_2^{\left( i \right)} + \beta_3 x_3^{\left( i \right)} - y^{\left( i \right)} \right) }^{2} \tag{2}$$
60 |
61 | The cost function will average all of the losses and finally the process of gradient descent will result in optimal values for $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$.
62 |
63 | Using the `lm()` function in `R` provides a quick and easy way of calculating the optimal values.
64 |
65 | ```{r}
66 | model <- lm(y ~ x1 + x2 + x3,
67 | data = df)
68 | summary(model)
69 | ```
70 |
71 | The solution shows $\beta_0 = 342.7$, $\beta_1 = -3.6$, $\beta_2 = 0.2$, and $\beta_3 = -38.0$. (Note the poor Mutiple R-squared value and the high _p_ values. The data was created randomly, so even the best values for all the $\beta$ parameters are still going to be a bit off the mark with their predicted values).
72 |
73 | ## Flow diagram
74 |
75 | The above model can be represented as a flow-diagram. This allows for the introduction of the concept of neurons called _perceptrons_ (more about this in upcoming chapters).
76 |
77 | The diagram below expresses all of the calculations above. The feature variables are turned on their sides (transposed) and represent the input state of the network.
78 |
79 | 
80 |
81 | Following this is a layer of three _neurons_. The values that each of the three neurons, called _nodes_ take is the product of the corresponding input node and the associated parameter, called a _weight_.
82 |
83 | The last layer is a single node and takes as input the sum of all the three previous nodes plus the value held in $\beta_0$, called the _bias node_. This is the predicted or output value and will be compared to the target value.
84 |
85 | And that it it! A neural network (of sorts) with a single deep (hidden) layer. This layer holds values that are different from the actual input values (actual feature variable values for each subject or row).
86 |
87 | As before, all of the rows will pass through the _network_ and a cost function will be created, consisting of four unknowns, $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$.
88 |
89 | This whole process as described is known as _forward propagation_. This is followed by the process of _backpropagation_. This uses the process of gradient descent to update the values of $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$. These new values are then used in another forward propagation. Each pair of forward propagation and backpropagation is known as an _epoch_.
90 |
91 | During the first forward pass, the values for $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$, called the _weights_ are _initialized_. This is the process of providing each of the weights with a random value allowing for the calculation of actual values for the hidden nodes and the output node. Backpropagation through gradient descent then updates all of these values with (hopefully) better ones. Below is a graph of the first row of data.
92 |
93 | 
94 |
95 | ## Conclusion
96 |
97 | The process of forward propagation and back propagation allows a neural network to _learn_ the optimal values of parameters such that the best possible prediction for a target variable can be made.
98 |
99 | It is simply stated, a very, very elegant process, transforming the idea of learning into mathematical functions.
100 |
101 | The important take-aways from this chapter are two-fold:
102 |
103 | 1. Defining new terms for old concepts, introduced in the preceding chapters, i.e. unknowns or parameters are called _weights_.
104 | 2. The idea of hidden layers consisting of nodes that hold the values from which the weights can be learned through numerous epochs.
--------------------------------------------------------------------------------
/Cross_entropy.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Cross-entropy"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 |
25 |
26 | 
27 |
28 | ## Introduction
29 |
30 | When dealing wig classification problems, a special type of loss function is required. Whereas it is easy to conceptualize the difference between a numerical predicted and actual value, the same cannot be said for the difference between a predicted and actual categorical data point.
31 |
32 | _Cross-entropy_ provides the numerical representation of a category as a numerical value. The aim of this chapter is to create an understanding of this method.
33 |
34 | ## Entropy
35 |
36 | To understand cross-entropy, we first need to understand _entropy_. Entropy is a term that stems from physics and expresses the disorder in a system. Without external energy input, the entropy of a system increases (from order to chaos). A heap of bricks and bags of cement lying by the side of the road will not, by itself, construct a house. A house, on the other hand, will slowly decay over time until only cement dust and broken bricks remain.
37 |
38 | The idea of entropy can also be used to express an amount of information. Consider a random set of elements $A=\left\{\text{car},\text{car},\text{car},\text{car},\text{car},\text{car}, \text{bus}, \text{bus}, \text{bus}, \text{bus}, \text{bus}, \text{human},\text{stop sign}\right\}$. There are $12$ elements. Imagine repeatedly drawing one of these elements at random. At each turn, an observer without knowledge of the item can ask a set of questions to gain the ultimate information, which is, the item itself. The average number of questions that must be asked to determine the selection can be calculated.
39 |
40 | The image below shows such a list of questions and how many questions it would take to discover the randomly chosen object. (Questions are in red.)
41 |
42 | 
43 |
44 | By clever decisions, note that with a single question, we can find out if the element chosen was a car. It takes two question to discover that it was a bus and it takes three questions for both a human and a stop sign. The structure comes from the probability of choosing any of the elements at random. They have a distribution of $P\left\{\text{car},\text{bus},\text{human},\text{stop sign}\right\}=\left(\frac{1}{2},\frac{1}{3},\frac{1}{12},\frac{1}{12}\right)$. Multiplying the relevant number of questions with the probability of the selection is shown below.
45 |
46 | ```{r Average number of questions}
47 | 1*(1/2)+2*(1/3)+3*(1/12)+3*(1/12)
48 | ```
49 |
50 | Entropy in information theory sets the lower bound for this number of questions to gain information. When simple examples are constructed, the entropy equals the number of questions to be asked, but this not generally so. The equation for entropy in information gain is given in equation (1).
51 |
52 | $$E \left( \underline{d} \right) \tag{1} = - \sum_{i=1}^{k}\left[ p_i \log_{2} \left( p_i \right) \right]$$
53 |
54 | Here, $E$, is the entropy of a vector of elements, $\underline{d}$. The probability of each of the elements are given as $p_i$.
55 |
56 | The code below creates a function to calculate entropy given a vector of values.
57 |
58 | ```{r Entropy function}
59 | entropy <- function(d){
60 | x <- 0
61 | for (i in d){
62 | x <- x + (i * log(i, base = 2))
63 | }
64 | return(-x)
65 | }
66 | ```
67 |
68 | The theoretical minimal average minimum information (called bits or then, number of questions) required for the problem above, can now be calculated.
69 |
70 | ```{r Entropy of example problem}
71 | entropy(c(1/2, 1/3, 1/12, 1/12))
72 | ```
73 |
74 | ## Cross-entropy
75 |
76 | Cross - entropy compares the distance between two distributions. Fortunately, categorical variables are commonly multi-hot-encoded or one-hot-encoded. Let's then take one-hot-encoding as an example. The actual target variable data point values might be $y=\left(0,1,0\right)$. This states that the sample space of the target variable had three elements and that the current sample, was of the second element type. This is, in fact, a distribution. The softmax function might give a prediction of $\hat{y}=\left(0.1,0.8,0.1\right)$, another distribution. We can now use categorical cross-entropy to calculate the difference between these two distributions. The equation for categorical cross-entropy is given in (2) below.
77 |
78 | $$H \left(y, \hat{y} \right)= - \sum_{i=1}^k \left[ y_i \ln \left( \hat{y}_i\right) \right]\tag{2}$$
79 |
80 | Note the use of the natural logarithm. This is just for convenience. Remember the logarithmic identity given in equation (3) below.
81 |
82 | $$\log_{a}b = \frac{\log{b}}{\log{a}}\tag{3}$$
83 | If $a=2$ as with entropy above, the denominator would simply be $\log{2}$, i.e. a constant.
84 |
85 | For the example of $y$ and $\hat{y}$ above, a function is constructed below.
86 |
87 | ```{r Cross-entropy}
88 | cross.entropy <- function(p, phat){
89 | x <- 0
90 | for (i in 1:length(p)){
91 | x <- x + (p[i] * log(phat[i]))
92 | }
93 | return(-x)
94 | }
95 | ```
96 |
97 | The cross-entropy (difference between the two distributions or the two categorical types) is thus:
98 |
99 | ```{r Example problem cross-entropy}
100 | cross.entropy(c(0, 1, 0),
101 | c(0.1, 0.8, 0.1))
102 | ```
103 |
104 | Note that only the second element contains any value as elements number $1$ and $3$ is the product containing a zero.
105 |
106 | The derivative required to do backpropagation and gradient descent is quite simple. The derivative of the $\ln$ function is given in equation(4).
107 |
108 | $$\frac{d}{dz} \log{z} = \frac{1}{z}\tag{4}$$
109 |
110 | ## Conclusion
111 |
112 | Cross-entropy provides an elegant solution for determining the difference between actual and predicted categorical data point values.
113 |
114 |
115 |
--------------------------------------------------------------------------------
/The basics of a neural network.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "A basic neural network"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 | ```{r}
15 | suppressMessages(library(plotly))
16 | suppressMessages(library(sigmoid))
17 | ```
18 |
19 |
20 |
31 |
32 | 
33 |
34 | ## Introduction
35 |
36 | This chapter introduces the basics of a proper neural network, more specifically, a densely connected single-layer neural network.
37 |
38 | The preceding chapter already contained all the information required to readily grasp the creation of these networks. The only new concept introduced here, will be the way in which the weights and the input feature variables are _connected_.
39 |
40 | ## The perceptorn
41 |
42 | The idea of a deep neural network is loosely based on the structure and function of a neuron (a brain cell). A schematic depiction of a neuron in shown below (By Dhp1080, svg adaptation by Actam - Image:Neuron.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=4293768).
43 |
44 | 
45 |
46 | The neuron body accepts the transmission of impulses, which travels along the axon to terminal ends, each connecting with many other neurons via their dendrites. The important analogy here is that there are many connections, unlike the examples shown with regression in the preceding chapters. What is not so clear from this analogy, is that each neuron controls the _impulse_ that it transmits down the line.
47 |
48 | ## A densely-connected deep layer
49 |
50 | The diagram below shows exactly how the single deep layer is connected to the input layer.
51 |
52 | 
53 |
54 | The input layer consists of three feature variables and the single deep (hidden) layer consists of four nodes (neurons). The number of nodes in the hidden layer is known as a _hyperparameter_. This distinguishes it from the term _parameter_ that has been used up until now. A hyperparameter is a value that the designer of a neural networks decides on. A parameter is a value that the network learns (optimizes) through gradient descent. The choice of four nodes is completely arbitrary in this instance.
55 |
56 | Since there are three input nodes, each with four connections, there are a total of $3 \times 4 = 12$ weights. These values will no longer be referred to by the symbol $\beta_i$. Linear algebra is used in calculating the product between the input variables and the weights.
57 |
58 | In equation (1) below, the three input variables values are depicted as a column vector, $\underline{x}$. The $x_i, \quad i \in \left\{ 1,2,3 \right\}$ represent a row in a dataset, i.e. the data point values for the three feature variables for a specific subject.
59 |
60 | $$\underline{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \tag{1}$$
61 |
62 | The resultant values in the hidden layer is represented in equation (2) as a $4 \times 1$ column vector, $\underline{z}$.
63 |
64 | $$\underline{z} = \begin{bmatrix} z_1 \\ z_2 \\ z_3 \\ z_4 \end{bmatrix} \tag{2}$$
65 |
66 | The twelve weight values can be represented as a rank-2 tensor (a matrix) with dimension $3 \times 4$, as shown in equation (3).
67 |
68 | $$ W = \begin{bmatrix} w_{11} && w_{12} && w_{13} && w_{14} \\ w_{21} && w_{22} && w_{23} && w_{24} \\ w_{31} && w_{32} && w_{33} && w_{34} \end{bmatrix} \tag{3}$$
69 |
70 | The product of the differently ranked tensors that will result in a $4 \times 1$ column vector is given in equation (4). Note that it requires taking the transpose of the weight matrix, which turns the number of rows into $4$ and the number of columns into $3$.
71 |
72 | $$W^T_{4 \times 3} \cdot \underline{x}_{3 \times 1} = \underline{n}_{4 \times 1} \tag{4}$$
73 |
74 | A bias term can be added to this and is given in equation (5).
75 |
76 | $$W^T_{4 \times 3} \cdot \underline{x}_{3 \times 1} + \underline{b}_{4 \times 1} = \underline{z}_{4 \times 1} \tag{5}$$
77 |
78 | Note that to maintain dimensionality, the bias-vector must have a dimension of $4 \times 1$.
79 |
80 | ## The activation function
81 |
82 | The final step in determining the actual value that each hidden node will have and pass to the output is an activation function. The preceding chapter on logistic regression introduced the sigmoid function. The most common activation function used today, though, is the rectified linear unit (ReLU) function. It is depicted in the graph below.
83 |
84 | ```{r}
85 | x = seq(-3, 3, 0.01)
86 | y = relu(x)
87 |
88 | p <- plot_ly(x = x,
89 | y = y,
90 | name = "ReLU function",
91 | type = "scatter",
92 | mode = "lines") %>%
93 | layout(title = "ReLU function")
94 |
95 | p
96 | ```
97 |
98 | If the ReLU activation function is written as $g \left(z_i \right)$, then the output of the four hidden noes as given in equation (6).
99 |
100 | $${\begin{bmatrix} n_1 \\ n_2 \\ n_3 \\ n_4 \end{bmatrix}}_{4 \times 1} = {g \left( \begin{bmatrix} z_1 \\ z_2 \\ z_3 \\ z_4 \end{bmatrix} \right)}_{4 \times 1} \tag{6}$$
101 |
102 | For each value $n_i$, calculated above, the activation function will express a $0$ if the specific value is $0$ or less and the actual value if the specific value is more than $0$.
103 |
104 | The aim of an activation function is to introduce non-linearity. This is another departure from simple linear and logistic functions that expressed a straight line, or plane, or hyperplane (a linear function). This allows a neural network to learn more complex solutions when optimizing for the values of the weights.
105 |
106 | ## The output layer
107 |
108 | The output can be a single node. It takes as input the sum of the values held in each node $n_1$ (after applying the activation function). If the problem was initially a binary classification problem, then this layer will hold a sigmoid activation function.
109 |
110 | ## Conclusion
111 |
112 | A single hidden layer neural network is very similar in concept to linear and logistic regression models. The differences lie in the dense connections formed between the input and hidden layer nodes and the activation applied to each value in the hidden layer.
--------------------------------------------------------------------------------
/Example_of_a_convolutional_neural_network.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Example of a convolutional neural network"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 | ```{r Import libraries}
15 | library(keras)
16 | ```
17 |
18 |
19 |
30 |
31 | 
32 |
33 | ## Introduction
34 |
35 | The classification of images are best managed by convolutional neural networks (CNN). Before embarking on the use of novel images, it is best to look at the built-in images provided as datasets in Keras.
36 |
37 | ## The MNIST dataset
38 |
39 | The MNIST dataset contains small $ 28 \times 28 $ pixel grey scale images of handwritten digits that have been classified by humans.
40 |
41 | This dataset can be directly imported for use.
42 |
43 | ## Importing the data
44 |
45 | The `dataset_mnist()` is built into Keras and can be assigned to a computer variable.
46 |
47 | ```{r Importing the data from the web}
48 | mnist <- dataset_mnist()
49 | ```
50 |
51 | ## Preparing the data
52 |
53 | The dataset has already been divided into a training and test set, each with a set of feature variables (the images) and a set or target values.
54 |
55 | ```{r Splitting the data}
56 | x_train <- mnist$train$x
57 | y_train <- mnist$train$y
58 | x_test <- mnist$test$x
59 | y_test <- mnist$test$y
60 | ```
61 |
62 | The dimensions of the training feature set (the images) is given below.
63 |
64 | ```{r Dimensions of the training features}
65 | dim(x_train)
66 | ```
67 |
68 | Note that there are $60000$ grey scale images each of pixel size $ 28 \times 28 $. These values are assigned to computer variable below.
69 |
70 | ```{r Setting dimensions}
71 | img_rows <- 28
72 | img_cols <- 28
73 | ```
74 |
75 | These images are not in the the correct _shape_ as tensors, as the number of channels is missing. This can be corrected for both the training and test sets by using the `array_reshape()` function. The code below also creates the `input_shape` variable to hold the correct dimensions of the images.
76 |
77 | ```{r Redefine dimensions to include channel}
78 | x_train <- array_reshape(x_train,
79 | c(nrow(x_train),
80 | img_rows,
81 | img_cols, 1))
82 | x_test <- array_reshape(x_test,
83 | c(nrow(x_test),
84 | img_rows,
85 | img_cols, 1))
86 | input_shape <- c(img_rows,
87 | img_cols, 1)
88 | ```
89 |
90 | ```{r New dimensions}
91 | dim(x_train)
92 | ```
93 |
94 | As with all neural networks thus far, the data must be normalized. Since the pixel values represent brigness on a scale from $0$ (black) to $255$ (white), they can all be rescaled by dividing each by the maximum value of $255$.
95 |
96 | ```{r Transform the brightness values}
97 | x_train <- x_train / 255
98 | x_test <- x_test / 255
99 | ```
100 |
101 | The sample space of the target variable contains $10$ elements, i.e. there are $10$ classes in the target variable. these can be one-hot-encoded using the `to_categorical()` function.
102 |
103 | ```{r One-hot encoding of target variable}
104 | num_classes = 10
105 | y_train <- to_categorical(y_train, num_classes)
106 | y_test <- to_categorical(y_test, num_classes)
107 | ```
108 |
109 | The first image in the training set is a $5$ (the count starts at zero).
110 |
111 | ```{r}
112 | y_train[1,]
113 | ```
114 |
115 | ## The model
116 |
117 | Below is a simple CNN. It contains two convolutional layers. The first has $32$ filters, each of size $ 3 \times 3$ and uses the rectified linear unit activtion function. The second uses $64$ similarly sized filters and the same activation function.
118 |
119 | This is followed by a max pooling layer with a grid size of $ 2 \times 2 $. Next up is a dropout layer, set to $0.25$.
120 |
121 | The last resultant image is flattened before passing through a single densely connected layer with $128$ nodes using the rectified linear unit activation function. A $0.5$ dropout is used to combat overfitting. The output layer has $10$ nodes (as there are $10$ classes) and uses the softmax activation function.
122 |
123 | ### Creating the model
124 |
125 | ```{r Creating the CNN}
126 | model <- keras_model_sequential() %>%
127 | layer_conv_2d(filters = 32,
128 | kernel_size = c(3,3),
129 | activation = 'relu',
130 | input_shape = input_shape) %>%
131 | layer_conv_2d(filters = 64,
132 | kernel_size = c(3,3),
133 | activation = 'relu') %>%
134 | layer_max_pooling_2d(pool_size = c(2, 2)) %>%
135 | layer_dropout(rate = 0.25) %>%
136 | layer_flatten() %>%
137 | layer_dense(units = 128,
138 | activation = 'relu') %>%
139 | layer_dropout(rate = 0.5) %>%
140 | layer_dense(units = num_classes,
141 | activation = 'softmax')
142 | ```
143 |
144 | A summary of the model shows $ 1199882 $ learnable parameters.
145 |
146 | ```{r}
147 | model %>% summary()
148 | ```
149 |
150 | ### Compiling
151 |
152 | Categorical cross-entropy serves as the loss function. Adadelta optimizes the gradient descent and accuracy serves as metric.
153 |
154 | ```{r Compiling the model}
155 | model %>% compile(
156 | loss = loss_categorical_crossentropy,
157 | optimizer = optimizer_adadelta(),
158 | metrics = c('accuracy')
159 | )
160 | ```
161 |
162 | ## Training
163 |
164 | A mini-batch size of $128$ will allow the tensors to fit into the memory of the NVidia graphics processing unit of the current machine. The model will run over $12$ epochs, with a validation split set at $0.2$.
165 |
166 | ```{r Training the model}
167 | batch_size <- 128
168 | epochs <- 12
169 |
170 | # Train model
171 | model %>% fit(
172 | x_train, y_train,
173 | batch_size = batch_size,
174 | epochs = epochs,
175 | validation_split = 0.2
176 | )
177 |
178 | ```
179 |
180 | ## Evaluating the accuracy
181 |
182 | The model can be evaluated using the test data.
183 |
184 | ```{r Evaluating the model}
185 | score <- model %>% evaluate(x_test,
186 | y_test)
187 |
188 | cat('Test loss: ', score$loss, "\n")
189 | cat('Test accuracy: ', score$acc, "\n")
190 | ```
191 |
192 |
193 |
194 |
--------------------------------------------------------------------------------
/Deep neural networks for regression problems.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Deep neural networks for regression problems"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | setwd(getwd())
13 | ```
14 |
15 | ```{r library import, message=FALSE, warning=FALSE}
16 | library(readr)
17 | library(keras)
18 | library(plotly)
19 | ```
20 |
21 |
32 |
33 | 
34 |
35 | ## Introduction
36 |
37 | Unlike classification problems where the target variable is categorical in nature, regression problems have numerical variables as target.
38 |
39 | This chapter creates a deep neural network to predict a numerical outcome.
40 |
41 | ## Data
42 |
43 | The dataset contains simulated data. There are $4898$ samples over $10$ feature variables and a single target variable. This data is saved in a `.csv` file in the same folder as this `R` markdown file.
44 |
45 | ```{r data import, message=FALSE, warning=FALSE}
46 | data.set <- read_csv("RegressionData.csv",
47 | col_names = FALSE)
48 | ```
49 |
50 | The dimensions are confirmed below.
51 |
52 | ```{r dimensions of the dataset}
53 | dim(data.set)
54 | ```
55 |
56 | ### Transformation into a matrix
57 |
58 | The data structure is transformed into a _mathematical_ matrix using the `as_matrix()` function before removing the variable (column) names.
59 |
60 | ```{r}
61 | # Cast dataframe as a matrix
62 | data.set <- as.matrix(data.set)
63 |
64 | # Remove column names
65 | dimnames(data.set) = NULL
66 | ```
67 |
68 | ### Distribution of the target variable
69 |
70 | The summary statistics of the target variable is shown below.
71 |
72 | ```{r summary of target variable}
73 | summary(data.set[, 11])
74 | ```
75 |
76 | This can be represented as a histogram, as is shown in __figure 1__ below.
77 |
78 | ```{r target variable histogram, fig.cap="Fig 1 Histogram of the target variable"}
79 | f1 <- plot_ly() %>%
80 | add_histogram(x = ~data.set[, 11],
81 | name = "Target variable") %>%
82 | layout(title = "Target variable",
83 | xaxis = list(title = "Values",
84 | zeroline = FALSE),
85 | yaxis = list(title = "Count",
86 | zeroline = FALSE))
87 | f1
88 | ```
89 |
90 | Note that the values range from $2.5$ to $9.3$.
91 |
92 | ### Train and test split
93 |
94 | The dataset, which now exists as a matrix, must be split into a training and a test set. There are various ways in `R` to perform this split. The method employed in previous chapters is used here. With such a small dataset, the test set will comprise $20% of the samples.
95 |
96 | ```{r create index for splitting}
97 | # Split for train and test data
98 | set.seed(123)
99 | indx <- sample(2,
100 | nrow(data.set),
101 | replace = TRUE,
102 | prob = c(0.8, 0.2)) # Makes index with values 1 and 2
103 | ```
104 |
105 | ```{r splitting the data}
106 | x_train <- data.set[indx == 1, 1:10]
107 | x_test <- data.set[indx == 2, 1:10]
108 | y_train <- data.set[indx == 1, 11]
109 | y_test <- data.set[indx == 2, 11]
110 | ```
111 |
112 | ### Normalizing the data
113 |
114 | To improve learning, the feature variables must be normalized. As before, the method of standardization is used.
115 |
116 | The mean and standard deviation of the feature variables are calculated and stored in the objects `mean.train` and `sd.train`. The `apply()` function calculates the required test statistic along the axis required (the `2` indicating each column). Finally, the `scale()` function performs the standardization.
117 |
118 | ```{r normalizing the test data}
119 | mean.train <- apply(x_train,
120 | 2,
121 | mean)
122 | sd.train <- apply(x_train,
123 | 2,
124 | sd)
125 | x_test <- scale(x_test,
126 | center = mean.train,
127 | scale = sd.train)
128 | ```
129 |
130 | The training data is standardized with a simple use of the `scale()` function.
131 |
132 | ```{r normalizing the train data}
133 | x_train <- scale(x_train)
134 | ```
135 |
136 | ## The model
137 |
138 | The code below is used to create a densely connected deep neural network with three hidden layers and an output layer.
139 |
140 | ### Creating the model
141 |
142 | Note that there is no activation function for the output layer. Each hidden layer has $25$ nodes and the rectified linear unit is used as activation function. Dropout is employed to prevent overfitting.
143 |
144 | ```{r model}
145 | model <- keras_model_sequential() %>%
146 | layer_dense(units = 25,
147 | activation = "relu",
148 | input_shape = c(10)) %>%
149 | layer_dropout(0.2) %>%
150 | layer_dense(units = 25,
151 | activation = "relu") %>%
152 | layer_dropout(0.2) %>%
153 | layer_dense(units = 25,
154 | activation = "relu") %>%
155 | layer_dropout(0.2) %>%
156 | layer_dense(units = 1)
157 | ```
158 |
159 | The summary of the model shows $1601$ learnable parameters.
160 |
161 | ```{r model summary}
162 | model %>% summary()
163 | ```
164 |
165 | Detailed information that shows all the arguments (including those that were left at their default values) can be viewed with the `get_config()` function.
166 |
167 |
168 | ```{r}
169 | model %>% get_config()
170 | ```
171 |
172 | ### Compiling the model
173 |
174 | Since this is a regression problem, the mean squared error is used as the loss function. The `rmsprop` optimizer is used, with its default values, i.e. `lr = 0.001, rho = 0.9, epsilon = NULL, decay = 0, clipnorm = NULL, clipvalue = NULL`.
175 |
176 | ```{r compiling the model}
177 | model %>% compile(loss = "mse",
178 | optimizer = optimizer_rmsprop(),
179 | metrics = c("mean_absolute_error"))
180 | ```
181 |
182 | ### Fitting the data
183 |
184 | All that remains, is to fit the data, with $0.1$ of the training data reserved as validation set. A mini-batch size of $32$ is used. To avoid overfitting (and prevent an unnecessary long run), early stopping is employed. The mean absolute error of the validation set is used as callback monitor, with a patience level of five.
185 |
186 | ```{r fit the model, message=FALSE, warning=FALSE}
187 | history <- model %>%
188 | fit(x_train,
189 | y_train,
190 | epoch = 50,
191 | batch_size = 32,
192 | validation_split = 0.1,
193 | callbacks = c(callback_early_stopping(monitor = "val_mean_absolute_error",
194 | patience = 5)),
195 | verbose = 2)
196 | ```
197 |
198 | ## Testing the model
199 |
200 | The test data can be used to show the loss and the mean absolute error of the model. The code chunk below creates two object, `loss` and `mae` to hold these values. The mean absolute error is pasted into a `sprintf()` function using the `paste0()` function. The `"%.2f"` argument stipulate two decimal places.
201 |
202 | ```{r}
203 | c(loss, mae) %<-% (model %>% evaluate(x_test, y_test, verbose = 0))
204 |
205 | paste0("Mean absolute error on test set: ", sprintf("%.2f", mae))
206 | ```
207 |
208 | ## Conclusion
209 |
210 | The chapter introduced a model to solve a regression problem. The following are important notes when dealing with regression models:
211 |
212 | 1. The feature variables were standardized according to the mean and standard deviation of the test set
213 | 2. No activation function is used in the output layer
214 | 3. The mean squared error is a typical loss function in this setting
215 | 4. The mean absolute error is a useful metric
--------------------------------------------------------------------------------
/Introduction to convolutional neural networks.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to convolutional neural networks"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 |
25 |
26 | 
27 |
28 | ## Introduction
29 |
30 | Many forms of data can be managed by densely connected neural networks. In a previous chapter, images were transformed into vectors for input into a neural network.
31 |
32 | Depending on the dimensions of images, these vectors can be very large. This size can lead to parameter numbers that are simply to computationally expensive. A small $100 \times 100$ pixel image transformed into an input vector already has $10000$ nodes.
33 |
34 | Convolutional networks solve this problem through the use of the convolution operation. This makes convolutional neural networks (CNN) ideal for image classification problems.
35 |
36 | ## Numerical representation of images
37 |
38 | An image on a computer screen is made up of pixels (dots). Each of these pixels, is merely a brightness level. For a gray scale (black-and-white or monochrome) image these typically range from $0$ for no brightness (black) to $255$ for full brightness (white).
39 |
40 | A gray scale image can therefor be represented as a rank-2 tensor (a matrix), with row and column numbers equal to the dimensions of height and width of the image.
41 |
42 | A $5 \times 5$ gray scale image that is totally black is shown as a matrix in (1) below.
43 |
44 | $$ \begin{bmatrix} 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix} \tag{1} $$
45 |
46 | Color images are made up of three such brightness value layers, one each for red, green, and blue. This is represented numerically by a rank-3 tensor. The three layers are referred to as _channels_. The same black image above, when viewed as a color image, would have dimension of $5 \times 5 \times 3$.
47 |
48 | ## The convolution operation
49 |
50 | The convolution operation forms the basis of a CNN. The definition of the word convolution is _a thing that is complex and difficult to follow_. Fortunately, the convolution operation is rather simple to understand.
51 |
52 | Consider the two $3 \times 3$ matrices in (2) below.
53 |
54 | $$ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 3 & 3 \\ 3 & 4 & 2 \end{bmatrix} \begin{bmatrix} 3 & 3 & 2 \\ 1 & 1 & 2 \\ 7 & 2 & 2 \end{bmatrix} \tag{2}$$
55 |
56 | The convolutional operation multiplies the corresponding values (by index or address) and adds all these products. The result is shown in (3) below.
57 |
58 | $$\left(1 \times 3\right) + \left(2 \times 3\right) + \left(3 \times 2\right) + \\ \left(4 \times 1\right) + \left(3 \times 1\right) + \left(3 \times 2\right) + \\ \left(3 \times 7\right) + \left(4 \times 2\right) + \left(2 \times 2\right) \\ = 3 + 6 + 6 +4 +3+6+21+8+4 \\ = 61\tag{3}$$
59 |
60 | Note that this is not matrix multiplication, where a $3 \times 3$ matrix multiplied by a $3 \times 3$ matrix will results in a $3 \times 3$ matrix.
61 |
62 | There is more to the convolution operation, though. Images are usually larger than $3$ pixels wide and high. Consider then a very small, square image that is $10$ pixels on either side. The first $3 \times 3$ matrix in the example above is placed in the upper left corner of the image, so that a $3 \times 3$ area overlaps. A similar multiplication and addition operation ensues, resulting in a scalar value. This becomes the top-left pixel in the resultant _image_. In this course the _resultant image_ will refer to the data for the next layer. The $3 \times 3$ matrix now moves on one pixel to the tight of the image and the calculation is repeated, resulting in the second pixel value of the resultant image. When the $3 \times 3$ matrix runs up against the right edge of the image and performs the same calculation, it then moves one pixel down and jumps all the way to the left. This process continues until the $3 \times 3$ matrix ends up at the bottom right of the image. This is the convolution operation and is depicted below.
63 |
64 | 
65 |
66 | Other than solving the problem of many nodes when using a densely connected neural network, the convolution operation has the ability to detect edges. As more convolutional layers are added, the edges forms shapes, and eventually a representation of the original image than can be classified.
67 |
68 | A video tutorial by this author using Microsoft Excel to explain the concept of the convolution operation, and how it detects edges, is available at https://www.youtube.com/watch?v=kgp58cLaFHs.
69 |
70 | ## The filter and resultant image
71 |
72 | The $3 \times 3$ matrix in the example above is called a _kernel_ or a _filter_. Filters of size $3 \times 3$ are commonly used.
73 |
74 | The values in the kernel are akin to the weights in a densely connected neural network. More than one filter can be (and is) used in a convolutional layer. During many epochs their weight values update to discern the edges and shapes as discussed.
75 |
76 | The resultant image is necessarily smaller than the original image (or the prior _image_), deeper in the network. If the original image is square and of pixel size $n \times n$ and the kernel is of size $m \times m$, then moving along one pixel at a time will result in an image of the size given in equation (1).
77 |
78 | $$\left( n - m +1 \right) \times \left( n - m + 1 \right)\tag{1}$$
79 |
80 | If $p$ filters are used, the resultant _image_ is built up of a tensor of the dimension given in equation (2).
81 |
82 | $$\left( n - m +1 \right) \times \left( n - m + 1 \right) \times p \tag{2}$$
83 |
84 | When using color images with red, green, and blue channels, the kernel must similarly have a third axis over similar size, namely three. The process is depicted below.
85 |
86 | 
87 |
88 | The original image of size $6 \times 6$ convolved with a $3 \times 3$ filter (without padding and a stride length of $1$), produces a $4 \times 4$ resultant image as shown below.
89 |
90 | 
91 |
92 | ## Padding
93 |
94 | It follows from the description of the convolution operation above that pixels away from the edge are _more involved_ in the learning process. To aid in the edge pixels contributing to the process and to prevent the resultant image from being smaller, _padding_ can be employed to the original image. An edge of zero values are added all around the image. Where it was $n \times n$ before, it becomes $\left( n + 2 \right) \times \left( n + 2 \right)$ in size. Note that this is the specific case of a kernel with and odd size, i.e. $3 \times 3$. Padding with zero-valued pixels is shown below.
95 |
96 | 
97 |
98 | ## Stride
99 |
100 | The process described so far, has the kernel moving along and down, one pixel at a time. This is the _stride length_. A higher value for the stride length can be set. A stride length of two is shown below.
101 |
102 | 
103 |
104 | ## Pooling
105 |
106 | _Pooling_ consolidates the resultant image by looking at square sized pixel grids, i.e. $2 \times 2$. This grid moves along the image as with the convolution operation. Max pooling is most commonly used. In the grid formed by a $2 \times 2$ square pixel area, the largest value is maintained in a new resultant image.
107 |
108 | Average pooling, where the average of the values in the grid is calculated, can also be used. It has not been shown to be of much benefit and was used prior to the current era of deep learning.
109 |
110 | Max pooling for a $2 \times 2$ grid is shown below. The maximum value in the first grid is $78$, which is the maximum value in the first pixel of the resultant image. It remains the maximum value as the grid moves one pixel to the right, and so on.
111 |
112 | 
113 |
114 | ## Flattening
115 |
116 | Before an output layer can be constructed, the last resultant image must be flattened, i.e. turned into a vector. Each pixel is simply taken from top-left, moving along the current row, before dropping down one row and restarting at the left until to bottom-right is reached. This is then passed through a densely connected layer.
117 |
118 | ## Conclusion
119 |
120 | The parameters that make up the filters _learn_ about the shapes, forms, and edges in the image. Together with the final densely connected part, a CNN is well suited to classify images.
--------------------------------------------------------------------------------
/Predicting skin.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Predicting skin lesions"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 |
25 |
26 | 
27 |
28 | ## Introduction
29 |
30 | While it is a simple task to download images that are part of a predesigned dataset, the ultimate goal is to use your own images. In this file, we take a look at how to classify images of skin lesions as either benign or malignant. These images are available on Kaggle.
31 |
32 | Since images take up a lot of memory, we will also use an image data generator that will load images from a local drive in batches, as and when needed.
33 |
34 | ## Libraries
35 |
36 | We will use the `Keras` library. Note that this markdown file makes use of `Keras` as of October 2019 (`TensorFlow 2.0`).
37 |
38 | ```{r}
39 | setwd(getwd())
40 | library(keras)
41 | ```
42 |
43 | ## Path to data
44 |
45 | It is important that the data is saved in a particular structure on your hard drive. You should have separate directories (folders) for the training, validation, and test sets. In each of these, you should have separate folders for the different classes. Since we have two classes, `benign` and `malignant`, each of our three directories holds both of these as sub directories. Make sure that the names of the sub directories are spelled similarly in each of the three directories.
46 |
47 | Below, we set a file path to each of the three directories.
48 |
49 | ```{r}
50 | train_dir <- file.path("d:", "Kaggle", "R", "skin", "train")
51 | validation_dir <- file.path("d:", "Kaggle", "R", "skin", "validation")
52 | test_dir <- file.path("d:", "Kaggle", "R", "skin", "test")
53 | ```
54 |
55 | Will will use the number of images in each of the sub directories to counts the steps in our model.
56 |
57 | ```{r}
58 | train_benign_dir <- file.path(train_dir, "benign")
59 | train_malignant_dir <- file.path(train_dir, "malignant")
60 |
61 | validation_benign_dir <- file.path(validation_dir, "benign")
62 | validation_malignant_dir <- file.path(validation_dir, "malignant")
63 |
64 | test_benign_dir <- file.path(test_dir, "benign")
65 | test_malignant_dir <- file.path(test_dir, "malignant")
66 | ```
67 |
68 | Below, we find the file counts.
69 |
70 | ```{r}
71 | num_train_benign <- length(list.files(train_benign_dir))
72 | num_train_malignant <- length(list.files(train_malignant_dir))
73 |
74 | num_validation_benign <- length(list.files(validation_benign_dir))
75 | num_validation_malignant <- length(list.files(validation_malignant_dir))
76 |
77 | num_test_benign <- length(list.files(test_benign_dir))
78 | num_test_malignant <- length(list.files(test_malignant_dir))
79 |
80 | total_train <- num_train_benign + num_train_malignant
81 | total_validation <- num_validation_benign + num_validation_malignant
82 | total_test <- num_test_benign + num_test_malignant
83 | ```
84 |
85 | ## Data generators
86 |
87 | To take the image in batches from a local drive, we set up a generator with the `image_data_generator()` function. We make use of image augmentation to improve training.
88 |
89 | ```{r}
90 | IMG_HEIGHT <- 112 # Small image sizes for demo purposes only
91 | IMG_WIDTH <- 112
92 | batch_size <- 4 # Samll batch size for demo purposes only
93 | ```
94 |
95 | We set up these generators for the training, validation, and test images.
96 |
97 | ```{r}
98 | image_gen_train <- keras::image_data_generator(rescale = 1./255,
99 | rotation_range = 10,
100 | width_shift_range = 0.15,
101 | height_shift_range = 0.15,
102 | horizontal_flip = TRUE,
103 | zoom_range = 0.05)
104 |
105 | image_gen_validation <- keras::image_data_generator(rescale = 1./255) # The validation and test images are not augmented
106 |
107 | image_gen_test <- keras::image_data_generator(rescale = 1./255)
108 | ```
109 |
110 | Now we use the `flow_from_directory()` function. It will pull the images in batches as needed during training.
111 |
112 | ```{r}
113 | train_data_gen <- keras::flow_images_from_directory(train_dir,
114 | generator = image_gen_train,
115 | batch_size = batch_size,
116 | target_size = c(IMG_HEIGHT,
117 | IMG_WIDTH),
118 | class_mode = "binary")
119 |
120 | validation_data_gen <- keras::flow_images_from_directory(validation_dir,
121 | generator = image_gen_validation,
122 | batch_size = batch_size,
123 | target_size = c(IMG_HEIGHT,
124 | IMG_WIDTH),
125 | class_mode = "binary")
126 |
127 | test_data_gen <- keras::flow_images_from_directory(test_dir,
128 | generator = image_gen_test,
129 | batch_size = batch_size,
130 | target_size = c(IMG_HEIGHT,
131 | IMG_WIDTH),
132 | class_mode = "binary")
133 | ```
134 |
135 | ## Creating a model
136 |
137 | Our model is a simple two convolutional layer model.
138 |
139 | ```{r}
140 | model <- keras::keras_model_sequential() %>%
141 | layer_conv_2d(filters = 16,
142 | kernel_size = 3,
143 | padding = "same",
144 | activation = "relu",
145 | input_shape = c(IMG_HEIGHT,
146 | IMG_WIDTH,
147 | 3)) %>%
148 | layer_max_pooling_2d() %>%
149 |
150 | layer_dropout(0.2) %>%
151 | layer_conv_2d(filters = 32,
152 | kernel_size = 3,
153 | padding = "same",
154 | activation = "relu") %>%
155 | layer_max_pooling_2d() %>%
156 | layer_dropout(0.2) %>%
157 | layer_flatten() %>%
158 | layer_dense(512,
159 | activation = "relu") %>%
160 | layer_dense(1,
161 | activation = "sigmoid")
162 | ```
163 |
164 | ```{r}
165 | model %>% summary()
166 | ```
167 |
168 | ```{r}
169 | model %>% compile(loss = "binary_crossentropy",
170 | optimizer = optimizer_adam(),
171 | metrics = c("accuracy"))
172 | ```
173 |
174 | ## Training
175 |
176 | We now train the model over $10$ epochs, with early stopping.
177 |
178 | ```{r}
179 | history <- keras::fit_generator(model,
180 | train_data_gen,
181 | steps_per_epoch = floor(total_train / batch_size),
182 | epochs = 10,
183 | validation_data = validation_data_gen,
184 | validation_steps = floor(total_validation / batch_size),
185 | callbacks = callback_early_stopping(monitor = "val_loss",
186 | min_delta = 0.01,
187 | patience = 4))
188 | ```
189 |
190 | ## Evaluating the model
191 |
192 | We can now use the test image generator to check on our model.
193 |
194 | ```{r}
195 | score <- model %>% evaluate_generator(test_data_gen,
196 | steps = floor(total_test / batch_size))
197 | ```
198 |
199 |
200 | Let's have a look at the accuracy.
201 |
202 | ```{r}
203 | cat('Test accuracy: ', score$acc, "\n")
204 | ```
205 |
206 | ## Saving and reloading the model
207 |
208 | We can save this model as in HDF5 format.
209 |
210 | ```{r}
211 | model %>% save_model_hdf5("skin.h5")
212 | ```
213 |
214 | Reloading is simple.
215 |
216 | ```{r}
217 | load_model <- load_model_hdf5("skin.h5")
218 | load_model %>% summary()
219 | ```
220 |
221 | We can use the test set as before.
222 |
223 | ```{r}
224 | new_score <- load_model %>% evaluate_generator(test_data_gen,
225 | steps = floor(total_test / batch_size))
226 | ```
227 |
228 | ```{r}
229 | cat('Test accuracy: ', new_score$acc, "\n")
230 | ```
--------------------------------------------------------------------------------
/Regularization.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Regularization"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 |
25 |
26 | 
27 |
28 | ## Introduction
29 |
30 | Central to supervised machine leaning stands the cost function and the attempt to minimize this function by optimizing the values of its parameters (the unknowns). All the possible solutions together make up what is known as the _hypothesis space_.
31 |
32 | Herein, though, lies a danger. Certain solutions in this hypothesis spaces are extremely _partial_ to the training set. Unfortunately, they are less successful when given data outside of the training set. These solutions are said not to _generalize_.
33 |
34 | All attempts must be made to select from the hypothesis only those solutions that will generalize well. This is done through constraining the hypothesis space, i.e. making only certain solutions possible.
35 |
36 | There are many ways to constrain the hypothesis space and one of the most common techniques, _regularization_, is introduced in this chapter.
37 |
38 | ## Complexity
39 |
40 | One way to approach possible attempts at constraining the hypothesis space is to sequence the space. If a hypothesis space is seen as a set and denoted by $\mathbb{H}_i$, then such a sequence is shown in equation (1).
41 |
42 | $$\mathbb{H}_1 \subset \mathbb{H}_2 \subset \ldots \subset \mathbb{H}_n \tag{1}$$
43 |
44 | An example would be polynomials where all first degree polynomial are a subset of (contained within) all second degree polynomials.
45 |
46 | In the case of simple linear regression, _complexity_ can be sequenced in a similar fashion and in a number of ways. This idea expands naturally to the parameters in neural networks. Some are listed below.
47 |
48 | 1. The dimensionality of the inputs space (how many feature variables)
49 | 2. The number of non-zero coefficients, $w_i$ in $w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$, referred to as $\ell_0$ _complexity_
50 | 3. The sum of the the absolute values of the coefficients, $\sum_{i=1}^n \left| w_i \right|$, referred to as $\ell_1$ _complexity or _lasso complexity_
51 | 4. The sum of the squares of the coefficients, $\sum_{i=1}^n w_i^2$, referred to as $\ell_2$ _complexity_ or _ridge complexity_ (used as example in the description below)
52 |
53 | If a chosen measure of complexity is symbolized as $\omega$, then for a given value $r \in \omega$, the hypothesis space can be constrained as shown in equation (2).
54 |
55 | $$\mathbb{H}_1 \subset \mathbb{H}_2 \subset \ldots \subset \mathbb{H}_{r \in \omega} \tag{2}$$
56 |
57 | This makes $r$ a hyperparameter that the designer of a neural network must choose and change iteratively until the model generalizes well to unseen and real-world data.
58 |
59 | ## Regularization
60 |
61 | This concept of $r \in \omega$ can be expressed explicitly in a deep neural network by altering the cost function in some way, i.e. by penalizing it in some way. In machine learning, _penalized minimization_, referred to as _Tikhanov regularization_, is used most often. This penalizes the cost function by adding a regularization term according to a specified value for $r \in \omega$, where $\omega$ is determined by the choice of complexity measurement.
62 |
63 | Through the process of gradient descent, backpropagation attempts to minimize a cost function. This idea is expressed in equation (3).
64 |
65 | $$ \mathscr{C} \left(W,b\right) = \frac{1}{m} \sum_{i=1}^{m} \mathscr{L} \left( \hat{y}^{\left( i \right)},y^{\left( i \right)} \right) \tag{3}$$
66 |
67 | Here $\mathscr{C} \left( w , b \right)$ is the cost function, which is a multivariable function of the weight and bias parameters. The number of samples is denoted by $m$ and the loss function is denoted by $\mathscr{L}$, which is in turn a function of the predicted target variable, $\hat{y}^{\left( i \right)}$, and the actual target variable, $y^{\left( i \right)}$, over each of the samples, $\left( i \right)$.
68 |
69 | Regularization adds a term to the cost function (Tikhanov regularization). Equation (4) expresses $L_2$-regularization.
70 |
71 | $$\mathscr{C} \left(W,b\right) = \frac{1}{m} \sum_{i=1}^{m} \mathscr{L} \left( \hat{y}^{\left( i \right)},y^{\left( i \right)} \right) + \frac{\lambda}{2m} \sum_{l=1}^{L} {|| W^{\left[ l \right]} ||}^{2} \tag{4}$$
72 |
73 | Here the $L$ in the second term indicates all of the layers (not to be confused with the $\mathscr{L}$ in the first term denoting the loss function), whereas $\lambda$ is the regularization parameter, a hyperparameter that must be chosen by the designer of the neural network. Note that the $\frac{1}{2}$ is simply a scaling term. This makes the derivative of the cost function a simpler equation.
74 |
75 | Note that $W$ is a matrix with dimension $n^{\left[ l \right]} \times n^{\left[ l - 1 \right]}$, where $l$ refers to the current layer and $l-1$, the previous layer. This allows for the expression in the second term of equation (4) above to be written as in equation (5).
76 |
77 | $$ {|| W^{ \left[ l \right]} ||}^{2} = \sum_{i=1}^{n^{\left[ l \right]}} \sum_{j=1}^{n^{\left[ l - 1 \right]}} {\left( w_{ij} \right)}^{2} \tag{5}$$
78 | Equation (5) is referred to as the square of the _Euclidian_ or _Frobenius_ norm of a matrix. To understand this equation, consider a layer $\left[ l \right]$ in a network, containing $n^{\left[ l \right]}$ nodes. The preceding layer, $\left[ l-1 \right]$, has $n^{\left[ l - 1 \right]}$ nodes. A matrix has to be transposed and multiplied with the column vector, ${n}^{ \left[ l-1 \right] }$ to provided a column vector with dimensions $n^{\left[ l \right]}$. Such a matrix (after transposing) must therefor have dimensions $l \times \left( l-1 \right)$. This is depicted in equation (6).
79 |
80 | $$W_{l \times \left( l - 1 \right)}^{\left[ l \right]} \cdot x_{\left( l - 1 \right) \times 1 }^{\left[ l-1 \right]} = x_{l \times 1}^{\left[ l \right]} \tag{6}$$
81 |
82 | An example of equation (5) where $l = 3$ and $l-1 = 2$ is shown in equation (7) below.
83 |
84 | $$W_{3 \times 2} = \begin{bmatrix} 3 && 4 \\ 2 && 1 \\ 1 && 1 \end{bmatrix} \\
85 | {|| W^{ \left[ l \right]} ||}^{2} = \left( 3^2 + 4^2 \right) + \left( 2^2 + 1^2 \right) + \left( 1^1 + 1^2 \right) = 32 \tag{7}$$
86 |
87 | It should be clear that there is ultimately an addition to the cost function and through this addition, the constraint of the hypothesis space follows. Please note that strictly speaking, the complexity can be added to the cost function in a different way (called _Ivanov complexity_) that truly constrains the hypothesis space instead of penalizing the cost function as is the case in Tikhanov regularization. In the common case scenario of deep neural networks, the effect is similar though, especially when seen from the point of view of the ultimate goal of generalizing of the model. (This follows from _Lagrangian duality theory_ which is not covered in this text).
88 |
89 | The intuition behind how the the hypothesis space is reduced can be understood in terms of the larger cost function value (by way of addition of a positive term). By taking the derivative (see below) and through gradient descent, i.e. minimizing the cost function, the weights will be pushed towards being zero. If weight values are closer to zero (smaller), then it makes the regularization term smaller (which is what gradient descent will do). With many weights approaching zero in value, this makes for a much _simpler_ model, hence preventing overfitting (the final new weight values do not give the best performance for the training data). In fact, small values of the $W$ (the weight matrix), provides small values during forward propagation. these smaller values tend to be in the linear part of the activation function (i.e. tanh or sigmoid), turning the network into more of a linear network. A linear network has a much less complex decision boundary, with the result that overfitting is reduced.
90 |
91 | While this new cost function might seem complex, its derivative is fairly simple (as the penalization term is a sum of terms, made even easier by the original scaling term, $\frac{1}{2}$). During the update phase of backpropagation, the weights are updated as shown in equation (8).
92 |
93 | $$\partial W^{\left[ l \right]} = \psi + \frac{\lambda}{m} W^{\left[ l \right]} \\ W^{\left[ l \right]} = W^{\left[ l \right]} - \alpha \partial W^{\left[ l \right]} \tag{8}$$
94 |
95 | Here $\psi$ is the original derivative of the cost function without the regularization term.
96 |
97 | Note also that this form of regularization only describes the weights and not the biases. The latter can be included with the use of the Frobenius norm of a vector. In practice, though, there are far fewer bias parameters than weight parameters and they are excluded from the regularization.
98 |
99 | ## Conclusion
100 |
101 | Regularization, especially $\ell_2$ regularization, is commonly used to decrease high variance in deep neural networks. By adding to the cost function it creates a simpler, more linear model, that may perform better during testing or with real-world data.
102 |
103 | The code in `Keras` and `TensorFlow` is very easy to implement.
--------------------------------------------------------------------------------
/Poor performance of a deep learning model.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Poor performance of a deep learning model"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 |
25 |
26 | 
27 |
28 | ## Introduction
29 |
30 | This chapter considers the requirements that exist for training and test sets and introduces the concepts such as the _ground truth_, _bias_, and _variance_.
31 |
32 | These are all very important to understand. A model is not created once. A lot of though must go into the preparation of the data, the creation of the model, inclusive of the setting of the hyperparameters, evaluating the model, and iteratively restarting the process again and again.
33 |
34 | ## Training and test sets
35 |
36 | The preceding chapter showed how to divide a dataset into a training and test set. Supervised machine learning requires the presence of labeled data, i.e. a target (outcome) variable. A dataset for which the target variable exists, is an absolute requirement for testing the accuracy (an other performance indicators) of a trained deep neural network. Such a dataset is known as a _test set_.
37 |
38 | A test set must contain data that was not available to the model during training. It must also be representative of both the original dataset (if it was taken from the original) and it must be representative of real-world data. The test set furthermore requires the same set of feature variables as the set that was used during training.
39 |
40 | The situation arises where data is being actively collected while a model is being built, compiled, and used for training. Under such circumstances, it might be wise to use the whole current dataset as the training set and use newly collected data as a test set. This test set still has the same requirements as mentioned before, though.
41 |
42 | The size of the test set is of great importance, especially if it is split from the original dataset. It needs to be large enough to be representative, but not so large as to take away vital data used in training. With very small datasets, the norm used to be a $70$% : $30$% split. As datasets have become increasingly large, this is no longer the case. In the preceding chapter, only $10$% of the data was extracted as a test set, yet is still comprised about $5000$ samples. With datasets approaching, and even far exceeding, a million samples, a $0.5$% or $1$% sub-dataset might be adequate for the test set.
43 |
44 | Another requirement that must be considered is the distribution of the training and test sets. This is of special concern when the datasets are not collected at the same time or in the same way. Examples might include image recognition using convolutional deep neural networks. It might be that the training images are special, selected, high-resolution images, where the test set might be more indicative of real-world scenarios. Such a training set, even when well designed, can never generalize to real-world data.
45 |
46 | Distribution also refers to the actual proportions of the elements in the target variable sample space. When one of these elements occur very infrequently _class imbalance_ occurs (a class here refers to one of the elements in the target variable sample space). It is important that the training and test sets have an equal mismatch. When the mismatch is so severe, i.e. $0.95:0.05$, simply guessing the majority class will be $95$% accurate. A deep neural network might not even be required in these cases! Data augmentation by simulating minority class might help solve this problem. Data augmentation will be discussed in a following chapter.
47 |
48 | All of the above also apply to the validation set*, should it be created in the network (which always a good idea).
49 |
50 | A training set is not absolutely required. In some cases data scientist rely solely on the validation set during the model training to indicate problems with the deep neural network, which in turn informs changes for improvement. Remember that key performance indicators such as loss and accuracy are produced for the validation set. This is similar to the loss and accuracy for the test set used in the preceding chapter.
51 |
52 | * _Some texts refer to the validation set as the development set or the hold-out set. Data scientist that only make use of a training and validation set, may refer to the development set as a test set._
53 |
54 | ## The ground-truth
55 |
56 | At first glance, this might be an easy concept. The target data might refer to _benign_ or _malignant_ disease. Consider once again an example from computer vision where histology specimens (microscope slides of tissue biopsies) form the dataset. In the case of benign versus malignant disease, each of the samples in the dataset must be noted as such. The question arises as to who labeled the samples. Was it an experienced histopathologist? Did she or he make a mistake, thereby mis-classifying the target variable? Was it consensus by a group of experts? In other examples where the target variable was a measurement from an apparatus, the question again arises as to possible inaccuracies in the measurement, leading to mis-classifications. Such errors might occur in any data capture of the target variable. All of the above creates a question mark around the idea of the ground-truth.
57 |
58 | The concept of the _optimal error_, also called the _Bayes error_, arises. This is the theoretical smallest possible error. In certain cases the human error approaches the optimal error as is seen in cases where the labeling is done by a group of experts or very accurate apparatus. Note, though, that there may be a large difference between the optimal error and the error inherent in a particular dataset.
59 |
60 | In general, the aim of a neural network is to approach optimal error. At the very least, it must outperform human error. In this simple statement lies its promise in the field of healthcare and indeed, in many other fields.
61 |
62 | ## Bias and variance
63 |
64 | These are extremely important terms in machine learning. _Bias_, also called _underfitting_, refers to a model that does not separate the classes of a test set that well. Such a model is not sophisticated enough and there is room for improvement. _Variance_, also called _overfitting_, refers to a model that is so precise that it actually only fits the training set well. This problem is also referred to as _memorization_, where the model simply learns the training set very accurately. When it comes to new data, the model actually performs poorly.
65 |
66 | Overfitting can be demonstrated by a simple polynomials. The figure below (taken from the `scikit learn` library website using python), shows some data points and a model (a line) that attempts to fit the data. In the context of deep learning, such a line can be viewed as a decision boundary. Samples on one side of the line will be predicted as belonging to one class and samples on the other will be predicted to be in the other class (for binary classes). A straight line might not be the best decision boundary. A non-linear line might be slightly better (as depicted by a higher degree polynomial below). In the extreme, a very convoluted decision boundary can be created (going through all the data point below). This model clearly overfits the data. It will perform well on the training set, but poorly on new data.
67 |
68 | 
69 |
70 | There must be measurements by which bias and variance can be quantified in order to inform changes in the neural network. Two such measurements are _training set error_ and _validation set error_. Various scenarios arise based on the values of these errors. Some of these examples are highlighted below.
71 |
72 | ### Overfitting with a very low training set error
73 |
74 | This is indicated by a large difference between the error rates of the training and validation sets, i.e. an error rate of $1$% for the former, but $10$% for the latter. Such a model is said to have _high variance_.
75 |
76 | ### Underfitting with both high training and validation errors
77 |
78 | Here it is assumed that the optimal error is present in the target variable, i.e. less than $1$%. In this scenario both the training and the validation sets have high errors, i.e. $15$% and $16$% respectively. Such a model has _high bias_. Note that the differentiating factor here is the relative large difference between the _optimal error_ and the training error versus the relatively small difference between the training and validation set errors.
79 |
80 | ### Both high variance and bias
81 |
82 | In this scenario there are equally large differences between the optimal error, the training error and the validation error, i.e. $1$% versus $15$% versus $30$%.
83 |
84 | ### The influence of the optimal error
85 |
86 | In _the underfitting with both high training and validation errors_ subsection above, it was mentioned that there is an assumption that the optimal error exists in the target variable. To some extent this is true for the other two subsections. In the case of high bias above, it might be known that the optimal error is $14$%. With the same training set error of $15$% and validation set error of $16$% as above, this becomes a model with both low bias and low variance. Each case must be seen in conjunction with the underlying error inherent in the target variable. In most cases this requires expert domain knowledge. This brings home the point that such experts must be involved in deep learning and that it is not just the playground of mathematicians and computer scientists.
87 |
88 | In summary, it can be said that the difference between the optimal and the training set errors informs bias and the difference between the training set and validation set errors, informs the variance.
89 |
90 | Older reports and research documents referred to the concept of a trade-off between bias and variance. In essence, to create models that sit in the Goldilocks zone in between. With modern deep learning architectures such a trade-off is not longer the norm. Models can be created with low bias and low variance. It might require a lot of work, though.
91 |
92 | ## A systematic approach to correcting for bias and variance
93 |
94 | ### Correcting high bias
95 |
96 | Here there is a relatively large error in the training set. The learning phase is performing poorly. Possible solutions (in order) include:
97 |
98 | 1. Create a bigger network, i.e. more layers, more nodes in a layer
99 | 2. Train for longer (more epochs)
100 | 3. Change to a different architecture, i.e. convolutional neural networks for image classification
101 |
102 | ### Correcting high variance
103 |
104 | Here there is a relatively large difference between the error rate of the training and validation sets. Possible solutions** (in order) include:
105 |
106 | 1. Capture more data
107 | 2. Augment the data
108 | 3. Regularization, dropout, batch normalization and other techniques
109 |
110 | ** _Some of these solutions will be discussed in following chapters._
--------------------------------------------------------------------------------
/Linear regression as a simple learner.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Linear regression as a simple learning network"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | library(plotly)
13 | ```
14 |
15 |
26 |
27 | 
28 |
29 | ## Introduction
30 |
31 | This chapter puts the concepts of the preceding chapter to good use. The aim is to expand intuition around deep learning through becoming more intimately familiar with the idea of _learning_ the values of parameters. The values that the parameters ultimately take bring the predicted values as close as possible to the real target values. The actual values of the latter are refereed to as the _ground truth_.
32 |
33 | The act of learning is expressed in mathematical form. This means that the ultimate goal is to create a function, for which a minimum value can be calculated. Understandably, and for now, this should make very little sense!
34 |
35 | Linear regression with a single feature variable provides the simplest example to clear up this understanding and forms the basis for this chapter.
36 |
37 | ## Predictor function
38 |
39 | The emphasis is on a single feature variable predicting a target variable. Equation (1) below is taken from the preceding chapter and shows how a single value in the target variable set is predicted (calculated from) the corresponding feature variable value.
40 |
41 | Probably the most difficult concept to understand in equation (1) is to rid the memory of school algebra, where $x$ and $y$ were variables. In equation (1) they are, in fact, constants. Each pair of values (feature and target variable value pair) are both constants. It is $\beta_0$ and $\beta_1$ that are the variables. Given a value pair of $\left(2, 4 \right)$ and replacing $\beta_0$ and $\beta_1$ with the more familiar school variables (not to be confused with the $x$ and $y$ in equation (1)), the equation would read $4 = x + 2y$. In very common form, and through algebraic manipulation, this is the same as $y = -\frac{1}{2}x + 2$.
42 |
43 | $$ \hat{y}_i \left( x_i \right) = \beta_0 + \beta_1 x_i \tag{1} $$
44 |
45 | Equation (1) is not plucked from the air. It is a linear equation ( a straight line) which aims to draw a straight line through the set of points in a graph (representing the value pairs) that serve as a model. From this model, and given appropriate values for $\beta_0$ and $\beta_1$, a future value of the target variable can be predicted (calculated) given a value for the feature variable.
46 |
47 | Remember to that $\hat{y}_i$ is the predicted value and that $i$ takes on counting values from $1$ to $n$, where $n$ is the number of samples. The actual corresponding ground truth (target) value for pair $i$ is $y_i$.
48 |
49 | ## Loss function
50 |
51 | In real-life, given values for $\beta_0$ and $\beta_1$, every predicted value, $\hat{y}_i$, will be slightly different from the ground truth value, $y_i$. The squared error is a way of quantifying the error (difference between the two values). This error can be calculated for every value pair, $\left( x_i , y_i \right)$. In deep learning, this error is referred to as the _loss function_, $L$, given in equation (2).
52 |
53 | $$ L \left( x_i \right) = {\left[ \hat{y}_i \left( x_i \right) - y_i \right]}^{2} \tag{2} $$
54 |
55 | ## Cost function
56 |
57 | The loss function is calculated for each pair in the $n$-sample dataset. There are many ways to combine this loss function over all of the $n$ samples. One way is to average over all the errors, i.e. summing all the $n$ errors and dividing by $n$. This is shown in equation (3).
58 |
59 | $$ C \left( \beta_0 , \beta_1 \right) = \frac{1}{n} \sum_{i=1}^{n} L \tag{3} $$
60 |
61 | Replacing equations (2) and then (1) into equation (3) shows the complete cost function, given in equation (4) below.
62 |
63 | $$ C \left( \beta_0 , \beta_1 \right) = \frac{1}{n} \sum_{i=1}^{n} {\left[ \beta_0 + \beta_1 x_i - y_i \right]}^{2} \tag{4} $$
64 |
65 | The aim, as mentioned in the introduction, is to minimize the cost function by changing the parameters $\beta_0$ and $\beta_1$.
66 |
67 | ## Creating an example
68 |
69 | The best way to understand how the cost function is minimized, is by example. Below are two computer variables (objects), `feature.var` and `target.var`. This represents a linear regression problem, where the aim is to solve for values of $\beta_0$ and $\beta_1$ so as to use the values in `feature.var` to predict the values in `target.var`.
70 |
71 | There are five pairs of values. The feature variable values are hard-coded and the target variable values are created by adding a random value to each of the five target variable values.
72 |
73 | ```{r}
74 | set.seed(1234) # For reproducibility
75 | feature.var <- c(1.3, 2.1, 2.9, 3.1, 3.3) # Five hard-coded values
76 | target.var <- feature.var + round(rnorm(5,mean = 0,sd = 0.5),digits = 1) # Adding random noise
77 | ```
78 |
79 | Below is a scatter plot of the five value pairs. The feature variable value of each marker (dot) is on the $x$-axis (independent variable) and the target variable value of each marker is on the $y$-axis (dependent variable).
80 |
81 | ```{r}
82 | p <- plot_ly(type = "scatter",
83 | mode = "markers",
84 | x = ~feature.var,
85 | y = ~target.var,
86 | marker = list(size =14,
87 | color = "rgba(255, 180, 190, 0.8)",
88 | line = list(color = "rgba(150, 0, 0, 0.8)",
89 | width = 2)))%>%
90 | layout(title = "Scatter plot",
91 | xaxis = list(title = "Feature variable", zeroline = FALSE),
92 | yaxis = list(title = "Target variable", zeroline = FALSE))
93 | p
94 | ```
95 |
96 | The code chunk below shows the pair of values as row vectors.
97 |
98 | ```{r}
99 | feature.var
100 | target.var
101 | ```
102 |
103 | Equation (4) can now be used to plug in all five of the pairs. This is shown in equation (5) below.
104 |
105 | $$ C = \frac{1}{5} \times \left\{ { \left[ \beta_0 + \beta_1 \left( 1.3 \right) - 0.7 \right]}^{2} + { \left[ \beta_0 + \beta_1 \left( 2.1 \right) - 2.2 \right]}^{2} + { \left[ \beta_0 + \beta_1 \left( 2.9 \right) - 3.4 \right]}^{2} \\ + { \left[ \beta_0 + \beta_1 \left( 3.1 \right) - 1.9 \right]}^{2} + { \left[ \beta_0 + \beta_1 \left( 3.3 \right) - 3.5 \right]}^{2} \right\} \tag{5} $$
106 |
107 | Simple algebraic manipulation results in equation (6).
108 |
109 | $$ C = 6.55 - 4.68 {\beta}_{0} + {\beta}_{0}^{2} - 13.132 {\beta}_{1} + 5.08 {\beta}_{0} {\beta}_{1} + 7.002 {\beta}_{1}^{2} \tag{6} $$
110 |
111 | Note that this is an equation in two unknown and that it can be graphed in 3D space as shown in the figure below.
112 |
113 | 
114 |
115 | ## Minimizing the cost function
116 |
117 | All of this brings us to a very simple conclusion. The mathematical concept of minimizing the error is simply finding values for $\beta_0$ and $\beta_1$ that will show the point in the 3D graph that is the lowest point, called the _global minimum_.
118 |
119 | Since the problem was reduced to that of mathematical function that requires the finding of the global minimum, partial differentiation with respect to each variable allows for the calculation of this minimum.
120 |
121 |
122 | In this extremely simple example of a single feature variable with two unknowns, the global minimum is calculated by the two partial derivatives shown in equation (7) below.
123 |
124 | $$ \frac{\partial C}{\partial \beta_0} = 2 \beta_0 + 5.08 \beta_1 - 4.68 \\ \frac{\partial C}{\partial \beta_1} = 5.08 \beta_0 + 14.004 \beta_1 - 13.132 \tag{7} $$
125 |
126 | Setting both partial derivatives equal to $0$ results in two equations with two unknowns. These two equations are solved very easily through row-reduction of an augmented matrix, shown in equation (8)
127 |
128 | $$ 2 \beta_0 + 5.08 \beta_1 - 4.68 = 0 \\ 5.08 \beta_0 + 14.004 \beta_1 - 13.132 = 0 \\ 2 \beta_0 + 5.08 \beta_1 = 4.68 \\ 5.08 \beta_0 + 14.004 \beta_1 = 13.132 \\ \begin{bmatrix} 2 && 5.08 && 4.68 \\ 5.08 && 14.004 && 13.132 \end{bmatrix} \tag{8} $$
129 |
130 | Equation (9) shows the row-reduced form of the matrix above.
131 |
132 | $$ \begin{bmatrix} 1 && 0 && -0.532267 \\ 0 && 1 && 1.13081 \end{bmatrix} \tag{9} $$
133 |
134 | From this row-reduced augmented matrix the final values for $\beta_0$ and $\beta_1$ is shown in equation (10) below.
135 |
136 | $$ \beta_0 = -0.532267 \\ \beta_1 = 1.13081 \tag{10} $$
137 |
138 | As before the `lm()` function in `R` shows the results, which are exactly those in equation (10).
139 |
140 | ```{r}
141 | model <- lm(target.var ~ feature.var)
142 | summary(model)
143 | ```
144 |
145 | ## Gradient descent
146 |
147 | For this problem and for those with more than a single feature variable, the alternative method for finding the global minimum is done through a process of _gradient decent_.
148 |
149 | This process involves selecting an arbitrary (random) value for the parameters. In an effort to simplify the explanation as was done above and thereby maximizing the likelihood of intuitive understanding, the problem can be reduced to a single parameter (not $\beta_0$ and $\beta_1$). Instead of a 3D graph, this results in a 2D graph. To simplify matters to the extreme, an example from school will suffice.
150 |
151 | Consider then the equation $y = x^2$. Again, these are not to be confused with $x_i$ and $y_i$ from above. In fact, her $x$ represents only $\beta_1$. The graph of this equation is shown below.
152 |
153 | 
154 |
155 | Clearly, the global minimum is at $x=0$, i.e. $y$ is at its lowest point when $x=0$. The first derivative of $y$ with respect to $x$ is $2x$. This is the equation for a slope of the curve at any given value for $x$. Starting at an arbitrary point, say $x=-2$ shows a slope of $2 \times \left( -2 \right) = -4$.
156 |
157 | This is a rather steep (negative) slope. At the global minimum, the slope will be $0$. Clearly, there is a need to _step_ closer to the point $x=0$. This is done by updating the point $x=2$ by subtracting a small value times the current slope. If this small value is $0.01$ for the sake of argument, this update becomes $- \left[ 0.01 \times \left( -4 \right) \right] = + 0.04$.
158 |
159 | The new $x$ value is now $-2 + 0.04 = -1.96$. In later chapters this use of the derivative to update the parameter values is known as _backpropagation_. The process is repeated several times until the global minimum is reached. While it is simple to see from this contrived example where the global minimum is, this is not so trivial in multi-dimensional space, with a complicated, convoluted graph. This process of gradient descent is a reliable method of finding the global minimum, thereby minimizing the cost function.
160 |
161 | 
162 |
163 | ## Conclusion
164 |
165 | The problem of predicting target variable values given feature variable values was reduced to the creation of a cost function for which a global minimum could be calculated. The global minimum represents the parameter values that bring the predicted values as close to the ground-truth target values as is possible. This process forms the bedrock of deep learning.
--------------------------------------------------------------------------------
/Regression.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Regression as a first step to deep learning"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | library(plotly)
13 | ```
14 |
15 | 
16 |
17 | ## Introduction
18 |
19 | Becoming familiar with regression is a very important first step in understanding deep learning. It brings home the concept of a model to predict an outcome and the inherent error in the model. The aim behind a deep neural network is to create a model to predict an outcome and to make as small an error as possible.
20 |
21 | This tutorial uses concepts familiar to anyone who has viewed material on introductory statistics. It begins with the simplest of all models, which uses the very familiar mean (average). It then moves on to viewing variance and standard deviation is a slightly different way before cementing the idea of a model, with its error.
22 |
23 | ## The baseline model
24 |
25 | Statistical modeling infers the use of sample data to model real-world situations. As such there is a need for the model that is created from the data to accurately reflect the real-world.
26 |
27 | In the most common-case scenario, the statistical model has as aim the prediction of an outcome. This outcome should be a measurable variable and is referred to as a _target_. The target data type can be either categorical or numerical. Prediction of the target variable value is made by manipulating a set of _feature variables_. They can likewise be categorical or numerical.
28 |
29 | The simplest way to develop an intuitive understanding of the process is to consider only the target variable.
30 |
31 | As an example, the code chunk below creates a numerical vector with $10$ elements representing the number of sales made by a medical supply company.
32 |
33 | ```{r}
34 | sales <- c(3, 4, 2, 4, 5, 6, 3, 9, 1, 12)
35 | ```
36 |
37 | The mean of these $10$ values is easy to create and can serve as a baseline model. The arithmetic mean is given in equation (1).
38 |
39 | $$ \bar{X} = \frac{\sum_{i=1}^{n}x_i}{n} \tag{1} $$
40 |
41 | Here $\bar{X}$ is the mean, $x_i$ represents each element (value) in the set and $n$ is the sample size. The code chunk below creates an object named `mean.sales` and uses the `mean()` function to calculate the mean of the set of $10$ values.
42 |
43 | ```{r}
44 | mean.sales <- mean(sales)
45 | mean.sales
46 | ```
47 |
48 | The solution, `r mean.sales`, can serve as a predictor of the outcome. That is, given a number of input variables, the sales can always be predicted to be $4.9$. It is obvious that this is a poor model, since some of the values are not very close to $4.9$. In fact, the difference between each value and the mean can be expressed by subtracting the mean from each. The first value was $3$ and subtracting $4.9$ from it shows an error of `r 3 - 4.9`. In the case of the last value, $12$, the error is `r 12 - 4.9`.
49 |
50 | These errors can be totaled (summed), but it should be obvious from the way that the mean is calculated, that the sum total of errors would be $0$.
51 |
52 | ```{r}
53 | round(sum(sales - 4.9),
54 | digits = 2) # Using round() to prevent rounding errors
55 | ```
56 |
57 | The total error should clearly __not__ be $0$. In order to get a better idea of the total error, each difference is squared giving rise to the _sum of squared errors_ (SSE), given in equation (2) below (squaring turns each difference into a positive value).
58 |
59 | $$ \text{SSE} = \sum_{i=1}^{n}{\left( x_i - \bar{X} \right)}^{2} \tag{2} $$
60 |
61 | The $\sum$ symbols is short-hand for summing. It states that each value that follows from the calculation of the expression (which is done over-and-over again so that all the samples are represented) is added.
62 |
63 | The code chunk below squares each difference and then adds all these squared errors.
64 |
65 | ```{r}
66 | sum((sales - 4.9)^2)
67 | ```
68 |
69 | This is a much better idea of how poor the baseline model (using the mean as outcome predictor) is. One problem is that the values are squared. This means that the units are also squared. If the sales represented a value in weight, i.e. pounds, then the error is expressed in $\text{pounds}^2$. This makes no sense. A second problem arises when considering the fact that the larger the sample size, the larger the error will be. there are simply more values to be added. Both of these problems are solved by equation (3). The sum total is divided by one less than the sample size (a matter related to degrees of freedom) and the square root of the quotient is taken. This gives rise to the _standard deviation_.
70 |
71 | $$ s = \sqrt{\frac{\sum_{i=1}^{n}{\left( x_i -\bar{X} \right)}^{2}}{n-1}} \tag{3} $$
72 |
73 | Used in its un-squared version, this is off course, the _variance_. Note that this is a different way of looking at variance and standard deviation. Instead of considering it as a pure measure of dispersion it is, in fact, a measure of the performance of a baseline predictive model.
74 |
75 | The code chunks below demonstrate the calculation of the variance.
76 |
77 | ```{r}
78 | # The variance using long-handed calculation
79 | (sum((sales - 4.9)^2))/(10 - 1)
80 | ```
81 |
82 | ```{r}
83 | # Using the var() function
84 | var(sales)
85 | ```
86 |
87 | If $y_i$ represent the actual target value then the baseline mode predicts each of these value by equation (4)
88 |
89 | $$ y_i = \bar{X} + \varepsilon_i \tag{4} $$
90 |
91 | Here $\varepsilon_i$ is each individual error. Equation (4) forms the basis of a common theme throughout statistics, machine learning, and deep learning. This basis states that the outcome is equal to the model plus an error; given in equation (5). The aim is to create a model that will minimize the error, and as such, must greatly improve on the baseline model.
92 |
93 | $$ \text{target} = \text{model} + \text{error} \tag{5} $$
94 |
95 | ## Improving the model
96 |
97 | The baseline model explained above can be improved through the process of linear regression. The code chunk below creates a feature variable called `input.var` and a target variable called `output.var`. Both of the variables are of continuous numerical type. The feature variable consists of a sample size of $100$ with a mean of $100$ and a standard deviation of $10$, created with the use of the `rnorm()` function. The `round()` function limits the number of decimal values. The output adds some random noise to each value in the feature variable.
98 |
99 | ```{r}
100 | set.seed(1) # For rperoducible random values.
101 | input.var = round(rnorm(100,mean = 10,sd = 2),digits = 1)
102 | # Add random noise
103 | output.var = round(input.var + (10 * rnorm(100,mean = 0,sd = 0.2)) + 2,digits = 1)
104 | ```
105 |
106 | A scatter plot of each pair of the variable values is created below. The feature variable is placed on the $x$-axis (independent variable) and the target variable on the $y$-axis (dependent variable).
107 |
108 | ```{r}
109 | p <- plot_ly(type = "scatter",
110 | mode = "markers",
111 | x = ~input.var,
112 | y = ~output.var,
113 | marker = list(size =14,
114 | color = "rgba(255, 180, 190, 0.8)",
115 | line = list(color = "rgba(150, 0, 0, 0.8)",
116 | width = 2)))%>%
117 | layout(title = "Scatter plot",
118 | xaxis = list(title = "Input variable", zeroline = FALSE),
119 | yaxis = list(title = "Output variable", zeroline = FALSE))
120 | p
121 | ```
122 |
123 | A model can now be created that will transform every feature variable value and when an error term is added, will results in the target variable value.
124 |
125 | As with the baseline model, the error can be calculated. The error (when discussing linear regression) is referred to as the _deviation_ in a model. It follows exactly the same concept as the baseline model and is simply the sum of the squared differences between each output variable value and its predicted value. This is shown in equation (6).
126 | $$ \text{deviation} = \sum_{i=1}^{n}{{\left( \text{observed} - \text{model} \right)}^{2}} \tag{6} $$
127 |
128 | The baseline model predicts the output variable as the mean of the output variable. This is calculated as the `sst` in the code chunk below.
129 |
130 | ```{r}
131 | differences <- output.var - mean(output.var) # Difference between mean and actual value
132 | squared.differences <- differences^2 # Square each difference
133 | sst <- sum(squared.differences) # Sum up all the squared errors
134 | sst
135 | ```
136 |
137 | Looking back at the plot of the data, it could be imagined that a line could be drawn slanting upward through the point. Any straight line on a plane has a slope and an intercept. The slope is the rise over run and the intercept is the value at which the line crosses the $y$-axis, i.e. when the independent variable is $0$. Such a line can serve as an improved model, considering that the baseline model is simply a straight horizontal line with height on the $y$-axis equal to the mean of the target variable.
138 |
139 | In the example below, and by pure guess work, the slope is set to $0.8$ and the intercept to $0.1$, i.e. the predicted output variable value ($\hat{y}_i$) for a specific subject ($x_i$) will be $\hat{y_i} = 0.95 x_i + 2$. A $\hat{y}$ symbol is often used to indicate the predicted target value (that is when the error term in not added). The sum of squared errors is calculated and save in the object `ssm` below.
140 |
141 | ```{r}
142 | new.differences <- output.var - (0.95 * input.var + 2)
143 | new.squared.differences <- new.differences^2
144 | ssm <- sum(new.squared.differences)
145 | ssm
146 | ```
147 |
148 | The improvement over the baseline model can be expressed as a ratio of the new model over the baseline model. This ratio is given in equation (7).
149 | $$ R^2 = \frac{\text{ssm}}{\text{sst}} \tag{7} $$
150 |
151 | This ratio describes the variance in the outcome explained by the new model (the systematic variance) relative to how much variance there was to begin with (in the baseline model) (the unsystematic variance).
152 |
153 | ```{r}
154 | r.squared <- ssm / sst
155 | r.squared
156 | ```
157 |
158 | The actual _best_ model can be calculated using the `lm()` function as seen in the code chunk below. The `summary()` function provides all the required answers.
159 |
160 | ```{r}
161 | lr.model <- lm(output.var ~ input.var)
162 | summary(lr.model)
163 | ```
164 |
165 | From the summary the slope is $0.9982$ and the intercept to $1.9425$, i.e. the predicted target variable value ($\hat{y}_i$) for a specific sample subject, ($x_i$), will be $\hat{y_i} = 0.9982 x_i + 1.9425$. The latter is the `Estimate` of the `(Intercept)` and the slope is the `Estimate` of the `input.var` in the table above.
166 |
167 | The $\beta_0 + \beta_1 x_i$ part represents the model part of equation (5) and is shown in equation (8) below.
168 |
169 | $$ \hat{y}_i = \beta_0 + \beta_1 x_i \tag{8} $$
170 |
171 | Here $\beta_0$ is the intercept and $\beta_1$ is the slope. Equation (9) shows the calculation for the actual value.
172 |
173 | $$ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \tag{9} $$
174 |
175 | This is, in essence, linear regression. The target is very simply a constant multiple of the feature variable. Each individual target variable value is multiplied by the same constant. If there were more than one feature variable the expression would simply grow, as represented in equation (10).
176 |
177 | $$ y_i = \beta_0 + \beta_1 x_{1_i} + \beta_2 x_{2_i} + \ldots + \beta_n x_{n_i} + \varepsilon_i \tag{10} $$
178 |
179 | This equation represents $n$ feature variables marked $x_1$ through $x_n$. Each though is multiplied by its own constant. The result is still a linear model.
180 |
181 | ## Conclusion
182 |
183 | This tutorial explained the concept of a model to use the values in a feature variable to predict the corresponding value in a target variable.
184 |
185 | The idea of a model with coefficients and an error term lays the ground work for understanding the concepts behind a deep neural network.
--------------------------------------------------------------------------------
/Improving training of a neural network.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Improvement techniques in neural network training"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 |
21 |
22 | 
23 |
24 | ```{r setup, include=FALSE}
25 | knitr::opts_chunk$set(echo = TRUE)
26 | ```
27 |
28 | ```{r libraries, message=FALSE, warning=FALSE}
29 | library(plotly)
30 | ```
31 |
32 | ## Introduction
33 |
34 | Creating the best deep neural network model is an empirical and iterative process that works best on big datasets. Being empirical, iterative, and requiring big datasets, unfortunately, takes its toll on computer resources and time.
35 |
36 | It is therefor important to take steps to mitigate this triad of time and resource consumption. This chapter discusses some of the steps that can be implemented to reduce the burden of creating a good model.
37 |
38 | ## Normalizing the input features
39 |
40 | In many cases, the scale of the input features vary greatly. Where some variables might have data point values consisting of small integers, others might have values that range up to a hundred or a thousand.
41 |
42 | Scaling the data point values for each feature variable to a similar scale improves training by altering the cost function such that gradient descent becomes easier.
43 |
44 | This can be visualized when considering only two variables in a cost function. If one has very large values and the other has small values, then the resultant 3D graph will be very elongated (in the axis of the large values). Gradient descent must now find a long convoluted way of _traveling_ down this gradient.
45 |
46 | By creating a similar, small scale for each of the feature variables, the graph of the cost function variable becomes more _uniform_.
47 |
48 | Scaling is most often achieved through normalization, perhaps more properly referred to as standardization. A mean and a variance is calculated for each feature variable in the training set. Each element of each feature variable then has the specific mean for that variable subtracted from it, with the result divided by the variance for that variable, as shown in equation (1).
49 |
50 | $$\frac{x_i - \mu}{\sigma^2} \tag{1}$$
51 |
52 | The values for $\mu$ and $\sigma$ for each variable in the training set is retained and used to normalize the features in the test set too. This is an important step. It is incorrect to use and overall $\mu$ and $\sigma$ for the training and test sets and it is also incorrect to use a separate $\mu$ and $\sigma$ for the test set.
53 |
54 | ## Vanishing and exploding gradients
55 |
56 | Consider the multiplication of two rational numbers in the domain $\left( 0,1 \right)$. A few examples are shown in equation (2).
57 |
58 | $$\frac{1}{2} \times \frac{2}{3} = \frac{1}{3} \\ \frac{1}{4} \times \frac{9}{10} = \frac{9}{40} \\ \frac{a}{b} \times \frac{c}{d} = \frac{ac}{bd}, \forall a < b, c < d,\left\{a,b,c,d\right\}\in\mathbb{Z}^+ \tag{2}$$
59 |
60 | With these constraints it can be shown that the inequalities in equation (3) hold.
61 |
62 | $$\left( \frac{a}{b} > \frac{ac}{bd} \right) \wedge \left( \frac{c}{d}>\frac{ac}{bd} \right) \tag{2}$$
63 |
64 | If biases are omitted for the sake of simplicity and with a linear activation function $g \left( z \right) = z$, $\hat{y}$ can be calculated as shown in equation (3).
65 |
66 | $$W^{\left[ l \right]} \cdot W^{\left[ l-1 \right]} \cdot W^{\left[ l-2 \right]} \cdot \ldots \cdot W^{\left[ 2 \right]} \cdot W^{\left[ 1 \right]} \cdot x \tag{3}$$
67 |
68 | With appropriate dimensions and with weight values in the domain $\left( 0,1 \right)$, $W$, will have element values approaching zero as $l$ increases. In other words, given a sufficiently deep network, there is the threat of the values of the weights (parameters) approaching zero. This is known as the _vanishing gradient problem_.
69 |
70 | By a similar argument and for all weight values larger than $1$, the parameters will increase in size, known as the _exploding gradient problem_.
71 |
72 | A similar argument yet again holds for the derivatives during gradient descent (backpropagation).
73 |
74 | The problem of small weights is compounded by the relatively slower gradient descent which in turn might take many epochs to converge.
75 |
76 | A _partial_ solution lies in the random selection of the initial weight values. The goal is to set the variance of the weight matrix to the reciprocal of the number of input nodes which is to be multiplied by that matrix. (When using the rectified linear unit activation (ReLU) function, $\frac{2}{n}$, where $n$ is the number of input nodes, is a better choice.)
77 |
78 | Setting the variance of a matrix to $\frac{2}{n}$ (for ReLU) is achieved by multiplying each element in the matrix by the square root of $\frac{2}{n}$. For tanh activation $\frac{1}{n}$ is used. This is known as _Xavier initialization_. Note that this is no longer commonly used as ReLu has supplanted the hyperbolic tangent function, `tanh`, as popular activation function.
79 |
80 | ## Mini-batch gradient descent
81 |
82 | Datasets can have many features and millions of samples. To take a single step during gradient descent requires the completion of an entire epoch.
83 |
84 | The process of gradient descent can be _hastened_ if the training sample set can be broken up into parts, called _mini-batches_. During an epoch, gradient descent can take place after each mini-batch so that at the end of the epoch, a lot of progress can potentially be made. The process of forward propagation and backpropagation takes place during each mini-batch process. An epoch still refers to completing all of the mini-batches.
85 |
86 | The term _batch_ refers to the complete training set. Note, though, that in code, the argument *batch_size = * is used. This actually refers to the mini-batch sample size.
87 |
88 | The extreme form of mini-batch size is $1$. Every sample in the training dataset is its own mini-batch. The gradient descent that results is called _stochastic gradient descent_.
89 |
90 | In practice the ideal mini-batch size lies somewhere in between the extremes of using the whole batch and using a single sample. As a rule of thumb, the whole dataset can be used when it is relatively small and will not penalize the overall time taken to converge to a minimum value for the cost function. For bigger datasets, a power of $2$ such as $16,32,64,128,256,512$ is useful as mini-batch size. In most cases this works well with computer memory architecture and optimizes its use. Irrespective of the size used, it is important that it fits within the memory allocation of the central processing unit (CPU) or graphical processing unit (GPU) of the computer.
91 |
92 | ## Gradient descent with momentum
93 |
94 | Gradient descent with _momentum_ uses the concept of an exponentially weighted moving average to increase the rate of gradient descent. Instead of updating the parameters during backpropagation with the usual learning rate times the partial derivative of the specific parameter, note is kept of these partial derivatives. An _exponentially weighted moving average_ (EWMA) is calculated which is then instead multiplied by the learning rate.
95 |
96 | An EWMA considers sequential values and calculates a moving average along each of the values such that there is an exponential decay in how previous values in the sequence contributes to the current average.
97 |
98 | The code chunk below creates data point values along the $x$-axis and then calculates the sine of each of these values for the $y$-axis, but adds some random noise and the integer $1$ to each value.
99 |
100 | ```{r Creating values for a sine curve with random noise}
101 | x = seq(from = 0, to = 2*pi, by = pi/180)
102 | y = sin(x) + rnorm(length(x), mean = 0, sd = 0.1) + 1
103 | ```
104 |
105 | __Figure 1__ below shows the sine curve and the data with noise.
106 |
107 | ```{r Sine function, fig.cap="Fig 1 Random noise along sine curve"}
108 | f1 <- plot_ly(x = x,
109 | y = y,
110 | name = "data",
111 | type = "scatter",
112 | mode = "markers") %>%
113 | add_trace(x = x,
114 | y = sin(x) + 1,
115 | name = "sine",
116 | type = "scatter",
117 | mode = "lines") %>%
118 | layout(title = "Data point values along sine curve",
119 | xaxis = list(title = "Input values"),
120 | yaxis = list(title = "Output values"))
121 | f1
122 | ```
123 |
124 | For a given coefficient, $\beta$, a moving average, $v_i$, over each of the output values $y_i$, is given in equation (1).
125 |
126 | $$v_i = \beta \times v_{i-1} + \left( 1 - \beta \right) \times y_i \tag{1}$$
127 |
128 | The code chunk below uses a for loop over the output values. __Figure 2__ shows the EWMA for $\beta = 0.9$.
129 |
130 | ```{r Added EWMA, fig.cap="Added EWMA"}
131 | N <- length(x)
132 | beta <- 0.9
133 | v <- vector(length = N)
134 | for (i in 2:N){
135 | v[i] <- (beta * v[i - 1]) + ((1 - beta) * y[i])
136 | }
137 | f1 <- f1 %>% add_trace(x = x,
138 | y = v,
139 | name = "EWMA",
140 | type = "scatter",
141 | mode = "lines")
142 | f1
143 | ```
144 |
145 | Note the initial start at zero and the time taken to _catch up_. This is usually of no consequence in deep neural network training as training will occur over many epochs.
146 |
147 | Expansion of equation (1) for a specific value for $\beta$ and some $i \in \mathbb{N}$ shows that the number of previous data points over which the average is computed is approximately given in equation (2).
148 |
149 | $$\approx \frac{1}{1 - \beta^i}\tag{2}$$
150 |
151 | The effect of an EWMA is that gradient descent orthogonal (in higher dimensional space) to the idealized direction to be taken are averaged out over time, but those in the correct direction become additive, hence the term _momentum_, i.e. gradient descent _builds up momentum in the right direction_.
152 |
153 | Equation (3) below shows that instead of the partial derivative of the weight being stored, it is updated as an exponential moving average for some coefficient $\beta \in \left[ 0,1 \right]$.
154 |
155 | $$V_{\partial W_{i}} = \beta_v V_{\partial W_{i-1}} + \left( 1 - \beta_v \right) \partial W_{i} \tag{3}$$
156 |
157 | This exponential moving average update of the derivative is then used to update the weight as shown in equation (4).
158 |
159 | $$W_{i} = W_{i-1} \times \alpha V_{\partial W_{i-1}} \tag{4}$$
160 |
161 | ## Root mean square propagation
162 |
163 | _Root mean square propagation_ (RMSprop) also attempts to speed up gradient descent. It differs from momentum by squaring the value of the derivative in each iteration. The change in equations (3) and (4) are reflected in equation (5).
164 |
165 | $$S_{\partial W_{i}} = \beta_s S_{\partial W_{i-1}} + \left( 1 - \beta_s \right) \partial W_{i}^2 \\ W_{i} = W_{i-1} \times \alpha \frac{\partial W_{i-1}}{\sqrt{S_{\partial W_{i-1}}}} \tag{5}$$
166 |
167 | ## Combining momentum and root mean square propagation
168 |
169 | One of the most widely used optimization algorithms for gradient descent combines momentum and RMSprop into _adapative moment estimation_ (ADAM).
170 |
171 | One addition to the combination to these two algorithms is the correction of bias that exists at the start of exponentially weighted moving average calculations. For any iteration, $t$, in this calculation the current average is simply divided as shown in equation (6).
172 |
173 | $$\rho_{\partial W}^{\text{corrected}} = \frac{\rho_{\partial W}}{1 - \beta^t}\tag{6}$$
174 |
175 | Here $\rho$ refers to either $V$ as for momentum or $S$ as for RMSprop. This corrects for small values of $t$, but for larger values of $t$ the denominator approaches $1$ and makes very little difference.
176 |
177 | The parameter update is shown in equation (7).
178 |
179 | $$W_{i+1} = W_{i} \times \alpha \frac{V_{\partial W_i}^{\text{corrected}}}{\sqrt{S_{\partial W_i}^{\text{corrected}}}} \tag{7}$$
180 |
181 | Note that ADAM requires the setting of hyperparameters $\alpha$, $\beta_v$, and $\beta_s$. Typical values include $\beta_v = 0.9$ and $\beta_s = 0.999$.
182 |
183 | ## Learning rate decay
184 |
185 | As the values of the parameters converge to a minimum, a smaller learning rate can prevent _overshoot_. Equation (8) shows how to decrease the value of the learning rate, $\alpha$, at a decay rate, $\eta$, over each epoch, $\mathscr{E}$.
186 |
187 | $$\alpha_{\mathscr{H+1}} = \frac{1}{1 + \eta \left(\mathscr{H+1}\right)} \alpha_{\mathscr{H}}\tag{8}$$
188 |
189 | The decay rate, $\eta$, is another hyperparameter that must be set.
190 |
191 | Note that there are a number of other decay types such as exponential decay and staircase decay.
192 |
193 | ## Batch normalization
194 |
195 | Just as with normalizing the input variables, the values of each of the nodes in a hidden layer can also be normalized. This is referred to as _batch normalization_.
196 |
197 | It is most common to normalize the values of the nodes in a layer before applying the activation function, although normalization after applying the activation is also possible.
198 |
199 | In standard form, every node value $z^{\left( i \right)}$ is shown in equation (9).
200 |
201 | $$z_{\text{norm}}^{\left( i \right)} = \frac{z^{\left( i \right)} - \mu}{\sigma}\tag{9}$$
202 |
203 | The distribution parameters, $\mu$ and $\sigma$ used for this normalization can be turned into learnable parameters by equation (10).
204 |
205 | $$\tilde{z}^{\left( i \right)} = \gamma z_{\text{norm}}^{\left( i \right)} + \beta\tag{10}$$
206 |
207 | ## Conclusion
208 |
209 | There are a variety of methods to improve the training of a network. These unfortunately bring a plethora of hyperparameters and network setups that take time, vigilance, and experience to implement.
210 |
211 | The use of the improvements to a network is mathematically complex, but easy to express in code. TensorFlow and other neural network platforms have built-in function that executes all the ideas mentioned in this chapter.
212 |
--------------------------------------------------------------------------------
/Implementing regularization and dropout.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Implementing regularization and dropout"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | number_sections: no
7 | toc: yes
8 | ---
9 |
10 | ```{r setup, include = FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | setwd(getwd())
13 | ```
14 |
15 | ```{r libraries, message=FALSE, warning=FALSE}
16 | library(keras)
17 | library(readr)
18 | library(tidyr)
19 | library(tibble)
20 | library(plotly)
21 | ```
22 |
23 |
24 |
35 |
36 | 
37 |
38 | ## Introduction
39 |
40 | The preceding chapters introduced methods to decrease the problem of overfitting or high variance. The result is a model with trained parameter values that fit the training data very well, but perform poorly with respect to test or real-world data.
41 |
42 | This chapter shows the implementation of $\ell_2$-regularization and dropout to reduce overfitting. Models will be created to illustrate the problem of overfitting, before showing how to add the mentioned solutions. This will be done with an example of _sentiment analysis_.
43 |
44 | The dataset used in this chapter is built into Keras and contains $50000$ examples of written text. The text is labeled according to a sentiment that serves as target variable and is either _positive_ or _negative_ (encoded as integers).
45 |
46 | Text must be converted into computable data before use in a deep learning network. This is done by selecting a fixed number of words that become the feature variables (one word is one variable). If any of the specific words occur in the text of a specific subject, a $1$ is entered as data point value for that variable. Each of the words that are not contained in the text for that subject, receives a $0$ as data point value.
47 |
48 | ## The dataset
49 |
50 | The `dataset_imdb` dataset can be downloaded by `Keras`. This is not a normal dataset as would exist in a spreadsheet file. Not only does it contain the mentioned $50000$ text samples, but also a list of common words. During the download of the dataset, the number of words that will be used as the feature variables can be specified. In the code chunk below, `5000` common words will be selected.
51 |
52 | ```{r dataset}
53 | num_words <- 5000
54 | imdb <- dataset_imdb(num_words = num_words)
55 | ```
56 |
57 | The dataset as downloaded contains $25000$ training and $25000$ test subjects. Note that this train-test split is not the norm and should not be used in general. In the code chunk below, each of the two parts are split into feature and target sets.
58 |
59 | ```{r train_test_split}
60 | c(train_data, train_labels) %<-% imdb$train
61 | c(test_data, test_labels) %<-% imdb$test
62 | ```
63 |
64 | ## Multi-hot-encoding
65 |
66 | The introduction to this chapter alluded to the use of _multi-hot-encoding_. Whereas the _one-hot-encoding_ in introduced before had an in-built function, `to_categorical`, a user-function must be created for multi-hot-encoding.
67 |
68 | ```{r multi_hot_encoding_function}
69 | multi_hot_sequences <- function(sequences, dimension) {
70 | multi_hot <- matrix(0, nrow = length(sequences), ncol = dimension)
71 | for (i in 1:length(sequences)) {
72 | multi_hot[i, sequences[[i]]] <- 1
73 | }
74 | multi_hot
75 | }
76 | ```
77 |
78 | The `train_data` and `test_data` feature set objects are multi-hot-encoded below.
79 |
80 | ```{r multi_hot_encode_data}
81 | train_data <- multi_hot_sequences(train_data, num_words)
82 | test_data <- multi_hot_sequences(test_data, num_words)
83 | ```
84 |
85 | To illustrate the concept of multi-hot-encoding, the features $1$ through $10$ of the first subject of the `test_data` object is shown.
86 |
87 | ```{r demonstrating multi hot encoding}
88 | test_data[1, 1:10]
89 | ```
90 | This subject had all of the $10$ most common words in it, except for word number three.
91 |
92 | ## Baseline model
93 |
94 | This dataset was chosen because a normal densely connected neural network will demonstrate high variance. The code below creates a model called `baseline_model`. It contains two hidden layers with `16` nodes each. Both layers have the rectified linear unit (ReLU) as activation function. The output layer is a single node with the logistic sigmoid function as activation function. This will output a value in the domain $\left[ 0,1 \right]$. This will work well, since the target variable is encoded as $0$ and $1$. Note that this is a different form of constructing the output as the one-hot-encoding seen before.
95 |
96 | ADAM is used a optimizer and binary cross entropy is used as the loss function. There concepts will be discussed in a following chapter.
97 |
98 | ```{r baseline model}
99 | baseline_model <-
100 | keras_model_sequential() %>%
101 | layer_dense(units = 16, activation = "relu", input_shape = num_words) %>%
102 | layer_dense(units = 16, activation = "relu") %>%
103 | layer_dense(units = 1, activation = "sigmoid")
104 |
105 | baseline_model %>% compile(
106 | optimizer = "adam",
107 | loss = "binary_crossentropy",
108 | metrics = list("accuracy")
109 | )
110 |
111 | baseline_model %>% summary()
112 | ```
113 |
114 | The training data and training target as now fed through the network. The mini-batch size is `512` and there are `20` epochs. The test set and its target is used as validation sets.
115 |
116 | ```{r fit baseline model, message=FALSE, warning=FALSE}
117 | baseline_history <- baseline_model %>% fit(
118 | train_data,
119 | train_labels,
120 | epochs = 20,
121 | batch_size = 512,
122 | validation_data = list(test_data, test_labels),
123 | verbose = 2
124 | )
125 | ```
126 |
127 | When this code is executed in RStudio, the high variance is clearly seen. In an attempt to lessen this overfitting a smaller model is used below. There are only four nodes in each of the two hidden layers. The rest of the hyperparameters are the same. The two code chunks below create the model and then train it.
128 |
129 | ```{r smaller model}
130 | smaller_model <-
131 | keras_model_sequential() %>%
132 | layer_dense(units = 4, activation = "relu", input_shape = num_words) %>%
133 | layer_dense(units = 4, activation = "relu") %>%
134 | layer_dense(units = 1, activation = "sigmoid")
135 |
136 | smaller_model %>% compile(
137 | optimizer = "adam",
138 | loss = "binary_crossentropy",
139 | metrics = list("accuracy")
140 | )
141 |
142 | smaller_model %>% summary()
143 | ```
144 |
145 | ```{r fit smaller model, message=FALSE, warning=FALSE}
146 | smaller_history <- smaller_model %>% fit(
147 | train_data,
148 | train_labels,
149 | epochs = 20,
150 | batch_size = 512,
151 | validation_data = list(test_data, test_labels),
152 | verbose = 2
153 | )
154 | ```
155 |
156 | A much bigger network with `512` nodes in each of the two hidden layers is created below. This creates more learning capacity, but also more overfitting.
157 |
158 | ```{r bigger model}
159 | bigger_model <-
160 | keras_model_sequential() %>%
161 | layer_dense(units = 512, activation = "relu", input_shape = num_words) %>%
162 | layer_dense(units = 512, activation = "relu") %>%
163 | layer_dense(units = 1, activation = "sigmoid")
164 |
165 | bigger_model %>% compile(
166 | optimizer = "adam",
167 | loss = "binary_crossentropy",
168 | metrics = list("accuracy")
169 | )
170 |
171 | bigger_model %>% summary()
172 | ```
173 |
174 | ```{r fit bigger model, message=FALSE, warning=FALSE}
175 | bigger_history <- bigger_model %>% fit(
176 | train_data,
177 | train_labels,
178 | epochs = 20,
179 | batch_size = 512,
180 | validation_data = list(test_data, test_labels),
181 | verbose = 2
182 | )
183 | ```
184 |
185 | A simple line chart is created using the `plotly` package. __Figure 1__ compares the losses of the training and validation sets for each of the three models. Note the high variance.
186 |
187 | ```{r plotting models}
188 | compare_cx <- data.frame(
189 | baseline_train = baseline_history$metrics$loss,
190 | baseline_val = baseline_history$metrics$val_loss,
191 | smaller_train = smaller_history$metrics$loss,
192 | smaller_val = smaller_history$metrics$val_loss,
193 | bigger_train = bigger_history$metrics$loss,
194 | bigger_val = bigger_history$metrics$val_loss
195 | ) %>%
196 | rownames_to_column() %>%
197 | mutate(rowname = as.integer(rowname)) %>%
198 | gather(key = "type", value = "value", -rowname)
199 |
200 | p <- plot_ly(compare_cx,
201 | x = ~rowname,
202 | y = ~value,
203 | color = ~type,
204 | type = "scatter",
205 | mode = "lines") %>%
206 | layout(title = "Fig 1 Comparing model losses",
207 | xaxis = list(title = "Epochs"),
208 | yaxis = list(title = "Loss"))
209 | p
210 | ```
211 |
212 | With such high variance either $\ell_2$-regularization or dropout can be implemented to try and reduce the overfitting.
213 |
214 | ## $\ell_2$-regularization
215 |
216 | The `l2_model` model created below has regularization implemented in both hidden layers. There are various ways to write the code for this. The simplest was specified regularization as an argument to the specified layer. The value for $\lambda$ is also specified.
217 |
218 | ```{r l2model}
219 | l2_model <-
220 | keras_model_sequential() %>%
221 | layer_dense(units = 16, activation = "relu", input_shape = num_words,
222 | kernel_regularizer = regularizer_l2(l = 0.001)) %>%
223 | layer_dense(units = 16, activation = "relu",
224 | kernel_regularizer = regularizer_l2(l = 0.001)) %>%
225 | layer_dense(units = 1, activation = "sigmoid")
226 |
227 | l2_model %>% compile(
228 | optimizer = "adam",
229 | loss = "binary_crossentropy",
230 | metrics = list("accuracy")
231 | )
232 |
233 | l2_model %>% summary()
234 | ```
235 |
236 | ```{r fit l2 model, message=FALSE, warning=FALSE}
237 | l2_history <- l2_model %>% fit(
238 | train_data,
239 | train_labels,
240 | epochs = 20,
241 | batch_size = 512,
242 | validation_data = list(test_data, test_labels),
243 | verbose = 2
244 | )
245 | ```
246 |
247 | __Figure 2__ below shows the difference in variance between the baseline and the new model.
248 |
249 | ```{r plotting baseline vs regularization}
250 | compare_cx <- data.frame(
251 | baseline_train = baseline_history$metrics$loss,
252 | baseline_val = baseline_history$metrics$val_loss,
253 | l2_train = l2_history$metrics$loss,
254 | l2_val = l2_history$metrics$val_loss
255 | ) %>%
256 | rownames_to_column() %>%
257 | mutate(rowname = as.integer(rowname)) %>%
258 | gather(key = "type", value = "value", -rowname)
259 |
260 | p <- plot_ly(compare_cx,
261 | x = ~rowname,
262 | y = ~value,
263 | color = ~type,
264 | type = "scatter",
265 | mode = "lines") %>%
266 | layout(title = "Fig 2 Comparing baseline and regularization model losses",
267 | xaxis = list(title = "Epochs"),
268 | yaxis = list(title = "Loss"))
269 | p
270 | ```
271 |
272 | ## Dropout
273 |
274 | Dropout is implemented in the model below. It is added a separate layer following each of the hidden layers. The value for $\kappa$ is set at `0.6`.
275 |
276 | ```{r dropout model}
277 | dropout_model <-
278 | keras_model_sequential() %>%
279 | layer_dense(units = 16, activation = "relu", input_shape = num_words) %>%
280 | layer_dropout(0.6) %>%
281 | layer_dense(units = 16, activation = "relu") %>%
282 | layer_dropout(0.6) %>%
283 | layer_dense(units = 1, activation = "sigmoid")
284 |
285 | dropout_model %>% compile(
286 | optimizer = "adam",
287 | loss = "binary_crossentropy",
288 | metrics = list("accuracy")
289 | )
290 |
291 | dropout_model %>% summary()
292 | ```
293 |
294 | ```{r fit dropout model, message=FALSE, warning=FALSE}
295 | dropout_history <- dropout_model %>% fit(
296 | train_data,
297 | train_labels,
298 | epochs = 20,
299 | batch_size = 512,
300 | validation_data = list(test_data, test_labels),
301 | verbose = 2
302 | )
303 | ```
304 |
305 | __Figure 3__ shows the difference in variance between the baseline and the dropout models.
306 |
307 | ```{r plotting baseline vs dropout}
308 | compare_cx <- data.frame(
309 | baseline_train = baseline_history$metrics$loss,
310 | baseline_val = baseline_history$metrics$val_loss,
311 | dropout_train = dropout_history$metrics$loss,
312 | dropout_val = dropout_history$metrics$val_loss
313 | ) %>%
314 | rownames_to_column() %>%
315 | mutate(rowname = as.integer(rowname)) %>%
316 | gather(key = "type", value = "value", -rowname)
317 |
318 | p <- plot_ly(compare_cx,
319 | x = ~rowname,
320 | y = ~value,
321 | color = ~type,
322 | type = "scatter",
323 | mode = "lines") %>%
324 | layout(title = "Fig 3 Comparing baseline and dropout model losses",
325 | xaxis = list(title = "Epochs"),
326 | yaxis = list(title = "Loss"))
327 | p
328 | ```
329 |
330 | ## Comparing regularization and dropout
331 |
332 | As a final comparison, __Figure 4__ below shows the difference in loss between $ell_2$ regularization and dropout.
333 |
334 | ```{r plotting regularization dropout}
335 | compare_rd <- data.frame(
336 | l2_train = l2_history$metrics$loss,
337 | l2_val = l2_history$metrics$val_loss,
338 | dropout_train = dropout_history$metrics$loss,
339 | dropout_val = dropout_history$metrics$val_loss
340 | ) %>%
341 | rownames_to_column() %>%
342 | mutate(rowname = as.integer(rowname)) %>%
343 | gather(key = "type", value = "value", -rowname)
344 |
345 | p <- plot_ly(compare_rd,
346 | x = ~rowname,
347 | y = ~value,
348 | color = ~type,
349 | type = "scatter",
350 | mode = "lines") %>%
351 | layout(title = "Fig 4 Comparing regularization and dropout model losses",
352 | xaxis = list(title = "Epochs"),
353 | yaxis = list(title = "Loss"))
354 | p
355 | ```
356 |
357 | Note that the choice of architecture and hyperparameters shown in this chapter are unique to this dataset. Architecture and hyperparameter choices are not transferable in any meaningful way and the designer of a neural network must work hard at getting these correct in every new problem. Some guidelines and experience do help, but there is no escaping a sometimes long and arduous road to the best performing model.
358 |
359 | ## Conclusion
360 |
361 | This chapter showed the implementation of $\ell_2$-regularization and dropout and their effect on high variance.
--------------------------------------------------------------------------------
/A brief introduction to R.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "A brief introduction to R"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 |
25 |
26 | 
27 |
28 | ## Introduction
29 |
30 | `R` is a programming language primarily designed for statistical computing. As with many things in the world of computer science, it has a fascinating history. More information on the topic is available on Wikipedia at https://en.wikipedia.org/wiki/R_(programming_language)
31 | When installed on a computer `R` consists of certain base and core parts. Over many years the language has been extended by countless packages or libraries. These are downloadable and installable code that greatly enhances the use of `R`.
32 |
33 | To make use of `R`, it has to be downloaded and installed. RStudio is a graphical user interface for `R`. It is also a program that is downloaded and installed. RStudio is a very powerful program which makes the use of `R` a pleasure. It is not only an environment in which `R` programs and scripts can be written, but it can also be used to create books, blog posts, documents, and dissertations, making it an ideal tool for the novice and expert alike.
34 |
35 | ## Downloading and installing
36 |
37 | `R` is available as a downloadable file for most operating systems. The download files are are available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/ Simply follow the instructions for the required operating system.
38 |
39 | RStudio is available from https://www.rstudio.com/ . This website is also a rich resource for information on `R`. Once again, the appropriate file is simply downloaded and installed.
40 |
41 | ## The RStudio environment
42 |
43 | The main menu and icon bar should be somewhat familiar to anyone who has used a word processor or a spreadsheet program.
44 |
45 | The main RStudio program is divided into several panels. The arrangement and even look (colors) for these panels are customizable through the menu (Global Options under Tools).
46 |
47 | The upper-left panel is the Main working area where actual code is written. It has its own set of icons for specific tasks.
48 |
49 | The bottom-left panel provides a Console. This space shows the output of code created in the Main panel. Code can also be written here. The bottom-left panel also houses a tab for the Terminal. This provides access to the operating system and folder structure of the computer.
50 |
51 | The top-right panel has three tabs by default. These show information about the current Environment (information about objects created in the code), about the history of recent code, and about external Connections.
52 |
53 | The bottom-right panel has several tabs. One shows the Files on the local computer. The next shows any Plots created by code. These plots can be saved as separate image files on the local computer for later use. The Packages tab serves as main center for downloading additional packages and libraries. The Help tab provides extensive help on everything `R` has to offer. It has a convenient search bar. Simply enter an `R` command in the search box and the Help tab will show useful information for that command. The Viewer tab can show created documents, websites, book chapters and all objects created in `R`.
54 |
55 | ## Creating a file
56 |
57 | Clicking File > New File (on the main menu) shows the type of files that can be created. This course uses mostly the R Script and R Markdown... selections.
58 |
59 | A script creates a simple file in which code can be written and executed at will. An R Markdown file is a rich environment that allows for the creation of documents such as web pages, documents, textbooks, applications, and much more.
60 |
61 | The introduction to the language that follows assumes assumes that a new R Script has been created.
62 |
63 | ## Simple artithemtic
64 |
65 | A line of code can be entered in a script and executed by hitting the Run button on the top-right of the Main panel. This must be done when the cursor is anywhere on the actual line of code.
66 |
67 | Try typing `2 + 2` (spaces optional) and hitting the Run icon. The Console opens on the bottom-left and displays the line of code, `> 2 + 2` and below this, the solution, which is `4`. The `[1]` refers to the item number. Many objects in `R` are lists of of values. Since there is only a single element in the solution, the `4`, it is the first element and hence named `[1]`.
68 |
69 | Try other common arithmetical operations such as `2 - 2`. Multiplication uses the asterisk, `*`. Division is accomplished by the forward slash, `/` symbol. Powers are most conveniently expressed using the caret, `^`, symbol.
70 |
71 | Trigonometric and other transcendental functions are built into `R`. These include `sin()`, `cos()`, and Euler's number, `exp()`. To get the actual Euler's number, simply type `exp(1)`, which is $e^1$. The result should be `r exp(1)`.
72 |
73 | The parenthesis denote the commands above as functions. Functions take an input and produce an output. The input is stated within the parenthesis and are called _arguments_. The arguments are specific to each function and explicitly required. Try typing the name of a function into the search bar in the Help tab in the bottom-right panel. It gives instructions on the use of arguments for that function. There is more information on function later in this chapter.
74 |
75 | ## Computer variables
76 |
77 | Solutions to code can be saved as objects. These are given a name, created by the user, and stores the content of the object in the computer's memory for later retrieval. The names of these objects follow certain conventions. Their names are first and foremost guided by the actual content of the object. A wisely chosen name indicates what to expect in the object.
78 | There are restrictions on the names for objects. Built-in functions must not be used and no illegal characters such as spaces can be used. Two popular conventions are snake_case and camelCase. In the first instance, proper words are connected by underscore and in the second case the first word starts with a lowercase letter, but each subsequent word starts with an uppercase letter. Another common practice is to separate the words with periods (full-stops).
79 |
80 | The computer variable name is followed by a `<-` symbol. The keyboard shortcut for this is ALT+- (Windows and Linux) or OPT+- (MAC). Both of these are achieved by holding down the ALT or OPT key and hitting the minus key. Follow this up by the value that is to be saved in the object.
81 |
82 | In the code below are some examples. Note the use of the `#` symbol. Any code that follows this symbol (in a line) is ignored by `R` and only serves as a way to leave comments about code. This is of great help when viewing code later or for handing code over to others for use.
83 |
84 | ```{r}
85 | # An obejct that holds text
86 | myText <- "This is text!" # Note the use of quotation marks
87 |
88 | # An object that hold the solution to an expression
89 | myAnswer <- 4 + 4
90 | ```
91 |
92 |
93 | The content of an object can be retrieved (or even referred to by the computer variable name in other lines of code).
94 |
95 | ```{r}
96 | myText
97 | ```
98 |
99 | ## Lists
100 |
101 | As mentioned, lists are very common objects in `R`. In the example below a computer variable named `temperature` holds five elements (all of the same type, i.e. integers). These are stored as `vectors` and are created with the `c()` function.
102 |
103 | ```{r}
104 | temperature <- c(72, 76, 80, 65, 69)
105 | temperature
106 | ```
107 |
108 | List can be created as sequences using the `seq()` function. It typically takes a start, stop, and step-size argument. The code below creates an object name `myList` that has elements starting at `1` and ending at `10`, with a step-size of `0.5`.
109 |
110 | ```{r}
111 | myList <- seq(1, 10, 0.5)
112 | myList
113 | ```
114 |
115 | Note the bracket notation at the start of each row (should the list of number overflow one line). It gives the number of the element (its address) that starts a new line.
116 |
117 | The number of elements in a list can be expressed using the `length()` function and passing the object name as the argument.
118 |
119 | ```{r}
120 | length(myList)
121 | ```
122 |
123 | ## For loops
124 |
125 | `R` code can be made to run iteratively over some sequence. This is done creating a `for` loop. Such a loop uses a counter to repeat a task a specified number of times.
126 |
127 | The code below creates a sequence of integers ($1$ through $10$) and stores this as a number vector in the object with the computer variable name `my.numbers`. A second object is created named `sum.total` that holds the integer value $0$.
128 |
129 | The loop uses the keyword `for`, which then specifies the parameters of this loop in parenthesis. Note the use of a placeholder, `i`, for the iteration. It states that the loop should run through each value in `my.numbers`, i.e. $1,2,3,\ldots,10$. During each loop the content inside the curly braces, `{}` are executed. In this case it takes the current value in the `sum.total` object and adds the current loop value of the placeholder, `i`, to it.
130 |
131 | During the first loop the current value in `sum.total` is $0$ and in `i` is $1$. These two values are added (right-hand side of the equation) and the passed as new value $0 + 1 = 1$ to the `sum.total` object. Note how this differs from a standard algebraic equation in mathematics. In `R`, as in most other computer languages, the right-hand side of an equation ( a line of code with an equal sign) is executed first and the result overwrites what is currently held in the object on the left (or creates it if it doesn't yet exist).
132 |
133 | During the second loop the value in `i` is $2$ (the next element in `my.numbers`), which is added to the current value in `sum.total`, which is $1$ resulting in $3$. This is now passed as the new value in `sum.total`.
134 |
135 | Adding all the integer values from $1$ through $10$ is $55$. This is printed in the last line (outside the curly braces).
136 |
137 | ```{r}
138 | my.numbers <- seq(1, 10, 1)
139 | sum.total <- 0
140 | for (i in my.numbers){
141 | sum.total = sum.total + i
142 | }
143 | sum.total
144 | ```
145 |
146 | ## Functions
147 |
148 | There are numerous built in function in `R`. These include the keywords followed by a set of parenthesis used earlier in this chapter.
149 |
150 | As another example, the code below calculates the average of the sequence created and stored in the object `my.numbers` above. This is achieved using the `mean()` function and passing the object containing the $10$ numbers as argument.
151 |
152 | ```{r}
153 | mean(my.numbers)
154 | ```
155 |
156 | Functions can also be created. The code below performs this task by starting with a name for this new function called, `my.mean` (use names that are not part of the built-in set). The `function` keyword stipulates that `my.name` is not a simple object, but a function. The content of the parenthesis that follows is a list of arguments. In this case there is a single argument which is a placeholder for whatever will be passed to the function when it is called.
157 |
158 | Inside the curly braces follows a set of instruction that the new function performs. The first is to create a new object called, `number.of.elements` that holds the value of the number of elements in the argument that is to be passed to the function. This is done through the use of the `length()` function.
159 |
160 | The next set of instructions follow the pattern used in the preceding section describing `for` loops. It starts off by setting the value held in a new object called `cumulative.total` to $0$. The `for` loop iterates through all the elements held in the object passed to the function and iteratively adds them to the `cumulative.total` object.
161 |
162 | The function ends with the `return` keyword that returns the value held inside of the parenthesis that follows. Since the aim of this new function is to return a mean value of the argument passed to the function, it divides the sum total by the number of elements.
163 |
164 | ```{r}
165 | my.mean <- function(vals){
166 | number.of.elements <- length(vals)
167 | cumulative.total <- 0
168 | for (i in vals){
169 | cumulative.total = cumulative.total + i
170 | }
171 | return (cumulative.total / number.of.elements)
172 | }
173 | ```
174 |
175 | The function is now called like any other in `R`. The code below passes the $10$ integers in `my.numbers` to the function. The result is the same as was returned using the built in `R` function `mean()`.
176 |
177 | ```{r}
178 | my.mean(my.numbers)
179 | ```
180 |
181 |
182 | ## Loading data
183 |
184 | A script file and an R Markdown file can be saved on the computer's disk. Files such as spreadsheet files can be imported from disk. It is useful to keep these in the same folder. The `getwd()` function returns the directory (folder) on the computer where the `R` file is saved. This can be passed as argument to the `setwd()` function to tell the current file where it is saved. All files in this directory can the be accessed by simply using their name. If the `setwd(getwd())` option is not used, then the full address to the file on the computer must be used.
185 |
186 |
187 | ```{r}
188 | setwd(getwd())
189 | ```
190 |
191 | In the code below the `LogitsicRegression.csv` spreadsheet file is imported into the current `R` file, giving access to the data in the spreadsheet. This content is saved in an object conveniently named `data`.
192 |
193 | ```{r}
194 | data <- read.csv("LogisticRegression.csv")
195 | ```
196 |
197 | Note that `data` appears in the top-right Environment tab along all the other objects created so far. Clicking on the square box at the end of the object opens the content in the Main panel (in a new tab). This can also be done using the `View()` function. Note the uppercase V.
198 |
199 | ```{r}
200 | View(data)
201 | ```
202 |
203 | ## Installing `tensorflow` and `keras`
204 |
205 | TensorFlow is Google's open source framework for tensor calculations used in deep learning. It exists in a package that can be added to `R`.
206 |
207 | Keras is a library of code that uses TensorFlow as a backend. It greatly simplifies writing TensorFlow code that can be laborious. It has become very popular and is even built into the newer version of TensorFlow.
208 |
209 | Adding these packages to `R` requires the addition of the `reticulate` package and the `devtools` package. In Windows, the latter also requires the installation of RTools from https://cran.r-project.org/bin/windows/Rtools/ . Install `reticulate` and `devtools` using the Packages tab in the bottom-right panel.
210 |
211 | To install `tensorflow` and `keras` visit https://tensorflow.rstudio.com/keras/ . A Graphics Processing Unit (GPU) version is also available for systems that have modern NVidia GPUs. In most laptops these can cause problems, though. Once larger datasets, such as those with images, are used in deep learning models, there might not be enough memory in the GPU to train the models. When starting with deep learning is is safe to simply install the central processing unit (CPU) versions of these packages.
212 |
213 | ## Conclusion
214 |
215 | This is by no means a comprehensive introduction to the language, but merely a short primer. The `R` language is easy to learn. Simply start coding. Regularly consult the Help tab and access content (such as the RStudio site) on the web.
216 |
217 | Typing the rest of the content in this course into a new file will greatly facilitate a natural path towards learning to code in `R`. Just do it!
--------------------------------------------------------------------------------
/Deep neural network example using R.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Deep neural network example using R"
3 | author: "Dr Juan H Klopper"
4 | output:
5 | html_document:
6 | toc: true
7 | number_sections: false
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | setwd(getwd())
13 | ```
14 |
15 | ```{r}
16 | suppressMessages(library(readr))
17 | suppressMessages(library(keras))
18 | suppressMessages(library(DT))
19 | ```
20 |
21 |
32 |
33 | 
34 |
35 | ## Introduction
36 |
37 | This chapter introduces a densely connected deep neural network with two hidden layers and an output layer. It serves as a first example of the concepts that follow directly from the preceding chapter.
38 |
39 | A densely connected network is the simplest form of a network. Each of the nodes in each of the hidden and the output layers are connected to each of the nodes in the layers before it in the algorithm.
40 |
41 | New concepts are introduced as well. The first considers _one-hot-encoding_ of the binary target variable. There is also a new activation function, _softmax_, which will be demonstrated here, but described in more detail in a following chapter.
42 |
43 | One of the most important concepts in machine learning is also shown in this demonstration. It pertains to splitting the data into two sets. The first is a _training set_. This is the data that will be passed to the neural network. The second is the _test set_. This data is kept from the network. Once the network has learned from the training data, it can be tested against the _unseen_ data in the test set so as to determine how well the _learning_ performed.
44 |
45 | ## Data
46 |
47 | With the working directory (folder) set to the directory in which this markdown file is saved, the `read_csv()` function is used below to import a `.csv` file, which resides in the same directory (therefor negating the need to type the full address of the file).
48 |
49 | The file consists of $50000$ observations with $10$ feature variables and a binary target variable.
50 |
51 | The `read_cvs()` function is part of the `readr` package that was imported above. It has subtle differences from the standard `read.csv()` function. It creates a _tibble_ which is different from a standard `data.frame`, the latter resulting from importing using `read.csv()`. A tibble displays large datasets better than a data.frame. It also never uses row names and does not store variables as special attributes.
52 |
53 | The code below reads the `.csv` file and the displays a random $1$% of it in a table using the `DT` package. This display is intended for output to a web page. It uses the `datatable()` function. The `data.set` object is passed as argument with square bracket denoting a `row,column` address. The rows are selected at random using the `sample()` function. The first argument of this function state the total number of rows of the full dataset. The second states explicitly not to replace rows once they have been selected. The last argument show the size, which is `0.01` of the total number of rows. Note the closing parenthesis, `)` and the a comma. There is only a space after the comma, which is shorthand for indicating all the columns.
54 |
55 | ```{r}
56 | data.set <- read_csv("SimulatedBinaryClassificationDataset.csv",
57 | col_names = TRUE)
58 | datatable(data.set[sample(nrow(data.set),
59 | replace = FALSE,
60 | size = 0.01 * nrow(data.set)), ])
61 | ```
62 |
63 | The `summary()` function provides descriptive statistics for each of the variables.
64 |
65 | ```{r}
66 | summary(data.set)
67 | ```
68 |
69 |
70 | ## Preparing the data (preprocessing)
71 |
72 | The data structure that exists after importing must be prepared before it can be passed to a neural network. Several steps are involved in this preparation and are given below.
73 |
74 | ### Transformation into a matrix
75 |
76 | The data structure is transformed into a _mathematical_ matrix using the `as_matrix()` function before removing the variable (column) names.
77 |
78 | ```{r}
79 | # Cast dataframe as a matrix
80 | data.set <- as.matrix(data.set)
81 |
82 | # Remove column names
83 | dimnames(data.set) = NULL
84 | ```
85 |
86 | ### Train and test split
87 |
88 | The dataset, which now exists as a matrix, must be split into a training and a test set as mentioned in the introduction. There are various ways in `R` to perform this split. Once such method is shown below.
89 |
90 | The `set.seed()` function, with its arbitrary argument `123`, ensures that the random _splitting numbers_ generated to split the data will follow a pattern that is repeated when the code is re-executed later or by others. This simply ensures reprodicibility for the sake of this text.
91 |
92 | The code creates an object named `indx`. The`sample()` function creates a random sample. The first argument, `2`, is a list item. The count starts at $1$ and goes up in steps of $1$. The sample will thus be selected from the sample space of only two elements, $\left\{1,2\right\}$.
93 |
94 | The next argument stipulates how many samples are required for the `indx` object. In this instance, it is set to the number of rows in the dataset, thereby ensuring that there is a number, either $1$ or $2$, for each row.
95 |
96 | The `replace = TRUE` argument stipulates that the elements $\left\{1,2\right\}$ are replaced after each round of randomly selecting a $1$ or a $2$, thereby ensuring a random sample of more than just two elements.
97 |
98 | The `prob = c()` argument gives the probability that each respective element in the sample space has of being selected during each round.
99 |
100 | ```{r}
101 | # Split for train and test data
102 | set.seed(123)
103 | indx <- sample(2,
104 | nrow(data.set),
105 | replace = TRUE,
106 | prob = c(0.9, 0.1)) # Makes index with values 1 and 2
107 | ```
108 |
109 | The probability of a $1$ being selected is set at $90$% and of a $2$ being selected is set at only $10$%. Note that these values must sum to $100$%. These numbers, $1$ and $2$ are going to added to each row and thereby allow for the split along these two values. The split will therefor create a sub-dataset that contains $90$% of the original dataset and another that will contain the remaining $10$. This is a choice that the designer of the neural network must make.
110 |
111 | The first sub-dataset is ultimately going to be the training set that is passed to the network from which it will learn the optimum parameters (so as to minimize the cost function). The second will be the test set against which the learned parameters will be tested. Generally, the larger the original dataset, the smaller the second set can be. There are two forces at play. The training set must be as large as possible to maximize the learning phase. The test set, though, must be big enough to be representative of the data as a whole. This ensures generalization to real-world data for which each network is ultimately designed.
112 |
113 | In a tiny dataset containing only $200$ samples, a $10$% test set contains only $20$ samples, which might not be representative. In the case of the dataset used in this chapter, $10$% comprises a massive $5000$ samples (roughly, as the precise number of $2$s are random). This still leaving $45000$ for training. The approximate $5000$ samples should be quite enough to be representative for testing, whilst the $45000$ should be enough to maximize learning.
114 |
115 | The code below is very compact, but achieves a lot. It creates two objects named `x_train` and `x_test`. It is customary to use an `x` when referring to the matrix of feature variables. The `_train` and `_test` post-fixes differentiates the two objects for their ultimate roles.
116 |
117 | The square bracket notation references addressing. Each value in a matrix has an address, given by its row number and then its column number, separated by a comma. The code then takes the list of randomly created $1$ and $2$ values from the `indx` object and selects those where the `indx` object has a value of `1` (by row) to go into the `x_train` object. The columns are `1:10` specifying shorthand for columns $1$ through $10$, i.e. only the $10$ feature variables.
118 |
119 | ```{r}
120 | # Select only the feature variables
121 | # Take rows with index = 1
122 | x_train <- data.set[indx == 1, 1:10]
123 | x_test <- data.set[indx == 2, 1:10]
124 | ```
125 |
126 | ### Processing the target variable
127 |
128 | The target variable must be split in a similar way. A separate object, `y_test_actual`, is created to hold the ground-truth (actual) feature values of the test set for later use. Note the use of indexing (row, column), indicating that these belong to the test set and that only the last column, `11` (the target) is included.
129 |
130 | ```{r}
131 | y_test_actual <- data.set[indx == 2, 11]
132 | ```
133 |
134 | This chapter will use the `softmax` activation function in the output (see below). This requires the target variable to be _one-hot-encoded_ . The concept is quite simple. Since there are only two elements in the sample set of the target variable in this example, $0$ and $1$, two variables are created by one-hot-encoding. The names for these variable are natural numbers starting at $0$. Consider a target variable that was not either a $0$ or a $1$, i.e. _benign_ and _malignant_. The target variable will be a list of the two elements, one for each subject. One-hot-encoding will then have two variable, names $0$ and $1$. The designer of the network might choose to encode benign as $\left\{ 1,0 \right\}$ and malignant as $\left\{ 0,1 \right\}$. In this case the first variable $0$ references benign and the second, malignant. If a particular subject is then benign, the first variable $0$ contains a $1$ and the second, the $1$, contains a $0$.
135 |
136 | It should be clear then why this encoding is referred to as one-hot-encoding. A number of _dummy_ variables are created, the number being equal to the number of elements in the sample space of the target variable. For any given subject, a $1$ will be introduced for the particular dummy variable and $0$ for the rest.
137 |
138 | The target variable of the training and test sets can be one-hot-encoded using the `Keras` function `to_categorical()`. Note the use of addressing and the column specified as $11$, the target variable.
139 |
140 | ```{r}
141 | # Using similar indices to correspond to the training and test set
142 | y_train <- to_categorical(data.set[indx == 1, 11])
143 | y_test <- to_categorical(data.set[indx == 2, 11])
144 | ```
145 |
146 | The code below shows the first five actual target variables of the test set and then the corresponding one-hot-encoded equivalent. Its uses `cbind()` to bind the data (listd as arguments) as columns. The first column is the actual first $10$ samples and columns two and three are the encoded equivalent.
147 |
148 | ```{r}
149 | cbind(y_test_actual[1:10],
150 | y_test[1:10, ])
151 | ```
152 |
153 | ## Creating the model
154 |
155 | With the data prepared, the next step involves the design of the actual deep neural network. The code below saves the network as an object named `model`. As with functions where the `function()` keyword is used to denote that the object is not a normal object, `model` is specified to be a `keras_model_sequential()` object.
156 |
157 | `Keras` has two network creation types. The first is used here and allows for the creation of one hidden layer after the next. There is also an `API` functional type that allows for much finer control over the design of the network.
158 |
159 | Once the model has been instantiated (created), the layers can be added. There is more than one way to do this. In this example, the layers are specified by their type, `layer_dense()`, each containing all their specifications.
160 |
161 | Note the use of the pipe, `%>%`, symbol. It passes what is on the left of it, as first argument to what is on its right (next line in this case).
162 |
163 | The first hidden layer is then a densely connected layer. Names can be specified (optional, with no illegal characters such as spaces). It contains `10` nodes and uses the `relu` activation function. In this first hidden layer, the shape of the input vector must be specified. This represents the number of feature variables. Since the forward propagation step involves the inner product of tensors, the dimensions specified must be correct. If not, the tensor multiplication cannot occur.
164 |
165 | The ccurrent layer is passed to the next hidden layer, again via the pipe symbol. This second hidden layer also contains `10` nodes and uses the `relu` activation function. The size of this layer need not be specified (for the ake of dimensionality required for the tensor multiplications), as it will be inferred.
166 |
167 | The last layer is the output layer. It contains two nodes since the target was one-hot-encoded. It specifies the `softmax` activation function. This function provides a probability to each of the output nodes ($0$ and $1$), such that the probabilities (of the two in this case) sum to one. Activation functions will be covered in more depth in a later chapter.
168 |
169 | ```{r}
170 | # Creating the model
171 | model <- keras_model_sequential()
172 |
173 | model %>%
174 | layer_dense(name = "DeepLayer1",
175 | units = 10,
176 | activation = "relu",
177 | input_shape = c(10)) %>%
178 | layer_dense(name = "DeepLayer2",
179 | units = 10,
180 | activation = "relu") %>%
181 | layer_dense(name = "OutputLayer",
182 | units = 2,
183 | activation = "softmax")
184 |
185 | summary(model)
186 | ```
187 |
188 | The `summary()` function provides a summary of the model. There are three columns in the summary,the first giving the layer name (as optionally specified when the network was created) and its type. All of the layers are densely connected layers in this example. The _Output Shape_ column specifies the output shape (after tensor multiplication, bias addition, and activation, i.e. forward propagation). The _Param #_ column indicates the number of parameters (weights and biases) that the specific layer must _learn_. For layer one (feature variables), since there was $10$ input nodes connected to $10$ nodes in the first hidden layer, that results in $10 \times 10 = 100$ parameters plus the column vector of bias values, of which there are also $10$, resulting in $110$ parameters. The next two layers follow a similar explanation.
189 |
190 | The model for this chapter is depicted below, showing all 242 parameters (weight and bias) values that are to be optimized (minimizing the cost function), through backpropagation and gradient descent.
191 |
192 | 
193 |
194 | ## Compiling the model
195 |
196 | Before fitting the training data (passing the training data to the model), the model requires _compilation_. The loss function, optimizer, and metrics are specified during this step. In this example, categorical cross-entropy is used as the loss function (since this is a multi-class classification problem). A standard _ADAM_ optimizer is used for gradient descent and _accuracy_ is used as the metric.
197 |
198 | This loss function is different from the mean-squared-error used in preceding chapters. Gradient descent optimizers will be discussed in a following chapter.
199 |
200 | ```{r}
201 | # Compiling the model
202 | model %>% compile(loss = "categorical_crossentropy",
203 | optimizer = "adam",
204 | metrics = c("accuracy"))
205 | ```
206 |
207 | ## Fitting the data
208 |
209 | The training set can now be fited (passed) to the compiled model. In addition, a validation set is created during the training and is set to comprise a fraction of $0.1$ of the training data. This represents another split in the data similar to the initial train and test split. It allows for determining the accuracy of the model as it trains. Discrepancies between the loss and accuracy of the training and the validation gives clues as to how to change the hyperparameters during the re-design phase and will be discussed in a following chapter.
210 |
211 | The fitted model is saved in a computer variable named `history`. Twenty epochs are run, with a mini-batch size of $256$, yet more concepts for a following chapters.
212 |
213 | When using `Keras` in RStudio, two live plots are created in the Viewer tab. The top shows the loss values for the training and validation sets. The bottom plot shows the accuracy of the two sets.
214 |
215 | ```{r}
216 | history <- model %>%
217 | fit(x_train,
218 | y_train,
219 | epoch = 10,
220 | batch_size = 256,
221 | validation_split = 0.1,
222 | verbose = 2)
223 | ```
224 |
225 | A simple plot can be created to show the loss and the accuracy over the epochs.
226 |
227 | ```{r}
228 | plot(history)
229 | ```
230 |
231 | ## Model evaluation
232 |
233 | The test _feature and target sets_ can be used to evaluate the model. The results show the overall loss and accuracy by using the `evaluate()` function. It takes two arguments referencing the feature and target test sets.
234 |
235 | ```{r}
236 | model %>%
237 | evaluate(x_test,
238 | y_test)
239 | ```
240 |
241 | A confusion matrix can be constructed. A computer variable is created to store the predicted classes given the test set, `x_test`. It passes this dataset through the model and uses the learned parameters to predict an output expressed as probability for each of the two output nodes (and ultimately a choice between a predicted $0$ or $1$, depending on which has the highest probability).
242 |
243 | In the code below, a table is created using the initially saved ground-truth values, `y_test_actual`. The result is a confusion matrix showing how many times $0$ and $1$ were correctly and incorrectly predicted.
244 |
245 | ```{r}
246 | pred <- model %>%
247 | predict_classes(x_test)
248 |
249 | table(Predicted = pred,
250 | Actual = y_test_actual)
251 | ```
252 |
253 | The `predict_proba()` function creates probabilities for each of the two classes in each of the test cases. The case with the highest probability is chosen as the predicted target class.
254 |
255 | ```{r}
256 | prob <- model %>%
257 | predict_proba(x_test)
258 | ```
259 |
260 | The code chunk below prints the first $5$ probabilities. Since there are only two classes in the sample space of the target, only the first probability (for predicting a `0`) is shown. For the sake of simplicity, this value is subtracted from $1$ so as to give an indication of whether a `0` or a `1` is predicted. The former is predicted when the probability is less than $0.5$ and the latter is predicted when the probability is greater than or equal to $0.05$.
261 |
262 | ```{r}
263 | 1 - prob[1:5]
264 | ```
265 |
266 | Since all these values are greater than or equal to $0.5$ all of the first five predictions are for `1`.
267 |
268 | The predicted values, and the ground-truth values can be printed by combining them using `cbind()`. This function binds data into columns. The first column shows the probability for a $1$. The second column shows the first $10$ predictions and the last column shows the actual values (saved as an object at the start of this chapter).
269 |
270 | ```{r}
271 | cbind(1 - prob[1:10],
272 | pred[1:10],
273 | y_test_actual[1:10])
274 | ```
275 |
276 | Note that subjects `6,7,9,10` have probabilities for `1` of less than `0.5` and hence a `0` is predicted. All of the first $10$ subjects have correct predictions.
277 |
278 | ## Conclusion
279 |
280 | This chapter introduced very important concepts in machine learning. The first showed the preparation of data. This is a required step before the data can be passed to a network and the networks accuracy tested. It is important to have data that the network has never seen when testing.
281 |
282 | The `Keras` package allows for the easy construction of a network, with simple, clear syntax.
283 |
--------------------------------------------------------------------------------