├── untitled.txt
├── LICENSE
├── tile_splitter
├── README.md
├── pytorch-cnn-problem
├── the-car-connection-image-scraper
├── Keras CNN Benchmark.py
├── generating-male-faces-with-aae
├── generating-male-faces-with-dcgan
├── the-car-connection-image-scraper.py
├── generating-male-faces-with-vae
├── Rapport Final.ipynb
├── Deep Convolutional GAN.ipynb
├── Pytorch CNN to Test on the Generated Samples.ipynb
├── Rapport_Final (1).ipynb
└── Data Cleaning.ipynb


/untitled.txt:
--------------------------------------------------------------------------------
1 | Given our non-significant results, we aimed to determine the reason why the generated samples did not improve the classifier. Clearly, the numerous additional samples did not provide any new information. As such, the information in the generated samples was already contained in the real samples. 
2 | 
3 | In an attempt to demonstrate this hypothesis, we trained a classifier on 8,000 _real_ men and 8,000 _real_ women. As test data, we used 10,000 generated men and 10,000 generated men. The pictures of women were reused from the previous analyses, and 10,000 men were generated with the 5 adversarial networks mentioned previously, without altering hyperparameters. 
4 | 
5 | As expected, when trained on real samples, the classifier had an outstanding performance on the generated samples. Within two epochs, the accuracy was 100%. We regret to say that the VAEs and GANs did not magically yield more information than what had been fed to them. 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Nicolas Gervais
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/tile_splitter:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from glob import glob
 3 | import matplotlib.pyplot as plt
 4 | import os
 5 | os.chdir('c:/users/nicolas/documents/data/faces')
 6 | from PIL import Image
 7 | 
 8 | 
 9 | def split_pics(source_dir):
10 |     """
11 |     Splits an array of 5x5 pictures, of size 60x60.
12 |     Additionally, it adds a directory in the provided directory, called 'split'.
13 |     """
14 | 
15 |     for photo in glob('%s/*.png' % source_dir):
16 |         a = plt.imread(photo)
17 |         b = np.array(a)
18 |         c = np.vsplit(b, np.arange(1, b.shape[0], 62))
19 |         d = c[1:-1]
20 | 
21 |         pictures = []
22 | 
23 |         for i in d:
24 |             imgs = np.hsplit(i, np.arange(1, 312, 62))
25 |             imgs = imgs[1:-1]
26 |             for i in imgs:
27 |                 pictures.append(i[1:-1, 1:-1])
28 | 
29 |         letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
30 |                    'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
31 |                    'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
32 |                    'y', 'z']
33 | 
34 |         os.mkdir('%s/split' % sourcedir)
35 | 
36 |         for pic in pictures:
37 |             filename = '{}/split/{}.png'.format(source_dir, ''.join(np.random.choice(letters, 15)))
38 |             pic *= 255
39 |             im = Image.fromarray(pic.astype(np.uint8))
40 |             im.save(filename)
41 | 
42 | 
43 | split_pics('aae')
44 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # data-augmentation-with-gan-and-vae :100:
 2 | 
 3 | [Vincent Fortin](https://github.com/vincentfortin) and I are using the [UTK Faces dataset](http://aicip.eecs.utk.edu/wiki/UTKFace) to for the project in the [_Machine Learning I_](https://www.hec.ca/en/courses/detail/?cours=MATH80629A) project. 
 4 | 
 5 | Unbalanced classes is one of the most frequent struggle when dealing with real data. Is it better to down/upsample, or do nothing at all? Another approach is to generate samples resembling the smallest class. In this project, we are using Variational AutoEncoders (VAEs) and Generative Adversarial Networks (GANs) to generate samples of the smallest class. Using human faces, we will determine if a convolutional neural network (CNN) will be trained better with generated samples, or without.  
 6 | 
 7 | ## PROGRESS
 8 | 1. [First we trained a VAE](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Variational%20Auto%20Encoder.ipynb) to generate human faces
 9 | 2. [Then we trained a ConvNet with Pytorch](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Pytorch%20ConvNet%20Distinguishing%20Men%20and%20Women.ipynb) but it didn't work.
10 | 3. So we tried with Keras to see if our architecture was the problem. It's not. [We reached 90% accuracy](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Keras%20CNN%20Benchmark.ipynb). 
11 | 4. Here is the [Adversarial Auto Encoder](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Adversarial%20Auto%20Encoder.ipynb). The results are very clear.
12 | 5. Here is the [Wasserstein GAN](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Wasserstein%20GAN.ipynb).
13 | 6. The [Softmax GAN](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Softmax%20GAN.ipynb) worked out pretty well.
14 | 7. The [Deep Convolutional GAN](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Deep%20Convolutional%20GAN.ipynb) has worked but its performance is quite low.
15 | 8. Finally fixed the [Pytorch CNN](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Pytorch%20ConvNet%20Distinguishing%20Men%20and%20Women.ipynb), with 92% accuracy!
16 | 9. The CNN was able to classify generated samples, when trained on the original samples, with [100% accuracy](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae/blob/master/Pytorch%20CNN%20to%20Test%20on%20the%20Generated%20Samples.ipynb).
17 | ## TO DO
18 | - [x] Train a Tensorflow convolutional neural network as classifier
19 | - [x] Create a GAN to generate human faces
20 | - [x] Explore other generative methods
21 | - [ ] Train CNNs to see if the accuracy is better with the generative methods
22 | - [x] Fix the Pytorch CNN
23 | - [ ] Use Keras and Pydot to plot the chosen architecture
24 | - [x] Use generated samples as test set to see if there is untapped information
25 | ## PROJECT PLAN
26 | 1. Create various sample generators
27 | 2. Establish a benchmark CNN classifier, trained with 10% of the female samples (smaller class)
28 | 3. Train classifiers on 10% of the female samples, and add generated samples. Finally, compare performance.
29 |     - VAE
30 |     - GAN
31 |     - other
32 | 4. Compare performance, plot 
33 | 5. Determine if the generated samples have information that is not contained in the original pictures
34 | ## Example of the Adversarial Auto Encoder Learning
35 | ![Alt Text](https://media.discordapp.net/attachments/552684049588682752/632967292946350080/sickgif.gif)
36 | 
37 | This is the output (generated faces) of the adversarial autoencoder.
38 | 


--------------------------------------------------------------------------------
/pytorch-cnn-problem:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from PIL import Image
  3 | import torch
  4 | import torch.nn as nn
  5 | import torch.nn.functional as F
  6 | import torch.optim as optim
  7 | from torch.utils.data import DataLoader
  8 | from torch.autograd import Variable
  9 | from keras.datasets import mnist
 10 | 
 11 | (x_train, y_train), (x_test, y_test) = mnist.load_data()
 12 | 
 13 | 
 14 | def resize(pics):
 15 |     pictures = []
 16 |     for image in pics:
 17 |         image = Image.fromarray(image).resize((dim, dim))
 18 |         image = np.array(image)
 19 |         pictures.append(image)
 20 |     return np.array(pictures)
 21 | 
 22 | 
 23 | dim = 60
 24 | 
 25 | x_train, x_test = resize(x_train), resize(x_test)
 26 | 
 27 | x_train = x_train.reshape(-1, 1, dim, dim).astype('float32') / 255
 28 | x_test = x_test.reshape(-1, 1, dim, dim).astype('float32') / 255
 29 | y_train, y_test = y_train.astype('float32'), y_test.astype('float32') 
 30 | 
 31 | if torch.cuda.is_available():
 32 |     x_train = torch.from_numpy(x_train)[:10_000]
 33 |     x_test = torch.from_numpy(x_test)[:4_000] 
 34 |     y_train = torch.from_numpy(y_train)[:10_000] 
 35 |     y_test = torch.from_numpy(y_test)[:4_000]
 36 | 
 37 | 
 38 | class ConvNet(nn.Module):
 39 |     
 40 |     def __init__(self):
 41 |         super().__init__()
 42 |         self.conv1 = nn.Conv2d(1, 32, 3)
 43 |         self.conv2 = nn.Conv2d(32, 64, 3)
 44 |         self.conv3 = nn.Conv2d(64, 128, 3)
 45 |         
 46 |         self.fc1 = nn.Linear(5*5*128, 1024) 
 47 |         self.fc2 = nn.Linear(1024, 2048)
 48 |         self.fc3 = nn.Linear(2048, 1)
 49 |     
 50 |     def forward(self, x):
 51 |         x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
 52 |         x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
 53 |         x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2))
 54 |         
 55 |         x = x.view(x.size(0), -1) 
 56 |         x = F.relu(self.fc1(x))
 57 |         x = F.relu(self.fc2(x))
 58 |         x = F.dropout(x, 0.5)
 59 |         x = torch.sigmoid(self.fc3(x))
 60 |         return x
 61 | 
 62 | 
 63 | net = ConvNet()
 64 | 
 65 | optimizer = optim.Adam(net.parameters(), lr=0.03)
 66 | 
 67 | loss_function = nn.BCELoss()
 68 | 
 69 | 
 70 | class FaceTrain:
 71 |     
 72 |     def __init__(self):
 73 |         self.len = x_train.shape[0]
 74 |         self.x_train = x_train
 75 |         self.y_train = y_train
 76 |         
 77 |     def __getitem__(self, index):
 78 |         return x_train[index], y_train[index].unsqueeze(0)
 79 |     
 80 |     def __len__(self):
 81 |         return self.len
 82 | 
 83 | 
 84 | class FaceTest:
 85 |     
 86 |     def __init__(self):
 87 |         self.len = x_test.shape[0]
 88 |         self.x_test = x_test
 89 |         self.y_test = y_test
 90 |         
 91 |     def __getitem__(self, index):
 92 |         return x_test[index], y_test[index].unsqueeze(0)
 93 |     
 94 |     def __len__(self):
 95 |         return self.len
 96 | 
 97 | 
 98 | train = FaceTrain()
 99 | test = FaceTest()
100 | 
101 | train_loader = DataLoader(dataset=train, batch_size=64, shuffle=True)
102 | test_loader = DataLoader(dataset=test, batch_size=64, shuffle=True)
103 | 
104 | epochs = 10
105 | steps = 0
106 | train_losses, test_losses = [], []
107 | for e in range(epochs):
108 |     running_loss = 0
109 |     for images, labels in train_loader: 
110 |         optimizer.zero_grad()
111 |         log_ps = net(images)
112 |         loss = loss_function(log_ps, labels)
113 |         loss.backward()
114 |         optimizer.step()        
115 |         running_loss += loss.item()        
116 |     else:
117 |         test_loss = 0
118 |         accuracy = 0        
119 | 
120 |         with torch.no_grad():
121 |             for images, labels in test_loader: 
122 |                 log_ps = net(images)
123 |                 test_loss += loss_function(log_ps, labels)                
124 |                 ps = torch.exp(log_ps)
125 |                 top_p, top_class = ps.topk(1, dim=1)
126 |                 equals = top_class.type('torch.LongTensor') == labels.type('torch.LongTensor').view(*top_class.shape)
127 |                 accuracy += torch.mean(equals.type('torch.FloatTensor'))
128 |         train_losses.append(running_loss/len(train_loader))
129 |         test_losses.append(test_loss/len(test_loader))
130 |         print("[Epoch: {}/{}] ".format(e+1, epochs),
131 |               "[Training Loss: {:.3f}] ".format(running_loss/len(train_loader)),
132 |               "[Test Loss: {:.3f}] ".format(test_loss/len(test_loader)),
133 |               "[Test Accuracy: {:.3f}]".format(accuracy/len(test_loader)))
134 | 
135 | 


--------------------------------------------------------------------------------
/the-car-connection-image-scraper:
--------------------------------------------------------------------------------
  1 | from selenium import webdriver
  2 | import bs4 as bs
  3 | from urllib.request import Request, urlopen
  4 | import pandas as pd
  5 | import time
  6 | import os
  7 | 
  8 | # os.chdir('/data')
  9 | 
 10 | website = 'https://www.thecarconnection.com'
 11 | 
 12 | 
 13 | def fetch(page, addition=''):
 14 |     return bs.BeautifulSoup(urlopen(Request(page + addition,
 15 |             headers={'User-Agent': 'Opera/9.80 (X11; Linux i686; Ub'\
 16 |                      'untu/14.10) Presto/2.12.388 Version/12.16'})).read(), 'lxml')
 17 | 
 18 | 
 19 | def all_makes():
 20 |     # Fetches all makes (acura, cadilac, etc)
 21 |     all_makes_list = []
 22 |     for a in fetch(website, "/new-cars").find_all("a", {"class": "add-zip"}):
 23 |         all_makes_list.append(a['href'])
 24 |     print(all_makes_list[:10])
 25 |     print("All makes fetched")
 26 |     return all_makes_list
 27 | 
 28 | 
 29 | def make_menu(listed):
 30 |     # Fetches all makes + model ? (acura_mdx, audi_q3, etc)
 31 |     make_menu_list = []
 32 |     for make in listed: # REMOVE REMOVE REMOVE REMOVE REMOVE REMOVE #
 33 |         for div in fetch(website, make).find_all("div", {"class": "name"}):
 34 |             make_menu_list.append(div.find_all("a")[0]['href'])
 35 |     print(make_menu_list[:10])
 36 |     print("Make menu list fetched")
 37 |     return make_menu_list
 38 | 
 39 | 
 40 | def model_menu(listed):
 41 |     # Add year to previous step
 42 |     model_menu_list = []
 43 |     for make in listed:
 44 |         soup = fetch(website, make)
 45 |         for div in soup.find_all("a", {"class": "btn avail-now first-item"}):
 46 |             model_menu_list.append(div['href'])
 47 |         for div in soup.find_all("a", {"class": "btn 1"})[:8]:
 48 |             model_menu_list.append(div['href'])
 49 |     print(model_menu_list[:10])
 50 |     print("Model menu list fetched")
 51 |     return model_menu_list
 52 | 
 53 | 
 54 | def year_model_overview(listed):
 55 |     year_model_overview_list = []
 56 |     for make in listed: 
 57 |         for id in fetch(website, make).find_all("a", {"id": "ymm-nav-specs-btn"}):
 58 |             year_model_overview_list.append(id['href'])
 59 |     try:
 60 |         year_model_overview_list.remove("/specifications/buick_enclave_2019_fwd-4dr-preferred")
 61 |     except:
 62 |         pass
 63 |     print(year_model_overview_list[:10])
 64 |     print("Year model overview list fetched")
 65 |     return year_model_overview_list
 66 | 
 67 | 
 68 | def trims(listed):
 69 |     trim_list = []
 70 |     for row in listed:
 71 |         div = fetch(website, row).find_all("div", {"class": "block-inner"})[-1]
 72 |         div_a = div.find_all("a")
 73 |         for i in range(len(div_a)):
 74 |             trim_list.append(div_a[-i]['href'])
 75 |     print(trim_list[:10])
 76 |     print("Trims list fetched")
 77 |     return trim_list
 78 | 
 79 | 
 80 | def timer(start, end, iters, iters_left):
 81 |     hours, rem = divmod(end-start, 3600)
 82 |     minutes, seconds = divmod(rem, 60)
 83 | 
 84 |     hours_per_iter, rem_per_iter = divmod((end-start)/(iters+1),3600)
 85 |     minutes_per_iter, seconds_per_iter = divmod(rem_per_iter,60)
 86 | 
 87 |     hours_left , rem_left = divmod(((end-start)/(iters+1))*iters_left,3600)
 88 |     minutes_left, seconds_left = divmod(rem_left,60)
 89 |     print("    Total elapsed: {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
 90 |     print("    Time per page: {:0>2}:{:0>2}:{:05.2f}".format(int(hours_per_iter),int(minutes_per_iter),seconds_per_iter))
 91 |     print("    Time left    : {:0>2}:{:0>2}:{:05.2f}".format(int(hours_left),int(minutes_left),seconds_left))
 92 | 
 93 | 
 94 | def specifications(website, trims, keep_all_images=True):
 95 |     ''' keep_all_images: True means we create 2 files, one for main (front/read)
 96 |                          And one for all of the pictures.
 97 |     '''
 98 |     options = webdriver.FirefoxOptions()
 99 |     options.add_argument('-headless')
100 |     driver = webdriver.Firefox(options=options)
101 |     # driver = webdriver.Firefox()
102 | 
103 |     # Timer start
104 |     start = time.time()
105 | 
106 |     if not os.path.isfile('data/pictures_all.csv'):
107 |         # Table for all images
108 |         specifications_table_all = pd.DataFrame()
109 |         # Table for only front and rear images
110 |         specifications_table_front_rear = pd.DataFrame()
111 |     else:
112 |         specifications_table_all = pd.read_csv('data/pictures_all.csv',header=None)
113 |         specifications_table_front_rear = pd.read_csv('data/pictures_rear_front.csv',header=None)
114 | 
115 |     trims_left = len(trims.index)
116 | 
117 |     for inx, webpage in enumerate(trims.iloc[len(specifications_table_all.columns):, 0]):
118 |         soup = fetch(website, webpage.replace('overview', 'specifications'))
119 |         # Same splitting as above
120 |         specifications_df_all = pd.DataFrame(columns=[soup.find_all("title")[0].text[:-15]])
121 |         specifications_df_front_rear = pd.DataFrame(columns=[soup.find_all("title")[0].text[:-15]])
122 |         for div in soup.find_all("div", {"class": "specs-set-item"})[:9]:
123 |             row_name = div.find_all("span")[0].text
124 |             row_value = div.find_all("span")[1].text
125 |             specifications_df_all.loc[row_name] = row_value
126 |             specifications_df_front_rear.loc[row_name] = row_value
127 |         try:
128 |             driver.get(website + webpage.replace('specifications', 'overview'))
129 |             class_img = driver.find_elements_by_class_name('image')
130 |         except:
131 |             print(f'Problem with {website + webpage}')
132 |         list_urls = []
133 |         for ii in class_img:
134 |             list_urls.append(ii.get_attribute('data-image-huge'))
135 |         
136 |         # Keep a count of rear and front images to put them at start of index
137 |         rear_front_img_count = 0
138 |         for ix, img_url in enumerate(list_urls): 
139 |             specifications_df_all.loc['Picture_%i' % ix, :] = img_url
140 |             if keep_all_images and 'pkg-rear-exterior-view' in img_url:
141 |                 specifications_df_front_rear.loc['Picture_%i' % rear_front_img_count, :] = img_url
142 |                 rear_front_img_count += 1
143 |             
144 |         # If no images, we don't add to the main df
145 |         if len(class_img) > 0:
146 |             specifications_table_all = pd.concat([specifications_table_all, specifications_df_all], axis=1, sort=False)
147 |             specifications_table_front_rear = pd.concat([specifications_table_front_rear, specifications_df_front_rear], axis=1, sort=False)
148 |         # Save content every 10 images
149 |         if inx % 10 == 0:
150 |             print("%d/%d completed."%(inx, trims_left))
151 |             specifications_table_all.to_csv('data/pictures_all.csv',header=None)
152 |             specifications_table_front_rear.to_csv('data/pictures_rear_front.csv',header=None)
153 |             timer(start,time.time(), inx, trims_left-inx)
154 | 
155 |     # At the end of loop
156 |     specifications_table_all.to_csv('data/pictures_all.csv',header=None)
157 |     specifications_table_front_rear.to_csv('data/pictures_rear_front.csv',header=None)
158 | 
159 | 
160 | if __name__ == '__main__':
161 |     # If list of trims has not been fetched
162 |     if not os.path.isfile('data/trims_octobre_2019.csv'):
163 |         a = all_makes()
164 |         b = make_menu(a)
165 |         c = model_menu(b)
166 |         d = year_model_overview(c)
167 |         e = trims(d)
168 |         f = pd.DataFrame(e).to_csv('data/trims_octobre_2019.csv', index=False, header=None)
169 | 
170 |     # Read list of trims
171 |     g = pd.read_csv('data/trims_octobre_2019.csv',header=None)
172 |     g.drop_duplicates(inplace=True)
173 |     h = specifications(website, g)
174 |     h.to_csv('data/pictures.csv')
175 |     
176 | 


--------------------------------------------------------------------------------
/Keras CNN Benchmark.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | 
  4 | # # Keras Benchmark 
  5 | 
  6 | import numpy as np
  7 | import matplotlib.pyplot as plt
  8 | from glob import glob
  9 | from PIL import Image
 10 | from time import time
 11 | from sklearn.model_selection import train_test_split
 12 | from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D,    Dropout, Flatten, Dense
 13 | from keras.callbacks import EarlyStopping
 14 | from keras.models import Sequential
 15 | from keras.optimizers import Adam
 16 | from keras.utils import to_categorical
 17 | from keras.backend import epsilon
 18 | from keras import backend
 19 | from keras.metrics import AUC
 20 | import os
 21 | import csv
 22 | import pandas as pd
 23 | 
 24 | 
 25 | def unison_shuffled_copies(a, b):
 26 |     # Shuffles two lists keeping orders
 27 |     assert len(a) == len(b)
 28 |     p = np.random.permutation(len(a))
 29 |     return a[p], b[p]
 30 | 
 31 | def crop(img):
 32 |     if img.shape[0]<img.shape[1]:
 33 |         x = img.shape[0]
 34 |         y = img.shape[1]
 35 |         crop_img = img[: , int(y/2-x/2):int(y/2+x/2)]
 36 |     else:
 37 |         x = img.shape[1]
 38 |         y = img.shape[0]
 39 |         crop_img = img[int(y/2-x/2):int(y/2+x/2) , :]
 40 | 
 41 |     return crop_img
 42 | 
 43 | def f1_m(y_true, y_pred):
 44 |     precision = precision_m(y_true, y_pred)
 45 |     recall = recall_m(y_true, y_pred)
 46 |     return 2*((precision*recall)/(precision+recall+backend.epsilon()))
 47 | 
 48 | def recall_m(y_true, y_pred):
 49 |         true_positives = backend.sum(backend.round(backend.clip(y_true * y_pred, 0, 1)))
 50 |         possible_positives = backend.sum(backend.round(backend.clip(y_true, 0, 1)))
 51 |         recall = true_positives / (possible_positives + backend.epsilon())
 52 |         return recall
 53 | 
 54 | def precision_m(y_true, y_pred):
 55 |         true_positives = backend.sum(backend.round(backend.clip(y_true * y_pred, 0, 1)))
 56 |         predicted_positives = backend.sum(backend.round(backend.clip(y_pred, 0, 1)))
 57 |         precision = true_positives / (predicted_positives + backend.epsilon())
 58 |         return precision
 59 | 
 60 | sex = ['men', 'women']
 61 | models = ['DC-GAN','wasserstein gan','vae_800_women','softmax gan','adversarial auto encoder',"Oversample","None"]
 62 | for model_name in models:
 63 |     nb_epochs = 50
 64 | 
 65 |     if model_name != "None" and model_name != "Oversample":
 66 |         # Gennerated images
 67 |         files_generated = glob('generated_images/'+model_name+'/*.png')
 68 |         if len(files_generated) == 0:
 69 |             files_generated = glob('generated_images/'+model_name+'/*.jpg')
 70 |         files_generated = np.random.permutation(files_generated)
 71 | 
 72 |     # ##### Load 800 women dataset (data used to train generator models)
 73 |     with open('800_women.csv', 'r') as f:
 74 |         reader = csv.reader(f)
 75 |         women_real_to_generate = np.array(list(reader)).flatten()
 76 |     women_real_to_generate = [x.replace('combined\\','UTKFace/') for x in women_real_to_generate]
 77 |     women_real_to_generate = list(set(women_real_to_generate))
 78 | 
 79 |     # ##### Add 7200 random generated women to 800 real women and create train/ test data
 80 | 
 81 |     training_class_size = 6000
 82 | 
 83 |     # Import pictures names. This is to keep them all the same
 84 |     train_men = pd.read_csv('train_men.csv',index_col=0).values.flatten()
 85 |     test_men = pd.read_csv('test_men.csv',index_col=0).values.flatten()
 86 |     test_women = pd.read_csv('test_women.csv',index_col=0).values.flatten()
 87 | 
 88 |     if model_name != "None" and model_name != "Oversample":
 89 |         # Choose 7200 from the generated women
 90 |         train_women = np.random.choice(files_generated, training_class_size-len(women_real_to_generate),replace=False)
 91 |         # Add generated woment to training_women
 92 |         train_women = np.concatenate((train_women, np.array(women_real_to_generate)),axis=0)
 93 |     elif model_name == "Oversample":
 94 |         train_women = np.random.choice(women_real_to_generate,6000-len(women_real_to_generate))
 95 |         train_women = np.concatenate((train_women, np.array(women_real_to_generate)),axis=0)
 96 |     else:
 97 |         train_women = women_real_to_generate
 98 | 
 99 |     dim = 60
100 | 
101 |     print('Scaling...', end='')
102 |     start = time()
103 |     x_train = []
104 |     y_train = []
105 |     x_test = []
106 |     y_test = []
107 | 
108 |     if not (training_class_size == len(train_men) and len(train_men) == len(train_women)):
109 |         print('Length mismatch in training set')
110 |     # assert training_class_size == len(train_men) and len(train_men) == len(train_women) , 
111 |     assert len(test_men) == len(test_women) , 'Length mismatch in test set'
112 | 
113 |     # Men
114 |     for ix, file in enumerate(train_men): 
115 |         image = plt.imread(file, 'jpg')
116 |         image = Image.fromarray(image).resize((dim, dim)).convert('L')
117 |         image = crop(np.array(image))
118 |         x_train.append(image)
119 |         y_train.append(0)
120 |     # Women
121 |     for ix, file in enumerate(train_women): 
122 |         image = plt.imread(file, 'jpg')
123 |         image = Image.fromarray(image).resize((dim, dim)).convert('L')
124 |         image = crop(np.array(image))
125 |         x_train.append(image)
126 |         y_train.append(1)
127 | 
128 |     # Men (test)
129 |     for ix, file in enumerate(test_men): 
130 |         image = plt.imread(file, 'jpg')
131 |         image = Image.fromarray(image).resize((dim, dim)).convert('L')
132 |         image = crop(np.array(image))
133 |         x_test.append(image)
134 |         y_test.append(0)
135 | 
136 |     # Women (test)
137 |     for ix, file in enumerate(test_women): 
138 |         image = plt.imread(file, 'jpg')
139 |         image = Image.fromarray(image).resize((dim, dim)).convert('L')
140 |         image = crop(np.array(image))
141 |         x_test.append(image)
142 |         y_test.append(1)
143 |         
144 |     print(f'\rDone in {int(time() - start)} seconds')
145 | 
146 | 
147 |     # ##### Turning the pictures into arrays
148 |     # Train
149 |     x_train = np.array(x_train, dtype=np.float32).reshape(-1, 60, 60, 1)
150 |     y_train = np.array(y_train, dtype=np.float32)
151 |     # Test
152 |     x_test = np.array(x_test, dtype=np.float32).reshape(-1, 60, 60, 1)
153 |     y_test = np.array(y_test, dtype=np.float32)
154 |     labels_test = y_test.copy()
155 | 
156 |     # ##### Shuffle train sets
157 |     x_train, y_train = unison_shuffled_copies(x_train, y_train)
158 |     labels_train = y_train.copy()
159 | 
160 |     # ##### Turning the targets into a 2D matrix
161 |     y_train = to_categorical(y_train, 2)
162 |     y_test = to_categorical(y_test, 2)
163 | 
164 |     assert x_train.ndim == 4, 'The input is the wrong shape!'
165 | 
166 |     yy, xx = y_train.nbytes, x_train.nbytes
167 | 
168 |     print(f'The size of X is {xx:,} bytes and the size of Y is {yy:,} bytes.')
169 | 
170 |     files, faces = None, None
171 | 
172 |     # ##### Displaying the pictures
173 |     show_images = False
174 |     if show_images:
175 |         fig = plt.figure(figsize=(12, 12))
176 |         for i in range(1, 5):
177 |             plt.subplot(1, 5, i)
178 |             rand = np.random.randint(0, x_train.shape[0])
179 |             ax = plt.imshow(x_train[rand][:, :, 0], cmap='gray')
180 |             plt.title('<{}>'.format(sex[int(labels_train[rand])].capitalize()))
181 |             yticks = plt.xticks([])
182 |             yticks = plt.yticks([])
183 | 
184 |         plt.show()
185 | 
186 |     trainsize, testsize = x_train.shape[0], x_test.shape[0]
187 |     print(f'The size of the training set is {trainsize:,} and the '     f'size of the test set is {testsize:,}.')
188 | 
189 |     # ##### Scaling, casting the arrays
190 |     print('Scaling...', end='')
191 |     image_size = x_train.shape[1] * x_train.shape[1] 
192 |     x_train = x_train.astype('float32') / 255 
193 |     x_test = x_test.astype('float32') / 255
194 |     print('\rDone.     ')
195 | 
196 |     model = Sequential([
197 |         Conv2D(16*4, (3, 3), input_shape=(60, 60, 1), activation='relu'),
198 |         MaxPooling2D(),
199 |         
200 |         Conv2D(32*4, (3, 3), activation='relu'),
201 |         MaxPooling2D(),
202 |         
203 |         Conv2D(64*4, (3, 3), activation='relu'),
204 |         MaxPooling2D(),
205 |         
206 |         Conv2D(128*4, (3, 3), activation='relu'),
207 |         MaxPooling2D(),
208 |         
209 |         Flatten(),
210 |         
211 |         Dense(1024, activation='relu'),
212 |         Dense(2048, activation='relu'),
213 |         Dense(2, activation='sigmoid')
214 |     ])
215 | 
216 |     # model.summary()
217 | 
218 |     model.compile(optimizer=Adam(lr=0.001), 
219 |                                  loss='binary_crossentropy', 
220 |                                  metrics=['accuracy', AUC(),f1_m])
221 | 
222 |     e_s = EarlyStopping(monitor='val_loss', patience=10)
223 | 
224 |     hist = model.fit(x_train, y_train,
225 |                      epochs=nb_epochs,
226 |                      validation_data=[x_test, y_test],
227 |                      batch_size=32,
228 |                      callbacks=[e_s])
229 | 
230 | 
231 |     pd.DataFrame(hist.history).to_csv(model_name+'_history.csv')
232 | 
233 |     test_loss, test_acc, test_AUC, test_f1 = model.evaluate(x_test, y_test)
234 | 
235 |     print("-------------------")
236 |     print(model_name)
237 |     print(f'Test loss: {np.round(test_loss, 4)} — Test accuracy: {np.round(test_acc*100,2)}%')
238 |     print(f'Test AUC: {np.round(test_AUC, 4)} — Test F1: {np.round(test_f1,4)}%')
239 | 
240 | 


--------------------------------------------------------------------------------
/generating-male-faces-with-aae:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | from glob import glob
  4 | from PIL import Image
  5 | from time import time
  6 | import re
  7 | import pandas as pd
  8 | import os
  9 | import argparse
 10 | import math
 11 | import itertools
 12 | import torchvision.transforms as transforms
 13 | from torchvision.utils import save_image
 14 | from torch.utils.data import DataLoader
 15 | from torchvision import datasets
 16 | from torch.autograd import Variable
 17 | import torch.nn as nn
 18 | import torch.nn.functional as F
 19 | import torch
 20 | 
 21 | os.chdir('C:/Users/Nicolas/Documents/Data/Faces')
 22 | 
 23 | files = glob('combined/*.jpg')
 24 | 
 25 | faces = [i for i in files if (i[-34] == '0') and len(i[-37:-35].strip('\\').strip('d')) == 2]
 26 | y = [i[-34] for i in files if (i[-34] == '0') and len(i[-37:-35].strip('\\').strip('d')) > 1]
 27 | 
 28 | dim = 60
 29 | 
 30 | start = time()
 31 | x = list()
 32 | num_to_load = len(faces)  
 33 | for ix, file in enumerate(faces[:num_to_load]):
 34 |     image = plt.imread(file, 'jpg')
 35 |     if image.shape[0] != image.shape[1]:
 36 |         prob += 1
 37 |     image = Image.fromarray(image).resize((dim, dim)).convert('L')
 38 |     image = np.array(image)
 39 |     x.append(image)
 40 | 
 41 | x = np.array(x, dtype=np.float32).reshape(-1, 1, 60, 60)
 42 | 
 43 | assert x.ndim == 4, 'The input is the wrong shape!'
 44 | 
 45 | files, faces = None, None
 46 | 
 47 | x = x.astype(np.float32) / 127.5 - 1
 48 | y = np.array(y, dtype=np.float32)
 49 | 
 50 | if torch.cuda.is_available():
 51 |     x = torch.from_numpy(x) 
 52 |     y = torch.from_numpy(y)
 53 |     print('Tensors successfully flushed to CUDA.')
 54 | else:
 55 |     print('CUDA not available!')
 56 | 
 57 | 
 58 | class Face:
 59 |     
 60 |     def __init__(self):
 61 |         self.len = x.shape[0]
 62 |         self.x = x
 63 |         self.y = y
 64 |         
 65 |     def __getitem__(self, index):
 66 |         return x[index], y[index].unsqueeze(0)
 67 |     
 68 |     def __len__(self):
 69 |         return self.len
 70 | 
 71 | 
 72 | train = Face()
 73 | 
 74 | parser = argparse.ArgumentParser()
 75 | 
 76 | parser.add_argument("--n_epochs", type=int, default=100, help="number of epochs of training")
 77 | parser.add_argument("--batch_size", type=int, default=32, help="size of the batches")
 78 | parser.add_argument("--lr", type=float, default=0.005, help="adam: learning rate")
 79 | parser.add_argument("--b1", type=float, default=0.3, help="adam: decay of first order momentum of gradient")
 80 | parser.add_argument("--b2", type=float, default=0.999, help="adam: decay of first order momentum of gradient")
 81 | parser.add_argument("--n_cpu", type=int, default=8, help="number of cpu threads to use during batch generation")
 82 | parser.add_argument("--latent_dim", type=int, default=3, help="dimensionality of the latent code")
 83 | parser.add_argument("--img_size", type=int, default=60, help="size of each image dimension")
 84 | parser.add_argument("--channels", type=int, default=1, help="number of image channels")
 85 | parser.add_argument("--sample_interval", type=int, default=50, help="interval between image sampling")
 86 | opt, unknown = parser.parse_known_args()
 87 | 
 88 | img_shape = (opt.channels, opt.img_size, opt.img_size)
 89 | 
 90 | cuda = True if torch.cuda.is_available() else False
 91 | 
 92 | 
 93 | def reparameterization(mu, logvar):
 94 |     std = torch.exp(logvar / 2)
 95 |     sampled_z = Variable(Tensor(np.random.normal(0, 1, (mu.size(0), opt.latent_dim))))
 96 |     z = sampled_z * std + mu
 97 |     return z
 98 | 
 99 | 
100 | class Encoder(nn.Module):
101 |     def __init__(self):
102 |         super(Encoder, self).__init__()
103 | 
104 |         self.model = nn.Sequential(
105 |             nn.Linear(int(np.prod(img_shape)), 512),
106 |             nn.LeakyReLU(0.2, inplace=True),
107 |             nn.Linear(512, 512),
108 |             nn.BatchNorm1d(512),
109 |             nn.LeakyReLU(0.2, inplace=True),
110 |         )
111 | 
112 |         self.mu = nn.Linear(512, opt.latent_dim)
113 |         self.logvar = nn.Linear(512, opt.latent_dim)
114 | 
115 |     def forward(self, img):
116 |         img_flat = img.view(img.shape[0], -1)
117 |         x = self.model(img_flat)
118 |         mu = self.mu(x)
119 |         logvar = self.logvar(x)
120 |         z = reparameterization(mu, logvar)
121 |         return z
122 | 
123 | 
124 | class Decoder(nn.Module):
125 |     def __init__(self):
126 |         super(Decoder, self).__init__()
127 | 
128 |         self.model = nn.Sequential(
129 |             nn.Linear(opt.latent_dim, 512),
130 |             nn.LeakyReLU(0.2, inplace=True),
131 |             nn.Linear(512, 512),
132 |             nn.BatchNorm1d(512),
133 |             nn.LeakyReLU(0.2, inplace=True),
134 |             nn.Linear(512, int(np.prod(img_shape))),
135 |             nn.Tanh(),
136 |         )
137 | 
138 |     def forward(self, z):
139 |         img_flat = self.model(z)
140 |         img = img_flat.view(img_flat.shape[0], *img_shape)
141 |         return img
142 | 
143 | 
144 | class Discriminator(nn.Module):
145 |     def __init__(self):
146 |         super(Discriminator, self).__init__()
147 | 
148 |         self.model = nn.Sequential(
149 |             nn.Linear(opt.latent_dim, 512),
150 |             nn.LeakyReLU(0.2, inplace=True),
151 |             nn.Linear(512, 256),
152 |             nn.LeakyReLU(0.2, inplace=True),
153 |             nn.Linear(256, 1),
154 |             nn.Sigmoid(),
155 |         )
156 | 
157 |     def forward(self, z):
158 |         validity = self.model(z)
159 |         return validity
160 | 
161 | 
162 | adversarial_loss = torch.nn.BCELoss()
163 | pixelwise_loss = torch.nn.L1Loss()
164 | 
165 | encoder = Encoder()
166 | decoder = Decoder()
167 | discriminator = Discriminator()
168 | 
169 | decoder.load_state_dict(torch.load('aae_decoder_men'))
170 | encoder.load_state_dict(torch.load('aae_encoder_men'))
171 | discriminator.load_state_dict(torch.load('aae_discriminator_men'))
172 | 
173 | if cuda:
174 |     encoder.cuda()
175 |     decoder.cuda()
176 |     discriminator.cuda()
177 |     adversarial_loss.cuda()
178 |     pixelwise_loss.cuda()
179 | 
180 | dataloader = torch.utils.data.DataLoader(train, batch_size=opt.batch_size, shuffle=True)
181 | 
182 | optimizer_G = torch.optim.Adam(
183 |     itertools.chain(encoder.parameters(),
184 |     decoder.parameters()), lr=opt.lr, betas=(opt.b1, opt.b2))
185 | 
186 | optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))
187 | 
188 | Tensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
189 | 
190 | 
191 | def sample_image(n_row, batches_done, directory):
192 |     """Saves a grid of generated digits"""
193 |     # Sample noise
194 |     z = Variable(Tensor(np.random.normal(0, 1, (n_row ** 2, opt.latent_dim))))
195 |     gen_imgs = decoder(z)
196 |     save_image(gen_imgs.data, "%s/%d.png" % (directory, batches_done), nrow=n_row, normalize=True)
197 | 
198 | 
199 | if not os.path.isdir('generated_men_aae'):
200 |             os.mkdir('generated_men_aae')
201 |         
202 | for epoch in range(1, opt.n_epochs + 1):
203 | 
204 |     break # model already trained
205 |     for i, (imgs, _) in enumerate(dataloader):
206 | 
207 |         # Adversarial ground truths
208 |         valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False)
209 |         fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False)
210 | 
211 |         # Configure input
212 |         real_imgs = Variable(imgs.type(Tensor))
213 | 
214 |         # -----------------
215 |         #  Train Generator
216 |         # -----------------
217 | 
218 |         optimizer_G.zero_grad()
219 | 
220 |         encoded_imgs = encoder(real_imgs)
221 |         decoded_imgs = decoder(encoded_imgs)
222 | 
223 |         # Loss measures generator's ability to fool the discriminator
224 |         g_loss = 0.001 * adversarial_loss(discriminator(encoded_imgs), valid) + 0.999 * pixelwise_loss(
225 |             decoded_imgs, real_imgs
226 |         )
227 | 
228 |         g_loss.backward()
229 |         optimizer_G.step()
230 | 
231 |         # ---------------------
232 |         #  Train Discriminator
233 |         # ---------------------
234 | 
235 |         optimizer_D.zero_grad()
236 | 
237 |         # Sample noise as discriminator ground truth
238 |         z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], opt.latent_dim))))
239 | 
240 |         # Measure discriminator's ability to classify real from generated samples
241 |         real_loss = adversarial_loss(discriminator(z), valid)
242 |         fake_loss = adversarial_loss(discriminator(encoded_imgs.detach()), fake)
243 |         d_loss = 0.5 * (real_loss + fake_loss)
244 | 
245 |         d_loss.backward()
246 |         optimizer_D.step()
247 |         
248 |     batches_done = epoch * len(dataloader) + i
249 | 
250 |     if epoch >= 25 and epoch % 10 == 0:
251 |         val = input("\nContinue training? [y/n]: ")
252 |         print()
253 |         if val in ('y', 'yes'):
254 |             val = True
255 |             pass
256 |         elif val in ('n', 'no'):
257 |             break  
258 |         else:
259 |             pass
260 | 
261 |     if epoch > 10:
262 |         if batches_done % opt.sample_interval == 0:
263 |             sample_image(n_row=5, batches_done=batches_done, directory='generated_men_aae')
264 | 
265 |     if epoch % 5 == 0:
266 |         print(
267 |             "[Epoch %d/%d] [D loss: %f] [G loss: %f]"
268 |             % (epoch, opt.n_epochs, d_loss.item(), g_loss.item())
269 |         )        
270 | 
271 | torch.save(decoder.state_dict(), 'aae_decoder_men')
272 | torch.save(encoder.state_dict(), 'aae_encoder_men')
273 | torch.save(discriminator.state_dict(), 'aae_discriminator_men')
274 | 
275 | images = 0
276 | stop = False
277 | for epoch in range(1, 4 + 1):
278 |     for i, (imgs, _) in enumerate(dataloader):
279 |         
280 |         with torch.no_grad():
281 | 
282 |             # Adversarial ground truths
283 |             valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False)
284 |             fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False)
285 | 
286 |             # Configure input
287 |             real_imgs = Variable(imgs.type(Tensor))
288 | 
289 |             batches_done = epoch * len(dataloader) + i
290 |             sample_image(directory='generated_men_aae', n_row=5, batches_done=batches_done)
291 |             images += 25
292 | 
293 |             if len(os.listdir(os.path.join(os.getcwd(), 'generated_men_aae'))) >= 1000:
294 |                 stop = True
295 |                 break
296 | 
297 |     if stop:
298 |         break
299 | 
300 | 


--------------------------------------------------------------------------------
/generating-male-faces-with-dcgan:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | from glob import glob
  4 | from PIL import Image
  5 | from time import time
  6 | import os
  7 | import pandas as pd
  8 | import argparse
  9 | import math
 10 | import re
 11 | import itertools
 12 | import torchvision.transforms as transforms
 13 | from torchvision.utils import save_image
 14 | from torch.utils.data import DataLoader
 15 | from torchvision import datasets
 16 | from torch.autograd import Variable
 17 | import torch.nn as nn
 18 | import torch.nn.functional as F
 19 | import torch
 20 | os.chdir('C:/Users/Nicolas/Documents/Data/Faces')
 21 | 
 22 | 
 23 | def sorted_alphanumeric(data):
 24 |     convert = lambda text: int(text) if text.isdigit() else text.lower()
 25 |     alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
 26 |     return sorted(data, key=alphanum_key)
 27 | 
 28 | files = sorted_alphanumeric(glob('combined/*.jpg'))
 29 | 
 30 | 
 31 | faces = [i for i in files if (i[-34] == '0') and len(i[-37:-35].strip('\\').strip('d')) == 2]
 32 | y = [i[-34] for i in files if (i[-34] == '0') and len(i[-37:-35].strip('\\').strip('d')) > 1]
 33 | 
 34 | dim = 60
 35 | 
 36 | 
 37 | def crop(img):
 38 |     if img.shape[0]<img.shape[1]:
 39 |         x = img.shape[0]
 40 |         y = img.shape[1]
 41 |         crop_img = img[: , int(y/2-x/2):int(y/2+x/2)]
 42 |     else:
 43 |         x = img.shape[1]
 44 |         y = img.shape[0]
 45 |         crop_img = img[int(y/2-x/2):int(y/2+x/2) , :]
 46 | 
 47 |     return crop_img
 48 | 
 49 | 
 50 | start = time()
 51 | x = []
 52 | num_to_load = len(faces)  
 53 | for ix, file in enumerate(faces[:num_to_load]): 
 54 |     image = plt.imread(file, 'jpg')
 55 |     image = Image.fromarray(image).resize((dim, dim)).convert('L')
 56 |     image = crop(np.array(image))
 57 |     x.append(image)
 58 | 
 59 | 
 60 | x = np.divide(np.array(x, dtype=np.float32), 255).reshape(-1, 1, 60, 60)
 61 | y = np.array(y, dtype=np.float32)
 62 | 
 63 | assert x.ndim == 4, 'The input is the wrong shape!'
 64 | 
 65 | yy, xx = y.nbytes, x.nbytes
 66 | 
 67 | files, faces = None, None
 68 | 
 69 | image_size = x.shape[1] * x.shape[1]
 70 | 
 71 | if torch.cuda.is_available():
 72 |     x = torch.from_numpy(x) 
 73 |     y = torch.from_numpy(y)
 74 |     print('Tensors successfully flushed to CUDA.')
 75 | else:
 76 |     print('CUDA not available!')
 77 | 
 78 | 
 79 | class Face:
 80 |     
 81 |     def __init__(self):
 82 |         self.len = x.shape[0]
 83 |         self.x = x
 84 |         self.y = y
 85 |         
 86 |     def __getitem__(self, index):
 87 |         return x[index], y[index].unsqueeze(0) 
 88 |     
 89 |     def __len__(self):
 90 |         return self.len
 91 | 
 92 | 
 93 | data = Face()
 94 | 
 95 | 
 96 | parser = argparse.ArgumentParser()
 97 | parser.add_argument("--n_epochs", type=int, default=5_000, help="number of epochs of training")
 98 | parser.add_argument("--batch_size", type=int, default=32, help="size of the batches")
 99 | parser.add_argument("--lr", type=float, default=0.001, help="adam: learning rate")
100 | parser.add_argument("--b1", type=float, default=0.3, help="adam: decay of first order momentum of gradient")
101 | parser.add_argument("--b2", type=float, default=0.999, help="adam: decay of first order momentum of gradient")
102 | parser.add_argument("--n_cpu", type=int, default=8, help="number of cpu threads to use during batch generation")
103 | parser.add_argument("--latent_dim", type=int, default=128, help="dimensionality of the latent space")
104 | parser.add_argument("--img_size", type=int, default=60, help="size of each image dimension")
105 | parser.add_argument("--channels", type=int, default=1, help="number of image channels")
106 | parser.add_argument("--sample_interval", type=int, default=5, help="interval between image sampling")
107 | opt, unknown = parser.parse_known_args()
108 | print(opt)
109 | 
110 | cuda = True if torch.cuda.is_available() else False
111 | 
112 | 
113 | def weights_init_normal(m):
114 |     classname = m.__class__.__name__
115 |     if classname.find("Conv") != -1:
116 |         torch.nn.init.normal_(m.weight.data, 0.0, 0.02)
117 |     elif classname.find("BatchNorm2d") != -1:
118 |         torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
119 |         torch.nn.init.constant_(m.bias.data, 0.0)
120 | 
121 | 
122 | class Generator(nn.Module):
123 |     def __init__(self):
124 |         super(Generator, self).__init__()
125 | 
126 |         self.init_size = opt.img_size // 4 ## 15
127 |         self.l1 = nn.Sequential(nn.Linear(opt.latent_dim, 128 * self.init_size ** 2)) # (100, 128*15^2) 28800
128 |         self.conv_blocks = nn.Sequential(
129 |             nn.BatchNorm2d(128),
130 |             nn.Upsample(scale_factor=2),
131 |             nn.Conv2d(128, 128, 3, stride=1, padding=1),
132 |             nn.BatchNorm2d(128, 0.8),
133 |             nn.LeakyReLU(0.2, inplace=True),
134 |             nn.Upsample(scale_factor=2),
135 |             nn.Conv2d(128, 64, 3, stride=1, padding=1),
136 |             nn.BatchNorm2d(64, 0.8),
137 |             nn.LeakyReLU(0.2, inplace=True),
138 |             nn.Conv2d(64, opt.channels, 3, stride=1, padding=1),
139 |             nn.Tanh(),
140 |         )
141 | 
142 |     def forward(self, z):
143 |         out = self.l1(z)
144 |         out = out.view(out.shape[0], 128, self.init_size, self.init_size)
145 |         img = self.conv_blocks(out)
146 |         return img
147 | 
148 | 
149 | class Discriminator(nn.Module):
150 |     def __init__(self):
151 |         super(Discriminator, self).__init__()
152 | 
153 |         def discriminator_block(in_filters, out_filters, bn=True):
154 |             block = [nn.Conv2d(in_filters, out_filters, 3, 2, 1), nn.LeakyReLU(0.2, inplace=True), nn.Dropout2d(0.25)]
155 |             if bn:
156 |                 block.append(nn.BatchNorm2d(out_filters, 0.8))
157 |             return block
158 | 
159 |         self.model = nn.Sequential(
160 |             *discriminator_block(opt.channels, 16, bn=False),
161 |             *discriminator_block(16, 32),
162 |             *discriminator_block(32, 64),
163 |             *discriminator_block(64, 128),
164 |         )
165 | 
166 |         # The height and width of downsampled image
167 |         ds_size = opt.img_size // (2 ** 4) # gives 3
168 |         self.adv_layer = nn.Sequential(nn.Linear(128 * 4 ** 2, 1), nn.Sigmoid()) # s'attend à (1152, 1)
169 | 
170 |     def forward(self, img):
171 |         out = self.model(img)
172 |         out = out.view(out.shape[0], -1) # torch.Size([64, 2048])
173 |         validity = self.adv_layer(out)
174 | 
175 |         return validity
176 | 
177 | 
178 | # Loss function
179 | adversarial_loss = torch.nn.BCELoss()
180 | 
181 | # Initialize generator and discriminator
182 | generator = Generator()
183 | discriminator = Discriminator()
184 | 
185 | # generator.load_state_dict(torch.load('deep_conv_gan_generator'))
186 | # discriminator.load_state_dict(torch.load('deep_conv_gan_discriminator'))
187 | 
188 | if cuda:
189 |     generator.cuda()
190 |     discriminator.cuda()
191 |     adversarial_loss.cuda()
192 | 
193 | # Initialize weights
194 | generator.apply(weights_init_normal)
195 | discriminator.apply(weights_init_normal)
196 | 
197 | # Configure data loader
198 | dataloader = torch.utils.data.DataLoader(data,
199 |     batch_size=opt.batch_size,
200 |     shuffle=True,
201 | )
202 | 
203 | # Optimizers
204 | optimizer_G = torch.optim.Adam(generator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))
205 | optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))
206 | 
207 | Tensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
208 | 
209 | for epoch in range(1, opt.n_epochs + 1):
210 |     for i, (imgs, _) in enumerate(dataloader):
211 |         
212 |         #break ##### DO NOT TRAIN THE MODEL AGAIN
213 |         
214 |         # Adversarial ground truths
215 |         valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False)
216 |         fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False)
217 | 
218 |         # Configure input
219 |         real_imgs = Variable(imgs.type(Tensor))
220 | 
221 |         # -----------------
222 |         #  Train Generator
223 |         # -----------------
224 | 
225 |         optimizer_G.zero_grad()
226 | 
227 |         # Sample noise as generator input
228 |         z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], opt.latent_dim))))
229 | 
230 |         # Generate a batch of images
231 |         gen_imgs = generator(z)
232 | 
233 |         # Loss measures generator's ability to fool the discriminator
234 |         g_loss = adversarial_loss(discriminator(gen_imgs), valid)
235 | 
236 |         g_loss.backward()
237 |         optimizer_G.step()
238 | 
239 |         # ---------------------
240 |         #  Train Discriminator
241 |         # ---------------------
242 | 
243 |         optimizer_D.zero_grad()
244 | 
245 |         # Measure discriminator's ability to classify real from generated samples
246 |         real_loss = adversarial_loss(discriminator(real_imgs), valid)
247 |         fake_loss = adversarial_loss(discriminator(gen_imgs.detach()), fake)
248 |         d_loss = (real_loss + fake_loss) / 2
249 | 
250 |         d_loss.backward()
251 |         optimizer_D.step()
252 |         
253 |         if not os.path.isdir('generated_men_dcgan'):
254 |             os.mkdir('generated_men_dcgan')
255 |         
256 |     batches_done = epoch * len(dataloader) + i + 1
257 |     
258 |     if epoch >= 10 and epoch % 5 == 0:
259 |         val = input("\nContinue training? [y/n]: ")
260 |         print()
261 |         if val in ('y', 'yes'):
262 |             val = True
263 |             pass
264 |         elif val in ('n', 'no'):
265 |             break  
266 |         else:
267 |             pass
268 |         
269 |     if batches_done % opt.sample_interval == 0:
270 |         save_image(gen_imgs.data[:25], "generated_men_dcgan/%d.png" % batches_done, nrow=5, normalize=True)
271 | 
272 |     if epoch % 5 == 0:
273 |         print(
274 |             "[Epoch %d/%d] [D loss: %f] [G loss: %f]"
275 |             % (epoch, opt.n_epochs, d_loss.item(), g_loss.item())
276 |         )
277 | 
278 | torch.save(generator.state_dict(), 'dcgan_generator_men')
279 | torch.save(discriminator.state_dict(), 'dcgan_discriminator_men')
280 | 
281 | 
282 | def sample_image(n_row, batches_done):
283 |     z = Variable(Tensor(np.random.normal(0, 1, (n_row ** 2, opt.latent_dim))))
284 |     gen_imgs = generator(z)
285 |     save_image(gen_imgs.data, "generated_men_dcgan/%d.png" % batches_done, nrow=n_row, normalize=True)
286 | 
287 | 
288 | images = 0
289 | stop =False
290 | for epoch in range(1, 2_50 + 1): 
291 |     for i, (imgs, _) in enumerate(dataloader, 1):
292 |         
293 |         with torch.no_grad():
294 | 
295 |             # Adversarial ground truths
296 |             valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False)
297 |             fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False)
298 | 
299 |             # Configure input
300 |             real_imgs = Variable(imgs.type(Tensor))
301 | 
302 |             batches_done = epoch * len(dataloader) + i
303 |             sample_image(n_row=5, batches_done=batches_done)
304 |             images += 25
305 | 
306 |             if len(os.listdir(os.path.join(os.getcwd(), 'generated_men_dcgan')))  >= 1_000:
307 |                 print('\n25,000 images successfully generated.')
308 |                 stop = True
309 |                 break
310 |     if stop:
311 |         break
312 |             
313 |     if images % 5_000 == 0:
314 |         print(f'Pictures created: {images:,}')
315 | 
316 | 


--------------------------------------------------------------------------------
/the-car-connection-image-scraper.py:
--------------------------------------------------------------------------------
  1 | from selenium import webdriver
  2 | import bs4 as bs
  3 | from urllib.request import Request, urlopen
  4 | import pandas as pd
  5 | import time
  6 | import os
  7 | import requests
  8 | from IPython import embed
  9 | 
 10 | # os.chdir('/data')
 11 | 
 12 | website = 'https://www.thecarconnection.com'
 13 | 
 14 | 
 15 | def fetch(page, addition=''):
 16 |     return bs.BeautifulSoup(urlopen(Request(page + addition,
 17 |             headers={'User-Agent': 'Opera/9.80 (X11; Linux i686; Ub'\
 18 |                      'untu/14.10) Presto/2.12.388 Version/12.16'})).read(), 'lxml')
 19 | 
 20 | def all_makes():
 21 |     # Fetches all makes (acura, cadilac, etc)
 22 |     all_makes_list = []
 23 |     for a in fetch(website, "/new-cars").find_all("a", {"class": "add-zip"}):
 24 |         all_makes_list.append(a['href'])
 25 |     print(all_makes_list[:10])
 26 |     print("All makes fetched")
 27 |     return all_makes_list
 28 | 
 29 | 
 30 | def make_menu(listed):
 31 |     # Fetches all makes + model ? (acura_mdx, audi_q3, etc)
 32 |     make_menu_list = []
 33 |     for make in listed: # REMOVE REMOVE REMOVE REMOVE REMOVE REMOVE #
 34 |         for div in fetch(website, make).find_all("div", {"class": "name"}):
 35 |             make_menu_list.append(div.find_all("a")[0]['href'])
 36 |     print(make_menu_list[:10])
 37 |     print("Make menu list fetched")
 38 |     return make_menu_list
 39 | 
 40 | 
 41 | def model_menu(listed):
 42 |     # Add year to previous step
 43 |     model_menu_list = []
 44 |     for make in listed:
 45 |         soup = fetch(website, make)
 46 |         for div in soup.find_all("a", {"class": "btn avail-now first-item"}):
 47 |             model_menu_list.append(div['href'])
 48 |         for div in soup.find_all("a", {"class": "btn 1"})[:8]:
 49 |             model_menu_list.append(div['href'])
 50 |     print(model_menu_list[:10])
 51 |     print("Model menu list fetched")
 52 |     return model_menu_list
 53 | 
 54 | 
 55 | def year_model_overview(listed):
 56 |     year_model_overview_list = []
 57 |     for make in listed: # REMOVE REMOVE REMOVE REMOVE REMOVE REMOVE REMOVE REMOVE
 58 |         for id in fetch(website, make).find_all("a", {"id": "ymm-nav-specs-btn"}):
 59 |             year_model_overview_list.append(id['href'])
 60 |     try:
 61 |         year_model_overview_list.remove("/specifications/buick_enclave_2019_fwd-4dr-preferred")
 62 |     except:
 63 |         pass
 64 |     print(year_model_overview_list[:10])
 65 |     print("Year model overview list fetched")
 66 |     return year_model_overview_list
 67 | 
 68 | 
 69 | def trims(listed):
 70 |     trim_list = []
 71 |     for row in listed:
 72 |         div = fetch(website, row).find_all("div", {"class": "block-inner"})[-1]
 73 |         div_a = div.find_all("a")
 74 |         for i in range(len(div_a)):
 75 |             trim_list.append(div_a[-i]['href'])
 76 |     print(trim_list[:10])
 77 |     print("Trims list fetched")
 78 |     return trim_list
 79 | 
 80 | 
 81 | def timer(start, end, iters, iters_left):
 82 |     hours, rem = divmod(end-start, 3600)
 83 |     minutes, seconds = divmod(rem, 60)
 84 | 
 85 |     hours_per_iter, rem_per_iter = divmod((end-start)/(iters+1),3600)
 86 |     minutes_per_iter, seconds_per_iter = divmod(rem_per_iter,60)
 87 | 
 88 |     hours_left , rem_left = divmod(((end-start)/(iters+1))*iters_left,3600)
 89 |     minutes_left, seconds_left = divmod(rem_left,60)
 90 |     print("    Total elapsed: {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
 91 |     print("    Time per page: {:0>2}:{:0>2}:{:05.2f}".format(int(hours_per_iter),int(minutes_per_iter),seconds_per_iter))
 92 |     print("    Time left    : {:0>2}:{:0>2}:{:05.2f}".format(int(hours_left),int(minutes_left),seconds_left))
 93 | 
 94 | 
 95 | def saveImage(imgUrl, imgName, group):
 96 |     imgData = requests.get(imgUrl).content
 97 |     with open('data/pictures/'+group+'/'+imgName+'.jpg','wb') as handler:
 98 |         handler.write(imgData)
 99 | 
100 | 
101 | def create_file_name(row):
102 |     ''' 
103 |         Takes all columns not named pictures and delimits it with --
104 |         Also replaces spaces in individual columns with _
105 |     '''
106 |     return row[0].strip().replace(' ','_').replace('/','.')
107 | 
108 | 
109 | def specifications(website, trims, keep_all_images=True):
110 |     ''' keep_all_images: True means we create 2 files, one for main (front/read)
111 |                          And one for all of the pictures.
112 |     '''
113 |     options = webdriver.FirefoxOptions()
114 |     options.add_argument('-headless')
115 |     driver = webdriver.Firefox(options=options)
116 |     # driver = webdriver.Firefox()
117 | 
118 |     # Timer start
119 |     start = time.time()
120 | 
121 |     if not os.path.isfile('data/pictures_all.csv'):
122 |         # Table for all images
123 |         specifications_table_all = pd.DataFrame()
124 |         # Table for only front and rear images
125 |         specifications_table_front_rear = pd.DataFrame()
126 |     else:
127 |         specifications_table_all = pd.read_csv('data/pictures_all.csv',index_col=0)
128 |         specifications_table_front_rear = pd.read_csv('data/pictures_rear_front.csv',index_col=0)
129 | 
130 |     trims_left = len(trims.index)
131 |     if trims_left == 0:
132 |         return 0
133 |     for inx, webpage in enumerate(trims.iloc[:, 0]):
134 |         soup = fetch(website, webpage.replace('overview', 'specifications'))
135 |         # Same splitting as above
136 |         specifications_df_all = pd.DataFrame(columns=[soup.find_all("title")[0].text[:-15]])
137 |         specifications_df_front_rear = pd.DataFrame(columns=[soup.find_all("title")[0].text[:-15]])
138 |         for div in soup.find_all("div", {"class": "specs-set-item"})[:9]:
139 |             row_name = div.find_all("span")[0].text
140 |             row_value = div.find_all("span")[1].text
141 |             specifications_df_all.loc[row_name] = row_value
142 |             specifications_df_front_rear.loc[row_name] = row_value
143 |         
144 |         try:
145 |             driver.get(website + webpage.replace('overview', 'photos'))
146 |             time.sleep(0.5)
147 |             ext_btn = driver.find_element_by_class_name('view-mode.show-ext')
148 |             if ext_btn.text == 'Exterior':
149 |                 ext_btn.click()
150 |             time.sleep(0.5)
151 |             class_img_ext = driver.find_elements_by_xpath("//div[@class='thumbs-wrapper']/div[starts-with(@class, 'thumbs-slide') and not(contains(@class, 'video'))]/img")
152 |             list_urls = [x.get_attribute("src").replace('/tmb/','/sml/').replace('_t.gif','_s.jpg') for x in class_img_ext]
153 |         except:
154 |             list_urls = []
155 |             print(f'Problem with {website + webpage}')
156 |         
157 |         # Different layout for older images
158 |         # if len(class_img) == 0:
159 |         #     try:
160 |         #         driver.get(website + webpage.replace('overview', 'photos'))
161 |         #         class_img = driver.find_elements_by_class_name('image')
162 |         #         list_urls = []
163 |         #         for ii in class_img:
164 |         #             list_urls.append(ii.get_attribute('data-image-small'))
165 |         #     except:
166 |         #         print(f'Problem with {website + webpage}')
167 |         
168 | 
169 |         
170 |         # Keep a count of rear and front images to put them at start of index
171 |         rear_front_img_count = 0
172 |         for ix, img_url in enumerate(list_urls): # REMOVE REMOVE REMOVE
173 |             specifications_df_all.loc['Picture_%i' % ix, :] = img_url
174 |             if keep_all_images and 'pkg-rear-exterior-view' in img_url:
175 |                 specifications_df_front_rear.loc['Picture_%i' % rear_front_img_count, :] = img_url
176 |                 rear_front_img_count += 1
177 |             
178 |         # If no images, we don't add to the main df
179 |         if len(list_urls) > 0:
180 |             specifications_table_all = pd.concat([specifications_table_all, specifications_df_all], axis=1, sort=False)
181 |             if rear_front_img_count > 0:
182 |                 specifications_table_front_rear = pd.concat([specifications_table_front_rear, specifications_df_front_rear], axis=1, sort=False)
183 |         else:
184 |             print(website + webpage.replace('specifications', 'overview'))
185 | 
186 |         # Save content every 10 images
187 |         if inx % 10 == 0:
188 |             print("%d/%d completed."%(inx, trims_left))
189 |             specifications_table_all.to_csv('data/pictures_all.csv')
190 |             specifications_table_front_rear.to_csv('data/pictures_rear_front.csv')
191 |             trims.iloc[inx:].to_csv('data/trims_octobre_2019.csv', header=None)
192 |             timer(start,time.time(), inx, trims_left-inx)
193 | 
194 | 
195 |     # At the end of loop
196 |     specifications_table_all.to_csv('data/pictures_all.csv')
197 |     specifications_table_front_rear.to_csv('data/pictures_rear_front.csv')
198 |     specifications_table_all.to_csv('data/img_left_octobre_2019.csv')
199 |     specifications_table_front_rear.to_csv('data/img_left_frontrear_octobre_2019.csv')
200 | 
201 | fetch_urls       = False
202 | download_images  = True
203 | 
204 | if __name__ == '__main__':
205 |     if fetch_urls:
206 |         # If list of trims has not been fetched
207 |         if not os.path.isfile('data/trims_octobre_2019.csv'):
208 |             a = all_makes()
209 |             b = make_menu(a)
210 |             c = model_menu(b)
211 |             d = year_model_overview(c)
212 |             e = trims(d)
213 |             f = pd.DataFrame(e).to_csv('data/trims_octobre_2019.csv', header=None)
214 |             # Previous one will be modified
215 |             f = pd.DataFrame(e).to_csv('data/trims_octobre_2019_keep.csv', header=None)
216 | 
217 |         # Read list of trims
218 |         g = pd.read_csv('data/trims_octobre_2019.csv',index_col=0, header=None)
219 |         g.drop_duplicates(inplace=True)
220 |         h = specifications(website, g)
221 | 
222 |     if download_images:
223 |         i_all = pd.read_csv('data/img_left_octobre_2019.csv',index_col=0)
224 |         i_front_rear = pd.read_csv('data/img_left_frontrear_octobre_2019.csv',index_col=0)
225 |         
226 |         if 'imgName' not in i_all.columns:
227 |             i_all = i_all.transpose().reset_index()
228 |             i_all['imgName'] = i_all.apply(create_file_name, axis=1)
229 |         if 'imgName' not in i_front_rear.columns:
230 |             i_front_rear = i_front_rear.transpose().reset_index()
231 |             i_front_rear['imgName'] = i_front_rear.apply(create_file_name, axis=1)
232 | 
233 |         start = time.time()
234 |         
235 |         for ind, row in i_all.iterrows():
236 |             if ind % 10 == 0:
237 |                 timer(start, time.time(), ind, len(i_all.index))
238 |                 print('%i/%i image pages for all angles completed.' %(ind,len(i_all.index)))
239 |                 i_all.iloc[ind:].to_csv('data/img_left_octobre_2019.csv')
240 |             img_urls = [x for inx, x in row.iteritems() if 'Picture_' in inx and str(x) != 'nan']
241 |             pic_name = row['imgName']
242 |             for ix, url in enumerate(img_urls):
243 |                 saveImage(url, pic_name+'_'+str(ix), 'all_images')
244 | 
245 |         start = time.time()
246 |         for ind, row_front in i_front_rear.iterrows():
247 |             if ind % 10 == 0:
248 |                 timer(start, time.time(), ind, len(i_front_rear.index))
249 |                 print('%i/%i image pages for front/rear completed.' %(ind,len(i_front_rear.index)))
250 |                 i_front_rear.iloc[ind:].to_csv('data/img_left_frontrear_octobre_2019.csv')
251 |             img_urls = [x for inx, x in row_front.iteritems() if 'Picture_' in inx and str(x) != 'nan']
252 |             pic_name = row_front['imgName']
253 |             for ix, url in enumerate(img_urls):
254 |                 saveImage(url, pic_name+'_'+str(ix), 'front_rear')
255 | 
256 | 
257 | 


--------------------------------------------------------------------------------
/generating-male-faces-with-vae:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | import tensorflow as tf
  4 | from tensorflow import keras
  5 | import numpy as np
  6 | import pandas as pd
  7 | import re
  8 | import matplotlib.pyplot as plt
  9 | # fashion_mnist = keras.datasets.fashion_mnist
 10 | from matplotlib.markers import MarkerStyle 
 11 | from keras import backend as K
 12 | from keras.optimizers import Adam
 13 | from keras.datasets import mnist 
 14 | from keras.layers import Lambda, Input, Dense 
 15 | from keras.losses import binary_crossentropy 
 16 | from keras.models import Model
 17 | from keras.callbacks import EarlyStopping, ModelCheckpoint
 18 | from glob import glob
 19 | from PIL import Image
 20 | from time import time
 21 | from sklearn.model_selection import train_test_split
 22 | import os
 23 | import imageio
 24 | from IPython.display import Image as Img
 25 | os.chdir('c:/users/nicolas/documents/data/faces')
 26 | 
 27 | 
 28 | # ##### Function to sort images
 29 | 
 30 | # In[2]:
 31 | 
 32 | 
 33 | def sorted_alphanumeric(data):
 34 |     convert = lambda text: int(text) if text.isdigit() else text.lower()
 35 |     alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
 36 |     return sorted(data, key=alphanum_key)
 37 | 
 38 | 
 39 | # ##### Loading all file names
 40 | 
 41 | # In[3]:
 42 | 
 43 | 
 44 | files = sorted_alphanumeric(glob(r'C:\Users\Nicolas\Documents\Data\faces\combined/*.jpg'))
 45 | 
 46 | 
 47 | 
 48 | np.unique([i[-34] for i in files], return_counts=True)
 49 | 
 50 | 
 51 | # ##### Keeping only men/women (not both)
 52 | 
 53 | # In[6]:
 54 | 
 55 | 
 56 | faces = [i for i in files if (i[-34] == '0') and len(i[-37:-35].strip('\\').strip('d'))  == 2 ] # or in ('0', ''1'')
 57 | 
 58 | 
 59 | # In[7]:
 60 | 
 61 | 
 62 | y = [i[-34] for i in files if (i[-34] == '0') and len(i[-37:-35].strip('\\').strip('d')) > 1 ]
 63 | 
 64 | 
 65 | assert len(y) == len(faces), 'The X and Y are not of the same length!'
 66 | 
 67 | 
 68 | dim = 60
 69 | 
 70 | 
 71 | def crop(img):
 72 |     if img.shape[0]<img.shape[1]:
 73 |         x = img.shape[0]
 74 |         y = img.shape[1]
 75 |         crop_img = img[: , int(y/2-x/2):int(y/2+x/2)]
 76 |     else:
 77 |         x = img.shape[1]
 78 |         y = img.shape[0]
 79 |         crop_img = img[int(y/2-x/2):int(y/2+x/2) , :]
 80 | 
 81 |     return crop_img
 82 | 
 83 | 
 84 | # ##### Loading and cropping images
 85 | 
 86 | # In[13]:
 87 | 
 88 | 
 89 | start = time()
 90 | x = []
 91 | num_to_load = len(faces)
 92 | for ix, file in enumerate(faces[:num_to_load]): 
 93 |     image = plt.imread(file, 'jpg')
 94 |     image = Image.fromarray(image).resize((dim, dim)).convert('L')
 95 |     image = crop(np.array(image))
 96 |     x.append(image)
 97 | print(f'{int(time() - start)} seconds')
 98 | # y = y[:num_to_load]
 99 | 
100 | 
101 | # ##### Turning the pictures into arrays
102 | 
103 | # In[14]:
104 | 
105 | 
106 | x = np.array(x, dtype=np.float32)
107 | y = np.array(y, dtype=np.float32)
108 | 
109 | 
110 | # In[15]:
111 | 
112 | 
113 | 
114 | # In[18]:
115 | 
116 | 
117 | files, faces = None, None
118 | 
119 | 
120 | # ##### Cross-validation
121 | 
122 | # In[19]:
123 | 
124 | 
125 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=2e-1)
126 | x, y = None, None
127 | 
128 | 
129 | # In[20]:
130 | 
131 | 
132 | trainsize, testsize = x_train.shape[0], x_test.shape[0]
133 | print(f'The size of the training set is {trainsize:,} and the '     f'size of the test set is {testsize:,}.')
134 | 
135 | 
136 | # ##### Scaling, casting the arrays
137 | 
138 | # In[21]:
139 | 
140 | 
141 | image_size = x_train.shape[1] * x_train.shape[1] 
142 | x_train = np.reshape(x_train, [-1, image_size]) 
143 | x_test = np.reshape(x_test, [-1, image_size]) 
144 | x_train = x_train.astype('float32') / 255 
145 | x_test = x_test.astype('float32') / 255
146 | print('Done.')
147 | 
148 | 
149 | # ##### Building the VAE
150 | 
151 | # In[22]:
152 | 
153 | 
154 | def build_vae(intermediate_dim=512, latent_dim=2):    
155 |     """    
156 |     Build VAE    
157 |     :param intermediate_dim: size of hidden layers of the encoder/decoder    
158 |     :param latent_dim: latent space size    
159 |     :returns tuple: the encoder, the decoder, and the full vae    
160 |     """
161 |     
162 |     # encoder first    
163 |     
164 |     inputs = Input(shape=(image_size,), name='encoder_input')    
165 |     x = Dense(intermediate_dim, activation='relu')(inputs)
166 |     
167 |     # latent mean and variance    
168 |     z_mean = Dense(latent_dim, name='z_mean')(x)    
169 |     z_log_var = Dense(latent_dim, name='z_log_var')(x)
170 |     
171 |     # reparameterization trick for random sampling    
172 |     # Note the use of the Lambda layer    
173 |     # At runtime, it will call the sampling function    
174 |     
175 |     z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var])
176 |                                                                 
177 |     # full encoder encoder model    
178 |                                                                 
179 |     encoder = Model(inputs, [z_mean, z_log_var, z], name='encoder')    
180 |     encoder.summary()
181 |     
182 |     # decoder    
183 |                                                                 
184 |     latent_inputs = Input(shape=(latent_dim,), name='z_sampling')    
185 |     x = Dense(intermediate_dim, activation='relu')(latent_inputs)    
186 |     outputs = Dense(image_size, activation='sigmoid')(x)
187 |     
188 |     # full decoder model  
189 | 
190 |     decoder = Model(latent_inputs, outputs, name='decoder')    
191 |     decoder.summary()
192 | 
193 |     # VAE model    
194 | 
195 |     outputs = decoder(encoder(inputs)[2])    
196 |     vae = Model(inputs, outputs, name='vae')
197 | 
198 |     # Loss function    
199 |     # we start with the reconstruction loss    
200 | 
201 |     reconstruction_loss = binary_crossentropy(inputs, outputs) * image_size
202 | 
203 |     # next is the KL divergence
204 | 
205 |     kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)    
206 |     kl_loss = K.sum(kl_loss, axis=-1)    
207 |     kl_loss *= -0.5
208 |     
209 |     # we combine them in a total loss    
210 |     
211 |     vae_loss = K.mean(reconstruction_loss + kl_loss)    
212 |     vae.add_loss(vae_loss)
213 |     
214 |     return encoder, decoder, vae
215 | 
216 | 
217 | # ##### Making a sampler
218 | 
219 | # In[23]:
220 | 
221 | 
222 | def sampling(args: tuple):    
223 |     """    
224 |     Reparameterization trick by sampling z from unit Gaussian    
225 |     :param args: (tensor, tensor) mean and log of variance of q(z|x)    
226 |     :returns tensor: sampled latent vector z    
227 |     """
228 |     
229 |     # unpack the input tuple    
230 |     
231 |     z_mean, z_log_var = args
232 |     
233 |     # mini-batch size    
234 |     
235 |     mb_size = K.shape(z_mean)[0]
236 |     
237 |     # latent space size    
238 |     
239 |     dim = K.int_shape(z_mean)[1]
240 |     
241 |     # random normal vector with mean=0 and std=1.0    
242 |     
243 |     epsilon = K.random_normal(shape=(mb_size, dim))
244 |     
245 |     return z_mean + K.exp(0.5 * z_log_var) * epsilon
246 | 
247 | 
248 | # ##### Plotting the distribution
249 | 
250 | # In[24]:
251 | 
252 | 
253 | def plot_latent_distribution(encoder, x_test, y_test, batch_size=128):    
254 |     """    
255 |     Display a 2D plot of the digit classes in the latent space.
256 |     We are interested only in z, so we only need the encoder here.    
257 |     :param encoder: the encoder network    
258 |     :param x_test: test images    
259 |     :param y_test: test labels    
260 |     :param batch_size: size of the mini-batch    
261 |     """    
262 |     
263 |     z_mean, _, _ = encoder.predict(x_test, batch_size=batch_size)
264 |     
265 |     plt.figure(figsize=(6, 6))
266 |     markers = ('o', 'x', '^', '<', '>', '*', 'h', 'H', 'D', 'd', 'P', 'X', '8', 's', 'p')
267 |     
268 |     for i in np.unique(y_test):
269 |         plt.scatter(z_mean[y_test == i, 0], z_mean[y_test == i, 1],
270 |         marker=MarkerStyle(markers[int(i)], fillstyle='none'), edgecolors='black')
271 |     
272 |     plt.xlabel("z[0]")    
273 |     plt.ylabel("z[1]")    
274 |     plt.show()
275 | 
276 | 
277 | # ##### Plotting the generated images
278 | 
279 | # In[25]:
280 | 
281 | 
282 | def plot_generated_images(decoder):    
283 |     """    
284 |     Display a 2D plot of the generated images.    
285 |     We only need the decoder, because we'll manually sample the distribution z    
286 |     :param decoder: the decoder network    
287 |     """
288 |     
289 |     # display a nxn 2D manifold of digits    
290 |     
291 |     n = 4 # used to be 15    
292 |     digit_size = nrow # used to be 28
293 |     figure = np.zeros((digit_size * n, digit_size * n))    
294 |     
295 |     # linearly spaced coordinates corresponding to the 2D plot    
296 |     # of digit classes in the latent space    
297 |     
298 |     low = -1
299 |     high = 1
300 |     
301 |     grid_x = np.linspace(low, high, n)    
302 |     grid_y = np.linspace(low, high, n)[::-1]
303 |     
304 |     # start sampling z1 and z2 in the ranges grid_x and grid_y    
305 |     
306 |     for i, yi in enumerate(grid_y):        
307 |         for j, xi in enumerate(grid_x):
308 |             z_sample = np.array([[xi, yi]])            
309 |             x_decoded = decoder.predict(z_sample)            
310 |             digit = x_decoded[0].reshape(digit_size, digit_size)             
311 |             slice_i = slice(i * digit_size, (i + 1) * digit_size)            
312 |             slice_j = slice(j * digit_size, (j + 1) * digit_size)            
313 |             figure[slice_i, slice_j] = digit
314 |             
315 |      # plot the results    
316 |     #
317 |     # plt.figure(figsize=(6, 5)) # was 6, 5
318 |     # start_range = digit_size // 2
319 |     # end_range = n * digit_size + start_range + 1
320 |     # pixel_range = np.arange(start_range, end_range, digit_size)
321 |     # sample_range_x = np.round(grid_x, 1)
322 |     # sample_range_y = np.round(grid_y, 1)
323 |     # plt.xticks(pixel_range, sample_range_x)
324 |     # plt.yticks(pixel_range, sample_range_y)
325 |     # plt.xlabel("z[0]")
326 |     # plt.ylabel("z[1]")
327 |     # plt.imshow(figure, cmap='Greys_r')
328 |     # plt.show()
329 | 
330 | 
331 | # ##### Callbacks
332 | 
333 | # In[26]:
334 | 
335 | 
336 | e_s = EarlyStopping(monitor='val_loss', patience=10)
337 | m_c = ModelCheckpoint('vae_weights.hdf5', save_best_only=True)
338 | 
339 | 
340 | # ##### Run the entire thing
341 | 
342 | # In[28]:
343 | 
344 | 
345 | if __name__ == '__main__': 
346 |     encoder, decoder, vae = build_vae()
347 |     vae.compile(optimizer=Adam(lr=0.002)) 
348 |     vae.summary()
349 | 
350 |     
351 |     vae.fit(x_train, epochs=500, batch_size=16, # was
352 |             validation_data=(x_test, None),
353 |             callbacks=[e_s, m_c],
354 |             verbose=1)
355 |     
356 |     plot_generated_images(decoder)
357 | 
358 | 
359 | plt.imshow(vae.predict(np.random.uniform(-1, 1, dim*dim).reshape(1, 
360 |                             -1)).reshape(dim, dim), cmap='gray')
361 | 
362 | 
363 | def gen_samples(n, minimum, maximum):
364 |         face = []
365 |         for i in range(n):
366 |             face.append(vae.predict(np.random.uniform(minimum, maximum, dim*dim).reshape(1, -1)).reshape(dim, dim))
367 |         return face
368 | 
369 | 
370 | # In[33]:
371 | 
372 | 
373 | if not os.path.isdir('vae_800_women'):
374 |     os.mkdir('vae_800_women')
375 | 
376 | 
377 | # In[34]:
378 | 
379 | 
380 | # png_dir = 'vae_800_women'
381 | # images = []
382 | # for file_name in sorted_alphanumeric(os.listdir(png_dir))[:-20]:
383 | #     if file_name.endswith('.jpg'):
384 | #         file_path = os.path.join(png_dir, file_name)
385 | #         images.append(imageio.imread(file_path))
386 | # images.extend(images[::-1])     
387 | 
388 | 
389 | # In[37]:
390 | if not os.path.isdir('generated_men_vae'):
391 |     os.mkdir('generated_men_vae')
392 | 
393 | n = 25_000
394 | for i in range(1, n + 1):
395 |     pic = np.mean(gen_samples(n=50, minimum=np.random.uniform(0, 1), 
396 |                               maximum=np.random.uniform(0, 1)), axis=0)*255
397 |     filename = 'generated_men_vae/{}.jpg'.format(str(i))
398 |     im = Image.fromarray(pic.astype(np.uint8))
399 |     im.save(filename)
400 |     if (i+1) % 1_000 == 0:
401 |         print(i+1, ' completed.')
402 | print('Done.')
403 | 


--------------------------------------------------------------------------------
/Rapport Final.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## GAN-Generated Samples to Offset Class Imbalance"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Vincent Fortin (11249631) | \n",
 15 |     "Nicolas Gervais (11263889)"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "### 1. Intro  <br>"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "#### 1.1.1 Class Imbalance "
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "Working with imbalanced classes represents a challenge that most machine learning practitioners will face. Indeed, many learning algorithms are suited for balanced datasets, such as Support Vector Machines (SVM), decision trees, and logistic regression [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0020025513005124). When combined with a limited number of training instances, imbalanced classes can result in poorly trained models. Having few instances from which to learn, algorithms may have a limited ability to generalize, and therefore suffer from poor performance on unseen data. These problems have most frequently sparked research in the fields of neurocomputing, knowledge-based systems, but also in image recognition [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub). Many "
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "Various strategies have been suggested to negate the effects of class imbalance, which typically fall into three categories, oversampling, undersampling, and hybrid methods [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub). "
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "In the category of oversampling, probably the most popular strategy is to use the Synthetic Minority Over-sampling Technique (SMOTE). As its name suggests, SMOTE is an oversampling method, which works by creating synthetic samples from the minor class instead of creating copies [$^{ref}$](https://jair.org/index.php/jair/article/view/10302). The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances. Many rules have been put forward to weigh minority instances differently. A suggestion has been to cluster minority instances using a semi-unsupervised hierarchical clustering approach to determine the size to oversample each sub-cluster using its classification complexity and cross-validation. Then, the minority instances are oversampled depending on their Euclidean distance to the majority class. [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0957417415007356) <font color='red'>_this excerpt was copy pasted_</font>. Minority instances can also be weighted according to their distance to the majority class [$^{ref}$](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180830). "
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "A similar family of strategies is the undersampling of the majority class. Interestingly, the most effective method of this kind is to delete random samples until the size of the majority and minority classes match [$^{ref}$](https://link.springer.com/chapter/10.1007%2F978-3-642-02326-2_9)."
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "Finally, hybrid methods are a mixture of the two aforementioned strategies. A recent meta-analysis from Haixiang and colleagues (2017) offers more details of the latest developments of research on imbalanced datasets [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub)."
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "1.1.2 What has been done in AI"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "Counteracting the effects of class imbalance in image recognition tasks add another layer of difficulty. Yet, some methods have been suggested. By matching pairs of images (taking the mean of every pixel), accuracy was improved on the CIFAR-10, compared to the generic dataset [$^{ref}$](https://arxiv.org/abs/1801.02929). In a similar fashion, state-of-the-art results on the CIFAR-10 and ImageNet have been achieved using translation, rotation, or shearing of different magnitudes [$^{ref}$](https://arxiv.org/abs/1805.09501). Another method to provide more training samples was to cut the \"main\" component of the image, and paste it on different backgrounds [$^{ref}$](http://openaccess.thecvf.com/content_iccv_2017/html/Dwibedi_Cut_Paste_and_ICCV_2017_paper.html)."
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "Other possibilities to generate more training samples include generative adversarial networks (GAN) and variational auto-encoders (VAE)."
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "1.2.1 GAN"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "markdown",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "GANs are neural networks defined by a _generator_ and a _discriminator_. The former generates increasingly realistic samples, and the discriminator determines if the samples looks \"real\" or \"fake\". The term adversarial refers to the competitive nature of the interaction between generation and discrimination. "
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "1.2.2 How GAN has been used to counter class imbalance"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "Researchers have used GANs to generate new minority samples [$^{ref}$](https://arxiv.org/abs/1807.04585). A balanced GAN (BAGAN) was designed with both the majority and minority class to learn useful features. The authors found that the pictures generated were of higher quality than simply using the minority class [$^{ref}$](https://arxiv.org/abs/1803.09655). However, the authors did not test if this resulted in a more accurate classifier. Similar to our research question, researchers have used GANs to generate instances of multiple classes, and found increased CNN accuracy [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0925231219309257?dgcid=rss_sd_all), over and above generic oversampling. Importantly, intra-class heterogeneity must be captured by the GAN, to provide new boundaries to the parameter space. With this concern in mind, Huang and colleagues (2019) improved classification accuracy with their actor-critic GAN (AC-GAN) [$^{ref}$](https://ieeexplore.ieee.org/document/8784774)."
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "1.3 Our experiment"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": []
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "#### 2. Our experiment"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "2.1 Our task"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "With the help of GANs and VAEs, we will generate a multitude of samples for a minority class, and determine if these generated samples improve a classifier. A simple face recognition task will be assessed: to determine the sex of the person. "
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "2.2 Our data"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "The dataset used will be the UTK Face Dataset [$^{ref}$](http://aicip.eecs.utk.edu/wiki/UTKFace), which consists of over 20,000 face images with annotations of age, gender, and ethnicity. Only the pictures labeled as between 20 and 100 years old will be included. 8,000 samples will be kept for both the male and female categories. Next, the female class will be reduced to 10% of its original size, in order to weaken the classifier. The picture size is 60x60 in grayscale."
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "2.3 Our benchmarks "
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 1,
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": [
178 |     "# Here we downsample women until we get 70% accuracy"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "2.3 Our metrics / CNN Classifier"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "metadata": {},
191 |    "source": [
192 |     "Our classifier will be a convolutional neural network (CNN) <font color='orange'>(describe shortly)</font>. CNNs are neural networks with at least one convolutional layer, which serves as a feature detector. See the following figure for the exact architecture."
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": [
201 |     "# here plot keras pydot model plot"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "markdown",
206 |    "metadata": {},
207 |    "source": [
208 |     "The classifier will be trained with both the _complete_ classes, and this performance will be established as the original baseline benchmark. Additionally, the CNN will be trained with the reduced female class, and it will be set as the lower bound classification performance. "
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "markdown",
213 |    "metadata": {},
214 |    "source": [
215 |     "Since the experiment will undergo various levels of imbalance, classification performance will be measured with the area under the curve (AUC). "
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "markdown",
220 |    "metadata": {},
221 |    "source": [
222 |     "2.4 Our models"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "metadata": {},
228 |    "source": [
229 |     "In total, five models will be used to generate samples. All models contain interesting particularities for the task at hand, which are promising in unique ways."
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "markdown",
234 |    "metadata": {},
235 |    "source": [
236 |     "2.4.1 Variational Auto-Encoder"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "A variational auto-encoder (VAE) consists of an encoder, a decoder, and a loss function. The encoder transforms its input in a hidden representation, also called its latent representation space. This space is much less than the input dimensions. This is typically referred to as a ‘bottleneck’ because the encoder must learn an efficient compression of the data into this lower-dimensional space. The lower-dimensional space is stochastic: the encoder outputs parameters as a Gaussian probability density. We can sample from this distribution to get noisy values of the representations zz. <font color='red'> _this is copy pasted_ </font>."
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "The decoder is another neural net. Its input is the representation output by the encoder, it outputs the parameters to the probability distribution of the data. By re-expanding the representation, we can determine how much information is lost with the reconstruction log-likelihood loss. This measure tells us how effectively the decoder has learned to reconstruct an input image xx given its latent representation zz."
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "markdown",
255 |    "metadata": {},
256 |    "source": [
257 |     "The loss function of the variational autoencoder is the negative log-likelihood with a regularizer. If the decoder’s output does not reconstruct the data well, statistically we say that the decoder parameterizes a likelihood distribution that does not place much probability mass on the true data. The second term is a regularizer that we throw in (we’ll see how it’s derived later). This is the Kullback-Leibler divergence.  between the encoder’s distribution q_\\theta(z\\mid x)q\n",
258 |     "​θ\n",
259 |     "​​ (z∣x) and p(z)p(z). This divergence measures how much information is lost (in units of nats) when using qq to represent pp. It is one measure of how close qq is to pp.\n",
260 |     "\n",
261 |     "In the variational autoencoder, pp is specified as a standard Normal distribution with mean zero and variance one, or p(z) = Normal(0,1)p(z)=Normal(0,1). If the encoder outputs representations zz that are different than those from a standard normal distribution, it will receive a penalty in the loss. This regularizer term means ‘keep the representations zz of each digit sufficiently diverse’. If we didn’t include the regularizer, the encoder could learn to cheat and give each datapoint a representation in a different region of Euclidean space. This is bad, because then two images of the same number (say a 2 written by different people, 2_{alice}2\n",
262 |     "​alice\n",
263 |     "​​  and 2_{bob}2\n",
264 |     "​bob\n",
265 |     "​​ ) could end up with very different representations z_{alice}, z_{bob}z\n",
266 |     "​alice\n",
267 |     "​​ ,z\n",
268 |     "​bob\n",
269 |     "​​ . We want the representation space of zz to be meaningful, so we penalize this behavior. This has the effect of keeping similar numbers’ representations close together (e.g. so the representations of the digit two {z_{alice}, z_{bob}, z_{ali}}z\n",
270 |     "​alice\n",
271 |     "​​ ,z\n",
272 |     "​bob\n",
273 |     "​​ ,z\n",
274 |     "​ali\n",
275 |     "​​  remain sufficiently close)."
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "markdown",
280 |    "metadata": {},
281 |    "source": [
282 |     "Not copy pasted\n",
283 |     "\n",
284 |     "The variational auto-encoder is an auto-encoder which is used mostly to generate data. It's structure is similar to regular auto-encoders, where we try to find a lower dimentionentional representation of the data by creating a bottleneck in the middle of the neural networks (encoder and decoder).\n",
285 |     "The reason we can't generate examples from regular auto encoder is that we don't know what the distribution of the hidden bottleneck layer, meaning that if we were to pass random values through the decoder, we would most likely get a reconstruction that would look nothing like our other examples.\n",
286 |     "\n",
287 |     "In order to know what kind of inputs passed to the decoder will reproduce examples which look like the population examples, we need to make changes to the hidden bottleneck layer. We need to change it so that we can sample from a distribution in the hidden layer, instead of imputting random fixed values to the decoder.\n",
288 |     "In order to efficiently train the autoencoder with backpropagation, slight changes need to be made to the hidden bottleneck, since backpropagation can't be calculated on a distribution.\n",
289 |     "TALK ABOUT CHANGES ?\n",
290 |     "\n",
291 |     "Once the model is trained, we will have learned the latent distribution of the bottleneck layer and we can sample points from this distribution. Sampling a point from the latent distribution, and passing it trough the decoder, we can generate a new unique example. This new example will be based on other examples which were used to train the autoencoder"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "markdown",
296 |    "metadata": {},
297 |    "source": [
298 |     "2.4.2 Adversarial Auto-Encoder "
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": null,
304 |    "metadata": {},
305 |    "outputs": [],
306 |    "source": []
307 |   },
308 |   {
309 |    "cell_type": "markdown",
310 |    "metadata": {},
311 |    "source": [
312 |     "2.4.3 Generative adversarial networks (GAN)"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "GANs are comprised of two neural networks, namely a _generator_ and a _discriminator_.\n",
320 |     "\n",
321 |     "The goal of the former is to generate increasingly realistic samples, while the latter, the discriminator's goal is to determine if the samples are \"real\" or \"fake\". The term adversarial refers to the competitive nature of the interaction between the generator and discriminator. \n",
322 |     "\n",
323 |     "Here is how this works in more details.\n",
324 |     "As we said GANs are composed of 2 competing neural networks. \n",
325 |     "The first part is the discriminator, which is a neural network trained to differentiate between real and fake examples, in our case images. The way it is trained at first is to feed it real examples, as well as random noise. In later phases, the discriminator is trained on examples that were generated by the generator, as well as real examples from the original dataset.\n",
326 |     "\n",
327 |     "The Generator's goal is to generate more and more realistic fake examples. When we talk about realistic examples, we mean that the fake examples generated are classified as real by the discriminator.\n",
328 |     "To generate examples, the network takes in a random point in a input space, called latent space, passes it through the neural network and outputs an image. At first, the network will output random pixels since it doesn't have a good way to map the points in the latent space to images.\n",
329 |     "To train the generator, we need to find a way to connect the discriminator's loss function to the generator weights, meaning that when the generator generates a __poor quality image__, we need to use the information from the discriminator (how it figured out that it was a fake) to update the weights from the generator.\n",
330 |     "\n",
331 |     "This can be done with backpropagation __(explain if needed)__.\n",
332 |     "\n",
333 |     "Both networks are trained asynchronously and the training stops when both loss functions stabilyze. \n",
334 |     "\n",
335 |     "The steps to training the whole network are as followed:\n",
336 |     "1- Train the discriminator: Generate one (or few) images from random noise (random point in latent space). Since the generator has not been trained yet, the images will be of poor quality. Use those images and the same amount of real images to train the discriminator classifier (real vs fake).\n",
337 |     "2- Train the generator: Generate one (or few) images from a point in the latent space. Pass them through the discriminator classifier and use the loss function (how good our fake images are) to update the weights of the generator.\n",
338 |     "3- Repeat steps 1 and 2 until both loss functions converge. The goal here is to have the generator's loss function be very low, while having the discriminator's loss function be high (accuracy at 0.5).\n",
339 |     "\n",
340 |     "We can think of it as a min max game, where the discriminator tries to minimize a loss function d(z), where z is the latent space and where the generator tries to maximize a d(g(z)).\n",
341 |     "\n",
342 |     "In the original GAN paper, the authors used a mixture of rectifier linear activations and sigmoid activations for the discriminator, and maxout activation for the discriminator. They also noted that it is technically possible to use any differentiable activation function.\n",
343 |     "\n",
344 |     "\n",
345 |     "\n",
346 |     "\n",
347 |     "\n",
348 |     "Problems with GANS:\n",
349 |     "Mode collapse: When we generate new data points, we would like them to be close enough from real examples, but with enough variety so that examples are not exact copies of the training examples, but also that all of the classes are represented, with similar proportions as the training set.\n",
350 |     "The problem of classes being over/under represented in the learned distribution by the generator, versus the real (training) distribution is known as mode collapse. This problem arises because of the fact that the generator is trying to 'trick' the discriminator by exploiting it's weaknesses. What can happen concretly is that the generator generates images from a single class it 'thinks' the discriminator has the biggest difficulty classifying as real/fake. When the discriminator has learned to properly classify those examples from the single class, the generator will move to another class entirely. This process will happen 'forever' and the generator will never learn the proper distribution of the data. \n",
351 |     "Many variations to the GAN try to deal with this problem, notably one which changes the amount of examples provided to the discriminator from 1 to a small amount. The goal of the generator is thus to predict if the batch of data points is real or fake as a whole. If the generator tries to generate a batch of data points which all come from the same class, the discriminator will be able to figure out that those points do not resemble the training distribution.\n",
352 |     "Changes in the loss functions can mitigate this problem and we will further discuss them later.\n",
353 |     "\n",
354 |     "Diminished gradient: \n",
355 |     "This problem also comes from the fact that we have two competing neural networks. The diminished gradient problem comes in when the discriminator is very good and it's performance is very high. Since we perform gradient descent based on the discriminator's loss function, when the discriminator's performance is very high, it's loss function is very low and so is it's gradient. When we perform gradient descent to optimize neural networks or other learning algorithms, small gradients usually mean we are close to the optimum and we can stop the optimisation. In the case of GANs, this only means that the discriminator is close to it's optimum and says little about the generator, which is what we are trying to use.\n",
356 |     "Since we spread the gradient from the discriminator to the generator to optimize the generator's neural network, when the gradient used for the backpropagation is very small, the weights of the network are left almost unchanged, which means that the generator stops learning.\n",
357 |     "\n",
358 |     "To mitigate this problem, ....\n"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "markdown",
363 |    "metadata": {},
364 |    "source": [
365 |     "2.4.3 Softmax GAN"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {},
371 |    "source": [
372 |     "Despite its success in many applications, GAN is highly unstable in training. Careful selection of\n",
373 |     "hyperparameters is often necessary to make the training process converge [11]. It is often believed\n",
374 |     "that this instability is caused by unbalanced discriminator and generator training. As the discriminator\n",
375 |     "utilizes a logistic loss, it saturates quickly and its gradient vanishes if the generated samples are easy\n",
376 |     "to separate from the real ones. When the discriminator fails to provide gradient, the generator stops\n",
377 |     "updating. Softmax GAN overcomes this problem by utilizing the softmax cross-entropy loss, whose\n",
378 |     "gradient is always non-zero unless the softmaxed distribution matches the target distribution.\n",
379 |     "\n",
380 |     "\n",
381 |     "\n",
382 |     "\n"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "2.4.4 Wasserstein GAN"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": null,
395 |    "metadata": {},
396 |    "outputs": [],
397 |    "source": []
398 |   },
399 |   {
400 |    "cell_type": "markdown",
401 |    "metadata": {},
402 |    "source": [
403 |     "2.4.5 Deep Convolutional GAN"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": null,
409 |    "metadata": {},
410 |    "outputs": [],
411 |    "source": []
412 |   },
413 |   {
414 |    "cell_type": "markdown",
415 |    "metadata": {},
416 |    "source": [
417 |     "#### 3. Our results "
418 |    ]
419 |   },
420 |   {
421 |    "cell_type": "markdown",
422 |    "metadata": {},
423 |    "source": [
424 |     "Test accuracy on AAE : 80.5%"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "markdown",
429 |    "metadata": {},
430 |    "source": [
431 |     "#### 4. Alternative results (CNN trained on original data, test is generated data)"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": null,
437 |    "metadata": {},
438 |    "outputs": [],
439 |    "source": []
440 |   }
441 |  ],
442 |  "metadata": {
443 |   "kernelspec": {
444 |    "display_name": "Python 3",
445 |    "language": "python",
446 |    "name": "python3"
447 |   },
448 |   "language_info": {
449 |    "codemirror_mode": {
450 |     "name": "ipython",
451 |     "version": 3
452 |    },
453 |    "file_extension": ".py",
454 |    "mimetype": "text/x-python",
455 |    "name": "python",
456 |    "nbconvert_exporter": "python",
457 |    "pygments_lexer": "ipython3",
458 |    "version": "3.7.4"
459 |   }
460 |  },
461 |  "nbformat": 4,
462 |  "nbformat_minor": 4
463 | }
464 | 


--------------------------------------------------------------------------------
/Deep Convolutional GAN.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Deep Convolutional Generative Adversarial Network"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "<img src=\"https://miro.medium.com/max/4124/1*KvMnRfb76DponICrHIbSdg.png\"  width=\"800\" height=\"400\">"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "##### Importing libraries"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "import numpy as np\n",
 31 |     "import matplotlib.pyplot as plt\n",
 32 |     "from glob import glob\n",
 33 |     "from PIL import Image\n",
 34 |     "from time import time\n",
 35 |     "import os\n",
 36 |     "import pandas as pd\n",
 37 |     "os.chdir('C:/Users/Nicolas/Documents/Data/Faces')\n",
 38 |     "import argparse\n",
 39 |     "import math\n",
 40 |     "import re\n",
 41 |     "import itertools\n",
 42 |     "import torchvision.transforms as transforms\n",
 43 |     "from torchvision.utils import save_image\n",
 44 |     "from torch.utils.data import DataLoader\n",
 45 |     "from torchvision import datasets\n",
 46 |     "from torch.autograd import Variable\n",
 47 |     "import torch.nn as nn\n",
 48 |     "import torch.nn.functional as F\n",
 49 |     "import torch"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "##### Function to order images"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 2,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "def sorted_alphanumeric(data):\n",
 66 |     "    convert = lambda text: int(text) if text.isdigit() else text.lower()\n",
 67 |     "    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]\n",
 68 |     "    return sorted(data, key=alphanum_key)"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "##### Loading all file names"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": 3,
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "files = sorted_alphanumeric(glob('combined/*.jpg'))"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "##### Loading the 800 hundred women pictures"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 4,
 97 |    "metadata": {},
 98 |    "outputs": [],
 99 |    "source": [
100 |     "def load_women():\n",
101 |     "    faces = pd.read_csv('800_women.csv', header=None).values\n",
102 |     "    faces = faces.ravel().tolist()\n",
103 |     "    return faces\n",
104 |     "faces = load_women()\n",
105 |     "y = np.repeat(1, len(faces))"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "metadata": {},
111 |    "source": [
112 |     "#### This is the shape width/height"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": 5,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "dim = 60"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "metadata": {},
127 |    "source": [
128 |     "#### Cropping function"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": 6,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "def crop(img):\n",
138 |     "    if img.shape[0]<img.shape[1]:\n",
139 |     "        x = img.shape[0]\n",
140 |     "        y = img.shape[1]\n",
141 |     "        crop_img = img[: , int(y/2-x/2):int(y/2+x/2)]\n",
142 |     "    else:\n",
143 |     "        x = img.shape[1]\n",
144 |     "        y = img.shape[0]\n",
145 |     "        crop_img = img[int(y/2-x/2):int(y/2+x/2) , :]\n",
146 |     "\n",
147 |     "    return crop_img"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "markdown",
152 |    "metadata": {},
153 |    "source": [
154 |     "##### Loading and cropping images"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 7,
160 |    "metadata": {},
161 |    "outputs": [
162 |     {
163 |      "name": "stdout",
164 |      "output_type": "stream",
165 |      "text": [
166 |       "Done. 0 seconds\n"
167 |      ]
168 |     }
169 |    ],
170 |    "source": [
171 |     "print('Scaling...', end='')\n",
172 |     "start = time()\n",
173 |     "x = []\n",
174 |     "num_to_load = len(faces)  \n",
175 |     "for ix, file in enumerate(faces[:num_to_load]): \n",
176 |     "    image = plt.imread(file, 'jpg')\n",
177 |     "    image = Image.fromarray(image).resize((dim, dim)).convert('L')\n",
178 |     "    image = crop(np.array(image))\n",
179 |     "    x.append(image)\n",
180 |     "print(f'\\rDone. {int(time() - start)} seconds')"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "##### Turning the pictures into arrays"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": 8,
193 |    "metadata": {},
194 |    "outputs": [],
195 |    "source": [
196 |     "x = np.divide(np.array(x, dtype=np.float32), 255).reshape(-1, 1, 60, 60)\n",
197 |     "y = np.array(y, dtype=np.float32)"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "markdown",
202 |    "metadata": {},
203 |    "source": [
204 |     "##### Turning the targets into a 2D matrix"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 9,
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "assert x.ndim == 4, 'The input is the wrong shape!'"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 10,
219 |    "metadata": {},
220 |    "outputs": [],
221 |    "source": [
222 |     "yy, xx = y.nbytes, x.nbytes"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": 11,
228 |    "metadata": {},
229 |    "outputs": [
230 |     {
231 |      "name": "stdout",
232 |      "output_type": "stream",
233 |      "text": [
234 |       "The size of X is 11,520,000 bytes and the size of Y is 3,200 bytes.\n"
235 |      ]
236 |     }
237 |    ],
238 |    "source": [
239 |     "print(f'The size of X is {xx:,} bytes and the size of Y is {yy:,} bytes.')"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "code",
244 |    "execution_count": 12,
245 |    "metadata": {},
246 |    "outputs": [],
247 |    "source": [
248 |     "files, faces = None, None"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": 13,
254 |    "metadata": {},
255 |    "outputs": [
256 |     {
257 |      "name": "stdout",
258 |      "output_type": "stream",
259 |      "text": [
260 |       "Done.     \n"
261 |      ]
262 |     }
263 |    ],
264 |    "source": [
265 |     "print('Scaling...', end='')\n",
266 |     "image_size = x.shape[1] * x.shape[1]\n",
267 |     "print('\\rDone.     ')"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "markdown",
272 |    "metadata": {},
273 |    "source": [
274 |     "##### Sending the tensors to CUDA"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 14,
280 |    "metadata": {},
281 |    "outputs": [
282 |     {
283 |      "name": "stdout",
284 |      "output_type": "stream",
285 |      "text": [
286 |       "Tensors successfully flushed to CUDA.\n"
287 |      ]
288 |     }
289 |    ],
290 |    "source": [
291 |     "if torch.cuda.is_available():\n",
292 |     "    x = torch.from_numpy(x) \n",
293 |     "    y = torch.from_numpy(y)\n",
294 |     "    print('Tensors successfully flushed to CUDA.')\n",
295 |     "else:\n",
296 |     "    print('CUDA not available!')"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "markdown",
301 |    "metadata": {},
302 |    "source": [
303 |     "##### Making a dataset class"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": 15,
309 |    "metadata": {},
310 |    "outputs": [],
311 |    "source": [
312 |     "class Face():\n",
313 |     "    \n",
314 |     "    def __init__(self):\n",
315 |     "        self.len = x.shape[0]\n",
316 |     "        self.x = x\n",
317 |     "        self.y = y\n",
318 |     "        \n",
319 |     "    def __getitem__(self, index):\n",
320 |     "        return x[index], y[index].unsqueeze(0) \n",
321 |     "    \n",
322 |     "    def __len__(self):\n",
323 |     "        return self.len"
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "markdown",
328 |    "metadata": {},
329 |    "source": [
330 |     "##### Instantiating the class"
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": 16,
336 |    "metadata": {},
337 |    "outputs": [],
338 |    "source": [
339 |     "data = Face()"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "markdown",
344 |    "metadata": {},
345 |    "source": [
346 |     "##### Parsing the args"
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "code",
351 |    "execution_count": 17,
352 |    "metadata": {},
353 |    "outputs": [
354 |     {
355 |      "name": "stdout",
356 |      "output_type": "stream",
357 |      "text": [
358 |       "Namespace(b1=0.3, b2=0.999, batch_size=32, channels=1, img_size=60, latent_dim=128, lr=0.001, n_cpu=8, n_epochs=5000, sample_interval=5)\n"
359 |      ]
360 |     }
361 |    ],
362 |    "source": [
363 |     "parser = argparse.ArgumentParser()\n",
364 |     "parser.add_argument(\"--n_epochs\", type=int, default=5_000, help=\"number of epochs of training\")\n",
365 |     "parser.add_argument(\"--batch_size\", type=int, default=32, help=\"size of the batches\")\n",
366 |     "parser.add_argument(\"--lr\", type=float, default=0.001, help=\"adam: learning rate\")\n",
367 |     "parser.add_argument(\"--b1\", type=float, default=0.3, help=\"adam: decay of first order momentum of gradient\")\n",
368 |     "parser.add_argument(\"--b2\", type=float, default=0.999, help=\"adam: decay of first order momentum of gradient\")\n",
369 |     "parser.add_argument(\"--n_cpu\", type=int, default=8, help=\"number of cpu threads to use during batch generation\")\n",
370 |     "parser.add_argument(\"--latent_dim\", type=int, default=128, help=\"dimensionality of the latent space\")\n",
371 |     "parser.add_argument(\"--img_size\", type=int, default=60, help=\"size of each image dimension\")\n",
372 |     "parser.add_argument(\"--channels\", type=int, default=1, help=\"number of image channels\")\n",
373 |     "parser.add_argument(\"--sample_interval\", type=int, default=5, help=\"interval between image sampling\")\n",
374 |     "opt, unknown = parser.parse_known_args()\n",
375 |     "print(opt)"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": 18,
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "cuda = True if torch.cuda.is_available() else False"
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "markdown",
389 |    "metadata": {},
390 |    "source": [
391 |     "##### Initializing the weights"
392 |    ]
393 |   },
394 |   {
395 |    "cell_type": "code",
396 |    "execution_count": 19,
397 |    "metadata": {},
398 |    "outputs": [],
399 |    "source": [
400 |     "def weights_init_normal(m):\n",
401 |     "    classname = m.__class__.__name__\n",
402 |     "    if classname.find(\"Conv\") != -1:\n",
403 |     "        torch.nn.init.normal_(m.weight.data, 0.0, 0.02)\n",
404 |     "    elif classname.find(\"BatchNorm2d\") != -1:\n",
405 |     "        torch.nn.init.normal_(m.weight.data, 1.0, 0.02)\n",
406 |     "        torch.nn.init.constant_(m.bias.data, 0.0)"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "markdown",
411 |    "metadata": {},
412 |    "source": [
413 |     "##### Creating the generator"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "code",
418 |    "execution_count": 20,
419 |    "metadata": {},
420 |    "outputs": [],
421 |    "source": [
422 |     "class Generator(nn.Module):\n",
423 |     "    def __init__(self):\n",
424 |     "        super(Generator, self).__init__()\n",
425 |     "\n",
426 |     "        self.init_size = opt.img_size // 4 ## 15\n",
427 |     "        self.l1 = nn.Sequential(nn.Linear(opt.latent_dim, 128 * self.init_size ** 2)) # (100, 128*15^2) 28800\n",
428 |     "        self.conv_blocks = nn.Sequential(\n",
429 |     "            nn.BatchNorm2d(128),\n",
430 |     "            nn.Upsample(scale_factor=2),\n",
431 |     "            nn.Conv2d(128, 128, 3, stride=1, padding=1),\n",
432 |     "            nn.BatchNorm2d(128, 0.8),\n",
433 |     "            nn.LeakyReLU(0.2, inplace=True),\n",
434 |     "            nn.Upsample(scale_factor=2),\n",
435 |     "            nn.Conv2d(128, 64, 3, stride=1, padding=1),\n",
436 |     "            nn.BatchNorm2d(64, 0.8),\n",
437 |     "            nn.LeakyReLU(0.2, inplace=True),\n",
438 |     "            nn.Conv2d(64, opt.channels, 3, stride=1, padding=1),\n",
439 |     "            nn.Tanh(),\n",
440 |     "        )\n",
441 |     "\n",
442 |     "    def forward(self, z):\n",
443 |     "        out = self.l1(z)\n",
444 |     "        out = out.view(out.shape[0], 128, self.init_size, self.init_size)\n",
445 |     "        img = self.conv_blocks(out)\n",
446 |     "        return img"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "markdown",
451 |    "metadata": {},
452 |    "source": [
453 |     "##### Creating the discriminator"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "code",
458 |    "execution_count": 21,
459 |    "metadata": {},
460 |    "outputs": [],
461 |    "source": [
462 |     "class Discriminator(nn.Module):\n",
463 |     "    def __init__(self):\n",
464 |     "        super(Discriminator, self).__init__()\n",
465 |     "\n",
466 |     "        def discriminator_block(in_filters, out_filters, bn=True):\n",
467 |     "            block = [nn.Conv2d(in_filters, out_filters, 3, 2, 1), nn.LeakyReLU(0.2, inplace=True), nn.Dropout2d(0.25)]\n",
468 |     "            if bn:\n",
469 |     "                block.append(nn.BatchNorm2d(out_filters, 0.8))\n",
470 |     "            return block\n",
471 |     "\n",
472 |     "        self.model = nn.Sequential(\n",
473 |     "            *discriminator_block(opt.channels, 16, bn=False),\n",
474 |     "            *discriminator_block(16, 32),\n",
475 |     "            *discriminator_block(32, 64),\n",
476 |     "            *discriminator_block(64, 128),\n",
477 |     "        )\n",
478 |     "\n",
479 |     "        # The height and width of downsampled image\n",
480 |     "        ds_size = opt.img_size // (2 ** 4) # gives 3\n",
481 |     "        self.adv_layer = nn.Sequential(nn.Linear(128 * 4 ** 2, 1), nn.Sigmoid()) # s'attend à (1152, 1)\n",
482 |     "\n",
483 |     "    def forward(self, img):\n",
484 |     "        out = self.model(img)\n",
485 |     "        out = out.view(out.shape[0], -1) # torch.Size([64, 2048])\n",
486 |     "        validity = self.adv_layer(out)\n",
487 |     "\n",
488 |     "        return validity"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "markdown",
493 |    "metadata": {},
494 |    "source": [
495 |     "##### Creating the loss functions"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "code",
500 |    "execution_count": 22,
501 |    "metadata": {},
502 |    "outputs": [],
503 |    "source": [
504 |     "# Loss function\n",
505 |     "adversarial_loss = torch.nn.BCELoss()\n",
506 |     "\n",
507 |     "# Initialize generator and discriminator\n",
508 |     "generator = Generator()\n",
509 |     "discriminator = Discriminator()"
510 |    ]
511 |   },
512 |   {
513 |    "cell_type": "markdown",
514 |    "metadata": {},
515 |    "source": [
516 |     "##### Loading the trained models"
517 |    ]
518 |   },
519 |   {
520 |    "cell_type": "code",
521 |    "execution_count": 23,
522 |    "metadata": {},
523 |    "outputs": [
524 |     {
525 |      "data": {
526 |       "text/plain": [
527 |        "<All keys matched successfully>"
528 |       ]
529 |      },
530 |      "execution_count": 23,
531 |      "metadata": {},
532 |      "output_type": "execute_result"
533 |     }
534 |    ],
535 |    "source": [
536 |     "generator.load_state_dict(torch.load('deep_conv_gan_generator'))\n",
537 |     "discriminator.load_state_dict(torch.load('deep_conv_gan_discriminator'))"
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "code",
542 |    "execution_count": 24,
543 |    "metadata": {},
544 |    "outputs": [],
545 |    "source": [
546 |     "if cuda:\n",
547 |     "    generator.cuda()\n",
548 |     "    discriminator.cuda()\n",
549 |     "    adversarial_loss.cuda()"
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "code",
554 |    "execution_count": 25,
555 |    "metadata": {},
556 |    "outputs": [
557 |     {
558 |      "data": {
559 |       "text/plain": [
560 |        "Discriminator(\n",
561 |        "  (model): Sequential(\n",
562 |        "    (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
563 |        "    (1): LeakyReLU(negative_slope=0.2, inplace=True)\n",
564 |        "    (2): Dropout2d(p=0.25, inplace=False)\n",
565 |        "    (3): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
566 |        "    (4): LeakyReLU(negative_slope=0.2, inplace=True)\n",
567 |        "    (5): Dropout2d(p=0.25, inplace=False)\n",
568 |        "    (6): BatchNorm2d(32, eps=0.8, momentum=0.1, affine=True, track_running_stats=True)\n",
569 |        "    (7): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
570 |        "    (8): LeakyReLU(negative_slope=0.2, inplace=True)\n",
571 |        "    (9): Dropout2d(p=0.25, inplace=False)\n",
572 |        "    (10): BatchNorm2d(64, eps=0.8, momentum=0.1, affine=True, track_running_stats=True)\n",
573 |        "    (11): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
574 |        "    (12): LeakyReLU(negative_slope=0.2, inplace=True)\n",
575 |        "    (13): Dropout2d(p=0.25, inplace=False)\n",
576 |        "    (14): BatchNorm2d(128, eps=0.8, momentum=0.1, affine=True, track_running_stats=True)\n",
577 |        "  )\n",
578 |        "  (adv_layer): Sequential(\n",
579 |        "    (0): Linear(in_features=2048, out_features=1, bias=True)\n",
580 |        "    (1): Sigmoid()\n",
581 |        "  )\n",
582 |        ")"
583 |       ]
584 |      },
585 |      "execution_count": 25,
586 |      "metadata": {},
587 |      "output_type": "execute_result"
588 |     }
589 |    ],
590 |    "source": [
591 |     "# Initialize weights\n",
592 |     "generator.apply(weights_init_normal)\n",
593 |     "discriminator.apply(weights_init_normal)"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "markdown",
598 |    "metadata": {},
599 |    "source": [
600 |     "##### Configure the `dataloader`"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "code",
605 |    "execution_count": 26,
606 |    "metadata": {},
607 |    "outputs": [],
608 |    "source": [
609 |     "# Configure data loader\n",
610 |     "dataloader = torch.utils.data.DataLoader(data,\n",
611 |     "    batch_size=opt.batch_size,\n",
612 |     "    shuffle=True,\n",
613 |     ")\n",
614 |     "\n",
615 |     "# Optimizers\n",
616 |     "optimizer_G = torch.optim.Adam(generator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))\n",
617 |     "optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))\n",
618 |     "\n",
619 |     "Tensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor"
620 |    ]
621 |   },
622 |   {
623 |    "cell_type": "markdown",
624 |    "metadata": {},
625 |    "source": [
626 |     "##### Train the model"
627 |    ]
628 |   },
629 |   {
630 |    "cell_type": "code",
631 |    "execution_count": 27,
632 |    "metadata": {},
633 |    "outputs": [
634 |     {
635 |      "name": "stdout",
636 |      "output_type": "stream",
637 |      "text": [
638 |       "[Epoch 50/5000] [D loss: 0.730409] [G loss: 0.706794]\n",
639 |       "[Epoch 100/5000] [D loss: 0.573488] [G loss: 0.627583]\n",
640 |       "[Epoch 150/5000] [D loss: 0.654021] [G loss: 0.734065]\n",
641 |       "[Epoch 200/5000] [D loss: 0.706726] [G loss: 0.803335]\n",
642 |       "[Epoch 250/5000] [D loss: 0.706729] [G loss: 0.742541]\n",
643 |       "[Epoch 300/5000] [D loss: 0.675414] [G loss: 0.822383]\n",
644 |       "[Epoch 350/5000] [D loss: 0.700076] [G loss: 0.858581]\n",
645 |       "[Epoch 400/5000] [D loss: 0.785222] [G loss: 0.834059]\n",
646 |       "[Epoch 450/5000] [D loss: 0.621714] [G loss: 0.968932]\n"
647 |      ]
648 |     },
649 |     {
650 |      "name": "stdin",
651 |      "output_type": "stream",
652 |      "text": [
653 |       "\n",
654 |       "Continue training? [y/n]:  n\n"
655 |      ]
656 |     },
657 |     {
658 |      "name": "stdout",
659 |      "output_type": "stream",
660 |      "text": [
661 |       "\n"
662 |      ]
663 |     }
664 |    ],
665 |    "source": [
666 |     "for epoch in range(1, opt.n_epochs + 1):\n",
667 |     "    for i, (imgs, _) in enumerate(dataloader):\n",
668 |     "        \n",
669 |     "        break ##### DO NOT TRAIN THE MODEL AGAIN\n",
670 |     "        \n",
671 |     "        # Adversarial ground truths\n",
672 |     "        valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False)\n",
673 |     "        fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False)\n",
674 |     "\n",
675 |     "        # Configure input\n",
676 |     "        real_imgs = Variable(imgs.type(Tensor))\n",
677 |     "\n",
678 |     "        # -----------------\n",
679 |     "        #  Train Generator\n",
680 |     "        # -----------------\n",
681 |     "\n",
682 |     "        optimizer_G.zero_grad()\n",
683 |     "\n",
684 |     "        # Sample noise as generator input\n",
685 |     "        z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], opt.latent_dim))))\n",
686 |     "\n",
687 |     "        # Generate a batch of images\n",
688 |     "        gen_imgs = generator(z)\n",
689 |     "\n",
690 |     "        # Loss measures generator's ability to fool the discriminator\n",
691 |     "        g_loss = adversarial_loss(discriminator(gen_imgs), valid)\n",
692 |     "\n",
693 |     "        g_loss.backward()\n",
694 |     "        optimizer_G.step()\n",
695 |     "\n",
696 |     "        # ---------------------\n",
697 |     "        #  Train Discriminator\n",
698 |     "        # ---------------------\n",
699 |     "\n",
700 |     "        optimizer_D.zero_grad()\n",
701 |     "\n",
702 |     "        # Measure discriminator's ability to classify real from generated samples\n",
703 |     "        real_loss = adversarial_loss(discriminator(real_imgs), valid)\n",
704 |     "        fake_loss = adversarial_loss(discriminator(gen_imgs.detach()), fake)\n",
705 |     "        d_loss = (real_loss + fake_loss) / 2\n",
706 |     "\n",
707 |     "        d_loss.backward()\n",
708 |     "        optimizer_D.step()\n",
709 |     "        \n",
710 |     "        if not os.path.isdir('dcgan_800_women'):\n",
711 |     "            os.mkdir('dcgan_800_women')\n",
712 |     "        \n",
713 |     "    batches_done = epoch * len(dataloader) + i + 1\n",
714 |     "    \n",
715 |     "    if epoch >= 500 and epoch % 100 == 0:\n",
716 |     "        val = input(\"\\nContinue training? [y/n]: \")\n",
717 |     "        print()\n",
718 |     "        if val in ('y', 'yes'):\n",
719 |     "            val = True\n",
720 |     "            pass\n",
721 |     "        elif val in ('n', 'no'):\n",
722 |     "            break  \n",
723 |     "        else:\n",
724 |     "            pass\n",
725 |     "        \n",
726 |     "    if batches_done % opt.sample_interval == 0:\n",
727 |     "        save_image(gen_imgs.data[:25], \"dcgan/%d.png\" % batches_done, nrow=5, normalize=True)\n",
728 |     "\n",
729 |     "    if epoch % 50 == 0:\n",
730 |     "        print(\n",
731 |     "            \"[Epoch %d/%d] [D loss: %f] [G loss: %f]\"\n",
732 |     "            % (epoch, opt.n_epochs, d_loss.item(), g_loss.item())\n",
733 |     "        )"
734 |    ]
735 |   },
736 |   {
737 |    "cell_type": "markdown",
738 |    "metadata": {},
739 |    "source": [
740 |     "##### Save the model"
741 |    ]
742 |   },
743 |   {
744 |    "cell_type": "code",
745 |    "execution_count": 31,
746 |    "metadata": {},
747 |    "outputs": [],
748 |    "source": [
749 |     "torch.save(generator.state_dict(), 'deep_conv_gan_generator')\n",
750 |     "torch.save(discriminator.state_dict(), 'deep_conv_gan_discriminator')"
751 |    ]
752 |   },
753 |   {
754 |    "cell_type": "markdown",
755 |    "metadata": {},
756 |    "source": [
757 |     "##### Create a function to save images"
758 |    ]
759 |   },
760 |   {
761 |    "cell_type": "code",
762 |    "execution_count": 32,
763 |    "metadata": {},
764 |    "outputs": [],
765 |    "source": [
766 |     "def sample_image(n_row, batches_done):\n",
767 |     "    z = Variable(Tensor(np.random.normal(0, 1, (n_row ** 2, opt.latent_dim))))\n",
768 |     "    gen_imgs = generator(z)\n",
769 |     "    save_image(gen_imgs.data, \"dcgan_800_women/%d.png\" % batches_done, nrow=n_row, normalize=True)"
770 |    ]
771 |   },
772 |   {
773 |    "cell_type": "markdown",
774 |    "metadata": {},
775 |    "source": [
776 |     "##### Generate 25,000 pictures"
777 |    ]
778 |   },
779 |   {
780 |    "cell_type": "code",
781 |    "execution_count": 33,
782 |    "metadata": {},
783 |    "outputs": [
784 |     {
785 |      "name": "stdout",
786 |      "output_type": "stream",
787 |      "text": [
788 |       "Pictures created: 5,000\n",
789 |       "Pictures created: 10,000\n",
790 |       "Pictures created: 15,000\n",
791 |       "Pictures created: 20,000\n",
792 |       "Pictures created: 25,000\n",
793 |       "\n",
794 |       "25,000 images successfully generated.\n"
795 |      ]
796 |     }
797 |    ],
798 |    "source": [
799 |     "images = 0\n",
800 |     "\n",
801 |     "for epoch in range(1, 2_50 + 1): \n",
802 |     "    for i, (imgs, _) in enumerate(dataloader, 1):\n",
803 |     "        \n",
804 |     "        with torch.no_grad():\n",
805 |     "\n",
806 |     "            # Adversarial ground truths\n",
807 |     "            valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False)\n",
808 |     "            fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False)\n",
809 |     "\n",
810 |     "            # Configure input\n",
811 |     "            real_imgs = Variable(imgs.type(Tensor))\n",
812 |     "\n",
813 |     "            batches_done = epoch * len(dataloader) + i\n",
814 |     "            sample_image(n_row=5, batches_done=batches_done)\n",
815 |     "            images += 25\n",
816 |     "            \n",
817 |     "    if images % 5_000 == 0:\n",
818 |     "        print(f'Pictures created: {images:,}')\n",
819 |     "        \n",
820 |     "    if len(os.listdir(os.path.join(os.getcwd(), 'dcgan_800_women')))  >= 1_000:\n",
821 |     "        print('\\n25,000 images successfully generated.')\n",
822 |     "        break"
823 |    ]
824 |   }
825 |  ],
826 |  "metadata": {
827 |   "kernelspec": {
828 |    "display_name": "Python 3",
829 |    "language": "python",
830 |    "name": "python3"
831 |   },
832 |   "language_info": {
833 |    "codemirror_mode": {
834 |     "name": "ipython",
835 |     "version": 3
836 |    },
837 |    "file_extension": ".py",
838 |    "mimetype": "text/x-python",
839 |    "name": "python",
840 |    "nbconvert_exporter": "python",
841 |    "pygments_lexer": "ipython3",
842 |    "version": "3.6.9"
843 |   }
844 |  },
845 |  "nbformat": 4,
846 |  "nbformat_minor": 4
847 | }
848 | 


--------------------------------------------------------------------------------
/Pytorch CNN to Test on the Generated Samples.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Pytorch CNN to Test on the Generated Samples"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "<img src='https://miro.medium.com/max/3288/1*uAeANQIOQPqWZnnuH-VEyw.jpeg'   width=\"700\" height=\"350\">"
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "markdown",
  19 |    "metadata": {},
  20 |    "source": [
  21 |     "#### Importing libraries"
  22 |    ]
  23 |   },
  24 |   {
  25 |    "cell_type": "code",
  26 |    "execution_count": 1,
  27 |    "metadata": {},
  28 |    "outputs": [],
  29 |    "source": [
  30 |     "import numpy as np\n",
  31 |     "import matplotlib.pyplot as plt\n",
  32 |     "from glob import glob\n",
  33 |     "from PIL import Image\n",
  34 |     "import pandas as pd\n",
  35 |     "from time import time\n",
  36 |     "from sklearn.model_selection import train_test_split\n",
  37 |     "import torch\n",
  38 |     "import torch.nn as nn\n",
  39 |     "import torch.nn.functional as F\n",
  40 |     "import torch.optim as optim\n",
  41 |     "from torch.utils.data import DataLoader\n",
  42 |     "from torch.autograd import Variable\n",
  43 |     "import os\n",
  44 |     "os.chdir('c:/users/nicolas/documents/data/faces')"
  45 |    ]
  46 |   },
  47 |   {
  48 |    "cell_type": "markdown",
  49 |    "metadata": {},
  50 |    "source": [
  51 |     "#### Loading all file names for the TRAIN set"
  52 |    ]
  53 |   },
  54 |   {
  55 |    "cell_type": "code",
  56 |    "execution_count": 2,
  57 |    "metadata": {},
  58 |    "outputs": [],
  59 |    "source": [
  60 |     "files = glob('combined/*.jpg')"
  61 |    ]
  62 |   },
  63 |   {
  64 |    "cell_type": "markdown",
  65 |    "metadata": {},
  66 |    "source": [
  67 |     "#### Getting labels "
  68 |    ]
  69 |   },
  70 |   {
  71 |    "cell_type": "code",
  72 |    "execution_count": 3,
  73 |    "metadata": {},
  74 |    "outputs": [],
  75 |    "source": [
  76 |     "faces = [i for i in files if (i[-34] in ('0', '1')) and len(i[-37:-35].strip('\\\\').strip('d'))  == 2 ]"
  77 |    ]
  78 |   },
  79 |   {
  80 |    "cell_type": "code",
  81 |    "execution_count": 4,
  82 |    "metadata": {},
  83 |    "outputs": [],
  84 |    "source": [
  85 |     "y = [i[-34] for i in files if (i[-34] in ('0', '1')) and len(i[-37:-35].strip('\\\\').strip('d')) > 1 ]"
  86 |    ]
  87 |   },
  88 |   {
  89 |    "cell_type": "code",
  90 |    "execution_count": 5,
  91 |    "metadata": {},
  92 |    "outputs": [],
  93 |    "source": [
  94 |     "sex = ['men', 'women']"
  95 |    ]
  96 |   },
  97 |   {
  98 |    "cell_type": "code",
  99 |    "execution_count": 6,
 100 |    "metadata": {},
 101 |    "outputs": [],
 102 |    "source": [
 103 |     "assert len(y) == len(faces), 'The X and Y are not of the same length!'"
 104 |    ]
 105 |   },
 106 |   {
 107 |    "cell_type": "markdown",
 108 |    "metadata": {},
 109 |    "source": [
 110 |     "#### Getting shape info"
 111 |    ]
 112 |   },
 113 |   {
 114 |    "cell_type": "code",
 115 |    "execution_count": 7,
 116 |    "metadata": {},
 117 |    "outputs": [],
 118 |    "source": [
 119 |     "nrow, ncol, nchan = 60, 60, 3"
 120 |    ]
 121 |   },
 122 |   {
 123 |    "cell_type": "markdown",
 124 |    "metadata": {},
 125 |    "source": [
 126 |     "#### This is the shape width/height"
 127 |    ]
 128 |   },
 129 |   {
 130 |    "cell_type": "code",
 131 |    "execution_count": 8,
 132 |    "metadata": {},
 133 |    "outputs": [],
 134 |    "source": [
 135 |     "dim = 60"
 136 |    ]
 137 |   },
 138 |   {
 139 |    "cell_type": "markdown",
 140 |    "metadata": {},
 141 |    "source": [
 142 |     "#### Cropping function"
 143 |    ]
 144 |   },
 145 |   {
 146 |    "cell_type": "code",
 147 |    "execution_count": 9,
 148 |    "metadata": {},
 149 |    "outputs": [],
 150 |    "source": [
 151 |     "def crop(img):\n",
 152 |     "    if img.shape[0]<img.shape[1]:\n",
 153 |     "        x = img.shape[0]\n",
 154 |     "        y = img.shape[1]\n",
 155 |     "        crop_img = img[: , int(y/2-x/2):int(y/2+x/2)]\n",
 156 |     "    else:\n",
 157 |     "        x = img.shape[1]\n",
 158 |     "        y = img.shape[0]\n",
 159 |     "        crop_img = img[int(y/2-x/2):int(y/2+x/2) , :]\n",
 160 |     "\n",
 161 |     "    return crop_img"
 162 |    ]
 163 |   },
 164 |   {
 165 |    "cell_type": "markdown",
 166 |    "metadata": {},
 167 |    "source": [
 168 |     "##### Loading and cropping images"
 169 |    ]
 170 |   },
 171 |   {
 172 |    "cell_type": "code",
 173 |    "execution_count": 10,
 174 |    "metadata": {},
 175 |    "outputs": [
 176 |     {
 177 |      "name": "stdout",
 178 |      "output_type": "stream",
 179 |      "text": [
 180 |       "Done. 15 seconds\n"
 181 |      ]
 182 |     }
 183 |    ],
 184 |    "source": [
 185 |     "print('Scaling...', end='')\n",
 186 |     "start = time()\n",
 187 |     "x = []\n",
 188 |     "num_to_load = len(faces)\n",
 189 |     "for ix, file in enumerate(faces[:num_to_load]): \n",
 190 |     "    image = plt.imread(file, 'jpg')\n",
 191 |     "    image = Image.fromarray(image).resize((dim, dim)).convert('L')\n",
 192 |     "    image = crop(np.array(image))\n",
 193 |     "    x.append(image)\n",
 194 |     "print(f'\\rDone. {int(time() - start)} seconds')\n",
 195 |     "y = y[:num_to_load]"
 196 |    ]
 197 |   },
 198 |   {
 199 |    "cell_type": "markdown",
 200 |    "metadata": {},
 201 |    "source": [
 202 |     "##### Turning the pictures into arrays"
 203 |    ]
 204 |   },
 205 |   {
 206 |    "cell_type": "code",
 207 |    "execution_count": 11,
 208 |    "metadata": {},
 209 |    "outputs": [],
 210 |    "source": [
 211 |     "xtrain = np.array(x, dtype=np.float32)\n",
 212 |     "ytrain = np.array(y, dtype=np.float32)"
 213 |    ]
 214 |   },
 215 |   {
 216 |    "cell_type": "code",
 217 |    "execution_count": 12,
 218 |    "metadata": {},
 219 |    "outputs": [
 220 |     {
 221 |      "name": "stdout",
 222 |      "output_type": "stream",
 223 |      "text": [
 224 |       "(20638, 60, 60)\n"
 225 |      ]
 226 |     }
 227 |    ],
 228 |    "source": [
 229 |     "assert xtrain.shape[1] == dim\n",
 230 |     "print(xtrain.shape)"
 231 |    ]
 232 |   },
 233 |   {
 234 |    "cell_type": "code",
 235 |    "execution_count": 13,
 236 |    "metadata": {},
 237 |    "outputs": [
 238 |     {
 239 |      "name": "stdout",
 240 |      "output_type": "stream",
 241 |      "text": [
 242 |       "X and Y shapes are correct! (20638 samples each)\n"
 243 |      ]
 244 |     }
 245 |    ],
 246 |    "source": [
 247 |     "if xtrain.shape[0] == ytrain.shape[0]:\n",
 248 |     "    print('X and Y shapes are correct! (%i samples each)' % xtrain.shape[0])"
 249 |    ]
 250 |   },
 251 |   {
 252 |    "cell_type": "code",
 253 |    "execution_count": 14,
 254 |    "metadata": {},
 255 |    "outputs": [],
 256 |    "source": [
 257 |     "assert xtrain.ndim == 3"
 258 |    ]
 259 |   },
 260 |   {
 261 |    "cell_type": "code",
 262 |    "execution_count": 15,
 263 |    "metadata": {},
 264 |    "outputs": [
 265 |     {
 266 |      "name": "stdout",
 267 |      "output_type": "stream",
 268 |      "text": [
 269 |       "The size of the data we are using is 20,638 pictures.\n"
 270 |      ]
 271 |     }
 272 |    ],
 273 |    "source": [
 274 |     "print(f'The size of the data we are using is {xtrain.shape[0]:,} pictures.')"
 275 |    ]
 276 |   },
 277 |   {
 278 |    "cell_type": "code",
 279 |    "execution_count": 16,
 280 |    "metadata": {},
 281 |    "outputs": [],
 282 |    "source": [
 283 |     "files, faces = None, None"
 284 |    ]
 285 |   },
 286 |   {
 287 |    "cell_type": "markdown",
 288 |    "metadata": {},
 289 |    "source": [
 290 |     "#### Cross-validation, splitting input"
 291 |    ]
 292 |   },
 293 |   {
 294 |    "cell_type": "code",
 295 |    "execution_count": 17,
 296 |    "metadata": {},
 297 |    "outputs": [],
 298 |    "source": [
 299 |     "men = np.random.choice(np.where(ytrain == 0)[0], 8_000)\n",
 300 |     "m_array = xtrain[men]\n",
 301 |     "m_target = ytrain[men]"
 302 |    ]
 303 |   },
 304 |   {
 305 |    "cell_type": "code",
 306 |    "execution_count": 18,
 307 |    "metadata": {},
 308 |    "outputs": [],
 309 |    "source": [
 310 |     "women = np.random.choice(np.where(ytrain == 1)[0], 8_000)\n",
 311 |     "w_array = xtrain[women]\n",
 312 |     "w_target = ytrain[women]"
 313 |    ]
 314 |   },
 315 |   {
 316 |    "cell_type": "code",
 317 |    "execution_count": 19,
 318 |    "metadata": {},
 319 |    "outputs": [],
 320 |    "source": [
 321 |     "x_train = np.vstack([m_array, w_array])\n",
 322 |     "y_train = np.concatenate([m_target, w_target], axis=0).astype(np.int32)"
 323 |    ]
 324 |   },
 325 |   {
 326 |    "cell_type": "code",
 327 |    "execution_count": 20,
 328 |    "metadata": {},
 329 |    "outputs": [
 330 |     {
 331 |      "data": {
 332 |       "text/plain": [
 333 |        "((16000, 60, 60), (16000,))"
 334 |       ]
 335 |      },
 336 |      "execution_count": 20,
 337 |      "metadata": {},
 338 |      "output_type": "execute_result"
 339 |     }
 340 |    ],
 341 |    "source": [
 342 |     "x_train.shape, y_train.shape"
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "code",
 347 |    "execution_count": 21,
 348 |    "metadata": {},
 349 |    "outputs": [],
 350 |    "source": [
 351 |     "permut = np.random.permutation(np.arange(x_train.shape[0]))\n",
 352 |     "x_train = x_train[permut]\n",
 353 |     "y_train = y_train[permut]\n",
 354 |     "assert y_train.sum() == 8_000, 'The classes aren\\'t balanced.'"
 355 |    ]
 356 |   },
 357 |   {
 358 |    "cell_type": "code",
 359 |    "execution_count": 22,
 360 |    "metadata": {},
 361 |    "outputs": [
 362 |     {
 363 |      "name": "stdout",
 364 |      "output_type": "stream",
 365 |      "text": [
 366 |       "The size of X is 16,000 and the size of Y is 16,000.\n"
 367 |      ]
 368 |     }
 369 |    ],
 370 |    "source": [
 371 |     "x_size, y_size = x_train.shape[0], y_train.shape[0]\n",
 372 |     "print(f'The size of X is {x_size:,} and the '\\\n",
 373 |     "     f'size of Y is {y_size:,}.')"
 374 |    ]
 375 |   },
 376 |   {
 377 |    "cell_type": "markdown",
 378 |    "metadata": {},
 379 |    "source": [
 380 |     "#### Getting the test set for MEN"
 381 |    ]
 382 |   },
 383 |   {
 384 |    "cell_type": "code",
 385 |    "execution_count": 23,
 386 |    "metadata": {},
 387 |    "outputs": [],
 388 |    "source": [
 389 |     "directory = 'generated_men_aae'"
 390 |    ]
 391 |   },
 392 |   {
 393 |    "cell_type": "code",
 394 |    "execution_count": 24,
 395 |    "metadata": {},
 396 |    "outputs": [],
 397 |    "source": [
 398 |     "test_files = glob(directory + '/split' + '/*.png')"
 399 |    ]
 400 |   },
 401 |   {
 402 |    "cell_type": "code",
 403 |    "execution_count": 25,
 404 |    "metadata": {},
 405 |    "outputs": [],
 406 |    "source": [
 407 |     "assert len(test_files) == 25_000, 'There is not 25,000 files, '\\\n",
 408 |     "    'but %i.' % len(test_files)"
 409 |    ]
 410 |   },
 411 |   {
 412 |    "cell_type": "code",
 413 |    "execution_count": 26,
 414 |    "metadata": {},
 415 |    "outputs": [
 416 |     {
 417 |      "name": "stdout",
 418 |      "output_type": "stream",
 419 |      "text": [
 420 |       "Done. 4 seconds\n"
 421 |      ]
 422 |     }
 423 |    ],
 424 |    "source": [
 425 |     "print('Scaling...', end='')\n",
 426 |     "start = time()\n",
 427 |     "x = []\n",
 428 |     "num_to_load = 10_000\n",
 429 |     "for ix, file in enumerate(test_files[:num_to_load]): \n",
 430 |     "    image = plt.imread(file, 'jpg')\n",
 431 |     "    image = Image.fromarray(image).resize((dim, dim)).convert('L')\n",
 432 |     "    image = np.array(image)\n",
 433 |     "    x.append(image)\n",
 434 |     "print(f'\\rDone. {int(time() - start)} seconds')\n",
 435 |     "y_gen_men = np.repeat(0, num_to_load)"
 436 |    ]
 437 |   },
 438 |   {
 439 |    "cell_type": "code",
 440 |    "execution_count": 27,
 441 |    "metadata": {},
 442 |    "outputs": [],
 443 |    "source": [
 444 |     "assert np.array(x).shape == (num_to_load, 60, 60)"
 445 |    ]
 446 |   },
 447 |   {
 448 |    "cell_type": "code",
 449 |    "execution_count": 28,
 450 |    "metadata": {},
 451 |    "outputs": [
 452 |     {
 453 |      "name": "stdout",
 454 |      "output_type": "stream",
 455 |      "text": [
 456 |       "\r",
 457 |       "Done. 0 seconds\n"
 458 |      ]
 459 |     }
 460 |    ],
 461 |    "source": [
 462 |     "start = time()\n",
 463 |     "xtest_men = np.array(x, dtype=np.float32)\n",
 464 |     "ytest_men = np.array(y_gen_men, dtype=np.int32)\n",
 465 |     "print(f'\\rDone. {int(time() - start)} seconds')"
 466 |    ]
 467 |   },
 468 |   {
 469 |    "cell_type": "markdown",
 470 |    "metadata": {},
 471 |    "source": [
 472 |     "#### Getting the test set for WOMEN"
 473 |    ]
 474 |   },
 475 |   {
 476 |    "cell_type": "code",
 477 |    "execution_count": 29,
 478 |    "metadata": {},
 479 |    "outputs": [],
 480 |    "source": [
 481 |     "directory = 'generated_women_aae'"
 482 |    ]
 483 |   },
 484 |   {
 485 |    "cell_type": "code",
 486 |    "execution_count": 30,
 487 |    "metadata": {},
 488 |    "outputs": [],
 489 |    "source": [
 490 |     "test_files = glob(directory + '/split' + '/*.png')"
 491 |    ]
 492 |   },
 493 |   {
 494 |    "cell_type": "code",
 495 |    "execution_count": 31,
 496 |    "metadata": {},
 497 |    "outputs": [],
 498 |    "source": [
 499 |     "assert len(test_files) == 25_000, 'There is not 25,000 files, '\\\n",
 500 |     "    'but %i.' % len(test_files)"
 501 |    ]
 502 |   },
 503 |   {
 504 |    "cell_type": "code",
 505 |    "execution_count": 32,
 506 |    "metadata": {},
 507 |    "outputs": [
 508 |     {
 509 |      "name": "stdout",
 510 |      "output_type": "stream",
 511 |      "text": [
 512 |       "Done. 5 seconds\n"
 513 |      ]
 514 |     }
 515 |    ],
 516 |    "source": [
 517 |     "print('Scaling...', end='')\n",
 518 |     "start = time()\n",
 519 |     "x = []\n",
 520 |     "num_to_load = 10_000\n",
 521 |     "for ix, file in enumerate(test_files[:num_to_load]): \n",
 522 |     "    image = plt.imread(file, 'jpg')\n",
 523 |     "    image = Image.fromarray(image).resize((dim, dim)).convert('L')\n",
 524 |     "    image = np.array(image)\n",
 525 |     "    x.append(image)\n",
 526 |     "print(f'\\rDone. {int(time() - start)} seconds')\n",
 527 |     "y_gen_women = np.repeat(1, num_to_load)"
 528 |    ]
 529 |   },
 530 |   {
 531 |    "cell_type": "code",
 532 |    "execution_count": 33,
 533 |    "metadata": {},
 534 |    "outputs": [],
 535 |    "source": [
 536 |     "xtest_women = np.array(x, dtype=np.float32)\n",
 537 |     "ytest_women = np.array(y_gen_women, dtype=np.int32)"
 538 |    ]
 539 |   },
 540 |   {
 541 |    "cell_type": "code",
 542 |    "execution_count": 34,
 543 |    "metadata": {},
 544 |    "outputs": [],
 545 |    "source": [
 546 |     "assert np.array(x).shape == (num_to_load, 60, 60)"
 547 |    ]
 548 |   },
 549 |   {
 550 |    "cell_type": "markdown",
 551 |    "metadata": {},
 552 |    "source": [
 553 |     "#### Merging the test set"
 554 |    ]
 555 |   },
 556 |   {
 557 |    "cell_type": "code",
 558 |    "execution_count": 35,
 559 |    "metadata": {},
 560 |    "outputs": [],
 561 |    "source": [
 562 |     "x_test = np.vstack([xtest_men, xtest_women])\n",
 563 |     "y_test = np.concatenate([ytest_men, ytest_women], axis=0)"
 564 |    ]
 565 |   },
 566 |   {
 567 |    "cell_type": "code",
 568 |    "execution_count": 36,
 569 |    "metadata": {},
 570 |    "outputs": [
 571 |     {
 572 |      "data": {
 573 |       "text/plain": [
 574 |        "((16000, 60, 60), (16000,))"
 575 |       ]
 576 |      },
 577 |      "execution_count": 36,
 578 |      "metadata": {},
 579 |      "output_type": "execute_result"
 580 |     }
 581 |    ],
 582 |    "source": [
 583 |     "x_train.shape, y_train.shape"
 584 |    ]
 585 |   },
 586 |   {
 587 |    "cell_type": "code",
 588 |    "execution_count": 37,
 589 |    "metadata": {},
 590 |    "outputs": [],
 591 |    "source": [
 592 |     "permut = np.random.permutation(np.arange(x_test.shape[0]))\n",
 593 |     "x_test = x_test[permut]\n",
 594 |     "y_test = y_test[permut]\n",
 595 |     "assert y_test.sum() == y_test.shape[0]/2, 'The classes aren\\'t balanced.'"
 596 |    ]
 597 |   },
 598 |   {
 599 |    "cell_type": "code",
 600 |    "execution_count": 38,
 601 |    "metadata": {},
 602 |    "outputs": [
 603 |     {
 604 |      "name": "stdout",
 605 |      "output_type": "stream",
 606 |      "text": [
 607 |       "The size of X is 20,000 and the size of Y is 20,000.\n"
 608 |      ]
 609 |     }
 610 |    ],
 611 |    "source": [
 612 |     "x_size, y_size = x_test.shape[0], y_test.shape[0]\n",
 613 |     "print(f'The size of X is {x_size:,} and the '\\\n",
 614 |     "     f'size of Y is {y_size:,}.')"
 615 |    ]
 616 |   },
 617 |   {
 618 |    "cell_type": "code",
 619 |    "execution_count": 39,
 620 |    "metadata": {},
 621 |    "outputs": [],
 622 |    "source": [
 623 |     "assert x_test.shape[0] == x_test.shape[0] == y_test.shape[0] == y_test.shape[0]"
 624 |    ]
 625 |   },
 626 |   {
 627 |    "cell_type": "markdown",
 628 |    "metadata": {},
 629 |    "source": [
 630 |     "#### One hot encoding the targets"
 631 |    ]
 632 |   },
 633 |   {
 634 |    "cell_type": "code",
 635 |    "execution_count": 40,
 636 |    "metadata": {},
 637 |    "outputs": [],
 638 |    "source": [
 639 |     "# y_train = np.eye(2)[y_train]\n",
 640 |     "# y_test = np.eye(2)[y_test]"
 641 |    ]
 642 |   },
 643 |   {
 644 |    "cell_type": "markdown",
 645 |    "metadata": {},
 646 |    "source": [
 647 |     "#### Scaling, casting the arrays"
 648 |    ]
 649 |   },
 650 |   {
 651 |    "cell_type": "code",
 652 |    "execution_count": 41,
 653 |    "metadata": {},
 654 |    "outputs": [
 655 |     {
 656 |      "name": "stdout",
 657 |      "output_type": "stream",
 658 |      "text": [
 659 |       "Done.     \n"
 660 |      ]
 661 |     }
 662 |    ],
 663 |    "source": [
 664 |     "print('Scaling...', end='') \n",
 665 |     "x_train = x_train.reshape(-1, 1, dim, dim).astype('float32') / 255 \n",
 666 |     "x_test = x_test.reshape(-1, 1, dim, dim).astype('float32') / 255\n",
 667 |     "y_train = y_train.astype('int64')\n",
 668 |     "y_test = y_test.astype('int64')\n",
 669 |     "print('\\rDone.     ')"
 670 |    ]
 671 |   },
 672 |   {
 673 |    "cell_type": "code",
 674 |    "execution_count": 42,
 675 |    "metadata": {},
 676 |    "outputs": [
 677 |     {
 678 |      "name": "stdout",
 679 |      "output_type": "stream",
 680 |      "text": [
 681 |       "First dimension: 20000 \n",
 682 |       "Second dimension: 1 \n",
 683 |       "Third dimension: 60 \n",
 684 |       "Fourth dimension: 60\n"
 685 |      ]
 686 |     }
 687 |    ],
 688 |    "source": [
 689 |     "samples, first, second, third = x_test.shape\n",
 690 |     "print('First dimension: %i' % samples,\n",
 691 |     "     '\\nSecond dimension: %i' % first,\n",
 692 |     "     '\\nThird dimension: %i' % second,\n",
 693 |     "     '\\nFourth dimension: %i' % third)"
 694 |    ]
 695 |   },
 696 |   {
 697 |    "cell_type": "markdown",
 698 |    "metadata": {},
 699 |    "source": [
 700 |     "#### Sending the arrays to Cuda"
 701 |    ]
 702 |   },
 703 |   {
 704 |    "cell_type": "code",
 705 |    "execution_count": 43,
 706 |    "metadata": {},
 707 |    "outputs": [
 708 |     {
 709 |      "name": "stdout",
 710 |      "output_type": "stream",
 711 |      "text": [
 712 |       "Tensors successfully flushed to CUDA.\n"
 713 |      ]
 714 |     }
 715 |    ],
 716 |    "source": [
 717 |     "if torch.cuda.is_available():\n",
 718 |     "    x_train = torch.from_numpy(x_train) \n",
 719 |     "    x_test = torch.from_numpy(x_test) \n",
 720 |     "    y_train = torch.from_numpy(y_train) \n",
 721 |     "    y_test = torch.from_numpy(y_test)\n",
 722 |     "    print('Tensors successfully flushed to CUDA.')\n",
 723 |     "else:\n",
 724 |     "    print('CUDA not available!')"
 725 |    ]
 726 |   },
 727 |   {
 728 |    "cell_type": "markdown",
 729 |    "metadata": {},
 730 |    "source": [
 731 |     "##### Clearning memory"
 732 |    ]
 733 |   },
 734 |   {
 735 |    "cell_type": "code",
 736 |    "execution_count": 44,
 737 |    "metadata": {},
 738 |    "outputs": [],
 739 |    "source": [
 740 |     "x, y = None, None"
 741 |    ]
 742 |   },
 743 |   {
 744 |    "cell_type": "markdown",
 745 |    "metadata": {},
 746 |    "source": [
 747 |     "##### Building the ConvNet"
 748 |    ]
 749 |   },
 750 |   {
 751 |    "cell_type": "markdown",
 752 |    "metadata": {},
 753 |    "source": [
 754 |     "Initially image size, W = 60 <br>\n",
 755 |     "Kernel Size, k = 3 <br>\n",
 756 |     "Stride , s = 1 <br>\n",
 757 |     "Padding, P = 0 <br>\n",
 758 |     "The formula for the number of outputs to the next layer of conv2d is: O = { (W - k + 2*P)/s } + 1"
 759 |    ]
 760 |   },
 761 |   {
 762 |    "cell_type": "code",
 763 |    "execution_count": 45,
 764 |    "metadata": {},
 765 |    "outputs": [],
 766 |    "source": [
 767 |     "class ConvNet(nn.Module):\n",
 768 |     "    \n",
 769 |     "    def __init__(self):\n",
 770 |     "        super().__init__()\n",
 771 |     "        a = 32 #* 4\n",
 772 |     "        b = 64 #* 4\n",
 773 |     "        c = 128 #* 4\n",
 774 |     "        self.conv1 = nn.Conv2d(1, a, 3)\n",
 775 |     "        self.conv2 = nn.Conv2d(a, b, 3)\n",
 776 |     "        self.conv3 = nn.Conv2d(b, c, 3)\n",
 777 |     "        \n",
 778 |     "        self.fc1 = nn.Linear(5*5*c, 1024) \n",
 779 |     "        self.fc2 = nn.Linear(1024, 2048)\n",
 780 |     "        self.fc3 = nn.Linear(2048, 2)\n",
 781 |     "    \n",
 782 |     "    def forward(self, x):\n",
 783 |     "        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))\n",
 784 |     "        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))\n",
 785 |     "        x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2))\n",
 786 |     "        \n",
 787 |     "        x = x.view(x.size(0), -1) \n",
 788 |     "        x = F.relu(self.fc1(x))\n",
 789 |     "        x = F.relu(self.fc2(x))\n",
 790 |     "        x = F.dropout(x, 0.5)\n",
 791 |     "        x = self.fc3(x)\n",
 792 |     "        return x"
 793 |    ]
 794 |   },
 795 |   {
 796 |    "cell_type": "code",
 797 |    "execution_count": 46,
 798 |    "metadata": {},
 799 |    "outputs": [],
 800 |    "source": [
 801 |     "net = ConvNet()"
 802 |    ]
 803 |   },
 804 |   {
 805 |    "cell_type": "code",
 806 |    "execution_count": 47,
 807 |    "metadata": {},
 808 |    "outputs": [],
 809 |    "source": [
 810 |     "if torch.cuda.is_available():\n",
 811 |     "    net.cuda()"
 812 |    ]
 813 |   },
 814 |   {
 815 |    "cell_type": "code",
 816 |    "execution_count": 48,
 817 |    "metadata": {},
 818 |    "outputs": [],
 819 |    "source": [
 820 |     "optimizer = optim.Adam(net.parameters(), lr=0.001)"
 821 |    ]
 822 |   },
 823 |   {
 824 |    "cell_type": "code",
 825 |    "execution_count": 49,
 826 |    "metadata": {},
 827 |    "outputs": [],
 828 |    "source": [
 829 |     "loss_function = nn.CrossEntropyLoss()"
 830 |    ]
 831 |   },
 832 |   {
 833 |    "cell_type": "markdown",
 834 |    "metadata": {},
 835 |    "source": [
 836 |     "##### Instantiating the data"
 837 |    ]
 838 |   },
 839 |   {
 840 |    "cell_type": "code",
 841 |    "execution_count": 50,
 842 |    "metadata": {},
 843 |    "outputs": [],
 844 |    "source": [
 845 |     "class FaceTrain:\n",
 846 |     "    \n",
 847 |     "    def __init__(self):\n",
 848 |     "        self.len = x_train.shape[0]\n",
 849 |     "        self.x_train = x_train\n",
 850 |     "        self.y_train = y_train\n",
 851 |     "        \n",
 852 |     "    def __getitem__(self, index):\n",
 853 |     "        return x_train[index], y_train[index]#.unsqueeze(0)\n",
 854 |     "    \n",
 855 |     "    def __len__(self):\n",
 856 |     "        return self.len"
 857 |    ]
 858 |   },
 859 |   {
 860 |    "cell_type": "code",
 861 |    "execution_count": 51,
 862 |    "metadata": {},
 863 |    "outputs": [],
 864 |    "source": [
 865 |     "class FaceTest:\n",
 866 |     "    \n",
 867 |     "    def __init__(self):\n",
 868 |     "        self.len = x_test.shape[0]\n",
 869 |     "        self.x_test = x_test\n",
 870 |     "        self.y_test = y_test\n",
 871 |     "        \n",
 872 |     "    def __getitem__(self, index):\n",
 873 |     "        return x_test[index], y_test[index]#.unsqueeze(0)\n",
 874 |     "    \n",
 875 |     "    def __len__(self):\n",
 876 |     "        return self.len"
 877 |    ]
 878 |   },
 879 |   {
 880 |    "cell_type": "markdown",
 881 |    "metadata": {},
 882 |    "source": [
 883 |     "##### Making instances of the data"
 884 |    ]
 885 |   },
 886 |   {
 887 |    "cell_type": "code",
 888 |    "execution_count": 52,
 889 |    "metadata": {},
 890 |    "outputs": [],
 891 |    "source": [
 892 |     "train = FaceTrain()\n",
 893 |     "test = FaceTest()"
 894 |    ]
 895 |   },
 896 |   {
 897 |    "cell_type": "markdown",
 898 |    "metadata": {},
 899 |    "source": [
 900 |     "##### Making data iterator"
 901 |    ]
 902 |   },
 903 |   {
 904 |    "cell_type": "code",
 905 |    "execution_count": 53,
 906 |    "metadata": {},
 907 |    "outputs": [],
 908 |    "source": [
 909 |     "train_loader = DataLoader(dataset=train, batch_size=64, shuffle=True)\n",
 910 |     "test_loader = DataLoader(dataset=test, batch_size=64, shuffle=True)"
 911 |    ]
 912 |   },
 913 |   {
 914 |    "cell_type": "markdown",
 915 |    "metadata": {},
 916 |    "source": [
 917 |     "##### Training the model"
 918 |    ]
 919 |   },
 920 |   {
 921 |    "cell_type": "code",
 922 |    "execution_count": 55,
 923 |    "metadata": {},
 924 |    "outputs": [
 925 |     {
 926 |      "name": "stdout",
 927 |      "output_type": "stream",
 928 |      "text": [
 929 |       "[Epoch: 1/30]  [Training Loss: 0.194]  [Test Loss: 0.043]  [Test Accuracy: 0.997]\n",
 930 |       "[Epoch: 2/30]  [Training Loss: 0.175]  [Test Loss: 0.016]  [Test Accuracy: 0.999]\n",
 931 |       "[Epoch: 3/30]  [Training Loss: 0.152]  [Test Loss: 0.017]  [Test Accuracy: 0.999]\n",
 932 |       "[Epoch: 4/30]  [Training Loss: 0.132]  [Test Loss: 0.013]  [Test Accuracy: 1.000]\n",
 933 |       "[Epoch: 5/30]  [Training Loss: 0.120]  [Test Loss: 0.023]  [Test Accuracy: 1.000]\n",
 934 |       "[Epoch: 6/30]  [Training Loss: 0.102]  [Test Loss: 0.028]  [Test Accuracy: 0.999]\n",
 935 |       "[Epoch: 7/30]  [Training Loss: 0.099]  [Test Loss: 0.010]  [Test Accuracy: 1.000]\n",
 936 |       "[Epoch: 8/30]  [Training Loss: 0.084]  [Test Loss: 0.012]  [Test Accuracy: 0.999]\n",
 937 |       "[Epoch: 9/30]  [Training Loss: 0.067]  [Test Loss: 0.010]  [Test Accuracy: 1.000]\n",
 938 |       "[Epoch: 10/30]  [Training Loss: 0.061]  [Test Loss: 0.008]  [Test Accuracy: 1.000]\n",
 939 |       "[Epoch: 11/30]  [Training Loss: 0.054]  [Test Loss: 0.012]  [Test Accuracy: 0.997]\n",
 940 |       "[Epoch: 12/30]  [Training Loss: 0.050]  [Test Loss: 0.003]  [Test Accuracy: 1.000]\n",
 941 |       "[Epoch: 13/30]  [Training Loss: 0.041]  [Test Loss: 0.007]  [Test Accuracy: 1.000]\n",
 942 |       "[Epoch: 14/30]  [Training Loss: 0.038]  [Test Loss: 0.006]  [Test Accuracy: 1.000]\n",
 943 |       "[Epoch: 15/30]  [Training Loss: 0.031]  [Test Loss: 0.006]  [Test Accuracy: 1.000]\n",
 944 |       "[Epoch: 16/30]  [Training Loss: 0.031]  [Test Loss: 0.007]  [Test Accuracy: 0.997]\n",
 945 |       "[Epoch: 17/30]  [Training Loss: 0.023]  [Test Loss: 0.004]  [Test Accuracy: 0.999]\n",
 946 |       "[Epoch: 18/30]  [Training Loss: 0.032]  [Test Loss: 0.014]  [Test Accuracy: 0.995]\n",
 947 |       "[Epoch: 19/30]  [Training Loss: 0.021]  [Test Loss: 0.001]  [Test Accuracy: 1.000]\n",
 948 |       "[Epoch: 20/30]  [Training Loss: 0.022]  [Test Loss: 0.003]  [Test Accuracy: 0.999]\n",
 949 |       "[Epoch: 21/30]  [Training Loss: 0.021]  [Test Loss: 0.008]  [Test Accuracy: 0.998]\n",
 950 |       "[Epoch: 22/30]  [Training Loss: 0.027]  [Test Loss: 0.009]  [Test Accuracy: 0.998]\n",
 951 |       "[Epoch: 23/30]  [Training Loss: 0.020]  [Test Loss: 0.029]  [Test Accuracy: 0.991]\n",
 952 |       "[Epoch: 24/30]  [Training Loss: 0.012]  [Test Loss: 0.006]  [Test Accuracy: 0.998]\n",
 953 |       "[Epoch: 25/30]  [Training Loss: 0.021]  [Test Loss: 0.019]  [Test Accuracy: 0.996]\n",
 954 |       "[Epoch: 26/30]  [Training Loss: 0.017]  [Test Loss: 0.003]  [Test Accuracy: 0.999]\n",
 955 |       "[Epoch: 27/30]  [Training Loss: 0.013]  [Test Loss: 0.002]  [Test Accuracy: 1.000]\n",
 956 |       "[Epoch: 28/30]  [Training Loss: 0.019]  [Test Loss: 0.005]  [Test Accuracy: 0.999]\n",
 957 |       "[Epoch: 29/30]  [Training Loss: 0.016]  [Test Loss: 0.027]  [Test Accuracy: 0.992]\n",
 958 |       "[Epoch: 30/30]  [Training Loss: 0.019]  [Test Loss: 0.003]  [Test Accuracy: 0.999]\n"
 959 |      ]
 960 |     }
 961 |    ],
 962 |    "source": [
 963 |     "epochs = 30\n",
 964 |     "steps = 0\n",
 965 |     "train_losses, test_losses = [], []\n",
 966 |     "for e in range(epochs):\n",
 967 |     "    running_loss = 0\n",
 968 |     "    net.train()\n",
 969 |     "    for images, labels in train_loader:   \n",
 970 |     "        if torch.cuda.is_available():\n",
 971 |     "            images, labels = images.cuda(), labels.cuda()     \n",
 972 |     "        optimizer.zero_grad()\n",
 973 |     "        log_ps = net(images)\n",
 974 |     "        loss = loss_function(log_ps, labels)\n",
 975 |     "        loss.backward()\n",
 976 |     "        optimizer.step()        \n",
 977 |     "        running_loss += loss.item()        \n",
 978 |     "    else:\n",
 979 |     "        test_loss = 0\n",
 980 |     "        accuracy = 0        \n",
 981 |     "        net.eval()\n",
 982 |     "        with torch.no_grad():\n",
 983 |     "            for images, labels in test_loader:\n",
 984 |     "                if torch.cuda.is_available():\n",
 985 |     "                    images, labels = images.cuda(), labels.cuda()\n",
 986 |     "                log_ps = net(images)\n",
 987 |     "                test_loss += loss_function(log_ps, labels)                \n",
 988 |     "                # ps = torch.exp(log_ps)\n",
 989 |     "                top_p, top_class = log_ps.topk(1, dim=1)\n",
 990 |     "                equals = top_class.long() == labels.long().view(*top_class.shape)\n",
 991 |     "                accuracy += torch.mean(equals.type(torch.FloatTensor))                \n",
 992 |     "        train_losses.append(running_loss/len(train_loader))\n",
 993 |     "        test_losses.append(test_loss/len(test_loader))\n",
 994 |     "        print(\"[Epoch: {}/{}] \".format(e+1, epochs),\n",
 995 |     "              \"[Training Loss: {:.3f}] \".format(running_loss/len(train_loader)),\n",
 996 |     "              \"[Test Loss: {:.3f}] \".format(test_loss/len(test_loader)),\n",
 997 |     "              \"[Test Accuracy: {:.3f}]\".format(accuracy/len(test_loader)))"
 998 |    ]
 999 |   }
1000 |  ],
1001 |  "metadata": {
1002 |   "kernelspec": {
1003 |    "display_name": "Python 3",
1004 |    "language": "python",
1005 |    "name": "python3"
1006 |   },
1007 |   "language_info": {
1008 |    "codemirror_mode": {
1009 |     "name": "ipython",
1010 |     "version": 3
1011 |    },
1012 |    "file_extension": ".py",
1013 |    "mimetype": "text/x-python",
1014 |    "name": "python",
1015 |    "nbconvert_exporter": "python",
1016 |    "pygments_lexer": "ipython3",
1017 |    "version": "3.6.9"
1018 |   }
1019 |  },
1020 |  "nbformat": 4,
1021 |  "nbformat_minor": 4
1022 | }
1023 | 


--------------------------------------------------------------------------------
/Rapport_Final (1).ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## [GAN-Generated Samples to Offset Class Imbalance](https://github.com/nicolas-gervais/data-augmentation-with-gan-and-vae) {-}"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Vincent Fortin (11249631) | \n",
 15 |     "Nicolas Gervais (11263889)"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "### 1. Introduction {-}\n",
 23 |     "<br>"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "Working with imbalanced classes represents a challenge that most machine learning practitioners will face. Indeed, many learning algorithms are suited for balanced datasets, such as Support Vector Machines (SVM), decision trees, and logistic regression [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0020025513005124). When combined with a limited number of training instances, imbalanced classes can result in poorly trained models. Having few instances from which to learn, algorithms may have a limited ability to generalize, and therefore suffer from poor performance on unseen data. These problems have most frequently sparked research in the fields of neurocomputing, knowledge-based systems, but also in image recognition [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub)."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "#### 1.1 Working with class imbalance \\newline \\newline"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "markdown",
 42 |    "metadata": {},
 43 |    "source": [
 44 |     "Various strategies have been suggested to negate the effects of class imbalance, which typically fall into three categories, oversampling, undersampling, and hybrid methods [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub). "
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "In the category of oversampling, probably the most popular strategy is to use the Synthetic Minority Over-sampling Technique (SMOTE). As its name suggests, SMOTE is an oversampling method, which works by creating synthetic samples from the minor class instead of creating copies [$^{ref}$](https://jair.org/index.php/jair/article/view/10302). The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances. Many rules have been put forward to weigh minority instances differently. For example, minority instances could be weighted based on their distance from the majority class instances[$^{ref}$](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180830)."
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "A similar family of strategies is the undersampling of the majority class. Interestingly, the most effective method of this kind is to delete random samples until the size of the majority and minority classes match [$^{ref}$](https://link.springer.com/chapter/10.1007%2F978-3-642-02326-2_9)."
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "Finally, hybrid methods are a mixture of the two aforementioned strategies. A recent meta-analysis from Haixiang and colleagues (2017) offers more details of the latest developments of research on imbalanced datasets [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0957417416307175?via%3Dihub).\\newline\\newline\\newline"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "#### 1.2 What has been done in machine learning \\newline\\newline"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "markdown",
 77 |    "metadata": {},
 78 |    "source": [
 79 |     "Counteracting the effects of class imbalance in image recognition tasks add another layer of difficulty. This is because of the fact that we are working in high dimensional data, and that pixels close to each others are correlated. Using traditional distance measures in this context would not yield very good results.<br>\n",
 80 |     "This being said, some methods have been suggested. By matching pairs of images (taking the mean of every pixel), accuracy was improved on the CIFAR-10, compared to the generic dataset [$^{ref}$](https://arxiv.org/abs/1801.02929). In a similar fashion, state-of-the-art results on the CIFAR-10 and ImageNet have been achieved using translation, rotation, or shearing of different magnitudes [$^{ref}$](https://arxiv.org/abs/1805.09501). Another method to provide more training samples was to cut the \"main\" component of the image, and paste it on different backgrounds [$^{ref}$](http://openaccess.thecvf.com/content_iccv_2017/html/Dwibedi_Cut_Paste_and_ICCV_2017_paper.html)."
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "The previous methods add information to the classifier by performing simple transformation, which creates new and unseen feature combinations. Other possible methods to generate more training samples include generative adversarial networks (GAN) and variational auto-encoders (VAE)."
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "#### 1.2.1 GAN \\newline\\newline"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "GANs are neural networks defined by a _generator_ and a _discriminator_. The former generates increasingly realistic samples, and the discriminator determines if the samples looks \"real\" or \"fake\". The term adversarial refers to the competitive nature of the interaction between generation and discrimination. GANs will be be explained in more details later."
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "metadata": {},
107 |    "source": [
108 |     "#### 1.2.2 How GAN has been used to counter class imbalance \\newline\\newline"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "Researchers have used GANs to generate new minority samples [$^{ref}$](https://arxiv.org/abs/1807.04585). A balanced GAN (BAGAN) was designed with both the majority and minority class to learn useful features. The authors found that the pictures generated were of higher quality than simply using the minority class [$^{ref}$](https://arxiv.org/abs/1803.09655). However, the authors did not test if this resulted in a more accurate classifier. Similar to our research question, researchers have also used GANs to generate instances of multiple classes, and found increased CNN accuracy [$^{ref}$](https://www.sciencedirect.com/science/article/pii/S0925231219309257?dgcid=rss_sd_all), over and above generic oversampling. Importantly, intra-class heterogeneity must be captured by the GAN, to provide new boundaries to the parameter space. With this concern in mind, Huang and colleagues (2019) improved classification accuracy with their actor-critic GAN (AC-GAN) [$^{ref}$](https://ieeexplore.ieee.org/document/8784774)."
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "markdown",
120 |    "metadata": {},
121 |    "source": [
122 |     "### 2. Our experiment {-}"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "#### 2.1 Our task \\newline\\newline"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "Given the fact that some encouraging preliminary results have been found to rebalance classes using GANs, we wanted to learn more about different GAN architectures, as well as evaluate their relative performance on a simple classification task.\n",
137 |     "<br>\n",
138 |     "With the help of GANs and VAEs, we will generate a multitude of samples from a minority class, and determine if these generated samples improve a classifier. A simple face recognition task will be assessed: to determine the sex of the person, given an image. \n",
139 |     "It is important to note that our test set will be balanced. This is to ensure that we can evaluate the efficiency of the generating models."
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "#### 2.2 Our data \\newline\\newline"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "markdown",
151 |    "metadata": {},
152 |    "source": [
153 |     "The dataset used will be the UTK Face Dataset [$^{ref}$](http://aicip.eecs.utk.edu/wiki/UTKFace), which consists of over 20,000 face images with annotations of age, gender, and ethnicity. Only the pictures labeled as between 10 and 99 years old will be included. 8,000 samples will be kept for both the male and female categories. Next, the female class will be artificially reduced to 10% of its original size, in order to weaken the classifier. The picture size is 60x60 in grayscale.\n",
154 |     "<br>\n",
155 |     "We will be using the minority class ( _female_ ) examples to generate images until it matches the number of examples in the majority class, giving us balanced classes."
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "#### 2.3 Our benchmarks \\newline\\newline"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "We will be using two different benchmarks to evaluate the performance of our different models.\n",
170 |     "1. Training classifier on the original unbalanced dataset.\n",
171 |     "2. Simple oversampling minority class. \n",
172 |     "\n",
173 |     "We expect the first benchmark to have the worse performance on the test set, since it has a very different distribution than the balanced data which we will test on.\n",
174 |     "We expect our second benchmark to have a higher performance, for the simple fact that it is not balanced, even through it has duplicated examples.\n",
175 |     "\n",
176 |     "We choose to use simple benchmarks, but we could also have transformed the oversampled examples by rotating, adding blurs, or randomly removing (whitening) pixels, so that we did not have exact duplicates of images in our training set."
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "markdown",
181 |    "metadata": {},
182 |    "source": [
183 |     "#### 2.3 Our metrics / CNN Classifier \\newline\\newline"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "Our classifier will be a convolutional neural network (CNN). CNNs are neural networks which are highly effective in modeling high dimensional data. Since their inception, they have been achieving the best performances on most benchmark image recognition tasks. \n",
191 |     "\n",
192 |     "CNN typically have least one convolutional layer, which serves as a feature detector, by learning kernel features. Each convolutional layer is often followed by a pooling layer, which further reduces the dimension of the data, by taking the maximum of adjacent features for example (2D max pooling).\n",
193 |     "\n",
194 |     "After a series of convolutional and pooling layers, the last pooling layer is typically flattened and then fully connected layers are added until the output layer.\n",
195 |     "\n",
196 |     "CNN are trained using backpropagation, similarly to regular fully connected neural networks.\n",
197 |     "\n",
198 |     "A representation of our network can be found in the <a href='#annex1'>annex</a> section. It is not shown on the image, but we used a relu activativation and softmax for the output layer."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "As said previously, the classifier will be trained on balanced classes (except for the lower bound benchmark, where we will leave the classes unbalanced). We will be using the same network for each generator model, using the same early stopping criteria as well as a learning rate of 0.001.\n",
206 |     "Models will be evaluated using accuracy since our testing set is balanced and most of our testing sets will be as well."
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "metadata": {},
212 |    "source": [
213 |     "#### 2.4 Our models \\newline\\newline"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "markdown",
218 |    "metadata": {},
219 |    "source": [
220 |     "In total, five models will be used to generate samples. All models contain interesting particularities for the task at hand, which are promising in unique ways. Generator models will be trained on the rare class, _females_ in our case."
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "markdown",
225 |    "metadata": {},
226 |    "source": [
227 |     "#### 2.4.1 Variational Auto-Encoder \\newline\\newline"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "markdown",
232 |    "metadata": {},
233 |    "source": [
234 |     "The variational auto-encoder is an auto-encoder which is used mostly to generate data. It's structure is similar to regular auto-encoders, where we try to find a lower dimensional representation of the data by creating a bottleneck in the middle of the neural networks (encoder and decoder).\n",
235 |     "The reason we can't generate examples from regular auto encoder is that we don't know the distribution of the hidden bottleneck layer, meaning that if we were to pass random values through the decoder, we would most likely get a reconstruction that would look nothing like our other examples. It would most likely be a random collection of pixels.\n",
236 |     "\n",
237 |     "In order to know what kind of inputs passed to the decoder will reproduce examples which look like the population examples, we need to make changes to the hidden bottleneck layer. We need to change it so that we can sample from a distribution (vector of means and standard deviations) in the hidden layer, rather than inputing random fixed values to the decoder.\n",
238 |     "In order to efficiently train the auto-encoder with backpropagation, slight changes need to be made to the hidden bottleneck, since backpropagation can't be performed on a distribution.\n",
239 |     "In simple terms, we will be using the reparameterizing trick, which changes the latent layer from a random node to a deterministic node. This is done by adding an third node in the layer before the latent layer, now having the mean node, the standard deviation node, and the new random epsilon node, with $\\epsilon \\sim N(0,1)$. The epsilon node is now the only random part this layer and the backpropagation will only pass through the other two nodes ($\\mu$ and $\\sigma$). The following <a href='#annex2'>image</a> shows the idea behind this transformation.\n",
240 |     "\n",
241 |     "![VAE training](report_img/vae.png)\n",
242 |     "\n",
243 |     "The loss function of VAE will typically be comprised of two parts. The first is a reconstruction loss, similar to regular auto-encoders. The second part in the loss function is a KL divergence term. The KL divergence measures how far our latent distribution is from a gaussian distribution. The reason we want the distribution to be as close to a gaussian as possible is that when we sample points to generate new images, we will sampling from a multi-dimensional gaussian. If the latent distribution is far from gaussian, we will be sampling points in incorrect proportions.\n",
244 |     "\n",
245 |     "Once the model is trained, we will have learned the latent distribution of the bottleneck layer and we can sample points from this distribution. Sampling a point from the latent distribution, and passing it trough the decoder, we can generate a new unique example. This new example will be based on other examples which were used to train the network, but should still be unique, since we are sampling from a distribution."
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {},
251 |    "source": [
252 |     "#### 2.4.2 Generative adversarial networks (GAN) \\newline\\newline"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "GANs are comprised of two neural networks, namely a _generator_ and a _discriminator_.\n",
260 |     "\n",
261 |     "The goal of the former is to generate increasingly realistic samples, while the latter, the discriminator's goal is to determine if the samples are \"real\" or \"fake\". The term adversarial refers to the competitive nature of the interaction between the generator and discriminator. \n",
262 |     "\n",
263 |     "Here is how this works in more details.\n",
264 |     "As we said GANs are composed of 2 competing neural networks. \n",
265 |     "The first part is the discriminator, which is a neural network trained to differentiate between real and fake examples, images in our case. The way it is trained at first is to feed it real examples, as well as random noise. In later phases, the discriminator is trained on examples that were generated by the generator, as well as real examples from the original dataset.\n",
266 |     "\n",
267 |     "The Generator's goal is to generate more and more realistic examples. When we talk about realistic examples, we mean that the fake examples generated are classified as real by the discriminator.\n",
268 |     "To generate examples, the network takes in a random point in a latent input space, passes it through the neural network and outputs an image. At first, the network will output random pixels since it doesn't have a good way to map the points in the latent space to face images. At the end of the training, the generator should generate realistic examples.\n",
269 |     "To train the generator, we need to find a way to connect the discriminator's loss function to the generator weights, meaning that when the generator generates a _poor quality image_ , we need to use the information from the discriminator (how it figured out that it was a fake) to update the weights from the generator.\n",
270 |     "\n",
271 |     "This can be done with backpropagation. Since we are training two networks at the same time, and the only network that has a loss function is the discriminator, we need to use it update the weights of both networks at the same time. The following image shows an example representation of a GAN.\n",
272 |     "\n",
273 |     "![gan_copy](gan_copy.jpeg)\n",
274 |     "\n",
275 |     "Both networks are trained at the same time and the training stops when both loss functions stabilize. \n",
276 |     "\n",
277 |     "To review, here is how the network is trained\n",
278 |     "1. Generate one (or few) images from random noise (random point in latent space). Since the generator has not been trained yet, the images will be of poor quality. \n",
279 |     "2. Pass those generated images and the same amount of real images to the discriminator.\n",
280 |     "3. Use the loss function of the generator to update weights for both of the generator and discriminator.\n",
281 |     "4. Repeat steps 1-3 until both network's loss function have converged. \n",
282 |     "\n",
283 |     "The goal here is to have the generator's loss function be very low, while having the discriminator's loss function be high (accuracy at 0.5).\n",
284 |     "\n",
285 |     "We can think of it as a min max game, where the discriminator tries to minimize a loss function d(z), where z is the latent space and where the generator tries to maximize a d(g(z)).\n",
286 |     "\n",
287 |     "In the original GAN paper, the authors used a mixture of rectifier linear activations and sigmoid activations for the discriminator, and maxout activation for the discriminator. They also noted that it is technically possible to use any differentiable activation function.\n",
288 |     "\n",
289 |     "\n",
290 |     "The training steps outlined above seem pretty simple, however, because of the min-max nature of the optimization, many problems can arise. \n",
291 |     "\n",
292 |     "#### Problems with GAN training\n",
293 |     "#### Mode collapse: \n",
294 |     "When we generate new data points, we would like them to be close enough from real examples, but with enough variety so that examples are not exact copies of the training examples, but also that all of the classes are represented, with similar proportions as the training set.\n",
295 |     "The problem of classes being over/under represented in the learned distribution by the generator, versus the real (training) distribution is known as mode collapse. This problem arises because of the fact that the generator is trying to 'trick' the discriminator by exploiting it's weaknesses. What can happen concretely in the case of multi class problems is that the generator generates images from a single class it 'thinks' the discriminator has the biggest difficulty classifying as real/fake. When the discriminator has learned to properly classify those examples from the single class, the generator will move to another class entirely. This process will happen 'forever' and the generator will never learn the proper distribution of the data. \n",
296 |     "Many variations to the GAN try to deal with this problem, notably one which changes the amount of examples provided to the discriminator from 1 to a small amount. The goal of the generator in this case is to predict if the batch of data points is real or fake as a whole. If the generator tries to generate a batch of data points which all come from the same class, the discriminator will be able to figure out that those points do not resemble the real distribution.\n",
297 |     "Changes in the loss functions can also mitigate this problem and we will further discuss them later.\n",
298 |     "\n",
299 |     "#### Diminished gradient: \n",
300 |     "This problem also comes from the fact that we have two competing neural networks. The diminished gradient problem comes in when the discriminator is very good too quickly and it's performance is very high, especially compared to the generator. Since we perform gradient descent based on the discriminator's loss function, when the discriminator's performance is very high, it's loss function is very low and so is it's gradient. When we perform gradient descent to optimize neural networks or other learning algorithms, small gradients usually mean we are close to the optimum and we can stop the optimization. In the case of GANs, this only means that the discriminator is close to it's optimum and says little about the generator, which is what we are trying to use.\n",
301 |     "Since we spread the gradient from the discriminator to the generator to optimize the generator's neural network, when the gradient used for the backpropagation is very small, the weights of the network are left almost unchanged, which means that the generator stops learning.\n",
302 |     "\n",
303 |     "Alternative loss functions can also mitigate this problem.\n"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "markdown",
308 |    "metadata": {},
309 |    "source": [
310 |     "#### 2.4.3 Softmax GAN \\newline\\newline"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "Softmax GANs [$^{ref}$](https://arxiv.org/abs/1704.06191) are an attempt to fix both of the problems outlined above. The way it does it is by using a softmax cross-entropy loss. The other novel idea in this paper is that a batch of images are passed to the discriminator, instead of only one. This batch is comprised of both real and fake images and the discriminator tries to predict the share (percentage) of images which are real. Both networks are optimized using this softmax loss function.\n",
318 |     "\n",
319 |     "\n",
320 |     "\n"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "markdown",
325 |    "metadata": {},
326 |    "source": [
327 |     "#### 2.4.4 Wasserstein GAN \\newline\\newline"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "markdown",
332 |    "metadata": {},
333 |    "source": [
334 |     "Wasserstein GANs [$^{ref}$](https://arxiv.org/abs/1701.07875) is another attempt to stabilize the learning of both networks. The idea is that instead of assigning a probability of being real or fake, the discriminator scores the realness or fakeness of a given image. The idea here is that we are trying to find (and minimize) the distance between the generated examples and the training examples. The paper discusses multiple distance measures and argues that the Wasserstein distance is the most appropriate to train GANs.\n",
335 |     "One of the advantage of this distance measure is that it is continuous and differentiable almost everywhere. The other advantage they argue is that the gradient does not diminish as the discriminator gets closer to optimality, in fact, it gets more reliable."
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "markdown",
340 |    "metadata": {},
341 |    "source": [
342 |     "#### 2.4.5 Deep Convolutional GAN \\newline\\newline"
343 |    ]
344 |   },
345 |   {
346 |    "cell_type": "markdown",
347 |    "metadata": {},
348 |    "source": [
349 |     "DCGANs [$^{ref}$](https://arxiv.org/abs/1511.06434) are a natural extension to the original GAN proposed a few years before. They leverage the power of convolutional neural networks to improve quality of generated images.\n",
350 |     "Other contributions from this paper come from the fact that features learned from DCGANs are more easily visualizable and interpretable than the features of fully connected neural networks.\n",
351 |     "Another interesting contribution from this paper is the arithmetic properties of the latent vectors, once the GAN has been optimized. This means that if we adding and subtracting latent vectors will create an image that makes sense. For example, if we take a vector which generates a man with glasses and subtract a man without glasses and add a women without glasses, we should get the image of the women with glasses. This means that we can find useful features in the latent space."
352 |    ]
353 |   },
354 |   {
355 |    "cell_type": "markdown",
356 |    "metadata": {},
357 |    "source": [
358 |     "#### 2.4.6 Adversarial Auto-Encoder \\newline\\newline"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "markdown",
363 |    "metadata": {},
364 |    "source": [
365 |     "Adversarial Auto-Encoders [$^{ref}$](https://arxiv.org/abs/1511.05644), which we can think of as a combination of GANs and auto-encoders. The idea is that the discriminator tries to discriminate between examples reconstructed from an autoencoder's latent space, versus a reconstructed image which comes from the distribution which we try to learn.\n",
366 |     "\n",
367 |     "Sample generated <a href='#annex4_8'>images</a> of all five previously mentioned models can be found in annex."
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "markdown",
372 |    "metadata": {},
373 |    "source": [
374 |     "### 3. Our results  {-}\n",
375 |     "<br>"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "markdown",
380 |    "metadata": {},
381 |    "source": [
382 |     "As explained previously explained, we have generated enough images to match our majority class. Our previously rare class is now comprised of approximately 10% real examples and 90% generated examples.\n",
383 |     "\n",
384 |     "We will be training the classifier on approximately 12 000 examples (6000 women and 6000 men), except for the simple baseline which has 6 600 examples (600 women and 6000 men).\n",
385 |     "Our testing set will also be balanced, with approximately 10 000 samples. \n",
386 |     "\n",
387 |     "When running this experiment, we found that we had unstable results. By unstable, we mean that our results were highly dependent on the random nature of the artificial down sizing of the majority class, as well as the train test split.\n",
388 |     "\n",
389 |     "We ran the whole experiment 3 times and found vastly different results.\n",
390 |     "Here is the graph of the averaged accuracy of the 3 experiments.\n",
391 |     "\n",
392 |     "![result_graph](report_img/plot.png)\n",
393 |     "\n",
394 |     "We can see that in the averaged performance metrics, none of the methods outperformed oversampling. This being said, in 2 of the 3 experiments, either softmax GAN or VAE outperformed the oversampling method. The only difference between the experiments is the random splits, meaning that our results are quite dependent on how we split our data, and which 10% of the artificial were used to generate images.\n",
395 |     "\n",
396 |     "With this information, we can't confidently conclude that any of the methods tried above significantly outperform the oversampling method.\n",
397 |     "\n",
398 |     "#### How we could have mitigated this problem \\newline\\newline\n",
399 |     "\n",
400 |     "There are many reasons the problem of unstable results could have arisen. Here are some hypothesis:\n",
401 |     "1. We did not do any hyperparameter tuning (we kept mostly default values for learning rates, etc.). This could possibly have stabilized the learning.\n",
402 |     "2. We could have used higher resolution images. This would have allowed our generator images to generate higher quality images and higher quality features, which would have helped the classifiers.\n",
403 |     "3. Different performance metric. We used early stopping, which stopped the learning when the optimization stabilized. We used the test accuracy of the last epoch. It is possible that using an average of the last n epoch's test accuracy instead of only the last epoch's accuracy could have provided more meaningful results. "
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "markdown",
408 |    "metadata": {},
409 |    "source": [
410 |     "### 4. Alternative results (CNN trained on original data, test is generated data) {-}"
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "markdown",
415 |    "metadata": {},
416 |    "source": [
417 |     "Following our null results, we ought to determine the reason why the classifier did not improve. As can be deduced from the non significant improvement in accuracy, the generated examples most likely did not generate any new information that helped the classifier learn the boundaries between the two classes. In order to test that hypothesis, we designed another experiment. \n",
418 |     "We trained a new adversarial auto-encoder based on all 6 000 women and generated 10 000 female faces. Using the same architecture and hyperparameters, we generated 10 000 male faces, again using 6 000 training samples. We used these 20 000 generated faces as a test sample. \n",
419 |     "Then, we trained a convolutional neural network based on the same male and female original samples, totalling 12 000. We expected to have extremely (if not perfect) accuracy. If the information contained in the generated samples truly is already present in the original pictures, then the test sample would’t be “unseen”.\n",
420 |     "As expected, the model reached a perfect accuracy very quickly. On the second epoch, it reached 100% accuracy on the generated (test) pictures. From that, we conclude that the generated samples are averaged, easy-to-classify replicas of the original data."
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "markdown",
425 |    "metadata": {},
426 |    "source": [
427 |     "### 5. Conclusion {-}"
428 |    ]
429 |   },
430 |   {
431 |    "cell_type": "markdown",
432 |    "metadata": {},
433 |    "source": [
434 |     "The results presented above do not show without the shadow of a doubt that generating examples from minority classes can't improve classification accuracy. In fact, we did not use most of the state of the art techniques, and we trained all of our models on laptop computers. This can explain the difference between our images and high resolution images generated [$^{ref}$](https://mymodernmet.com/free-ai-generated-faces/) using the current best techniques and more GPU training time. It is possible, yet unknown, that this difference could have affected our results significantly.\n",
435 |     "\n",
436 |     "In fact, in the past weeks, an interesting new paper [$^{ref}$](https://arxiv.org/abs/1911.09665) argues that they were able to achieve significant performance improvement on many of the image recognition task benchmarks using GANs. In fact, they argue that the reason that most experiments similar to our own have failed to capture a performance improvement is that they failed to disentangle the distribution of the generated images and the distribution of the real images when performing batch normalization.\n",
437 |     "\n",
438 |     "This paper is the first (or one of the first) to argue that increased performance can be achieved using GANs and it will be interesting to see if more research confirms those results.\n",
439 |     "\n",
440 |     "This paper does not talk about using this method for unbalanced classes and it would be interesting to see if we could improve on simple techniques like oversampling using their proposed auxiliary batch normalization.\n",
441 |     "\n"
442 |    ]
443 |   },
444 |   {
445 |    "cell_type": "markdown",
446 |    "metadata": {},
447 |    "source": [
448 |     "### 6. Annex {-}"
449 |    ]
450 |   },
451 |   {
452 |    "cell_type": "markdown",
453 |    "metadata": {},
454 |    "source": [
455 |     "<br>1. CNN architecture\n",
456 |     "\n",
457 |     "![GAN](report_img/cnnimg.png)"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "markdown",
462 |    "metadata": {},
463 |    "source": [
464 |     "<br> 2. Generated images (Variational Auto Encoder)\n",
465 |     "\n",
466 |     "![GAN](report_img/vae_img.jpg)\n",
467 |     "\n",
468 |     "<br> 3. Generated images (Softmax GAN)\n",
469 |     "\n",
470 |     "![GAN](report_img/softmax.png) \n",
471 |     "\n",
472 |     "\n",
473 |     "<br>\\newpage 4. Generated images (Wasserstein GAN)\n",
474 |     "\n",
475 |     "![GAN](report_img/wsgan.png)\n",
476 |     "\n",
477 |     "<br> 5. Generated images (Deep convolutional GAN)\n",
478 |     "\n",
479 |     "![GAN](report_img/dcgan.png)\n",
480 |     "\n",
481 |     "<br>\\newpage 6. Generated images (Adversarial Auto Encoder)\n",
482 |     "\n",
483 |     "![GAN](report_img/aae.png)"
484 |    ]
485 |   }
486 |  ],
487 |  "metadata": {
488 |   "kernelspec": {
489 |    "display_name": "Python 3",
490 |    "language": "python",
491 |    "name": "python3"
492 |   },
493 |   "language_info": {
494 |    "codemirror_mode": {
495 |     "name": "ipython",
496 |     "version": 3
497 |    },
498 |    "file_extension": ".py",
499 |    "mimetype": "text/x-python",
500 |    "name": "python",
501 |    "nbconvert_exporter": "python",
502 |    "pygments_lexer": "ipython3",
503 |    "version": "3.6.9"
504 |   }
505 |  },
506 |  "nbformat": 4,
507 |  "nbformat_minor": 4
508 | }
509 | 


--------------------------------------------------------------------------------
/Data Cleaning.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# <font color='red'>_Accidents</font> in Montreal: Data Cleaning_"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "- See the documentation [HERE](https://saaq.gouv.qc.ca/donnees-ouvertes/rapports-accident/rapports-accident-documentation.pdf)\n",
  15 |     "- Get the data [HERE (http://donnees.ville.montreal.qc.ca/dataset/collisions-routieres)"
  16 |    ]
  17 |   },
  18 |   {
  19 |    "cell_type": "markdown",
  20 |    "metadata": {},
  21 |    "source": [
  22 |     "#### Procedure\n",
  23 |     "- Deleting Useless columns (see documentation)\n",
  24 |     "- Make dummy variables for categorical/discrete variables"
  25 |    ]
  26 |   },
  27 |   {
  28 |    "cell_type": "code",
  29 |    "execution_count": 1,
  30 |    "metadata": {},
  31 |    "outputs": [],
  32 |    "source": [
  33 |     "# imports\n",
  34 |     "import pandas as pd\n",
  35 |     "import numpy as np\n",
  36 |     "import sys\n",
  37 |     "import os \n",
  38 |     "os.chdir('c:/users/nicolas/documents/data/accidents-mtl')"
  39 |    ]
  40 |   },
  41 |   {
  42 |    "cell_type": "code",
  43 |    "execution_count": 2,
  44 |    "metadata": {},
  45 |    "outputs": [],
  46 |    "source": [
  47 |     "# loading data\n",
  48 |     "df = pd.read_csv('accidents_2012_2018.csv', index_col=0)"
  49 |    ]
  50 |   },
  51 |   {
  52 |    "cell_type": "code",
  53 |    "execution_count": 3,
  54 |    "metadata": {},
  55 |    "outputs": [],
  56 |    "source": [
  57 |     "# resetting index\n",
  58 |     "df = df.reset_index(drop=True)"
  59 |    ]
  60 |   },
  61 |   {
  62 |    "cell_type": "code",
  63 |    "execution_count": 4,
  64 |    "metadata": {
  65 |     "scrolled": true
  66 |    },
  67 |    "outputs": [
  68 |     {
  69 |      "data": {
  70 |       "text/html": [
  71 |        "<div>\n",
  72 |        "<style scoped>\n",
  73 |        "    .dataframe tbody tr th:only-of-type {\n",
  74 |        "        vertical-align: middle;\n",
  75 |        "    }\n",
  76 |        "\n",
  77 |        "    .dataframe tbody tr th {\n",
  78 |        "        vertical-align: top;\n",
  79 |        "    }\n",
  80 |        "\n",
  81 |        "    .dataframe thead th {\n",
  82 |        "        text-align: right;\n",
  83 |        "    }\n",
  84 |        "</style>\n",
  85 |        "<table border=\"1\" class=\"dataframe\">\n",
  86 |        "  <thead>\n",
  87 |        "    <tr style=\"text-align: right;\">\n",
  88 |        "      <th></th>\n",
  89 |        "      <th>JR_SEMN_ACCDN</th>\n",
  90 |        "      <th>DT_ACCDN</th>\n",
  91 |        "      <th>CD_MUNCP</th>\n",
  92 |        "      <th>NO_CIVIQ_ACCDN</th>\n",
  93 |        "      <th>SFX_NO_CIVQ_ACCDN</th>\n",
  94 |        "      <th>BORNE_KM_ACCDN</th>\n",
  95 |        "      <th>RUE_ACCDN</th>\n",
  96 |        "      <th>TP_REPRR_ACCDN</th>\n",
  97 |        "      <th>ACCDN_PRES_DE</th>\n",
  98 |        "      <th>NB_METRE_DIST_ACCD</th>\n",
  99 |        "      <th>...</th>\n",
 100 |        "      <th>NB_VICTIMES_VELO</th>\n",
 101 |        "      <th>VITESSE_AUTOR</th>\n",
 102 |        "      <th>LOC_X</th>\n",
 103 |        "      <th>LOC_Y</th>\n",
 104 |        "      <th>LOC_COTE_Q</th>\n",
 105 |        "      <th>LOC_COTE_P</th>\n",
 106 |        "      <th>LOC_DETACHEE</th>\n",
 107 |        "      <th>LOC_IMPRECISION</th>\n",
 108 |        "      <th>LOC_LONG</th>\n",
 109 |        "      <th>LOC_LAT</th>\n",
 110 |        "    </tr>\n",
 111 |        "  </thead>\n",
 112 |        "  <tbody>\n",
 113 |        "    <tr>\n",
 114 |        "      <td>0</td>\n",
 115 |        "      <td>ME</td>\n",
 116 |        "      <td>2012/02/01</td>\n",
 117 |        "      <td>66102.0</td>\n",
 118 |        "      <td>3501.0</td>\n",
 119 |        "      <td>NaN</td>\n",
 120 |        "      <td>NaN</td>\n",
 121 |        "      <td>ST CHARLES</td>\n",
 122 |        "      <td>2.0</td>\n",
 123 |        "      <td>STAT</td>\n",
 124 |        "      <td>NaN</td>\n",
 125 |        "      <td>...</td>\n",
 126 |        "      <td>0</td>\n",
 127 |        "      <td>NaN</td>\n",
 128 |        "      <td>276517.37950</td>\n",
 129 |        "      <td>5.035127e+06</td>\n",
 130 |        "      <td>A</td>\n",
 131 |        "      <td>3</td>\n",
 132 |        "      <td>O</td>\n",
 133 |        "      <td>N</td>\n",
 134 |        "      <td>-73.861616</td>\n",
 135 |        "      <td>45.455505</td>\n",
 136 |        "    </tr>\n",
 137 |        "    <tr>\n",
 138 |        "      <td>1</td>\n",
 139 |        "      <td>SA</td>\n",
 140 |        "      <td>2012/06/02</td>\n",
 141 |        "      <td>66023.0</td>\n",
 142 |        "      <td>NaN</td>\n",
 143 |        "      <td>NaN</td>\n",
 144 |        "      <td>NaN</td>\n",
 145 |        "      <td>NaN</td>\n",
 146 |        "      <td>NaN</td>\n",
 147 |        "      <td>COTE VERTU ET AUT 40</td>\n",
 148 |        "      <td>NaN</td>\n",
 149 |        "      <td>...</td>\n",
 150 |        "      <td>0</td>\n",
 151 |        "      <td>NaN</td>\n",
 152 |        "      <td>287913.26000</td>\n",
 153 |        "      <td>5.038666e+06</td>\n",
 154 |        "      <td>A</td>\n",
 155 |        "      <td>3</td>\n",
 156 |        "      <td>N</td>\n",
 157 |        "      <td>O</td>\n",
 158 |        "      <td>-73.716033</td>\n",
 159 |        "      <td>45.487715</td>\n",
 160 |        "    </tr>\n",
 161 |        "    <tr>\n",
 162 |        "      <td>2</td>\n",
 163 |        "      <td>JE</td>\n",
 164 |        "      <td>2012/06/28</td>\n",
 165 |        "      <td>66023.0</td>\n",
 166 |        "      <td>NaN</td>\n",
 167 |        "      <td>NaN</td>\n",
 168 |        "      <td>NaN</td>\n",
 169 |        "      <td>COTE VERTU</td>\n",
 170 |        "      <td>1.0</td>\n",
 171 |        "      <td>DECARIE</td>\n",
 172 |        "      <td>NaN</td>\n",
 173 |        "      <td>...</td>\n",
 174 |        "      <td>0</td>\n",
 175 |        "      <td>50.0</td>\n",
 176 |        "      <td>290518.82501</td>\n",
 177 |        "      <td>5.041617e+06</td>\n",
 178 |        "      <td>A</td>\n",
 179 |        "      <td>1</td>\n",
 180 |        "      <td>N</td>\n",
 181 |        "      <td>N</td>\n",
 182 |        "      <td>-73.682786</td>\n",
 183 |        "      <td>45.514324</td>\n",
 184 |        "    </tr>\n",
 185 |        "    <tr>\n",
 186 |        "      <td>3</td>\n",
 187 |        "      <td>ME</td>\n",
 188 |        "      <td>2012/07/11</td>\n",
 189 |        "      <td>66023.0</td>\n",
 190 |        "      <td>NaN</td>\n",
 191 |        "      <td>NaN</td>\n",
 192 |        "      <td>NaN</td>\n",
 193 |        "      <td>ST MATHIEU</td>\n",
 194 |        "      <td>1.0</td>\n",
 195 |        "      <td>RENE LEVESQUE</td>\n",
 196 |        "      <td>50.0</td>\n",
 197 |        "      <td>...</td>\n",
 198 |        "      <td>0</td>\n",
 199 |        "      <td>50.0</td>\n",
 200 |        "      <td>298822.88600</td>\n",
 201 |        "      <td>5.039146e+06</td>\n",
 202 |        "      <td>A</td>\n",
 203 |        "      <td>3</td>\n",
 204 |        "      <td>N</td>\n",
 205 |        "      <td>N</td>\n",
 206 |        "      <td>-73.576472</td>\n",
 207 |        "      <td>45.492212</td>\n",
 208 |        "    </tr>\n",
 209 |        "    <tr>\n",
 210 |        "      <td>4</td>\n",
 211 |        "      <td>LU</td>\n",
 212 |        "      <td>2012/01/02</td>\n",
 213 |        "      <td>66023.0</td>\n",
 214 |        "      <td>4849.0</td>\n",
 215 |        "      <td>NaN</td>\n",
 216 |        "      <td>NaN</td>\n",
 217 |        "      <td>ST JEAN</td>\n",
 218 |        "      <td>NaN</td>\n",
 219 |        "      <td>NaN</td>\n",
 220 |        "      <td>NaN</td>\n",
 221 |        "      <td>...</td>\n",
 222 |        "      <td>0</td>\n",
 223 |        "      <td>NaN</td>\n",
 224 |        "      <td>277433.35738</td>\n",
 225 |        "      <td>5.038881e+06</td>\n",
 226 |        "      <td>A</td>\n",
 227 |        "      <td>1</td>\n",
 228 |        "      <td>O</td>\n",
 229 |        "      <td>N</td>\n",
 230 |        "      <td>-73.850114</td>\n",
 231 |        "      <td>45.489319</td>\n",
 232 |        "    </tr>\n",
 233 |        "  </tbody>\n",
 234 |        "</table>\n",
 235 |        "<p>5 rows × 67 columns</p>\n",
 236 |        "</div>"
 237 |       ],
 238 |       "text/plain": [
 239 |        "  JR_SEMN_ACCDN    DT_ACCDN  CD_MUNCP  NO_CIVIQ_ACCDN SFX_NO_CIVQ_ACCDN  \\\n",
 240 |        "0            ME  2012/02/01   66102.0          3501.0               NaN   \n",
 241 |        "1            SA  2012/06/02   66023.0             NaN               NaN   \n",
 242 |        "2            JE  2012/06/28   66023.0             NaN               NaN   \n",
 243 |        "3            ME  2012/07/11   66023.0             NaN               NaN   \n",
 244 |        "4            LU  2012/01/02   66023.0          4849.0               NaN   \n",
 245 |        "\n",
 246 |        "   BORNE_KM_ACCDN   RUE_ACCDN  TP_REPRR_ACCDN         ACCDN_PRES_DE  \\\n",
 247 |        "0             NaN  ST CHARLES             2.0                  STAT   \n",
 248 |        "1             NaN         NaN             NaN  COTE VERTU ET AUT 40   \n",
 249 |        "2             NaN  COTE VERTU             1.0               DECARIE   \n",
 250 |        "3             NaN  ST MATHIEU             1.0         RENE LEVESQUE   \n",
 251 |        "4             NaN     ST JEAN             NaN                   NaN   \n",
 252 |        "\n",
 253 |        "   NB_METRE_DIST_ACCD  ...  NB_VICTIMES_VELO  VITESSE_AUTOR         LOC_X  \\\n",
 254 |        "0                 NaN  ...                 0            NaN  276517.37950   \n",
 255 |        "1                 NaN  ...                 0            NaN  287913.26000   \n",
 256 |        "2                 NaN  ...                 0           50.0  290518.82501   \n",
 257 |        "3                50.0  ...                 0           50.0  298822.88600   \n",
 258 |        "4                 NaN  ...                 0            NaN  277433.35738   \n",
 259 |        "\n",
 260 |        "          LOC_Y  LOC_COTE_Q  LOC_COTE_P  LOC_DETACHEE  LOC_IMPRECISION  \\\n",
 261 |        "0  5.035127e+06           A           3             O                N   \n",
 262 |        "1  5.038666e+06           A           3             N                O   \n",
 263 |        "2  5.041617e+06           A           1             N                N   \n",
 264 |        "3  5.039146e+06           A           3             N                N   \n",
 265 |        "4  5.038881e+06           A           1             O                N   \n",
 266 |        "\n",
 267 |        "    LOC_LONG    LOC_LAT  \n",
 268 |        "0 -73.861616  45.455505  \n",
 269 |        "1 -73.716033  45.487715  \n",
 270 |        "2 -73.682786  45.514324  \n",
 271 |        "3 -73.576472  45.492212  \n",
 272 |        "4 -73.850114  45.489319  \n",
 273 |        "\n",
 274 |        "[5 rows x 67 columns]"
 275 |       ]
 276 |      },
 277 |      "execution_count": 4,
 278 |      "metadata": {},
 279 |      "output_type": "execute_result"
 280 |     }
 281 |    ],
 282 |    "source": [
 283 |     "# head\n",
 284 |     "df.head()"
 285 |    ]
 286 |   },
 287 |   {
 288 |    "cell_type": "markdown",
 289 |    "metadata": {},
 290 |    "source": [
 291 |     "#### Deleting Useless Columns"
 292 |    ]
 293 |   },
 294 |   {
 295 |    "cell_type": "code",
 296 |    "execution_count": 5,
 297 |    "metadata": {},
 298 |    "outputs": [
 299 |     {
 300 |      "name": "stdout",
 301 |      "output_type": "stream",
 302 |      "text": [
 303 |       "We have 171,271 rows and 67 columns.\n"
 304 |      ]
 305 |     }
 306 |    ],
 307 |    "source": [
 308 |     "rows, columns = df.shape\n",
 309 |     "print(f'We have {rows:,} rows and {columns} columns.')"
 310 |    ]
 311 |   },
 312 |   {
 313 |    "cell_type": "code",
 314 |    "execution_count": 6,
 315 |    "metadata": {},
 316 |    "outputs": [],
 317 |    "source": [
 318 |     "# deleting columns upon inspection of documentation\n",
 319 |     "df.drop(['NO_CIVIQ_ACCDN', 'RUE_ACCDN', 'ACCDN_PRES_DE', 'CD_PNT_CDRNL_ROUTE', \n",
 320 |     "         'BORNE_KM_ACCDN', 'NB_METRE_DIST_ACCD', 'CD_PNT_CDRNL_REPRR', \n",
 321 |     "         'CD_SIT_PRTCE_ACCDN', 'nb_taxi', 'nb_urgence', 'nb_motoneige', 'nb_VHR', \n",
 322 |     "         'nb_autres_types', 'nb_veh_non_precise', 'CD_MUNCP', 'CD_ASPCT_ROUTE',\n",
 323 |     "         'REG_ADM', 'MRC', 'LOC_DETACHEE', 'LOC_IMPRECISION', 'LOC_COTE_Q', \n",
 324 |     "         'LOC_COTE_P'], \n",
 325 |     "         axis=1, inplace=True)"
 326 |    ]
 327 |   },
 328 |   {
 329 |    "cell_type": "code",
 330 |    "execution_count": 7,
 331 |    "metadata": {},
 332 |    "outputs": [
 333 |     {
 334 |      "name": "stdout",
 335 |      "output_type": "stream",
 336 |      "text": [
 337 |       "We have 171,271 rows and 47 columns.\n"
 338 |      ]
 339 |     }
 340 |    ],
 341 |    "source": [
 342 |     "rows, columns = df.shape\n",
 343 |     "print(f'We have {rows:,} rows and {columns} columns.')"
 344 |    ]
 345 |   },
 346 |   {
 347 |    "cell_type": "code",
 348 |    "execution_count": 8,
 349 |    "metadata": {},
 350 |    "outputs": [
 351 |     {
 352 |      "data": {
 353 |       "text/html": [
 354 |        "<div>\n",
 355 |        "<style scoped>\n",
 356 |        "    .dataframe tbody tr th:only-of-type {\n",
 357 |        "        vertical-align: middle;\n",
 358 |        "    }\n",
 359 |        "\n",
 360 |        "    .dataframe tbody tr th {\n",
 361 |        "        vertical-align: top;\n",
 362 |        "    }\n",
 363 |        "\n",
 364 |        "    .dataframe thead th {\n",
 365 |        "        text-align: right;\n",
 366 |        "    }\n",
 367 |        "</style>\n",
 368 |        "<table border=\"1\" class=\"dataframe\">\n",
 369 |        "  <thead>\n",
 370 |        "    <tr style=\"text-align: right;\">\n",
 371 |        "      <th></th>\n",
 372 |        "      <th>Name</th>\n",
 373 |        "      <th>Missing Values</th>\n",
 374 |        "    </tr>\n",
 375 |        "  </thead>\n",
 376 |        "  <tbody>\n",
 377 |        "    <tr>\n",
 378 |        "      <td>8</td>\n",
 379 |        "      <td>CD_CONFG_ROUTE</td>\n",
 380 |        "      <td>18738</td>\n",
 381 |        "    </tr>\n",
 382 |        "    <tr>\n",
 383 |        "      <td>7</td>\n",
 384 |        "      <td>CD_LOCLN_ACCDN</td>\n",
 385 |        "      <td>15009</td>\n",
 386 |        "    </tr>\n",
 387 |        "    <tr>\n",
 388 |        "      <td>9</td>\n",
 389 |        "      <td>CD_COND_METEO</td>\n",
 390 |        "      <td>11922</td>\n",
 391 |        "    </tr>\n",
 392 |        "    <tr>\n",
 393 |        "      <td>4</td>\n",
 394 |        "      <td>CD_ECLRM</td>\n",
 395 |        "      <td>11403</td>\n",
 396 |        "    </tr>\n",
 397 |        "    <tr>\n",
 398 |        "      <td>3</td>\n",
 399 |        "      <td>CD_ETAT_SURFC</td>\n",
 400 |        "      <td>11273</td>\n",
 401 |        "    </tr>\n",
 402 |        "    <tr>\n",
 403 |        "      <td>2</td>\n",
 404 |        "      <td>CD_GENRE_ACCDN</td>\n",
 405 |        "      <td>9045</td>\n",
 406 |        "    </tr>\n",
 407 |        "    <tr>\n",
 408 |        "      <td>5</td>\n",
 409 |        "      <td>CD_ENVRN_ACCDN</td>\n",
 410 |        "      <td>5957</td>\n",
 411 |        "    </tr>\n",
 412 |        "    <tr>\n",
 413 |        "      <td>6</td>\n",
 414 |        "      <td>CD_CATEG_ROUTE</td>\n",
 415 |        "      <td>5115</td>\n",
 416 |        "    </tr>\n",
 417 |        "    <tr>\n",
 418 |        "      <td>35</td>\n",
 419 |        "      <td>LOC_Y</td>\n",
 420 |        "      <td>11</td>\n",
 421 |        "    </tr>\n",
 422 |        "    <tr>\n",
 423 |        "      <td>34</td>\n",
 424 |        "      <td>LOC_X</td>\n",
 425 |        "      <td>11</td>\n",
 426 |        "    </tr>\n",
 427 |        "  </tbody>\n",
 428 |        "</table>\n",
 429 |        "</div>"
 430 |       ],
 431 |       "text/plain": [
 432 |        "              Name Missing Values\n",
 433 |        "8   CD_CONFG_ROUTE          18738\n",
 434 |        "7   CD_LOCLN_ACCDN          15009\n",
 435 |        "9    CD_COND_METEO          11922\n",
 436 |        "4         CD_ECLRM          11403\n",
 437 |        "3    CD_ETAT_SURFC          11273\n",
 438 |        "2   CD_GENRE_ACCDN           9045\n",
 439 |        "5   CD_ENVRN_ACCDN           5957\n",
 440 |        "6   CD_CATEG_ROUTE           5115\n",
 441 |        "35           LOC_Y             11\n",
 442 |        "34           LOC_X             11"
 443 |       ]
 444 |      },
 445 |      "execution_count": 8,
 446 |      "metadata": {},
 447 |      "output_type": "execute_result"
 448 |     }
 449 |    ],
 450 |    "source": [
 451 |     "# columns we are going to delete\n",
 452 |     "null_col = df.loc[:, df.isnull().sum(axis=0) < 5e4].columns # del more than 50,000 missing values\n",
 453 |     "null_num = df.loc[:, null_col].isnull().sum(axis=0)\n",
 454 |     "null_count = pd.DataFrame([null_col, null_num]).T\n",
 455 |     "null_count.columns = ['Name', 'Missing Values']\n",
 456 |     "null_count.sort_values(by='Missing Values', ascending=False).head(10)"
 457 |    ]
 458 |   },
 459 |   {
 460 |    "cell_type": "code",
 461 |    "execution_count": 9,
 462 |    "metadata": {},
 463 |    "outputs": [],
 464 |    "source": [
 465 |     "# keeping only rows with less than 20% of missing values\n",
 466 |     "df = df.loc[:, df.isnull().sum(axis=0) < 5e4]"
 467 |    ]
 468 |   },
 469 |   {
 470 |    "cell_type": "code",
 471 |    "execution_count": 10,
 472 |    "metadata": {},
 473 |    "outputs": [
 474 |     {
 475 |      "name": "stdout",
 476 |      "output_type": "stream",
 477 |      "text": [
 478 |       "We now have 171,271 rows and 40 columns.\n"
 479 |      ]
 480 |     }
 481 |    ],
 482 |    "source": [
 483 |     "# new shape\n",
 484 |     "rows, columns = df.shape\n",
 485 |     "print(f'We now have {rows:,} rows and {columns} columns.')"
 486 |    ]
 487 |   },
 488 |   {
 489 |    "cell_type": "code",
 490 |    "execution_count": 11,
 491 |    "metadata": {},
 492 |    "outputs": [
 493 |     {
 494 |      "name": "stdout",
 495 |      "output_type": "stream",
 496 |      "text": [
 497 |       "We have 1.29% missing values.\n"
 498 |      ]
 499 |     }
 500 |    ],
 501 |    "source": [
 502 |     "# new number of missing values \n",
 503 |     "missing_values = df.isnull().sum().sum()/df.size*100\n",
 504 |     "print('We have {}% missing values.'.format(np.round(missing_values, 2)))"
 505 |    ]
 506 |   },
 507 |   {
 508 |    "cell_type": "code",
 509 |    "execution_count": 12,
 510 |    "metadata": {},
 511 |    "outputs": [
 512 |     {
 513 |      "name": "stdout",
 514 |      "output_type": "stream",
 515 |      "text": [
 516 |       "Our dataframe is 112 MB.\n"
 517 |      ]
 518 |     }
 519 |    ],
 520 |    "source": [
 521 |     "# size of the dataframe\n",
 522 |     "print(f'Our dataframe is {int(sys.getsizeof(df)/1e6)} MB.')"
 523 |    ]
 524 |   },
 525 |   {
 526 |    "cell_type": "code",
 527 |    "execution_count": 13,
 528 |    "metadata": {},
 529 |    "outputs": [
 530 |     {
 531 |      "data": {
 532 |       "text/plain": [
 533 |        "JR_SEMN_ACCDN      object\n",
 534 |        "DT_ACCDN           object\n",
 535 |        "CD_GENRE_ACCDN    float64\n",
 536 |        "CD_ETAT_SURFC     float64\n",
 537 |        "CD_ECLRM          float64\n",
 538 |        "dtype: object"
 539 |       ]
 540 |      },
 541 |      "execution_count": 13,
 542 |      "metadata": {},
 543 |      "output_type": "execute_result"
 544 |     }
 545 |    ],
 546 |    "source": [
 547 |     "# getting dtypes\n",
 548 |     "column_types = df.dtypes\n",
 549 |     "column_types.head()"
 550 |    ]
 551 |   },
 552 |   {
 553 |    "cell_type": "code",
 554 |    "execution_count": 14,
 555 |    "metadata": {},
 556 |    "outputs": [],
 557 |    "source": [
 558 |     "# dropping rows missing our target\n",
 559 |     "df.dropna(subset=['GRAVITE'], inplace=True)"
 560 |    ]
 561 |   },
 562 |   {
 563 |    "cell_type": "code",
 564 |    "execution_count": 15,
 565 |    "metadata": {},
 566 |    "outputs": [
 567 |     {
 568 |      "data": {
 569 |       "text/plain": [
 570 |        "Index(['NB_MORTS', 'NB_BLESSES_GRAVES', 'NB_BLESS_LEGERS', 'AN',\n",
 571 |        "       'NB_VICTIMES_TOTAL', 'NB_DECES_PIETON', 'NB_BLESSES_PIETON',\n",
 572 |        "       'NB_VICTIMES_PIETON', 'NB_DECES_MOTO', 'NB_BLESSES_MOTO',\n",
 573 |        "       'NB_VICTIMES_MOTO', 'NB_DECES_VELO', 'NB_BLESSES_VELO',\n",
 574 |        "       'NB_VICTIMES_VELO', 'LOC_COTE_P'],\n",
 575 |        "      dtype='object')"
 576 |       ]
 577 |      },
 578 |      "execution_count": 15,
 579 |      "metadata": {},
 580 |      "output_type": "execute_result"
 581 |     }
 582 |    ],
 583 |    "source": [
 584 |     "# getting columns by datatype\n",
 585 |     "integers = df.columns[column_types == 'int64']\n",
 586 |     "integers"
 587 |    ]
 588 |   },
 589 |   {
 590 |    "cell_type": "code",
 591 |    "execution_count": 16,
 592 |    "metadata": {},
 593 |    "outputs": [
 594 |     {
 595 |      "data": {
 596 |       "text/plain": [
 597 |        "Index(['CD_GENRE_ACCDN', 'CD_ETAT_SURFC', 'CD_ECLRM', 'CD_ENVRN_ACCDN',\n",
 598 |        "       'CD_CATEG_ROUTE', 'CD_LOCLN_ACCDN', 'CD_CONFG_ROUTE', 'CD_COND_METEO',\n",
 599 |        "       'NB_VEH_IMPLIQUES_ACCDN', 'nb_automobile_camion_leger',\n",
 600 |        "       'nb_camionLourd_tractRoutier', 'nb_outil_equipement',\n",
 601 |        "       'nb_tous_autobus_minibus', 'nb_bicyclette', 'nb_cyclomoteur',\n",
 602 |        "       'nb_motocyclette', 'LOC_X', 'LOC_Y', 'LOC_LONG', 'LOC_LAT'],\n",
 603 |        "      dtype='object')"
 604 |       ]
 605 |      },
 606 |      "execution_count": 16,
 607 |      "metadata": {},
 608 |      "output_type": "execute_result"
 609 |     }
 610 |    ],
 611 |    "source": [
 612 |     "# getting columns by datatype\n",
 613 |     "floats = df.columns[column_types == 'float64']\n",
 614 |     "floats"
 615 |    ]
 616 |   },
 617 |   {
 618 |    "cell_type": "code",
 619 |    "execution_count": 17,
 620 |    "metadata": {
 621 |     "scrolled": false
 622 |    },
 623 |    "outputs": [
 624 |     {
 625 |      "data": {
 626 |       "text/plain": [
 627 |        "Index(['JR_SEMN_ACCDN', 'DT_ACCDN', 'HR_ACCDN', 'GRAVITE', 'LOC_COTE_Q'], dtype='object')"
 628 |       ]
 629 |      },
 630 |      "execution_count": 17,
 631 |      "metadata": {},
 632 |      "output_type": "execute_result"
 633 |     }
 634 |    ],
 635 |    "source": [
 636 |     "# getting columns by datatype\n",
 637 |     "objects = df.columns[column_types == 'object']\n",
 638 |     "objects"
 639 |    ]
 640 |   },
 641 |   {
 642 |    "cell_type": "code",
 643 |    "execution_count": 19,
 644 |    "metadata": {},
 645 |    "outputs": [],
 646 |    "source": [
 647 |     "# function to make dummies\n",
 648 |     "def make_dummies(col):\n",
 649 |     "    global df\n",
 650 |     "    dummies = pd.get_dummies(df[col], prefix_sep=': ', prefix=col)\n",
 651 |     "    df = pd.concat([df, dummies], sort=False, axis=1)\n",
 652 |     "    df.drop(col, axis=1, inplace=True)"
 653 |    ]
 654 |   },
 655 |   {
 656 |    "cell_type": "code",
 657 |    "execution_count": 20,
 658 |    "metadata": {},
 659 |    "outputs": [],
 660 |    "source": [
 661 |     "# function to see value counts\n",
 662 |     "def vc(col):\n",
 663 |     "    return df[col].value_counts()"
 664 |    ]
 665 |   },
 666 |   {
 667 |    "cell_type": "code",
 668 |    "execution_count": 21,
 669 |    "metadata": {},
 670 |    "outputs": [],
 671 |    "source": [
 672 |     "# jour semaine\n",
 673 |     "df['JR_SEMN_ACCDN'] = df['JR_SEMN_ACCDN'].str.replace('DI', 'Dimanche')\n",
 674 |     "df['JR_SEMN_ACCDN'] = df['JR_SEMN_ACCDN'].str.replace('LU', 'Lundi')\n",
 675 |     "df['JR_SEMN_ACCDN'] = df['JR_SEMN_ACCDN'].str.replace('MA', 'Mardi')\n",
 676 |     "df['JR_SEMN_ACCDN'] = df['JR_SEMN_ACCDN'].str.replace('ME', 'Mercredi')\n",
 677 |     "df['JR_SEMN_ACCDN'] = df['JR_SEMN_ACCDN'].str.replace('JE', 'Jeudi')\n",
 678 |     "df['JR_SEMN_ACCDN'] = df['JR_SEMN_ACCDN'].str.replace('VE', 'Vendredi')\n",
 679 |     "df['JR_SEMN_ACCDN'] = df['JR_SEMN_ACCDN'].str.replace('SA', 'Samedi')\n",
 680 |     "\n",
 681 |     "df['JR_SEMN_ACCDN'].value_counts()\n",
 682 |     "make_dummies('JR_SEMN_ACCDN')"
 683 |    ]
 684 |   },
 685 |   {
 686 |    "cell_type": "code",
 687 |    "execution_count": 22,
 688 |    "metadata": {},
 689 |    "outputs": [],
 690 |    "source": [
 691 |     "# month\n",
 692 |     "month_dict = {\n",
 693 |     "    '01':'Janvier',\n",
 694 |     "    '02':'Février',\n",
 695 |     "    '03':'Mars',\n",
 696 |     "    '04':'Avril',\n",
 697 |     "    '05':'Mai',\n",
 698 |     "    '06':'Juin',\n",
 699 |     "    '07':'Juillet',\n",
 700 |     "    '08':'Août',\n",
 701 |     "    '09':'Septembre',\n",
 702 |     "    '10':'Octobre',\n",
 703 |     "    '11':'Novembre',\n",
 704 |     "    '12':'Décembre'\n",
 705 |     "}\n",
 706 |     "df['DT_ACCDN'] = df['DT_ACCDN'].str.split('/').str.get(1)\n",
 707 |     "df['DT_ACCDN'] = df['DT_ACCDN'].astype(str).replace(month_dict)\n",
 708 |     "make_dummies('DT_ACCDN')"
 709 |    ]
 710 |   },
 711 |   {
 712 |    "cell_type": "code",
 713 |    "execution_count": 23,
 714 |    "metadata": {},
 715 |    "outputs": [],
 716 |    "source": [
 717 |     "# genre d'accident\n",
 718 |     "genre_dict = {\n",
 719 |     "    '31':'Collision avec véhicule routier',\n",
 720 |     "    '32':'Collision avec piéton',\n",
 721 |     "    '33':'Collision avec cycliste',\n",
 722 |     "    '34':'Collision avec train',\n",
 723 |     "    '35':'Collision avec chevreuil (cerf de Virginie)',\n",
 724 |     "    '36':'Collision avec orignal/ours/caribou',\n",
 725 |     "    '37':'Collision avec autre animal',\n",
 726 |     "    '38':'Collision avec obstacle temporaire',\n",
 727 |     "    '39':'Collision avec objet projeté/détaché',\n",
 728 |     "    '40':'Objet fixe: lampadaire',\n",
 729 |     "    '41':'Objet fixe: support/feu de signalisation',\n",
 730 |     "    '42':'Objet fixe: poteau (service public)',\n",
 731 |     "    '43':'Objet fixe: arbre',\n",
 732 |     "    '44':'Objet fixe: section de glissière de sécurité',\n",
 733 |     "    '45':'Objet fixe: atténuateur d’impact',\n",
 734 |     "    '46':'Objet fixe: extrémité de glissière de sécurité',\n",
 735 |     "    '47':'Objet fixe: pilier (pont/tunnel)',\n",
 736 |     "    '48':'Objet fixe: amoncellement de neige',\n",
 737 |     "    '49':'Objet fixe: bâtiment/édifice/mur',\n",
 738 |     "    '50':'Objet fixe: bordure/trottoir',\n",
 739 |     "    '51':'Objet fixe: borne-fontaine',\n",
 740 |     "    '52':'Objet fixe: clôture/barrière',\n",
 741 |     "    '53':'Objet fixe: fossé',\n",
 742 |     "    '54':'Objet fixe: paroi rocheuse',\n",
 743 |     "    '55':'Objet fixe: ponceau',\n",
 744 |     "    '59':'Objet fixe: autre objet fixe',\n",
 745 |     "    '71':'Sans collision: capotage',\n",
 746 |     "    '72':'Sans collision: renversement',\n",
 747 |     "    '73':'Sans collision: submersion/cours d’eau',\n",
 748 |     "    '74':'Sans collision: feu/explosion',\n",
 749 |     "    '75':'Sans collision: quitte la chaussée',\n",
 750 |     "    '99':'Sans collision: autre'    \n",
 751 |     "}\n",
 752 |     "df['CD_GENRE_ACCDN'] = df['CD_GENRE_ACCDN'].astype(str).replace(genre_dict)\n",
 753 |     "make_dummies('CD_GENRE_ACCDN')"
 754 |    ]
 755 |   },
 756 |   {
 757 |    "cell_type": "code",
 758 |    "execution_count": 25,
 759 |    "metadata": {},
 760 |    "outputs": [],
 761 |    "source": [
 762 |     "# etat de la surface\n",
 763 |     "surface_dict = {\n",
 764 |     "    '11.0':'Sèche',\n",
 765 |     "    '12.0':'Mouillée',\n",
 766 |     "    '13.0':'Accumulation d\\'eau',\n",
 767 |     "    '14.0':'Sable, gravier',\n",
 768 |     "    '15.0':'Gadoue',\n",
 769 |     "    '16.0':'Enneigée',\n",
 770 |     "    '17.0':'Neige durcie',\n",
 771 |     "    '18.0':'Glacée',\n",
 772 |     "    '19.0':'Boueuse',\n",
 773 |     "    '20.0':'Huileuse',\n",
 774 |     "    '90.0':'Autre'\n",
 775 |     "}\n",
 776 |     "df['CD_ETAT_SURFC'] = df['CD_ETAT_SURFC'].astype(str).replace(surface_dict)\n",
 777 |     "make_dummies('CD_ETAT_SURFC')"
 778 |    ]
 779 |   },
 780 |   {
 781 |    "cell_type": "code",
 782 |    "execution_count": 26,
 783 |    "metadata": {},
 784 |    "outputs": [],
 785 |    "source": [
 786 |     "# eclairement\n",
 787 |     "ecl_dict = {\n",
 788 |     "    '1':'Jour et clarté',\n",
 789 |     "    '2':'Jour et demi-obscurité',\n",
 790 |     "    '3':'Nuit et chemin éclairé',\n",
 791 |     "    '4':'Nuit et chemin non éclairé'\n",
 792 |     "}\n",
 793 |     "make_dummies('CD_ECLRM')"
 794 |    ]
 795 |   },
 796 |   {
 797 |    "cell_type": "code",
 798 |    "execution_count": 27,
 799 |    "metadata": {},
 800 |    "outputs": [],
 801 |    "source": [
 802 |     "# environment\n",
 803 |     "env_dict = {\n",
 804 |     "    '1':'Scolaire',\n",
 805 |     "    '2':'Résidentiel',\n",
 806 |     "    '3':'Commercial',\n",
 807 |     "    '4':'Industriel',\n",
 808 |     "    '5':'Rural',\n",
 809 |     "    '6':'Forestier',\n",
 810 |     "    '7':'Récréatif',\n",
 811 |     "    '9':'Autre'\n",
 812 |     "}\n",
 813 |     "df['CD_ENVRN_ACCDN'] = df['CD_ENVRN_ACCDN'].astype(str).replace(env_dict)\n",
 814 |     "make_dummies('CD_ENVRN_ACCDN')"
 815 |    ]
 816 |   },
 817 |   {
 818 |    "cell_type": "code",
 819 |    "execution_count": 28,
 820 |    "metadata": {},
 821 |    "outputs": [],
 822 |    "source": [
 823 |     "# type route\n",
 824 |     "categ_dict = {\n",
 825 |     "    '11':'Chemin public: bretelle/collecteur d’autoroute/voie de service',\n",
 826 |     "    '12':'Chemin public: route numérotée',\n",
 827 |     "    '13':'Chemin public: artère principale',\n",
 828 |     "    '14':'Chemin public: rue résidentielle',\n",
 829 |     "    '15':'Chemin public: chemin/rang',\n",
 830 |     "    '16':'Chemin public: ruelle',\n",
 831 |     "    '19':'Chemin public: autre',\n",
 832 |     "    '21':'Hors chemin public: terrain de stationnement',\n",
 833 |     "    '22':'Hors chemin public: terrain privé',\n",
 834 |     "    '23':'Hors chemin public: chemin privé',\n",
 835 |     "    '24':'Hors chemin public: chemin forestier',\n",
 836 |     "    '25':'Hors chemin public: sentier balisé',\n",
 837 |     "    '29':'Hors chemin public: autre',\n",
 838 |     "}\n",
 839 |     "df['CD_CATEG_ROUTE'] = df['CD_CATEG_ROUTE'].astype(str).replace(categ_dict)\n",
 840 |     "make_dummies('CD_CATEG_ROUTE')"
 841 |    ]
 842 |   },
 843 |   {
 844 |    "cell_type": "code",
 845 |    "execution_count": 29,
 846 |    "metadata": {},
 847 |    "outputs": [],
 848 |    "source": [
 849 |     "# localisation\n",
 850 |     "loc_dict = {\n",
 851 |     "    '31':'Carrefour giratoire/rond-point',\n",
 852 |     "    '32':'En intersection (moins de 5 mètres)',\n",
 853 |     "    '33':'Près d’une intersection/carrefour giratoire',\n",
 854 |     "    '34':'Entre intersections (100 mètres et +)',\n",
 855 |     "    '35':'Passage à niveau',\n",
 856 |     "    '36':'Pont (au-dessus d’un cours d’eau)',\n",
 857 |     "    '37':'Autre pont (viaduc)',\n",
 858 |     "    '38':'Tunnel',\n",
 859 |     "    '39':'Sous un pont ou un viaduc',\n",
 860 |     "    '40':'Centre commercial',\n",
 861 |     "    '99':'Autres'\n",
 862 |     "}\n",
 863 |     "df['CD_LOCLN_ACCDN'] = df['CD_LOCLN_ACCDN'].astype(str).replace(loc_dict)\n",
 864 |     "make_dummies('CD_LOCLN_ACCDN')"
 865 |    ]
 866 |   },
 867 |   {
 868 |    "cell_type": "code",
 869 |    "execution_count": 30,
 870 |    "metadata": {},
 871 |    "outputs": [],
 872 |    "source": [
 873 |     "# configuration\n",
 874 |     "config_dict = {\n",
 875 |     "    '1':'Sens unique',\n",
 876 |     "    '2':'Deux sens, une voie par direction',\n",
 877 |     "    '3':'Deux sens, plus d’une voie par direction',\n",
 878 |     "    '4':'Séparée par aménagement franchissable',\n",
 879 |     "    '5':'Séparée par aménagement infranchissable',\n",
 880 |     "    '6':'Autre (ex.: balises, voie de virage à gauche dans les deux sens)'\n",
 881 |     "}\n",
 882 |     "df['CD_CONFG_ROUTE'] = df['CD_CONFG_ROUTE'].astype(str).replace(config_dict)\n",
 883 |     "make_dummies('CD_CONFG_ROUTE')"
 884 |    ]
 885 |   },
 886 |   {
 887 |    "cell_type": "code",
 888 |    "execution_count": 31,
 889 |    "metadata": {},
 890 |    "outputs": [],
 891 |    "source": [
 892 |     "# meteo\n",
 893 |     "meteo_dict = {\n",
 894 |     "    '11':'Clair',\n",
 895 |     "    '12':'Couvert',\n",
 896 |     "    '13':'Brume',\n",
 897 |     "    '14':'Pluie',\n",
 898 |     "    '15':'Pluie forte',\n",
 899 |     "    '16':'Vent fort',\n",
 900 |     "    '17':'Neige',\n",
 901 |     "    '18':'Poudrerie avec vent',\n",
 902 |     "    '19':'Verglas',\n",
 903 |     "    '10':'Autre'\n",
 904 |     "}\n",
 905 |     "df['CD_COND_METEO'] = df['CD_COND_METEO'].astype(str).replace(meteo_dict)\n",
 906 |     "make_dummies('CD_COND_METEO')"
 907 |    ]
 908 |   },
 909 |   {
 910 |    "cell_type": "code",
 911 |    "execution_count": 32,
 912 |    "metadata": {},
 913 |    "outputs": [],
 914 |    "source": [
 915 |     "# heure\n",
 916 |     "heure_dict = {\n",
 917 |     "    'Non précisé':'Inconnu',         \n",
 918 |     "    '16:00:00-16:59:00':'16h__',\n",
 919 |     "    '15:00:00-15:59:00':'15h__',\n",
 920 |     "    '17:00:00-17:59:00':'17h__',\n",
 921 |     "    '14:00:00-14:59:00':'14h__',\n",
 922 |     "    '12:00:00-12:59:00':'12h__',\n",
 923 |     "    '13:00:00-13:59:00':'13h__',\n",
 924 |     "    '08:00:00-08:59:00':'08h__',\n",
 925 |     "    '18:00:00-18:59:00':'18h__',\n",
 926 |     "    '11:00:00-11:59:00':'11h__',\n",
 927 |     "    '10:00:00-10:59:00':'10h__',\n",
 928 |     "    '09:00:00-09:59:00':'09h__',\n",
 929 |     "    '19:00:00-19:59:00':'19h__',\n",
 930 |     "    '07:00:00-07:59:00':'07h__',\n",
 931 |     "    '20:00:00-20:59:00':'20h__',\n",
 932 |     "    '21:00:00-21:59:00':'21h__',\n",
 933 |     "    '22:00:00-22:59:00':'22h__',\n",
 934 |     "    '23:00:00-23:59:00':'23h__',\n",
 935 |     "    '00:00:00-00:59:00':'00h__',\n",
 936 |     "    '06:00:00-06:59:00':'06h__',\n",
 937 |     "    '03:00:00-03:59:00':'03h__',\n",
 938 |     "    '01:00:00-01:59:00':'01h__',\n",
 939 |     "    '02:00:00-02:59:00':'02h__',\n",
 940 |     "    '04:00:00-04:59:00':'04h__',\n",
 941 |     "    '05:00:00-05:59:00':'05h__'\n",
 942 |     "}\n",
 943 |     "df['HR_ACCDN'] = df['HR_ACCDN'].astype(str).replace(heure_dict)\n",
 944 |     "make_dummies('HR_ACCDN')"
 945 |    ]
 946 |   },
 947 |   {
 948 |    "cell_type": "code",
 949 |    "execution_count": 33,
 950 |    "metadata": {},
 951 |    "outputs": [],
 952 |    "source": [
 953 |     "# gravite\n",
 954 |     "df['GRAVITE'] = df['GRAVITE'].str.replace('inférieurs au seuil de rapportage', '(inférieurs)')\n",
 955 |     "make_dummies('GRAVITE')"
 956 |    ]
 957 |   },
 958 |   {
 959 |    "cell_type": "code",
 960 |    "execution_count": 38,
 961 |    "metadata": {},
 962 |    "outputs": [
 963 |     {
 964 |      "data": {
 965 |       "text/plain": [
 966 |        "Index(['NB_VEH_IMPLIQUES_ACCDN', 'NB_MORTS', 'NB_BLESSES_GRAVES',\n",
 967 |        "       'NB_BLESS_LEGERS', 'AN', 'NB_VICTIMES_TOTAL',\n",
 968 |        "       'nb_automobile_camion_leger', 'nb_camionLourd_tractRoutier',\n",
 969 |        "       'nb_outil_equipement', 'nb_tous_autobus_minibus', 'nb_bicyclette',\n",
 970 |        "       'nb_cyclomoteur', 'nb_motocyclette', 'NB_DECES_PIETON',\n",
 971 |        "       'NB_BLESSES_PIETON', 'NB_VICTIMES_PIETON', 'NB_DECES_MOTO',\n",
 972 |        "       'NB_BLESSES_MOTO', 'NB_VICTIMES_MOTO', 'NB_DECES_VELO',\n",
 973 |        "       'NB_BLESSES_VELO', 'NB_VICTIMES_VELO', 'LOC_X', 'LOC_Y', 'LOC_COTE_Q',\n",
 974 |        "       'LOC_COTE_P', 'LOC_LONG', 'LOC_LAT'],\n",
 975 |        "      dtype='object')"
 976 |       ]
 977 |      },
 978 |      "execution_count": 38,
 979 |      "metadata": {},
 980 |      "output_type": "execute_result"
 981 |     }
 982 |    ],
 983 |    "source": [
 984 |     "# final columns\n",
 985 |     "df.columns[:28]"
 986 |    ]
 987 |   },
 988 |   {
 989 |    "cell_type": "code",
 990 |    "execution_count": 39,
 991 |    "metadata": {},
 992 |    "outputs": [
 993 |     {
 994 |      "name": "stdout",
 995 |      "output_type": "stream",
 996 |      "text": [
 997 |       "We have 171,271 rows and 179 columns.\n"
 998 |      ]
 999 |     }
1000 |    ],
1001 |    "source": [
1002 |     "# final shape\n",
1003 |     "rows, columns = df.shape\n",
1004 |     "print(f'We have {rows:,} rows and {columns} columns.')"
1005 |    ]
1006 |   },
1007 |   {
1008 |    "cell_type": "code",
1009 |    "execution_count": 40,
1010 |    "metadata": {},
1011 |    "outputs": [],
1012 |    "source": [
1013 |     "# exports processed data to csv \n",
1014 |     "df.to_csv('a_dummies.csv', header=True, index=None)"
1015 |    ]
1016 |   }
1017 |  ],
1018 |  "metadata": {
1019 |   "kernelspec": {
1020 |    "display_name": "Python 3",
1021 |    "language": "python",
1022 |    "name": "python3"
1023 |   },
1024 |   "language_info": {
1025 |    "codemirror_mode": {
1026 |     "name": "ipython",
1027 |     "version": 3
1028 |    },
1029 |    "file_extension": ".py",
1030 |    "mimetype": "text/x-python",
1031 |    "name": "python",
1032 |    "nbconvert_exporter": "python",
1033 |    "pygments_lexer": "ipython3",
1034 |    "version": "3.7.4"
1035 |   }
1036 |  },
1037 |  "nbformat": 4,
1038 |  "nbformat_minor": 2
1039 | }
1040 | 


--------------------------------------------------------------------------------