├── .gitattributes
├── .gitignore
├── Acknowledgements.tex
├── AlexNet.pdf
├── Bottleneck.pdf
├── Bottleneck_BN.pdf
├── Bottleneck_BN_2.pdf
├── Bottleneck_BN_backprop.pdf
├── Bottleneck_BN_backprop_2.pdf
├── CNN_MM_pixels.pdf
├── CNN_MM_unpixels.pdf
├── Conclusion.tex
├── Conv_equiv.pdf
├── DEEP_LEARNING.bib
├── ELU.pdf
├── GoogleNet.pdf
├── Inception.pdf
├── Introduction.tex
├── LSTM_structure-peephole.pdf
├── LSTM_structure-tot.pdf
├── LSTM_structure.pdf
├── LeNet.pdf
├── Mediamobile.png
├── Preface.tex
├── README.md
├── RNN_structure-tot.pdf
├── RNN_structure.pdf
├── ReLU.pdf
├── ResNet.pdf
├── S_FNN.pdf
├── ThesisStyle.cls
├── VGG-conv.pdf
├── VGG-fc.pdf
├── VGG-pool-fc.pdf
├── VGG-pool.pdf
├── VGG.pdf
├── White_book-blx.bib
├── White_book.bcf
├── White_book.dvi
├── White_book.pdf
├── White_book.tex
├── chapter1.tex
├── chapter2.tex
├── chapter3.tex
├── conv_2d-crop.pdf
├── conv_4d-crop.pdf
├── cover_page-crop.pdf
├── fc_equiv.pdf
├── fc_resnet.pdf
├── fc_resnet_2.pdf
├── fc_resnet_3.pdf
├── formatAndDefs.tex
├── fully_connected.pdf
├── input_layer.pdf
├── lReLU.pdf
├── mathpazo.sty
├── output_layer.pdf
├── padding.pdf
├── pagedecouv.sty
├── pool_4d-crop.pdf
├── sigmoid.pdf
├── softmax.pdf
├── tanh.pdf
└── tanh2.pdf


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | *.aux
 2 | *.bbl
 3 | *.blg
 4 | *.log
 5 | *.maf
 6 | *.mtc*
 7 | *.run.xml
 8 | *.synctex.gz
 9 | *.toc
10 | *.tox
11 | *.bib.bak
12 | *.out
13 | 


--------------------------------------------------------------------------------
/Acknowledgements.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Acknowledgements}
 2 | 
 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}T}his work has no benefit nor added value to the deep learning topic on its own. It is just the reformulation of ideas of brighter researchers to fit a peculiar mindset: the one of preferring formulas with ten indices but where one knows precisely what one is manipulating rather than (in my opinion sometimes opaque) matrix formulations where the dimension of the objects are rarely if ever specified.
 4 | 
 5 | \vspace{0.2cm}
 6 | 
 7 | Among the brighter people from whom I learned online are Andrew Ng. His Coursera class (\href{https://www.coursera.org/learn/machine-learning}{here}) was the first contact I got with Neural Network, and this pedagogical introduction allowed me to build on solid ground.
 8 | 
 9 | \vspace{0.2cm}
10 | 
11 | I also wish to particularly thanks Hugo Larochelle, who not only built a wonderful deep learning class (\href{http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html}{here}), but was also kind enough to answer emails from a complete beginner and stranger!
12 | 
13 | \vspace{0.2cm}
14 | 
15 | The Stanford class on convolutional networks (\href{http://cs231n.github.io/convolutional-networks/}{here}) proved extremely valuable to me, so did the one on Natural Language processing (\href{http://web.stanford.edu/class/cs224n/}{here}).
16 | 
17 | \vspace{0.2cm}
18 | 
19 | I also benefited greatly from Sebastian Ruder's blog (\href{http://ruder.io/#open}{here}), both from the blog pages on gradient descent optimization techniques and from the author himself.
20 | 
21 | \vspace{0.2cm}
22 | 
23 | I learned more about LSTM on colah's blog (\href{http://colah.github.io/posts/2015-08-Understanding-LSTMs/}{here}), and some of my drawings are inspired from there.
24 | 
25 | \vspace{0.2cm}
26 | 
27 | I also thank Jonathan Del Hoyo for the great articles that he regularly shares on LinkedIn.
28 | 
29 | \vspace{0.2cm}
30 | 
31 | Many thanks go to my collaborators at Mediamobile, who let me dig as deep as I wanted on Neural Networks. I am especially indebted to Clément, Nicolas, Jessica, Christine and Céline.
32 | 
33 | \vspace{0.2cm}
34 | 
35 | Thanks to Jean-Michel Loubes and Fabrice Gamboa, from whom I learned a great deal on probability theory and statistics.
36 | 
37 | \vspace{0.2cm}
38 | 
39 | I end this list with my employer, Mediamobile, which has been kind enough to let me work on this topic with complete freedom. A special thanks to Philippe, who supervised me with the perfect balance of feedback and freedom!
40 | 
41 | 


--------------------------------------------------------------------------------
/AlexNet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/AlexNet.pdf


--------------------------------------------------------------------------------
/Bottleneck.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck.pdf


--------------------------------------------------------------------------------
/Bottleneck_BN.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN.pdf


--------------------------------------------------------------------------------
/Bottleneck_BN_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN_2.pdf


--------------------------------------------------------------------------------
/Bottleneck_BN_backprop.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN_backprop.pdf


--------------------------------------------------------------------------------
/Bottleneck_BN_backprop_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN_backprop_2.pdf


--------------------------------------------------------------------------------
/CNN_MM_pixels.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/CNN_MM_pixels.pdf


--------------------------------------------------------------------------------
/CNN_MM_unpixels.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/CNN_MM_unpixels.pdf


--------------------------------------------------------------------------------
/Conclusion.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Conclusion}
 2 | 
 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}W}e have come to the end of our journey. I hope this note lived up to its promises, and that the reader now understands better how a neural network is designed and how it works under the hood. To wrap it up, we have seen the architecture of the three most common neural networks, as well as the careful mathematical derivation of their training formulas.
 4 | 
 5 | \vspace{0.2cm}
 6 | 
 7 | Deep Learning seems to be a fast evolving field, and this material might be out of date in a near future, but the index approach adopted will still allow the reader -- as it as helped the writer -- to work out for herself what is behind the next state of the art architectures.
 8 | 
 9 | \vspace{0.2cm}
10 | 
11 | Until then, one should have enough material to encode from scratch its own FNN, CNN and RNN-LSTM, as the author did as an empirical proof of his formulas.
12 | 


--------------------------------------------------------------------------------
/Conv_equiv.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Conv_equiv.pdf


--------------------------------------------------------------------------------
/DEEP_LEARNING.bib:
--------------------------------------------------------------------------------
  1 | % Encoding: UTF-8
  2 | 
  3 | @Article{Rosenblatt58theperceptron:,
  4 |   author  = {F. Rosenblatt},
  5 |   title   = {The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain},
  6 |   journal = {Psychological Review},
  7 |   year    = {1958},
  8 |   pages   = {65--386},
  9 | }
 10 | 
 11 | @Article{Hahnloser:2003:PFS:762330.762336,
 12 |   author     = {Hahnloser, Richard H. R. and Seung, H. Sebastian and Slotine, Jean-Jacques},
 13 |   title      = {Permitted and Forbidden Sets in Symmetric Threshold-linear Networks},
 14 |   journal    = {Neural Comput.},
 15 |   year       = {2003},
 16 |   volume     = {15},
 17 |   number     = {3},
 18 |   pages      = {621--638},
 19 |   month      = mar,
 20 |   issn       = {0899-7667},
 21 |   acmid      = {762336},
 22 |   address    = {Cambridge, MA, USA},
 23 |   doi        = {10.1162/089976603321192103},
 24 |   issue_date = {March 2003},
 25 |   numpages   = {18},
 26 |   publisher  = {MIT Press},
 27 |   url        = {http://dx.doi.org/10.1162/089976603321192103},
 28 | }
 29 | 
 30 | @Article{DBLP:journals/corr/DielemanWD15,
 31 |   author    = {Sander Dieleman and Kyle W. Willett and Joni Dambre},
 32 |   title     = {Rotation-invariant convolutional neural networks for galaxy morphology prediction},
 33 |   journal   = {CoRR},
 34 |   year      = {2015},
 35 |   volume    = {abs/1503.07077},
 36 |   bibsource = {dblp computer science bibliography, http://dblp.org},
 37 |   biburl    = {http://dblp.uni-trier.de/rec/bib/journals/corr/DielemanWD15},
 38 |   timestamp = {Wed, 07 Jun 2017 14:41:33 +0200},
 39 |   url       = {http://arxiv.org/abs/1503.07077},
 40 | }
 41 | 
 42 | @InProceedings{43022,
 43 |   author    = {Christian Szegedy and Wei Liu and Yangqing Jia and Pierre Sermanet and Scott Reed and Dragomir Anguelov and Dumitru Erhan and Vincent Vanhoucke and Andrew Rabinovich},
 44 |   title     = {Going Deeper with Convolutions},
 45 |   booktitle = {Computer Vision and Pattern Recognition (CVPR)},
 46 |   year      = {2015},
 47 |   url       = {http://arxiv.org/abs/1409.4842},
 48 | }
 49 | 
 50 | @Article{He2015,
 51 |   author = {He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
 52 |   title  = {Deep Residual Learning for Image Recognition},
 53 |   year   = {2015},
 54 |   volume = {7},
 55 |   month  = {12},
 56 | }
 57 | 
 58 | @Article{Hochreiter:1997:LSM:1246443.1246450,
 59 |   author     = {Hochreiter, Sepp and Schmidhuber, J\"{u}rgen},
 60 |   title      = {Long Short-Term Memory},
 61 |   journal    = {Neural Comput.},
 62 |   year       = {1997},
 63 |   volume     = {9},
 64 |   number     = {8},
 65 |   pages      = {1735--1780},
 66 |   month      = nov,
 67 |   issn       = {0899-7667},
 68 |   acmid      = {1246450},
 69 |   address    = {Cambridge, MA, USA},
 70 |   doi        = {10.1162/neco.1997.9.8.1735},
 71 |   issue_date = {November 15, 1997},
 72 |   numpages   = {46},
 73 |   publisher  = {MIT Press},
 74 |   url        = {http://dx.doi.org/10.1162/neco.1997.9.8.1735},
 75 | }
 76 | 
 77 | @Article{Kingma2014,
 78 |   author = {Kingma, Diederik and Ba, Jimmy},
 79 |   title  = {Adam: A Method for Stochastic Optimization},
 80 |   year   = {2014},
 81 |   month  = {12},
 82 | }
 83 | 
 84 | @Article{Ioffe2015,
 85 |   author = {Ioffe, Sergey and Szegedy, Christian},
 86 |   title  = {Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift},
 87 |   year   = {2015},
 88 |   month  = {02},
 89 | }
 90 | 
 91 | @Article{Hahnloser2000,
 92 |   author  = {Hahnloser, Richard and Sarpeshkar, R. and Mahowald, Misha and Douglas, Rodney J. and Seung, S},
 93 |   title   = {Digital selection and analog amplification co-exist in an electronic circuit inspired by neocortex},
 94 |   journal = {Nature},
 95 |   year    = {2000},
 96 |   volume  = {405},
 97 |   pages   = {947-951,},
 98 |   month   = {06},
 99 |   doi     = {10.1038/35016072},
100 |   url     = {http://dx.doi.org/10.1038/35016072},
101 | }
102 | 
103 | @InProceedings{Deng:2016:LSM:2939672.2939860,
104 |   author    = {Deng, Dingxiong and Shahabi, Cyrus and Demiryurek, Ugur and Zhu, Linhong and Yu, Rose and Liu, Yan},
105 |   title     = {Latent Space Model for Road Networks to Predict Time-Varying Traffic},
106 |   booktitle = {Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
107 |   year      = {2016},
108 |   series    = {KDD '16},
109 |   pages     = {1525--1534},
110 |   address   = {New York, NY, USA},
111 |   publisher = {ACM},
112 |   acmid     = {2939860},
113 |   doi       = {10.1145/2939672.2939860},
114 |   isbn      = {978-1-4503-4232-2},
115 |   keywords  = {latent space model, real-time traffic forecasting, road network},
116 |   location  = {San Francisco, California, USA},
117 |   numpages  = {10},
118 |   url       = {http://doi.acm.org/10.1145/2939672.2939860},
119 | }
120 | 
121 | @Article{SunHongyu,
122 |   author = {Sun, Hongyu; Liu, Henry X.; Xiao, Heng; \& Ran, Bin},
123 |   title  = {Short Term Traffic Forecasting Using the Local Linear Regression Model},
124 | }
125 | 
126 | @Article{MaDaiHe:2017,
127 |   author = {Ma Xiaolei and Dai Zhuang and He Zhengbing and Ma Jihui and Wang Yong and Wang Yunpeng},
128 |   title  = {Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction},
129 |   year   = {2017},
130 |   doi    = {10.3390/s17040818},
131 | }
132 | 
133 | @Article{Fouladgar2017ScalableDT,
134 |   author  = {Mohammadhani Fouladgar and Mostafa Parchami and Ramez Elmasri and Amir Ghaderi},
135 |   title   = {Scalable deep traffic flow neural networks for urban traffic congestion prediction},
136 |   journal = {2017 International Joint Conference on Neural Networks (IJCNN)},
137 |   year    = {2017},
138 |   pages   = {2251-2258},
139 | }
140 | 
141 | @InProceedings{MiwaTYM2004,
142 |   author    = {T. Miwa and Y. Tawada and T. Yamamoto and T. Morikawa},
143 |   title     = {En-Route Updating Methodology of Travel Time Prediction Using Accumulated Probe-Car Data},
144 |   year      = {2004},
145 |   publisher = {Proc. of the 11th ITS World Congress},
146 | }
147 | 
148 | @Article{iet:/content/conferences/10.1049/cp_20000103,
149 |   author      = {S. Turksma},
150 |   title       = {The various uses of floating car data},
151 |   journal     = {IET Conference Proceedings},
152 |   year        = {2000},
153 |   pages       = {51-55(4)},
154 |   month       = {January},
155 |   abstract    = {To a large extent, traffic control and traffic information services depend on accurate information about the situation on the road network. Often rough estimates of queue lengths are no longer sufficient to give a reasonable prediction of travel times. Traditionally the situation on the road network is derived from local measurements (e.g. induction loops). It is difficult to obtain travel time estimates from local speed and flow data. It is especially difficult in urban areas. When the positions of a sufficient number of vehicles can be frequently communicated to a central site, travel times can be directly measured. This is called floating car data (FCD). This paper gives the major results of the Prelude trial and the ways in which FCD can be used. The Prelude FCD-trial in the Netherlands had as its primary aim to investigate the technical feasibility of using FCD to accurately measure travel times in a mixed urban and motorway network. FCD has a broad range of applications. The applications range from real time data for traffic management to the compilation of very accurate origin-destination matrices complete with travel times.},
156 |   affiliation = {Peek Traffic BV},
157 |   keywords    = {traffic information services;floating car data;the Netherlands;traffic control;mixed urban/motorway network;real-time data;origin-destination matrices;OD matrices;Prelude trial;traffic information systems;traffic IS;},
158 |   language    = {English},
159 |   publisher   = {Institution of Engineering and Technology},
160 |   url         = {http://digital-library.theiet.org/content/conferences/10.1049/cp_20000103},
161 | }
162 | 
163 | @InProceedings{Yoon:2007:SST:1247660.1247686,
164 |   author    = {Yoon, Jungkeun and Noble, Brian and Liu, Mingyan},
165 |   title     = {Surface Street Traffic Estimation},
166 |   booktitle = {Proceedings of the 5th International Conference on Mobile Systems, Applications and Services},
167 |   year      = {2007},
168 |   series    = {MobiSys '07},
169 |   pages     = {220--232},
170 |   address   = {New York, NY, USA},
171 |   publisher = {ACM},
172 |   acmid     = {1247686},
173 |   doi       = {10.1145/1247660.1247686},
174 |   isbn      = {978-1-59593-614-1},
175 |   keywords  = {GPS, estimation, traffic},
176 |   location  = {San Juan, Puerto Rico},
177 |   numpages  = {13},
178 |   url       = {http://doi.acm.org/10.1145/1247660.1247686},
179 | }
180 | 
181 | @Article{Fazio2014,
182 |   author   = {Fazio, Joseph and Wiesner, Brady N. and Deardoff, Matthew D.},
183 |   title    = {Estimation of free-flow speed},
184 |   journal  = {KSCE Journal of Civil Engineering},
185 |   year     = {2014},
186 |   volume   = {18},
187 |   number   = {2},
188 |   pages    = {646--650},
189 |   month    = {Mar},
190 |   issn     = {1976-3808},
191 |   abstract = {In 2010 Highway Capacity Manual, one preferably determines free-flow speed by deriving it from a speed study involving the existing facility or on a comparable facility if the facility is in the planning stage. Many have used a `rule of thumb' by adding 10 km/h (5 mi/h) above the posted limit to obtain free-flow speed without justification. Two team members using a radar gun and manual tally sheets collected 1668 speed observations at ten sites during several weeks. Each site had a unique posted speed limit sign ranging from 30 km/h (20 mi/h) to 120 km/h (75 mi/h). Five sites were on urban streets. Three sites were on multilane highways, and two on freeways. Goodness-of-fit test results revealed that a Gaussian distribution generally fit the speed distributions at each site at a 5{\%} level of significance. The best-fit model had a correlation coefficient of +0.99. The posted speed limit variable was significant at 5{\%} level of significance. Examining data by highway type revealed that average free-flow speeds are strongly associated with posted speed limits with correlation coefficients of +0.99, +1.00, and +1.00 for urban streets, multilane highways, and freeways, respectively.},
192 |   day      = {01},
193 |   doi      = {10.1007/s12205-014-0481-7},
194 |   url      = {https://doi.org/10.1007/s12205-014-0481-7},
195 | }
196 | 
197 | @Article{Gu2015RecentAI,
198 |   author  = {Jiuxiang Gu and Zhenhua Wang and Jason Kuen and Lianyang Ma and Amir Shahroudy and Bing Shuai and Ting Liu and Xingxing Wang and Gang Wang},
199 |   title   = {Recent Advances in Convolutional Neural Networks},
200 |   journal = {CoRR},
201 |   year    = {2015},
202 |   volume  = {abs/1512.07108},
203 | }
204 | 
205 | @InProceedings{lrcn2014,
206 |   author    = {Jeff Donahue and Lisa Anne Hendricks and Sergio Guadarrama and Marcus Rohrbach and Subhashini Venugopalan and Kate Saenko and Trevor Darrell},
207 |   title     = {Long-term Recurrent Convolutional Networks for Visual Recognition and Description},
208 |   booktitle = {CVPR},
209 |   year      = {2015},
210 | }
211 | 
212 | @Article{DBLP:journals/corr/Ioffe17,
213 |   author    = {Sergey Ioffe},
214 |   title     = {Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models},
215 |   journal   = {CoRR},
216 |   year      = {2017},
217 |   volume    = {abs/1702.03275},
218 |   bibsource = {dblp computer science bibliography, http://dblp.org},
219 |   biburl    = {http://dblp.uni-trier.de/rec/bib/journals/corr/Ioffe17},
220 |   timestamp = {Wed, 07 Jun 2017 14:42:44 +0200},
221 |   url       = {http://arxiv.org/abs/1702.03275},
222 | }
223 | 
224 | @InCollection{LeCun:1998:CNI:303568.303704,
225 |   author    = {LeCun, Yann and Bengio, Yoshua},
226 |   title     = {The Handbook of Brain Theory and Neural Networks},
227 |   publisher = {MIT Press},
228 |   year      = {1998},
229 |   editor    = {Arbib, Michael A.},
230 |   chapter   = {Convolutional Networks for Images, Speech, and Time Series},
231 |   pages     = {255--258},
232 |   address   = {Cambridge, MA, USA},
233 |   isbn      = {0-262-51102-9},
234 |   acmid     = {303704},
235 |   numpages  = {4},
236 |   url       = {http://dl.acm.org/citation.cfm?id=303568.303704},
237 | }
238 | 
239 | @InProceedings{LeCun:1998:EB:645754.668382,
240 |   author    = {LeCun, Yann and Bottou, L{\'e}on and Orr, Genevieve B. and M\"{u}ller, Klaus-Robert},
241 |   title     = {Effiicient BackProp},
242 |   booktitle = {Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop},
243 |   year      = {1998},
244 |   pages     = {9--50},
245 |   address   = {London, UK, UK},
246 |   publisher = {Springer-Verlag},
247 |   acmid     = {668382},
248 |   isbn      = {3-540-65311-2},
249 |   numpages  = {42},
250 |   url       = {http://dl.acm.org/citation.cfm?id=645754.668382},
251 | }
252 | 
253 | @Article{Srivastava:2014:DSW:2627435.2670313,
254 |   author     = {Srivastava, Nitish and Hinton, Geoffrey and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan},
255 |   title      = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
256 |   journal    = {J. Mach. Learn. Res.},
257 |   year       = {2014},
258 |   volume     = {15},
259 |   number     = {1},
260 |   pages      = {1929--1958},
261 |   month      = jan,
262 |   issn       = {1532-4435},
263 |   acmid      = {2670313},
264 |   issue_date = {January 2014},
265 |   keywords   = {deep learning, model combination, neural networks, regularization},
266 |   numpages   = {30},
267 |   publisher  = {JMLR.org},
268 |   url        = {http://dl.acm.org/citation.cfm?id=2627435.2670313},
269 | }
270 | 
271 | @Article{DBLP:journals/corr/SimonyanZ14a,
272 |   author    = {Simonyan, Karen and Zisserman, Andrew},
273 |   title     = {Very Deep Convolutional Networks for Large-Scale Image Recognition},
274 |   journal   = {CoRR},
275 |   year      = {2014},
276 |   volume    = {abs/1409.1556},
277 |   bibsource = {dblp computer science bibliography, http://dblp.org},
278 |   interhash = {4e6fa56cb7cf99400d5701543ee228de},
279 |   intrahash = {0ee0434e0a70b329d5518f43f1742f7a},
280 |   url       = {http://arxiv.org/abs/1409.1556},
281 | }
282 | 
283 | @InProceedings{Goodfellow13maxoutnetworks,
284 |   author    = {Ian J. Goodfellow and David Warde-farley and Mehdi Mirza and Aaron Courville and Yoshua Bengio},
285 |   title     = {Maxout networks},
286 |   booktitle = {In ICML},
287 |   year      = {2013},
288 | }
289 | 
290 | @InProceedings{Gers2000c,
291 |   author    = {Gers, F. A. and Schmidhuber, J.},
292 |   title     = {Recurrent Nets that Time and Count},
293 |   booktitle = {{Proceedings of the IJCNN'2000, Int. Joint Conf. on Neural Networks}},
294 |   year      = {2000},
295 |   address   = {Como, Italy},
296 |   interhash = {01a07138f9b65e4eccfa440aed281bf5},
297 |   intrahash = {f0560c89bdc9fd511c03ed8d3ace008d},
298 |   owner     = {thierry},
299 |   timestamp = {2009.04.18},
300 | }
301 | 
302 | @Article{Gers:2000:LFC:1121912.1121915,
303 |   author     = {Gers, Felix A. and Schmidhuber, J\"{u}rgen A. and Cummins, Fred A.},
304 |   title      = {Learning to Forget: Continual Prediction with LSTM},
305 |   journal    = {Neural Comput.},
306 |   year       = {2000},
307 |   volume     = {12},
308 |   number     = {10},
309 |   pages      = {2451--2471},
310 |   month      = oct,
311 |   issn       = {0899-7667},
312 |   acmid      = {1121915},
313 |   address    = {Cambridge, MA, USA},
314 |   doi        = {10.1162/089976600300015015},
315 |   issue_date = {October 2000},
316 |   numpages   = {21},
317 |   publisher  = {MIT Press},
318 |   url        = {http://dx.doi.org/10.1162/089976600300015015},
319 | }
320 | 
321 | @Article{citeulike:14070430,
322 |   author               = {Srivastava, Rupesh K. and Greff, Klaus and Schmidhuber, Jurgen},
323 |   title                = {{Highway Networks}},
324 |   citeulike-article-id = {14070430},
325 |   citeulike-linkout-0  = {http://arxiv.org/pdf/1505.00387v1.pdf},
326 |   keywords             = {deep\_learning\_architectures, deep\_learning\_theory, lstm, multilayer\_networks, networks, neural, schmidhuber\_jurgen},
327 |   posted-at            = {2016-06-16 15:29:20},
328 |   url                  = {http://arxiv.org/pdf/1505.00387v1.pdf},
329 | }
330 | 
331 | @Article{HuangGLZLW,
332 |   author = {Huang, Gao and Liu, Zhuang and Van De Maaten, Laurens and Weinberger, Kilian},
333 |   title  = {Densely Connected Convolutional Networks},
334 |   year   = {2017},
335 |   month  = {07},
336 | }
337 | 
338 | @Article{QIAN1999145,
339 |   author   = {Ning Qian},
340 |   title    = {On the momentum term in gradient descent learning algorithms},
341 |   journal  = {Neural Networks},
342 |   year     = {1999},
343 |   volume   = {12},
344 |   number   = {1},
345 |   pages    = {145 - 151},
346 |   issn     = {0893-6080},
347 |   doi      = {http://dx.doi.org/10.1016/S0893-6080(98)00116-6},
348 |   keywords = {Momentum, Gradient descent learning algorithm, Damped harmonic oscillator, Critical damping, Learning rate, Speed of convergence},
349 |   url      = {http://www.sciencedirect.com/science/article/pii/S0893608098001166},
350 | }
351 | 
352 | @InProceedings{nesterov1983method,
353 |   author    = {Nesterov, Yurii},
354 |   title     = {A method for unconstrained convex minimization problem with the rate of convergence O (1/k2)},
355 |   booktitle = {Doklady an SSSR},
356 |   year      = {1983},
357 |   volume    = {269},
358 |   number    = {3},
359 |   pages     = {543--547},
360 | }
361 | 
362 | @Article{Duchi:2011:ASM:1953048.2021068,
363 |   author     = {Duchi, John and Hazan, Elad and Singer, Yoram},
364 |   title      = {Adaptive Subgradient Methods for Online Learning and Stochastic Optimization},
365 |   journal    = {J. Mach. Learn. Res.},
366 |   year       = {2011},
367 |   volume     = {12},
368 |   pages      = {2121--2159},
369 |   month      = jul,
370 |   issn       = {1532-4435},
371 |   acmid      = {2021068},
372 |   issue_date = {2/1/2011},
373 |   numpages   = {39},
374 |   publisher  = {JMLR.org},
375 |   url        = {http://dl.acm.org/citation.cfm?id=1953048.2021068},
376 | }
377 | 
378 | @Article{journals/corr/abs-1212-5701,
379 |   author    = {Zeiler, Matthew D.},
380 |   title     = {ADADELTA: An Adaptive Learning Rate Method},
381 |   journal   = {CoRR},
382 |   year      = {2012},
383 |   volume    = {abs/1212.5701},
384 |   ee        = {http://arxiv.org/abs/1212.5701},
385 |   interhash = {0485dc964af0cd2296b6868b7f97c90d},
386 |   intrahash = {593eceee0e364927f3dd9c85e788bba8},
387 |   url       = {http://dblp.uni-trier.de/db/journals/corr/corr1212.html#abs-1212-5701},
388 | }
389 | 
390 | @InProceedings{Lecun98gradient-basedlearning,
391 |   author    = {Yann Lecun and Léon Bottou and Yoshua Bengio and Patrick Haffner},
392 |   title     = {Gradient-based learning applied to document recognition},
393 |   booktitle = {Proceedings of the IEEE},
394 |   year      = {1998},
395 |   pages     = {2278--2324},
396 | }
397 | 
398 | @InCollection{NIPS2012_4824,
399 |   author    = {Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E},
400 |   title     = {ImageNet Classification with Deep Convolutional Neural Networks},
401 |   booktitle = {Advances in Neural Information Processing Systems 25},
402 |   publisher = {Curran Associates, Inc.},
403 |   year      = {2012},
404 |   editor    = {F. Pereira and C. J. C. Burges and L. Bottou and K. Q. Weinberger},
405 |   pages     = {1097--1105},
406 |   url       = {http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf},
407 | }
408 | 
409 | @Book{GravesA2016,
410 |   title     = {Supervised Sequence Labelling with Recurrent Neural Networks},
411 |   year      = {2011},
412 |   author    = {Graves, Alex},
413 |   added-at  = {2016-12-07T19:03:54.000+0100},
414 |   biburl    = {https://www.bibsonomy.org/bibtex/22fe5732cd8b62f4d09a140d5b40c82ec/hprop},
415 |   interhash = {ce5e3e2888eb4afd21867cfb9639bc23},
416 |   intrahash = {2fe5732cd8b62f4d09a140d5b40c82ec},
417 |   keywords  = {books machine-learning rnn},
418 |   timestamp = {2016-12-07T19:03:54.000+0100},
419 | }
420 | 
421 | @Article{Clevert2015FastAA,
422 |   author  = {Djork-Arn{\'e} Clevert and Thomas Unterthiner and Sepp Hochreiter},
423 |   title   = {Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)},
424 |   journal = {CoRR},
425 |   year    = {2015},
426 |   volume  = {abs/1511.07289},
427 | }
428 | 
429 | @Book{Epelbaum2017,
430 |   title     = {Deep Learning: Technical Introduction},
431 |   publisher = {https://arxiv.org/abs/1709.xxxxx},
432 |   year      = {2017},
433 |   author    = {Thomas Epelbaum},
434 | }
435 | 
436 | @Article{citeulike:14069459,
437 |   author               = {Greff, Klaus and Srivastava, Rupesh K. and Koutn{\i}k, Jan and Steunebrink, Bas R. and Schmidhuber, Jurgen},
438 |   title                = {{LSTM}: A Search Space Odyssey},
439 |   citeulike-article-id = {14069459},
440 |   citeulike-linkout-0  = {http://arxiv.org/pdf/1503.04069.pdf},
441 |   keywords             = {lstm, recurrent, rnn, schmidhuber\_jurgen},
442 |   posted-at            = {2016-06-15 14:04:20},
443 |   url                  = {http://arxiv.org/pdf/1503.04069.pdf},
444 | }
445 | 
446 | @InProceedings{citeulike:4571969,
447 |   author               = {Akaike, H.},
448 |   title                = {{Information theory and an extension of the maximum likelihood principle}},
449 |   booktitle            = {Second International Symposium on Information Theory},
450 |   year                 = {1973},
451 |   editor               = {Petrov, B. N. and Csaki, F.},
452 |   pages                = {267--281},
453 |   address              = {Budapest},
454 |   publisher            = {Akad\'{e}miai Kiado},
455 |   citeulike-article-id = {4571969},
456 |   keywords             = {modelselection},
457 |   posted-at            = {2009-05-21 19:56:32},
458 | }
459 | 
460 | @Comment{jabref-meta: databaseType:bibtex;}
461 | 


--------------------------------------------------------------------------------
/ELU.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/ELU.pdf


--------------------------------------------------------------------------------
/GoogleNet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/GoogleNet.pdf


--------------------------------------------------------------------------------
/Inception.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Inception.pdf


--------------------------------------------------------------------------------
/Introduction.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Introduction}
 2 | 
 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}T}his note aims at presenting the three most common forms of neural network architectures. It does so in a technical though hopefully pedagogical way, buiding up in complexity as one progresses through the chapters.
 4 | 
 5 | \vspace{0.2cm}
 6 | 
 7 | Chapter \ref{sec:chapterFNN} starts with the first type of network introduced historically: a regular feedforward neural network, itself an evolution of the original perceptron \cite{Rosenblatt58theperceptron:} algorithm. One should see the latter as a non-linear regression, and feedforward networks schematically stack perceptron layers on top of one another.
 8 | 
 9 | \vspace{0.2cm}
10 | 
11 | We will thus introduce in chapter \ref{sec:chapterFNN} the fundamental building blocks of the simplest neural network layers: weight averaging and activation functions. We will also introduce gradient descent as a way to train the network when joint with the backpropagation algorithm, as a way to minimize a loss function adapted to the task at hand (classification or regression). The more technical details of the backpropagation algorithm are found in the appendix of this chapter, alongside with an introduction to the state of the art feedforward neural network, the ResNet. One can finally find a short matrix description of the feedforward network.
12 | 
13 | \vspace{0.2cm}
14 | 
15 | In chapter \ref{sec:chapterCNN}, we present the second type of neural network studied: the convolutional networks, particularly suited to treat images and label them. This implies presenting the mathematical tools related to this network: convolution, pooling, stride... As well as seeing the modification of the building block introduced in chapter \ref{sec:chapterFNN}. Several convolutional architectures are then presented, and the appendices once again detail the difficult steps of the main text.
16 | 
17 | \vspace{0.2cm}
18 | 
19 | Chapter \ref{sec:chapterRNN} finally presents the network architecture suited for data with a temporal structure -- as time series for instance, the recurrent neural network. There again, the novelties and the modifications of the material introduced in the two previous chapters are detailed in the main text, while the appendices give all what one needs to understand the most cumbersome formula of this kind of network architecture.
20 | 


--------------------------------------------------------------------------------
/LSTM_structure-peephole.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LSTM_structure-peephole.pdf


--------------------------------------------------------------------------------
/LSTM_structure-tot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LSTM_structure-tot.pdf


--------------------------------------------------------------------------------
/LSTM_structure.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LSTM_structure.pdf


--------------------------------------------------------------------------------
/LeNet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LeNet.pdf


--------------------------------------------------------------------------------
/Mediamobile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Mediamobile.png


--------------------------------------------------------------------------------
/Preface.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Preface}
 2 | 
 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}I} started learning about deep learning fundamentals in February 2017. At this time, I knew nothing about backpropagation, and was completely ignorant about the differences between a Feedforward, Convolutional and a Recurrent Neural Network.
 4 | 
 5 | \vspace{0.2cm}
 6 | 
 7 | As I navigated through the humongous amount of data available on deep learning online, I found myself quite frustrated when it came to really understand what deep learning is, and not just applying it with some available library.
 8 | 
 9 | \vspace{0.2cm}
10 | 
11 | In particular, the backpropagation update rules are seldom derived, and never in index form. Unfortunately for me, I have an "index" mind: seeing a 4 Dimensional convolution formula in matrix form does not do it for me. Since I am also stupid enough to like recoding the wheel in low level programming languages, the matrix form cannot be directly converted into working code either.
12 | 
13 | 
14 | \vspace{0.2cm}
15 | 
16 | I therefore started some notes for my personal use, where I tried to rederive everything from scratch in index form.
17 | 
18 | \vspace{0.2cm}
19 | 
20 | I did so for the vanilla Feedforward network, then learned about L1 and L2 regularization , dropout\cite{Srivastava:2014:DSW:2627435.2670313}, batch normalization\cite{Ioffe2015}, several gradient descent optimization techniques... Then turned to convolutional networks, from conventional single digit number of layer conv-pool architectures\cite{Lecun98gradient-basedlearning} to recent VGG\cite{DBLP:journals/corr/SimonyanZ14a} ResNet\cite{He2015} ones, from local contrast normalization and rectification to bacthnorm... And finally I studied Recurrent Neural Network structures\cite{GravesA2016}, from the standard formulation to the most recent LSTM one\cite{Gers:2000:LFC:1121912.1121915}.
21 | 
22 | \vspace{0.2cm}
23 | 
24 | As my work progressed, my notes got bigger and bigger, until a point when I realized I might have enough material to help others starting their own deep learning journey.
25 | 
26 | \vspace{0.2cm}
27 | 
28 | This work is bottom-up at its core. If you are searching a working Neural Network in 10 lines of code and 5 minutes of your time, you have come to the wrong place. If you can mentally multiply and convolve 4D tensors, then I have nothing to convey to you either.
29 | 
30 | \vspace{0.2cm}
31 | 
32 | If on the other hand you like(d) to rederive every tiny calculation of every theorem of every class that you stepped into, then you might be interested by what follow!
33 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Technical Book on Deep Learning
 2 | 
 3 | This note presents in a technical though hopefully pedagogical way the three most common forms of neural network architectures: Feedforward, Convolutional and Recurrent.
 4 | 
 5 | For each network, their fundamental building blocks are detailed. The forward pass and the update rules for the backpropagation algorithm are then derived in full.
 6 | 
 7 | The pdf of the whole document can be downloaded directly: [White_book.pdf](https://github.com/tomepel/Technical_Book_DL/raw/master/White_book.pdf).
 8 | 
 9 | Otherwise, all the figures contained in the note are joined in this repo, as well as the tex files needed for compilation. Just don't forget to cite the source if you use any of this material! :)
10 | 
11 | Hope it can help others!
12 | 
13 | # Acknowledgement
14 | 
15 | This work has no benefit nor added value to the deep learning topic on its own. It is just the reformulation of ideas of brighter researchers to fit a peculiar mindset: the one of preferring formulas with ten indices but where one knows precisely what one is manipulating rather than (in my opinion sometimes opaque) matrix formulations where the dimension of the objects are rarely if ever specified.
16 | 
17 | Among the brighter people from whom I learned online are Andrew Ng. His Coursera class (https://www.coursera.org/learn/machine-learning) was the first contact I got with Neural Network, and this pedagogical introduction allowed me to build on solid ground.
18 | 
19 | I also wish to particularly thanks Hugo Larochelle, who not only built a wonderful deep learning class (http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html), but was also kind enough to answer emails from a complete beginner and stranger!
20 | 
21 | The Stanford class on convolutional networks (http://cs231n.github.io/convolutional-networks/) proved extremely valuable to me, so did the one on Natural Language processing (http://web.stanford.edu/class/cs224n/).
22 | 
23 | I also benefited greatly from Sebastian Ruder's blog (http://ruder.io/#open), both from the blog pages on gradient descent optimization techniques and from the author himself.
24 | 
25 | I learned more about LSTM on colah's blog (http://colah.github.io/posts/2015-08-Understanding-LSTMs/), and some of my drawings are inspired from there.
26 | 
27 | I also thank Jonathan Del Hoyo for the great articles that he regularly shares on LinkedIn.
28 | 
29 | Many thanks go to my collaborators at Mediamobile, who let me dig as deep as I wanted on Neural Networks. I am especially indebted to Clément, Nicolas, Jessica, Christine and Céline.
30 | 
31 | Thanks to Jean-Michel Loubes and Fabrice Gamboa, from whom I learned a great deal on probability theory and statistics.
32 | 
33 | I end this list with my employer, Mediamobile, which has been kind enough to let me work on this topic with complete freedom. A special thanks to Philippe, who supervised me with the perfect balance of feedback and freedom!
34 | 
35 | # Contact
36 | 
37 | If you detect any typo, error (as I am sure that there unfortunately still are), or feel that I forgot to cite an important source, don't hesitate to email me:  thomas.epelbaum@shift-technology.com
38 | 


--------------------------------------------------------------------------------
/RNN_structure-tot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/RNN_structure-tot.pdf


--------------------------------------------------------------------------------
/RNN_structure.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/RNN_structure.pdf


--------------------------------------------------------------------------------
/ReLU.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/ReLU.pdf


--------------------------------------------------------------------------------
/ResNet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/ResNet.pdf


--------------------------------------------------------------------------------
/S_FNN.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/S_FNN.pdf


--------------------------------------------------------------------------------
/ThesisStyle.cls:
--------------------------------------------------------------------------------
  1 | %%
  2 | %% This is file `book.cls',
  3 | %% generated with the docstrip utility.
  4 | %%
  5 | %% The original source files were:
  6 | %%
  7 | %% classes.dtx  (with options: `book')
  8 | %% 
  9 | %% This is a generated file.
 10 | %% 
 11 | %% Copyright 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
 12 | %% The LaTeX3 Project and any individual authors listed elsewhere
 13 | %% in this file.
 14 | %% 
 15 | %% This file was generated from file(s) of the LaTeX base system.
 16 | %% --------------------------------------------------------------
 17 | %% 
 18 | %% It may be distributed and/or modified under the
 19 | %% conditions of the LaTeX Project Public License, either version 1.3
 20 | %% of this license or (at your option) any later version.
 21 | %% The latest version of this license is in
 22 | %%    http://www.latex-project.org/lppl.txt
 23 | %% and version 1.3 or later is part of all distributions of LaTeX
 24 | %% version 2003/12/01 or later.
 25 | %% 
 26 | %% This file has the LPPL maintenance status "maintained".
 27 | %% 
 28 | %% This file may only be distributed together with a copy of the LaTeX
 29 | %% base system. You may however distribute the LaTeX base system without
 30 | %% such generated files.
 31 | %% 
 32 | %% The list of all files belonging to the LaTeX base distribution is
 33 | %% given in the file `manifest.txt'. See also `legal.txt' for additional
 34 | %% information.
 35 | %% 
 36 | %% The list of derived (unpacked) files belonging to the distribution
 37 | %% and covered by LPPL is defined by the unpacking scripts (with
 38 | %% extension .ins) which are part of the distribution.
 39 | %% \CharacterTable
 40 | %%  {Upper-case    \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z
 41 | %%   Lower-case    \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z
 42 | %%   Digits        \0\1\2\3\4\5\6\7\8\9
 43 | %%   Exclamation   \!     Double quote  \"     Hash (number) \#
 44 | %%   Dollar        \$     Percent       \%     Ampersand     \&
 45 | %%   Acute accent  \'     Left paren    \(     Right paren   \)
 46 | %%   Asterisk      \*     Plus          \+     Comma         \,
 47 | %%   Minus         \-     Point         \.     Solidus       \/
 48 | %%   Colon         \:     Semicolon     \;     Less than     \<
 49 | %%   Equals        \=     Greater than  \>     Question mark \?
 50 | %%   Commercial at \@     Left bracket  \[     Backslash     \\
 51 | %%   Right bracket \]     Circumflex    \^     Underscore    \_
 52 | %%   Grave accent  \`     Left brace    \{     Vertical bar  \|
 53 | %%   Right brace   \}     Tilde         \~}
 54 | \NeedsTeXFormat{LaTeX2e}[1995/12/01]
 55 | \ProvidesClass{ThesisStyle}
 56 |               [2004/02/16 v1.4f
 57 |  Standard LaTeX document class]
 58 | \newcommand\@ptsize{}
 59 | \newif\if@restonecol
 60 | \newif\if@titlepage
 61 | \@titlepagetrue
 62 | \newif\if@openright
 63 | \newif\if@mainmatter \@mainmattertrue
 64 | \if@compatibility\else
 65 | \DeclareOption{a4paper}
 66 |    {\setlength\paperheight {297mm}%
 67 |     \setlength\paperwidth  {210mm}}
 68 | \DeclareOption{a5paper}
 69 |    {\setlength\paperheight {210mm}%
 70 |     \setlength\paperwidth  {148mm}}
 71 | \DeclareOption{b5paper}
 72 |    {\setlength\paperheight {250mm}%
 73 |     \setlength\paperwidth  {176mm}}
 74 | \DeclareOption{letterpaper}
 75 |    {\setlength\paperheight {11in}%
 76 |     \setlength\paperwidth  {8.5in}}
 77 | \DeclareOption{legalpaper}
 78 |    {\setlength\paperheight {14in}%
 79 |     \setlength\paperwidth  {8.5in}}
 80 | \DeclareOption{executivepaper}
 81 |    {\setlength\paperheight {10.5in}%
 82 |     \setlength\paperwidth  {7.25in}}
 83 | \DeclareOption{landscape}
 84 |    {\setlength\@tempdima   {\paperheight}%
 85 |     \setlength\paperheight {\paperwidth}%
 86 |     \setlength\paperwidth  {\@tempdima}}
 87 | \fi
 88 | \if@compatibility
 89 |   \renewcommand\@ptsize{0}
 90 | \else
 91 | \DeclareOption{10pt}{\renewcommand\@ptsize{0}}
 92 | \fi
 93 | \DeclareOption{11pt}{\renewcommand\@ptsize{1}}
 94 | \DeclareOption{12pt}{\renewcommand\@ptsize{2}}
 95 | \if@compatibility\else
 96 | \DeclareOption{oneside}{\@twosidefalse \@mparswitchfalse}
 97 | \fi
 98 | \DeclareOption{twoside}{\@twosidetrue  \@mparswitchtrue}
 99 | \DeclareOption{draft}{\setlength\overfullrule{5pt}}
100 | \if@compatibility\else
101 | \DeclareOption{final}{\setlength\overfullrule{0pt}}
102 | \fi
103 | \DeclareOption{titlepage}{\@titlepagetrue}
104 | \if@compatibility\else
105 | \DeclareOption{notitlepage}{\@titlepagefalse}
106 | \fi
107 | \if@compatibility
108 | \@openrighttrue
109 | \else
110 | \DeclareOption{openright}{\@openrighttrue}
111 | \DeclareOption{openany}{\@openrightfalse}
112 | \fi
113 | \if@compatibility\else
114 | \DeclareOption{onecolumn}{\@twocolumnfalse}
115 | \fi
116 | \DeclareOption{twocolumn}{\@twocolumntrue}
117 | \DeclareOption{leqno}{\input{leqno.clo}}
118 | \DeclareOption{fleqn}{\input{fleqn.clo}}
119 | \DeclareOption{openbib}{%
120 |   \AtEndOfPackage{%
121 |    \renewcommand\@openbib@code{%
122 |       \advance\leftmargin\bibindent
123 |       \itemindent -\bibindent
124 |       \listparindent \itemindent
125 |       \parsep \z@
126 |       }%
127 |    \renewcommand\newblock{\par}}%
128 | }
129 | \ExecuteOptions{letterpaper,12pt,twoside,onecolumn,final,openright}
130 | \ProcessOptions
131 | \input{bk1\@ptsize.clo}
132 | \setlength\lineskip{1\p@}
133 | \setlength\normallineskip{1\p@}
134 | \renewcommand\baselinestretch{}
135 | \setlength\parskip{0\p@ \@plus \p@}
136 | \@lowpenalty   51
137 | \@medpenalty  151
138 | \@highpenalty 301
139 | \setcounter{topnumber}{2}
140 | \renewcommand\topfraction{.7}
141 | \setcounter{bottomnumber}{1}
142 | \renewcommand\bottomfraction{.3}
143 | \setcounter{totalnumber}{3}
144 | \renewcommand\textfraction{.2}
145 | \renewcommand\floatpagefraction{.5}
146 | \setcounter{dbltopnumber}{2}
147 | \renewcommand\dbltopfraction{.7}
148 | \renewcommand\dblfloatpagefraction{.5}
149 | \if@twoside
150 |   \def\ps@headings{%
151 |       \let\@oddfoot\@empty\let\@evenfoot\@empty
152 |       \def\@evenhead{\thepage\hfil\slshape\leftmark}%
153 |       \def\@oddhead{{\slshape\rightmark}\hfil\thepage}%
154 |       \let\@mkboth\markboth
155 |     \def\chaptermark##1{%
156 |       \markboth {\MakeUppercase{%
157 |         \ifnum \c@secnumdepth >\m@ne
158 |           \if@mainmatter
159 |             \@chapapp\ \thechapter. \ %
160 |           \fi
161 |         \fi
162 |         ##1}}{}}%
163 |     \def\sectionmark##1{%
164 |       \markright {\MakeUppercase{%
165 |         \ifnum \c@secnumdepth >\z@
166 |           \thesection. \ %
167 |         \fi
168 |         ##1}}}}
169 | \else
170 |   \def\ps@headings{%
171 |     \let\@oddfoot\@empty
172 |     \def\@oddhead{{\slshape\rightmark}\hfil\thepage}%
173 |     \let\@mkboth\markboth
174 |     \def\chaptermark##1{%
175 |       \markright {\MakeUppercase{%
176 |         \ifnum \c@secnumdepth >\m@ne
177 |           \if@mainmatter
178 |             \@chapapp\ \thechapter. \ %
179 |           \fi
180 |         \fi
181 |         ##1}}}}
182 | \fi
183 | \def\ps@myheadings{%
184 |     \let\@oddfoot\@empty\let\@evenfoot\@empty
185 |     \def\@evenhead{\thepage\hfil\slshape\leftmark}%
186 |     \def\@oddhead{{\slshape\rightmark}\hfil\thepage}%
187 |     \let\@mkboth\@gobbletwo
188 |     \let\chaptermark\@gobble
189 |     \let\sectionmark\@gobble
190 |     }
191 |   \if@titlepage
192 |   \newcommand\maketitle{\begin{titlepage}%
193 |   \let\footnotesize\small
194 |   \let\footnoterule\relax
195 |   \let \footnote \thanks
196 |   \null\vfil
197 |   \vskip 60\p@
198 |   \begin{center}%
199 |     {\LARGE \@title \par}%
200 |     \vskip 3em%
201 |     {\large
202 |      \lineskip .75em%
203 |       \begin{tabular}[t]{c}%
204 |         \@author
205 |       \end{tabular}\par}%
206 |       \vskip 1.5em%
207 |     {\large \@date \par}%       % Set date in \large size.
208 |   \end{center}\par
209 |   \@thanks
210 |   \vfil\null
211 |   \end{titlepage}%
212 |   \setcounter{footnote}{0}%
213 |   \global\let\thanks\relax
214 |   \global\let\maketitle\relax
215 |   \global\let\@thanks\@empty
216 |   \global\let\@author\@empty
217 |   \global\let\@date\@empty
218 |   \global\let\@title\@empty
219 |   \global\let\title\relax
220 |   \global\let\author\relax
221 |   \global\let\date\relax
222 |   \global\let\and\relax
223 | }
224 | \else
225 | \newcommand\maketitle{\par
226 |   \begingroup
227 |     \renewcommand\thefootnote{\@fnsymbol\c@footnote}%
228 |     \def\@makefnmark{\rlap{\@textsuperscript{\normalfont\@thefnmark}}}%
229 |     \long\def\@makefntext##1{\parindent 1em\noindent
230 |             \hb@xt@1.8em{%
231 |                 \hss\@textsuperscript{\normalfont\@thefnmark}}##1}%
232 |     \if@twocolumn
233 |       \ifnum \col@number=\@ne
234 |         \@maketitle
235 |       \else
236 |         \twocolumn[\@maketitle]%
237 |       \fi
238 |     \else
239 |       \newpage
240 |       \global\@topnum\z@   % Prevents figures from going at top of page.
241 |       \@maketitle
242 |     \fi
243 |     \thispagestyle{plain}\@thanks
244 |   \endgroup
245 |   \setcounter{footnote}{0}%
246 |   \global\let\thanks\relax
247 |   \global\let\maketitle\relax
248 |   \global\let\@maketitle\relax
249 |   \global\let\@thanks\@empty
250 |   \global\let\@author\@empty
251 |   \global\let\@date\@empty
252 |   \global\let\@title\@empty
253 |   \global\let\title\relax
254 |   \global\let\author\relax
255 |   \global\let\date\relax
256 |   \global\let\and\relax
257 | }
258 | \def\@maketitle{%
259 |   \newpage
260 |   \null
261 |   \vskip 2em%
262 |   \begin{center}%
263 |   \let \footnote \thanks
264 |     {\LARGE \@title \par}%
265 |     \vskip 1.5em%
266 |     {\large
267 |       \lineskip .5em%
268 |       \begin{tabular}[t]{c}%
269 |         \@author
270 |       \end{tabular}\par}%
271 |     \vskip 1em%
272 |     {\large \@date}%
273 |   \end{center}%
274 |   \par
275 |   \vskip 1.5em}
276 | \fi
277 | \newcommand*\chaptermark[1]{}
278 | \setcounter{secnumdepth}{2}
279 | \newcounter {part}
280 | \newcounter {chapter}
281 | \newcounter {section}[chapter]
282 | \newcounter {subsection}[section]
283 | \newcounter {subsubsection}[subsection]
284 | \newcounter {paragraph}[subsubsection]
285 | \newcounter {subparagraph}[paragraph]
286 | \renewcommand \thepart {\@Roman\c@part}
287 | \renewcommand \thechapter {\@arabic\c@chapter}
288 | \renewcommand \thesection {\thechapter.\@arabic\c@section}
289 | \renewcommand\thesubsection   {\thesection.\@arabic\c@subsection}
290 | \renewcommand\thesubsubsection{\thesubsection .\@arabic\c@subsubsection}
291 | \renewcommand\theparagraph    {\thesubsubsection.\@arabic\c@paragraph}
292 | \renewcommand\thesubparagraph {\theparagraph.\@arabic\c@subparagraph}
293 | \newcommand\@chapapp{\chaptername}
294 | \newcommand\frontmatter{%
295 |     \cleardoublepage
296 |   \@mainmatterfalse
297 |   \pagenumbering{roman}}
298 | \newcommand\mainmatter{%
299 |     \cleardoublepage
300 |   \@mainmattertrue
301 |   \pagenumbering{arabic}}
302 | \newcommand\backmatter{%
303 |   \if@openright
304 |     \cleardoublepage
305 |   \else
306 |     \clearpage
307 |   \fi
308 |   \@mainmatterfalse}
309 | \newcommand\part{%
310 |   \if@openright
311 |     \cleardoublepage
312 |   \else
313 |     \clearpage
314 |   \fi
315 |   \thispagestyle{plain}%
316 |   \if@twocolumn
317 |     \onecolumn
318 |     \@tempswatrue
319 |   \else
320 |     \@tempswafalse
321 |   \fi
322 |   \null\vfil
323 |   \secdef\@part\@spart}
324 | 
325 | \def\@part[#1]#2{%
326 |     \ifnum \c@secnumdepth >-2\relax
327 |       \refstepcounter{part}%
328 |       \addcontentsline{toc}{part}{\thepart\hspace{1em}#1}%
329 |     \else
330 |       \addcontentsline{toc}{part}{#1}%
331 |     \fi
332 |     \markboth{}{}%
333 |     {\centering
334 |      \interlinepenalty \@M
335 |      \normalfont
336 |      \ifnum \c@secnumdepth >-2\relax
337 |        \huge\bfseries \partname\nobreakspace\thepart
338 |        \par
339 |        \vskip 20\p@
340 |      \fi
341 |      \Huge \bfseries #2\par}%
342 |     \@endpart}
343 | \def\@spart#1{%
344 |     {\centering
345 |      \interlinepenalty \@M
346 |      \normalfont
347 |      \Huge \bfseries #1\par}%
348 |     \@endpart}
349 | \def\@endpart{\vfil\newpage
350 |               \if@twoside
351 |                \if@openright
352 |                 \null
353 |                 \thispagestyle{empty}%
354 |                 \newpage
355 |                \fi
356 |               \fi
357 |               \if@tempswa
358 |                 \twocolumn
359 |               \fi}
360 | \newcommand\chapter{\if@openright\cleardoublepage\else\clearpage\fi
361 |                     \thispagestyle{plain}%
362 |                     \global\@topnum\z@
363 |                     \@afterindentfalse
364 |                     \secdef\@chapter\@schapter}
365 | \def\@chapter[#1]#2{\ifnum \c@secnumdepth >\m@ne
366 |                        \if@mainmatter
367 |                          \refstepcounter{chapter}%
368 |                          \typeout{\@chapapp\space\thechapter.}%
369 |                          \addcontentsline{toc}{chapter}%
370 |                                    {\protect\numberline{\thechapter}#1}%
371 |                        \else
372 |                          \addcontentsline{toc}{chapter}{#1}%
373 |                        \fi
374 |                     \else
375 |                       \addcontentsline{toc}{chapter}{#1}%
376 |                     \fi
377 |                     \chaptermark{#1}%
378 |                     \addtocontents{lof}{\protect\addvspace{10\p@}}%
379 |                     \addtocontents{lot}{\protect\addvspace{10\p@}}%
380 |                     \if@twocolumn
381 |                       \@topnewpage[\@makechapterhead{#2}]%
382 |                     \else
383 |                       \@makechapterhead{#2}%
384 |                       \@afterheading
385 |                     \fi}
386 | 
387 | \def\@makechapterhead#1{%
388 | %  \vspace*{10\p@}%
389 |   {\parindent \z@ \raggedright \normalfont
390 |     \begin{flushright}
391 |       \ifnum \c@secnumdepth >\m@ne
392 |         \if@mainmatter
393 | %          \huge\bfseries 
394 |           {\Large \scshape \@chapapp\space \thechapter}
395 |           \par\nobreak
396 | %        \vskip 0\p@
397 |         \fi
398 |       \fi
399 |       \interlinepenalty\@M
400 |       \Huge \bfseries #1\par\nobreak
401 |       \hrulefill
402 |     \end{flushright}
403 |     \vskip 20\p@
404 |   }}
405 | \def\@schapter#1{\if@twocolumn
406 |                    \@topnewpage[\@makeschapterhead{#1}]%
407 |                  \else
408 |                    \@makeschapterhead{#1}%
409 |                    \@afterheading
410 |                  \fi}
411 | \def\@makeschapterhead#1{%
412 | %  \vspace*{10\p@}%
413 |   {\parindent \z@ \raggedright
414 |     \normalfont
415 |     \interlinepenalty\@M
416 |     \begin{flushright}
417 | 	\Huge \bfseries  #1\par\nobreak
418 |     \end{flushright}
419 |     \vskip 20\p@
420 |   }}
421 |   
422 | \renewcommand{\@makechapterhead}[1]{%
423 | \vspace*{50\p@}%
424 | {\parindent \z@ \raggedright \normalfont
425 | \hrule % horizontal line
426 | \vspace{5pt}% % add vertical space
427 | \ifnum \c@secnumdepth >\m@ne
428 | \begin{center}\huge\scshape\bfseries \@chapapp\space \thechapter\end{center} % Chapter number
429 | \par\nobreak
430 | \vskip 1\p@
431 | \fi
432 | \interlinepenalty\@M
433 | \begin{center}\reflectbox{\includegraphics[scale=0.075]{VGG-conv.pdf}}\bfseries\huge\scshape#1\includegraphics[scale=0.075]{VGG-conv.pdf}\end{center}\par   % chapter title
434 | \vspace{5pt}% % add vertical space
435 | \hrule % horizontal rule
436 | \nobreak
437 | \vskip 40\p@
438 | }}  
439 |   
440 | \newcommand\section{\@startsection {section}{1}{\z@}%
441 |                                    {-3.5ex \@plus -1ex \@minus -.2ex}%
442 |                                    {2.3ex \@plus.2ex}%
443 |                                    {\normalfont\Large\bfseries}}
444 | \newcommand\subsection{\@startsection{subsection}{2}{\z@}%
445 |                                      {-3.25ex\@plus -1ex \@minus -.2ex}%
446 |                                      {1.5ex \@plus .2ex}%
447 |                                      {\normalfont\large\bfseries}}
448 | \newcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}%
449 |                                      {-3.25ex\@plus -1ex \@minus -.2ex}%
450 |                                      {1.5ex \@plus .2ex}%
451 |                                      {\normalfont\normalsize\bfseries}}
452 | \newcommand\paragraph{\@startsection{paragraph}{4}{\z@}%
453 |                                     {3.25ex \@plus1ex \@minus.2ex}%
454 |                                     {-1em}%
455 |                                     {\normalfont\normalsize\bfseries}}
456 | \newcommand\subparagraph{\@startsection{subparagraph}{5}{\parindent}%
457 |                                        {3.25ex \@plus1ex \@minus .2ex}%
458 |                                        {-1em}%
459 |                                       {\normalfont\normalsize\bfseries}}
460 | \if@twocolumn
461 |   \setlength\leftmargini  {2em}
462 | \else
463 |   \setlength\leftmargini  {2.5em}
464 | \fi
465 | \leftmargin  \leftmargini
466 | \setlength\leftmarginii  {2.2em}
467 | \setlength\leftmarginiii {1.87em}
468 | \setlength\leftmarginiv  {1.7em}
469 | \if@twocolumn
470 |   \setlength\leftmarginv  {.5em}
471 |   \setlength\leftmarginvi {.5em}
472 | \else
473 |   \setlength\leftmarginv  {1em}
474 |   \setlength\leftmarginvi {1em}
475 | \fi
476 | \setlength  \labelsep  {.5em}
477 | \setlength  \labelwidth{\leftmargini}
478 | \addtolength\labelwidth{-\labelsep}
479 | \@beginparpenalty -\@lowpenalty
480 | \@endparpenalty   -\@lowpenalty
481 | \@itempenalty     -\@lowpenalty
482 | \renewcommand\theenumi{\@arabic\c@enumi}
483 | \renewcommand\theenumii{\@alph\c@enumii}
484 | \renewcommand\theenumiii{\@roman\c@enumiii}
485 | \renewcommand\theenumiv{\@Alph\c@enumiv}
486 | \newcommand\labelenumi{\theenumi.}
487 | \newcommand\labelenumii{(\theenumii)}
488 | \newcommand\labelenumiii{\theenumiii.}
489 | \newcommand\labelenumiv{\theenumiv.}
490 | \renewcommand\p@enumii{\theenumi}
491 | \renewcommand\p@enumiii{\theenumi(\theenumii)}
492 | \renewcommand\p@enumiv{\p@enumiii\theenumiii}
493 | \newcommand\labelitemi{\textbullet}
494 | \newcommand\labelitemii{\normalfont\bfseries \textendash}
495 | \newcommand\labelitemiii{\textasteriskcentered}
496 | \newcommand\labelitemiv{\textperiodcentered}
497 | \newenvironment{description}
498 |                {\list{}{\labelwidth\z@ \itemindent-\leftmargin
499 |                         \let\makelabel\descriptionlabel}}
500 |                {\endlist}
501 | \newcommand*\descriptionlabel[1]{\hspace\labelsep
502 |                                 \normalfont\bfseries #1}
503 | \newenvironment{verse}
504 |                {\let\\\@centercr
505 |                 \list{}{\itemsep      \z@
506 |                         \itemindent   -1.5em%
507 |                         \listparindent\itemindent
508 |                         \rightmargin  \leftmargin
509 |                         \advance\leftmargin 1.5em}%
510 |                 \item\relax}
511 |                {\endlist}
512 | \newenvironment{quotation}
513 |                {\list{}{\listparindent 1.5em%
514 |                         \itemindent    \listparindent
515 |                         \rightmargin   \leftmargin
516 |                         \parsep        \z@ \@plus\p@}%
517 |                 \item\relax}
518 |                {\endlist}
519 | \newenvironment{quote}
520 |                {\list{}{\rightmargin\leftmargin}%
521 |                 \item\relax}
522 |                {\endlist}
523 | \if@compatibility
524 | \newenvironment{titlepage}
525 |     {%
526 |       \cleardoublepage
527 |       \if@twocolumn
528 |         \@restonecoltrue\onecolumn
529 |       \else
530 |         \@restonecolfalse\newpage
531 |       \fi
532 |       \thispagestyle{empty}%
533 |       \setcounter{page}\z@
534 |     }%
535 |     {\if@restonecol\twocolumn \else \newpage \fi
536 |     }
537 | \else
538 | \newenvironment{titlepage}
539 |     {%
540 |       \cleardoublepage
541 |       \if@twocolumn
542 |         \@restonecoltrue\onecolumn
543 |       \else
544 |         \@restonecolfalse\newpage
545 |       \fi
546 |       \thispagestyle{empty}%
547 |       \setcounter{page}\@ne
548 |     }%
549 |     {\if@restonecol\twocolumn \else \newpage \fi
550 |      \if@twoside\else
551 |         \setcounter{page}\@ne
552 |      \fi
553 |     }
554 | \fi
555 | \newcommand\appendix{\par
556 |   \setcounter{chapter}{0}%
557 |   \setcounter{section}{0}%
558 |   \gdef\@chapapp{\appendixname}%
559 |   \gdef\thechapter{\@Alph\c@chapter}}
560 | \setlength\arraycolsep{5\p@}
561 | \setlength\tabcolsep{6\p@}
562 | \setlength\arrayrulewidth{.4\p@}
563 | \setlength\doublerulesep{2\p@}
564 | \setlength\tabbingsep{\labelsep}
565 | \skip\@mpfootins = \skip\footins
566 | \setlength\fboxsep{3\p@}
567 | \setlength\fboxrule{.4\p@}
568 | \@addtoreset {equation}{chapter}
569 | \renewcommand\theequation
570 |   {\ifnum \c@chapter>\z@ \thechapter.\fi \@arabic\c@equation}
571 | \newcounter{figure}[chapter]
572 | \renewcommand \thefigure
573 |      {\ifnum \c@chapter>\z@ \thechapter.\fi \@arabic\c@figure}
574 | \def\fps@figure{tbp}
575 | \def\ftype@figure{1}
576 | \def\ext@figure{lof}
577 | \def\fnum@figure{\figurename\nobreakspace\thefigure}
578 | \newenvironment{figure}
579 |                {\@float{figure}}
580 |                {\end@float}
581 | \newenvironment{figure*}
582 |                {\@dblfloat{figure}}
583 |                {\end@dblfloat}
584 | \newcounter{table}[chapter]
585 | \renewcommand \thetable
586 |      {\ifnum \c@chapter>\z@ \thechapter.\fi \@arabic\c@table}
587 | \def\fps@table{tbp}
588 | \def\ftype@table{2}
589 | \def\ext@table{lot}
590 | \def\fnum@table{\tablename\nobreakspace\thetable}
591 | \newenvironment{table}
592 |                {\@float{table}}
593 |                {\end@float}
594 | \newenvironment{table*}
595 |                {\@dblfloat{table}}
596 |                {\end@dblfloat}
597 | \newlength\abovecaptionskip
598 | \newlength\belowcaptionskip
599 | \setlength\abovecaptionskip{10\p@}
600 | \setlength\belowcaptionskip{0\p@}
601 | \long\def\@makecaption#1#2{%
602 |   \vskip\abovecaptionskip
603 |   \sbox\@tempboxa{#1: #2}%
604 |   \ifdim \wd\@tempboxa >\hsize
605 |     #1: #2\par
606 |   \else
607 |     \global \@minipagefalse
608 |     \hb@xt@\hsize{\hfil\box\@tempboxa\hfil}%
609 |   \fi
610 |   \vskip\belowcaptionskip}
611 | \DeclareOldFontCommand{\rm}{\normalfont\rmfamily}{\mathrm}
612 | \DeclareOldFontCommand{\sf}{\normalfont\sffamily}{\mathsf}
613 | \DeclareOldFontCommand{\tt}{\normalfont\ttfamily}{\mathtt}
614 | \DeclareOldFontCommand{\bf}{\normalfont\bfseries}{\mathbf}
615 | \DeclareOldFontCommand{\it}{\normalfont\itshape}{\mathit}
616 | \DeclareOldFontCommand{\sl}{\normalfont\slshape}{\@nomath\sl}
617 | \DeclareOldFontCommand{\sc}{\normalfont\scshape}{\@nomath\sc}
618 | \DeclareRobustCommand*\cal{\@fontswitch\relax\mathcal}
619 | \DeclareRobustCommand*\mit{\@fontswitch\relax\mathnormal}
620 | \newcommand\@pnumwidth{1.55em}
621 | \newcommand\@tocrmarg{2.55em}
622 | \newcommand\@dotsep{4.5}
623 | \setcounter{tocdepth}{2}
624 | \newcommand\tableofcontents{%
625 |     \if@twocolumn
626 |       \@restonecoltrue\onecolumn
627 |     \else
628 |       \@restonecolfalse
629 |     \fi
630 |     \chapter*{\contentsname
631 |         \@mkboth{%
632 |            \MakeUppercase\contentsname}{\MakeUppercase\contentsname}}%
633 | 	\@starttoc{toc}%
634 |     \if@restonecol\twocolumn\fi
635 |     }
636 | \newcommand*\l@part[2]{%
637 |   \ifnum \c@tocdepth >-2\relax
638 |     \addpenalty{-\@highpenalty}%
639 |     \addvspace{2.25em \@plus\p@}%
640 |     \setlength\@tempdima{3em}%
641 |     \begingroup
642 |       \parindent \z@ \rightskip \@pnumwidth
643 |       \parfillskip -\@pnumwidth
644 |       {\leavevmode
645 |        \large \bfseries #1\hfil \hb@xt@\@pnumwidth{\hss #2}}\par
646 |        \nobreak
647 |          \global\@nobreaktrue
648 |          \everypar{\global\@nobreakfalse\everypar{}}%
649 |     \endgroup
650 |   \fi}
651 | \newcommand*\l@chapter[2]{%
652 |   \ifnum \c@tocdepth >\m@ne
653 |     \addpenalty{-\@highpenalty}%
654 |     \vskip 1.0em \@plus\p@
655 |     \setlength\@tempdima{1.5em}%
656 |     \begingroup
657 |       \parindent \z@ \rightskip \@pnumwidth
658 |       \parfillskip -\@pnumwidth
659 |       \leavevmode \bfseries
660 |       \advance\leftskip\@tempdima
661 |       \hskip -\leftskip
662 |       #1\nobreak\hfil \nobreak\hb@xt@\@pnumwidth{\hss #2}\par
663 |       \penalty\@highpenalty
664 |     \endgroup
665 |   \fi}
666 | \newcommand*\l@section{\@dottedtocline{1}{1.5em}{2.3em}}
667 | \newcommand*\l@subsection{\@dottedtocline{2}{3.8em}{3.2em}}
668 | \newcommand*\l@subsubsection{\@dottedtocline{3}{7.0em}{4.1em}}
669 | \newcommand*\l@paragraph{\@dottedtocline{4}{10em}{5em}}
670 | \newcommand*\l@subparagraph{\@dottedtocline{5}{12em}{6em}}
671 | \newcommand\listoffigures{%
672 |     \if@twocolumn
673 |       \@restonecoltrue\onecolumn
674 |     \else
675 |       \@restonecolfalse
676 |     \fi
677 |     \chapter*{\listfigurename}%
678 |       \@mkboth{\MakeUppercase\listfigurename}%
679 |               {\MakeUppercase\listfigurename}%
680 |     \@starttoc{lof}%
681 |     \if@restonecol\twocolumn\fi
682 |     }
683 | \newcommand*\l@figure{\@dottedtocline{1}{1.5em}{2.3em}}
684 | \newcommand\listoftables{%
685 |     \if@twocolumn
686 |       \@restonecoltrue\onecolumn
687 |     \else
688 |       \@restonecolfalse
689 |     \fi
690 |     \chapter*{\listtablename}%
691 |       \@mkboth{%
692 |           \MakeUppercase\listtablename}%
693 |          {\MakeUppercase\listtablename}%
694 |     \@starttoc{lot}%
695 |     \if@restonecol\twocolumn\fi
696 |     }
697 | \let\l@table\l@figure
698 | \newdimen\bibindent
699 | \setlength\bibindent{1.5em}
700 | \newenvironment{thebibliography}[1]
701 |      {\chapter*{\bibname}%
702 |       \@mkboth{\MakeUppercase\bibname}{\MakeUppercase\bibname}%
703 |       \list{\@biblabel{\@arabic\c@enumiv}}%
704 |            {\settowidth\labelwidth{\@biblabel{#1}}%
705 |             \leftmargin\labelwidth
706 |             \advance\leftmargin\labelsep
707 |             \@openbib@code
708 |             \usecounter{enumiv}%
709 |             \let\p@enumiv\@empty
710 |             \renewcommand\theenumiv{\@arabic\c@enumiv}}%
711 |       \sloppy
712 |       \clubpenalty4000
713 |       \@clubpenalty \clubpenalty
714 |       \widowpenalty4000%
715 |       \sfcode`\.\@m}
716 |      {\def\@noitemerr
717 |        {\@latex@warning{Empty `thebibliography' environment}}%
718 |       \endlist}
719 | \newcommand\newblock{\hskip .11em\@plus.33em\@minus.07em}
720 | \let\@openbib@code\@empty
721 | \newenvironment{theindex}
722 |                {\if@twocolumn
723 |                   \@restonecolfalse
724 |                 \else
725 |                   \@restonecoltrue
726 |                 \fi
727 |                 \twocolumn[\@makeschapterhead{\indexname}]%
728 |                 \@mkboth{\MakeUppercase\indexname}%
729 |                         {\MakeUppercase\indexname}%
730 |                 \thispagestyle{plain}\parindent\z@
731 |                 \parskip\z@ \@plus .3\p@\relax
732 |                 \columnseprule \z@
733 |                 \columnsep 35\p@
734 |                 \let\item\@idxitem}
735 |                {\if@restonecol\onecolumn\else\clearpage\fi}
736 | \newcommand\@idxitem{\par\hangindent 40\p@}
737 | \newcommand\subitem{\@idxitem \hspace*{20\p@}}
738 | \newcommand\subsubitem{\@idxitem \hspace*{30\p@}}
739 | \newcommand\indexspace{\par \vskip 10\p@ \@plus5\p@ \@minus3\p@\relax}
740 | \renewcommand\footnoterule{%
741 |   \kern-3\p@
742 |   \hrule\@width.4\columnwidth
743 |   \kern2.6\p@}
744 | \@addtoreset{footnote}{chapter}
745 | \newcommand\@makefntext[1]{%
746 |     \parindent 1em%
747 |     \noindent
748 |     \hb@xt@1.8em{\hss\@makefnmark}#1}
749 | \newcommand\contentsname{Contents}
750 | \newcommand\listfigurename{List of Figures}
751 | \newcommand\listtablename{List of Tables}
752 | \newcommand\bibname{Bibliography}
753 | \newcommand\indexname{Index}
754 | \newcommand\figurename{Figure}
755 | \newcommand\tablename{Table}
756 | \newcommand\partname{Part}
757 | \newcommand\chaptername{Chapter}
758 | \newcommand\appendixname{Appendix}
759 | \def\today{\ifcase\month\or
760 |   January\or February\or March\or April\or May\or June\or
761 |   July\or August\or September\or October\or November\or December\fi
762 |   \space\number\day, \number\year}
763 | \setlength\columnsep{10\p@}
764 | \setlength\columnseprule{0\p@}
765 | \pagestyle{headings}
766 | \pagenumbering{arabic}
767 | \if@twoside
768 | \else
769 |   \raggedbottom
770 | \fi
771 | \if@twocolumn
772 |   \twocolumn
773 |   \sloppy
774 |   \flushbottom
775 | \else
776 |   \onecolumn
777 | \fi
778 | \endinput
779 | %%
780 | %% End of file `book.cls'.
781 | 


--------------------------------------------------------------------------------
/VGG-conv.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-conv.pdf


--------------------------------------------------------------------------------
/VGG-fc.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-fc.pdf


--------------------------------------------------------------------------------
/VGG-pool-fc.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-pool-fc.pdf


--------------------------------------------------------------------------------
/VGG-pool.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-pool.pdf


--------------------------------------------------------------------------------
/VGG.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG.pdf


--------------------------------------------------------------------------------
/White_book-blx.bib:
--------------------------------------------------------------------------------
 1 | @Comment{$ biblatex control file $}
 2 | @Comment{$ biblatex version 2.8 $}
 3 | Do not modify this file!
 4 | 
 5 | This is an auxiliary file used by the 'biblatex' package.
 6 | This file may safely be deleted. It will be recreated as
 7 | required.
 8 | 
 9 | @Control{biblatex-control,
10 |   options = {2.8:0:0:1:0:1:1:0:0:0:0:0:3:1:79:+:none},
11 | }
12 | 


--------------------------------------------------------------------------------
/White_book.dvi:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/White_book.dvi


--------------------------------------------------------------------------------
/White_book.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/White_book.pdf


--------------------------------------------------------------------------------
/White_book.tex:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,12pt]{ThesisStyle}
 2 | \include{formatAndDefs}
 3 | \addbibresource{DEEP_LEARNING.bib}
 4 | \begin{document}
 5 | \def\layersep{2.5cm}
 6 | \pgfmathsetseed{12}
 7 | \newcommand{\midarrow}{\tikz \draw[-stealth] (0,0) -- +(.1,0);}
 8 | \newcommand{\midarroww}{\tikz \draw[-stealth] (0,0.1) -- +(.1,0);}
 9 | \newcommand{\orient}{\tikz \draw[-stealth] (0,0) -- +(0,.001);}
10 | \newcommand{\orientl}{\tikz \draw[-stealth] (0,0.15) -- +(.005,.01);}
11 | \newcommand{\orientr}{\tikz \draw[-stealth] (0,0.15) -- +(-.005,.01);}
12 | 
13 | \thispagestyle{empty}
14 | \newgeometry{hmargin=0cm,vmargin=1cm}
15 | \begin{center}
16 | \begin{tikzpicture}
17 | \node[] at (0,0) {\includegraphics[scale=1]{cover_page-crop.pdf}};
18 | \end{tikzpicture}
19 | \end{center}
20 | 
21 | \restoregeometry
22 | \dominitoc \tableofcontents
23 | 
24 | \include{Preface}
25 | \include{Acknowledgements}
26 | \include{Introduction}
27 | 
28 | %\part{Theoretical background} \label{Part1}
29 | 
30 | 
31 | 
32 | \include{chapter1}
33 | \include{chapter2}
34 | \include{chapter3}
35 | 
36 | \include{Conclusion}
37 | \printbibliography
38 | \end{document}
39 | 


--------------------------------------------------------------------------------
/chapter1.tex:
--------------------------------------------------------------------------------
   1 |  \chapter{Feedforward Neural Networks} \label{sec:chapterFNN}
   2 | 
   3 | \minitoc
   4 | 
   5 | \section{Introduction}
   6 | 
   7 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}I}n this section we review the first type of neural network that has been developed historically: a regular Feedforward Neural Network (FNN). This network does not take into account any particular structure that the input data might have. Nevertheless, it is already a very powerful machine learning tool, especially when used with the state of the art regularization techniques. These techniques -- that we are going to present as well -- allowed to circumvent the training issues that people experienced when dealing with "deep" architectures: namely the fact that neural networks with an important number of hidden states and hidden layers have proven historically to be very hard to train (vanishing gradient and overfitting issues).
   8 | 
   9 | \section{FNN architecture}
  10 | 
  11 | \begin{figure}[H]
  12 | \begin{center}
  13 | \begin{tikzpicture}[shorten >=1pt,-stealth,draw=black!50, node distance=\layersep]
  14 |     \tikzstyle{every pin edge}=[stealth-,shorten <=1pt]
  15 |     \tikzstyle{neuron}=[circle,draw=black,fill=black!25,minimum size=17pt,inner sep=0pt]
  16 |     \tikzstyle{input neuron}=[neuron, fill=gray!50];
  17 |     \tikzstyle{output neuron}=[neuron, fill=gray!50];
  18 |     \tikzstyle{hidden neuron}=[neuron, fill=gray!50];
  19 |     \tikzstyle{annot} = [text width=4em, text centered]
  20 | 
  21 |     % Draw the input layer nodes
  22 |     \foreach \name / \y in {1}
  23 |        	\pgfmathtruncatemacro{\m}{int(\y-1)}
  24 |     % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
  25 |         \node[input neuron, pin=left:Bias] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$};
  26 | 
  27 | 
  28 |     \foreach \name / \y in {2,...,6}
  29 |        	\pgfmathtruncatemacro{\m}{int(\y-1)}
  30 |     % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
  31 |         \node[input neuron, pin=left:Input \#\y] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$};
  32 | 
  33 |     % Draw the hidden layer 1 nodes
  34 |     \foreach \name / \y in {1,...,7}
  35 |     	\pgfmathtruncatemacro{\m}{int(\y-1)}
  36 |         \path[yshift=0.5cm]
  37 |             node[hidden neuron] (H1-\name) at (\layersep,-\y cm) {$h_{\m}^{(1)}$};
  38 | 
  39 |     % Draw the hidden layer 1 node
  40 |     \foreach \name / \y in {1,...,6}
  41 |         \pgfmathtruncatemacro{\m}{int(\y-1)}
  42 |         \path[yshift=0.0cm]
  43 |             node[hidden neuron] (H2-\name) at (2*\layersep,-\y cm) {$h_{\m}^{(\nu)}$};
  44 | 
  45 |     % Draw the output layer node
  46 |     \foreach \name / \y in {1,...,5}
  47 |         \path[yshift=-0.5cm]
  48 |     node[output neuron,pin={[pin edge={->}]right:Output \#\y}] (O-\name) at (3*\layersep,-\y cm) {$h_{\y}^{(N)}$};
  49 | 
  50 |     % Connect every node in the input layer with every node in the
  51 |     % hidden layer.
  52 |     \foreach \source in {1,...,6}
  53 |         \foreach \dest in {2,...,7}
  54 |             \path (I-\source) edge (H1-\dest);
  55 | 
  56 |      \foreach \source in {1,...,7}
  57 |        \foreach \dest in {2,...,6}
  58 |            \path (H1-\source) edge (H2-\dest);
  59 | 
  60 |     % Connect every node in the hidden layer with the output layer
  61 |     \foreach \source in {1,...,6}
  62 |        \foreach \dest in {1,...,5}
  63 |           \path (H2-\source) edge (O-\dest);
  64 | 
  65 |     % Annotate the layers
  66 |     \node[annot,above of=H1-1, node distance=1cm] (hl) {Hidden layer 1};
  67 |     \node[annot,left of=hl] {Input layer};
  68 |     \node[annot,right of=hl] (hm) {Hidden layer $\nu$};
  69 |     \node[annot,right of=hm] {Output layer};
  70 | 
  71 |     \node at ((1.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$};
  72 |     \node at ((2.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$};
  73 | \end{tikzpicture}
  74 | \caption{\label{fig:1}Neural Network with $N+1$ layers ($N-1$ hidden layers). For simplicity of notations, the index referencing the training set has not been indicated. Shallow architectures use only one hidden layer. Deep learning amounts to take several hidden layers, usually containing the same number of hidden neurons. This number should be on the ballpark of the average of the number of input and output variables.}
  75 | \end{center}
  76 | \end{figure}
  77 | 
  78 | A FNN is formed by one input layer, one (shallow network) or more (deep network, hence the name deep learning) hidden layers and one output layer. Each layer of the network (except the output one) is connected to a following layer. This connectivity is central to the FNN structure and has two main features in its simplest form: a weight averaging feature and an activation feature. We will review these features extensively in the following
  79 | 
  80 | \section{Some notations}
  81 | 
  82 | In the following, we will call
  83 | \begin{itemize}
  84 | \item[$\bullet$] $N$ the number of layers (not counting the input) in the Neural Network.
  85 | \item[$\bullet$] $T_{{\rm train}}$ the number of training examples in the training set.
  86 | \item[$\bullet$] $T_{{\rm mb}}$ the number of training examples in a mini-batch (see section \ref{sec:FNNlossfunction}).
  87 | \item[$\bullet$] $t \in \llbracket0,T_{{\rm mb}}-1\rrbracket$ the mini-batch training instance index.
  88 | \item[$\bullet$] $\nu\in\llbracket0,N\rrbracket$ the number of layers in the FNN.
  89 | \item[$\bullet$] $F_\nu$ the number of neurons in the $\nu$'th layer.
  90 | \item[$\bullet$] $X^{(t)}_f=h_{f}^{(0)(t)}$ where $f\in\llbracket0,F_0-1\rrbracket$ the input variables.
  91 | \item[$\bullet$] $y^{(t)}_f$ where $f\in[0,F_N-1]$ the output variables (to be predicted).
  92 | \item[$\bullet$] $\hat{y}^{(t)}_f$ where $f\in[0,F_N-1]$ the output of the network.
  93 | \item[$\bullet$] $\Theta_{f}^{(\nu)f'}$ for $f\in [0,F_{\nu}-1]$, $f'\in [0,F_{\nu+1}-1]$ and $\nu\in[0,N-1]$ the weights matrices
  94 | \item[$\bullet$] A bias term can be included. In practice, we will see when talking about the batch-normalization procedure that we can omit it.
  95 | \end{itemize}
  96 | 
  97 | 
  98 | \section{Weight averaging}
  99 | 
 100 | 
 101 | One of the two main components of a FNN is a weight averaging procedure, which amounts to average the previous layer with some weight matrix to obtain the next layer. This is illustrated on the figure \ref{fig:3}
 102 | 
 103 | 
 104 | \begin{figure}[H]
 105 | \begin{center}
 106 | \begin{tikzpicture}[shorten >=1pt,-stealth,draw=black!50, node distance=\layersep]
 107 |     \tikzstyle{every pin edge}=[stealth-,shorten <=1pt]
 108 |     \tikzstyle{neuron}=[circle,draw=black,fill=black!25,minimum size=17pt,inner sep=0pt]
 109 |     \tikzstyle{input neuron}=[neuron, fill=gray!50];
 110 |     \tikzstyle{output neuron}=[neuron, fill=gray!50];
 111 |     \tikzstyle{hidden neuron}=[neuron, fill=gray!50];
 112 |     \tikzstyle{annot} = [text width=4em, text centered]
 113 | 
 114 |     % Draw the input layer nodes
 115 |     \foreach \name / \y in {1}
 116 |        	\pgfmathtruncatemacro{\m}{int(\y-1)}
 117 |     % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
 118 |         \node[input neuron] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$};
 119 | 
 120 | 
 121 |     \foreach \name / \y in {2,...,6}
 122 |        	\pgfmathtruncatemacro{\m}{int(\y-1)}
 123 |     % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
 124 |         \node[input neuron] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$};
 125 | 
 126 |     % Draw the hidden layer 1 nodes
 127 |     \foreach \name / \y in {4}
 128 |     	\pgfmathtruncatemacro{\m}{int(\y-1)}
 129 |         \path[yshift=0.5cm]
 130 |             node[hidden neuron] (H1-\name) at (\layersep,-\y cm) {$a_{\m}^{(0)}$};
 131 | 
 132 | 
 133 |  \path (I-1) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_0$} (H1-4);
 134 |  \path (I-2) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_1$} (H1-4);
 135 |  \path (I-3) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_2$} (H1-4);
 136 |  \path (I-4) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_3$} (H1-4);
 137 |  \path (I-5) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_4$} (H1-4);
 138 |  \path (I-6) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_5$} (H1-4);
 139 | 
 140 | \end{tikzpicture}
 141 | \caption{\label{fig:3}Weight averaging procedure.}
 142 | \end{center}
 143 | \end{figure}
 144 | 
 145 | 
 146 | Formally, the weight averaging procedure reads:
 147 | 
 148 | \begin{align}
 149 | a_{f}^{(t)(\nu)}&=\sum^{F_\nu-1+\epsilon}_{f'=0}\Theta^{(\nu)f}_{\,f'}h^{(t)(\nu)}_{f'}\;,
 150 | \end{align}
 151 | where $\nu\in\llbracket 0,N-1\rrbracket$, $t \in \llbracket0,T_{{\rm mb}}-1\rrbracket$ and $f\in \llbracket 0,F_{\nu+1}-1\rrbracket$. The $\epsilon$ is here to include or exclude a bias term. In practice, as we will be using batch-normalization, we can safely omit it ($\epsilon=0$ in all the following).
 152 | 
 153 | \section{Activation function}
 154 | 
 155 | The hidden neuron of each layer is defined as
 156 | \begin{align}
 157 | h_{f}^{(t)(\nu+1)}&=g\left(a_{f}^{(t)(\nu)}\right)\;,
 158 | \end{align}
 159 | where $\nu\in\llbracket 0,N-2\rrbracket$, $f\in \llbracket 0,F_{\nu+1}-1\rrbracket$ and as usual $t \in \llbracket0,T_{{\rm mb}}-1\rrbracket$. Here $g$ is an activation function -- the second main ingredient of a FNN -- whose non-linearity allow to predict arbitrary output data. In practice, $g$ is usually taken to be one of the functions described in the following subsections.
 160 | 
 161 | 
 162 | \subsection{The sigmoid function}
 163 | 
 164 | The sigmoid function takes its value in $]0,1[$ and reads
 165 | \begin{align}
 166 | g(x)&=\sigma(x)=\frac{1}{1+e^{-x}}\;.
 167 | \end{align}
 168 | Its derivative is
 169 | \begin{align}
 170 | \sigma'(x)&=\sigma(x)\left(1-\sigma(x)\right)\;.
 171 | \end{align}
 172 | This activation function is not much used nowadays (except in RNN-LSTM networks that we will present later in chapter \ref{sec:chapterRNN}).
 173 | 
 174 | \begin{figure}[H]
 175 | \begin{center}
 176 | \begin{tikzpicture}
 177 | \node at (0,0) {\includegraphics[scale=1]{sigmoid}};
 178 | \end{tikzpicture}
 179 | \end{center}
 180 | \caption{\label{fig:sigmoid} the sigmoid function and its derivative.}
 181 | \end{figure}
 182 | 
 183 | \subsection{The tanh function}
 184 | 
 185 | The tanh function takes its value in $]-1,1[$ and reads
 186 | \begin{align}
 187 | g(x)&=\tanh(x)=\frac{1-e^{-2x}}{1+e^{-2x}}\;.
 188 | \end{align}
 189 | Its derivative is
 190 | \begin{align}
 191 | \tanh'(x)&=1-\tanh^2(x)\;.
 192 | \end{align}
 193 | This activation function has seen its popularity drop due to the use of the activation function presented in the next section.
 194 | 
 195 | \begin{figure}[H]
 196 | \begin{center}
 197 | \begin{tikzpicture}
 198 | \node at (0,0) {\includegraphics[scale=1]{tanh2}};
 199 | \end{tikzpicture}
 200 | \end{center}
 201 | \caption{\label{fig:tanh} the tanh function and its derivative.}
 202 | \end{figure}
 203 | 
 204 | It is nevertherless still used in the standard formulation of the RNN-LSTM model (\ref{sec:chapterRNN}).
 205 | 
 206 | \subsection{The ReLU function}
 207 | 
 208 | 
 209 | The ReLU -- for Rectified Linear Unit -- function takes its value in $[0,+\infty[$ and reads
 210 | \begin{align}
 211 | g(x)&={\rm ReLU}(x)=\begin{cases}
 212 |       x & x\geq 0 \\
 213 |       0& x<0
 214 |    \end{cases}\;.
 215 | \end{align}
 216 | Its derivative is
 217 | \begin{align}
 218 | {\rm ReLU}'(x)&=\begin{cases}
 219 |       1 & x\geq 0 \\
 220 |       0 & x<0
 221 |    \end{cases}\;.
 222 | \end{align}
 223 | 
 224 | \begin{figure}[H]
 225 | \begin{center}
 226 | \begin{tikzpicture}
 227 | \node at (0,0) {\includegraphics[scale=1]{ReLU}};
 228 | \end{tikzpicture}
 229 | \end{center}
 230 | \caption{\label{fig:relu} the ReLU function and its derivative.}
 231 | \end{figure}
 232 | 
 233 | 
 234 | This activation function is the most extensively used nowadays. Two of its more common variants can also be found : the leaky ReLU and ELU -- Exponential Linear Unit. They have been introduced because the ReLU activation function tends to "kill" certain hidden neurons: once it has been turned off (zero value), it can never be turned on again.
 235 | 
 236 | 
 237 | 
 238 | \subsection{The leaky-ReLU function}
 239 | 
 240 | 
 241 | The leaky-ReLU --for Linear Rectified Linear Unit -- function takes its value in $]-\infty,+\infty[$ and is a slight modification of the ReLU that allows non-zero value for the hidden neuron whatever the $x$ value. It reads
 242 | \begin{align}
 243 | g(x)&= \text{leaky-ReLU}(x)=\begin{cases}
 244 |       x & x\geq 0 \\
 245 |       0.01\,x & x<0
 246 |    \end{cases}\;.
 247 | \end{align}
 248 | Its derivative is
 249 | \begin{align}
 250 | \text{leaky-ReLU}'(x)&=\begin{cases}
 251 |       1 & x\geq 0 \\
 252 |       0.01 & x<0
 253 |    \end{cases}\;.
 254 | \end{align}
 255 | 
 256 | \begin{figure}[H]
 257 | \begin{center}
 258 | \begin{tikzpicture}
 259 | \node at (0,0) {\includegraphics[scale=1]{lReLU}};
 260 | \end{tikzpicture}
 261 | \end{center}
 262 | \caption{\label{fig:lrelu} the leaky-ReLU function and its derivative.}
 263 | \end{figure}
 264 | 
 265 | A variant of the leaky-ReLU can also be found in the literature : the Parametric-ReLU, where the arbitrary $0.01$ in the definition of the leaky-ReLU is replaced by an $\alpha$ coefficient, that can be
 266 | computed via backpropagation.
 267 | \begin{align}
 268 | g(x)&={\rm Parametric-ReLU}(x)=\begin{cases}
 269 |       x & x\geq 0 \\
 270 |       \alpha\,x & x<0
 271 |    \end{cases}\;.
 272 | \end{align}
 273 | Its derivative is
 274 | \begin{align}
 275 | {\rm Parametric-ReLU}'(x)&=\begin{cases}
 276 |       1 & x\geq 0 \\
 277 |       \alpha & x<0
 278 |    \end{cases}\;.
 279 | \end{align}
 280 | 
 281 | \subsection{The ELU function}
 282 | 
 283 | The ELU --for Exponential Linear Unit -- function takes its value between $]-1,+\infty[$ and is inspired by the leaky-ReLU philosophy: non-zero values for all $x$'s. But it presents the advantage of being $\mathcal{C}^1$.
 284 | \begin{align}
 285 | g(x)&={\rm ELU}(x)=\begin{cases}
 286 |       x & x\geq 0 \\
 287 |       e^x-1 & x<0
 288 |    \end{cases}\;.
 289 | \end{align}
 290 | Its derivative is
 291 | \begin{align}
 292 | {\rm ELU}'(x)&=\begin{cases}
 293 |       1 & x\geq 0 \\
 294 |       e^x & x<0
 295 |    \end{cases}\;.
 296 | \end{align}
 297 | 
 298 | 
 299 | \begin{figure}[H]
 300 | \begin{center}
 301 | \begin{tikzpicture}
 302 | \node at (0,0) {\includegraphics[scale=1]{ELU}};
 303 | \end{tikzpicture}
 304 | \end{center}
 305 | \caption{\label{fig:elu} the ELU function and its derivative.}
 306 | \end{figure}
 307 | 
 308 | % From my experience, leay-relu is more than enough.
 309 | 
 310 | 
 311 | \section{FNN layers}
 312 | 
 313 | As illustrated in figure \ref{fig:1}, a regular FNN is composed by several specific layers. Let us explicit them one by one.
 314 | 
 315 | 
 316 | 
 317 | \subsection{Input layer}
 318 | 
 319 | The input layer is one of the two places where the data at disposal for the problem at hand come into place. In this chapter, we will be considering a input of size $F_0$, denoted $X^{(t)}_{f}$, with\footnote{
 320 | To train the FNN, we jointly compute the forward and backward pass for $T_{{\rm mb}}$ samples of the training set, with $T_{{\rm mb}}\ll T_{{\rm train}}$. In the following we will thus have $t\in \llbracket 0, T_{{\rm mb}}-1\rrbracket$.
 321 | }
 322 |  $t\in \llbracket 0, T_{{\rm mb}}-1\rrbracket$ (size of the mini-batch, more on that when we will be talking about gradient descent techniques), and $f \in \llbracket 0, F_0-1\rrbracket$. Given the problem at hand, a common procedure could be to center the input following the procedure
 323 | \begin{align}
 324 | \tilde{X}^{(t)}_{f}&=X^{(t)}_{f}-\mu_{f}\;,
 325 | \end{align}
 326 | with
 327 | \begin{align}
 328 | \mu_{f}&=\frac{1}{T_{{\rm train}}}\sum^{T_{{\rm train}}-1}_{t=0}X^{(t)}_{f}\;.
 329 | \end{align}
 330 | This correspond to compute the mean per data types over the training set. Following our notations, let us recall that
 331 | \begin{align}
 332 | X^{(t)}_{f}&=h^{(t)(0)}_{f}\;.
 333 | \end{align}
 334 | 
 335 | \subsection{Fully connected layer}
 336 | 
 337 | The fully connected operation is just the conjunction of the weight averaging and the activation procedure. Namely, $\forall \nu\in \llbracket 0,N-1 \rrbracket$
 338 | \begin{align}
 339 | a_{f}^{(t)(\nu)}&=\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}\;.\label{eq:Weightavg}
 340 | \end{align}
 341 | and $\forall \nu\in \llbracket 0,N-2 \rrbracket$
 342 | \begin{align}
 343 | h_{f}^{(t)(\nu+1)}&=g\left(a_{f}^{(t)(\nu)}\right)\;.
 344 | \end{align}
 345 | for the case where $\nu=N-1$, the activation function is replaced by an output function.
 346 | 
 347 | 
 348 | 
 349 | \subsection{Output layer}
 350 | 
 351 | The output of the FNN reads
 352 |  \begin{align}
 353 | h_{f}^{(t)(N)}&=o(a_{f}^{(t)(N-1)})\;,
 354 | \end{align}
 355 | where $o$ is called the output function. In the case of the Euclidean loss function, the output function is just the identity. In a classification task, $o$ is the softmax function.
 356 | \begin{align}
 357 | o\left(a^{(t)(N-1)}_f\right)&=\frac{e^{a^{(t)(N-1)}_f}}{\sum\limits^{F_{N}-1}_{f'=0}e^{a^{(t)(N-1)}_{f'}}}
 358 | \end{align}
 359 | 
 360 | 
 361 | \section{Loss function} \label{sec:FNNlossfunction}
 362 | 
 363 | The loss function evaluates the error performed by the FNN when it tries to estimate the data to be predicted (second place where the data make their appearance). For a regression problem, this is simply a mean square error (MSE) evaluation
 364 | \begin{align}
 365 | J(\Theta)&=\frac{1}{2T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1}
 366 | %
 367 | \left(y_f^{(t)}-h_{f}^{(t)(N)}\right)^2\;,
 368 | \end{align}
 369 | while for a classification task, the loss function is called the cross-entropy function
 370 | \begin{align}
 371 | J(\Theta)&=-\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1}
 372 | %
 373 | \delta^f_{y^{(t)}}\ln h_{f}^{(t)(N)}\;,
 374 | \end{align}
 375 | and for a regression problem transformed into a classification one, calling $C$ the number of bins leads to
 376 | \begin{align}
 377 | J(\Theta)&=-\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1}\sum_{c=0}^{C-1}
 378 | %
 379 | \delta^c_{y_f^{(t)}}\ln h_{fc}^{(t)(N)}\;.
 380 | \end{align}
 381 | For reasons that will appear clear when talking about the data sample used at each training step, we denote
 382 | \begin{align}
 383 | J(\Theta)&=\sum_{t=0}^{T_{{\rm mb}}-1}J_{{\rm mb}}(\Theta)\;.
 384 | \end{align}
 385 | 
 386 | \section{Regularization techniques}
 387 | 
 388 | On of the main difficulties when dealing with deep learning techniques is to get the deep neural network to train efficiently. To that end, several regularization techniques have been invented. We will review them in this section
 389 | 
 390 | \subsection{L2 regularization}
 391 | 
 392 | L2 regularization is the most common regularization technique that on can find in the literature. It amounts to add a regularizing term to the loss function in the following way
 393 | \begin{align}
 394 | J_{{\rm L2}}(\Theta)&=\lambda_{{\rm L2}} \sum_{\nu=0}^{N-1}\left\|\Theta^{(\nu)}\right\|^2_{{\rm L2}}
 395 | %
 396 | =\lambda_{{\rm L2}}\sum_{\nu=0}^{N-1}\sum_{f=0}^{F_{\nu+1}-1}\sum_{f'=0}^{F_\nu-1}
 397 | %
 398 | \left(\Theta^{(\nu)f'}_{f}\right)^2\;.\label{eq:l2reg}
 399 | \end{align}
 400 | This regularization technique is almost always used, but not on its own. A typical value of $\lambda_{{\rm L2}}$ is in the range $10^{-4}-10^{-2}$. Interestingly, this L2 regularization technique has a Bayesian interpretation: it is Bayesian inference with a Gaussian prior on the weights. Indeed, for a given $\nu$, the weight averaging procedure can be considered as
 401 | \begin{align}
 402 | a_{f}^{(t)(\nu)}&=\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}+\epsilon\;,
 403 | \end{align}
 404 | where $\epsilon$ is a noise term of mean $0$ and variance $\sigma^2$. Hence the following Gaussian likelihood for all values of $t$ and $f$:
 405 | \begin{align}
 406 | \mathcal{N}\left(a_{f}^{(t)(i)}\middle|\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'},\sigma^2\right)\;.
 407 | \end{align}
 408 | Assuming all the weights to have a Gaussian prior of the form $\mathcal{N}\left(\Theta^{(\nu)f}_{f'}\middle|\lambda_{{\rm L2}}^{-1}\right)$ with the same parameter $\lambda_{{\rm L2}}$, we get the following expression
 409 | \begin{align}
 410 | \mathcal{P}&=
 411 | %
 412 | \prod_{t=0}^{T_{{\rm mb}}-1}\prod_{f=0}^{F_{\nu+1}-1}\left[\mathcal{N}\left(a_{f}^{(t)(\nu)}\middle|
 413 | %
 414 | \sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'},\sigma^2\right)
 415 | %
 416 | \prod_{f'=0}^{F_{\nu}-1}\mathcal{N}\left(\Theta^{(\nu)f}_{f'}
 417 | %
 418 | \middle|\lambda_{{\rm L2}}^{-1}\right)\right]\notag\\
 419 | %
 420 | &=\prod_{t=0}^{T_{{\rm mb}}-1}\prod_{f=0}^{F_{\nu+1}-1}\left[\frac{1}{\sqrt{2\pi \sigma^2}}
 421 | %
 422 | e^{-\frac{\left(a_{f}^{(t)(\nu)}-\sum^{F_i-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}\right)^2}{2\sigma^2}}
 423 | %
 424 | \prod_{f'=0}^{F_{\nu}-1}\sqrt{\frac{\lambda_{{\rm L2}}}{2\pi}}e^{-\frac{\left(\Theta^{(\nu)f}_{f'}\right)^2\lambda_{{\rm L2}}}{2}}\right] \;.
 425 | \end{align}
 426 | Taking the log of it and forgetting most of the constant terms leads to
 427 | \begin{align}
 428 | \mathcal{L}&\propto\frac{1}{T_{{\rm mb}}\sigma^2}
 429 | %
 430 | \sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_{\nu+1}-1}
 431 | %
 432 | \left(a_{f}^{(t)(\nu)}-\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}\right)^2
 433 | %
 434 | +\lambda_{{\rm L2}}\sum_{f=0}^{F_{\nu+1}-1}\sum_{f'=0}^{F_{\nu}-1}\left(\Theta^{(\nu)f}_{f'}\right)^2 \;,
 435 | \end{align}
 436 | and the last term is exactly the L2 regulator for a given $nu$ value (see formula (\ref{eq:l2reg})).
 437 | 
 438 | \subsection{L1 regularization}
 439 | 
 440 | L1 regularization amounts to replace the L2 norm by the L1 one in the L2 regularization technique
 441 | \begin{align}
 442 | J_{{\rm L1}}(\Theta)&=\lambda_{{\rm L1}} \sum_{\nu=0}^{N-1}\left\|\Theta^{(\nu)}\right\|_{{\rm L1}}
 443 | %
 444 | =\lambda_{{\rm L1}}\sum_{\nu=0}^{N-1}\sum_{f=0}^{F_{\nu+1}-1}\sum_{f'=0}^{F_\nu-1}
 445 | %
 446 | \left|\Theta^{(\nu)f}_{f'}\right|\;.
 447 | \end{align}
 448 | It can be used in conjunction with L2 regularization, but again these techniques are not sufficient on their own. A typical value of $\lambda_{{\rm L1}}$ is in the range $10^{-4}-10^{-2}$. Following the same line as in the previous section, one can show that L1 regularization is equivalent to Bayesian inference with a Laplacian prior on the weights
 449 | \begin{align}
 450 | \mathcal{F}\left(\Theta^{(\nu)f}_{f'}\middle| 0,\lambda_{{\rm L1}}^{-1}\right)&=
 451 | %
 452 | \frac{\lambda_{{\rm L1}}}{2}e^{-\lambda_{{\rm L1}}\left|\Theta^{(\nu)f}_{f'}\right|}\;.
 453 | \end{align}
 454 | 
 455 | \subsection{Clipping}
 456 | 
 457 | Clipping forbids the L2 norm of the weights to go beyond a pre-determined threshold $C$. Namely after having computed the update rules for the weights, if their L2 norm goes above $C$, it is pushed back to $C$
 458 | \begin{align}
 459 | {\rm if}\;\left\|\Theta^{(\nu)}\right\|_{{\rm L2}}>C \longrightarrow \Theta^{(\nu)f}_{f'}&=
 460 | %
 461 | \Theta^{(\nu)f}_{f'} \times \frac{C}{\left\|\Theta^{(\nu)}\right\|_{{\rm L2}}}\;.
 462 | \end{align}
 463 | 
 464 | This regularization technique avoids the so-called exploding gradient problem, and is mainly used in RNN-LSTM networks. A typical value of $C$ is in the range $10^{0}-10^{1}$. Let us now turn to the most efficient regularization techniques for a FNN: dropout and Batch-normalization.
 465 | 
 466 | 
 467 | \subsection{Dropout}
 468 | 
 469 | A simple procedure allows for better backpropagation performance for classification tasks: it amounts to stochastically drop some of the hidden units (and in some instances even some of the input variables) for each training example.
 470 | 
 471 | \begin{figure}[H]
 472 | \begin{center}
 473 | \begin{tikzpicture}[shorten >=1pt,-stealth,draw=black!50, node distance=\layersep]
 474 |     \tikzstyle{every pin edge}=[stealth-,shorten <=1pt]
 475 |     \tikzstyle{neuron}=[circle,draw=black,fill=black!25,minimum size=17pt,inner sep=0pt]
 476 |     \tikzstyle{input neuron}=[neuron, fill=gray!50];
 477 |     \tikzstyle{output neuron}=[neuron, fill=gray!50];
 478 |     \tikzstyle{dropout neuron}=[neuron, fill=black];
 479 |     \tikzstyle{hidden neuron}=[neuron, fill=gray!50];
 480 |     \tikzstyle{annot} = [text width=4em, text centered]
 481 | 
 482 |     % Draw the input layer nodes
 483 |     \foreach \name / \y in {1}
 484 |        	\pgfmathtruncatemacro{\m}{int(\y-1)}
 485 |     % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
 486 |         \node[input neuron, pin=left:Bias] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$};
 487 | 
 488 | 
 489 |     \foreach \name / \y in {2,3,4,6}
 490 |        	\pgfmathtruncatemacro{\m}{int(\y-1)}
 491 |     % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
 492 |         \node[input neuron, pin=left:Input \#\y] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$};
 493 | 
 494 |         \foreach \name / \y in {5}
 495 |        	\pgfmathtruncatemacro{\m}{int(\y-1)}
 496 |     % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4}
 497 |         \node[dropout neuron] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$};
 498 | 
 499 |     % Draw the hidden layer 1 nodes
 500 |     \foreach \name / \y in {1,2,3,5}
 501 |     	\pgfmathtruncatemacro{\m}{int(\y-1)}
 502 |         \path[yshift=0.5cm]
 503 |             node[hidden neuron] (H1-\name) at (\layersep,-\y cm) {$h_{\m}^{(1)}$};
 504 | 
 505 |     % Draw the hidden layer 1 nodes
 506 |     \foreach \name / \y in {4,6,7}
 507 |     	\pgfmathtruncatemacro{\m}{int(\y-1)}
 508 |         \path[yshift=0.5cm]
 509 |             node[dropout neuron] (H1-\name) at (\layersep,-\y cm) {$h_{\m}^{(1)}$};
 510 | 
 511 |     % Draw the hidden layer 1 node
 512 |     \foreach \name / \y in {1,3,5}
 513 |         \pgfmathtruncatemacro{\m}{int(\y-1)}
 514 |         \path[yshift=0.0cm]
 515 |             node[hidden neuron] (H2-\name) at (2*\layersep,-\y cm) {$h_{\m}^{(\nu)}$};
 516 | 
 517 |     % Draw the hidden layer 1 node
 518 |     \foreach \name / \y in {2,4,6}
 519 |         \pgfmathtruncatemacro{\m}{int(\y-1)}
 520 |         \path[yshift=0.0cm]
 521 |             node[dropout neuron] (H2-\name) at (2*\layersep,-\y cm) {$h_{\m}^{(\nu)}$};
 522 | 
 523 |     % Draw the output layer node
 524 |     \foreach \name / \y in {1,...,5}
 525 |         \path[yshift=-0.5cm]
 526 |     node[output neuron,pin={[pin edge={->}]right:Output \#\y}] (O-\name) at (3*\layersep,-\y cm) {$h_{\y}^{(N)}$};
 527 | 
 528 |     % Connect every node in the input layer with every node in the
 529 |     % hidden layer.
 530 |     \foreach \source in {1,2,3,4,6}
 531 |         \foreach \dest in {2,3,5}
 532 |             \path (I-\source) edge (H1-\dest);
 533 | 
 534 |      \foreach \source in {1,2,3,5}
 535 |        \foreach \dest in {3,5}
 536 |            \path (H1-\source) edge (H2-\dest);
 537 | 
 538 |     % Connect every node in the hidden layer with the output layer
 539 |     \foreach \source in {1,3,5}
 540 |        \foreach \dest in {1,...,5}
 541 |           \path (H2-\source) edge (O-\dest);
 542 | 
 543 |     % Annotate the layers
 544 |     \node[annot,above of=H1-1, node distance=1cm] (hl) {Hidden layer 1};
 545 |     \node[annot,left of=hl] {Input layer};
 546 |     \node[annot,right of=hl] (hm) {Hidden layer $\nu$};
 547 |     \node[annot,right of=hm] {Output layer};
 548 | 
 549 |     \node at ((1.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$};
 550 |     \node at ((2.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$};
 551 | \end{tikzpicture}
 552 | \caption{\label{fig:2}The neural network of figure \ref{fig:1} with dropout taken into account for both the hidden layers and the input. Usually, a different (lower) probability for turning off a neuron is adopted for the input than the one adopted for the hidden layers.}
 553 | \end{center}
 554 | \end{figure}
 555 | 
 556 | 
 557 | This amounts to do the following change: for $\nu\in \llbracket 1,N-1\rrbracket$
 558 | \begin{align}
 559 | h^{(\nu)}_{f}=\null&m_f^{(\nu)} g\left(a_f^{(\nu)}\right)
 560 | \end{align}
 561 | with $m_f^{(i)}$ following a $p$ Bernoulli distribution with usually $p=\frac15$ for the mask of the input layer and $p=\frac12$ otherwise. Dropout\cite{Srivastava:2014:DSW:2627435.2670313} has been the most successful regularization technique until the appearance of Batch Normalization.
 562 | 
 563 | \subsection{Batch Normalization}
 564 | 
 565 | Batch normalization\cite{Ioffe2015} amounts to jointly normalize the mini-batch set per data types, and does so at each input of a FNN layer. In the original paper, the authors argued that this step should be done after the convolutional layers, but in practice it has been shown to be more efficient after the non-linear step. In our case, we will thus consider $\forall \nu \in \llbracket 0,N-2\rrbracket$
 566 | \begin{align}
 567 | \tilde{h}_{f}^{(t)(\nu)}&=\frac{h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}}
 568 | %
 569 | {\sqrt{\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon}}\;,
 570 | \end{align}
 571 | with
 572 | \begin{align}
 573 | \hat{h}_{f}^{(\nu)}&=
 574 | %
 575 | \frac{1}{T_{{\rm mb}}}\sum^{T_{{\rm mb}}-1}_{t=0}h_{f}^{(t)(\nu+1)}\\
 576 | %
 577 | \left(\hat{\sigma}_{f}^{(\nu)}\right)^2&=\frac{1}{T_{{\rm mb}}}\sum^{T_{{\rm mb}}-1}_{t=0}
 578 | %
 579 | \left(h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)^2\;.
 580 | \end{align} To make sure that this transformation can represent the identity transform, we add two additional parameters $(\gamma_f,\beta_f)$ to the model
 581 | \begin{align}
 582 | y^{(t)(\nu)}_{f}&=\gamma^{(\nu)}_f\,\tilde{h}_{f}^{(t)(\nu)}+\beta^{(\nu)}_f
 583 | %
 584 | =\tilde{\gamma}^{(\nu)}_f\,h_{f}^{(t)(\nu)}+\tilde{\beta}^{(\nu)}_f\;.
 585 | \end{align}
 586 | The presence of the $\beta^{(\nu)}_f$ coefficient is what pushed us to get rid of the bias term, as it is naturally included in batchnorm. During training, one must compute a running sum for the mean and the variance, that will serve for the evaluation of the cross-validation and the test set (calling $e$ the number of iterations/epochs)
 587 | \begin{align}
 588 | \mathbb{E}\left[h_{f}^{(t)(\nu+1)}\right]_{e+1} &=
 589 | %
 590 | \frac{e\mathbb{E}\left[h_{f}^{(t)(\nu)}\right]_{e}+\hat{h}_{f}^{(\nu)}}{e+1}\;,\\
 591 | %
 592 | \mathbb{V}ar\left[h_{f}^{(t)(\nu+1)}\right]_{e+1} &=
 593 | %
 594 | \frac{e\mathbb{V}ar\left[h_{f}^{(t)(\nu)}\right]_{e}+\left(\hat{\sigma}_{f}^{(\nu)}\right)^2}{e+1}
 595 | \end{align}
 596 | and what will be used at test time is
 597 | \begin{align}
 598 | \mathbb{E}\left[h_{f}^{(t)(\nu)}\right]&=\mathbb{E}\left[h_{f}^{(t)(\nu)}\right]\;,&
 599 | %
 600 | \mathbb{V}ar\left[h_{f}^{(t)(\nu)}\right]&=
 601 | %
 602 | \frac{T_{{\rm mb}}}{T_{{\rm mb}}-1}\mathbb{V}ar\left[h_{f}^{(t)(\nu)}\right]\;.
 603 | \end{align}
 604 | so that at test time
 605 | \begin{align}
 606 | y^{(t)(\nu)}_{f}&=\gamma^{(\nu)}_f\,\frac{h_{f}^{(t)(\nu)}-E[h_{f}^{(t)(\nu)}]}{\sqrt{Var\left[h_{f}^{(t)(\nu)}\right]+\epsilon}}+\beta^{(\nu)}_f\;.
 607 | \end{align}
 608 | 
 609 | In practice, and as advocated in the original paper, on can get rid of dropout without loss of precision when using batch normalization. We will adopt this convention in the following.
 610 | 
 611 | 
 612 | \section{Backpropagation}
 613 | 
 614 | Backpropagation\cite{LeCun:1998:EB:645754.668382} is the standard technique to decrease the loss function error so as to correctly predict what one needs. As it name suggests, it amounts to backpropagate through the FNN the error performed at the output layer, so as to update the weights. In practice, on has to compute a bunch of gradient terms, and this can be a tedious computational task. Nevertheless, if performed correctly, this is the most useful and important task that one can do in a FNN. We will therefore detail how to compute each weight (and Batchnorm coefficients) gradients in the following.
 615 | 
 616 | \subsection{Backpropagate through Batch Normalization} \label{sec:Backpropbatchnorm}
 617 | 
 618 | Backpropagation introduces a new gradient
 619 | \begin{align}
 620 | \delta^f_{f'}J^{(tt')(\nu)}_{f}&=\frac{\partial y^{(t')(\nu)}_{f'}}{\partial h_{f}^{(t)(\nu+1)}}\;.
 621 | \end{align}
 622 | we show in appendix \ref{sec:appenbatchnorm} that
 623 | \begin{align}
 624 | J^{(tt')(\nu)}_{f}&=\tilde{\gamma}^{(\nu)}_f\ \left[\delta^{t'}_t-
 625 | %
 626 | \frac{1+\tilde{h}_{f}^{(t')(\nu)}\tilde{h}_{f}^{(t)(\nu)}}{T_{{\rm mb}}}\right]\;.
 627 | \end{align}
 628 | 
 629 | 
 630 | \subsection{error updates}
 631 | 
 632 | 
 633 | To backpropagate the loss error through the FNN, it is very useful to compute a so-called error rate
 634 | \begin{align}
 635 | \delta^{(t)(\nu)}_f&= \frac{\partial }{\partial a_{f}^{(t)(\nu)}}J(\Theta)\;,
 636 | \end{align}
 637 | We show in Appendix \ref{sec:appenbplayers} that $\forall \nu \in \llbracket 0,N-2\rrbracket$
 638 | \begin{align}
 639 | \delta^{(t)(\nu)}_f&=g'\left(a_{f}^{(t)(\nu)}\right)
 640 | %
 641 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f}J^{(tt')(\nu)}_{f} \delta^{(t')(\nu+1)}_{f'}\;,
 642 | \end{align}
 643 | the value of $\delta^{(t)(N-1)}_f$ depends on the loss used. We show also in appendix \ref{sec:appenbpoutput} that for the MSE loss function
 644 | \begin{align}
 645 | \delta^{(t)(N-1)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N)}-y_f^{(t)}\right)\;,
 646 | \end{align}
 647 | and for the cross entropy loss function
 648 | \begin{align}
 649 | \delta^{(t)(N-1)}_{f}&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N)}-\delta^f_{y^{(t)}}\right)\;.
 650 | \end{align}
 651 | To unite the notation of chapters \ref{sec:chapterFNN}, \ref{sec:chapterCNN} and \ref{sec:chapterRNN}, we will call
 652 | \begin{align}
 653 | \mathcal{H}^{(t)(\nu+1)}_{ff'}&=g'\left(a_{f}^{(t)(\nu)}\right)\Theta^{(\nu+1)f'}_{f}\;,
 654 | \end{align}
 655 | so that the update rule for the error rate reads
 656 | \begin{align}
 657 | \delta^{(t)(\nu)}_f&=
 658 | %
 659 | \sum_{t'=0}^{T_{{\rm mb}}-1}J^{(tt')(\nu)}_{f}\sum_{f'=0}^{F_{\nu+1}-1}\mathcal{H}^{(t)(\nu+1)}_{ff'} \delta^{(t)(\nu+1)}_{f'}\;.
 660 | \end{align}
 661 | 
 662 | \subsection{Weight update}
 663 | 
 664 | Thanks to the computation of the error rates, the derivation of the error rate is straightforward. We indeed get $\forall \nu \in \llbracket 1,N-1\rrbracket$
 665 | \begin{align}
 666 | \Delta^{\Theta(\nu)f}_{f'}&=\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}
 667 | %
 668 | \sum^{F_{\nu+1}-1}_{f^{''}=0}\sum^{F_\nu}_{f^{'''}=0}\frac{\partial\Theta^{(\nu)f^{''}}_{f^{'''}}
 669 | %
 670 | }{\partial \Theta^{(\nu)f}_{f'}}y^{(t)(\nu-1)}_{f^{'''}}\delta^{(t)(\nu)}_{f^{''}}
 671 | %
 672 | =\sum_{t=0}^{T_{{\rm mb}}-1}\delta^{(t)(\nu)}_f y^{(t)(\nu-1)}_{f'}\;.
 673 | \end{align}
 674 | and
 675 | \begin{align}
 676 | \Delta^{\Theta(0)f}_{f'}&=\sum_{t=0}^{T_{{\rm mb}}-1}\delta^{(t)(0)}_f h^{(t)(0)}_{f'}\;.
 677 | \end{align}
 678 | 
 679 | \subsection{Coefficient update}
 680 | 
 681 | The update rule for the Batchnorm coefficient can easily be computed thanks to the error rate. It reads
 682 | \begin{align}
 683 | \Delta_f^{\gamma(\nu)}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}
 684 | %
 685 | \frac{\partial a^{(t)(\nu+1)}_{f'}}{\partial\gamma_f^{(i)}}\delta^{(t)(\nu+1)}_{f'}
 686 | %
 687 | =\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}
 688 | %
 689 | \Theta^{(\nu+1)f'}_{f}\tilde{h}^{(t)(i)}_{f}\delta^{(t)(\nu+1)}_{f'}\;,\\
 690 | %
 691 | \Delta_f^{\beta(\nu)}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}
 692 | %
 693 | \frac{\partial a^{(t)(\nu+1)}_{f'}}{\partial\beta_f^{(i)}}\delta^{(t)(\nu+1)}_{f'}
 694 | %
 695 | =\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f}\delta^{(t)(\nu+1)}_{f'}\;,
 696 | \end{align}
 697 | 
 698 | 
 699 | \section{Which data sample to use for gradient descent?}
 700 | 
 701 | From the beginning we have denoted $T_{{\rm mb}}$ the sample of the data from which we train our model. This procedure is repeated a large number of time (each time is called an epoch). But in the literature there exists three way to sample from the data: Full-batch, Stochastic and Mini-batch gradient descent. We explicit these terms in the following sections.
 702 | 
 703 | \subsection{Full-batch}
 704 | 
 705 | Full-batch takes the whole training set at each epoch, such that the loss function reads
 706 | \begin{align}
 707 | J(\Theta)&=\sum_{t=0}^{T_{{\rm train}}-1}J_{{\rm train}}(\Theta)\;.
 708 | \end{align}
 709 | This choice has the advantage to be numerically stable, but it so costly in computation time that it is rarely if ever used.
 710 | 
 711 | \subsection{Stochastic Gradient Descent (SGD)}
 712 | 
 713 | SGD amounts to take only one exemplary of the training set at each epoch
 714 | \begin{align}
 715 | J(\Theta)&=J_{{\rm SGD}}(\Theta)\;.
 716 | \end{align}
 717 | This choice leads to faster computations, but is so numerically unstable that the most standard choice by far is Mini-batch gradient descent.
 718 | 
 719 | \subsection{Mini-batch}
 720 | 
 721 | Mini-batch gradient descent is a compromise between stability and time efficiency, and is the middle-ground between Full-batch and Stochastic gradient descent: $1\ll T_{{\rm mb}}\ll T_{{\rm train}}$. Hence
 722 | \begin{align}
 723 | J(\Theta)&=\sum_{t=0}^{T_{{\rm mb}}-1}J_{{\rm mb}}(\Theta)\;.
 724 | \end{align}
 725 | All the calculations in this note have been performed using this gradient descent technique.
 726 | 
 727 | \section{Gradient optimization techniques}
 728 | 
 729 | Once the gradients for backpropagation have been computed, the question of how to add them to the existing weights arise. The most natural choice would be to take
 730 | \begin{align}
 731 | \Theta^{(\nu)f}_{f'}&=\Theta^{(\nu)f}_{f'}-\eta\Delta^{\Theta(i)f}_{f'}\;.
 732 | \end{align}
 733 | where $\eta$ is a free parameter that is generally initialized thanks to cross-validation. It can also be made epoch dependent (with usually a slow exponentially decaying behaviour). When using Mini-batch gradient descent, this update choice for the weights presents the risk of having the loss function being stuck in a local minimum. Several method have been invented to prevent this risk. We are going to review them in the next sections.
 734 | 
 735 | 
 736 | \subsection{Momentum}
 737 | 
 738 | Momentum\cite{QIAN1999145} introduces a new vector $v_{{\rm e}}$ and can be seen as keeping a memory of what where the previous updates at prior epochs. Calling $e$ the number of epochs and forgetting the $f,f',\nu$ indices for the gradients to ease the notations, we have
 739 | \begin{align}
 740 | v_{{\rm e}}&=\gamma v_{{\rm e-1}}+\eta \Delta^{\Theta}\;,
 741 | \end{align}
 742 | and the weights at epoch $e$ are then updated as
 743 | \begin{align}
 744 | \Theta_e&=\Theta_{e-1}-v_{{\rm e}}\;.
 745 | \end{align}
 746 | $\gamma$ is a new parameter of the model, that is usually set to $0.9$ but that could also be fixed thanks to cross-validation.
 747 | 
 748 | \subsection{Nesterov accelerated gradient}
 749 | 
 750 | Nesterov accelerated gradient\cite{nesterov1983method} is a slight modification of the momentum technique that allows the gradients to escape from local minima. It amounts to take
 751 | \begin{align}
 752 | v_{{\rm e}}&=\gamma v_{{\rm e-1}}+\eta \Delta^{\Theta-\gamma v_{{\rm e-1}}}\;,
 753 | \end{align}
 754 | and then again
 755 | \begin{align}
 756 | \Theta_e&=\Theta_{e-1}-v_{{\rm e}}\;.
 757 | \end{align}
 758 | Until now, the parameter $\eta$ that controls the magnitude of the update has been set globally. It would be nice to have a fine control of it, so that different weights can be updated with different magnitudes.
 759 | 
 760 | \subsection{Adagrad}
 761 | 
 762 | Adagrad\cite{Duchi:2011:ASM:1953048.2021068} allows to fine tune the different gradients by having individual learning rates $\eta$. Calling for each value of $f,f',i$
 763 | \begin{align}
 764 | v_{{\rm e}}&=\sum_{e'=0}^{e-1} \left(\Delta^{\Theta}_{e'}\right)^2\;,
 765 | \end{align}
 766 | the update rule then reads
 767 | \begin{align}
 768 | \Theta_e&=\Theta_{e-1}-\frac{\eta}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\;.
 769 | \end{align}
 770 | One advantage of Adagrad is that the learning rate $\eta$ can be set once and for all (usually to $10^{-2}$) and does not need to be fine tune via cross validation anymore, as it is individually adapted to each weight via the $v_{{\rm e}}$ term. $\epsilon$ is here to avoid division by 0 issues, and is usually set to $10^{-8}$.
 771 | 
 772 | \subsection{RMSprop}
 773 | 
 774 | Since in Adagrad one adds the gradient from the first epoch, the weight are forced to monotonically decrease. This behaviour can be smoothed via the Adadelta technique, which takes
 775 | \begin{align}
 776 | v_{{\rm e}}&=\gamma v_{{\rm e-1}}+(1-\gamma )\Delta^{\Theta}_{e}\;,
 777 | \end{align}
 778 | with $\gamma$ a new parameter of the model, that is usually set to $0.9$. The Adadelta update rule then reads as the Adagrad one
 779 | \begin{align}
 780 | \Theta_e&=\Theta_{e-1}-\frac{\eta}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\;.
 781 | \end{align}
 782 | $\eta$ can be set once and for all (usually to $10^{-3}$).
 783 | 
 784 | \subsection{Adadelta}
 785 | 
 786 | Adadelta\cite{journals/corr/abs-1212-5701} is an extension of RMSprop, that aims at getting rid of the $\eta$ parameter. To do so, a new vector update is introduced
 787 | \begin{align}
 788 | m_{{\rm e}}&=\gamma m_{{\rm e-1}}+(1-\gamma )
 789 | %
 790 | \left(\frac{\sqrt{m_{{\rm e-1}}+\epsilon}}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\right)^2\;,
 791 | \end{align}
 792 | and the new update rule for the weights reads
 793 | \begin{align}
 794 | \Theta_e&=\Theta_{e-1}-\frac{\sqrt{m_{{\rm e-1}}+\epsilon}}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\;.
 795 | \end{align}
 796 | The learning rate has been completely eliminated from the update rule, but the procedure for doing so is ad hoc. The next and last optimization technique presented seems more natural and is the default choice on a number of deep learning algorithms.
 797 | \subsection{Adam}
 798 | 
 799 | Adam\cite{Kingma2014} keeps track of both the gradient and its square via two epoch dependent vectors
 800 | \begin{align}
 801 | m_{{\rm e}}&= \beta_1 m_{{\rm e-1}}+ (1-\beta_1)\Delta^{\Theta}_{e}\;,&
 802 | %
 803 | v_{{\rm e}}&= \beta_2 v_{{\rm e}}+ (1-\beta_2)\left(\Delta^{\Theta}_{e}\right)^2\;,
 804 | \end{align}
 805 | with $\beta_1$ and $\beta_2$ parameters usually respectively set to $0.9$ and $0.999$. But the robustness and great strength of Adam is that it makes the whole learning process weakly dependent of their precise value. To avoid numerical problems during the first steps, these vector are rescaled
 806 | \begin{align}
 807 | \hat{m}_{{\rm e}}&= \frac{m_{{\rm e}}}{1-\beta_1^{e}}\;,&
 808 | %
 809 | \hat{v}_{{\rm e}}&= \frac{v_{{\rm e}}}{1-\beta_2^{e}}\;.
 810 | \end{align}
 811 | before entering into the update rule
 812 | \begin{align}
 813 | \Theta_e&=\Theta_{e-1}-\frac{\eta }{\sqrt{\hat{v}_{{\rm e}}+\epsilon}}\hat{m}_{{\rm e}}\;.
 814 | \end{align}
 815 | This is the optimization technique implicitly used throughout this note, alongside with a learning rate decay
 816 | \begin{align}
 817 | \eta_e&=e^{-\alpha_0}\eta_{e-1}\;,
 818 | \end{align}
 819 | $\alpha_0$ determined by cross-validation, and $\eta_0$ usually initialized in the range $10^{-3}-10^{-2}$.
 820 | 
 821 | \section{Weight initialization}
 822 | 
 823 | Without any regularization, training a neural network can be a daunting task because of the fine-tuning of the weight initial conditions. This is one of the reasons why neural networks have experienced out of mode periods. Since dropout and Batch normalization, this issue is less pronounced, but one should not initialize the weight in a symmetric fashion (all zero for instance), nor should one initialize them too large. A good heuristic is
 824 | \begin{align}
 825 | \left[\Theta^{(\nu)f'}_f\right]_{{\rm init}}&=\sqrt{\frac{6}{F_\nu+F_{\nu+1}}}\times\mathcal{N}(0,1)\;.
 826 | \end{align}
 827 | 
 828 | \begin{subappendices}
 829 | \section{Backprop through the output layer} \label{sec:appenbpoutput}
 830 | 
 831 | Recalling the MSE loss function
 832 | \begin{align}
 833 | J(\Theta)&=\frac{1}{2T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1}
 834 | %
 835 | \left(y_f^{(t)}-h_{f}^{(t)(N)}\right)^2\;,
 836 | \end{align}
 837 | we instantaneously get
 838 | \begin{align}
 839 | \delta^{(t)(N-1)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N)}-y_f^{(t)}\right)\;.
 840 | \end{align}
 841 | Things are more complicated for the cross-entropy loss function of a regression problem transformed into a multi-classification task.
 842 | Assuming that we have $C$ classes for all the values that we are trying to predict, we get
 843 | \begin{align}
 844 | \delta^{(t)(N-1)}_{fc}&= \frac{\partial }{\partial a_{fc}^{(t)(N-1)}}J(\Theta)
 845 | %
 846 | =\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_N-1}\sum_{d=0}^{C-1}
 847 | %
 848 | \frac{\partial h_{f'd}^{(t')(N)}}{\partial a_{fc}^{(t)(N-1)}}
 849 | %
 850 |  \frac{\partial }{\partial h_{f'd}^{(t')(N)}}J(\Theta)\;.
 851 | \end{align}
 852 | Now
 853 | \begin{align}
 854 |  \frac{\partial }{\partial h_{f'd}^{(t')(N)}}J(\Theta)&=-\frac{\delta^{d}_{ y_{f'}^{(t')}}}{T_{{\rm mb}} h_{f'd}^{(t')(N)}}\;,
 855 | \end{align}
 856 | and
 857 | \begin{align}
 858 | \frac{\partial h_{f'd}^{(t')(N)}}{\partial a_{fc}^{(t)(N-1)}}&=
 859 | %
 860 | \delta^f_{f'}\delta^{t}_{t'} \left(\delta^c_d h_{fc}^{(t)(N)}- h_{fc}^{(t)(N)} h_{fd}^{(t)(N)}\right)\;,
 861 | \end{align}
 862 | so that
 863 | \begin{align}
 864 | \delta^{(t)(N-1)}_{fc}&=-\frac{1}{T_{{\rm mb}}} \sum_{d=0}^{C-1}\frac{\delta^{d}_{ y_f^{(t)}}}{h_{fd}^{(t)(N)}}
 865 | %
 866 | \left(\delta^c_d h_{fc}^{(t)(N)}- h_{fc}^{(t)(N)} h_{fd}^{(t)(N)}\right)\notag\\
 867 | %
 868 | &=\frac{1}{T_{{\rm mb}}}\left( h_{fc}^{(t)(N)}-\delta^{c}_{ y_f^{(t)}}\right)\;.
 869 | \end{align}
 870 | For a true classification problem, we easily deduce
 871 | \begin{align}
 872 | \delta^{(t)(N-1)}_{fc}&=\frac{1}{T_{{\rm mb}}}\left( h_{f}^{(t)(N)}-\delta^{f}_{ y^{(t)}}\right)\;.
 873 | \end{align}
 874 | 
 875 | \section{Backprop through hidden layers} \label{sec:appenbplayers}
 876 | 
 877 | To go further we need
 878 | \begin{align}
 879 | \delta^{(t)(\nu)}_f&= \frac{\partial }{\partial a_{f}^{(t)(\nu)}}J^{(t)}(\Theta)=
 880 | %
 881 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}
 882 | %
 883 |  \frac{\partial a_{f'}^{(t')(\nu+1)}}{\partial a_{f}^{(t)(\nu)}} \delta^{(t')(\nu+1)}_{f'}\notag\\
 884 | %
 885 | &=\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\sum^{F_\nu}_{f''=0}\Theta^{(\nu+1)f'}_{f''}
 886 | %
 887 | \frac{\partial y^{(t')(\nu)}_{f''} }{\partial a_{f}^{(t)(\nu)}} \delta^{(t')(\nu+1)}_{f'}\notag\\
 888 | %
 889 | &=\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\sum^{F_\nu}_{f''=0}\Theta^{(\nu+1)f'}_{f''}
 890 | %
 891 | \frac{\partial y^{(t')(\nu)}_{f''} }{\partial h_{f}^{(t)(\nu+1)}}
 892 | %
 893 | g'\left(a_{f}^{(t)(\nu)}\right) \delta^{(t')(\nu+1)}_{f'}\;,
 894 | \end{align}
 895 | so that
 896 | \begin{align}
 897 | \delta^{(t)(\nu)}_f&=g'\left(a_{f}^{(t)(\nu)}\right)
 898 | %
 899 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f}J^{(tt')(\nu)}_{f} \delta^{(t)(\nu+1)}_{f'}\;,
 900 | \end{align}
 901 | 
 902 | 
 903 | \section{Backprop through BatchNorm} \label{sec:appenbatchnorm}
 904 | 
 905 | 
 906 | We saw in section \ref{sec:Backpropbatchnorm} that batch normalization implies among other things to compute the following gradient.
 907 | \begin{align}
 908 | \frac{\partial y^{(t')(\nu)}_{f'}}{\partial h_{f}^{(t)(\nu+1)}}&=
 909 | %
 910 | \gamma^{(\nu)}_f\frac{\partial \tilde{h}_{f'}^{(t)(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}\;.
 911 | \end{align}
 912 | We propose to do just that in this section. Firstly
 913 | \begin{align}
 914 | \frac{\partial h^{(t')(\nu+1)}_{f'}}{\partial h_{f}^{(t)(\nu+1)}}&=\delta^{t'}_t\delta^{f'}_f\;,&
 915 | %
 916 | \frac{\partial \hat{h}_{f'}^{(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}&=\frac{\delta^{f'}_f}{T_{{\rm mb}}}\;.
 917 | \end{align}
 918 | Secondly
 919 | \begin{align}
 920 | \frac{\partial \left(\hat{\sigma}_{f'}^{(\nu)}\right)^2}{\partial h_{f}^{(t)(\nu+1)}}&=
 921 | %
 922 | \frac{2\delta^{f'}_f}{T_{{\rm mb}}}\left(h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)\;,
 923 | \end{align}
 924 | so that we get
 925 | \begin{align}
 926 | \frac{\partial \tilde{h}_{f'}^{(t)(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}&=
 927 | %
 928 | \frac{\delta^{f'}_f}{T_{{\rm mb}}}\left[\frac{T_{{\rm mb}}\delta^{t'}_t-1}
 929 | %
 930 | {\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac12}-
 931 | %
 932 | \frac{\left(h_{f}^{(t')(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)\left(h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)}
 933 | %
 934 | {\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac32}\right]\notag\\
 935 | %
 936 | &=\frac{\delta^{f'}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac12}
 937 | %
 938 | \left[\delta^{t'}_t-
 939 | %
 940 | \frac{1+\tilde{h}_{f}^{(t')(\nu)}\tilde{h}_{f}^{(t)(\nu)}}{T_{{\rm mb}}}\right]\;.
 941 | \end{align}
 942 | To ease the notation recall that we denoted
 943 | \begin{align}
 944 | \tilde{\gamma}^{(\nu)}_f&=
 945 | %
 946 | \frac{\gamma^{(\nu)}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac12}\;.
 947 | \end{align}
 948 | %
 949 | %
 950 | so that
 951 | \begin{align}
 952 | \frac{\partial y_{f'}^{(t)(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}&=
 953 | %
 954 | \tilde{\gamma}^{(\nu)}_f \delta^{f'}_f\left[\delta^{t'}_t-
 955 | %
 956 | \frac{1+\tilde{h}_{f}^{(t')(\nu)}\tilde{h}_{f}^{(t)(\nu)}}{T_{{\rm mb}}}\right]\;.
 957 | \end{align}
 958 | 
 959 | 
 960 | 
 961 | \section{FNN ResNet (non standard presentation)} \label{sec:ResnetFNN}
 962 | 
 963 | The state of the art architecture of convolutional neural networks (CNN, to be explained in chapter \ref{sec:chapterCNN}) is called ResNet\cite{He2015}. Its name comes from its philosophy: each hidden layer output $y$ of the network is a small -- hence the term residual -- modification of its input ($y=x+F(x)$), instead of a total modification ($y=H(x)$) of its input $x$. This philosophy can be imported to the FNN case. Representing the operations of weight averaging, activation function and batch normalization in the following way
 964 | 
 965 | \begin{figure}[H]
 966 | \begin{center}
 967 | \begin{tikzpicture}
 968 | \node at (0,0) {\includegraphics[scale=1]{fc_equiv}};
 969 | \end{tikzpicture}
 970 | \end{center}
 971 | \caption{\label{fig:fc_equiv} Schematic representation of one FNN fully connected layer.}
 972 | \end{figure}
 973 | 
 974 | In its non standard form presented in this section, the residual operation amounts to add a skip connection to two consecutive full layers
 975 | 
 976 | 
 977 | \begin{figure}[H]
 978 | \begin{center}
 979 | \begin{tikzpicture}
 980 | \node at (0,0) {\includegraphics[scale=1]{fc_resnet_2}};
 981 | \end{tikzpicture}
 982 | \end{center}
 983 | \caption{\label{fig:fc_resnet_2} Residual connection in a FNN.}
 984 | \end{figure}
 985 | 
 986 | Mathematically, we had before (calling the input $y^{(t)(\nu-1)}$)
 987 | 
 988 | \begin{align}
 989 | y_{f}^{(t)(\nu+1)}&=\gamma_f^{(\nu+1)}\tilde{h}_f^{(t)(\nu+2)}+\beta_f^{(\nu+1)}\;,&
 990 | %
 991 | a_{f}^{(t)(\nu+1)}&=\sum^{F_{\nu}-1}_{f'=0}\Theta^{(\nu+1)f}_{f'}y_{f}^{(t)(\nu)}\notag\\
 992 | %
 993 | y_{f}^{(t)(\nu)}&=\gamma_f^{(\nu)}\tilde{h}_f^{(t)(\nu+1)}+\beta_f^{(\nu)}\;,&
 994 | %
 995 | a_{f}^{(t)(\nu)}&=\sum^{F_{\nu-1}-1}_{f'=0}\Theta^{(\nu)f}_{f'}y_{f}^{(t)(\nu-1)}\;,
 996 | \end{align}
 997 | as well as $h^{(t)(\nu+2)}_f=g\left(a_{f}^{(t)(\nu+1)}\right)$ and $h^{(t)(\nu+1)}_f=g\left(a_{f}^{(t)(\nu)}\right)$. In ResNet, we now have the slight modification
 998 | \begin{align}
 999 | y_{f}^{(t)(\nu+1)}&=\gamma_f^{(\nu+1)}\tilde{h}_f^{\nu+2}+\beta_f^{(\nu+1)}+y^{(t)(\nu-1)}_{f}\;.
1000 | \end{align}
1001 | The choice of skipping two and not just one layer has become a standard for empirical reasons, so as the decision not to weight the two paths (the trivial skip one and the two FNN layer one) by a parameter to be learned by backpropagation
1002 | \begin{align}
1003 | y_{f}^{(t)(\nu+1)}&=\alpha\left(\gamma_f^{(\nu+1)}\tilde{h}_f^{(t)(\nu+2)}+\beta_f^{(\nu+1)}\right)
1004 | %
1005 | +\left( 1-\alpha\right)y^{(t)(\nu-1)}_{f'}\;.
1006 | \end{align}
1007 | This choice is called highway nets\cite{citeulike:14070430}, and it remains to be theoretically understood why it leads to worse performance than ResNet, as the latter is a particular instance of the former. Going back to the ResNet backpropagation algorithm, this changes the gradient through the skip connection in the following way
1008 | \begin{align}
1009 | \delta^{(t)(\nu-1)}_f&=
1010 | %
1011 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu}-1}
1012 | %
1013 |  \frac{\partial a_{f'}^{(t')(\nu)}}{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu)}_{f'}
1014 |  %
1015 | +\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+2}-1}
1016 | %
1017 |  \frac{\partial a_{f'}^{(t')(\nu+2)}}{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu+2)}_{f'}\notag\\
1018 | %
1019 | &=\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu)f'}_{f''}
1020 | %
1021 | \frac{\partial y^{(t')(\nu-1)}_{f''} }{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu)}_{f'}\notag\\
1022 | %
1023 | &+\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+2}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu+2)f'}_{f''}
1024 | %
1025 | \frac{\partial y^{(t')(\nu+1)}_{f''} }{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu+2)}_{f'}\notag\\
1026 | %
1027 | &=g'\left(a_{f}^{(t)(\nu-1)}\right)
1028 | %
1029 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu)f'}_{f''}
1030 | %
1031 | J^{(tt')(\nu)}_{f} \delta^{(t')(\nu)}_{f'}\notag\\
1032 | %
1033 | &+g'\left(a_{f}^{(t)(\nu-1)}\right)
1034 | %
1035 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+2}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu+2)f'}_{f''}
1036 | %
1037 | J^{(tt')(\nu)}_{f} \delta^{(t')(\nu+2)}_{f'}\;,
1038 | \end{align}
1039 | so that
1040 | \begin{align}
1041 | \delta^{(t)(\nu-1)}_f&=g'\left(a_{f}^{(t)(\nu-1)}\right)
1042 | %
1043 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum^{F_{\nu-1}-1}_{f''=0}J^{(tt')(\nu)}_{f}\notag\\
1044 | %
1045 | &\times\left[\sum_{f'=0}^{F_{\nu}-1}\Theta^{(\nu)f'}_{f''}\delta^{(t')(\nu)}_{f'}+
1046 | %
1047 | \sum_{f'=0}^{F_{\nu+2}-1}\Theta^{(\nu+2)f'}_{f''}\delta^{(t')(\nu+2)}_{f'}\right]\;.
1048 | \end{align}
1049 | 
1050 | This formulation has one advantage: it totally preserves the usual FNN layer structure of a weight averaging (WA) followed by an activation function (AF) and then a batch normalization operation (BN). It nevertheless has one disadvantage: the backpropagation gradient does not really flow smoothly from one error rate to the other. In the following section we will present the standard ResNet formulation of that takes the problem the other way around : it allows the gradient to flow smoothly at the cost of "breaking" the natural FNN building block.
1051 | 
1052 | \section{FNN ResNet (more standard presentation)} \label{sec:ResnetFNN2}
1053 | 
1054 | \begin{figure}[H]
1055 | \begin{center}
1056 | \begin{tikzpicture}
1057 | \node at (0,0) {\includegraphics[scale=1]{fc_resnet_3}};
1058 | \end{tikzpicture}
1059 | \end{center}
1060 | \caption{\label{fig:fc_resnet_3} Residual connection in a FNN, trivial gradient flow through error rates.}
1061 | \end{figure}
1062 | 
1063 | In the more standard form of ResNet, the skip connections reads
1064 | \begin{align}
1065 |  a_{f}^{(t)(\nu+2)}&= a_{f}^{(t)(\nu+2)}+ a_{f}^{(t)(\nu)}\;,
1066 | \end{align}
1067 | and the updated error rate reads
1068 | \begin{align}
1069 | \delta^{(t)(\nu)}_f&=g'\left(a_{f}^{(t)(\nu)}\right)
1070 | %
1071 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum^{F_{\nu}-1}_{f''=0}J^{(tt')(\nu)}_{f}
1072 | %
1073 | \sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f''}\delta^{(t')(\nu+1)}_{f'}+\delta^{(t')(\nu+2)}_{f}\;.
1074 | \end{align}
1075 | 
1076 | 
1077 | \section{Matrix formulation}
1078 | 
1079 | In all this chapter, we adopted an "index" formulation of the FNN. This has upsides and downsides. On the positive side, one can take the formula as written here and go implement them. On the downside, they can be quite cumbersome to read.
1080 | 
1081 | \vspace{0.2cm}
1082 | 
1083 | Another FNN formulation is therefore possible: a matrix one. To do so, one has to rewrite
1084 | \begin{align}
1085 | h_f^{(t)(\nu)}\mapsto h^{(\nu)}_{ft}&\mapsto h^{(\nu)}\in \mathcal{M}(F_\nu,T_{\rm mb})\;.
1086 | \end{align}
1087 | In this case the weight averaging procedure (\ref{eq:Weightavg}) can be written as
1088 | \begin{align}
1089 | a_f^{(t)(\nu)}=\sum_{f'=0}^{F_\nu-1}\Theta^{(\nu)f}_{f'}h^{(\nu)}_{f't}&\mapsto a^{(\nu)}=\Theta^{(\nu)}h^{(\nu)}\;.
1090 | \end{align}
1091 | The upsides and downsides of this formulation are the exact opposite of the index one: what we gained in readability, we lost in terms of direct implementation in low level programming languages (C for instance). For FNN, one can use a high level programming language (like python), but this will get quite intractable when we talk about Convolutional networks. Since the whole point of the present work was to introduce the index notation, and as one can easily find numerous derivation of the backpropagation update rules in matrix form, we will stick with the index notation in all the following, and now turn our attention to convolutional networks.
1092 | \end{subappendices}
1093 | 


--------------------------------------------------------------------------------
/chapter3.tex:
--------------------------------------------------------------------------------
   1 | \chapter{Recurrent Neural Networks} \label{sec:chapterRNN}
   2 | 
   3 | \minitoc
   4 | 
   5 | \section{Introduction}
   6 | 
   7 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}I}n this chapter, we review a third kind of Neural Network architecture: Recurrent Neural Networks\cite{GravesA2016}. By contrast with the CNN, this kind of network introduces a real architecture novelty : instead of forwarding only in a "spatial" direction, the data are also forwarded in a new -- time dependent -- direction. We will present the first Recurrent Neural Network (RNN) architecture, as well as the current most popular one: the Long Short Term Memory (LSTM) Neural Network. 
   8 | 
   9 | \section{RNN-LSTM architecture}
  10 | 
  11 | \subsection{Forward pass in a RNN-LSTM}
  12 | 
  13 | In figure \ref{fig:1}, we present the RNN architecture in a schematic way
  14 | 
  15 | \begin{figure}[H]
  16 | \begin{center}
  17 | \begin{tikzpicture}
  18 | \node at (0,0) [rectangle,draw,fill=gray!0!white] (h00) {$h^{(00)}$};
  19 | \node at (0,1.5) [rectangle,draw,fill=gray!30!white] (h01) {$h^{(10)}$};
  20 | \node at (0,3) [rectangle,draw,fill=gray!30!white] (h02) {$h^{(20)}$};
  21 | \node at (0,4.5) [rectangle,draw,fill=gray!30!white] (h03) {$h^{(30)}$};
  22 | \node at (0,6) [rectangle,draw,fill=gray!0!white] (h04) {$h^{(40)}$};
  23 | %
  24 | \node at (1.8,0) [rectangle,draw,fill=gray!0!white] (h10) {$h^{(01)}$};
  25 | \node at (1.8,1.5) [rectangle,draw,fill=gray!70!white] (h11) {$h^{(11)}$};
  26 | \node at (1.8,3) [rectangle,draw,fill=gray!70!white] (h12) {$h^{(21)}$};
  27 | \node at (1.8,4.5) [rectangle,draw,fill=gray!70!white] (h13) {$h^{(31)}$};
  28 | \node at (1.8,6) [rectangle,draw,fill=gray!0!white] (h14) {$h^{(41)}$};
  29 | %
  30 | \node at (3.6,0) [rectangle,draw,fill=gray!0!white] (h20) {$h^{(02)}$};
  31 | \node at (3.6,1.5) [rectangle,draw,fill=gray!70!white] (h21) {$h^{(12)}$};
  32 | \node at (3.6,3) [rectangle,draw,fill=gray!70!white] (h22) {$h^{(22)}$};
  33 | \node at (3.6,4.5) [rectangle,draw,fill=gray!70!white] (h23) {$h^{(32)}$};
  34 | \node at (3.6,6) [rectangle,draw,fill=gray!0!white] (h24) {$h^{(42)}$};
  35 | %
  36 | \node at (5.4,0) [rectangle,draw,fill=gray!0!white] (h30) {$h^{(03)}$};
  37 | \node at (5.4,1.5) [rectangle,draw,fill=gray!70!white] (h31) {$h^{(13)}$};
  38 | \node at (5.4,3) [rectangle,draw,fill=gray!70!white] (h32) {$h^{(23)}$};
  39 | \node at (5.4,4.5) [rectangle,draw,fill=gray!70!white] (h33) {$h^{(33)}$};
  40 | \node at (5.4,6) [rectangle,draw,fill=gray!0!white] (h34) {$h^{(43)}$};
  41 | %
  42 | \node at (7.2,0) [rectangle,draw,fill=gray!0!white] (h40) {$h^{(04)}$};
  43 | \node at (7.2,1.5) [rectangle,draw,fill=gray!70!white] (h41) {$h^{(14)}$};
  44 | \node at (7.2,3) [rectangle,draw,fill=gray!70!white] (h42) {$h^{(24)}$};
  45 | \node at (7.2,4.5) [rectangle,draw,fill=gray!70!white] (h43) {$h^{(34)}$};
  46 | \node at (7.2,6) [rectangle,draw,fill=gray!0!white] (h44) {$h^{(44)}$};
  47 | %
  48 | \node at (9,0) [rectangle,draw,fill=gray!0!white] (h50) {$h^{(05)}$};
  49 | \node at (9,1.5) [rectangle,draw,fill=gray!70!white] (h51) {$h^{(15)}$};
  50 | \node at (9,3) [rectangle,draw,fill=gray!70!white] (h52) {$h^{(25)}$};
  51 | \node at (9,4.5) [rectangle,draw,fill=gray!70!white] (h53) {$h^{(35)}$};
  52 | \node at (9,6) [rectangle,draw,fill=gray!0!white] (h54) {$h^{(45)}$};
  53 | %
  54 | \node at (10.8,0) [rectangle,draw,fill=gray!0!white] (h60) {$h^{(06)}$};
  55 | \node at (10.8,1.5) [rectangle,draw,fill=gray!70!white] (h61) {$h^{(16)}$};
  56 | \node at (10.8,3) [rectangle,draw,fill=gray!70!white] (h62) {$h^{(26)}$};
  57 | \node at (10.8,4.5) [rectangle,draw,fill=gray!70!white] (h63) {$h^{(36)}$};
  58 | \node at (10.8,6) [rectangle,draw,fill=gray!0!white] (h64) {$h^{(46)}$};
  59 | %
  60 | \node at (12.6,0) [rectangle,draw,fill=gray!0!white] (h70) {$h^{(07)}$};
  61 | \node at (12.6,1.5) [rectangle,draw,fill=gray!70!white] (h71) {$h^{(17)}$};
  62 | \node at (12.6,3) [rectangle,draw,fill=gray!70!white] (h72) {$h^{(27)}$};
  63 | \node at (12.6,4.5) [rectangle,draw,fill=gray!70!white] (h73) {$h^{(37)}$};
  64 | \node at (12.6,6) [rectangle,draw,fill=gray!0!white] (h74) {$h^{(47)}$};
  65 | %
  66 | %
  67 | \draw[-stealth] (h00) -- node[pos=0.5,anchor=east,scale=1] {$\Theta^{\nu(1)}$} (h01);
  68 | \draw[-stealth] (h01) -- node[pos=0.5,anchor=east,scale=1] {$\Theta^{\nu(2)}$} (h02);
  69 | \draw[-stealth] (h02) -- node[pos=0.5,anchor=east,scale=1] {$\Theta^{\nu(3)}$} (h03);
  70 | \draw[dotted,-stealth] (h03) -- node [pos=0.5,anchor = east] {$\Theta$} (h04);
  71 | %
  72 | \draw[-stealth] (h10) -- (h11);
  73 | \draw[-stealth] (h11) -- (h12);
  74 | \draw[-stealth] (h12) -- (h13);
  75 | \draw[dotted,-stealth] (h13) -- (h14);
  76 | %
  77 | \draw[-stealth] (h20) -- (h21);
  78 | \draw[-stealth] (h21) -- (h22);
  79 | \draw[-stealth] (h22) -- (h23);
  80 | \draw[dotted,-stealth] (h23) -- (h24);
  81 | %
  82 | \draw[-stealth] (h30) -- (h31);
  83 | \draw[-stealth] (h31) -- (h32);
  84 | \draw[-stealth] (h32) -- (h33);
  85 | \draw[dotted,-stealth] (h33) -- (h34);
  86 | %
  87 | \draw[-stealth] (h40) -- (h41);
  88 | \draw[-stealth] (h41) -- (h42);
  89 | \draw[-stealth] (h42) -- (h43);
  90 | \draw[dotted,-stealth] (h43) -- (h44);
  91 | %
  92 | \draw[-stealth] (h50) -- (h51);
  93 | \draw[-stealth] (h51) -- (h52);
  94 | \draw[-stealth] (h52) -- (h53);
  95 | \draw[dotted,-stealth] (h53) -- (h54);
  96 | %
  97 | \draw[-stealth] (h60) -- (h61);
  98 | \draw[-stealth] (h61) -- (h62);
  99 | \draw[-stealth] (h62) -- (h63);
 100 | \draw[dotted,-stealth] (h63) -- (h64);
 101 | %
 102 | \draw[-stealth] (h70) -- (h71);
 103 | \draw[-stealth] (h71) -- (h72);
 104 | \draw[-stealth] (h72) -- (h73);
 105 | \draw[dotted,-stealth] (h73) -- (h74);
 106 | %
 107 | \draw[-stealth] (h01) --   node[pos=0.5,above=7pt,scale=1] {$\Theta^{\tau(1)}$} (h11);
 108 | \draw[-stealth] (h11) -- (h21);
 109 | \draw[-stealth] (h21) -- (h31);
 110 | \draw[-stealth] (h31) -- (h41);
 111 | \draw[-stealth] (h41) -- (h51);
 112 | \draw[-stealth] (h51) -- (h61);
 113 | \draw[-stealth] (h61) -- (h71);
 114 | %
 115 | \draw[-stealth] (h02) -- node[pos=0.5,above=7pt,scale=1] {$\Theta^{\tau(2)}$} (h12);
 116 | \draw[-stealth] (h12) -- (h22);
 117 | \draw[-stealth] (h22) -- (h32);
 118 | \draw[-stealth] (h32) -- (h42);
 119 | \draw[-stealth] (h42) -- (h52);
 120 | \draw[-stealth] (h52) -- (h62);
 121 | \draw[-stealth] (h62) -- (h72);
 122 | %
 123 | \draw[-stealth] (h03) -- node[pos=0.5,above=7pt,scale=1] {$\Theta^{\tau(3)}$} (h13);
 124 | \draw[-stealth] (h13) -- (h23);
 125 | \draw[-stealth] (h23) -- (h33);
 126 | \draw[-stealth] (h33) -- (h43);
 127 | \draw[-stealth] (h43) -- (h53);
 128 | \draw[-stealth] (h53) -- (h63);
 129 | \draw[-stealth] (h63) -- (h73);
 130 | %
 131 | \draw[very thin,densely dashed,-stealth] (h04) to[out=45,in=225] (h10);
 132 | \draw[very thin,densely dashed,-stealth] (h14) to[out=45,in=225] (h20);
 133 | \draw[very thin,densely dashed,-stealth] (h24) to[out=45,in=225] (h30);
 134 | \draw[very thin,densely dashed,-stealth] (h34) to[out=45,in=225] (h40);
 135 | \draw[very thin,densely dashed,-stealth] (h44) to[out=45,in=225] (h50);
 136 | \draw[very thin,densely dashed,-stealth] (h54) to[out=45,in=225] (h60);
 137 | \draw[very thin,densely dashed,-stealth] (h64) to[out=45,in=225] (h70);
 138 | \end{tikzpicture}
 139 | \caption{\label{fig:RNN architecture}RNN architecture, with data propagating both in "space" and in "time". In our exemple, the time dimension is of size 8 while the "spatial" one is of size 4.}
 140 | \end{center}
 141 | \end{figure}
 142 | 
 143 | The real novelty of this type of neural network is that the fact that we are trying to predict a time serie is encoded in the very architecture of the network. RNN have first been introduced mostly to predict the next words in a sentance (classification task), hence the notion of ordering in time of the prediction. But this kind of network architecture can also be applied to regression problems. Among others things one can think of stock prices evolution, or temperature forecasting. In contrast to the precedent neural networks that we introduced, where we defined (denoting $\nu$ as in previous chapters the layer index in the spatial direction)
 144 | \begin{align}
 145 | a^{(t)(\nu)}_{f}&= \text{ Weight Averaging } \left(h^{(t)(\nu)}_{f}\right)\;,\notag\\
 146 | %
 147 | h^{(t)(\nu+1)}_{f}&= \text{ Activation function } \left(a^{(t)(\nu)}_{f}\right)\;,
 148 | \end{align}
 149 | we now have the hidden layers that are indexed by both a "spatial" and a "temporal" index (with $T$ being the network dimension in this new direction), and the general philosophy of the RNN is (now the $a$ is usually characterized by a $c$ for cell state, this denotation, trivial for the basic RNN architecture will make more sense when we talk about LSTM networks)
 150 | \begin{align}
 151 | c^{(t)(\nu \tau )}_{f}&= \text{ Weight Averaging } \left(h^{(t)(\nu \tau-1)}_{f},h^{(t)(\nu-1\tau)}_{f}\right)\;,\notag\\
 152 | %
 153 | h^{(t)(\nu\tau)}_{f}&= \text{ Activation function } \left(c^{(t)(\nu \tau)}_{f}\right)\;,
 154 | \end{align}
 155 | 
 156 | \subsection{Backward pass in a RNN-LSTM}
 157 | 
 158 | The backward pass in a RNN-LSTM has to respect a certain time order, as illustrated in the following figure
 159 | 
 160 | \begin{figure}[H]
 161 | \begin{center}
 162 | \begin{tikzpicture}
 163 | \node at (0,0) [rectangle,draw,fill=gray!0!white] (h00) {$h^{(00)}$};
 164 | \node at (0,1.5) [rectangle,draw,fill=gray!70!white] (h01) {$h^{(10)}$};
 165 | \node at (0,3) [rectangle,draw,fill=gray!70!white] (h02) {$h^{(20)}$};
 166 | \node at (0,4.5) [rectangle,draw,fill=gray!70!white] (h03) {$h^{(30)}$};
 167 | \node at (0,6) [rectangle,draw,fill=gray!0!white] (h04) {$h^{(40)}$};
 168 | %
 169 | \node at (1.8,0) [rectangle,draw,fill=gray!0!white] (h10) {$h^{(01)}$};
 170 | \node at (1.8,1.5) [rectangle,draw,fill=gray!70!white] (h11) {$h^{(11)}$};
 171 | \node at (1.8,3) [rectangle,draw,fill=gray!70!white] (h12) {$h^{(21)}$};
 172 | \node at (1.8,4.5) [rectangle,draw,fill=gray!70!white] (h13) {$h^{(31)}$};
 173 | \node at (1.8,6) [rectangle,draw,fill=gray!0!white] (h14) {$h^{(41)}$};
 174 | %
 175 | \node at (3.6,0) [rectangle,draw,fill=gray!0!white] (h20) {$h^{(02)}$};
 176 | \node at (3.6,1.5) [rectangle,draw,fill=gray!70!white] (h21) {$h^{(12)}$};
 177 | \node at (3.6,3) [rectangle,draw,fill=gray!70!white] (h22) {$h^{(22)}$};
 178 | \node at (3.6,4.5) [rectangle,draw,fill=gray!70!white] (h23) {$h^{(32)}$};
 179 | \node at (3.6,6) [rectangle,draw,fill=gray!0!white] (h24) {$h^{(42)}$};
 180 | %
 181 | \node at (5.4,0) [rectangle,draw,fill=gray!0!white] (h30) {$h^{(03)}$};
 182 | \node at (5.4,1.5) [rectangle,draw,fill=gray!70!white] (h31) {$h^{(13)}$};
 183 | \node at (5.4,3) [rectangle,draw,fill=gray!70!white] (h32) {$h^{(23)}$};
 184 | \node at (5.4,4.5) [rectangle,draw,fill=gray!70!white] (h33) {$h^{(33)}$};
 185 | \node at (5.4,6) [rectangle,draw,fill=gray!0!white] (h34) {$h^{(43)}$};
 186 | %
 187 | \node at (7.2,0) [rectangle,draw,fill=gray!0!white] (h40) {$h^{(04)}$};
 188 | \node at (7.2,1.5) [rectangle,draw,fill=gray!70!white] (h41) {$h^{(14)}$};
 189 | \node at (7.2,3) [rectangle,draw,fill=gray!70!white] (h42) {$h^{(24)}$};
 190 | \node at (7.2,4.5) [rectangle,draw,fill=gray!70!white] (h43) {$h^{(34)}$};
 191 | \node at (7.2,6) [rectangle,draw,fill=gray!0!white] (h44) {$h^{(44)}$};
 192 | %
 193 | \node at (9,0) [rectangle,draw,fill=gray!0!white] (h50) {$h^{(05)}$};
 194 | \node at (9,1.5) [rectangle,draw,fill=gray!70!white] (h51) {$h^{(15)}$};
 195 | \node at (9,3) [rectangle,draw,fill=gray!70!white] (h52) {$h^{(25)}$};
 196 | \node at (9,4.5) [rectangle,draw,fill=gray!70!white] (h53) {$h^{(35)}$};
 197 | \node at (9,6) [rectangle,draw,fill=gray!0!white] (h54) {$h^{(45)}$};
 198 | %
 199 | \node at (10.8,0) [rectangle,draw,fill=gray!0!white] (h60) {$h^{(06)}$};
 200 | \node at (10.8,1.5) [rectangle,draw,fill=gray!70!white] (h61) {$h^{(16)}$};
 201 | \node at (10.8,3) [rectangle,draw,fill=gray!70!white] (h62) {$h^{(26)}$};
 202 | \node at (10.8,4.5) [rectangle,draw,fill=gray!70!white] (h63) {$h^{(36)}$};
 203 | \node at (10.8,6) [rectangle,draw,fill=gray!0!white] (h64) {$h^{(46)}$};
 204 | %
 205 | \node at (12.6,0) [rectangle,draw,fill=gray!0!white] (h70) {$h^{(07)}$};
 206 | \node at (12.6,1.5) [rectangle,draw,fill=gray!30!white] (h71) {$h^{(17)}$};
 207 | \node at (12.6,3) [rectangle,draw,fill=gray!30!white] (h72) {$h^{(27)}$};
 208 | \node at (12.6,4.5) [rectangle,draw,fill=gray!30!white] (h73) {$h^{(37)}$};
 209 | \node at (12.6,6) [rectangle,draw,fill=gray!0!white] (h74) {$h^{(47)}$};
 210 | %
 211 | %
 212 | \draw[stealth-] (h00) -- (h01);
 213 | \draw[stealth-] (h01) -- (h02);
 214 | \draw[stealth-] (h02) -- (h03);
 215 | \draw[dotted,stealth-] (h03) -- (h04);
 216 | %
 217 | \draw[stealth-] (h10) -- (h11);
 218 | \draw[stealth-] (h11) -- (h12);
 219 | \draw[stealth-] (h12) -- (h13);
 220 | \draw[dotted,stealth-] (h13) -- (h14);
 221 | %
 222 | \draw[stealth-] (h20) -- (h21);
 223 | \draw[stealth-] (h21) -- (h22);
 224 | \draw[stealth-] (h22) -- (h23);
 225 | \draw[dotted,stealth-] (h23) -- (h24);
 226 | %
 227 | \draw[stealth-] (h30) -- (h31);
 228 | \draw[stealth-] (h31) -- (h32);
 229 | \draw[stealth-] (h32) -- (h33);
 230 | \draw[dotted,stealth-] (h33) -- (h34);
 231 | %
 232 | \draw[stealth-] (h40) -- (h41);
 233 | \draw[stealth-] (h41) -- (h42);
 234 | \draw[stealth-] (h42) -- (h43);
 235 | \draw[dotted,stealth-] (h43) -- (h44);
 236 | %
 237 | \draw[stealth-] (h50) -- (h51);
 238 | \draw[stealth-] (h51) -- (h52);
 239 | \draw[stealth-] (h52) -- (h53);
 240 | \draw[dotted,stealth-] (h53) -- (h54);
 241 | %
 242 | \draw[stealth-] (h60) -- (h61);
 243 | \draw[stealth-] (h61) -- (h62);
 244 | \draw[stealth-] (h62) -- (h63);
 245 | \draw[dotted,stealth-] (h63) -- (h64);
 246 | %
 247 | \draw[stealth-] (h70) -- (h71);
 248 | \draw[stealth-] (h71) -- (h72);
 249 | \draw[stealth-] (h72) -- (h73);
 250 | \draw[dotted,stealth-] (h73) -- (h74);
 251 | %
 252 | \draw[stealth-] (h01) -- (h11);
 253 | \draw[stealth-] (h11) -- (h21);
 254 | \draw[stealth-] (h21) -- (h31);
 255 | \draw[stealth-] (h31) -- (h41);
 256 | \draw[stealth-] (h41) -- (h51);
 257 | \draw[stealth-] (h51) -- (h61);
 258 | \draw[stealth-] (h61) -- (h71);
 259 | %
 260 | \draw[stealth-] (h02) -- (h12);
 261 | \draw[stealth-] (h12) -- (h22);
 262 | \draw[stealth-] (h22) -- (h32);
 263 | \draw[stealth-] (h32) -- (h42);
 264 | \draw[stealth-] (h42) -- (h52);
 265 | \draw[stealth-] (h52) -- (h62);
 266 | \draw[stealth-] (h62) -- (h72);
 267 | %
 268 | \draw[stealth-] (h03) -- (h13);
 269 | \draw[stealth-] (h13) -- (h23);
 270 | \draw[stealth-] (h23) -- (h33);
 271 | \draw[stealth-] (h33) -- (h43);
 272 | \draw[stealth-] (h43) -- (h53);
 273 | \draw[stealth-] (h53) -- (h63);
 274 | \draw[stealth-] (h63) -- (h73);
 275 | \end{tikzpicture}
 276 | \caption{\label{fig:rnnback}Architecture taken, backward pass. Here what cannot compute the gradient of a layer without having computed the ones that flow into it}
 277 | \end{center}
 278 | \end{figure}
 279 | 
 280 | 
 281 | With this in mind, let us now see in details the implementation of a RNN and its advanced cousin, the Long Short Term Memory (LSTM)-RNN.
 282 | 
 283 | \section{Extreme Layers and loss function}
 284 | 
 285 | These part of the RNN-LSTM networks just experiences trivial modifications. Let us see them
 286 | 
 287 | \subsection{Input layer}
 288 | 
 289 | In a RNN-LSTM, the input layer is recursively defined as 
 290 | \begin{align}
 291 | h^{(t)(0\tau+1)}_{f}&=\left(\tilde{h}^{(t)(0\tau)}_{f},h^{(t)(N-1\tau)}_{f}\right)\;.
 292 | \end{align}
 293 | where $\tilde{h}^{(t)(0\tau)}_{f}$ is $h^{(t)(0\tau)}_{f}$ with the first time column removed.
 294 | 
 295 | \subsection{Output layer }
 296 | 
 297 | The output layer of a RNN-LSTM reads
 298 | \begin{align}
 299 | h^{(t)(N\tau)}_{f}&=o\left(\sum_{f'=0}^{F_{N-1}-1}\Theta^f_{f'} h^{(t)(N-1\tau)}_{f}\right)\;,
 300 | \end{align}
 301 | where the output function $o$ is as for FNN's and CNN's is either the identity (regression task) or the cross-entropy function (classification task).
 302 | 
 303 | \subsection{Loss function}
 304 | 
 305 | The loss function for a regression task reads
 306 | \begin{align}
 307 | J(\Theta)&=\frac{1}{2T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\tau=0}^{T-1}\sum_{f=0}^{F_N-1}
 308 | %
 309 | \left(h^{(t)( N\tau)}_f-y^{(t)(\tau)}_f\right)^2\;.
 310 | \end{align}
 311 | and for a classification task
 312 | \begin{align}
 313 | J(\Theta)&=-\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\tau=0}^{T-1}\sum_{c=0}^{C-1}
 314 | %
 315 | \delta^{c}_{y^{(t)(\tau)}_c}  \ln \left(h^{(t)( N\tau)}_f\right)\;.
 316 | \end{align}
 317 | 
 318 | 
 319 | \section{RNN specificities}
 320 | 
 321 | \subsection{RNN structure} \label{sec:rnnstructure}
 322 | 
 323 | RNN is the most basic architecture that takes -- thanks to the way it is built in -- into account the time structure of the data to be predicted. Zooming on one hidden layer of \ref{fig:RNN architecture}, here is what we see for a simple Recurrent Neural Network.
 324 | 
 325 | \begin{figure}[H]
 326 | \begin{center}
 327 | \begin{tikzpicture}
 328 | \node[] at (0,0) {\includegraphics[scale=1.5]{RNN_structure}};
 329 | \end{tikzpicture}
 330 | \caption{\label{fig:RNN hidden unit}RNN hidden unit details}
 331 | \end{center}
 332 | \end{figure}
 333 | 
 334 | And here is how the output of the hidden layer represented in \ref{fig:RNN hidden unit} enters into the subsequent hidden units
 335 | 
 336 | \begin{figure}[H]
 337 | \begin{center}
 338 | \begin{tikzpicture}
 339 | \node[] at (0,0) {\includegraphics[scale=0.8]{RNN_structure-tot}};
 340 | \end{tikzpicture}
 341 | \caption{\label{fig:RNN interaction}How the RNN hidden unit interact with each others}
 342 | \end{center}
 343 | \end{figure}
 344 | 
 345 | 
 346 | Lest us now mathematically express what is reprensented in figures \ref{fig:RNN hidden unit} and \ref{fig:RNN interaction}.
 347 | 
 348 | \subsection{Forward pass in a RNN}
 349 | 
 350 |  In a RNN, the update rules read for the first time slice (spatial layer at the extreme left of figure \ref{fig:RNN architecture})
 351 |  
 352 | \begin{align}
 353 | h^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{\nu(\nu)f}_{f'}
 354 | %
 355 | h^{(t)(\nu-1\tau)}_{f'}\right)\;,
 356 | \end{align}
 357 | 
 358 | and for the other ones
 359 | 
 360 | \begin{align}
 361 | h^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{\nu(\nu)f}_{f'}
 362 | %
 363 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{\tau(\nu)f}_{f'}
 364 | %
 365 | h^{(t)(\nu\tau-1)}_{f'}\right)\;.
 366 | \end{align}
 367 | 
 368 | \subsection{Backpropagation in a RNN}
 369 | 
 370 | The backpropagation philosophy will remain unchanged : find the error rate updates, from which one can deduce the weight updates. But as for the hidden layers, the $\delta$ now have both a spatial and a temporal component. We will thus have to compute 
 371 | \begin{align}
 372 | \delta^{(t)( \nu\tau)}_f&=\frac{\delta}{\delta h^{(t)( \nu+1\tau)}_f }J(\Theta)\;,
 373 | \end{align}
 374 | to deduce
 375 | \begin{align}
 376 | \Delta^{\Theta{\rm index}f}_{f'}&=\frac{\delta}{\delta \Delta^{\Theta{\rm index}f}_{f'} }J(\Theta)\;,
 377 | \end{align}
 378 | where the index can either be nothing (weights of the ouput layers), $\nu(\nu)$ (weights between two spatially connected layers) or  $\tau(\nu)$ (weights between two temporally connected layers). First, it is easy to compute (in the same way as in chapter 1 for FNN) for the MSE loss function
 379 | \begin{align}
 380 | \delta^{(t)(N-1\tau)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-y_f^{(t)(\tau)}\right)\;,
 381 | \end{align}
 382 | and for the cross entropy loss function
 383 | \begin{align}
 384 | \delta^{(t)(N-1)}_{f}&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-\delta^f_{y^{(t)(\tau)}}\right)\;.
 385 | \end{align}
 386 | Calling 
 387 | \begin{align}
 388 | \mathcal{T}_{f}^{(t)(\nu\tau)}&=1-\left(h_{f}^{(t)(\nu\tau)}\right)^2\;,
 389 | \end{align}
 390 | and
 391 | \begin{align}
 392 | \mathcal{H}^{(t')(\nu\tau)_a}_{ff'}&=\mathcal{T}^{(t')(\nu+1\tau)}_{f'}\Theta^{a(\nu+1)f'}_{f}\;,
 393 | \end{align}
 394 | we show in appendix \ref{sec:rnnappenderrorrate} that (if $\tau+1$ exists, otherwise the second term is absent)
 395 | \begin{align}
 396 | \delta^{(t)(\nu-1\tau)}_f&= 
 397 | %
 398 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 399 | %
 400 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
 401 | \end{align}
 402 | where $b_0=\nu$ and $b_1=\tau$.
 403 | 
 404 | \subsection{Weight and coefficient updates in a RNN}
 405 | 
 406 | To complete the backpropagation algorithm, we need
 407 | 
 408 | \begin{align}
 409 | &\Delta^{\nu(\nu)f}_{f'}\;,&
 410 | %
 411 | &\Delta^{\tau(\nu)f}_{f'}\;,&
 412 | %
 413 | &\Delta^{f}_{f'}\;,&
 414 | %
 415 | &\Delta^{\beta(\nu \tau)}_{f}\;,&
 416 | %
 417 | &\Delta^{\gamma(\nu \tau)}_{f}\;.
 418 | \end{align}
 419 | 
 420 | We show in appendix \ref{sec:rnncoefficient} that
 421 | 
 422 | \begin{align}
 423 | \Delta^{\nu(\nu-)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 424 | %
 425 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu-1\tau)}_{f'}\;,\\
 426 | %
 427 | \Delta^{\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 428 | %
 429 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu\tau-1)}_{f'}\;,\\
 430 | %
 431 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} h^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;,\\
 432 | %
 433 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 434 | %
 435 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\
 436 | %
 437 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 438 | %
 439 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
 440 | \end{align}
 441 | 
 442 | \section{LSTM specificities}
 443 | 
 444 | 
 445 | \subsection{LSTM structure}
 446 | 
 447 | 
 448 | In a Long Short Term Memory Neural Network\cite{Gers:2000:LFC:1121912.1121915}, the state of a given unit is not directly determined by its left and bottom neighbours. Instead, a cell state is updated for each hidden unit, and the output of this unit is a probe of the cell state. This formulation might seem puzzling at first, but it is philosophically similar to the ResNet approach that we briefly encounter in the appendix of chapter \ref{sec:chapterFNN}: instead of trying to fit an input with a complicated function, we try to fit tiny variation of the input, hence allowing the gradient to flow in a smoother manner in the network. In the LSTM network, several gates are thus introduced : the input gate $i^{(t)(\nu\tau)}_f$ determines if we allow new information $g^{(t)(\nu\tau)}_f$ to enter into the cell state. The  output gate $o^{(t)(\nu\tau)}_f$ determines if we set or not the output hidden value to $0$, or really probes the current cell state. Finally, the forget state $f^{(t)(\nu\tau)}_f$ determines if we forget or not the past cell state. All theses concepts are illustrated on the figure \ref{fig:Lstm1}, which is the LSTM counterpart of the RNN structure of section \ref{sec:rnnstructure}. This diagram will be explained in details in the next section.
 449 | 
 450 | \begin{figure}[H]
 451 | \begin{center}
 452 | \begin{tikzpicture}
 453 | \node[] at (0,0) {\includegraphics[scale=1.7]{LSTM_structure}};
 454 | \end{tikzpicture}
 455 | \caption{\label{fig:Lstm1}LSTM hidden unit details}
 456 | \end{center}
 457 | \end{figure}
 458 | 
 459 | In a LSTM, the different hidden units interact in the following way
 460 | 
 461 | \begin{figure}[H]
 462 | \begin{center}
 463 | \begin{tikzpicture}
 464 | \node[] at (0,0) {\includegraphics[scale=0.8]{LSTM_structure-tot}};
 465 | \end{tikzpicture}
 466 | \caption{\label{fig:Lstmall}How the LSTM hidden unit interact with each others}
 467 | \end{center}
 468 | \end{figure}
 469 | 
 470 | 
 471 | \subsection{Forward pass in LSTM}
 472 | 
 473 | Considering all the $\tau-1$ variable values to be $0$ when $\tau=0$, we get the following formula for the input, forget and output gates
 474 | 
 475 | \begin{align}
 476 | i^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{i_{_\nu}(\nu)f}_{f'}
 477 | %
 478 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F{_{\nu}}-1}\Theta^{i_{_\tau}(\nu)f}_{f'}
 479 | %
 480 | h^{(t)(\nu\tau-1)}_{f'}\right)\;,\\
 481 | %
 482 | f^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F{_{\nu-1}}-1}\Theta^{f_{_\nu}(\nu)f}_{f'}
 483 | %
 484 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{f_{_\tau}(\nu)f}_{f'}
 485 | %
 486 | h^{(t)(\nu\tau-1)}_{f'}\right)\;,\\
 487 | %
 488 | o^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{o_{_\nu}(\nu)f}_{f'}
 489 | %
 490 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{o_{_\tau}(\nu)f}_{f'}
 491 | %
 492 | h^{(t)(\nu\tau-1)}_{f'}\right)\;.
 493 | \end{align}
 494 | The sigmoid function is the reason why the $i,f,o$ functions are called gates: they take their values between $0$ and $1$, therefore either allowing or forbidding information to pass through the next step. The cell state update is then performed in the following way
 495 | \begin{align}
 496 | g^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{g_{_\nu}(\nu)f}_{f'}
 497 | %
 498 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{g_{_\tau}(\nu)f}_{f'}
 499 | %
 500 | h^{(t)(\nu\tau-1)}_{f'}\right)\;,\\
 501 | %
 502 | c^{(t)(\nu\tau)}_{f}&=
 503 | %
 504 | f^{(t)(\nu\tau)}_{f}c^{(t)(\nu\tau-1)}_{f}+i^{(t)(\nu\tau)}_{f}g^{(t)(\nu\tau)}_{f}\;,
 505 | \end{align}
 506 | and as announced, hidden state update is just a probe of the current cell state
 507 | \begin{align}
 508 | h^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\tanh \left(c^{(t)(\nu\tau)}_{f}\right)\;.
 509 | \end{align}
 510 | 
 511 | These formula singularly complicates the feed forward and especially the backpropagation procedure. For completeness, we will us nevertheless carefully derive it. Let us mention in passing that recent studies tried to replace the tanh activation function of the hidden state $h^{(t)(\nu\tau)}_{f}$ and the cell update $g^{(t)(\nu\tau)}_f$ by Rectified Linear Units, and seems to report better results with a proper initialization of all the weight matrices, argued to be diagonal
 512 | \begin{align}
 513 | \Theta^f_{f'}(\text{init})&=\frac12\left(\delta^f_{f'}+\sqrt{\frac{6}{F_{\rm in}+F_{\rm out}}}\mathcal{N}(0,1)\right)\;,
 514 | \end{align}
 515 | with the bracket term here to possibly (or not) include some randomness into the initialization
 516 | 
 517 | \subsection{Batch normalization}
 518 | 
 519 | In batchnorm The update rules for the gates are modified as expected
 520 | 
 521 | \begin{align}
 522 | i^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{i_\nu(\nu-)f}_{f'}
 523 | %
 524 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F{_{\nu}}-1}\Theta^{i_\tau(-\nu)f}_{f'}
 525 | %
 526 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,\\
 527 | %
 528 | f^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F{_{\nu-1}}-1}\Theta^{f_\nu(\nu-)f}_{f'}
 529 | %
 530 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{f_\tau(-\nu)f}_{f'}
 531 | %
 532 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,\\
 533 | %
 534 | o^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{o_\nu(\nu-)f}_{f'}
 535 | %
 536 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{o_\tau(-\nu)f}_{f'}
 537 | %
 538 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,\\
 539 | %
 540 | g^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{g_\nu(\nu-)f}_{f'}
 541 | %
 542 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{g_\tau(-\nu)f}_{f'}
 543 | %
 544 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,
 545 | \end{align}
 546 | where
 547 | \begin{align}
 548 | y^{(t)(\nu\tau)}_{f}&=\gamma^{(\nu\tau)}_{f}\tilde{h}^{(t)(\nu\tau)}_{f}+\beta^{(\nu\tau)}_{f}\;,
 549 | \end{align}
 550 | as well as
 551 | \begin{align}
 552 | \tilde{h}^{(t)(\nu\tau)}_{f}&=\frac{h^{(t)(\nu\tau)}_{f}-
 553 | %
 554 | \hat{h}^{(\nu\tau)}_{f}}{\sqrt{\left(\sigma^{(\nu\tau)}_{f}\right)^2+\epsilon}}
 555 | \end{align}
 556 | and
 557 | \begin{align}
 558 | \hat{h}^{(\nu\tau)}_{f}&=\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}h^{(t)(\nu\tau)}_{f}\;,&
 559 | %
 560 | \left(\sigma^{(\nu\tau)}_{f}\right)^2&=\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\left(h^{(t)(\nu\tau)}_{f}
 561 | %
 562 | -\hat{h}^{(\nu\tau)}_{f}\right)^2\;.
 563 | \end{align}
 564 | It is important to compute a running sum for the mean and the variance, that will serve for the evaluation of the cross-validation and the test set (calling $e$ the number of iterations/epochs) 
 565 | \begin{align}
 566 | \mathbb{E}\left[h_{f}^{(t)(\nu\tau)}\right]_{e+1} &=
 567 | %
 568 | \frac{e\mathbb{E}\left[h_{f}^{(t)(\nu\tau)}\right]_{e}+\hat{h}_{f}^{(\nu\tau)}}{e+1}\;,\\
 569 | %
 570 | \mathbb{V}ar\left[h_{f}^{(t)(\nu\tau)}\right]_{e+1} &=
 571 | %
 572 | \frac{e\mathbb{V}ar\left[h_{f}^{(t)(\nu\tau)}\right]_{e}+\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2}{e+1}
 573 | \end{align}
 574 | and what will be used at the end is $\mathbb{E}\left[h_{f}^{(t)(\nu\tau)}\right]$ and $\frac{T_{{\rm mb}}}{T_{{\rm mb}}-1}\mathbb{V}ar\left[h_{f}^{(t)(\nu\tau)}\right]$.
 575 | 
 576 | 
 577 | 
 578 | \subsection{Backpropagation in a LSTM} \label{sec:appendbackproplstm}
 579 | 
 580 | 
 581 | The backpropagation In a LSTM keeps the same structure as in a RNN, namely 
 582 | 
 583 | \begin{align}
 584 | \delta^{(t)(N-1\tau)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-y_f^{(t)(\tau)}\right)\;,
 585 | \end{align}
 586 | and (shown in appendix \ref{sec:ARNNLSTMerror_rates})
 587 | \begin{align}
 588 | \delta^{(t)(\nu-1\tau)}_f&= 
 589 | %
 590 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 591 | %
 592 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
 593 | \end{align}
 594 | What changes is the form of $\mathcal{H}$, now given by
 595 | \begin{align}
 596 | \mathcal{O}^{(t)(\nu\tau)}_{f}&=h^{(t)(\nu\tau)}_{f}
 597 | %
 598 | \left(1-o^{(t)(\nu\tau)}_{f}\right)\;,\notag\\
 599 | %
 600 | \mathcal{I}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right)
 601 | %
 602 | \right)g^{(t)(\nu\tau)}_{f} i^{(t)(\nu\tau)}_{f}\left(1-i^{(t)(\nu\tau)}_{f}\right)\;,\notag\\
 603 | %
 604 | \mathcal{F}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right)
 605 | %
 606 | \right)c^{(t)(\nu\tau-1)}_{f} f^{(t)(\nu\tau)}_{f}\left(1-f^{(t)(\nu\tau)}_{f}\right)\;,\notag\\
 607 | %
 608 | \mathcal{G}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right)
 609 | %
 610 | \right)i^{(t)(\nu\tau)}_{f}\left(1-\left(g^{(t)(\nu\tau)}_{f}\right)^2\right)\;,
 611 | \end{align}
 612 | 
 613 | and
 614 | 
 615 | \begin{align}
 616 | H^{(t)(\nu\tau)_a}_{ff'}&=\Theta^{o_a(\nu+1)f'}_{f}\mathcal{O}^{(t)(\nu+1\tau)}_{f'}
 617 | %
 618 | +\Theta^{f_a(\nu+1)f'}_{f}\mathcal{F}^{(t)(\nu+1\tau)}_{f'}\notag\\
 619 | %
 620 | &+\Theta^{g_a(\nu+1)f'}_{f}\mathcal{G}^{(t)(\nu+1\tau)}_{f'}
 621 | %
 622 | +\Theta^{i_a(\nu+1)f'}_{f}\mathcal{I}^{(t)(\nu+1\tau)}_{f'}\;.
 623 | \end{align}
 624 | 
 625 | 
 626 | \subsection{Weight and coefficient updates in a LSTM}
 627 | 
 628 | As for the RNN, (but with the $\mathcal{H}$ defined in section \ref{sec:appendbackproplstm}), we get for $\nu=1$
 629 | \begin{align}
 630 | \Delta^{\rho_\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 631 | %
 632 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}h^{(\nu-1\tau)(t)}_{f'}\;,\\
 633 | \end{align}
 634 | and otherwise
 635 | \begin{align}
 636 | \Delta^{\rho_\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 637 | %
 638 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu-1\tau)(t)}_{f'}\;,\\
 639 | %
 640 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu-1\tau)(t)}_{f'}\;,\\
 641 | %
 642 | \Delta^{\rho_\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 643 | %
 644 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu\tau-1)(t)}_{f'}\;,\\
 645 | %
 646 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 647 | %
 648 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\
 649 | %
 650 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 651 | %
 652 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
 653 | \end{align}
 654 | 
 655 | and
 656 | 
 657 | \begin{align}
 658 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} y^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;.
 659 | \end{align}
 660 | 
 661 | 
 662 | \begin{subappendices}
 663 | 
 664 | 
 665 | \section{Backpropagation trough Batch Normalization}
 666 | 
 667 | For Backpropagation, we will need 
 668 | 
 669 | \begin{align}
 670 | \frac{\partial y^{(t')(\nu\tau)}_{f'}}{\partial h_{f}^{(t)(\nu\tau)}}&=
 671 | %
 672 | \gamma^{(\nu\tau)}_f\frac{\partial \tilde{h}_{f'}^{(t)(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}\;.
 673 | \end{align}
 674 | 
 675 | Since
 676 | 
 677 | \begin{align}
 678 | \frac{\partial h^{(t')(\nu\tau)}_{f'}}{\partial h_{f}^{(t)(\nu\tau)}}&=\delta^{t'}_t\delta^{f'}_f\;,&
 679 | %
 680 | \frac{\partial \hat{h}_{f'}^{(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}&=\frac{\delta^{f'}_f}{T_{{\rm mb}}}\:;
 681 | \end{align}
 682 | 
 683 | and
 684 | 
 685 | \begin{align}
 686 | \frac{\partial \left(\hat{\sigma}_{f'}^{(\nu\tau)}\right)^2}{\partial h_{f}^{(t)(\nu\tau)}}&=
 687 | %
 688 | \frac{2\delta^{f'}_f}{T_{{\rm mb}}}\left(h_{f}^{(t)(\nu\tau)}-\hat{h}_{f}^{(\nu\tau)}\right)\;,
 689 | \end{align}
 690 | 
 691 | we get
 692 | 
 693 | \begin{align}
 694 | \frac{\partial \tilde{h}_{f'}^{(t')(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}&=
 695 | %
 696 | \frac{\delta^{f'}_f}{T_{{\rm mb}}}\left[\frac{T_{{\rm mb}}\delta^{t'}_t-1}
 697 | %
 698 | {\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac12}-
 699 | %
 700 | \frac{\left(h_{f}^{(t')(\nu\tau)}-\hat{h}_{f}^{(\nu\tau)}\right)\left(h_{f}^{(t)(\nu\tau)}-\hat{h}_{f}^{(\nu\tau)}\right)}
 701 | %
 702 | {\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac32}\right]\notag\\
 703 | %
 704 | &=\frac{\delta^{f'}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac12}
 705 | %
 706 | \left[\delta^{t'}_t-
 707 | %
 708 | \frac{1+\tilde{h}_{f}^{(t')(\nu\tau)}\tilde{h}_{f}^{(t)(\nu\tau)}}{T_{{\rm mb}}}\right]\;.
 709 | \end{align}
 710 | 
 711 | To ease the notation we will denote
 712 | 
 713 | \begin{align}
 714 | \tilde{\gamma}^{(\nu\tau)}_f&=
 715 | %
 716 | \frac{\gamma^{(\nu\tau)}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac12}\;.
 717 | \end{align}
 718 | 
 719 | so that
 720 | 
 721 | \begin{align}
 722 | \frac{\partial y_{f'}^{(t')(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}&=
 723 | %
 724 | \tilde{\gamma}^{(\nu\tau)}_f \delta^{f'}_f\left[\delta^{t'}_t-
 725 | %
 726 | \frac{1+\tilde{h}_{f}^{(t')(\nu\tau)}\tilde{h}_{f}^{(t)(\nu\tau)}}{T_{{\rm mb}}}\right]\;.
 727 | \end{align}
 728 | 
 729 | This modifies the error rate backpropagation, as well as the formula for the weight update ($y$'s instead of $h$'s). In the following we will use the formula
 730 | 
 731 | \begin{align}
 732 | J^{(tt')(\nu\tau)}_{f}&=
 733 | %
 734 | \tilde{\gamma}^{(\nu\tau)}_f \left[\delta^{t'}_t-
 735 | %
 736 | \frac{1+\tilde{h}_{f}^{(t')(\nu\tau)}\tilde{h}_{f}^{(t)(\nu\tau)}}{T_{{\rm mb}}}\right]\;.
 737 | \end{align}
 738 | 
 739 | 
 740 | \section{RNN Backpropagation}
 741 | 
 742 | \subsection{RNN Error rate updates: details} \label{sec:rnnappenderrorrate}
 743 | 
 744 | Recalling the error rate definition
 745 | 
 746 | \begin{align}
 747 | \delta^{(t)( \nu\tau)}_f&=\frac{\delta}{\delta h^{(t)( \nu+1\tau)}_f }J(\Theta)\;,
 748 | \end{align}
 749 | 
 750 | we would like to compute it for all existing values of $\nu$ and $\tau$. As computed in chapter \ref{sec:chapterFNN}, one has for the maximum $\nu$ value
 751 | 
 752 | \begin{align}
 753 | \delta^{(t)(N-1\tau)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-y_f^{(t)(\tau)}\right)\;.
 754 | \end{align}
 755 | 
 756 | Now since (taking Batch Normalization into account)
 757 | 
 758 | \begin{align}
 759 | h^{(t)(N\tau)}_{f}&=o\left(\sum_{f'=0}^{F_{N-1}-1}\Theta^f_{f'} y^{(t)(N-1\tau)}_{f}\right)\;,
 760 | \end{align}
 761 | 
 762 | and 
 763 | 
 764 | \begin{align}
 765 | h^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{\nu(\nu)f}_{f'}
 766 | %
 767 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{\tau(\nu)f}_{f'}
 768 | %
 769 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,
 770 | \end{align}
 771 | 
 772 | we get for 
 773 | 
 774 | \begin{align}
 775 | \delta^{(t)(N-2\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}\left[\sum_{f'=0}^{F_{N}-1}
 776 | %
 777 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-1\tau)}_{f'}\right.\notag\\
 778 | %
 779 | &+\left.\sum_{f'=0}^{F_{N-1}-1}\frac{\delta h^{(t')(N-1\tau+1)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-2\tau+1)}_{f'}\right]\;.
 780 | \end{align}
 781 | 
 782 | Let us work out explicitly once (for a regression cost function and a trivial identity output function)
 783 | 
 784 | \begin{align}
 785 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }&=\sum_{f''=0}^{F_{N-1}-1}\Theta^{f'}_{f''}\,
 786 | %
 787 | \frac{\delta y^{(t')(N-1\tau)}_{f''}}{\delta h^{(t)(N-1\tau)}_f } \notag\\
 788 | %
 789 | &=\Theta^{f'}_{f}\,J_f^{(tt')(N-1\tau)}\;.
 790 | \end{align}
 791 | 
 792 | as well as
 793 | 
 794 | \begin{align}
 795 | \frac{\delta h^{(t')(N-1\tau+1)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }&=
 796 | %
 797 | \left[1-\left(h^{(t')(N-1\tau+1)}_{f'}\right)^2\right]\sum_{f''=0}^{F_{{N-1}}-1}\Theta^{\tau(N-1)f'}_{f''}
 798 | %
 799 | \frac{\delta y^{(t')(N-1\tau)}_{f''}}{\delta h^{(t)(N-1\tau)}_f }\notag\\
 800 | %
 801 | &=\mathcal{T}^{(t')(N-1\tau+1)}_{f'}\Theta^{\tau(N-1)f'}_{f}\,J_f^{(tt')(N-1\tau)}\;.
 802 | \end{align}
 803 | 
 804 | Thus
 805 | 
 806 | \begin{align}
 807 | \delta^{(t)(N-2\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}J_f^{(tt')(N-1\tau)}\left[\sum_{f'=0}^{F_{N}-1}
 808 | %
 809 | \Theta^{f'}_{f}\,\delta^{(t')(N-1\tau)}_{f'}\right.\notag\\
 810 | %
 811 | &\left.+\sum_{f'=0}^{F_{N-1}-1}\mathcal{T}^{(t')(N-1\tau+1)}_{f'}\Theta^{\tau(N-1)f'}_{f}\,\delta^{(t')(N-2\tau+1)}_{f'}\right]\;.
 812 | \end{align}
 813 | 
 814 | Here we adopted the convention that the $\delta^{(t')(N-2\tau+1)}$'s are $0$ if $\tau=T$. In a similar way, we derive for $\nu\leq N-1$
 815 | 
 816 | \begin{align}
 817 | \delta^{(t)(\nu-1\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\left[\sum_{f'=0}^{F_{\nu+1}-1}
 818 | %
 819 | \mathcal{T}^{(t')(\nu+1\tau)}_{f'}\Theta^{\nu(\nu+1)f'}_{f}\,\delta^{(t')(\nu\tau)}_{f'}\right.\notag\\
 820 | %
 821 | &\left.+\sum_{f'=0}^{F_{\nu}-1}\mathcal{T}^{(t')(\nu\tau+1)}_{f'}\Theta^{\tau(\nu)f'}_{f}\,\delta^{(t')(\nu-1\tau+1)}_{f'}\right]\;.
 822 | \end{align}
 823 | 
 824 | Defining 
 825 | 
 826 | \begin{align}
 827 | \mathcal{T}^{(t')(N\tau)}_{f'}&=1\;,&
 828 | %
 829 | \Theta^{\nu(N)f'}_{f}&=\Theta^{f'}_{f}\;,
 830 | \end{align}
 831 | 
 832 | the previous $ \delta^{(t)(\nu-1\tau)}_f$ formula extends to the case $\nu  =N-1$. To unite the RNN and the LSTM formulas, let us finally define (with $a$ either $\tau$ or $\nu$
 833 | 
 834 | \begin{align}
 835 | \mathcal{H}^{(t')(\nu\tau)_a}_{ff'}&=\mathcal{T}^{(t')(\nu+1\tau)}_{f'}\Theta^{a(\nu+1)f'}_{f}\;,
 836 | \end{align}
 837 | 
 838 | thus (defining $b_0=\nu$ and $b_1=\tau$)
 839 | 
 840 | \begin{align}
 841 | \delta^{(t)(\nu-1\tau)}_f&= 
 842 | %
 843 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 844 | %
 845 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
 846 | \end{align}
 847 | 
 848 | 
 849 | \subsection{RNN Weight and coefficient updates: details} \label{sec:rnncoefficient}
 850 | 
 851 | We want here to derive 
 852 | 
 853 | \begin{align}
 854 | \Delta^{\nu(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\nu(\nu)f}_{f'}} J(\Theta)&
 855 | %
 856 | \Delta^{\tau(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\tau(\nu)f}_{f'}} J(\Theta)\;.
 857 | \end{align}
 858 | 
 859 | We first expand
 860 | 
 861 | \begin{align}
 862 | \Delta^{\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 863 | %
 864 | \frac{\partial h^{(t)(\nu\tau)}_{f''}}{\partial \Theta^{\nu(\nu)f}_{f'}}
 865 | %
 866 | \delta^{(t)(\nu-1\tau)}_{f''}\;,\notag\\
 867 | %
 868 | \Delta^{\tau(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 869 | %
 870 | \frac{\partial h^{(t)(\nu\tau)}_{f''}}{\partial \Theta^{\tau(\nu)f}_{f'}}
 871 | %
 872 | \delta^{(t)(\nu-1\tau)}_{f''}\;,
 873 | \end{align}
 874 | 
 875 | so that
 876 | 
 877 | \begin{align}
 878 | \Delta^{\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 879 | %
 880 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu-1\tau)}_{f'}\;,\\
 881 | %
 882 | \Delta^{\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 883 | %
 884 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu\tau-1)}_{f'}\;.
 885 | \end{align}
 886 | 
 887 | We also have to compute
 888 | 
 889 | \begin{align}
 890 | \Delta^{f}_{f'}&=\frac{\partial}{\partial \Theta^{f}_{f'}} J(\Theta)\;.
 891 | \end{align}
 892 | 
 893 | We first expand
 894 | 
 895 | \begin{align}
 896 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_N-1}\sum_{t=0}^{T_{{\rm mb}}-1}
 897 | %
 898 | \frac{\partial h^{(t)(N\tau)}_{f''}}{\partial \Theta^{f}_{f'}}
 899 | %
 900 | \delta^{(t)(N-1\tau)}_{f''}\;
 901 | \end{align}
 902 | 
 903 | so that
 904 | 
 905 | \begin{align}
 906 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} h^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;.
 907 | \end{align}
 908 | 
 909 | Finally, we need
 910 | 
 911 | \begin{align}
 912 | \Delta^{\beta(\nu \tau)}_{f}&=\frac{\partial}{\partial\beta^{(\nu \tau)}_{f}} J(\Theta)&
 913 | %
 914 | \Delta^{\gamma(\nu \tau)}_{f}&=\frac{\partial}{\partial \gamma^{(\nu \tau)}_{f}} J(\Theta)\;.
 915 | \end{align}
 916 | 
 917 | First
 918 | 
 919 | \begin{align}
 920 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1}
 921 | %
 922 | \frac{\partial h^{(t)(\nu+1\tau)}_{f'}}{\partial \beta^{(\nu \tau)}_{f}}\delta^{(t)(\nu\tau)}_{f'}+
 923 | %
 924 | \sum_{f'=0}^{F_{\nu}-1}
 925 | %
 926 | \frac{\partial h^{(t)(\nu\tau+1)}_{f'}}{\partial \beta^{(\nu \tau)}_{f}}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;,\notag\\
 927 | %
 928 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1}
 929 | %
 930 | \frac{\partial h^{(t)(\nu+1\tau)}_{f'}}{\partial \gamma^{(\nu \tau)}_{f}}\delta^{(t)(\nu\tau)}_{f'}+
 931 | %
 932 | \sum_{f'=0}^{F_{\nu}-1}
 933 | %
 934 | \frac{\partial h^{(t)(\nu\tau+1)}_{f'}}{\partial \gamma^{(\nu \tau)}_{f}}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;.
 935 | \end{align}
 936 | 
 937 | So that
 938 | 
 939 | \begin{align}
 940 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1}
 941 | %
 942 | \mathcal{T}^{(t)(\nu+1\tau)}_{f'}\Theta^{\nu(\nu+1)f'}_{f}\delta^{(t)(\nu\tau)}_{f'}\right.\notag\\
 943 | %
 944 | &\left.+\sum_{f'=0}^{F_{\nu}-1}
 945 | %
 946 | \mathcal{T}^{(t)(\nu\tau+1)}_{f'}\Theta^{\tau(\nu)f'}_{f}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;,\\
 947 | %
 948 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1}
 949 | %
 950 | \mathcal{T}^{(t)(\nu+1\tau)}_{f'}\Theta^{\nu(\nu+1)f'}_{f}
 951 | %
 952 | \tilde{h}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu\tau)}_{f'}\right.\notag\\
 953 | %
 954 | &\left.+\sum_{f'=0}^{F_{\nu}-1}\mathcal{T}^{(t)(\nu\tau+1)}_{f'}
 955 | %
 956 | \Theta^{\tau(\nu)f'}_{f}\tilde{h}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;,
 957 | \end{align}
 958 | 
 959 | which we can rewrite as
 960 | 
 961 | \begin{align}
 962 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 963 | %
 964 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\
 965 | %
 966 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
 967 | %
 968 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
 969 | \end{align}
 970 | 
 971 | \section{LSTM Backpropagation}
 972 | 
 973 | 
 974 | 
 975 | 
 976 | \subsection{LSTM Error rate updates: details} \label{sec:ARNNLSTMerror_rates}
 977 | 
 978 | As for the RNN 
 979 | 
 980 | \begin{align}
 981 | \delta^{(t)(N-1\tau)}_{f}&
 982 | %
 983 | =\frac{1}{T_{{\rm mb}}}\left(h^{(t)(N\tau)}_{f}-y^{(t)(\tau)}_{f}\right)\;.
 984 | \end{align}
 985 | 
 986 | Before  going any further, it will be useful to define 
 987 | 
 988 | \begin{align}
 989 | \mathcal{O}^{(t)(\nu\tau)}_{f}&=h^{(t)(\nu\tau)}_{f}
 990 | %
 991 | \left(1-o^{(t)(\nu\tau)}_{f}\right)\;,\notag\\
 992 | %
 993 | \mathcal{I}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right)
 994 | %
 995 | \right)g^{(t)(\nu\tau)}_{f} i^{(t)(\nu\tau)}_{f}\left(1-i^{(t)(\nu\tau)}_{f}\right)\;,\notag\\
 996 | %
 997 | \mathcal{F}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right)
 998 | %
 999 | \right)c^{(t)(\nu\tau-1)}_{f} f^{(t)(\nu\tau)}_{f}\left(1-f^{(t)(\nu\tau)}_{f}\right)\;,\notag\\
1000 | %
1001 | \mathcal{G}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right)
1002 | %
1003 | \right)i^{(t)(\nu\tau)}_{f}\left(1-\left(g^{(t)(\nu\tau)}_{f}\right)^2\right)\;,
1004 | \end{align}
1005 | 
1006 | and
1007 | 
1008 | \begin{align}
1009 | H^{(t)(\nu\tau)_a}_{ff'}&=\Theta^{o_a(\nu+1)f'}_{f}\mathcal{O}^{(t)(\nu+1\tau)}_{f'}
1010 | %
1011 | +\Theta^{f_a(\nu+1)f'}_{f}\mathcal{F}^{(t)(\nu+1\tau)}_{f'}\notag\\
1012 | %
1013 | &+\Theta^{g_a(\nu+1)f'}_{f}\mathcal{G}^{(t)(\nu+1\tau)}_{f'}
1014 | %
1015 | +\Theta^{i_a(\nu+1)f'}_{f}\mathcal{I}^{(t)(\nu+1\tau)}_{f'}\;.
1016 | \end{align}
1017 | 
1018 | As for RNN, we will start off by looking at 
1019 | 
1020 | \begin{align}
1021 | \delta^{(t)(N-2\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}\left[\sum_{f'=0}^{F_{N}-1}
1022 | %
1023 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-1\tau)}_{f'}\right.\notag\\
1024 | %
1025 | &+\left.\sum_{f'=0}^{F_{N-1}-1}\frac{\delta h^{(t')(N-1\tau+1)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-2\tau+1)}_{f'}\right]\;.
1026 | \end{align}
1027 | 
1028 | We will be able to get our hands on the second term with the general formula, so let us first look at
1029 | 
1030 | \begin{align}
1031 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }&=\Theta^{f'}_{f}\,J_f^{(tt')(N-1\tau)}\;,
1032 | \end{align}
1033 | 
1034 | which is is similar to the RNN case. Let us put aside the second term of $\delta^{(t)(N-2\tau)}_f$, and look at the general case
1035 | 
1036 | \begin{align}
1037 | \delta^{(t)(\nu-1\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}\left[\sum_{f'=0}^{F_{\nu+1}-1}
1038 | %
1039 | \frac{\delta h^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\delta^{(t')(\nu\tau)}_{f'}
1040 | %
1041 | +\sum_{f'=0}^{F_{\nu}-1}\frac{\delta h^{(t')(\nu\tau+1)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\delta^{(t')(\nu-1\tau+1)}_{f'}\right]\;,
1042 | \end{align}
1043 | which involves to study in details
1044 | \begin{align}
1045 | \frac{\delta h^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=
1046 | %
1047 | \frac{\delta o^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\tanh  c^{(t')(\nu+1\tau)}_{f'}\notag\\
1048 | %
1049 | &+\frac{\delta c^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f } o^{(t')(\nu+1\tau)}_{f'}
1050 | %
1051 | \left[1-\tanh^2  c^{(t')(\nu+1\tau)}_{f'}\right]\;.
1052 | \end{align}
1053 | Now
1054 | \begin{align}
1055 | \frac{\delta o^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=o^{(t')(\nu+1\tau)}_{f'}
1056 | %
1057 | \left[1-o^{(t')(\nu+1\tau)}_{f'}\right]\sum_{f''=0}^{F_\nu-1}\Theta^{o_\nu(\nu+1)f'}_{f''}
1058 | %
1059 | \frac{\delta y^{(t')(\nu\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\notag\\
1060 | %
1061 | &=o^{(t')(\nu+1\tau)}_{f'}
1062 | %
1063 | \left[1-o^{(t')(\nu+1\tau)}_{f'}\right]\Theta^{o_\nu(\nu+1)f'}_{f}
1064 | %
1065 | J^{(tt')(\nu\tau)}_f\;,
1066 | \end{align}
1067 | and
1068 | \begin{align}
1069 | \frac{\delta c^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=
1070 | %
1071 | \frac{\delta i^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }g^{(t')(\nu+1\tau)}_{f'}
1072 | %
1073 | +\frac{\delta g^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }i^{(t')(\nu+1\tau)}_{f'}\notag\\
1074 | %
1075 | &+\frac{\delta f^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }c^{(t')(\nu\tau)}_{f'}\;.
1076 | \end{align}
1077 | We continue our journey
1078 | \begin{align}
1079 | \frac{\delta i^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=
1080 | %
1081 |  i^{(t')(\nu+1\tau)}_{f'}\left[1-  i^{(t')(\nu+1\tau)}_{f'}\right]\Theta^{i_\nu(\nu+1)f'}_{f}
1082 | %
1083 | J^{(tt')(\nu\tau)}_f\;,\notag\\
1084 | %
1085 | \frac{\delta f^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=
1086 | %
1087 |  f^{(t')(\nu+1\tau)}_{f'}\left[1-  f^{(t')(\nu+1\tau)}_{f'}\right]\Theta^{f_\nu(\nu+1)f'}_{f}
1088 | %
1089 | J^{(tt')(\nu\tau)}_f\;,\notag\\
1090 | %
1091 | \frac{\delta g^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=
1092 | %
1093 | \left[1-  \left(g^{(t')(\nu+1\tau)}_{f'}\right)^2\right]\Theta^{g_\nu(\nu+1)f'}_{f}
1094 | %
1095 | J^{(tt')(\nu\tau)}_f\;,
1096 | \end{align}
1097 | and our notations now come handy
1098 | \begin{align}
1099 | \frac{\delta h^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=J^{(tt')(\nu\tau)}_fH^{(t)(\nu\tau)_\nu}_{ff'}\;.
1100 | \end{align}
1101 | This formula also allows us to compute the second term for $\delta^{(t)(N-2\tau)}_f$. In a totally similar manner
1102 | \begin{align}
1103 | \frac{\delta h^{(t')(\nu\tau+1)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=J^{(tt')(\nu\tau)}_fH^{(t)(\nu-1\tau+1)_\tau}_{ff'}\;.
1104 | \end{align}
1105 | Going back to our general formula
1106 | \begin{align}
1107 | \delta^{(t)(\nu-1\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\left[\sum_{f'=0}^{F_{\nu+1}-1}
1108 | %
1109 | H^{(t)(\nu\tau)_\nu}_{ff'}\delta^{(t')(\nu\tau)}_{f'}\right.\notag\\
1110 | %
1111 | &+\left.\sum_{f'=0}^{F_{\nu}-1}H^{(t)(\nu-1\tau+1)_\tau}_{ff'}\delta^{(t')(\nu-1\tau+1)}_{f'}\right]\;,
1112 | \end{align}
1113 | and as in the RNN case, we re-express it as (defining $b_0=\nu$ and $b_1=\tau$)
1114 | \begin{align}
1115 | \delta^{(t)(\nu-1\tau)}_f&= 
1116 | %
1117 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
1118 | %
1119 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
1120 | \end{align}
1121 | This formula is also valid for $\nu  =N-1$ if we define as for the RNN case
1122 | \begin{align}
1123 | \mathcal{H}^{(t')(N\tau)}_{f'}&=1\;,&
1124 | %
1125 | \Theta^{\nu(N)f'}_{f}&=\Theta^{f'}_{f}\;,
1126 | \end{align}
1127 |  
1128 | 
1129 | 
1130 | \subsection{LSTM Weight and coefficient updates: details}
1131 | 
1132 | We want to compute 
1133 | 
1134 | \begin{align}
1135 | \Delta^{\rho_{_\nu}(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\rho_{_\nu}(\nu)f}_{f'}} J(\Theta)&
1136 | %
1137 | \Delta^{\rho_{_\tau}(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\rho_{_\tau}(\nu)f}_{f'}} J(\Theta)\;,
1138 | \end{align}
1139 | 
1140 | with $\rho = (f,i,g,o)$. First we expand
1141 | 
1142 | \begin{align}
1143 | \Delta^{\rho_{_\nu}(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1}
1144 | %
1145 | \frac{\partial h^{(\nu\tau)(t)}_{f''}}{\partial \Theta^{\rho_{_\nu}(\nu)f}_{f'}}
1146 | %
1147 | \frac{\partial}{\partial h^{(\nu\tau)(t)}_{f''}} J(\Theta)\notag\\
1148 | %
1149 | &=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1}
1150 | %
1151 | \frac{\partial h^{(\nu\tau)(t)}_{f''}}{\partial \Theta^{\rho_{_\nu}(\nu)f}_{f'}}
1152 | %
1153 | \delta^{(\nu\tau)(t)}_{f''}\;,
1154 | \end{align}
1155 | 
1156 | so that (with $\rho^{(\nu\tau)}=\left(\mathcal{F},\mathcal{I},\mathcal{G},\mathcal{O}\right)$) if $\nu=1$
1157 | 
1158 | \begin{align}
1159 | \Delta^{\rho_\nu(\nu-)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
1160 | %
1161 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}h^{(\nu-1\tau)(t)}_{f'}\;,
1162 | \end{align}
1163 | 
1164 | and else
1165 | 
1166 | \begin{align}
1167 | \Delta^{\rho_\nu(\nu-)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
1168 | %
1169 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu-1\tau)(t)}_{f'}\;,\\
1170 | %
1171 | \Delta^{\rho_\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1}
1172 | %
1173 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu\tau-1)(t)}_{f'}\;.
1174 | \end{align}
1175 | 
1176 | We will now need to compute
1177 | 
1178 | \begin{align}
1179 | \Delta^{\beta(\nu\tau)}_{f}&=\frac{\partial}{\partial \beta^{(\nu\tau)}_f} J(\Theta)&
1180 | %
1181 | \Delta^{\gamma(\nu\tau)}_{f}&=\frac{\partial}{\partial \gamma^{(\nu\tau)}_f} J(\Theta)\;.
1182 | \end{align}
1183 | 
1184 | For that we need to look at 
1185 | 
1186 | \begin{align}
1187 | \Delta^{\beta(\nu\tau)}_{f}&=\sum_{f'=0}^{F_{\nu+1}-1}\sum_{t'=0}^{T_{{\rm mb}}-1}
1188 | %
1189 | \frac{\partial h^{(\nu+1\tau)(t')}_{f'}}{\partial \beta^{(\nu\tau)}_f}\delta^{(\nu\tau)(t')}_{f'}
1190 | %
1191 | +\sum_{f'=0}^{F_{\nu}-1}\sum_{t'=0}^{T_{{\rm mb}}-1}\frac{\partial h^{(\nu\tau+1)(t')}_{f'}}
1192 | %
1193 | {\partial \beta^{(\nu\tau)}_f}\delta^{(\nu-1\tau+1)(t')}_{f'}\notag\\
1194 | %
1195 | &=\sum_{t=0}^{T_{{\rm mb}}-1}\left\{\sum_{f'=0}^{F_{\nu+1}-1}
1196 | %
1197 | H^{(t)(\nu\tau)}_{ff'}\delta^{(t)(\nu\tau)}_{f'}
1198 | %
1199 | +\sum_{f'=0}^{F_{\nu}-1}H^{(t)(\nu-1\tau+1)}_{ff'}\delta^{(t)(\nu\tau+1)}_{f'}\right\}\;. 
1200 | \end{align}
1201 | 
1202 | and
1203 | 
1204 | \begin{align}
1205 | \Delta^{\gamma(\nu\tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\left\{\sum_{f'=0}^{F_{\nu+1}-1}
1206 | %
1207 | H^{(t)(\nu\tau)}_{ff'}\delta^{(t)(\nu\tau)}_{f'}
1208 | %
1209 | +\sum_{f'=0}^{F_{\nu}-1}H^{(t)(\nu-1\tau+1)}_{ff'}\delta^{(t)(\nu-1\tau+1)}_{f'}\right\}\;,
1210 | \end{align}
1211 | 
1212 | which we can rewrite as
1213 | 
1214 | \begin{align}
1215 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
1216 | %
1217 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\
1218 | %
1219 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1}
1220 | %
1221 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;.
1222 | \end{align}
1223 | 
1224 | Finally, as in the RNN case
1225 | 
1226 | \begin{align}
1227 | \Delta^{f}_{f'}&=\frac{\partial}{\partial \Theta^{f}_{f'}} J(\Theta)\;.
1228 | \end{align}
1229 | 
1230 | We first expand
1231 | 
1232 | \begin{align}
1233 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_N-1}\sum_{t=0}^{T_{{\rm mb}}-1}
1234 | %
1235 | \frac{\partial h^{(t)(N\tau)}_{f''}}{\partial \Theta^{f}_{f'}}
1236 | %
1237 | \delta^{(t)(N-1\tau)}_{f''}\;
1238 | \end{align}
1239 | 
1240 | so that
1241 | 
1242 | \begin{align}
1243 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} h^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;.
1244 | \end{align}
1245 | 
1246 | \newpage
1247 | 
1248 | \section{Peephole connexions}
1249 | 
1250 | Some LSTM variants probe the cell state to update the gate themselves. This is illustrated in figure \ref{fig:peepholeLSTM}
1251 | 
1252 | \begin{figure}[H]
1253 | \begin{center}
1254 | \begin{tikzpicture}
1255 | \node[] at (0,0) {\includegraphics[scale=1.4]{LSTM_structure-peephole}};
1256 | \end{tikzpicture}
1257 | \caption{\label{fig:peepholeLSTM}LSTM hidden unit with peephole}
1258 | \end{center}
1259 | \end{figure}
1260 | 
1261 | Peepholes modify the gate updates in the following way
1262 | 
1263 | \begin{align}
1264 | i^{(\nu\tau)(t)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{i_{_\nu}(\nu)f}_{f'}
1265 | %
1266 | h^{(\nu-1\tau)(t)}_{f'}+\sum_{f'=0}^{F{_{\nu}}-1}\left[\Theta^{i_{_\tau}(\nu)f}_{f'}
1267 | %
1268 | h^{(\nu\tau-1)(t)}_{f'}+\Theta^{c_{_i}(\nu)f}_{f'}c^{(\nu\tau-1)(t)}_{f'}\right]\right)\;,\\
1269 | %
1270 | f^{(\nu\tau)(t)}_f&=\sigma\left(\sum_{f'=0}^{F{_{\nu-1}}-1}\Theta^{f_{_\nu}(\nu)f}_{f'}
1271 | %
1272 | h^{(\nu-1\tau)(t)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\left[\Theta^{f_{_\tau}(\nu)f}_{f'}
1273 | %
1274 | h^{(\nu\tau-1)(t)}_{f'}+\Theta^{c_{_f}(\nu)f}_{f'}c^{(\nu\tau-1)(t)}_{f'}\right]\right)\;,\\
1275 | %
1276 | o^{(\nu\tau)(t)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{o_{_\nu}(\nu)f}_{f'}
1277 | %
1278 | h^{(\nu-1\tau)(t)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\left[\Theta^{o_{_\tau}(\nu)f}_{f'}
1279 | %
1280 | h^{(\nu\tau-1)(t)}_{f'}+\Theta^{c_{_o}(\nu)f}_{f'}c^{(\nu\tau)(t)}_{f'}\right]\right)\;,
1281 | \end{align}
1282 | which also modifies the LSTM backpropagation algorithm in a non-trivial way. As it as been shown that different LSTM formulations lead to pretty similar results, we leave to the reader the derivation of the backpropagation update rules as an exercise.
1283 | 
1284 | \end{subappendices}
1285 | 


--------------------------------------------------------------------------------
/conv_2d-crop.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/conv_2d-crop.pdf


--------------------------------------------------------------------------------
/conv_4d-crop.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/conv_4d-crop.pdf


--------------------------------------------------------------------------------
/cover_page-crop.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/cover_page-crop.pdf


--------------------------------------------------------------------------------
/fc_equiv.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_equiv.pdf


--------------------------------------------------------------------------------
/fc_resnet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_resnet.pdf


--------------------------------------------------------------------------------
/fc_resnet_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_resnet_2.pdf


--------------------------------------------------------------------------------
/fc_resnet_3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_resnet_3.pdf


--------------------------------------------------------------------------------
/formatAndDefs.tex:
--------------------------------------------------------------------------------
  1 | \usepackage[T1]{fontenc}
  2 | \usepackage{bm}
  3 | \usepackage{bbm}
  4 | \usepackage[utf8]{inputenc}
  5 | \usepackage{latexsym}
  6 | \usepackage[english]{babel}
  7 | \usepackage{indentfirst}
  8 | %\usepackage{fullpage}
  9 | \usepackage{graphicx}
 10 | \usepackage{lmodern}
 11 | %\usepackage{epsfig}
 12 | %\usepackage[math]{anttor}
 13 | \usepackage[sc]{mathpazo}
 14 | %\usepackage{fouriernc}
 15 | %\usepackage[garamond]{mathdesign}
 16 | \usepackage{geometry}
 17 | \input Kramer.fd
 18 | \usepackage{yfonts,color}
 19 | \usepackage{minitoc}
 20 | \usepackage{titlesec}
 21 | \usepackage{subfigure}
 22 | \usepackage{amsmath}
 23 | \usepackage{amssymb}
 24 | \usepackage{stmaryrd}
 25 | \usepackage{url}
 26 | \usepackage{pgfplots}
 27 | \pgfplotsset{compat=1.5}
 28 | \usepackage[colorlinks,linkcolor=red!80!black,
 29 | citecolor=red!80!black,pdfpagelabels,hyperindex=true]{hyperref}
 30 | \hypersetup{
 31 | %    bookmarks=true,         % show bookmarks bar?
 32 | %    unicode=false,          % non-Latin characters in Acrobat’s bookmarks
 33 | %    pdftoolbar=true,        % show Acrobat’s toolbar?
 34 | %    pdfmenubar=true,        % show Acrobat’s menu?
 35 | %    pdffitwindow=false,     % window fit to page when opened
 36 | %    pdfstartview={FitH},    % fits the width of the page to the window
 37 | %    pdftitle={My title},    % title
 38 | %    pdfauthor={Author},     % author
 39 | %    pdfsubject={Subject},   % subject of the document
 40 | %    pdfcreator={Creator},   % creator of the document
 41 | %    pdfproducer={Producer}, % producer of the document
 42 | %    pdfkeywords={keyword1} {key2} {key3}, % list of keywords
 43 | %    pdfnewwindow=true,      % links in new window
 44 |     colorlinks=true,       % false: boxed links; true: colored links
 45 |     linkcolor=brown!80!black,  % color of internal links (change box color with linkbordercolor)
 46 |     citecolor=green!50!black,        % color of links to bibliography
 47 |     filecolor=magenta,      % color of file links
 48 |     urlcolor=red!80!black           % color of external links
 49 | }
 50 | \usepackage{csquotes}
 51 | \usepackage[sorting=none,backend=bibtex,style=numeric-comp]{biblatex}
 52 | \usepackage{cancel}
 53 | %\usepackage{natbib}
 54 | \usepackage{pifont}
 55 | %\bibliographystyle{plain}
 56 | %\bibliographystyle{unsrt}
 57 | \usepackage{tikz}
 58 | \usetikzlibrary{matrix,arrows,decorations,backgrounds,shapes,calc,fit}
 59 | \usetikzlibrary{fadings}
 60 | \usetikzlibrary{decorations.pathmorphing}
 61 | \usetikzlibrary{decorations.markings}
 62 | \usepackage{xcolor}
 63 | \usepackage{amsfonts}
 64 | \usepackage{slashed}
 65 | \usepackage{fancybox}
 66 | \usepackage{fancyhdr}
 67 | \newenvironment {abstract}%
 68 | {\cleardoublepage \null \vfill \begin{center}%
 69 | \bfseries \abstractname\end{center}}%
 70 | {\vfill \null}
 71 | \newcommand{\intkaha}{\ensuremath{\int\frac{\d^3\ka}{4|\ka||p-k|(2\pi)^3}}}
 72 | \newcommand{\slv}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$v$}}
 73 | \newcommand{\slF}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$F$}}
 74 | \newcommand{\slL}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$L$}}
 75 | \newcommand{\slP}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$P$}}
 76 | \newcommand{\slp}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$p$}}
 77 | \newcommand{\slq}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$q$}}
 78 | \newcommand{\slR}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$R$}}
 79 | \newcommand{\slQ}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$Q$}}
 80 | \newcommand{\slK}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$K$}}
 81 | \newcommand{\slk}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$k$}}
 82 | \newcommand{\slD}{\raise.15ex\hbox{$/$}\kern-.73em\hbox{$D$}}
 83 | \newcommand{\slC}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$C$}}
 84 | \newcommand{\slA}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$A$}}
 85 | \newcommand{\slSigma}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$\Sigma$}}
 86 | \newcommand{\slpartial}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$\partial$}}
 87 | \newcommand{\slcalP}{\raise.15ex\hbox{$/$}\kern-.63em\hbox{$\cal P$}}
 88 | \definecolor{purp}{RGB}{0,0,0}
 89 | \def\p{{\boldsymbol p}}
 90 | \def\P{{\boldsymbol P}}
 91 | \def\q{{\boldsymbol q}}
 92 | \def\Q{{\boldsymbol Q}}
 93 | \def\l{{\boldsymbol l}}
 94 | \def\k{{\boldsymbol k}}
 95 | \def\kp{{\boldsymbol k}_{{\tiny \perp}}}
 96 | \def\m{{\boldsymbol m}}
 97 | \def\x{{\boldsymbol x}}
 98 | \def\xp{{\boldsymbol x}_{{\tiny \perp}}}
 99 | \def\yp{{\boldsymbol y}_{{\tiny \perp}}}
100 | \def\pp{{\boldsymbol p}_{{\tiny \perp}}}
101 | \def\y{{\boldsymbol y}}
102 | \def\X{{\boldsymbol X}}
103 | \def\Y{{\boldsymbol Y}}
104 | \def\D{{\boldsymbol D}}
105 | \def\r{{\boldsymbol r}}
106 | \def\z{{\boldsymbol z}}
107 | \def\v{{\boldsymbol v}}
108 | \def\w{{\boldsymbol w}}
109 | \def\b{{\boldsymbol b}}
110 | \def\u{{\boldsymbol u}}
111 | \newcommand{\intkk}{\ensuremath{\int\frac{\d^3\ka}{2|\ka|(2\pi)^3}}}
112 | \renewcommand{\d}{\ensuremath{\mathrm{d}}}
113 | \newcommand{\old}{\ensuremath{\text{old}}}
114 | \newcommand{\niou}{\ensuremath{\text{new}}}
115 | \newcommand{\nab}{\ensuremath{\text{\boldmath$\nabla$}}}
116 | \newcommand{\ix}{\ensuremath{\text{\boldmath $x$}}}
117 | \newcommand{\igrec}{\boldsymbol{y}}
118 | \newcommand{\ixl}{\ensuremath{\text{\scriptsize\boldmath $x$}}}
119 | \newcommand{\ka}{\ensuremath{\text{\boldmath $k$}}}
120 | \newcommand{\parti}{\ensuremath{\text{\boldmath $\partial$}}}
121 | \newcommand{\uu}{\ensuremath{\text{\boldmath $u$}}}
122 | \newcommand{\vv}{\ensuremath{\text{\boldmath $v$}}}
123 | \newcommand{\pel}{\ensuremath{\mbox{\scriptsize\boldmath $p$}}}
124 | \newcommand{\uul}{\ensuremath{\text{\scriptsize\boldmath $u$}}}
125 | \newcommand{\vvl}{\ensuremath{\text{\scriptsize\boldmath $v$}}}
126 | \newcommand{\kal}{\ensuremath{\text{\scriptsize\boldmath $k$}}}
127 | \newcommand{\el}{\ensuremath{\text{\scriptsize\boldmath $l$}}}
128 | \newcommand{\cigma}{\ensuremath{\text{\boldmath$\Sigma$}}}
129 | \newcommand{\Sig}{\ensuremath{\mbox{\boldmath $\Sigma$}}}
130 | \newcommand{\Sigl}{\ensuremath{\mbox{\scriptsize\boldmath $\Sigma$}}}
131 | \newcommand{\pe}{\ensuremath{\mbox{\boldmath $p$}}}
132 | \newcommand{\ine}{\ensuremath{\mathrm{in}}}
133 | \newcommand{\oute}{\ensuremath{\mathrm{out}}}
134 | \newcommand{\intk}{\ensuremath{\int\frac{\d^3\ka}{2|\k|(2\pi)^3}}}
135 | \newcommand{\intmod}[1]{\ensuremath{\int\frac{\d^3{\bm #1}}{2|{\bm #1}|(2\pi)^3}}}
136 | \newcommand{\intmodd}[2]{\ensuremath{\iint\frac{\d^3{\bm #1}\,\d^3{\bm #2}}{4|{\bm #1}||{\bm #2}|(2\pi)^6}}}
137 | \newcommand{\ma}[1]{{\mathcal{#1}}}
138 | \newcommand{\SK}{ \textsc{Schwinger-Keldysh} }
139 | \newcommand{\Fnman}{ \textsc{Feynman} }
140 | \newcommand{\pointe}[2]{
141 | \node[place5] at (#1,#2) {};
142 | \node[place4,xshift=rand*0.3mm,yshift=rand*0.3mm] at (#1,#2) {};	}
143 | \newcommand{\pointee}[2]{
144 | \node[place,rotate around={45:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {};
145 | \node[place2,rotate around={45:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {};
146 | \node[place3,rotate around={45:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {};}
147 | \newcommand{\pointeee}[2]{
148 | \node[place2,rotate around={135:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {};
149 | \node[place3,rotate around={135:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {};
150 | \node[place,rotate around={135:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {};	}
151 | \newcommand{\point}[2]{
152 | \node[place,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};
153 | \node[place2,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};
154 | \node[place3,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};}
155 | \newcommand{\pointt}[2]{
156 | \node[place2,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};
157 | \node[place3,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};
158 | \node[place,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};}
159 | \tikzstyle{blueballa} = [circle,shading=ball, ball color=blue!30,inner sep =1.2mm]
160 | \tikzstyle{redballa} = [circle,shading=ball, ball color=red,inner sep =1.2mm]
161 | \tikzstyle{greenballa} = [circle,shading=ball, ball color=green!70!black,inner sep =1.2mm]
162 | \tikzstyle{blueball} = [circle,shading=ball, ball color=blue!30,inner sep =0.3mm]
163 | \tikzstyle{redball} = [circle,shading=ball, ball color=red,inner sep =0.3mm]
164 | \tikzstyle{greenball} = [circle,shading=ball, ball color=green!70!black,inner sep =0.3mm]
165 | \tikzstyle{gluon} = [thick, style={decorate,decoration={coil,amplitude=4pt, segment length=4pt}}]
166 | \newcommand{\gluebr}[4]{
167 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {};
168 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {};
169 | \draw[photon] (toto) --(tata);}
170 | \newcommand{\gluerb}[4]{
171 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {};
172 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {};
173 | \draw[photon] (toto) --(tata);}
174 | \newcommand{\gluebg}[4]{
175 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {};
176 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {};
177 | \draw[photon] (toto) --(tata);}
178 | \newcommand{\gluegb}[4]{
179 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {};
180 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {};
181 | \draw[photon] (toto) --(tata);}
182 | \newcommand{\gluegr}[4]{
183 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {};
184 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {};
185 | \draw[photon] (toto) --(tata);}
186 | \newcommand{\gluerg}[4]{
187 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {};
188 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {};
189 | \draw[photon] (toto) --(tata);}
190 | \newcommand{\gluebra}[4]{
191 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {};
192 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {};
193 | \draw[photon] (toto) --(tata);}
194 | \newcommand{\gluerba}[4]{
195 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {};
196 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {};
197 | \draw[photon] (toto) --(tata);}
198 | \newcommand{\gluebga}[4]{
199 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {};
200 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {};
201 | \draw[photon] (toto) --(tata);}
202 | \newcommand{\gluegba}[4]{
203 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {};
204 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {};
205 | \draw[photon] (toto) --(tata);}
206 | \newcommand{\gluegra}[4]{
207 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {};
208 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {};
209 | \draw[photon] (toto) --(tata);}
210 | \newcommand{\gluerga}[4]{
211 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {};
212 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {};
213 | \draw[photon] (toto) --(tata);}
214 | \newcommand{\freezew}[2]{
215 | \node[wball,xshift=rand*1.7cm,yshift=rand*1.4cm] at (#1,#2) {};}
216 | \newcommand{\freezeb}[2]{
217 | \node[bball,xshift=rand*1.7cm,yshift=rand*1.4cm] at (#1,#2) {};}
218 | \tikzstyle{photon} = [very thin, style={decorate, decoration={snake,amplitude=0.4pt, segment length=2pt}}]
219 | \tikzfading [name=radialfade, inner color=transparent!40, outer color=transparent!100]
220 | \newcommand{\col}[1]{{\color{black} #1}}
221 | \newcommand{\intp}{\int \frac{\d^2 \p }{(2\pi)^2}}
222 | \newlength{\longueurAdHoc}
223 | \settodepth{\longueurAdHoc}{$\displaystyle\int\limits_{y^->0}$}
224 | \newcommand{\esss}{\begin{tikzpicture}[]
225 | \draw[red!60!white,thick] (-0.06,-0.06) circle (0.12);
226 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(45:0.12);
227 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(135:0.12);
228 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(315:0.12);
229 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(225:0.12);
230 | \end{tikzpicture}}
231 | \newcommand{\matprodu}{\begin{tikzpicture}[]
232 | \draw[red!60!white,thick] (0,0) circle (0.12);
233 | \draw[red!60!black,thick] (0,0) -- ++(45:0.12);
234 | \draw[red!60!black,thick] (0,0) -- ++(135:0.12);
235 | \draw[red!60!black,thick] (0,0) -- ++(315:0.12);
236 | \draw[red!60!black,thick] (0,0) -- ++(225:0.12);
237 | \end{tikzpicture}}
238 | \newcommand{\matprodp}{\begin{tikzpicture}[]
239 | \draw[green!60!white,thick] (0,0) circle (0.12);
240 | \draw[green!60!black,thick] (0,0) -- ++(45:0.12);
241 | \draw[green!60!black,thick] (0,0) -- ++(135:0.12);
242 | \draw[green!60!black,thick] (0,0) -- ++(315:0.12);
243 | \draw[green!60!black,thick] (0,0) -- ++(225:0.12);
244 | \end{tikzpicture}}
245 | \newcommand{\matprode}{\begin{tikzpicture}[]
246 | \draw[blue!60!white,thick] (0,0) circle (0.12);
247 | \draw[blue!60!black,thick] (0,0) -- ++(45:0.12);
248 | \draw[blue!60!black,thick] (0,0) -- ++(135:0.12);
249 | \draw[blue!60!black,thick] (0,0) -- ++(315:0.12);
250 | \draw[blue!60!black,thick] (0,0) -- ++(225:0.12);
251 | \end{tikzpicture}}
252 | \newcommand{\circa}[1]{\begin{tikzpicture}[baseline=-0.65ex]
253 | \draw[thick] (0,0) node[black] {$#1$} circle (0.17);
254 | \end{tikzpicture}}
255 | \newcommand{\circb}[2]{\begin{tikzpicture}[baseline=(current bounding box.center)]
256 | \draw[thick] (0,0) node[black] {$#1$} circle (0.17);
257 | \node[anchor=north west] at (0.085,0) {\tiny $#2$};
258 | \end{tikzpicture}}
259 | \newcommand{\ells}{\begin{tikzpicture}[]
260 | \draw[red!60!white,thick] (0,0) circle (0.12);
261 | \draw[red!60!black,thick] (0,0) -- ++(45:0.12);
262 | \draw[red!60!black,thick] (0,0) -- ++(135:0.12);
263 | \draw[red!60!black,thick] (0,0) -- ++(315:0.12);
264 | \draw[red!60!black,thick] (0,0) -- ++(225:0.12);
265 | \end{tikzpicture}}
266 | \newcommand{\etts}{\begin{tikzpicture}[]
267 | \draw[green!60!white,thick] (0,0) circle (0.12);
268 | \draw[green!60!black,thick] (0,0) -- ++(45:0.12);
269 | \draw[green!60!black,thick] (0,0) -- ++(135:0.12);
270 | \draw[green!60!black,thick] (0,0) -- ++(315:0.12);
271 | \draw[green!60!black,thick] (0,0) -- ++(225:0.12);
272 | \end{tikzpicture}}
273 | \newcommand{\umumnud}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}]
274 | \draw[] (0,0)  -- node {\midarrow} (0.5,0)   --node {\midarrow}  (0.5,0.5) -- node {\midarrow} (0,0.5)node[anchor=south east] {$x$} --node {\midarrow}  (0,0);
275 | \node[place] at (0,0.5) {};
276 | \node[anchor=north] at (0.25,0) {\tiny $\hat{#1}$};
277 | \node[anchor=east] at (0,0.25) {\tiny $\hat{#2}$};
278 | \end{tikzpicture}}
279 | \newcommand{\umumnu}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}]
280 | \draw[] (0,0) -- node {\midarrow} (0,0.5)  node[anchor=south east] {$x$}  --node {\midarrow}  (0.5,0.5) -- node {\midarrow} (0.5,0)--node {\midarrow}  (0,0);
281 | \node[place] at (0,0.5) {};
282 | \node[anchor=south] at (0.25,0.5) {\tiny $\hat{#1}$};
283 | \node[anchor=west] at (0.5,0.25) {\tiny $\hat{#2}$};
284 | \end{tikzpicture}}
285 | \newcommand{\umunud}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}]
286 | \draw[] (0,0) node[anchor=north east] {$x$} -- node {\midarrow} (0,0.5)   --node {\midarrow}  (0.5,0.5) -- node {\midarrow} (0.5,0)--node {\midarrow}  (0,0);
287 | \node[place] at (0,0) {};
288 | \node[anchor=south] at (0.25,0.5) {\tiny $\hat{#1}$};
289 | \node[anchor=east] at (0,0.25) {\tiny $\hat{#2}$};
290 | \end{tikzpicture}}
291 | \newcommand{\umunu}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}]
292 | \draw[] (0,0) node[anchor=north east] {$x$} -- node {\midarrow} (0.5,0)   --node {\midarrow}  (0.5,0.5) -- node {\midarrow} (0,0.5)--node {\midarrow}  (0,0);
293 | \node[place] at (0,0) {};
294 | \node[anchor=north] at (0.25,0) {\tiny $\hat{#1}$};
295 | \node[anchor=west] at (0.5,0.25) {\tiny $\hat{#2}$};
296 | \end{tikzpicture}}
297 | \newcommand{\Emu}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=green!70!black},]
298 | \node[place3] at (0,0)  {};
299 | \node[anchor=south] at (0,0) {$x$};
300 | \node[anchor=north] at (0,0) {\tiny $#1$};
301 | \end{tikzpicture}}
302 | \newcommand{\emu}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=blue!70!white},]
303 | \node[place3] at (0,0)  {};
304 | \node[anchor=south] at (0,0) {$x$};
305 | \node[anchor=north] at (0,0) {\tiny $#1$};
306 | \end{tikzpicture}}
307 | \newcommand{\emup}[1]{\begin{tikzpicture}[baseline=-0.5ex]
308 | \shade [ball color=blue!70!white] (0,0) circle [radius=0.075];
309 | \node[anchor=south] at (0,0) {$x$};
310 | \node[anchor=north] at (0,0) {\tiny $#1$};
311 | \end{tikzpicture}}
312 | \newcommand{\Umu}[1]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}]
313 | \draw[] (0,0) node[anchor=south] {$x$} -- node {\midarrow} (0.5,0);
314 | \node[place] at (0,0) {};
315 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#1}$};
316 | \end{tikzpicture}}
317 | \newcommand{\umu}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=red!90!black},]
318 | \node[place3] at (0,0)  {};
319 | \node[anchor=south] at (0,0) {$x$};
320 | \node[anchor=north] at (0,0) {\tiny $#1$};
321 | \end{tikzpicture}}
322 | \newcommand{\umup}[1]{\begin{tikzpicture}[baseline=-0.5ex]
323 | \shade [ball color=red!90!black] (0,0) circle [radius=0.075];
324 | \node[anchor=south] at (0,0) {$x$};
325 | \node[anchor=north] at (0,0) {\tiny $#1$};
326 | \end{tikzpicture}}
327 | \newcommand{\umuo}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=orange},]
328 | \node[place3] at (0,0)  {};
329 | \node[anchor=south] at (0,0) {$x$};
330 | \node[anchor=north] at (0,0) {\tiny $#1$};
331 | \end{tikzpicture}}
332 | \newcommand{\umuop}[1]{\begin{tikzpicture}[baseline=-0.5ex,]
333 | \shade [ball color=orange] (0,0) circle [radius=0.075];
334 | \node[anchor=south] at (0,0) {$x$};
335 | \node[anchor=north] at (0,0) {\tiny $#1$};
336 | \end{tikzpicture}}
337 | \newcommand{\umuoo}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=yellow!80!black},]
338 | \node[place3] at (0,0)  {};
339 | \node[anchor=south] at (0,0) {$x$};
340 | \node[anchor=north] at (0,0) {\tiny $#1$};
341 | \end{tikzpicture}}
342 | \newcommand{\umuoop}[1]{\begin{tikzpicture}[baseline=-0.5ex]
343 | \shade [ball color=yellow!80!black] (0,0) circle [radius=0.075];
344 | \node[anchor=south] at (0,0) {$x$};
345 | \node[anchor=north] at (0,0) {\tiny $#1$};
346 | \end{tikzpicture}}
347 | \newcommand{\Umud}[1]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}]
348 | \draw[] (0.5,0) node[anchor=south west] {$x+\hat{#1}$} -- node {\midarrow} (0,0);
349 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#1}$};
350 | \node[place] at (0.5,0) {};
351 | \end{tikzpicture}}
352 | \newcommand{\Umudm}[1]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}]
353 | \draw[] (0.5,0) node[anchor=south west] {$x$} -- node {\midarrow} (0,0);
354 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#1}$};
355 | \node[place] at (0.5,0) {};
356 | \end{tikzpicture}}
357 | \newcommand{\EUmu}[2]{\begin{tikzpicture}[baseline=0ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=green!70!black},]
358 | \node[anchor=south] at (0,0) {$x$};
359 | \draw[] (0,0) -- node {\midarrow} (0.5,0);
360 | \node[anchor=north] at (0,0) {\tiny $#1$};
361 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#2}$};
362 | \node[place3] at (0,0)  {};
363 | \end{tikzpicture}}
364 | \newcommand{\tikzcuboid}[4]{% width, height, depth, scale
365 | \begin{tikzpicture}[scale=#4]
366 | \foreach \x in {0,...,#1}
367 | {
368 | \draw (\x,0,#3) -- (\x,#2,#3);
369 | \draw (\x,#2,#3) -- (\x,#2,0);
370 | }
371 | \foreach \x in {0,...,#2}
372 | {
373 | \draw (#1,\x,#3) -- (#1,\x,0);
374 | \draw (0,\x,#3) -- (#1,\x,#3);
375 | }
376 | \foreach \x in {0,...,#3}
377 | {
378 | \draw (#1,0,\x) -- (#1,#2,\x);
379 | \draw (0,#2,\x) -- (#1,#2,\x);
380 | }
381 | \foreach \x in {0,...,#1}
382 | {
383 | \foreach \y in {0,...,#2}
384 | {
385 | \node[redball] at (\x,\y,#3) {\tiny\phantom{a}};
386 | }
387 | \foreach \y in {0,...,#3}
388 | {
389 | \node[redball] at (\x,#2,\y) {\tiny\phantom{a}};
390 | }
391 | }
392 | \foreach \x in {0,...,#2}
393 | {
394 | \foreach \y in {0,...,#3}
395 | {
396 | \node[redball] at (#1,\x,\y) {\tiny \phantom{a}};
397 | }
398 | }
399 | \end{tikzpicture}
400 | }
401 | 
402 | \newcommand{\tikzcube}[2]{%lenght, scale
403 | \tikzcuboid{#1}{#1}{#1}{#2}
404 | }
405 | %\geometry{hmargin=2.5cm,vmargin=2cm}
406 | \usepackage{setspace}
407 | %\setstretch{1,5}
408 | \usepackage{pagedecouv}
409 | 
410 | \usepackage[boxruled,vlined]{algorithm2e}
411 | \providecommand{\SetAlgoLined}{\SetLine}
412 | \usepackage{float}
413 | \floatstyle{plain}
414 | \newfloat{myalgo}{tbhp}{mya}
415 | \newenvironment{Algorithm}[2][tbh]%
416 | {\begin{myalgo}[#1]
417 | \centering
418 | \begin{minipage}{#2}
419 | \begin{algorithm}[H]}%
420 | {\end{algorithm}
421 | \end{minipage}
422 | \end{myalgo}}
423 | 
424 | 
425 | \setcounter{secnumdepth}{4}
426 | \setcounter{tocdepth}{1}
427 | \makeatletter
428 | \newcounter {subsubsubsection}[subsubsection]
429 | \renewcommand\thesubsubsubsection{\thesubsubsection .\@alph\c@subsubsubsection}
430 | \newcommand\subsubsubsection{\@startsection{subsubsubsection}{4}{\z@}%
431 |                                      {-3.25ex\@plus -1ex \@minus -.2ex}%
432 |                                      {1.5ex \@plus .2ex}%
433 |                                      {\normalfont\normalsize\bfseries}}
434 | \renewcommand\paragraph{\@startsection{paragraph}{5}{\z@}%
435 |                                     {3.25ex \@plus1ex \@minus.2ex}%
436 |                                     {-1em}%
437 |                                     {\normalfont\normalsize\bfseries}}
438 | \renewcommand\subparagraph{\@startsection{subparagraph}{6}{\parindent}%
439 |                                        {3.25ex \@plus1ex \@minus .2ex}%
440 |                                        {-1em}%
441 |                                       {\normalfont\normalsize\bfseries}}
442 | \newcommand*\l@subsubsubsection{\@dottedtocline{4}{10.0em}{4.1em}}
443 | \renewcommand*\l@paragraph{\@dottedtocline{5}{10em}{5em}}
444 | \renewcommand*\l@subparagraph{\@dottedtocline{6}{12em}{6em}}
445 | \newcommand*{\subsubsubsectionmark}[1]{}
446 | \makeatother
447 | 
448 | \makeatletter
449 | \def\toclevel@subsubsubsection{4}
450 | \def\toclevel@paragraph{5}
451 | \def\toclevel@subparagraph{6}
452 | \makeatother
453 | \tikzset{
454 |     hyperlink node/.style={
455 |         alias=sourcenode,
456 |         append after command={
457 |             let     \p1 = (sourcenode.north west),
458 |                 \p2=(sourcenode.south east),
459 |                 \n1={\x2-\x1},
460 |                 \n2={\y1-\y2} in
461 |             node [inner sep=0pt, outer sep=0pt,anchor=north west,at=(\p1)] {\hyperlink{#1}{\phantom{\rule{\n1}{\n2}}}}
462 |         }
463 |     }
464 | }
465 | \usepackage{appendix}
466 | %\usepackage{chngcntr}
467 | %\counterwithout{figure}{chapter}
468 | 
469 | 
470 | \usepackage{etoolbox}
471 | \usepackage{lipsum}
472 | \AtBeginEnvironment{subappendices}{%
473 | \section*{Appendix}
474 | \addcontentsline{toc}{section}{Appendices}
475 | %\counterwithin{figure}{section}
476 | %\counterwithin{table}{section}
477 | }
478 | 
479 | %%% Auteurs en gras %%%
480 | \renewbibmacro*{author}{%
481 | \mkbibbold{%
482 |   \ifboolexpr{
483 |     test \ifuseauthor
484 |     and
485 |     not test {\ifnameundef{author}}
486 |   }
487 |     {\printnames{author}%
488 |      \iffieldundef{authortype}
489 |        {}
490 |        {\setunit{\addcomma\space}%
491 |         \usebibmacro{authorstrg}}}
492 |     {}}}
493 | %%% Auteurs en gras %%%
494 | 


--------------------------------------------------------------------------------
/fully_connected.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fully_connected.pdf


--------------------------------------------------------------------------------
/input_layer.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/input_layer.pdf


--------------------------------------------------------------------------------
/lReLU.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/lReLU.pdf


--------------------------------------------------------------------------------
/mathpazo.sty:
--------------------------------------------------------------------------------
  1 | %%
  2 | %% This is file `mathpazo.sty',
  3 | %% generated with the docstrip utility.
  4 | %%
  5 | %% The original source files were:
  6 | %%
  7 | %% psfonts.dtx  (with options: `mathpazo')
  8 | %% 
  9 | %% IMPORTANT NOTICE:
 10 | %% 
 11 | %% For the copyright see the source file.
 12 | %% 
 13 | %% Any modified versions of this file must be renamed
 14 | %% with new filenames distinct from mathpazo.sty.
 15 | %% 
 16 | %% For distribution of the original source see the terms
 17 | %% for copying and modification in the file psfonts.dtx.
 18 | %% 
 19 | %% This generated file may be distributed as long as the
 20 | %% original source files, as listed above, are part of the
 21 | %% same distribution. (The sources need not necessarily be
 22 | %% in the same archive or directory.)
 23 | \ProvidesPackage{mathpazo}%
 24 | [2005/04/12 PSNFSS-v9.2a
 25 |  Palatino w/ Pazo Math (D.Puga, WaS)
 26 | ]
 27 | \let\s@ved@info\@font@info
 28 | \let\@font@info\@gobble
 29 | \newif\ifpazo@osf
 30 | \newif\ifpazo@sc
 31 | \newif\ifpazo@slGreek
 32 | \newif\ifpazo@BB \pazo@BBtrue
 33 | \DeclareOption{osf}{\pazo@osftrue}
 34 | \DeclareOption{sc}{\pazo@sctrue}
 35 | \DeclareOption{slantedGreek}{\pazo@slGreektrue}
 36 | \DeclareOption{noBBpl}{\pazo@BBfalse}
 37 | \DeclareOption{osfeqnnum}{\OptionNotUsed}
 38 | \ProcessOptions\relax
 39 | \ifpazo@osf
 40 |   \renewcommand{\rmdefault}{pplj}
 41 |   \renewcommand{\oldstylenums}[1]{%
 42 |     {\fontfamily{pplj}\selectfont #1}}
 43 | \else\ifpazo@sc
 44 |   \renewcommand{\rmdefault}{pplx}
 45 |   \renewcommand{\oldstylenums}[1]{%
 46 |     {\fontfamily{pplj}\selectfont #1}}
 47 | \else
 48 |   \renewcommand{\rmdefault}{ppl}
 49 | \fi\fi
 50 | \newcommand{\ppleuro}{{\fontencoding{U}\fontfamily{fplm}\selectfont \char160}}
 51 | \AtBeginDocument{\@ifpackageloaded{europs}{\renewcommand{\EURtm}{\ppleuro}}{}}
 52 | \ifpazo@sc
 53 |  \DeclareSymbolFont{operators}     {OT1}{pplx}{m}{n}
 54 |  \SetSymbolFont{operators}{bold}   {OT1}{pplx}{b}{n}
 55 |  \DeclareMathAlphabet{\mathit}     {OT1}{pplx}{m}{it}
 56 |  \SetMathAlphabet{\mathit}{bold}   {OT1}{pplx}{b}{it}
 57 | \else
 58 |  \DeclareSymbolFont{operators}     {OT1}{ppl}{m}{n}
 59 |  \SetSymbolFont{operators}{bold}   {OT1}{ppl}{b}{n}
 60 |  \DeclareMathAlphabet{\mathit}     {OT1}{ppl}{m}{it}
 61 |  \SetMathAlphabet{\mathit}{bold}   {OT1}{ppl}{b}{it}
 62 | \fi
 63 | \DeclareSymbolFont{upright}       {OT1}{zplm}{m}{n}
 64 | \DeclareSymbolFont{letters}       {OML}{zplm}{m}{it}
 65 | \DeclareSymbolFont{symbols}       {OMS}{zplm}{m}{n}
 66 | \DeclareSymbolFont{largesymbols}  {OMX}{zplm}{m}{n}
 67 | \SetSymbolFont{upright}{bold}     {OT1}{zplm}{b}{n}
 68 | \SetSymbolFont{letters}{bold}     {OML}{zplm}{b}{it}
 69 | \SetSymbolFont{symbols}{bold}     {OMS}{zplm}{b}{n}
 70 | \SetSymbolFont{largesymbols}{bold}{OMX}{zplm}{m}{n}
 71 | %\DeclareMathAlphabet{\mathbf}     {OT1}{zplm}{b}{n}
 72 | %\DeclareMathAlphabet{\mathbold}   {OML}{zplm}{b}{it}
 73 | \DeclareSymbolFontAlphabet{\mathrm}    {operators}
 74 | \DeclareSymbolFontAlphabet{\mathnormal}{letters}
 75 | \DeclareSymbolFontAlphabet{\mathcal}   {symbols}
 76 | \DeclareMathSymbol{!}{\mathclose}{upright}{"21}
 77 | \DeclareMathSymbol{+}{\mathbin}{upright}{"2B}
 78 | \DeclareMathSymbol{:}{\mathrel}{upright}{"3A}
 79 | \DeclareMathSymbol{=}{\mathrel}{upright}{"3D}
 80 | \DeclareMathSymbol{?}{\mathclose}{upright}{"3F}
 81 | \DeclareMathDelimiter{(}{\mathopen} {upright}{"28}{largesymbols}{"00}
 82 | \DeclareMathDelimiter{)}{\mathclose}{upright}{"29}{largesymbols}{"01}
 83 | \DeclareMathDelimiter{[}{\mathopen} {upright}{"5B}{largesymbols}{"02}
 84 | \DeclareMathDelimiter{]}{\mathclose}{upright}{"5D}{largesymbols}{"03}
 85 | \DeclareMathDelimiter{/}{\mathord}{upright}{"2F}{largesymbols}{"0E}
 86 | \DeclareMathAccent{\acute}{\mathalpha}{upright}{"13}
 87 | \DeclareMathAccent{\grave}{\mathalpha}{upright}{"12}
 88 | \DeclareMathAccent{\ddot}{\mathalpha}{upright}{"7F}
 89 | \DeclareMathAccent{\tilde}{\mathalpha}{upright}{"7E}
 90 | \DeclareMathAccent{\bar}{\mathalpha}{upright}{"16}
 91 | \DeclareMathAccent{\breve}{\mathalpha}{upright}{"15}
 92 | \DeclareMathAccent{\check}{\mathalpha}{upright}{"14}
 93 | \DeclareMathAccent{\hat}{\mathalpha}{upright}{"5E}
 94 | \DeclareMathAccent{\dot}{\mathalpha}{upright}{"5F}
 95 | \DeclareMathAccent{\mathring}{\mathalpha}{upright}{"17}
 96 | \DeclareMathSymbol{\mathdollar}{\mathord}{upright}{"24}
 97 | \DeclareMathSymbol{,}{\mathpunct}{operators}{44}
 98 | \DeclareMathSymbol{.}{\mathord}{operators}{46}
 99 | \ifpazo@BB
100 |   \AtBeginDocument{%
101 |   \let\mathbb\relax
102 |   \DeclareMathAlphabet\PazoBB{U}{fplmbb}{m}{n}
103 |   \newcommand{\mathbb}{\PazoBB}
104 |   }
105 | \fi
106 | \medmuskip=3.5mu plus 1mu minus 1mu
107 | \def\joinrel{\mathrel{\mkern-3.45mu}}
108 | \renewcommand{\hbar}{{\mkern0.8mu\mathchar'26\mkern-6.8muh}}
109 | \ifpazo@slGreek
110 |   \DeclareMathSymbol{\Gamma}  {\mathalpha}{letters}{"00}
111 |   \DeclareMathSymbol{\Delta}  {\mathalpha}{letters}{"01}
112 |   \DeclareMathSymbol{\Theta}  {\mathalpha}{letters}{"02}
113 |   \DeclareMathSymbol{\Lambda} {\mathalpha}{letters}{"03}
114 |   \DeclareMathSymbol{\Xi}     {\mathalpha}{letters}{"04}
115 |   \DeclareMathSymbol{\Pi}     {\mathalpha}{letters}{"05}
116 |   \DeclareMathSymbol{\Sigma}  {\mathalpha}{letters}{"06}
117 |   \DeclareMathSymbol{\Upsilon}{\mathalpha}{letters}{"07}
118 |   \DeclareMathSymbol{\Phi}    {\mathalpha}{letters}{"08}
119 |   \DeclareMathSymbol{\Psi}    {\mathalpha}{letters}{"09}
120 |   \DeclareMathSymbol{\Omega}  {\mathalpha}{letters}{"0A}
121 | \else
122 |   \DeclareMathSymbol{\Gamma}{\mathalpha}{upright}{"00}
123 |   \DeclareMathSymbol{\Delta}{\mathalpha}{upright}{"01}
124 |   \DeclareMathSymbol{\Theta}{\mathalpha}{upright}{"02}
125 |   \DeclareMathSymbol{\Lambda}{\mathalpha}{upright}{"03}
126 |   \DeclareMathSymbol{\Xi}{\mathalpha}{upright}{"04}
127 |   \DeclareMathSymbol{\Pi}{\mathalpha}{upright}{"05}
128 |   \DeclareMathSymbol{\Sigma}{\mathalpha}{upright}{"06}
129 |   \DeclareMathSymbol{\Upsilon}{\mathalpha}{upright}{"07}
130 |   \DeclareMathSymbol{\Phi}{\mathalpha}{upright}{"08}
131 |   \DeclareMathSymbol{\Psi}{\mathalpha}{upright}{"09}
132 |   \DeclareMathSymbol{\Omega}{\mathalpha}{upright}{"0A}
133 | \fi
134 | \DeclareMathSymbol{\upGamma}{\mathord}{upright}{0}
135 | \DeclareMathSymbol{\upDelta}{\mathord}{upright}{1}
136 | \DeclareMathSymbol{\upTheta}{\mathord}{upright}{2}
137 | \DeclareMathSymbol{\upLambda}{\mathord}{upright}{3}
138 | \DeclareMathSymbol{\upXi}{\mathord}{upright}{4}
139 | \DeclareMathSymbol{\upPi}{\mathord}{upright}{5}
140 | \DeclareMathSymbol{\upSigma}{\mathord}{upright}{6}
141 | \DeclareMathSymbol{\upUpsilon}{\mathord}{upright}{7}
142 | \DeclareMathSymbol{\upPhi}{\mathord}{upright}{8}
143 | \DeclareMathSymbol{\upPsi}{\mathord}{upright}{9}
144 | \DeclareMathSymbol{\upOmega}{\mathord}{upright}{10}
145 | \DeclareMathSymbol{\alpha}{\mathalpha}{letters}{"0B}
146 | \DeclareMathSymbol{\beta}{\mathalpha}{letters}{"0C}
147 | \DeclareMathSymbol{\gamma}{\mathalpha}{letters}{"0D}
148 | \DeclareMathSymbol{\delta}{\mathalpha}{letters}{"0E}
149 | \DeclareMathSymbol{\epsilon}{\mathalpha}{letters}{"0F}
150 | \DeclareMathSymbol{\zeta}{\mathalpha}{letters}{"10}
151 | \DeclareMathSymbol{\eta}{\mathalpha}{letters}{"11}
152 | \DeclareMathSymbol{\theta}{\mathalpha}{letters}{"12}
153 | \DeclareMathSymbol{\iota}{\mathalpha}{letters}{"13}
154 | \DeclareMathSymbol{\kappa}{\mathalpha}{letters}{"14}
155 | \DeclareMathSymbol{\lambda}{\mathalpha}{letters}{"15}
156 | \DeclareMathSymbol{\mu}{\mathalpha}{letters}{"16}
157 | \DeclareMathSymbol{\nu}{\mathalpha}{letters}{"17}
158 | \DeclareMathSymbol{\xi}{\mathalpha}{letters}{"18}
159 | \DeclareMathSymbol{\pi}{\mathalpha}{letters}{"19}
160 | \DeclareMathSymbol{\rho}{\mathalpha}{letters}{"1A}
161 | \DeclareMathSymbol{\sigma}{\mathalpha}{letters}{"1B}
162 | \DeclareMathSymbol{\tau}{\mathalpha}{letters}{"1C}
163 | \DeclareMathSymbol{\upsilon}{\mathalpha}{letters}{"1D}
164 | \DeclareMathSymbol{\phi}{\mathalpha}{letters}{"1E}
165 | \DeclareMathSymbol{\chi}{\mathalpha}{letters}{"1F}
166 | \DeclareMathSymbol{\psi}{\mathalpha}{letters}{"20}
167 | \DeclareMathSymbol{\omega}{\mathalpha}{letters}{"21}
168 | \DeclareMathSymbol{\varepsilon}{\mathalpha}{letters}{"22}
169 | \DeclareMathSymbol{\vartheta}{\mathalpha}{letters}{"23}
170 | \DeclareMathSymbol{\varpi}{\mathalpha}{letters}{"24}
171 | \DeclareMathSymbol{\varrho}{\mathalpha}{letters}{"25}
172 | \DeclareMathSymbol{\varsigma}{\mathalpha}{letters}{"26}
173 | \DeclareMathSymbol{\varphi}{\mathalpha}{letters}{"27}
174 | \let\s@vedhbar\hbar
175 | \AtBeginDocument{%
176 |   \DeclareFontFamily{U}{msa}{}%
177 |   \DeclareFontShape{U}{msa}{m}{n}{<->s*[1.042]msam10}{}%
178 |   \DeclareFontFamily{U}{msb}{}%
179 |   \DeclareFontShape{U}{msb}{m}{n}{<->s*[1.042]msbm10}{}%
180 |   \DeclareFontFamily{U}{euf}{}%
181 |   \DeclareFontShape{U}{euf}{m}{n}{<-6>eufm5<6-8>eufm7<8->eufm10}{}%
182 |   \DeclareFontShape{U}{euf}{b}{n}{<-6>eufb5<6-8>eufb7<8->eufb10}{}%
183 |   \@ifpackageloaded{amsfonts}{\let\hbar\s@vedhbar}{}
184 |   \@ifpackageloaded{amsmath}{}{%
185 |   \newdimen\big@size
186 |   \addto@hook\every@math@size{\setbox\z@\vbox{\hbox{$($}\kern\z@}%
187 |    \global\big@size 1.2\ht\z@}
188 |   \def\bBigg@#1#2{%
189 |    {\hbox{$\left#2\vcenter to#1\big@size{}\right.\n@space$}}}
190 |   \def\big{\bBigg@\@ne}
191 |   \def\Big{\bBigg@{1.5}}
192 |   \def\bigg{\bBigg@\tw@}
193 |   \def\Bigg{\bBigg@{2.5}}
194 |   }
195 | }
196 | \def\defaultscriptratio{.76}
197 | \def\defaultscriptscriptratio{.6}
198 | \DeclareMathSizes{5}    {5}    {5}    {5}
199 | \DeclareMathSizes{6}    {6}    {5}    {5}
200 | \DeclareMathSizes{7}    {7}    {5}    {5}
201 | \DeclareMathSizes{8}    {8}    {6}    {5}
202 | \DeclareMathSizes{9}    {9}    {7}    {5}
203 | \DeclareMathSizes{10}   {10}   {7.6}  {6}
204 | \DeclareMathSizes{10.95}{10.95}{8}    {6}
205 | \DeclareMathSizes{12}   {12}   {9}    {7}
206 | \DeclareMathSizes{14.4} {14.4} {10}   {8}
207 | \DeclareMathSizes{17.28}{17.28}{12}   {10}
208 | \DeclareMathSizes{20.74}{20.74}{14.4} {12}
209 | \DeclareMathSizes{24.88}{24.88}{20.74}{14.4}
210 | \let\@font@info\s@ved@info
211 | \endinput
212 | %%
213 | %% End of file `mathpazo.sty'.
214 | 


--------------------------------------------------------------------------------
/output_layer.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/output_layer.pdf


--------------------------------------------------------------------------------
/padding.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/padding.pdf


--------------------------------------------------------------------------------
/pagedecouv.sty:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/pagedecouv.sty


--------------------------------------------------------------------------------
/pool_4d-crop.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/pool_4d-crop.pdf


--------------------------------------------------------------------------------
/sigmoid.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/sigmoid.pdf


--------------------------------------------------------------------------------
/softmax.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/softmax.pdf


--------------------------------------------------------------------------------
/tanh.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/tanh.pdf


--------------------------------------------------------------------------------
/tanh2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/tanh2.pdf


--------------------------------------------------------------------------------