├── .gitattributes ├── .gitignore ├── Acknowledgements.tex ├── AlexNet.pdf ├── Bottleneck.pdf ├── Bottleneck_BN.pdf ├── Bottleneck_BN_2.pdf ├── Bottleneck_BN_backprop.pdf ├── Bottleneck_BN_backprop_2.pdf ├── CNN_MM_pixels.pdf ├── CNN_MM_unpixels.pdf ├── Conclusion.tex ├── Conv_equiv.pdf ├── DEEP_LEARNING.bib ├── ELU.pdf ├── GoogleNet.pdf ├── Inception.pdf ├── Introduction.tex ├── LSTM_structure-peephole.pdf ├── LSTM_structure-tot.pdf ├── LSTM_structure.pdf ├── LeNet.pdf ├── Mediamobile.png ├── Preface.tex ├── README.md ├── RNN_structure-tot.pdf ├── RNN_structure.pdf ├── ReLU.pdf ├── ResNet.pdf ├── S_FNN.pdf ├── ThesisStyle.cls ├── VGG-conv.pdf ├── VGG-fc.pdf ├── VGG-pool-fc.pdf ├── VGG-pool.pdf ├── VGG.pdf ├── White_book-blx.bib ├── White_book.bcf ├── White_book.dvi ├── White_book.pdf ├── White_book.tex ├── chapter1.tex ├── chapter2.tex ├── chapter3.tex ├── conv_2d-crop.pdf ├── conv_4d-crop.pdf ├── cover_page-crop.pdf ├── fc_equiv.pdf ├── fc_resnet.pdf ├── fc_resnet_2.pdf ├── fc_resnet_3.pdf ├── formatAndDefs.tex ├── fully_connected.pdf ├── input_layer.pdf ├── lReLU.pdf ├── mathpazo.sty ├── output_layer.pdf ├── padding.pdf ├── pagedecouv.sty ├── pool_4d-crop.pdf ├── sigmoid.pdf ├── softmax.pdf ├── tanh.pdf └── tanh2.pdf /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.aux 2 | *.bbl 3 | *.blg 4 | *.log 5 | *.maf 6 | *.mtc* 7 | *.run.xml 8 | *.synctex.gz 9 | *.toc 10 | *.tox 11 | *.bib.bak 12 | *.out 13 | -------------------------------------------------------------------------------- /Acknowledgements.tex: -------------------------------------------------------------------------------- 1 | \chapter{Acknowledgements} 2 | 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}T}his work has no benefit nor added value to the deep learning topic on its own. It is just the reformulation of ideas of brighter researchers to fit a peculiar mindset: the one of preferring formulas with ten indices but where one knows precisely what one is manipulating rather than (in my opinion sometimes opaque) matrix formulations where the dimension of the objects are rarely if ever specified. 4 | 5 | \vspace{0.2cm} 6 | 7 | Among the brighter people from whom I learned online are Andrew Ng. His Coursera class (\href{https://www.coursera.org/learn/machine-learning}{here}) was the first contact I got with Neural Network, and this pedagogical introduction allowed me to build on solid ground. 8 | 9 | \vspace{0.2cm} 10 | 11 | I also wish to particularly thanks Hugo Larochelle, who not only built a wonderful deep learning class (\href{http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html}{here}), but was also kind enough to answer emails from a complete beginner and stranger! 12 | 13 | \vspace{0.2cm} 14 | 15 | The Stanford class on convolutional networks (\href{http://cs231n.github.io/convolutional-networks/}{here}) proved extremely valuable to me, so did the one on Natural Language processing (\href{http://web.stanford.edu/class/cs224n/}{here}). 16 | 17 | \vspace{0.2cm} 18 | 19 | I also benefited greatly from Sebastian Ruder's blog (\href{http://ruder.io/#open}{here}), both from the blog pages on gradient descent optimization techniques and from the author himself. 20 | 21 | \vspace{0.2cm} 22 | 23 | I learned more about LSTM on colah's blog (\href{http://colah.github.io/posts/2015-08-Understanding-LSTMs/}{here}), and some of my drawings are inspired from there. 24 | 25 | \vspace{0.2cm} 26 | 27 | I also thank Jonathan Del Hoyo for the great articles that he regularly shares on LinkedIn. 28 | 29 | \vspace{0.2cm} 30 | 31 | Many thanks go to my collaborators at Mediamobile, who let me dig as deep as I wanted on Neural Networks. I am especially indebted to Clément, Nicolas, Jessica, Christine and Céline. 32 | 33 | \vspace{0.2cm} 34 | 35 | Thanks to Jean-Michel Loubes and Fabrice Gamboa, from whom I learned a great deal on probability theory and statistics. 36 | 37 | \vspace{0.2cm} 38 | 39 | I end this list with my employer, Mediamobile, which has been kind enough to let me work on this topic with complete freedom. A special thanks to Philippe, who supervised me with the perfect balance of feedback and freedom! 40 | 41 | -------------------------------------------------------------------------------- /AlexNet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/AlexNet.pdf -------------------------------------------------------------------------------- /Bottleneck.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck.pdf -------------------------------------------------------------------------------- /Bottleneck_BN.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN.pdf -------------------------------------------------------------------------------- /Bottleneck_BN_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN_2.pdf -------------------------------------------------------------------------------- /Bottleneck_BN_backprop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN_backprop.pdf -------------------------------------------------------------------------------- /Bottleneck_BN_backprop_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Bottleneck_BN_backprop_2.pdf -------------------------------------------------------------------------------- /CNN_MM_pixels.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/CNN_MM_pixels.pdf -------------------------------------------------------------------------------- /CNN_MM_unpixels.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/CNN_MM_unpixels.pdf -------------------------------------------------------------------------------- /Conclusion.tex: -------------------------------------------------------------------------------- 1 | \chapter{Conclusion} 2 | 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}W}e have come to the end of our journey. I hope this note lived up to its promises, and that the reader now understands better how a neural network is designed and how it works under the hood. To wrap it up, we have seen the architecture of the three most common neural networks, as well as the careful mathematical derivation of their training formulas. 4 | 5 | \vspace{0.2cm} 6 | 7 | Deep Learning seems to be a fast evolving field, and this material might be out of date in a near future, but the index approach adopted will still allow the reader -- as it as helped the writer -- to work out for herself what is behind the next state of the art architectures. 8 | 9 | \vspace{0.2cm} 10 | 11 | Until then, one should have enough material to encode from scratch its own FNN, CNN and RNN-LSTM, as the author did as an empirical proof of his formulas. 12 | -------------------------------------------------------------------------------- /Conv_equiv.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Conv_equiv.pdf -------------------------------------------------------------------------------- /DEEP_LEARNING.bib: -------------------------------------------------------------------------------- 1 | % Encoding: UTF-8 2 | 3 | @Article{Rosenblatt58theperceptron:, 4 | author = {F. Rosenblatt}, 5 | title = {The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain}, 6 | journal = {Psychological Review}, 7 | year = {1958}, 8 | pages = {65--386}, 9 | } 10 | 11 | @Article{Hahnloser:2003:PFS:762330.762336, 12 | author = {Hahnloser, Richard H. R. and Seung, H. Sebastian and Slotine, Jean-Jacques}, 13 | title = {Permitted and Forbidden Sets in Symmetric Threshold-linear Networks}, 14 | journal = {Neural Comput.}, 15 | year = {2003}, 16 | volume = {15}, 17 | number = {3}, 18 | pages = {621--638}, 19 | month = mar, 20 | issn = {0899-7667}, 21 | acmid = {762336}, 22 | address = {Cambridge, MA, USA}, 23 | doi = {10.1162/089976603321192103}, 24 | issue_date = {March 2003}, 25 | numpages = {18}, 26 | publisher = {MIT Press}, 27 | url = {http://dx.doi.org/10.1162/089976603321192103}, 28 | } 29 | 30 | @Article{DBLP:journals/corr/DielemanWD15, 31 | author = {Sander Dieleman and Kyle W. Willett and Joni Dambre}, 32 | title = {Rotation-invariant convolutional neural networks for galaxy morphology prediction}, 33 | journal = {CoRR}, 34 | year = {2015}, 35 | volume = {abs/1503.07077}, 36 | bibsource = {dblp computer science bibliography, http://dblp.org}, 37 | biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/DielemanWD15}, 38 | timestamp = {Wed, 07 Jun 2017 14:41:33 +0200}, 39 | url = {http://arxiv.org/abs/1503.07077}, 40 | } 41 | 42 | @InProceedings{43022, 43 | author = {Christian Szegedy and Wei Liu and Yangqing Jia and Pierre Sermanet and Scott Reed and Dragomir Anguelov and Dumitru Erhan and Vincent Vanhoucke and Andrew Rabinovich}, 44 | title = {Going Deeper with Convolutions}, 45 | booktitle = {Computer Vision and Pattern Recognition (CVPR)}, 46 | year = {2015}, 47 | url = {http://arxiv.org/abs/1409.4842}, 48 | } 49 | 50 | @Article{He2015, 51 | author = {He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian}, 52 | title = {Deep Residual Learning for Image Recognition}, 53 | year = {2015}, 54 | volume = {7}, 55 | month = {12}, 56 | } 57 | 58 | @Article{Hochreiter:1997:LSM:1246443.1246450, 59 | author = {Hochreiter, Sepp and Schmidhuber, J\"{u}rgen}, 60 | title = {Long Short-Term Memory}, 61 | journal = {Neural Comput.}, 62 | year = {1997}, 63 | volume = {9}, 64 | number = {8}, 65 | pages = {1735--1780}, 66 | month = nov, 67 | issn = {0899-7667}, 68 | acmid = {1246450}, 69 | address = {Cambridge, MA, USA}, 70 | doi = {10.1162/neco.1997.9.8.1735}, 71 | issue_date = {November 15, 1997}, 72 | numpages = {46}, 73 | publisher = {MIT Press}, 74 | url = {http://dx.doi.org/10.1162/neco.1997.9.8.1735}, 75 | } 76 | 77 | @Article{Kingma2014, 78 | author = {Kingma, Diederik and Ba, Jimmy}, 79 | title = {Adam: A Method for Stochastic Optimization}, 80 | year = {2014}, 81 | month = {12}, 82 | } 83 | 84 | @Article{Ioffe2015, 85 | author = {Ioffe, Sergey and Szegedy, Christian}, 86 | title = {Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift}, 87 | year = {2015}, 88 | month = {02}, 89 | } 90 | 91 | @Article{Hahnloser2000, 92 | author = {Hahnloser, Richard and Sarpeshkar, R. and Mahowald, Misha and Douglas, Rodney J. and Seung, S}, 93 | title = {Digital selection and analog amplification co-exist in an electronic circuit inspired by neocortex}, 94 | journal = {Nature}, 95 | year = {2000}, 96 | volume = {405}, 97 | pages = {947-951,}, 98 | month = {06}, 99 | doi = {10.1038/35016072}, 100 | url = {http://dx.doi.org/10.1038/35016072}, 101 | } 102 | 103 | @InProceedings{Deng:2016:LSM:2939672.2939860, 104 | author = {Deng, Dingxiong and Shahabi, Cyrus and Demiryurek, Ugur and Zhu, Linhong and Yu, Rose and Liu, Yan}, 105 | title = {Latent Space Model for Road Networks to Predict Time-Varying Traffic}, 106 | booktitle = {Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, 107 | year = {2016}, 108 | series = {KDD '16}, 109 | pages = {1525--1534}, 110 | address = {New York, NY, USA}, 111 | publisher = {ACM}, 112 | acmid = {2939860}, 113 | doi = {10.1145/2939672.2939860}, 114 | isbn = {978-1-4503-4232-2}, 115 | keywords = {latent space model, real-time traffic forecasting, road network}, 116 | location = {San Francisco, California, USA}, 117 | numpages = {10}, 118 | url = {http://doi.acm.org/10.1145/2939672.2939860}, 119 | } 120 | 121 | @Article{SunHongyu, 122 | author = {Sun, Hongyu; Liu, Henry X.; Xiao, Heng; \& Ran, Bin}, 123 | title = {Short Term Traffic Forecasting Using the Local Linear Regression Model}, 124 | } 125 | 126 | @Article{MaDaiHe:2017, 127 | author = {Ma Xiaolei and Dai Zhuang and He Zhengbing and Ma Jihui and Wang Yong and Wang Yunpeng}, 128 | title = {Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction}, 129 | year = {2017}, 130 | doi = {10.3390/s17040818}, 131 | } 132 | 133 | @Article{Fouladgar2017ScalableDT, 134 | author = {Mohammadhani Fouladgar and Mostafa Parchami and Ramez Elmasri and Amir Ghaderi}, 135 | title = {Scalable deep traffic flow neural networks for urban traffic congestion prediction}, 136 | journal = {2017 International Joint Conference on Neural Networks (IJCNN)}, 137 | year = {2017}, 138 | pages = {2251-2258}, 139 | } 140 | 141 | @InProceedings{MiwaTYM2004, 142 | author = {T. Miwa and Y. Tawada and T. Yamamoto and T. Morikawa}, 143 | title = {En-Route Updating Methodology of Travel Time Prediction Using Accumulated Probe-Car Data}, 144 | year = {2004}, 145 | publisher = {Proc. of the 11th ITS World Congress}, 146 | } 147 | 148 | @Article{iet:/content/conferences/10.1049/cp_20000103, 149 | author = {S. Turksma}, 150 | title = {The various uses of floating car data}, 151 | journal = {IET Conference Proceedings}, 152 | year = {2000}, 153 | pages = {51-55(4)}, 154 | month = {January}, 155 | abstract = {To a large extent, traffic control and traffic information services depend on accurate information about the situation on the road network. Often rough estimates of queue lengths are no longer sufficient to give a reasonable prediction of travel times. Traditionally the situation on the road network is derived from local measurements (e.g. induction loops). It is difficult to obtain travel time estimates from local speed and flow data. It is especially difficult in urban areas. When the positions of a sufficient number of vehicles can be frequently communicated to a central site, travel times can be directly measured. This is called floating car data (FCD). This paper gives the major results of the Prelude trial and the ways in which FCD can be used. The Prelude FCD-trial in the Netherlands had as its primary aim to investigate the technical feasibility of using FCD to accurately measure travel times in a mixed urban and motorway network. FCD has a broad range of applications. The applications range from real time data for traffic management to the compilation of very accurate origin-destination matrices complete with travel times.}, 156 | affiliation = {Peek Traffic BV}, 157 | keywords = {traffic information services;floating car data;the Netherlands;traffic control;mixed urban/motorway network;real-time data;origin-destination matrices;OD matrices;Prelude trial;traffic information systems;traffic IS;}, 158 | language = {English}, 159 | publisher = {Institution of Engineering and Technology}, 160 | url = {http://digital-library.theiet.org/content/conferences/10.1049/cp_20000103}, 161 | } 162 | 163 | @InProceedings{Yoon:2007:SST:1247660.1247686, 164 | author = {Yoon, Jungkeun and Noble, Brian and Liu, Mingyan}, 165 | title = {Surface Street Traffic Estimation}, 166 | booktitle = {Proceedings of the 5th International Conference on Mobile Systems, Applications and Services}, 167 | year = {2007}, 168 | series = {MobiSys '07}, 169 | pages = {220--232}, 170 | address = {New York, NY, USA}, 171 | publisher = {ACM}, 172 | acmid = {1247686}, 173 | doi = {10.1145/1247660.1247686}, 174 | isbn = {978-1-59593-614-1}, 175 | keywords = {GPS, estimation, traffic}, 176 | location = {San Juan, Puerto Rico}, 177 | numpages = {13}, 178 | url = {http://doi.acm.org/10.1145/1247660.1247686}, 179 | } 180 | 181 | @Article{Fazio2014, 182 | author = {Fazio, Joseph and Wiesner, Brady N. and Deardoff, Matthew D.}, 183 | title = {Estimation of free-flow speed}, 184 | journal = {KSCE Journal of Civil Engineering}, 185 | year = {2014}, 186 | volume = {18}, 187 | number = {2}, 188 | pages = {646--650}, 189 | month = {Mar}, 190 | issn = {1976-3808}, 191 | abstract = {In 2010 Highway Capacity Manual, one preferably determines free-flow speed by deriving it from a speed study involving the existing facility or on a comparable facility if the facility is in the planning stage. Many have used a `rule of thumb' by adding 10 km/h (5 mi/h) above the posted limit to obtain free-flow speed without justification. Two team members using a radar gun and manual tally sheets collected 1668 speed observations at ten sites during several weeks. Each site had a unique posted speed limit sign ranging from 30 km/h (20 mi/h) to 120 km/h (75 mi/h). Five sites were on urban streets. Three sites were on multilane highways, and two on freeways. Goodness-of-fit test results revealed that a Gaussian distribution generally fit the speed distributions at each site at a 5{\%} level of significance. The best-fit model had a correlation coefficient of +0.99. The posted speed limit variable was significant at 5{\%} level of significance. Examining data by highway type revealed that average free-flow speeds are strongly associated with posted speed limits with correlation coefficients of +0.99, +1.00, and +1.00 for urban streets, multilane highways, and freeways, respectively.}, 192 | day = {01}, 193 | doi = {10.1007/s12205-014-0481-7}, 194 | url = {https://doi.org/10.1007/s12205-014-0481-7}, 195 | } 196 | 197 | @Article{Gu2015RecentAI, 198 | author = {Jiuxiang Gu and Zhenhua Wang and Jason Kuen and Lianyang Ma and Amir Shahroudy and Bing Shuai and Ting Liu and Xingxing Wang and Gang Wang}, 199 | title = {Recent Advances in Convolutional Neural Networks}, 200 | journal = {CoRR}, 201 | year = {2015}, 202 | volume = {abs/1512.07108}, 203 | } 204 | 205 | @InProceedings{lrcn2014, 206 | author = {Jeff Donahue and Lisa Anne Hendricks and Sergio Guadarrama and Marcus Rohrbach and Subhashini Venugopalan and Kate Saenko and Trevor Darrell}, 207 | title = {Long-term Recurrent Convolutional Networks for Visual Recognition and Description}, 208 | booktitle = {CVPR}, 209 | year = {2015}, 210 | } 211 | 212 | @Article{DBLP:journals/corr/Ioffe17, 213 | author = {Sergey Ioffe}, 214 | title = {Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models}, 215 | journal = {CoRR}, 216 | year = {2017}, 217 | volume = {abs/1702.03275}, 218 | bibsource = {dblp computer science bibliography, http://dblp.org}, 219 | biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/Ioffe17}, 220 | timestamp = {Wed, 07 Jun 2017 14:42:44 +0200}, 221 | url = {http://arxiv.org/abs/1702.03275}, 222 | } 223 | 224 | @InCollection{LeCun:1998:CNI:303568.303704, 225 | author = {LeCun, Yann and Bengio, Yoshua}, 226 | title = {The Handbook of Brain Theory and Neural Networks}, 227 | publisher = {MIT Press}, 228 | year = {1998}, 229 | editor = {Arbib, Michael A.}, 230 | chapter = {Convolutional Networks for Images, Speech, and Time Series}, 231 | pages = {255--258}, 232 | address = {Cambridge, MA, USA}, 233 | isbn = {0-262-51102-9}, 234 | acmid = {303704}, 235 | numpages = {4}, 236 | url = {http://dl.acm.org/citation.cfm?id=303568.303704}, 237 | } 238 | 239 | @InProceedings{LeCun:1998:EB:645754.668382, 240 | author = {LeCun, Yann and Bottou, L{\'e}on and Orr, Genevieve B. and M\"{u}ller, Klaus-Robert}, 241 | title = {Effiicient BackProp}, 242 | booktitle = {Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop}, 243 | year = {1998}, 244 | pages = {9--50}, 245 | address = {London, UK, UK}, 246 | publisher = {Springer-Verlag}, 247 | acmid = {668382}, 248 | isbn = {3-540-65311-2}, 249 | numpages = {42}, 250 | url = {http://dl.acm.org/citation.cfm?id=645754.668382}, 251 | } 252 | 253 | @Article{Srivastava:2014:DSW:2627435.2670313, 254 | author = {Srivastava, Nitish and Hinton, Geoffrey and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan}, 255 | title = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting}, 256 | journal = {J. Mach. Learn. Res.}, 257 | year = {2014}, 258 | volume = {15}, 259 | number = {1}, 260 | pages = {1929--1958}, 261 | month = jan, 262 | issn = {1532-4435}, 263 | acmid = {2670313}, 264 | issue_date = {January 2014}, 265 | keywords = {deep learning, model combination, neural networks, regularization}, 266 | numpages = {30}, 267 | publisher = {JMLR.org}, 268 | url = {http://dl.acm.org/citation.cfm?id=2627435.2670313}, 269 | } 270 | 271 | @Article{DBLP:journals/corr/SimonyanZ14a, 272 | author = {Simonyan, Karen and Zisserman, Andrew}, 273 | title = {Very Deep Convolutional Networks for Large-Scale Image Recognition}, 274 | journal = {CoRR}, 275 | year = {2014}, 276 | volume = {abs/1409.1556}, 277 | bibsource = {dblp computer science bibliography, http://dblp.org}, 278 | interhash = {4e6fa56cb7cf99400d5701543ee228de}, 279 | intrahash = {0ee0434e0a70b329d5518f43f1742f7a}, 280 | url = {http://arxiv.org/abs/1409.1556}, 281 | } 282 | 283 | @InProceedings{Goodfellow13maxoutnetworks, 284 | author = {Ian J. Goodfellow and David Warde-farley and Mehdi Mirza and Aaron Courville and Yoshua Bengio}, 285 | title = {Maxout networks}, 286 | booktitle = {In ICML}, 287 | year = {2013}, 288 | } 289 | 290 | @InProceedings{Gers2000c, 291 | author = {Gers, F. A. and Schmidhuber, J.}, 292 | title = {Recurrent Nets that Time and Count}, 293 | booktitle = {{Proceedings of the IJCNN'2000, Int. Joint Conf. on Neural Networks}}, 294 | year = {2000}, 295 | address = {Como, Italy}, 296 | interhash = {01a07138f9b65e4eccfa440aed281bf5}, 297 | intrahash = {f0560c89bdc9fd511c03ed8d3ace008d}, 298 | owner = {thierry}, 299 | timestamp = {2009.04.18}, 300 | } 301 | 302 | @Article{Gers:2000:LFC:1121912.1121915, 303 | author = {Gers, Felix A. and Schmidhuber, J\"{u}rgen A. and Cummins, Fred A.}, 304 | title = {Learning to Forget: Continual Prediction with LSTM}, 305 | journal = {Neural Comput.}, 306 | year = {2000}, 307 | volume = {12}, 308 | number = {10}, 309 | pages = {2451--2471}, 310 | month = oct, 311 | issn = {0899-7667}, 312 | acmid = {1121915}, 313 | address = {Cambridge, MA, USA}, 314 | doi = {10.1162/089976600300015015}, 315 | issue_date = {October 2000}, 316 | numpages = {21}, 317 | publisher = {MIT Press}, 318 | url = {http://dx.doi.org/10.1162/089976600300015015}, 319 | } 320 | 321 | @Article{citeulike:14070430, 322 | author = {Srivastava, Rupesh K. and Greff, Klaus and Schmidhuber, Jurgen}, 323 | title = {{Highway Networks}}, 324 | citeulike-article-id = {14070430}, 325 | citeulike-linkout-0 = {http://arxiv.org/pdf/1505.00387v1.pdf}, 326 | keywords = {deep\_learning\_architectures, deep\_learning\_theory, lstm, multilayer\_networks, networks, neural, schmidhuber\_jurgen}, 327 | posted-at = {2016-06-16 15:29:20}, 328 | url = {http://arxiv.org/pdf/1505.00387v1.pdf}, 329 | } 330 | 331 | @Article{HuangGLZLW, 332 | author = {Huang, Gao and Liu, Zhuang and Van De Maaten, Laurens and Weinberger, Kilian}, 333 | title = {Densely Connected Convolutional Networks}, 334 | year = {2017}, 335 | month = {07}, 336 | } 337 | 338 | @Article{QIAN1999145, 339 | author = {Ning Qian}, 340 | title = {On the momentum term in gradient descent learning algorithms}, 341 | journal = {Neural Networks}, 342 | year = {1999}, 343 | volume = {12}, 344 | number = {1}, 345 | pages = {145 - 151}, 346 | issn = {0893-6080}, 347 | doi = {http://dx.doi.org/10.1016/S0893-6080(98)00116-6}, 348 | keywords = {Momentum, Gradient descent learning algorithm, Damped harmonic oscillator, Critical damping, Learning rate, Speed of convergence}, 349 | url = {http://www.sciencedirect.com/science/article/pii/S0893608098001166}, 350 | } 351 | 352 | @InProceedings{nesterov1983method, 353 | author = {Nesterov, Yurii}, 354 | title = {A method for unconstrained convex minimization problem with the rate of convergence O (1/k2)}, 355 | booktitle = {Doklady an SSSR}, 356 | year = {1983}, 357 | volume = {269}, 358 | number = {3}, 359 | pages = {543--547}, 360 | } 361 | 362 | @Article{Duchi:2011:ASM:1953048.2021068, 363 | author = {Duchi, John and Hazan, Elad and Singer, Yoram}, 364 | title = {Adaptive Subgradient Methods for Online Learning and Stochastic Optimization}, 365 | journal = {J. Mach. Learn. Res.}, 366 | year = {2011}, 367 | volume = {12}, 368 | pages = {2121--2159}, 369 | month = jul, 370 | issn = {1532-4435}, 371 | acmid = {2021068}, 372 | issue_date = {2/1/2011}, 373 | numpages = {39}, 374 | publisher = {JMLR.org}, 375 | url = {http://dl.acm.org/citation.cfm?id=1953048.2021068}, 376 | } 377 | 378 | @Article{journals/corr/abs-1212-5701, 379 | author = {Zeiler, Matthew D.}, 380 | title = {ADADELTA: An Adaptive Learning Rate Method}, 381 | journal = {CoRR}, 382 | year = {2012}, 383 | volume = {abs/1212.5701}, 384 | ee = {http://arxiv.org/abs/1212.5701}, 385 | interhash = {0485dc964af0cd2296b6868b7f97c90d}, 386 | intrahash = {593eceee0e364927f3dd9c85e788bba8}, 387 | url = {http://dblp.uni-trier.de/db/journals/corr/corr1212.html#abs-1212-5701}, 388 | } 389 | 390 | @InProceedings{Lecun98gradient-basedlearning, 391 | author = {Yann Lecun and Léon Bottou and Yoshua Bengio and Patrick Haffner}, 392 | title = {Gradient-based learning applied to document recognition}, 393 | booktitle = {Proceedings of the IEEE}, 394 | year = {1998}, 395 | pages = {2278--2324}, 396 | } 397 | 398 | @InCollection{NIPS2012_4824, 399 | author = {Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E}, 400 | title = {ImageNet Classification with Deep Convolutional Neural Networks}, 401 | booktitle = {Advances in Neural Information Processing Systems 25}, 402 | publisher = {Curran Associates, Inc.}, 403 | year = {2012}, 404 | editor = {F. Pereira and C. J. C. Burges and L. Bottou and K. Q. Weinberger}, 405 | pages = {1097--1105}, 406 | url = {http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf}, 407 | } 408 | 409 | @Book{GravesA2016, 410 | title = {Supervised Sequence Labelling with Recurrent Neural Networks}, 411 | year = {2011}, 412 | author = {Graves, Alex}, 413 | added-at = {2016-12-07T19:03:54.000+0100}, 414 | biburl = {https://www.bibsonomy.org/bibtex/22fe5732cd8b62f4d09a140d5b40c82ec/hprop}, 415 | interhash = {ce5e3e2888eb4afd21867cfb9639bc23}, 416 | intrahash = {2fe5732cd8b62f4d09a140d5b40c82ec}, 417 | keywords = {books machine-learning rnn}, 418 | timestamp = {2016-12-07T19:03:54.000+0100}, 419 | } 420 | 421 | @Article{Clevert2015FastAA, 422 | author = {Djork-Arn{\'e} Clevert and Thomas Unterthiner and Sepp Hochreiter}, 423 | title = {Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)}, 424 | journal = {CoRR}, 425 | year = {2015}, 426 | volume = {abs/1511.07289}, 427 | } 428 | 429 | @Book{Epelbaum2017, 430 | title = {Deep Learning: Technical Introduction}, 431 | publisher = {https://arxiv.org/abs/1709.xxxxx}, 432 | year = {2017}, 433 | author = {Thomas Epelbaum}, 434 | } 435 | 436 | @Article{citeulike:14069459, 437 | author = {Greff, Klaus and Srivastava, Rupesh K. and Koutn{\i}k, Jan and Steunebrink, Bas R. and Schmidhuber, Jurgen}, 438 | title = {{LSTM}: A Search Space Odyssey}, 439 | citeulike-article-id = {14069459}, 440 | citeulike-linkout-0 = {http://arxiv.org/pdf/1503.04069.pdf}, 441 | keywords = {lstm, recurrent, rnn, schmidhuber\_jurgen}, 442 | posted-at = {2016-06-15 14:04:20}, 443 | url = {http://arxiv.org/pdf/1503.04069.pdf}, 444 | } 445 | 446 | @InProceedings{citeulike:4571969, 447 | author = {Akaike, H.}, 448 | title = {{Information theory and an extension of the maximum likelihood principle}}, 449 | booktitle = {Second International Symposium on Information Theory}, 450 | year = {1973}, 451 | editor = {Petrov, B. N. and Csaki, F.}, 452 | pages = {267--281}, 453 | address = {Budapest}, 454 | publisher = {Akad\'{e}miai Kiado}, 455 | citeulike-article-id = {4571969}, 456 | keywords = {modelselection}, 457 | posted-at = {2009-05-21 19:56:32}, 458 | } 459 | 460 | @Comment{jabref-meta: databaseType:bibtex;} 461 | -------------------------------------------------------------------------------- /ELU.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/ELU.pdf -------------------------------------------------------------------------------- /GoogleNet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/GoogleNet.pdf -------------------------------------------------------------------------------- /Inception.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Inception.pdf -------------------------------------------------------------------------------- /Introduction.tex: -------------------------------------------------------------------------------- 1 | \chapter{Introduction} 2 | 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}T}his note aims at presenting the three most common forms of neural network architectures. It does so in a technical though hopefully pedagogical way, buiding up in complexity as one progresses through the chapters. 4 | 5 | \vspace{0.2cm} 6 | 7 | Chapter \ref{sec:chapterFNN} starts with the first type of network introduced historically: a regular feedforward neural network, itself an evolution of the original perceptron \cite{Rosenblatt58theperceptron:} algorithm. One should see the latter as a non-linear regression, and feedforward networks schematically stack perceptron layers on top of one another. 8 | 9 | \vspace{0.2cm} 10 | 11 | We will thus introduce in chapter \ref{sec:chapterFNN} the fundamental building blocks of the simplest neural network layers: weight averaging and activation functions. We will also introduce gradient descent as a way to train the network when joint with the backpropagation algorithm, as a way to minimize a loss function adapted to the task at hand (classification or regression). The more technical details of the backpropagation algorithm are found in the appendix of this chapter, alongside with an introduction to the state of the art feedforward neural network, the ResNet. One can finally find a short matrix description of the feedforward network. 12 | 13 | \vspace{0.2cm} 14 | 15 | In chapter \ref{sec:chapterCNN}, we present the second type of neural network studied: the convolutional networks, particularly suited to treat images and label them. This implies presenting the mathematical tools related to this network: convolution, pooling, stride... As well as seeing the modification of the building block introduced in chapter \ref{sec:chapterFNN}. Several convolutional architectures are then presented, and the appendices once again detail the difficult steps of the main text. 16 | 17 | \vspace{0.2cm} 18 | 19 | Chapter \ref{sec:chapterRNN} finally presents the network architecture suited for data with a temporal structure -- as time series for instance, the recurrent neural network. There again, the novelties and the modifications of the material introduced in the two previous chapters are detailed in the main text, while the appendices give all what one needs to understand the most cumbersome formula of this kind of network architecture. 20 | -------------------------------------------------------------------------------- /LSTM_structure-peephole.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LSTM_structure-peephole.pdf -------------------------------------------------------------------------------- /LSTM_structure-tot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LSTM_structure-tot.pdf -------------------------------------------------------------------------------- /LSTM_structure.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LSTM_structure.pdf -------------------------------------------------------------------------------- /LeNet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/LeNet.pdf -------------------------------------------------------------------------------- /Mediamobile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/Mediamobile.png -------------------------------------------------------------------------------- /Preface.tex: -------------------------------------------------------------------------------- 1 | \chapter{Preface} 2 | 3 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}I} started learning about deep learning fundamentals in February 2017. At this time, I knew nothing about backpropagation, and was completely ignorant about the differences between a Feedforward, Convolutional and a Recurrent Neural Network. 4 | 5 | \vspace{0.2cm} 6 | 7 | As I navigated through the humongous amount of data available on deep learning online, I found myself quite frustrated when it came to really understand what deep learning is, and not just applying it with some available library. 8 | 9 | \vspace{0.2cm} 10 | 11 | In particular, the backpropagation update rules are seldom derived, and never in index form. Unfortunately for me, I have an "index" mind: seeing a 4 Dimensional convolution formula in matrix form does not do it for me. Since I am also stupid enough to like recoding the wheel in low level programming languages, the matrix form cannot be directly converted into working code either. 12 | 13 | 14 | \vspace{0.2cm} 15 | 16 | I therefore started some notes for my personal use, where I tried to rederive everything from scratch in index form. 17 | 18 | \vspace{0.2cm} 19 | 20 | I did so for the vanilla Feedforward network, then learned about L1 and L2 regularization , dropout\cite{Srivastava:2014:DSW:2627435.2670313}, batch normalization\cite{Ioffe2015}, several gradient descent optimization techniques... Then turned to convolutional networks, from conventional single digit number of layer conv-pool architectures\cite{Lecun98gradient-basedlearning} to recent VGG\cite{DBLP:journals/corr/SimonyanZ14a} ResNet\cite{He2015} ones, from local contrast normalization and rectification to bacthnorm... And finally I studied Recurrent Neural Network structures\cite{GravesA2016}, from the standard formulation to the most recent LSTM one\cite{Gers:2000:LFC:1121912.1121915}. 21 | 22 | \vspace{0.2cm} 23 | 24 | As my work progressed, my notes got bigger and bigger, until a point when I realized I might have enough material to help others starting their own deep learning journey. 25 | 26 | \vspace{0.2cm} 27 | 28 | This work is bottom-up at its core. If you are searching a working Neural Network in 10 lines of code and 5 minutes of your time, you have come to the wrong place. If you can mentally multiply and convolve 4D tensors, then I have nothing to convey to you either. 29 | 30 | \vspace{0.2cm} 31 | 32 | If on the other hand you like(d) to rederive every tiny calculation of every theorem of every class that you stepped into, then you might be interested by what follow! 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Technical Book on Deep Learning 2 | 3 | This note presents in a technical though hopefully pedagogical way the three most common forms of neural network architectures: Feedforward, Convolutional and Recurrent. 4 | 5 | For each network, their fundamental building blocks are detailed. The forward pass and the update rules for the backpropagation algorithm are then derived in full. 6 | 7 | The pdf of the whole document can be downloaded directly: [White_book.pdf](https://github.com/tomepel/Technical_Book_DL/raw/master/White_book.pdf). 8 | 9 | Otherwise, all the figures contained in the note are joined in this repo, as well as the tex files needed for compilation. Just don't forget to cite the source if you use any of this material! :) 10 | 11 | Hope it can help others! 12 | 13 | # Acknowledgement 14 | 15 | This work has no benefit nor added value to the deep learning topic on its own. It is just the reformulation of ideas of brighter researchers to fit a peculiar mindset: the one of preferring formulas with ten indices but where one knows precisely what one is manipulating rather than (in my opinion sometimes opaque) matrix formulations where the dimension of the objects are rarely if ever specified. 16 | 17 | Among the brighter people from whom I learned online are Andrew Ng. His Coursera class (https://www.coursera.org/learn/machine-learning) was the first contact I got with Neural Network, and this pedagogical introduction allowed me to build on solid ground. 18 | 19 | I also wish to particularly thanks Hugo Larochelle, who not only built a wonderful deep learning class (http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html), but was also kind enough to answer emails from a complete beginner and stranger! 20 | 21 | The Stanford class on convolutional networks (http://cs231n.github.io/convolutional-networks/) proved extremely valuable to me, so did the one on Natural Language processing (http://web.stanford.edu/class/cs224n/). 22 | 23 | I also benefited greatly from Sebastian Ruder's blog (http://ruder.io/#open), both from the blog pages on gradient descent optimization techniques and from the author himself. 24 | 25 | I learned more about LSTM on colah's blog (http://colah.github.io/posts/2015-08-Understanding-LSTMs/), and some of my drawings are inspired from there. 26 | 27 | I also thank Jonathan Del Hoyo for the great articles that he regularly shares on LinkedIn. 28 | 29 | Many thanks go to my collaborators at Mediamobile, who let me dig as deep as I wanted on Neural Networks. I am especially indebted to Clément, Nicolas, Jessica, Christine and Céline. 30 | 31 | Thanks to Jean-Michel Loubes and Fabrice Gamboa, from whom I learned a great deal on probability theory and statistics. 32 | 33 | I end this list with my employer, Mediamobile, which has been kind enough to let me work on this topic with complete freedom. A special thanks to Philippe, who supervised me with the perfect balance of feedback and freedom! 34 | 35 | # Contact 36 | 37 | If you detect any typo, error (as I am sure that there unfortunately still are), or feel that I forgot to cite an important source, don't hesitate to email me: thomas.epelbaum@shift-technology.com 38 | -------------------------------------------------------------------------------- /RNN_structure-tot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/RNN_structure-tot.pdf -------------------------------------------------------------------------------- /RNN_structure.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/RNN_structure.pdf -------------------------------------------------------------------------------- /ReLU.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/ReLU.pdf -------------------------------------------------------------------------------- /ResNet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/ResNet.pdf -------------------------------------------------------------------------------- /S_FNN.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/S_FNN.pdf -------------------------------------------------------------------------------- /ThesisStyle.cls: -------------------------------------------------------------------------------- 1 | %% 2 | %% This is file `book.cls', 3 | %% generated with the docstrip utility. 4 | %% 5 | %% The original source files were: 6 | %% 7 | %% classes.dtx (with options: `book') 8 | %% 9 | %% This is a generated file. 10 | %% 11 | %% Copyright 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 12 | %% The LaTeX3 Project and any individual authors listed elsewhere 13 | %% in this file. 14 | %% 15 | %% This file was generated from file(s) of the LaTeX base system. 16 | %% -------------------------------------------------------------- 17 | %% 18 | %% It may be distributed and/or modified under the 19 | %% conditions of the LaTeX Project Public License, either version 1.3 20 | %% of this license or (at your option) any later version. 21 | %% The latest version of this license is in 22 | %% http://www.latex-project.org/lppl.txt 23 | %% and version 1.3 or later is part of all distributions of LaTeX 24 | %% version 2003/12/01 or later. 25 | %% 26 | %% This file has the LPPL maintenance status "maintained". 27 | %% 28 | %% This file may only be distributed together with a copy of the LaTeX 29 | %% base system. You may however distribute the LaTeX base system without 30 | %% such generated files. 31 | %% 32 | %% The list of all files belonging to the LaTeX base distribution is 33 | %% given in the file `manifest.txt'. See also `legal.txt' for additional 34 | %% information. 35 | %% 36 | %% The list of derived (unpacked) files belonging to the distribution 37 | %% and covered by LPPL is defined by the unpacking scripts (with 38 | %% extension .ins) which are part of the distribution. 39 | %% \CharacterTable 40 | %% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z 41 | %% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z 42 | %% Digits \0\1\2\3\4\5\6\7\8\9 43 | %% Exclamation \! Double quote \" Hash (number) \# 44 | %% Dollar \$ Percent \% Ampersand \& 45 | %% Acute accent \' Left paren \( Right paren \) 46 | %% Asterisk \* Plus \+ Comma \, 47 | %% Minus \- Point \. Solidus \/ 48 | %% Colon \: Semicolon \; Less than \< 49 | %% Equals \= Greater than \> Question mark \? 50 | %% Commercial at \@ Left bracket \[ Backslash \\ 51 | %% Right bracket \] Circumflex \^ Underscore \_ 52 | %% Grave accent \` Left brace \{ Vertical bar \| 53 | %% Right brace \} Tilde \~} 54 | \NeedsTeXFormat{LaTeX2e}[1995/12/01] 55 | \ProvidesClass{ThesisStyle} 56 | [2004/02/16 v1.4f 57 | Standard LaTeX document class] 58 | \newcommand\@ptsize{} 59 | \newif\if@restonecol 60 | \newif\if@titlepage 61 | \@titlepagetrue 62 | \newif\if@openright 63 | \newif\if@mainmatter \@mainmattertrue 64 | \if@compatibility\else 65 | \DeclareOption{a4paper} 66 | {\setlength\paperheight {297mm}% 67 | \setlength\paperwidth {210mm}} 68 | \DeclareOption{a5paper} 69 | {\setlength\paperheight {210mm}% 70 | \setlength\paperwidth {148mm}} 71 | \DeclareOption{b5paper} 72 | {\setlength\paperheight {250mm}% 73 | \setlength\paperwidth {176mm}} 74 | \DeclareOption{letterpaper} 75 | {\setlength\paperheight {11in}% 76 | \setlength\paperwidth {8.5in}} 77 | \DeclareOption{legalpaper} 78 | {\setlength\paperheight {14in}% 79 | \setlength\paperwidth {8.5in}} 80 | \DeclareOption{executivepaper} 81 | {\setlength\paperheight {10.5in}% 82 | \setlength\paperwidth {7.25in}} 83 | \DeclareOption{landscape} 84 | {\setlength\@tempdima {\paperheight}% 85 | \setlength\paperheight {\paperwidth}% 86 | \setlength\paperwidth {\@tempdima}} 87 | \fi 88 | \if@compatibility 89 | \renewcommand\@ptsize{0} 90 | \else 91 | \DeclareOption{10pt}{\renewcommand\@ptsize{0}} 92 | \fi 93 | \DeclareOption{11pt}{\renewcommand\@ptsize{1}} 94 | \DeclareOption{12pt}{\renewcommand\@ptsize{2}} 95 | \if@compatibility\else 96 | \DeclareOption{oneside}{\@twosidefalse \@mparswitchfalse} 97 | \fi 98 | \DeclareOption{twoside}{\@twosidetrue \@mparswitchtrue} 99 | \DeclareOption{draft}{\setlength\overfullrule{5pt}} 100 | \if@compatibility\else 101 | \DeclareOption{final}{\setlength\overfullrule{0pt}} 102 | \fi 103 | \DeclareOption{titlepage}{\@titlepagetrue} 104 | \if@compatibility\else 105 | \DeclareOption{notitlepage}{\@titlepagefalse} 106 | \fi 107 | \if@compatibility 108 | \@openrighttrue 109 | \else 110 | \DeclareOption{openright}{\@openrighttrue} 111 | \DeclareOption{openany}{\@openrightfalse} 112 | \fi 113 | \if@compatibility\else 114 | \DeclareOption{onecolumn}{\@twocolumnfalse} 115 | \fi 116 | \DeclareOption{twocolumn}{\@twocolumntrue} 117 | \DeclareOption{leqno}{\input{leqno.clo}} 118 | \DeclareOption{fleqn}{\input{fleqn.clo}} 119 | \DeclareOption{openbib}{% 120 | \AtEndOfPackage{% 121 | \renewcommand\@openbib@code{% 122 | \advance\leftmargin\bibindent 123 | \itemindent -\bibindent 124 | \listparindent \itemindent 125 | \parsep \z@ 126 | }% 127 | \renewcommand\newblock{\par}}% 128 | } 129 | \ExecuteOptions{letterpaper,12pt,twoside,onecolumn,final,openright} 130 | \ProcessOptions 131 | \input{bk1\@ptsize.clo} 132 | \setlength\lineskip{1\p@} 133 | \setlength\normallineskip{1\p@} 134 | \renewcommand\baselinestretch{} 135 | \setlength\parskip{0\p@ \@plus \p@} 136 | \@lowpenalty 51 137 | \@medpenalty 151 138 | \@highpenalty 301 139 | \setcounter{topnumber}{2} 140 | \renewcommand\topfraction{.7} 141 | \setcounter{bottomnumber}{1} 142 | \renewcommand\bottomfraction{.3} 143 | \setcounter{totalnumber}{3} 144 | \renewcommand\textfraction{.2} 145 | \renewcommand\floatpagefraction{.5} 146 | \setcounter{dbltopnumber}{2} 147 | \renewcommand\dbltopfraction{.7} 148 | \renewcommand\dblfloatpagefraction{.5} 149 | \if@twoside 150 | \def\ps@headings{% 151 | \let\@oddfoot\@empty\let\@evenfoot\@empty 152 | \def\@evenhead{\thepage\hfil\slshape\leftmark}% 153 | \def\@oddhead{{\slshape\rightmark}\hfil\thepage}% 154 | \let\@mkboth\markboth 155 | \def\chaptermark##1{% 156 | \markboth {\MakeUppercase{% 157 | \ifnum \c@secnumdepth >\m@ne 158 | \if@mainmatter 159 | \@chapapp\ \thechapter. \ % 160 | \fi 161 | \fi 162 | ##1}}{}}% 163 | \def\sectionmark##1{% 164 | \markright {\MakeUppercase{% 165 | \ifnum \c@secnumdepth >\z@ 166 | \thesection. \ % 167 | \fi 168 | ##1}}}} 169 | \else 170 | \def\ps@headings{% 171 | \let\@oddfoot\@empty 172 | \def\@oddhead{{\slshape\rightmark}\hfil\thepage}% 173 | \let\@mkboth\markboth 174 | \def\chaptermark##1{% 175 | \markright {\MakeUppercase{% 176 | \ifnum \c@secnumdepth >\m@ne 177 | \if@mainmatter 178 | \@chapapp\ \thechapter. \ % 179 | \fi 180 | \fi 181 | ##1}}}} 182 | \fi 183 | \def\ps@myheadings{% 184 | \let\@oddfoot\@empty\let\@evenfoot\@empty 185 | \def\@evenhead{\thepage\hfil\slshape\leftmark}% 186 | \def\@oddhead{{\slshape\rightmark}\hfil\thepage}% 187 | \let\@mkboth\@gobbletwo 188 | \let\chaptermark\@gobble 189 | \let\sectionmark\@gobble 190 | } 191 | \if@titlepage 192 | \newcommand\maketitle{\begin{titlepage}% 193 | \let\footnotesize\small 194 | \let\footnoterule\relax 195 | \let \footnote \thanks 196 | \null\vfil 197 | \vskip 60\p@ 198 | \begin{center}% 199 | {\LARGE \@title \par}% 200 | \vskip 3em% 201 | {\large 202 | \lineskip .75em% 203 | \begin{tabular}[t]{c}% 204 | \@author 205 | \end{tabular}\par}% 206 | \vskip 1.5em% 207 | {\large \@date \par}% % Set date in \large size. 208 | \end{center}\par 209 | \@thanks 210 | \vfil\null 211 | \end{titlepage}% 212 | \setcounter{footnote}{0}% 213 | \global\let\thanks\relax 214 | \global\let\maketitle\relax 215 | \global\let\@thanks\@empty 216 | \global\let\@author\@empty 217 | \global\let\@date\@empty 218 | \global\let\@title\@empty 219 | \global\let\title\relax 220 | \global\let\author\relax 221 | \global\let\date\relax 222 | \global\let\and\relax 223 | } 224 | \else 225 | \newcommand\maketitle{\par 226 | \begingroup 227 | \renewcommand\thefootnote{\@fnsymbol\c@footnote}% 228 | \def\@makefnmark{\rlap{\@textsuperscript{\normalfont\@thefnmark}}}% 229 | \long\def\@makefntext##1{\parindent 1em\noindent 230 | \hb@xt@1.8em{% 231 | \hss\@textsuperscript{\normalfont\@thefnmark}}##1}% 232 | \if@twocolumn 233 | \ifnum \col@number=\@ne 234 | \@maketitle 235 | \else 236 | \twocolumn[\@maketitle]% 237 | \fi 238 | \else 239 | \newpage 240 | \global\@topnum\z@ % Prevents figures from going at top of page. 241 | \@maketitle 242 | \fi 243 | \thispagestyle{plain}\@thanks 244 | \endgroup 245 | \setcounter{footnote}{0}% 246 | \global\let\thanks\relax 247 | \global\let\maketitle\relax 248 | \global\let\@maketitle\relax 249 | \global\let\@thanks\@empty 250 | \global\let\@author\@empty 251 | \global\let\@date\@empty 252 | \global\let\@title\@empty 253 | \global\let\title\relax 254 | \global\let\author\relax 255 | \global\let\date\relax 256 | \global\let\and\relax 257 | } 258 | \def\@maketitle{% 259 | \newpage 260 | \null 261 | \vskip 2em% 262 | \begin{center}% 263 | \let \footnote \thanks 264 | {\LARGE \@title \par}% 265 | \vskip 1.5em% 266 | {\large 267 | \lineskip .5em% 268 | \begin{tabular}[t]{c}% 269 | \@author 270 | \end{tabular}\par}% 271 | \vskip 1em% 272 | {\large \@date}% 273 | \end{center}% 274 | \par 275 | \vskip 1.5em} 276 | \fi 277 | \newcommand*\chaptermark[1]{} 278 | \setcounter{secnumdepth}{2} 279 | \newcounter {part} 280 | \newcounter {chapter} 281 | \newcounter {section}[chapter] 282 | \newcounter {subsection}[section] 283 | \newcounter {subsubsection}[subsection] 284 | \newcounter {paragraph}[subsubsection] 285 | \newcounter {subparagraph}[paragraph] 286 | \renewcommand \thepart {\@Roman\c@part} 287 | \renewcommand \thechapter {\@arabic\c@chapter} 288 | \renewcommand \thesection {\thechapter.\@arabic\c@section} 289 | \renewcommand\thesubsection {\thesection.\@arabic\c@subsection} 290 | \renewcommand\thesubsubsection{\thesubsection .\@arabic\c@subsubsection} 291 | \renewcommand\theparagraph {\thesubsubsection.\@arabic\c@paragraph} 292 | \renewcommand\thesubparagraph {\theparagraph.\@arabic\c@subparagraph} 293 | \newcommand\@chapapp{\chaptername} 294 | \newcommand\frontmatter{% 295 | \cleardoublepage 296 | \@mainmatterfalse 297 | \pagenumbering{roman}} 298 | \newcommand\mainmatter{% 299 | \cleardoublepage 300 | \@mainmattertrue 301 | \pagenumbering{arabic}} 302 | \newcommand\backmatter{% 303 | \if@openright 304 | \cleardoublepage 305 | \else 306 | \clearpage 307 | \fi 308 | \@mainmatterfalse} 309 | \newcommand\part{% 310 | \if@openright 311 | \cleardoublepage 312 | \else 313 | \clearpage 314 | \fi 315 | \thispagestyle{plain}% 316 | \if@twocolumn 317 | \onecolumn 318 | \@tempswatrue 319 | \else 320 | \@tempswafalse 321 | \fi 322 | \null\vfil 323 | \secdef\@part\@spart} 324 | 325 | \def\@part[#1]#2{% 326 | \ifnum \c@secnumdepth >-2\relax 327 | \refstepcounter{part}% 328 | \addcontentsline{toc}{part}{\thepart\hspace{1em}#1}% 329 | \else 330 | \addcontentsline{toc}{part}{#1}% 331 | \fi 332 | \markboth{}{}% 333 | {\centering 334 | \interlinepenalty \@M 335 | \normalfont 336 | \ifnum \c@secnumdepth >-2\relax 337 | \huge\bfseries \partname\nobreakspace\thepart 338 | \par 339 | \vskip 20\p@ 340 | \fi 341 | \Huge \bfseries #2\par}% 342 | \@endpart} 343 | \def\@spart#1{% 344 | {\centering 345 | \interlinepenalty \@M 346 | \normalfont 347 | \Huge \bfseries #1\par}% 348 | \@endpart} 349 | \def\@endpart{\vfil\newpage 350 | \if@twoside 351 | \if@openright 352 | \null 353 | \thispagestyle{empty}% 354 | \newpage 355 | \fi 356 | \fi 357 | \if@tempswa 358 | \twocolumn 359 | \fi} 360 | \newcommand\chapter{\if@openright\cleardoublepage\else\clearpage\fi 361 | \thispagestyle{plain}% 362 | \global\@topnum\z@ 363 | \@afterindentfalse 364 | \secdef\@chapter\@schapter} 365 | \def\@chapter[#1]#2{\ifnum \c@secnumdepth >\m@ne 366 | \if@mainmatter 367 | \refstepcounter{chapter}% 368 | \typeout{\@chapapp\space\thechapter.}% 369 | \addcontentsline{toc}{chapter}% 370 | {\protect\numberline{\thechapter}#1}% 371 | \else 372 | \addcontentsline{toc}{chapter}{#1}% 373 | \fi 374 | \else 375 | \addcontentsline{toc}{chapter}{#1}% 376 | \fi 377 | \chaptermark{#1}% 378 | \addtocontents{lof}{\protect\addvspace{10\p@}}% 379 | \addtocontents{lot}{\protect\addvspace{10\p@}}% 380 | \if@twocolumn 381 | \@topnewpage[\@makechapterhead{#2}]% 382 | \else 383 | \@makechapterhead{#2}% 384 | \@afterheading 385 | \fi} 386 | 387 | \def\@makechapterhead#1{% 388 | % \vspace*{10\p@}% 389 | {\parindent \z@ \raggedright \normalfont 390 | \begin{flushright} 391 | \ifnum \c@secnumdepth >\m@ne 392 | \if@mainmatter 393 | % \huge\bfseries 394 | {\Large \scshape \@chapapp\space \thechapter} 395 | \par\nobreak 396 | % \vskip 0\p@ 397 | \fi 398 | \fi 399 | \interlinepenalty\@M 400 | \Huge \bfseries #1\par\nobreak 401 | \hrulefill 402 | \end{flushright} 403 | \vskip 20\p@ 404 | }} 405 | \def\@schapter#1{\if@twocolumn 406 | \@topnewpage[\@makeschapterhead{#1}]% 407 | \else 408 | \@makeschapterhead{#1}% 409 | \@afterheading 410 | \fi} 411 | \def\@makeschapterhead#1{% 412 | % \vspace*{10\p@}% 413 | {\parindent \z@ \raggedright 414 | \normalfont 415 | \interlinepenalty\@M 416 | \begin{flushright} 417 | \Huge \bfseries #1\par\nobreak 418 | \end{flushright} 419 | \vskip 20\p@ 420 | }} 421 | 422 | \renewcommand{\@makechapterhead}[1]{% 423 | \vspace*{50\p@}% 424 | {\parindent \z@ \raggedright \normalfont 425 | \hrule % horizontal line 426 | \vspace{5pt}% % add vertical space 427 | \ifnum \c@secnumdepth >\m@ne 428 | \begin{center}\huge\scshape\bfseries \@chapapp\space \thechapter\end{center} % Chapter number 429 | \par\nobreak 430 | \vskip 1\p@ 431 | \fi 432 | \interlinepenalty\@M 433 | \begin{center}\reflectbox{\includegraphics[scale=0.075]{VGG-conv.pdf}}\bfseries\huge\scshape#1\includegraphics[scale=0.075]{VGG-conv.pdf}\end{center}\par % chapter title 434 | \vspace{5pt}% % add vertical space 435 | \hrule % horizontal rule 436 | \nobreak 437 | \vskip 40\p@ 438 | }} 439 | 440 | \newcommand\section{\@startsection {section}{1}{\z@}% 441 | {-3.5ex \@plus -1ex \@minus -.2ex}% 442 | {2.3ex \@plus.2ex}% 443 | {\normalfont\Large\bfseries}} 444 | \newcommand\subsection{\@startsection{subsection}{2}{\z@}% 445 | {-3.25ex\@plus -1ex \@minus -.2ex}% 446 | {1.5ex \@plus .2ex}% 447 | {\normalfont\large\bfseries}} 448 | \newcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}% 449 | {-3.25ex\@plus -1ex \@minus -.2ex}% 450 | {1.5ex \@plus .2ex}% 451 | {\normalfont\normalsize\bfseries}} 452 | \newcommand\paragraph{\@startsection{paragraph}{4}{\z@}% 453 | {3.25ex \@plus1ex \@minus.2ex}% 454 | {-1em}% 455 | {\normalfont\normalsize\bfseries}} 456 | \newcommand\subparagraph{\@startsection{subparagraph}{5}{\parindent}% 457 | {3.25ex \@plus1ex \@minus .2ex}% 458 | {-1em}% 459 | {\normalfont\normalsize\bfseries}} 460 | \if@twocolumn 461 | \setlength\leftmargini {2em} 462 | \else 463 | \setlength\leftmargini {2.5em} 464 | \fi 465 | \leftmargin \leftmargini 466 | \setlength\leftmarginii {2.2em} 467 | \setlength\leftmarginiii {1.87em} 468 | \setlength\leftmarginiv {1.7em} 469 | \if@twocolumn 470 | \setlength\leftmarginv {.5em} 471 | \setlength\leftmarginvi {.5em} 472 | \else 473 | \setlength\leftmarginv {1em} 474 | \setlength\leftmarginvi {1em} 475 | \fi 476 | \setlength \labelsep {.5em} 477 | \setlength \labelwidth{\leftmargini} 478 | \addtolength\labelwidth{-\labelsep} 479 | \@beginparpenalty -\@lowpenalty 480 | \@endparpenalty -\@lowpenalty 481 | \@itempenalty -\@lowpenalty 482 | \renewcommand\theenumi{\@arabic\c@enumi} 483 | \renewcommand\theenumii{\@alph\c@enumii} 484 | \renewcommand\theenumiii{\@roman\c@enumiii} 485 | \renewcommand\theenumiv{\@Alph\c@enumiv} 486 | \newcommand\labelenumi{\theenumi.} 487 | \newcommand\labelenumii{(\theenumii)} 488 | \newcommand\labelenumiii{\theenumiii.} 489 | \newcommand\labelenumiv{\theenumiv.} 490 | \renewcommand\p@enumii{\theenumi} 491 | \renewcommand\p@enumiii{\theenumi(\theenumii)} 492 | \renewcommand\p@enumiv{\p@enumiii\theenumiii} 493 | \newcommand\labelitemi{\textbullet} 494 | \newcommand\labelitemii{\normalfont\bfseries \textendash} 495 | \newcommand\labelitemiii{\textasteriskcentered} 496 | \newcommand\labelitemiv{\textperiodcentered} 497 | \newenvironment{description} 498 | {\list{}{\labelwidth\z@ \itemindent-\leftmargin 499 | \let\makelabel\descriptionlabel}} 500 | {\endlist} 501 | \newcommand*\descriptionlabel[1]{\hspace\labelsep 502 | \normalfont\bfseries #1} 503 | \newenvironment{verse} 504 | {\let\\\@centercr 505 | \list{}{\itemsep \z@ 506 | \itemindent -1.5em% 507 | \listparindent\itemindent 508 | \rightmargin \leftmargin 509 | \advance\leftmargin 1.5em}% 510 | \item\relax} 511 | {\endlist} 512 | \newenvironment{quotation} 513 | {\list{}{\listparindent 1.5em% 514 | \itemindent \listparindent 515 | \rightmargin \leftmargin 516 | \parsep \z@ \@plus\p@}% 517 | \item\relax} 518 | {\endlist} 519 | \newenvironment{quote} 520 | {\list{}{\rightmargin\leftmargin}% 521 | \item\relax} 522 | {\endlist} 523 | \if@compatibility 524 | \newenvironment{titlepage} 525 | {% 526 | \cleardoublepage 527 | \if@twocolumn 528 | \@restonecoltrue\onecolumn 529 | \else 530 | \@restonecolfalse\newpage 531 | \fi 532 | \thispagestyle{empty}% 533 | \setcounter{page}\z@ 534 | }% 535 | {\if@restonecol\twocolumn \else \newpage \fi 536 | } 537 | \else 538 | \newenvironment{titlepage} 539 | {% 540 | \cleardoublepage 541 | \if@twocolumn 542 | \@restonecoltrue\onecolumn 543 | \else 544 | \@restonecolfalse\newpage 545 | \fi 546 | \thispagestyle{empty}% 547 | \setcounter{page}\@ne 548 | }% 549 | {\if@restonecol\twocolumn \else \newpage \fi 550 | \if@twoside\else 551 | \setcounter{page}\@ne 552 | \fi 553 | } 554 | \fi 555 | \newcommand\appendix{\par 556 | \setcounter{chapter}{0}% 557 | \setcounter{section}{0}% 558 | \gdef\@chapapp{\appendixname}% 559 | \gdef\thechapter{\@Alph\c@chapter}} 560 | \setlength\arraycolsep{5\p@} 561 | \setlength\tabcolsep{6\p@} 562 | \setlength\arrayrulewidth{.4\p@} 563 | \setlength\doublerulesep{2\p@} 564 | \setlength\tabbingsep{\labelsep} 565 | \skip\@mpfootins = \skip\footins 566 | \setlength\fboxsep{3\p@} 567 | \setlength\fboxrule{.4\p@} 568 | \@addtoreset {equation}{chapter} 569 | \renewcommand\theequation 570 | {\ifnum \c@chapter>\z@ \thechapter.\fi \@arabic\c@equation} 571 | \newcounter{figure}[chapter] 572 | \renewcommand \thefigure 573 | {\ifnum \c@chapter>\z@ \thechapter.\fi \@arabic\c@figure} 574 | \def\fps@figure{tbp} 575 | \def\ftype@figure{1} 576 | \def\ext@figure{lof} 577 | \def\fnum@figure{\figurename\nobreakspace\thefigure} 578 | \newenvironment{figure} 579 | {\@float{figure}} 580 | {\end@float} 581 | \newenvironment{figure*} 582 | {\@dblfloat{figure}} 583 | {\end@dblfloat} 584 | \newcounter{table}[chapter] 585 | \renewcommand \thetable 586 | {\ifnum \c@chapter>\z@ \thechapter.\fi \@arabic\c@table} 587 | \def\fps@table{tbp} 588 | \def\ftype@table{2} 589 | \def\ext@table{lot} 590 | \def\fnum@table{\tablename\nobreakspace\thetable} 591 | \newenvironment{table} 592 | {\@float{table}} 593 | {\end@float} 594 | \newenvironment{table*} 595 | {\@dblfloat{table}} 596 | {\end@dblfloat} 597 | \newlength\abovecaptionskip 598 | \newlength\belowcaptionskip 599 | \setlength\abovecaptionskip{10\p@} 600 | \setlength\belowcaptionskip{0\p@} 601 | \long\def\@makecaption#1#2{% 602 | \vskip\abovecaptionskip 603 | \sbox\@tempboxa{#1: #2}% 604 | \ifdim \wd\@tempboxa >\hsize 605 | #1: #2\par 606 | \else 607 | \global \@minipagefalse 608 | \hb@xt@\hsize{\hfil\box\@tempboxa\hfil}% 609 | \fi 610 | \vskip\belowcaptionskip} 611 | \DeclareOldFontCommand{\rm}{\normalfont\rmfamily}{\mathrm} 612 | \DeclareOldFontCommand{\sf}{\normalfont\sffamily}{\mathsf} 613 | \DeclareOldFontCommand{\tt}{\normalfont\ttfamily}{\mathtt} 614 | \DeclareOldFontCommand{\bf}{\normalfont\bfseries}{\mathbf} 615 | \DeclareOldFontCommand{\it}{\normalfont\itshape}{\mathit} 616 | \DeclareOldFontCommand{\sl}{\normalfont\slshape}{\@nomath\sl} 617 | \DeclareOldFontCommand{\sc}{\normalfont\scshape}{\@nomath\sc} 618 | \DeclareRobustCommand*\cal{\@fontswitch\relax\mathcal} 619 | \DeclareRobustCommand*\mit{\@fontswitch\relax\mathnormal} 620 | \newcommand\@pnumwidth{1.55em} 621 | \newcommand\@tocrmarg{2.55em} 622 | \newcommand\@dotsep{4.5} 623 | \setcounter{tocdepth}{2} 624 | \newcommand\tableofcontents{% 625 | \if@twocolumn 626 | \@restonecoltrue\onecolumn 627 | \else 628 | \@restonecolfalse 629 | \fi 630 | \chapter*{\contentsname 631 | \@mkboth{% 632 | \MakeUppercase\contentsname}{\MakeUppercase\contentsname}}% 633 | \@starttoc{toc}% 634 | \if@restonecol\twocolumn\fi 635 | } 636 | \newcommand*\l@part[2]{% 637 | \ifnum \c@tocdepth >-2\relax 638 | \addpenalty{-\@highpenalty}% 639 | \addvspace{2.25em \@plus\p@}% 640 | \setlength\@tempdima{3em}% 641 | \begingroup 642 | \parindent \z@ \rightskip \@pnumwidth 643 | \parfillskip -\@pnumwidth 644 | {\leavevmode 645 | \large \bfseries #1\hfil \hb@xt@\@pnumwidth{\hss #2}}\par 646 | \nobreak 647 | \global\@nobreaktrue 648 | \everypar{\global\@nobreakfalse\everypar{}}% 649 | \endgroup 650 | \fi} 651 | \newcommand*\l@chapter[2]{% 652 | \ifnum \c@tocdepth >\m@ne 653 | \addpenalty{-\@highpenalty}% 654 | \vskip 1.0em \@plus\p@ 655 | \setlength\@tempdima{1.5em}% 656 | \begingroup 657 | \parindent \z@ \rightskip \@pnumwidth 658 | \parfillskip -\@pnumwidth 659 | \leavevmode \bfseries 660 | \advance\leftskip\@tempdima 661 | \hskip -\leftskip 662 | #1\nobreak\hfil \nobreak\hb@xt@\@pnumwidth{\hss #2}\par 663 | \penalty\@highpenalty 664 | \endgroup 665 | \fi} 666 | \newcommand*\l@section{\@dottedtocline{1}{1.5em}{2.3em}} 667 | \newcommand*\l@subsection{\@dottedtocline{2}{3.8em}{3.2em}} 668 | \newcommand*\l@subsubsection{\@dottedtocline{3}{7.0em}{4.1em}} 669 | \newcommand*\l@paragraph{\@dottedtocline{4}{10em}{5em}} 670 | \newcommand*\l@subparagraph{\@dottedtocline{5}{12em}{6em}} 671 | \newcommand\listoffigures{% 672 | \if@twocolumn 673 | \@restonecoltrue\onecolumn 674 | \else 675 | \@restonecolfalse 676 | \fi 677 | \chapter*{\listfigurename}% 678 | \@mkboth{\MakeUppercase\listfigurename}% 679 | {\MakeUppercase\listfigurename}% 680 | \@starttoc{lof}% 681 | \if@restonecol\twocolumn\fi 682 | } 683 | \newcommand*\l@figure{\@dottedtocline{1}{1.5em}{2.3em}} 684 | \newcommand\listoftables{% 685 | \if@twocolumn 686 | \@restonecoltrue\onecolumn 687 | \else 688 | \@restonecolfalse 689 | \fi 690 | \chapter*{\listtablename}% 691 | \@mkboth{% 692 | \MakeUppercase\listtablename}% 693 | {\MakeUppercase\listtablename}% 694 | \@starttoc{lot}% 695 | \if@restonecol\twocolumn\fi 696 | } 697 | \let\l@table\l@figure 698 | \newdimen\bibindent 699 | \setlength\bibindent{1.5em} 700 | \newenvironment{thebibliography}[1] 701 | {\chapter*{\bibname}% 702 | \@mkboth{\MakeUppercase\bibname}{\MakeUppercase\bibname}% 703 | \list{\@biblabel{\@arabic\c@enumiv}}% 704 | {\settowidth\labelwidth{\@biblabel{#1}}% 705 | \leftmargin\labelwidth 706 | \advance\leftmargin\labelsep 707 | \@openbib@code 708 | \usecounter{enumiv}% 709 | \let\p@enumiv\@empty 710 | \renewcommand\theenumiv{\@arabic\c@enumiv}}% 711 | \sloppy 712 | \clubpenalty4000 713 | \@clubpenalty \clubpenalty 714 | \widowpenalty4000% 715 | \sfcode`\.\@m} 716 | {\def\@noitemerr 717 | {\@latex@warning{Empty `thebibliography' environment}}% 718 | \endlist} 719 | \newcommand\newblock{\hskip .11em\@plus.33em\@minus.07em} 720 | \let\@openbib@code\@empty 721 | \newenvironment{theindex} 722 | {\if@twocolumn 723 | \@restonecolfalse 724 | \else 725 | \@restonecoltrue 726 | \fi 727 | \twocolumn[\@makeschapterhead{\indexname}]% 728 | \@mkboth{\MakeUppercase\indexname}% 729 | {\MakeUppercase\indexname}% 730 | \thispagestyle{plain}\parindent\z@ 731 | \parskip\z@ \@plus .3\p@\relax 732 | \columnseprule \z@ 733 | \columnsep 35\p@ 734 | \let\item\@idxitem} 735 | {\if@restonecol\onecolumn\else\clearpage\fi} 736 | \newcommand\@idxitem{\par\hangindent 40\p@} 737 | \newcommand\subitem{\@idxitem \hspace*{20\p@}} 738 | \newcommand\subsubitem{\@idxitem \hspace*{30\p@}} 739 | \newcommand\indexspace{\par \vskip 10\p@ \@plus5\p@ \@minus3\p@\relax} 740 | \renewcommand\footnoterule{% 741 | \kern-3\p@ 742 | \hrule\@width.4\columnwidth 743 | \kern2.6\p@} 744 | \@addtoreset{footnote}{chapter} 745 | \newcommand\@makefntext[1]{% 746 | \parindent 1em% 747 | \noindent 748 | \hb@xt@1.8em{\hss\@makefnmark}#1} 749 | \newcommand\contentsname{Contents} 750 | \newcommand\listfigurename{List of Figures} 751 | \newcommand\listtablename{List of Tables} 752 | \newcommand\bibname{Bibliography} 753 | \newcommand\indexname{Index} 754 | \newcommand\figurename{Figure} 755 | \newcommand\tablename{Table} 756 | \newcommand\partname{Part} 757 | \newcommand\chaptername{Chapter} 758 | \newcommand\appendixname{Appendix} 759 | \def\today{\ifcase\month\or 760 | January\or February\or March\or April\or May\or June\or 761 | July\or August\or September\or October\or November\or December\fi 762 | \space\number\day, \number\year} 763 | \setlength\columnsep{10\p@} 764 | \setlength\columnseprule{0\p@} 765 | \pagestyle{headings} 766 | \pagenumbering{arabic} 767 | \if@twoside 768 | \else 769 | \raggedbottom 770 | \fi 771 | \if@twocolumn 772 | \twocolumn 773 | \sloppy 774 | \flushbottom 775 | \else 776 | \onecolumn 777 | \fi 778 | \endinput 779 | %% 780 | %% End of file `book.cls'. 781 | -------------------------------------------------------------------------------- /VGG-conv.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-conv.pdf -------------------------------------------------------------------------------- /VGG-fc.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-fc.pdf -------------------------------------------------------------------------------- /VGG-pool-fc.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-pool-fc.pdf -------------------------------------------------------------------------------- /VGG-pool.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG-pool.pdf -------------------------------------------------------------------------------- /VGG.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/VGG.pdf -------------------------------------------------------------------------------- /White_book-blx.bib: -------------------------------------------------------------------------------- 1 | @Comment{$ biblatex control file $} 2 | @Comment{$ biblatex version 2.8 $} 3 | Do not modify this file! 4 | 5 | This is an auxiliary file used by the 'biblatex' package. 6 | This file may safely be deleted. It will be recreated as 7 | required. 8 | 9 | @Control{biblatex-control, 10 | options = {2.8:0:0:1:0:1:1:0:0:0:0:0:3:1:79:+:none}, 11 | } 12 | -------------------------------------------------------------------------------- /White_book.dvi: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/White_book.dvi -------------------------------------------------------------------------------- /White_book.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/White_book.pdf -------------------------------------------------------------------------------- /White_book.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,12pt]{ThesisStyle} 2 | \include{formatAndDefs} 3 | \addbibresource{DEEP_LEARNING.bib} 4 | \begin{document} 5 | \def\layersep{2.5cm} 6 | \pgfmathsetseed{12} 7 | \newcommand{\midarrow}{\tikz \draw[-stealth] (0,0) -- +(.1,0);} 8 | \newcommand{\midarroww}{\tikz \draw[-stealth] (0,0.1) -- +(.1,0);} 9 | \newcommand{\orient}{\tikz \draw[-stealth] (0,0) -- +(0,.001);} 10 | \newcommand{\orientl}{\tikz \draw[-stealth] (0,0.15) -- +(.005,.01);} 11 | \newcommand{\orientr}{\tikz \draw[-stealth] (0,0.15) -- +(-.005,.01);} 12 | 13 | \thispagestyle{empty} 14 | \newgeometry{hmargin=0cm,vmargin=1cm} 15 | \begin{center} 16 | \begin{tikzpicture} 17 | \node[] at (0,0) {\includegraphics[scale=1]{cover_page-crop.pdf}}; 18 | \end{tikzpicture} 19 | \end{center} 20 | 21 | \restoregeometry 22 | \dominitoc \tableofcontents 23 | 24 | \include{Preface} 25 | \include{Acknowledgements} 26 | \include{Introduction} 27 | 28 | %\part{Theoretical background} \label{Part1} 29 | 30 | 31 | 32 | \include{chapter1} 33 | \include{chapter2} 34 | \include{chapter3} 35 | 36 | \include{Conclusion} 37 | \printbibliography 38 | \end{document} 39 | -------------------------------------------------------------------------------- /chapter1.tex: -------------------------------------------------------------------------------- 1 | \chapter{Feedforward Neural Networks} \label{sec:chapterFNN} 2 | 3 | \minitoc 4 | 5 | \section{Introduction} 6 | 7 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}I}n this section we review the first type of neural network that has been developed historically: a regular Feedforward Neural Network (FNN). This network does not take into account any particular structure that the input data might have. Nevertheless, it is already a very powerful machine learning tool, especially when used with the state of the art regularization techniques. These techniques -- that we are going to present as well -- allowed to circumvent the training issues that people experienced when dealing with "deep" architectures: namely the fact that neural networks with an important number of hidden states and hidden layers have proven historically to be very hard to train (vanishing gradient and overfitting issues). 8 | 9 | \section{FNN architecture} 10 | 11 | \begin{figure}[H] 12 | \begin{center} 13 | \begin{tikzpicture}[shorten >=1pt,-stealth,draw=black!50, node distance=\layersep] 14 | \tikzstyle{every pin edge}=[stealth-,shorten <=1pt] 15 | \tikzstyle{neuron}=[circle,draw=black,fill=black!25,minimum size=17pt,inner sep=0pt] 16 | \tikzstyle{input neuron}=[neuron, fill=gray!50]; 17 | \tikzstyle{output neuron}=[neuron, fill=gray!50]; 18 | \tikzstyle{hidden neuron}=[neuron, fill=gray!50]; 19 | \tikzstyle{annot} = [text width=4em, text centered] 20 | 21 | % Draw the input layer nodes 22 | \foreach \name / \y in {1} 23 | \pgfmathtruncatemacro{\m}{int(\y-1)} 24 | % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4} 25 | \node[input neuron, pin=left:Bias] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$}; 26 | 27 | 28 | \foreach \name / \y in {2,...,6} 29 | \pgfmathtruncatemacro{\m}{int(\y-1)} 30 | % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4} 31 | \node[input neuron, pin=left:Input \#\y] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$}; 32 | 33 | % Draw the hidden layer 1 nodes 34 | \foreach \name / \y in {1,...,7} 35 | \pgfmathtruncatemacro{\m}{int(\y-1)} 36 | \path[yshift=0.5cm] 37 | node[hidden neuron] (H1-\name) at (\layersep,-\y cm) {$h_{\m}^{(1)}$}; 38 | 39 | % Draw the hidden layer 1 node 40 | \foreach \name / \y in {1,...,6} 41 | \pgfmathtruncatemacro{\m}{int(\y-1)} 42 | \path[yshift=0.0cm] 43 | node[hidden neuron] (H2-\name) at (2*\layersep,-\y cm) {$h_{\m}^{(\nu)}$}; 44 | 45 | % Draw the output layer node 46 | \foreach \name / \y in {1,...,5} 47 | \path[yshift=-0.5cm] 48 | node[output neuron,pin={[pin edge={->}]right:Output \#\y}] (O-\name) at (3*\layersep,-\y cm) {$h_{\y}^{(N)}$}; 49 | 50 | % Connect every node in the input layer with every node in the 51 | % hidden layer. 52 | \foreach \source in {1,...,6} 53 | \foreach \dest in {2,...,7} 54 | \path (I-\source) edge (H1-\dest); 55 | 56 | \foreach \source in {1,...,7} 57 | \foreach \dest in {2,...,6} 58 | \path (H1-\source) edge (H2-\dest); 59 | 60 | % Connect every node in the hidden layer with the output layer 61 | \foreach \source in {1,...,6} 62 | \foreach \dest in {1,...,5} 63 | \path (H2-\source) edge (O-\dest); 64 | 65 | % Annotate the layers 66 | \node[annot,above of=H1-1, node distance=1cm] (hl) {Hidden layer 1}; 67 | \node[annot,left of=hl] {Input layer}; 68 | \node[annot,right of=hl] (hm) {Hidden layer $\nu$}; 69 | \node[annot,right of=hm] {Output layer}; 70 | 71 | \node at ((1.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$}; 72 | \node at ((2.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$}; 73 | \end{tikzpicture} 74 | \caption{\label{fig:1}Neural Network with $N+1$ layers ($N-1$ hidden layers). For simplicity of notations, the index referencing the training set has not been indicated. Shallow architectures use only one hidden layer. Deep learning amounts to take several hidden layers, usually containing the same number of hidden neurons. This number should be on the ballpark of the average of the number of input and output variables.} 75 | \end{center} 76 | \end{figure} 77 | 78 | A FNN is formed by one input layer, one (shallow network) or more (deep network, hence the name deep learning) hidden layers and one output layer. Each layer of the network (except the output one) is connected to a following layer. This connectivity is central to the FNN structure and has two main features in its simplest form: a weight averaging feature and an activation feature. We will review these features extensively in the following 79 | 80 | \section{Some notations} 81 | 82 | In the following, we will call 83 | \begin{itemize} 84 | \item[$\bullet$] $N$ the number of layers (not counting the input) in the Neural Network. 85 | \item[$\bullet$] $T_{{\rm train}}$ the number of training examples in the training set. 86 | \item[$\bullet$] $T_{{\rm mb}}$ the number of training examples in a mini-batch (see section \ref{sec:FNNlossfunction}). 87 | \item[$\bullet$] $t \in \llbracket0,T_{{\rm mb}}-1\rrbracket$ the mini-batch training instance index. 88 | \item[$\bullet$] $\nu\in\llbracket0,N\rrbracket$ the number of layers in the FNN. 89 | \item[$\bullet$] $F_\nu$ the number of neurons in the $\nu$'th layer. 90 | \item[$\bullet$] $X^{(t)}_f=h_{f}^{(0)(t)}$ where $f\in\llbracket0,F_0-1\rrbracket$ the input variables. 91 | \item[$\bullet$] $y^{(t)}_f$ where $f\in[0,F_N-1]$ the output variables (to be predicted). 92 | \item[$\bullet$] $\hat{y}^{(t)}_f$ where $f\in[0,F_N-1]$ the output of the network. 93 | \item[$\bullet$] $\Theta_{f}^{(\nu)f'}$ for $f\in [0,F_{\nu}-1]$, $f'\in [0,F_{\nu+1}-1]$ and $\nu\in[0,N-1]$ the weights matrices 94 | \item[$\bullet$] A bias term can be included. In practice, we will see when talking about the batch-normalization procedure that we can omit it. 95 | \end{itemize} 96 | 97 | 98 | \section{Weight averaging} 99 | 100 | 101 | One of the two main components of a FNN is a weight averaging procedure, which amounts to average the previous layer with some weight matrix to obtain the next layer. This is illustrated on the figure \ref{fig:3} 102 | 103 | 104 | \begin{figure}[H] 105 | \begin{center} 106 | \begin{tikzpicture}[shorten >=1pt,-stealth,draw=black!50, node distance=\layersep] 107 | \tikzstyle{every pin edge}=[stealth-,shorten <=1pt] 108 | \tikzstyle{neuron}=[circle,draw=black,fill=black!25,minimum size=17pt,inner sep=0pt] 109 | \tikzstyle{input neuron}=[neuron, fill=gray!50]; 110 | \tikzstyle{output neuron}=[neuron, fill=gray!50]; 111 | \tikzstyle{hidden neuron}=[neuron, fill=gray!50]; 112 | \tikzstyle{annot} = [text width=4em, text centered] 113 | 114 | % Draw the input layer nodes 115 | \foreach \name / \y in {1} 116 | \pgfmathtruncatemacro{\m}{int(\y-1)} 117 | % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4} 118 | \node[input neuron] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$}; 119 | 120 | 121 | \foreach \name / \y in {2,...,6} 122 | \pgfmathtruncatemacro{\m}{int(\y-1)} 123 | % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4} 124 | \node[input neuron] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$}; 125 | 126 | % Draw the hidden layer 1 nodes 127 | \foreach \name / \y in {4} 128 | \pgfmathtruncatemacro{\m}{int(\y-1)} 129 | \path[yshift=0.5cm] 130 | node[hidden neuron] (H1-\name) at (\layersep,-\y cm) {$a_{\m}^{(0)}$}; 131 | 132 | 133 | \path (I-1) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_0$} (H1-4); 134 | \path (I-2) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_1$} (H1-4); 135 | \path (I-3) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_2$} (H1-4); 136 | \path (I-4) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_3$} (H1-4); 137 | \path (I-5) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_4$} (H1-4); 138 | \path (I-6) edge node[pos=0.3,scale=0.9] {$\Theta^{(0)3}_5$} (H1-4); 139 | 140 | \end{tikzpicture} 141 | \caption{\label{fig:3}Weight averaging procedure.} 142 | \end{center} 143 | \end{figure} 144 | 145 | 146 | Formally, the weight averaging procedure reads: 147 | 148 | \begin{align} 149 | a_{f}^{(t)(\nu)}&=\sum^{F_\nu-1+\epsilon}_{f'=0}\Theta^{(\nu)f}_{\,f'}h^{(t)(\nu)}_{f'}\;, 150 | \end{align} 151 | where $\nu\in\llbracket 0,N-1\rrbracket$, $t \in \llbracket0,T_{{\rm mb}}-1\rrbracket$ and $f\in \llbracket 0,F_{\nu+1}-1\rrbracket$. The $\epsilon$ is here to include or exclude a bias term. In practice, as we will be using batch-normalization, we can safely omit it ($\epsilon=0$ in all the following). 152 | 153 | \section{Activation function} 154 | 155 | The hidden neuron of each layer is defined as 156 | \begin{align} 157 | h_{f}^{(t)(\nu+1)}&=g\left(a_{f}^{(t)(\nu)}\right)\;, 158 | \end{align} 159 | where $\nu\in\llbracket 0,N-2\rrbracket$, $f\in \llbracket 0,F_{\nu+1}-1\rrbracket$ and as usual $t \in \llbracket0,T_{{\rm mb}}-1\rrbracket$. Here $g$ is an activation function -- the second main ingredient of a FNN -- whose non-linearity allow to predict arbitrary output data. In practice, $g$ is usually taken to be one of the functions described in the following subsections. 160 | 161 | 162 | \subsection{The sigmoid function} 163 | 164 | The sigmoid function takes its value in $]0,1[$ and reads 165 | \begin{align} 166 | g(x)&=\sigma(x)=\frac{1}{1+e^{-x}}\;. 167 | \end{align} 168 | Its derivative is 169 | \begin{align} 170 | \sigma'(x)&=\sigma(x)\left(1-\sigma(x)\right)\;. 171 | \end{align} 172 | This activation function is not much used nowadays (except in RNN-LSTM networks that we will present later in chapter \ref{sec:chapterRNN}). 173 | 174 | \begin{figure}[H] 175 | \begin{center} 176 | \begin{tikzpicture} 177 | \node at (0,0) {\includegraphics[scale=1]{sigmoid}}; 178 | \end{tikzpicture} 179 | \end{center} 180 | \caption{\label{fig:sigmoid} the sigmoid function and its derivative.} 181 | \end{figure} 182 | 183 | \subsection{The tanh function} 184 | 185 | The tanh function takes its value in $]-1,1[$ and reads 186 | \begin{align} 187 | g(x)&=\tanh(x)=\frac{1-e^{-2x}}{1+e^{-2x}}\;. 188 | \end{align} 189 | Its derivative is 190 | \begin{align} 191 | \tanh'(x)&=1-\tanh^2(x)\;. 192 | \end{align} 193 | This activation function has seen its popularity drop due to the use of the activation function presented in the next section. 194 | 195 | \begin{figure}[H] 196 | \begin{center} 197 | \begin{tikzpicture} 198 | \node at (0,0) {\includegraphics[scale=1]{tanh2}}; 199 | \end{tikzpicture} 200 | \end{center} 201 | \caption{\label{fig:tanh} the tanh function and its derivative.} 202 | \end{figure} 203 | 204 | It is nevertherless still used in the standard formulation of the RNN-LSTM model (\ref{sec:chapterRNN}). 205 | 206 | \subsection{The ReLU function} 207 | 208 | 209 | The ReLU -- for Rectified Linear Unit -- function takes its value in $[0,+\infty[$ and reads 210 | \begin{align} 211 | g(x)&={\rm ReLU}(x)=\begin{cases} 212 | x & x\geq 0 \\ 213 | 0& x<0 214 | \end{cases}\;. 215 | \end{align} 216 | Its derivative is 217 | \begin{align} 218 | {\rm ReLU}'(x)&=\begin{cases} 219 | 1 & x\geq 0 \\ 220 | 0 & x<0 221 | \end{cases}\;. 222 | \end{align} 223 | 224 | \begin{figure}[H] 225 | \begin{center} 226 | \begin{tikzpicture} 227 | \node at (0,0) {\includegraphics[scale=1]{ReLU}}; 228 | \end{tikzpicture} 229 | \end{center} 230 | \caption{\label{fig:relu} the ReLU function and its derivative.} 231 | \end{figure} 232 | 233 | 234 | This activation function is the most extensively used nowadays. Two of its more common variants can also be found : the leaky ReLU and ELU -- Exponential Linear Unit. They have been introduced because the ReLU activation function tends to "kill" certain hidden neurons: once it has been turned off (zero value), it can never be turned on again. 235 | 236 | 237 | 238 | \subsection{The leaky-ReLU function} 239 | 240 | 241 | The leaky-ReLU --for Linear Rectified Linear Unit -- function takes its value in $]-\infty,+\infty[$ and is a slight modification of the ReLU that allows non-zero value for the hidden neuron whatever the $x$ value. It reads 242 | \begin{align} 243 | g(x)&= \text{leaky-ReLU}(x)=\begin{cases} 244 | x & x\geq 0 \\ 245 | 0.01\,x & x<0 246 | \end{cases}\;. 247 | \end{align} 248 | Its derivative is 249 | \begin{align} 250 | \text{leaky-ReLU}'(x)&=\begin{cases} 251 | 1 & x\geq 0 \\ 252 | 0.01 & x<0 253 | \end{cases}\;. 254 | \end{align} 255 | 256 | \begin{figure}[H] 257 | \begin{center} 258 | \begin{tikzpicture} 259 | \node at (0,0) {\includegraphics[scale=1]{lReLU}}; 260 | \end{tikzpicture} 261 | \end{center} 262 | \caption{\label{fig:lrelu} the leaky-ReLU function and its derivative.} 263 | \end{figure} 264 | 265 | A variant of the leaky-ReLU can also be found in the literature : the Parametric-ReLU, where the arbitrary $0.01$ in the definition of the leaky-ReLU is replaced by an $\alpha$ coefficient, that can be 266 | computed via backpropagation. 267 | \begin{align} 268 | g(x)&={\rm Parametric-ReLU}(x)=\begin{cases} 269 | x & x\geq 0 \\ 270 | \alpha\,x & x<0 271 | \end{cases}\;. 272 | \end{align} 273 | Its derivative is 274 | \begin{align} 275 | {\rm Parametric-ReLU}'(x)&=\begin{cases} 276 | 1 & x\geq 0 \\ 277 | \alpha & x<0 278 | \end{cases}\;. 279 | \end{align} 280 | 281 | \subsection{The ELU function} 282 | 283 | The ELU --for Exponential Linear Unit -- function takes its value between $]-1,+\infty[$ and is inspired by the leaky-ReLU philosophy: non-zero values for all $x$'s. But it presents the advantage of being $\mathcal{C}^1$. 284 | \begin{align} 285 | g(x)&={\rm ELU}(x)=\begin{cases} 286 | x & x\geq 0 \\ 287 | e^x-1 & x<0 288 | \end{cases}\;. 289 | \end{align} 290 | Its derivative is 291 | \begin{align} 292 | {\rm ELU}'(x)&=\begin{cases} 293 | 1 & x\geq 0 \\ 294 | e^x & x<0 295 | \end{cases}\;. 296 | \end{align} 297 | 298 | 299 | \begin{figure}[H] 300 | \begin{center} 301 | \begin{tikzpicture} 302 | \node at (0,0) {\includegraphics[scale=1]{ELU}}; 303 | \end{tikzpicture} 304 | \end{center} 305 | \caption{\label{fig:elu} the ELU function and its derivative.} 306 | \end{figure} 307 | 308 | % From my experience, leay-relu is more than enough. 309 | 310 | 311 | \section{FNN layers} 312 | 313 | As illustrated in figure \ref{fig:1}, a regular FNN is composed by several specific layers. Let us explicit them one by one. 314 | 315 | 316 | 317 | \subsection{Input layer} 318 | 319 | The input layer is one of the two places where the data at disposal for the problem at hand come into place. In this chapter, we will be considering a input of size $F_0$, denoted $X^{(t)}_{f}$, with\footnote{ 320 | To train the FNN, we jointly compute the forward and backward pass for $T_{{\rm mb}}$ samples of the training set, with $T_{{\rm mb}}\ll T_{{\rm train}}$. In the following we will thus have $t\in \llbracket 0, T_{{\rm mb}}-1\rrbracket$. 321 | } 322 | $t\in \llbracket 0, T_{{\rm mb}}-1\rrbracket$ (size of the mini-batch, more on that when we will be talking about gradient descent techniques), and $f \in \llbracket 0, F_0-1\rrbracket$. Given the problem at hand, a common procedure could be to center the input following the procedure 323 | \begin{align} 324 | \tilde{X}^{(t)}_{f}&=X^{(t)}_{f}-\mu_{f}\;, 325 | \end{align} 326 | with 327 | \begin{align} 328 | \mu_{f}&=\frac{1}{T_{{\rm train}}}\sum^{T_{{\rm train}}-1}_{t=0}X^{(t)}_{f}\;. 329 | \end{align} 330 | This correspond to compute the mean per data types over the training set. Following our notations, let us recall that 331 | \begin{align} 332 | X^{(t)}_{f}&=h^{(t)(0)}_{f}\;. 333 | \end{align} 334 | 335 | \subsection{Fully connected layer} 336 | 337 | The fully connected operation is just the conjunction of the weight averaging and the activation procedure. Namely, $\forall \nu\in \llbracket 0,N-1 \rrbracket$ 338 | \begin{align} 339 | a_{f}^{(t)(\nu)}&=\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}\;.\label{eq:Weightavg} 340 | \end{align} 341 | and $\forall \nu\in \llbracket 0,N-2 \rrbracket$ 342 | \begin{align} 343 | h_{f}^{(t)(\nu+1)}&=g\left(a_{f}^{(t)(\nu)}\right)\;. 344 | \end{align} 345 | for the case where $\nu=N-1$, the activation function is replaced by an output function. 346 | 347 | 348 | 349 | \subsection{Output layer} 350 | 351 | The output of the FNN reads 352 | \begin{align} 353 | h_{f}^{(t)(N)}&=o(a_{f}^{(t)(N-1)})\;, 354 | \end{align} 355 | where $o$ is called the output function. In the case of the Euclidean loss function, the output function is just the identity. In a classification task, $o$ is the softmax function. 356 | \begin{align} 357 | o\left(a^{(t)(N-1)}_f\right)&=\frac{e^{a^{(t)(N-1)}_f}}{\sum\limits^{F_{N}-1}_{f'=0}e^{a^{(t)(N-1)}_{f'}}} 358 | \end{align} 359 | 360 | 361 | \section{Loss function} \label{sec:FNNlossfunction} 362 | 363 | The loss function evaluates the error performed by the FNN when it tries to estimate the data to be predicted (second place where the data make their appearance). For a regression problem, this is simply a mean square error (MSE) evaluation 364 | \begin{align} 365 | J(\Theta)&=\frac{1}{2T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1} 366 | % 367 | \left(y_f^{(t)}-h_{f}^{(t)(N)}\right)^2\;, 368 | \end{align} 369 | while for a classification task, the loss function is called the cross-entropy function 370 | \begin{align} 371 | J(\Theta)&=-\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1} 372 | % 373 | \delta^f_{y^{(t)}}\ln h_{f}^{(t)(N)}\;, 374 | \end{align} 375 | and for a regression problem transformed into a classification one, calling $C$ the number of bins leads to 376 | \begin{align} 377 | J(\Theta)&=-\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1}\sum_{c=0}^{C-1} 378 | % 379 | \delta^c_{y_f^{(t)}}\ln h_{fc}^{(t)(N)}\;. 380 | \end{align} 381 | For reasons that will appear clear when talking about the data sample used at each training step, we denote 382 | \begin{align} 383 | J(\Theta)&=\sum_{t=0}^{T_{{\rm mb}}-1}J_{{\rm mb}}(\Theta)\;. 384 | \end{align} 385 | 386 | \section{Regularization techniques} 387 | 388 | On of the main difficulties when dealing with deep learning techniques is to get the deep neural network to train efficiently. To that end, several regularization techniques have been invented. We will review them in this section 389 | 390 | \subsection{L2 regularization} 391 | 392 | L2 regularization is the most common regularization technique that on can find in the literature. It amounts to add a regularizing term to the loss function in the following way 393 | \begin{align} 394 | J_{{\rm L2}}(\Theta)&=\lambda_{{\rm L2}} \sum_{\nu=0}^{N-1}\left\|\Theta^{(\nu)}\right\|^2_{{\rm L2}} 395 | % 396 | =\lambda_{{\rm L2}}\sum_{\nu=0}^{N-1}\sum_{f=0}^{F_{\nu+1}-1}\sum_{f'=0}^{F_\nu-1} 397 | % 398 | \left(\Theta^{(\nu)f'}_{f}\right)^2\;.\label{eq:l2reg} 399 | \end{align} 400 | This regularization technique is almost always used, but not on its own. A typical value of $\lambda_{{\rm L2}}$ is in the range $10^{-4}-10^{-2}$. Interestingly, this L2 regularization technique has a Bayesian interpretation: it is Bayesian inference with a Gaussian prior on the weights. Indeed, for a given $\nu$, the weight averaging procedure can be considered as 401 | \begin{align} 402 | a_{f}^{(t)(\nu)}&=\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}+\epsilon\;, 403 | \end{align} 404 | where $\epsilon$ is a noise term of mean $0$ and variance $\sigma^2$. Hence the following Gaussian likelihood for all values of $t$ and $f$: 405 | \begin{align} 406 | \mathcal{N}\left(a_{f}^{(t)(i)}\middle|\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'},\sigma^2\right)\;. 407 | \end{align} 408 | Assuming all the weights to have a Gaussian prior of the form $\mathcal{N}\left(\Theta^{(\nu)f}_{f'}\middle|\lambda_{{\rm L2}}^{-1}\right)$ with the same parameter $\lambda_{{\rm L2}}$, we get the following expression 409 | \begin{align} 410 | \mathcal{P}&= 411 | % 412 | \prod_{t=0}^{T_{{\rm mb}}-1}\prod_{f=0}^{F_{\nu+1}-1}\left[\mathcal{N}\left(a_{f}^{(t)(\nu)}\middle| 413 | % 414 | \sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'},\sigma^2\right) 415 | % 416 | \prod_{f'=0}^{F_{\nu}-1}\mathcal{N}\left(\Theta^{(\nu)f}_{f'} 417 | % 418 | \middle|\lambda_{{\rm L2}}^{-1}\right)\right]\notag\\ 419 | % 420 | &=\prod_{t=0}^{T_{{\rm mb}}-1}\prod_{f=0}^{F_{\nu+1}-1}\left[\frac{1}{\sqrt{2\pi \sigma^2}} 421 | % 422 | e^{-\frac{\left(a_{f}^{(t)(\nu)}-\sum^{F_i-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}\right)^2}{2\sigma^2}} 423 | % 424 | \prod_{f'=0}^{F_{\nu}-1}\sqrt{\frac{\lambda_{{\rm L2}}}{2\pi}}e^{-\frac{\left(\Theta^{(\nu)f}_{f'}\right)^2\lambda_{{\rm L2}}}{2}}\right] \;. 425 | \end{align} 426 | Taking the log of it and forgetting most of the constant terms leads to 427 | \begin{align} 428 | \mathcal{L}&\propto\frac{1}{T_{{\rm mb}}\sigma^2} 429 | % 430 | \sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_{\nu+1}-1} 431 | % 432 | \left(a_{f}^{(t)(\nu)}-\sum^{F_\nu-1}_{f'=0}\Theta^{(\nu)f}_{f'}h^{(t)(\nu)}_{f'}\right)^2 433 | % 434 | +\lambda_{{\rm L2}}\sum_{f=0}^{F_{\nu+1}-1}\sum_{f'=0}^{F_{\nu}-1}\left(\Theta^{(\nu)f}_{f'}\right)^2 \;, 435 | \end{align} 436 | and the last term is exactly the L2 regulator for a given $nu$ value (see formula (\ref{eq:l2reg})). 437 | 438 | \subsection{L1 regularization} 439 | 440 | L1 regularization amounts to replace the L2 norm by the L1 one in the L2 regularization technique 441 | \begin{align} 442 | J_{{\rm L1}}(\Theta)&=\lambda_{{\rm L1}} \sum_{\nu=0}^{N-1}\left\|\Theta^{(\nu)}\right\|_{{\rm L1}} 443 | % 444 | =\lambda_{{\rm L1}}\sum_{\nu=0}^{N-1}\sum_{f=0}^{F_{\nu+1}-1}\sum_{f'=0}^{F_\nu-1} 445 | % 446 | \left|\Theta^{(\nu)f}_{f'}\right|\;. 447 | \end{align} 448 | It can be used in conjunction with L2 regularization, but again these techniques are not sufficient on their own. A typical value of $\lambda_{{\rm L1}}$ is in the range $10^{-4}-10^{-2}$. Following the same line as in the previous section, one can show that L1 regularization is equivalent to Bayesian inference with a Laplacian prior on the weights 449 | \begin{align} 450 | \mathcal{F}\left(\Theta^{(\nu)f}_{f'}\middle| 0,\lambda_{{\rm L1}}^{-1}\right)&= 451 | % 452 | \frac{\lambda_{{\rm L1}}}{2}e^{-\lambda_{{\rm L1}}\left|\Theta^{(\nu)f}_{f'}\right|}\;. 453 | \end{align} 454 | 455 | \subsection{Clipping} 456 | 457 | Clipping forbids the L2 norm of the weights to go beyond a pre-determined threshold $C$. Namely after having computed the update rules for the weights, if their L2 norm goes above $C$, it is pushed back to $C$ 458 | \begin{align} 459 | {\rm if}\;\left\|\Theta^{(\nu)}\right\|_{{\rm L2}}>C \longrightarrow \Theta^{(\nu)f}_{f'}&= 460 | % 461 | \Theta^{(\nu)f}_{f'} \times \frac{C}{\left\|\Theta^{(\nu)}\right\|_{{\rm L2}}}\;. 462 | \end{align} 463 | 464 | This regularization technique avoids the so-called exploding gradient problem, and is mainly used in RNN-LSTM networks. A typical value of $C$ is in the range $10^{0}-10^{1}$. Let us now turn to the most efficient regularization techniques for a FNN: dropout and Batch-normalization. 465 | 466 | 467 | \subsection{Dropout} 468 | 469 | A simple procedure allows for better backpropagation performance for classification tasks: it amounts to stochastically drop some of the hidden units (and in some instances even some of the input variables) for each training example. 470 | 471 | \begin{figure}[H] 472 | \begin{center} 473 | \begin{tikzpicture}[shorten >=1pt,-stealth,draw=black!50, node distance=\layersep] 474 | \tikzstyle{every pin edge}=[stealth-,shorten <=1pt] 475 | \tikzstyle{neuron}=[circle,draw=black,fill=black!25,minimum size=17pt,inner sep=0pt] 476 | \tikzstyle{input neuron}=[neuron, fill=gray!50]; 477 | \tikzstyle{output neuron}=[neuron, fill=gray!50]; 478 | \tikzstyle{dropout neuron}=[neuron, fill=black]; 479 | \tikzstyle{hidden neuron}=[neuron, fill=gray!50]; 480 | \tikzstyle{annot} = [text width=4em, text centered] 481 | 482 | % Draw the input layer nodes 483 | \foreach \name / \y in {1} 484 | \pgfmathtruncatemacro{\m}{int(\y-1)} 485 | % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4} 486 | \node[input neuron, pin=left:Bias] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$}; 487 | 488 | 489 | \foreach \name / \y in {2,3,4,6} 490 | \pgfmathtruncatemacro{\m}{int(\y-1)} 491 | % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4} 492 | \node[input neuron, pin=left:Input \#\y] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$}; 493 | 494 | \foreach \name / \y in {5} 495 | \pgfmathtruncatemacro{\m}{int(\y-1)} 496 | % This is the same as writing \foreach \name / \y in {1/1,2/2,3/3,4/4} 497 | \node[dropout neuron] (I-\name) at (0,-\y) {$h_{\m}^{(0)}$}; 498 | 499 | % Draw the hidden layer 1 nodes 500 | \foreach \name / \y in {1,2,3,5} 501 | \pgfmathtruncatemacro{\m}{int(\y-1)} 502 | \path[yshift=0.5cm] 503 | node[hidden neuron] (H1-\name) at (\layersep,-\y cm) {$h_{\m}^{(1)}$}; 504 | 505 | % Draw the hidden layer 1 nodes 506 | \foreach \name / \y in {4,6,7} 507 | \pgfmathtruncatemacro{\m}{int(\y-1)} 508 | \path[yshift=0.5cm] 509 | node[dropout neuron] (H1-\name) at (\layersep,-\y cm) {$h_{\m}^{(1)}$}; 510 | 511 | % Draw the hidden layer 1 node 512 | \foreach \name / \y in {1,3,5} 513 | \pgfmathtruncatemacro{\m}{int(\y-1)} 514 | \path[yshift=0.0cm] 515 | node[hidden neuron] (H2-\name) at (2*\layersep,-\y cm) {$h_{\m}^{(\nu)}$}; 516 | 517 | % Draw the hidden layer 1 node 518 | \foreach \name / \y in {2,4,6} 519 | \pgfmathtruncatemacro{\m}{int(\y-1)} 520 | \path[yshift=0.0cm] 521 | node[dropout neuron] (H2-\name) at (2*\layersep,-\y cm) {$h_{\m}^{(\nu)}$}; 522 | 523 | % Draw the output layer node 524 | \foreach \name / \y in {1,...,5} 525 | \path[yshift=-0.5cm] 526 | node[output neuron,pin={[pin edge={->}]right:Output \#\y}] (O-\name) at (3*\layersep,-\y cm) {$h_{\y}^{(N)}$}; 527 | 528 | % Connect every node in the input layer with every node in the 529 | % hidden layer. 530 | \foreach \source in {1,2,3,4,6} 531 | \foreach \dest in {2,3,5} 532 | \path (I-\source) edge (H1-\dest); 533 | 534 | \foreach \source in {1,2,3,5} 535 | \foreach \dest in {3,5} 536 | \path (H1-\source) edge (H2-\dest); 537 | 538 | % Connect every node in the hidden layer with the output layer 539 | \foreach \source in {1,3,5} 540 | \foreach \dest in {1,...,5} 541 | \path (H2-\source) edge (O-\dest); 542 | 543 | % Annotate the layers 544 | \node[annot,above of=H1-1, node distance=1cm] (hl) {Hidden layer 1}; 545 | \node[annot,left of=hl] {Input layer}; 546 | \node[annot,right of=hl] (hm) {Hidden layer $\nu$}; 547 | \node[annot,right of=hm] {Output layer}; 548 | 549 | \node at ((1.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$}; 550 | \node at ((2.5*\layersep,-3.5 cm) {$\bullet\bullet\bullet$}; 551 | \end{tikzpicture} 552 | \caption{\label{fig:2}The neural network of figure \ref{fig:1} with dropout taken into account for both the hidden layers and the input. Usually, a different (lower) probability for turning off a neuron is adopted for the input than the one adopted for the hidden layers.} 553 | \end{center} 554 | \end{figure} 555 | 556 | 557 | This amounts to do the following change: for $\nu\in \llbracket 1,N-1\rrbracket$ 558 | \begin{align} 559 | h^{(\nu)}_{f}=\null&m_f^{(\nu)} g\left(a_f^{(\nu)}\right) 560 | \end{align} 561 | with $m_f^{(i)}$ following a $p$ Bernoulli distribution with usually $p=\frac15$ for the mask of the input layer and $p=\frac12$ otherwise. Dropout\cite{Srivastava:2014:DSW:2627435.2670313} has been the most successful regularization technique until the appearance of Batch Normalization. 562 | 563 | \subsection{Batch Normalization} 564 | 565 | Batch normalization\cite{Ioffe2015} amounts to jointly normalize the mini-batch set per data types, and does so at each input of a FNN layer. In the original paper, the authors argued that this step should be done after the convolutional layers, but in practice it has been shown to be more efficient after the non-linear step. In our case, we will thus consider $\forall \nu \in \llbracket 0,N-2\rrbracket$ 566 | \begin{align} 567 | \tilde{h}_{f}^{(t)(\nu)}&=\frac{h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}} 568 | % 569 | {\sqrt{\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon}}\;, 570 | \end{align} 571 | with 572 | \begin{align} 573 | \hat{h}_{f}^{(\nu)}&= 574 | % 575 | \frac{1}{T_{{\rm mb}}}\sum^{T_{{\rm mb}}-1}_{t=0}h_{f}^{(t)(\nu+1)}\\ 576 | % 577 | \left(\hat{\sigma}_{f}^{(\nu)}\right)^2&=\frac{1}{T_{{\rm mb}}}\sum^{T_{{\rm mb}}-1}_{t=0} 578 | % 579 | \left(h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)^2\;. 580 | \end{align} To make sure that this transformation can represent the identity transform, we add two additional parameters $(\gamma_f,\beta_f)$ to the model 581 | \begin{align} 582 | y^{(t)(\nu)}_{f}&=\gamma^{(\nu)}_f\,\tilde{h}_{f}^{(t)(\nu)}+\beta^{(\nu)}_f 583 | % 584 | =\tilde{\gamma}^{(\nu)}_f\,h_{f}^{(t)(\nu)}+\tilde{\beta}^{(\nu)}_f\;. 585 | \end{align} 586 | The presence of the $\beta^{(\nu)}_f$ coefficient is what pushed us to get rid of the bias term, as it is naturally included in batchnorm. During training, one must compute a running sum for the mean and the variance, that will serve for the evaluation of the cross-validation and the test set (calling $e$ the number of iterations/epochs) 587 | \begin{align} 588 | \mathbb{E}\left[h_{f}^{(t)(\nu+1)}\right]_{e+1} &= 589 | % 590 | \frac{e\mathbb{E}\left[h_{f}^{(t)(\nu)}\right]_{e}+\hat{h}_{f}^{(\nu)}}{e+1}\;,\\ 591 | % 592 | \mathbb{V}ar\left[h_{f}^{(t)(\nu+1)}\right]_{e+1} &= 593 | % 594 | \frac{e\mathbb{V}ar\left[h_{f}^{(t)(\nu)}\right]_{e}+\left(\hat{\sigma}_{f}^{(\nu)}\right)^2}{e+1} 595 | \end{align} 596 | and what will be used at test time is 597 | \begin{align} 598 | \mathbb{E}\left[h_{f}^{(t)(\nu)}\right]&=\mathbb{E}\left[h_{f}^{(t)(\nu)}\right]\;,& 599 | % 600 | \mathbb{V}ar\left[h_{f}^{(t)(\nu)}\right]&= 601 | % 602 | \frac{T_{{\rm mb}}}{T_{{\rm mb}}-1}\mathbb{V}ar\left[h_{f}^{(t)(\nu)}\right]\;. 603 | \end{align} 604 | so that at test time 605 | \begin{align} 606 | y^{(t)(\nu)}_{f}&=\gamma^{(\nu)}_f\,\frac{h_{f}^{(t)(\nu)}-E[h_{f}^{(t)(\nu)}]}{\sqrt{Var\left[h_{f}^{(t)(\nu)}\right]+\epsilon}}+\beta^{(\nu)}_f\;. 607 | \end{align} 608 | 609 | In practice, and as advocated in the original paper, on can get rid of dropout without loss of precision when using batch normalization. We will adopt this convention in the following. 610 | 611 | 612 | \section{Backpropagation} 613 | 614 | Backpropagation\cite{LeCun:1998:EB:645754.668382} is the standard technique to decrease the loss function error so as to correctly predict what one needs. As it name suggests, it amounts to backpropagate through the FNN the error performed at the output layer, so as to update the weights. In practice, on has to compute a bunch of gradient terms, and this can be a tedious computational task. Nevertheless, if performed correctly, this is the most useful and important task that one can do in a FNN. We will therefore detail how to compute each weight (and Batchnorm coefficients) gradients in the following. 615 | 616 | \subsection{Backpropagate through Batch Normalization} \label{sec:Backpropbatchnorm} 617 | 618 | Backpropagation introduces a new gradient 619 | \begin{align} 620 | \delta^f_{f'}J^{(tt')(\nu)}_{f}&=\frac{\partial y^{(t')(\nu)}_{f'}}{\partial h_{f}^{(t)(\nu+1)}}\;. 621 | \end{align} 622 | we show in appendix \ref{sec:appenbatchnorm} that 623 | \begin{align} 624 | J^{(tt')(\nu)}_{f}&=\tilde{\gamma}^{(\nu)}_f\ \left[\delta^{t'}_t- 625 | % 626 | \frac{1+\tilde{h}_{f}^{(t')(\nu)}\tilde{h}_{f}^{(t)(\nu)}}{T_{{\rm mb}}}\right]\;. 627 | \end{align} 628 | 629 | 630 | \subsection{error updates} 631 | 632 | 633 | To backpropagate the loss error through the FNN, it is very useful to compute a so-called error rate 634 | \begin{align} 635 | \delta^{(t)(\nu)}_f&= \frac{\partial }{\partial a_{f}^{(t)(\nu)}}J(\Theta)\;, 636 | \end{align} 637 | We show in Appendix \ref{sec:appenbplayers} that $\forall \nu \in \llbracket 0,N-2\rrbracket$ 638 | \begin{align} 639 | \delta^{(t)(\nu)}_f&=g'\left(a_{f}^{(t)(\nu)}\right) 640 | % 641 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f}J^{(tt')(\nu)}_{f} \delta^{(t')(\nu+1)}_{f'}\;, 642 | \end{align} 643 | the value of $\delta^{(t)(N-1)}_f$ depends on the loss used. We show also in appendix \ref{sec:appenbpoutput} that for the MSE loss function 644 | \begin{align} 645 | \delta^{(t)(N-1)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N)}-y_f^{(t)}\right)\;, 646 | \end{align} 647 | and for the cross entropy loss function 648 | \begin{align} 649 | \delta^{(t)(N-1)}_{f}&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N)}-\delta^f_{y^{(t)}}\right)\;. 650 | \end{align} 651 | To unite the notation of chapters \ref{sec:chapterFNN}, \ref{sec:chapterCNN} and \ref{sec:chapterRNN}, we will call 652 | \begin{align} 653 | \mathcal{H}^{(t)(\nu+1)}_{ff'}&=g'\left(a_{f}^{(t)(\nu)}\right)\Theta^{(\nu+1)f'}_{f}\;, 654 | \end{align} 655 | so that the update rule for the error rate reads 656 | \begin{align} 657 | \delta^{(t)(\nu)}_f&= 658 | % 659 | \sum_{t'=0}^{T_{{\rm mb}}-1}J^{(tt')(\nu)}_{f}\sum_{f'=0}^{F_{\nu+1}-1}\mathcal{H}^{(t)(\nu+1)}_{ff'} \delta^{(t)(\nu+1)}_{f'}\;. 660 | \end{align} 661 | 662 | \subsection{Weight update} 663 | 664 | Thanks to the computation of the error rates, the derivation of the error rate is straightforward. We indeed get $\forall \nu \in \llbracket 1,N-1\rrbracket$ 665 | \begin{align} 666 | \Delta^{\Theta(\nu)f}_{f'}&=\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1} 667 | % 668 | \sum^{F_{\nu+1}-1}_{f^{''}=0}\sum^{F_\nu}_{f^{'''}=0}\frac{\partial\Theta^{(\nu)f^{''}}_{f^{'''}} 669 | % 670 | }{\partial \Theta^{(\nu)f}_{f'}}y^{(t)(\nu-1)}_{f^{'''}}\delta^{(t)(\nu)}_{f^{''}} 671 | % 672 | =\sum_{t=0}^{T_{{\rm mb}}-1}\delta^{(t)(\nu)}_f y^{(t)(\nu-1)}_{f'}\;. 673 | \end{align} 674 | and 675 | \begin{align} 676 | \Delta^{\Theta(0)f}_{f'}&=\sum_{t=0}^{T_{{\rm mb}}-1}\delta^{(t)(0)}_f h^{(t)(0)}_{f'}\;. 677 | \end{align} 678 | 679 | \subsection{Coefficient update} 680 | 681 | The update rule for the Batchnorm coefficient can easily be computed thanks to the error rate. It reads 682 | \begin{align} 683 | \Delta_f^{\gamma(\nu)}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1} 684 | % 685 | \frac{\partial a^{(t)(\nu+1)}_{f'}}{\partial\gamma_f^{(i)}}\delta^{(t)(\nu+1)}_{f'} 686 | % 687 | =\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1} 688 | % 689 | \Theta^{(\nu+1)f'}_{f}\tilde{h}^{(t)(i)}_{f}\delta^{(t)(\nu+1)}_{f'}\;,\\ 690 | % 691 | \Delta_f^{\beta(\nu)}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1} 692 | % 693 | \frac{\partial a^{(t)(\nu+1)}_{f'}}{\partial\beta_f^{(i)}}\delta^{(t)(\nu+1)}_{f'} 694 | % 695 | =\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f}\delta^{(t)(\nu+1)}_{f'}\;, 696 | \end{align} 697 | 698 | 699 | \section{Which data sample to use for gradient descent?} 700 | 701 | From the beginning we have denoted $T_{{\rm mb}}$ the sample of the data from which we train our model. This procedure is repeated a large number of time (each time is called an epoch). But in the literature there exists three way to sample from the data: Full-batch, Stochastic and Mini-batch gradient descent. We explicit these terms in the following sections. 702 | 703 | \subsection{Full-batch} 704 | 705 | Full-batch takes the whole training set at each epoch, such that the loss function reads 706 | \begin{align} 707 | J(\Theta)&=\sum_{t=0}^{T_{{\rm train}}-1}J_{{\rm train}}(\Theta)\;. 708 | \end{align} 709 | This choice has the advantage to be numerically stable, but it so costly in computation time that it is rarely if ever used. 710 | 711 | \subsection{Stochastic Gradient Descent (SGD)} 712 | 713 | SGD amounts to take only one exemplary of the training set at each epoch 714 | \begin{align} 715 | J(\Theta)&=J_{{\rm SGD}}(\Theta)\;. 716 | \end{align} 717 | This choice leads to faster computations, but is so numerically unstable that the most standard choice by far is Mini-batch gradient descent. 718 | 719 | \subsection{Mini-batch} 720 | 721 | Mini-batch gradient descent is a compromise between stability and time efficiency, and is the middle-ground between Full-batch and Stochastic gradient descent: $1\ll T_{{\rm mb}}\ll T_{{\rm train}}$. Hence 722 | \begin{align} 723 | J(\Theta)&=\sum_{t=0}^{T_{{\rm mb}}-1}J_{{\rm mb}}(\Theta)\;. 724 | \end{align} 725 | All the calculations in this note have been performed using this gradient descent technique. 726 | 727 | \section{Gradient optimization techniques} 728 | 729 | Once the gradients for backpropagation have been computed, the question of how to add them to the existing weights arise. The most natural choice would be to take 730 | \begin{align} 731 | \Theta^{(\nu)f}_{f'}&=\Theta^{(\nu)f}_{f'}-\eta\Delta^{\Theta(i)f}_{f'}\;. 732 | \end{align} 733 | where $\eta$ is a free parameter that is generally initialized thanks to cross-validation. It can also be made epoch dependent (with usually a slow exponentially decaying behaviour). When using Mini-batch gradient descent, this update choice for the weights presents the risk of having the loss function being stuck in a local minimum. Several method have been invented to prevent this risk. We are going to review them in the next sections. 734 | 735 | 736 | \subsection{Momentum} 737 | 738 | Momentum\cite{QIAN1999145} introduces a new vector $v_{{\rm e}}$ and can be seen as keeping a memory of what where the previous updates at prior epochs. Calling $e$ the number of epochs and forgetting the $f,f',\nu$ indices for the gradients to ease the notations, we have 739 | \begin{align} 740 | v_{{\rm e}}&=\gamma v_{{\rm e-1}}+\eta \Delta^{\Theta}\;, 741 | \end{align} 742 | and the weights at epoch $e$ are then updated as 743 | \begin{align} 744 | \Theta_e&=\Theta_{e-1}-v_{{\rm e}}\;. 745 | \end{align} 746 | $\gamma$ is a new parameter of the model, that is usually set to $0.9$ but that could also be fixed thanks to cross-validation. 747 | 748 | \subsection{Nesterov accelerated gradient} 749 | 750 | Nesterov accelerated gradient\cite{nesterov1983method} is a slight modification of the momentum technique that allows the gradients to escape from local minima. It amounts to take 751 | \begin{align} 752 | v_{{\rm e}}&=\gamma v_{{\rm e-1}}+\eta \Delta^{\Theta-\gamma v_{{\rm e-1}}}\;, 753 | \end{align} 754 | and then again 755 | \begin{align} 756 | \Theta_e&=\Theta_{e-1}-v_{{\rm e}}\;. 757 | \end{align} 758 | Until now, the parameter $\eta$ that controls the magnitude of the update has been set globally. It would be nice to have a fine control of it, so that different weights can be updated with different magnitudes. 759 | 760 | \subsection{Adagrad} 761 | 762 | Adagrad\cite{Duchi:2011:ASM:1953048.2021068} allows to fine tune the different gradients by having individual learning rates $\eta$. Calling for each value of $f,f',i$ 763 | \begin{align} 764 | v_{{\rm e}}&=\sum_{e'=0}^{e-1} \left(\Delta^{\Theta}_{e'}\right)^2\;, 765 | \end{align} 766 | the update rule then reads 767 | \begin{align} 768 | \Theta_e&=\Theta_{e-1}-\frac{\eta}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\;. 769 | \end{align} 770 | One advantage of Adagrad is that the learning rate $\eta$ can be set once and for all (usually to $10^{-2}$) and does not need to be fine tune via cross validation anymore, as it is individually adapted to each weight via the $v_{{\rm e}}$ term. $\epsilon$ is here to avoid division by 0 issues, and is usually set to $10^{-8}$. 771 | 772 | \subsection{RMSprop} 773 | 774 | Since in Adagrad one adds the gradient from the first epoch, the weight are forced to monotonically decrease. This behaviour can be smoothed via the Adadelta technique, which takes 775 | \begin{align} 776 | v_{{\rm e}}&=\gamma v_{{\rm e-1}}+(1-\gamma )\Delta^{\Theta}_{e}\;, 777 | \end{align} 778 | with $\gamma$ a new parameter of the model, that is usually set to $0.9$. The Adadelta update rule then reads as the Adagrad one 779 | \begin{align} 780 | \Theta_e&=\Theta_{e-1}-\frac{\eta}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\;. 781 | \end{align} 782 | $\eta$ can be set once and for all (usually to $10^{-3}$). 783 | 784 | \subsection{Adadelta} 785 | 786 | Adadelta\cite{journals/corr/abs-1212-5701} is an extension of RMSprop, that aims at getting rid of the $\eta$ parameter. To do so, a new vector update is introduced 787 | \begin{align} 788 | m_{{\rm e}}&=\gamma m_{{\rm e-1}}+(1-\gamma ) 789 | % 790 | \left(\frac{\sqrt{m_{{\rm e-1}}+\epsilon}}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\right)^2\;, 791 | \end{align} 792 | and the new update rule for the weights reads 793 | \begin{align} 794 | \Theta_e&=\Theta_{e-1}-\frac{\sqrt{m_{{\rm e-1}}+\epsilon}}{\sqrt{v_{{\rm e}}+\epsilon}}\Delta^{\Theta}_{e}\;. 795 | \end{align} 796 | The learning rate has been completely eliminated from the update rule, but the procedure for doing so is ad hoc. The next and last optimization technique presented seems more natural and is the default choice on a number of deep learning algorithms. 797 | \subsection{Adam} 798 | 799 | Adam\cite{Kingma2014} keeps track of both the gradient and its square via two epoch dependent vectors 800 | \begin{align} 801 | m_{{\rm e}}&= \beta_1 m_{{\rm e-1}}+ (1-\beta_1)\Delta^{\Theta}_{e}\;,& 802 | % 803 | v_{{\rm e}}&= \beta_2 v_{{\rm e}}+ (1-\beta_2)\left(\Delta^{\Theta}_{e}\right)^2\;, 804 | \end{align} 805 | with $\beta_1$ and $\beta_2$ parameters usually respectively set to $0.9$ and $0.999$. But the robustness and great strength of Adam is that it makes the whole learning process weakly dependent of their precise value. To avoid numerical problems during the first steps, these vector are rescaled 806 | \begin{align} 807 | \hat{m}_{{\rm e}}&= \frac{m_{{\rm e}}}{1-\beta_1^{e}}\;,& 808 | % 809 | \hat{v}_{{\rm e}}&= \frac{v_{{\rm e}}}{1-\beta_2^{e}}\;. 810 | \end{align} 811 | before entering into the update rule 812 | \begin{align} 813 | \Theta_e&=\Theta_{e-1}-\frac{\eta }{\sqrt{\hat{v}_{{\rm e}}+\epsilon}}\hat{m}_{{\rm e}}\;. 814 | \end{align} 815 | This is the optimization technique implicitly used throughout this note, alongside with a learning rate decay 816 | \begin{align} 817 | \eta_e&=e^{-\alpha_0}\eta_{e-1}\;, 818 | \end{align} 819 | $\alpha_0$ determined by cross-validation, and $\eta_0$ usually initialized in the range $10^{-3}-10^{-2}$. 820 | 821 | \section{Weight initialization} 822 | 823 | Without any regularization, training a neural network can be a daunting task because of the fine-tuning of the weight initial conditions. This is one of the reasons why neural networks have experienced out of mode periods. Since dropout and Batch normalization, this issue is less pronounced, but one should not initialize the weight in a symmetric fashion (all zero for instance), nor should one initialize them too large. A good heuristic is 824 | \begin{align} 825 | \left[\Theta^{(\nu)f'}_f\right]_{{\rm init}}&=\sqrt{\frac{6}{F_\nu+F_{\nu+1}}}\times\mathcal{N}(0,1)\;. 826 | \end{align} 827 | 828 | \begin{subappendices} 829 | \section{Backprop through the output layer} \label{sec:appenbpoutput} 830 | 831 | Recalling the MSE loss function 832 | \begin{align} 833 | J(\Theta)&=\frac{1}{2T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{f=0}^{F_N-1} 834 | % 835 | \left(y_f^{(t)}-h_{f}^{(t)(N)}\right)^2\;, 836 | \end{align} 837 | we instantaneously get 838 | \begin{align} 839 | \delta^{(t)(N-1)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N)}-y_f^{(t)}\right)\;. 840 | \end{align} 841 | Things are more complicated for the cross-entropy loss function of a regression problem transformed into a multi-classification task. 842 | Assuming that we have $C$ classes for all the values that we are trying to predict, we get 843 | \begin{align} 844 | \delta^{(t)(N-1)}_{fc}&= \frac{\partial }{\partial a_{fc}^{(t)(N-1)}}J(\Theta) 845 | % 846 | =\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_N-1}\sum_{d=0}^{C-1} 847 | % 848 | \frac{\partial h_{f'd}^{(t')(N)}}{\partial a_{fc}^{(t)(N-1)}} 849 | % 850 | \frac{\partial }{\partial h_{f'd}^{(t')(N)}}J(\Theta)\;. 851 | \end{align} 852 | Now 853 | \begin{align} 854 | \frac{\partial }{\partial h_{f'd}^{(t')(N)}}J(\Theta)&=-\frac{\delta^{d}_{ y_{f'}^{(t')}}}{T_{{\rm mb}} h_{f'd}^{(t')(N)}}\;, 855 | \end{align} 856 | and 857 | \begin{align} 858 | \frac{\partial h_{f'd}^{(t')(N)}}{\partial a_{fc}^{(t)(N-1)}}&= 859 | % 860 | \delta^f_{f'}\delta^{t}_{t'} \left(\delta^c_d h_{fc}^{(t)(N)}- h_{fc}^{(t)(N)} h_{fd}^{(t)(N)}\right)\;, 861 | \end{align} 862 | so that 863 | \begin{align} 864 | \delta^{(t)(N-1)}_{fc}&=-\frac{1}{T_{{\rm mb}}} \sum_{d=0}^{C-1}\frac{\delta^{d}_{ y_f^{(t)}}}{h_{fd}^{(t)(N)}} 865 | % 866 | \left(\delta^c_d h_{fc}^{(t)(N)}- h_{fc}^{(t)(N)} h_{fd}^{(t)(N)}\right)\notag\\ 867 | % 868 | &=\frac{1}{T_{{\rm mb}}}\left( h_{fc}^{(t)(N)}-\delta^{c}_{ y_f^{(t)}}\right)\;. 869 | \end{align} 870 | For a true classification problem, we easily deduce 871 | \begin{align} 872 | \delta^{(t)(N-1)}_{fc}&=\frac{1}{T_{{\rm mb}}}\left( h_{f}^{(t)(N)}-\delta^{f}_{ y^{(t)}}\right)\;. 873 | \end{align} 874 | 875 | \section{Backprop through hidden layers} \label{sec:appenbplayers} 876 | 877 | To go further we need 878 | \begin{align} 879 | \delta^{(t)(\nu)}_f&= \frac{\partial }{\partial a_{f}^{(t)(\nu)}}J^{(t)}(\Theta)= 880 | % 881 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1} 882 | % 883 | \frac{\partial a_{f'}^{(t')(\nu+1)}}{\partial a_{f}^{(t)(\nu)}} \delta^{(t')(\nu+1)}_{f'}\notag\\ 884 | % 885 | &=\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\sum^{F_\nu}_{f''=0}\Theta^{(\nu+1)f'}_{f''} 886 | % 887 | \frac{\partial y^{(t')(\nu)}_{f''} }{\partial a_{f}^{(t)(\nu)}} \delta^{(t')(\nu+1)}_{f'}\notag\\ 888 | % 889 | &=\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\sum^{F_\nu}_{f''=0}\Theta^{(\nu+1)f'}_{f''} 890 | % 891 | \frac{\partial y^{(t')(\nu)}_{f''} }{\partial h_{f}^{(t)(\nu+1)}} 892 | % 893 | g'\left(a_{f}^{(t)(\nu)}\right) \delta^{(t')(\nu+1)}_{f'}\;, 894 | \end{align} 895 | so that 896 | \begin{align} 897 | \delta^{(t)(\nu)}_f&=g'\left(a_{f}^{(t)(\nu)}\right) 898 | % 899 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f}J^{(tt')(\nu)}_{f} \delta^{(t)(\nu+1)}_{f'}\;, 900 | \end{align} 901 | 902 | 903 | \section{Backprop through BatchNorm} \label{sec:appenbatchnorm} 904 | 905 | 906 | We saw in section \ref{sec:Backpropbatchnorm} that batch normalization implies among other things to compute the following gradient. 907 | \begin{align} 908 | \frac{\partial y^{(t')(\nu)}_{f'}}{\partial h_{f}^{(t)(\nu+1)}}&= 909 | % 910 | \gamma^{(\nu)}_f\frac{\partial \tilde{h}_{f'}^{(t)(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}\;. 911 | \end{align} 912 | We propose to do just that in this section. Firstly 913 | \begin{align} 914 | \frac{\partial h^{(t')(\nu+1)}_{f'}}{\partial h_{f}^{(t)(\nu+1)}}&=\delta^{t'}_t\delta^{f'}_f\;,& 915 | % 916 | \frac{\partial \hat{h}_{f'}^{(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}&=\frac{\delta^{f'}_f}{T_{{\rm mb}}}\;. 917 | \end{align} 918 | Secondly 919 | \begin{align} 920 | \frac{\partial \left(\hat{\sigma}_{f'}^{(\nu)}\right)^2}{\partial h_{f}^{(t)(\nu+1)}}&= 921 | % 922 | \frac{2\delta^{f'}_f}{T_{{\rm mb}}}\left(h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)\;, 923 | \end{align} 924 | so that we get 925 | \begin{align} 926 | \frac{\partial \tilde{h}_{f'}^{(t)(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}&= 927 | % 928 | \frac{\delta^{f'}_f}{T_{{\rm mb}}}\left[\frac{T_{{\rm mb}}\delta^{t'}_t-1} 929 | % 930 | {\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac12}- 931 | % 932 | \frac{\left(h_{f}^{(t')(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)\left(h_{f}^{(t)(\nu+1)}-\hat{h}_{f}^{(\nu)}\right)} 933 | % 934 | {\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac32}\right]\notag\\ 935 | % 936 | &=\frac{\delta^{f'}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac12} 937 | % 938 | \left[\delta^{t'}_t- 939 | % 940 | \frac{1+\tilde{h}_{f}^{(t')(\nu)}\tilde{h}_{f}^{(t)(\nu)}}{T_{{\rm mb}}}\right]\;. 941 | \end{align} 942 | To ease the notation recall that we denoted 943 | \begin{align} 944 | \tilde{\gamma}^{(\nu)}_f&= 945 | % 946 | \frac{\gamma^{(\nu)}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu)}\right)^2+\epsilon\right)^\frac12}\;. 947 | \end{align} 948 | % 949 | % 950 | so that 951 | \begin{align} 952 | \frac{\partial y_{f'}^{(t)(\nu)}}{\partial h_{f}^{(t)(\nu+1)}}&= 953 | % 954 | \tilde{\gamma}^{(\nu)}_f \delta^{f'}_f\left[\delta^{t'}_t- 955 | % 956 | \frac{1+\tilde{h}_{f}^{(t')(\nu)}\tilde{h}_{f}^{(t)(\nu)}}{T_{{\rm mb}}}\right]\;. 957 | \end{align} 958 | 959 | 960 | 961 | \section{FNN ResNet (non standard presentation)} \label{sec:ResnetFNN} 962 | 963 | The state of the art architecture of convolutional neural networks (CNN, to be explained in chapter \ref{sec:chapterCNN}) is called ResNet\cite{He2015}. Its name comes from its philosophy: each hidden layer output $y$ of the network is a small -- hence the term residual -- modification of its input ($y=x+F(x)$), instead of a total modification ($y=H(x)$) of its input $x$. This philosophy can be imported to the FNN case. Representing the operations of weight averaging, activation function and batch normalization in the following way 964 | 965 | \begin{figure}[H] 966 | \begin{center} 967 | \begin{tikzpicture} 968 | \node at (0,0) {\includegraphics[scale=1]{fc_equiv}}; 969 | \end{tikzpicture} 970 | \end{center} 971 | \caption{\label{fig:fc_equiv} Schematic representation of one FNN fully connected layer.} 972 | \end{figure} 973 | 974 | In its non standard form presented in this section, the residual operation amounts to add a skip connection to two consecutive full layers 975 | 976 | 977 | \begin{figure}[H] 978 | \begin{center} 979 | \begin{tikzpicture} 980 | \node at (0,0) {\includegraphics[scale=1]{fc_resnet_2}}; 981 | \end{tikzpicture} 982 | \end{center} 983 | \caption{\label{fig:fc_resnet_2} Residual connection in a FNN.} 984 | \end{figure} 985 | 986 | Mathematically, we had before (calling the input $y^{(t)(\nu-1)}$) 987 | 988 | \begin{align} 989 | y_{f}^{(t)(\nu+1)}&=\gamma_f^{(\nu+1)}\tilde{h}_f^{(t)(\nu+2)}+\beta_f^{(\nu+1)}\;,& 990 | % 991 | a_{f}^{(t)(\nu+1)}&=\sum^{F_{\nu}-1}_{f'=0}\Theta^{(\nu+1)f}_{f'}y_{f}^{(t)(\nu)}\notag\\ 992 | % 993 | y_{f}^{(t)(\nu)}&=\gamma_f^{(\nu)}\tilde{h}_f^{(t)(\nu+1)}+\beta_f^{(\nu)}\;,& 994 | % 995 | a_{f}^{(t)(\nu)}&=\sum^{F_{\nu-1}-1}_{f'=0}\Theta^{(\nu)f}_{f'}y_{f}^{(t)(\nu-1)}\;, 996 | \end{align} 997 | as well as $h^{(t)(\nu+2)}_f=g\left(a_{f}^{(t)(\nu+1)}\right)$ and $h^{(t)(\nu+1)}_f=g\left(a_{f}^{(t)(\nu)}\right)$. In ResNet, we now have the slight modification 998 | \begin{align} 999 | y_{f}^{(t)(\nu+1)}&=\gamma_f^{(\nu+1)}\tilde{h}_f^{\nu+2}+\beta_f^{(\nu+1)}+y^{(t)(\nu-1)}_{f}\;. 1000 | \end{align} 1001 | The choice of skipping two and not just one layer has become a standard for empirical reasons, so as the decision not to weight the two paths (the trivial skip one and the two FNN layer one) by a parameter to be learned by backpropagation 1002 | \begin{align} 1003 | y_{f}^{(t)(\nu+1)}&=\alpha\left(\gamma_f^{(\nu+1)}\tilde{h}_f^{(t)(\nu+2)}+\beta_f^{(\nu+1)}\right) 1004 | % 1005 | +\left( 1-\alpha\right)y^{(t)(\nu-1)}_{f'}\;. 1006 | \end{align} 1007 | This choice is called highway nets\cite{citeulike:14070430}, and it remains to be theoretically understood why it leads to worse performance than ResNet, as the latter is a particular instance of the former. Going back to the ResNet backpropagation algorithm, this changes the gradient through the skip connection in the following way 1008 | \begin{align} 1009 | \delta^{(t)(\nu-1)}_f&= 1010 | % 1011 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu}-1} 1012 | % 1013 | \frac{\partial a_{f'}^{(t')(\nu)}}{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu)}_{f'} 1014 | % 1015 | +\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+2}-1} 1016 | % 1017 | \frac{\partial a_{f'}^{(t')(\nu+2)}}{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu+2)}_{f'}\notag\\ 1018 | % 1019 | &=\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu)f'}_{f''} 1020 | % 1021 | \frac{\partial y^{(t')(\nu-1)}_{f''} }{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu)}_{f'}\notag\\ 1022 | % 1023 | &+\sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+2}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu+2)f'}_{f''} 1024 | % 1025 | \frac{\partial y^{(t')(\nu+1)}_{f''} }{\partial a_{f}^{(t)(\nu-1)}} \delta^{(t')(\nu+2)}_{f'}\notag\\ 1026 | % 1027 | &=g'\left(a_{f}^{(t)(\nu-1)}\right) 1028 | % 1029 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu)f'}_{f''} 1030 | % 1031 | J^{(tt')(\nu)}_{f} \delta^{(t')(\nu)}_{f'}\notag\\ 1032 | % 1033 | &+g'\left(a_{f}^{(t)(\nu-1)}\right) 1034 | % 1035 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum_{f'=0}^{F_{\nu+2}-1}\sum^{F_{\nu-1}-1}_{f''=0}\Theta^{(\nu+2)f'}_{f''} 1036 | % 1037 | J^{(tt')(\nu)}_{f} \delta^{(t')(\nu+2)}_{f'}\;, 1038 | \end{align} 1039 | so that 1040 | \begin{align} 1041 | \delta^{(t)(\nu-1)}_f&=g'\left(a_{f}^{(t)(\nu-1)}\right) 1042 | % 1043 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum^{F_{\nu-1}-1}_{f''=0}J^{(tt')(\nu)}_{f}\notag\\ 1044 | % 1045 | &\times\left[\sum_{f'=0}^{F_{\nu}-1}\Theta^{(\nu)f'}_{f''}\delta^{(t')(\nu)}_{f'}+ 1046 | % 1047 | \sum_{f'=0}^{F_{\nu+2}-1}\Theta^{(\nu+2)f'}_{f''}\delta^{(t')(\nu+2)}_{f'}\right]\;. 1048 | \end{align} 1049 | 1050 | This formulation has one advantage: it totally preserves the usual FNN layer structure of a weight averaging (WA) followed by an activation function (AF) and then a batch normalization operation (BN). It nevertheless has one disadvantage: the backpropagation gradient does not really flow smoothly from one error rate to the other. In the following section we will present the standard ResNet formulation of that takes the problem the other way around : it allows the gradient to flow smoothly at the cost of "breaking" the natural FNN building block. 1051 | 1052 | \section{FNN ResNet (more standard presentation)} \label{sec:ResnetFNN2} 1053 | 1054 | \begin{figure}[H] 1055 | \begin{center} 1056 | \begin{tikzpicture} 1057 | \node at (0,0) {\includegraphics[scale=1]{fc_resnet_3}}; 1058 | \end{tikzpicture} 1059 | \end{center} 1060 | \caption{\label{fig:fc_resnet_3} Residual connection in a FNN, trivial gradient flow through error rates.} 1061 | \end{figure} 1062 | 1063 | In the more standard form of ResNet, the skip connections reads 1064 | \begin{align} 1065 | a_{f}^{(t)(\nu+2)}&= a_{f}^{(t)(\nu+2)}+ a_{f}^{(t)(\nu)}\;, 1066 | \end{align} 1067 | and the updated error rate reads 1068 | \begin{align} 1069 | \delta^{(t)(\nu)}_f&=g'\left(a_{f}^{(t)(\nu)}\right) 1070 | % 1071 | \sum_{t'=0}^{T_{{\rm mb}}-1}\sum^{F_{\nu}-1}_{f''=0}J^{(tt')(\nu)}_{f} 1072 | % 1073 | \sum_{f'=0}^{F_{\nu+1}-1}\Theta^{(\nu+1)f'}_{f''}\delta^{(t')(\nu+1)}_{f'}+\delta^{(t')(\nu+2)}_{f}\;. 1074 | \end{align} 1075 | 1076 | 1077 | \section{Matrix formulation} 1078 | 1079 | In all this chapter, we adopted an "index" formulation of the FNN. This has upsides and downsides. On the positive side, one can take the formula as written here and go implement them. On the downside, they can be quite cumbersome to read. 1080 | 1081 | \vspace{0.2cm} 1082 | 1083 | Another FNN formulation is therefore possible: a matrix one. To do so, one has to rewrite 1084 | \begin{align} 1085 | h_f^{(t)(\nu)}\mapsto h^{(\nu)}_{ft}&\mapsto h^{(\nu)}\in \mathcal{M}(F_\nu,T_{\rm mb})\;. 1086 | \end{align} 1087 | In this case the weight averaging procedure (\ref{eq:Weightavg}) can be written as 1088 | \begin{align} 1089 | a_f^{(t)(\nu)}=\sum_{f'=0}^{F_\nu-1}\Theta^{(\nu)f}_{f'}h^{(\nu)}_{f't}&\mapsto a^{(\nu)}=\Theta^{(\nu)}h^{(\nu)}\;. 1090 | \end{align} 1091 | The upsides and downsides of this formulation are the exact opposite of the index one: what we gained in readability, we lost in terms of direct implementation in low level programming languages (C for instance). For FNN, one can use a high level programming language (like python), but this will get quite intractable when we talk about Convolutional networks. Since the whole point of the present work was to introduce the index notation, and as one can easily find numerous derivation of the backpropagation update rules in matrix form, we will stick with the index notation in all the following, and now turn our attention to convolutional networks. 1092 | \end{subappendices} 1093 | -------------------------------------------------------------------------------- /chapter3.tex: -------------------------------------------------------------------------------- 1 | \chapter{Recurrent Neural Networks} \label{sec:chapterRNN} 2 | 3 | \minitoc 4 | 5 | \section{Introduction} 6 | 7 | \yinipar{\fontsize{60pt}{72pt}\usefont{U}{Kramer}{xl}{n}I}n this chapter, we review a third kind of Neural Network architecture: Recurrent Neural Networks\cite{GravesA2016}. By contrast with the CNN, this kind of network introduces a real architecture novelty : instead of forwarding only in a "spatial" direction, the data are also forwarded in a new -- time dependent -- direction. We will present the first Recurrent Neural Network (RNN) architecture, as well as the current most popular one: the Long Short Term Memory (LSTM) Neural Network. 8 | 9 | \section{RNN-LSTM architecture} 10 | 11 | \subsection{Forward pass in a RNN-LSTM} 12 | 13 | In figure \ref{fig:1}, we present the RNN architecture in a schematic way 14 | 15 | \begin{figure}[H] 16 | \begin{center} 17 | \begin{tikzpicture} 18 | \node at (0,0) [rectangle,draw,fill=gray!0!white] (h00) {$h^{(00)}$}; 19 | \node at (0,1.5) [rectangle,draw,fill=gray!30!white] (h01) {$h^{(10)}$}; 20 | \node at (0,3) [rectangle,draw,fill=gray!30!white] (h02) {$h^{(20)}$}; 21 | \node at (0,4.5) [rectangle,draw,fill=gray!30!white] (h03) {$h^{(30)}$}; 22 | \node at (0,6) [rectangle,draw,fill=gray!0!white] (h04) {$h^{(40)}$}; 23 | % 24 | \node at (1.8,0) [rectangle,draw,fill=gray!0!white] (h10) {$h^{(01)}$}; 25 | \node at (1.8,1.5) [rectangle,draw,fill=gray!70!white] (h11) {$h^{(11)}$}; 26 | \node at (1.8,3) [rectangle,draw,fill=gray!70!white] (h12) {$h^{(21)}$}; 27 | \node at (1.8,4.5) [rectangle,draw,fill=gray!70!white] (h13) {$h^{(31)}$}; 28 | \node at (1.8,6) [rectangle,draw,fill=gray!0!white] (h14) {$h^{(41)}$}; 29 | % 30 | \node at (3.6,0) [rectangle,draw,fill=gray!0!white] (h20) {$h^{(02)}$}; 31 | \node at (3.6,1.5) [rectangle,draw,fill=gray!70!white] (h21) {$h^{(12)}$}; 32 | \node at (3.6,3) [rectangle,draw,fill=gray!70!white] (h22) {$h^{(22)}$}; 33 | \node at (3.6,4.5) [rectangle,draw,fill=gray!70!white] (h23) {$h^{(32)}$}; 34 | \node at (3.6,6) [rectangle,draw,fill=gray!0!white] (h24) {$h^{(42)}$}; 35 | % 36 | \node at (5.4,0) [rectangle,draw,fill=gray!0!white] (h30) {$h^{(03)}$}; 37 | \node at (5.4,1.5) [rectangle,draw,fill=gray!70!white] (h31) {$h^{(13)}$}; 38 | \node at (5.4,3) [rectangle,draw,fill=gray!70!white] (h32) {$h^{(23)}$}; 39 | \node at (5.4,4.5) [rectangle,draw,fill=gray!70!white] (h33) {$h^{(33)}$}; 40 | \node at (5.4,6) [rectangle,draw,fill=gray!0!white] (h34) {$h^{(43)}$}; 41 | % 42 | \node at (7.2,0) [rectangle,draw,fill=gray!0!white] (h40) {$h^{(04)}$}; 43 | \node at (7.2,1.5) [rectangle,draw,fill=gray!70!white] (h41) {$h^{(14)}$}; 44 | \node at (7.2,3) [rectangle,draw,fill=gray!70!white] (h42) {$h^{(24)}$}; 45 | \node at (7.2,4.5) [rectangle,draw,fill=gray!70!white] (h43) {$h^{(34)}$}; 46 | \node at (7.2,6) [rectangle,draw,fill=gray!0!white] (h44) {$h^{(44)}$}; 47 | % 48 | \node at (9,0) [rectangle,draw,fill=gray!0!white] (h50) {$h^{(05)}$}; 49 | \node at (9,1.5) [rectangle,draw,fill=gray!70!white] (h51) {$h^{(15)}$}; 50 | \node at (9,3) [rectangle,draw,fill=gray!70!white] (h52) {$h^{(25)}$}; 51 | \node at (9,4.5) [rectangle,draw,fill=gray!70!white] (h53) {$h^{(35)}$}; 52 | \node at (9,6) [rectangle,draw,fill=gray!0!white] (h54) {$h^{(45)}$}; 53 | % 54 | \node at (10.8,0) [rectangle,draw,fill=gray!0!white] (h60) {$h^{(06)}$}; 55 | \node at (10.8,1.5) [rectangle,draw,fill=gray!70!white] (h61) {$h^{(16)}$}; 56 | \node at (10.8,3) [rectangle,draw,fill=gray!70!white] (h62) {$h^{(26)}$}; 57 | \node at (10.8,4.5) [rectangle,draw,fill=gray!70!white] (h63) {$h^{(36)}$}; 58 | \node at (10.8,6) [rectangle,draw,fill=gray!0!white] (h64) {$h^{(46)}$}; 59 | % 60 | \node at (12.6,0) [rectangle,draw,fill=gray!0!white] (h70) {$h^{(07)}$}; 61 | \node at (12.6,1.5) [rectangle,draw,fill=gray!70!white] (h71) {$h^{(17)}$}; 62 | \node at (12.6,3) [rectangle,draw,fill=gray!70!white] (h72) {$h^{(27)}$}; 63 | \node at (12.6,4.5) [rectangle,draw,fill=gray!70!white] (h73) {$h^{(37)}$}; 64 | \node at (12.6,6) [rectangle,draw,fill=gray!0!white] (h74) {$h^{(47)}$}; 65 | % 66 | % 67 | \draw[-stealth] (h00) -- node[pos=0.5,anchor=east,scale=1] {$\Theta^{\nu(1)}$} (h01); 68 | \draw[-stealth] (h01) -- node[pos=0.5,anchor=east,scale=1] {$\Theta^{\nu(2)}$} (h02); 69 | \draw[-stealth] (h02) -- node[pos=0.5,anchor=east,scale=1] {$\Theta^{\nu(3)}$} (h03); 70 | \draw[dotted,-stealth] (h03) -- node [pos=0.5,anchor = east] {$\Theta$} (h04); 71 | % 72 | \draw[-stealth] (h10) -- (h11); 73 | \draw[-stealth] (h11) -- (h12); 74 | \draw[-stealth] (h12) -- (h13); 75 | \draw[dotted,-stealth] (h13) -- (h14); 76 | % 77 | \draw[-stealth] (h20) -- (h21); 78 | \draw[-stealth] (h21) -- (h22); 79 | \draw[-stealth] (h22) -- (h23); 80 | \draw[dotted,-stealth] (h23) -- (h24); 81 | % 82 | \draw[-stealth] (h30) -- (h31); 83 | \draw[-stealth] (h31) -- (h32); 84 | \draw[-stealth] (h32) -- (h33); 85 | \draw[dotted,-stealth] (h33) -- (h34); 86 | % 87 | \draw[-stealth] (h40) -- (h41); 88 | \draw[-stealth] (h41) -- (h42); 89 | \draw[-stealth] (h42) -- (h43); 90 | \draw[dotted,-stealth] (h43) -- (h44); 91 | % 92 | \draw[-stealth] (h50) -- (h51); 93 | \draw[-stealth] (h51) -- (h52); 94 | \draw[-stealth] (h52) -- (h53); 95 | \draw[dotted,-stealth] (h53) -- (h54); 96 | % 97 | \draw[-stealth] (h60) -- (h61); 98 | \draw[-stealth] (h61) -- (h62); 99 | \draw[-stealth] (h62) -- (h63); 100 | \draw[dotted,-stealth] (h63) -- (h64); 101 | % 102 | \draw[-stealth] (h70) -- (h71); 103 | \draw[-stealth] (h71) -- (h72); 104 | \draw[-stealth] (h72) -- (h73); 105 | \draw[dotted,-stealth] (h73) -- (h74); 106 | % 107 | \draw[-stealth] (h01) -- node[pos=0.5,above=7pt,scale=1] {$\Theta^{\tau(1)}$} (h11); 108 | \draw[-stealth] (h11) -- (h21); 109 | \draw[-stealth] (h21) -- (h31); 110 | \draw[-stealth] (h31) -- (h41); 111 | \draw[-stealth] (h41) -- (h51); 112 | \draw[-stealth] (h51) -- (h61); 113 | \draw[-stealth] (h61) -- (h71); 114 | % 115 | \draw[-stealth] (h02) -- node[pos=0.5,above=7pt,scale=1] {$\Theta^{\tau(2)}$} (h12); 116 | \draw[-stealth] (h12) -- (h22); 117 | \draw[-stealth] (h22) -- (h32); 118 | \draw[-stealth] (h32) -- (h42); 119 | \draw[-stealth] (h42) -- (h52); 120 | \draw[-stealth] (h52) -- (h62); 121 | \draw[-stealth] (h62) -- (h72); 122 | % 123 | \draw[-stealth] (h03) -- node[pos=0.5,above=7pt,scale=1] {$\Theta^{\tau(3)}$} (h13); 124 | \draw[-stealth] (h13) -- (h23); 125 | \draw[-stealth] (h23) -- (h33); 126 | \draw[-stealth] (h33) -- (h43); 127 | \draw[-stealth] (h43) -- (h53); 128 | \draw[-stealth] (h53) -- (h63); 129 | \draw[-stealth] (h63) -- (h73); 130 | % 131 | \draw[very thin,densely dashed,-stealth] (h04) to[out=45,in=225] (h10); 132 | \draw[very thin,densely dashed,-stealth] (h14) to[out=45,in=225] (h20); 133 | \draw[very thin,densely dashed,-stealth] (h24) to[out=45,in=225] (h30); 134 | \draw[very thin,densely dashed,-stealth] (h34) to[out=45,in=225] (h40); 135 | \draw[very thin,densely dashed,-stealth] (h44) to[out=45,in=225] (h50); 136 | \draw[very thin,densely dashed,-stealth] (h54) to[out=45,in=225] (h60); 137 | \draw[very thin,densely dashed,-stealth] (h64) to[out=45,in=225] (h70); 138 | \end{tikzpicture} 139 | \caption{\label{fig:RNN architecture}RNN architecture, with data propagating both in "space" and in "time". In our exemple, the time dimension is of size 8 while the "spatial" one is of size 4.} 140 | \end{center} 141 | \end{figure} 142 | 143 | The real novelty of this type of neural network is that the fact that we are trying to predict a time serie is encoded in the very architecture of the network. RNN have first been introduced mostly to predict the next words in a sentance (classification task), hence the notion of ordering in time of the prediction. But this kind of network architecture can also be applied to regression problems. Among others things one can think of stock prices evolution, or temperature forecasting. In contrast to the precedent neural networks that we introduced, where we defined (denoting $\nu$ as in previous chapters the layer index in the spatial direction) 144 | \begin{align} 145 | a^{(t)(\nu)}_{f}&= \text{ Weight Averaging } \left(h^{(t)(\nu)}_{f}\right)\;,\notag\\ 146 | % 147 | h^{(t)(\nu+1)}_{f}&= \text{ Activation function } \left(a^{(t)(\nu)}_{f}\right)\;, 148 | \end{align} 149 | we now have the hidden layers that are indexed by both a "spatial" and a "temporal" index (with $T$ being the network dimension in this new direction), and the general philosophy of the RNN is (now the $a$ is usually characterized by a $c$ for cell state, this denotation, trivial for the basic RNN architecture will make more sense when we talk about LSTM networks) 150 | \begin{align} 151 | c^{(t)(\nu \tau )}_{f}&= \text{ Weight Averaging } \left(h^{(t)(\nu \tau-1)}_{f},h^{(t)(\nu-1\tau)}_{f}\right)\;,\notag\\ 152 | % 153 | h^{(t)(\nu\tau)}_{f}&= \text{ Activation function } \left(c^{(t)(\nu \tau)}_{f}\right)\;, 154 | \end{align} 155 | 156 | \subsection{Backward pass in a RNN-LSTM} 157 | 158 | The backward pass in a RNN-LSTM has to respect a certain time order, as illustrated in the following figure 159 | 160 | \begin{figure}[H] 161 | \begin{center} 162 | \begin{tikzpicture} 163 | \node at (0,0) [rectangle,draw,fill=gray!0!white] (h00) {$h^{(00)}$}; 164 | \node at (0,1.5) [rectangle,draw,fill=gray!70!white] (h01) {$h^{(10)}$}; 165 | \node at (0,3) [rectangle,draw,fill=gray!70!white] (h02) {$h^{(20)}$}; 166 | \node at (0,4.5) [rectangle,draw,fill=gray!70!white] (h03) {$h^{(30)}$}; 167 | \node at (0,6) [rectangle,draw,fill=gray!0!white] (h04) {$h^{(40)}$}; 168 | % 169 | \node at (1.8,0) [rectangle,draw,fill=gray!0!white] (h10) {$h^{(01)}$}; 170 | \node at (1.8,1.5) [rectangle,draw,fill=gray!70!white] (h11) {$h^{(11)}$}; 171 | \node at (1.8,3) [rectangle,draw,fill=gray!70!white] (h12) {$h^{(21)}$}; 172 | \node at (1.8,4.5) [rectangle,draw,fill=gray!70!white] (h13) {$h^{(31)}$}; 173 | \node at (1.8,6) [rectangle,draw,fill=gray!0!white] (h14) {$h^{(41)}$}; 174 | % 175 | \node at (3.6,0) [rectangle,draw,fill=gray!0!white] (h20) {$h^{(02)}$}; 176 | \node at (3.6,1.5) [rectangle,draw,fill=gray!70!white] (h21) {$h^{(12)}$}; 177 | \node at (3.6,3) [rectangle,draw,fill=gray!70!white] (h22) {$h^{(22)}$}; 178 | \node at (3.6,4.5) [rectangle,draw,fill=gray!70!white] (h23) {$h^{(32)}$}; 179 | \node at (3.6,6) [rectangle,draw,fill=gray!0!white] (h24) {$h^{(42)}$}; 180 | % 181 | \node at (5.4,0) [rectangle,draw,fill=gray!0!white] (h30) {$h^{(03)}$}; 182 | \node at (5.4,1.5) [rectangle,draw,fill=gray!70!white] (h31) {$h^{(13)}$}; 183 | \node at (5.4,3) [rectangle,draw,fill=gray!70!white] (h32) {$h^{(23)}$}; 184 | \node at (5.4,4.5) [rectangle,draw,fill=gray!70!white] (h33) {$h^{(33)}$}; 185 | \node at (5.4,6) [rectangle,draw,fill=gray!0!white] (h34) {$h^{(43)}$}; 186 | % 187 | \node at (7.2,0) [rectangle,draw,fill=gray!0!white] (h40) {$h^{(04)}$}; 188 | \node at (7.2,1.5) [rectangle,draw,fill=gray!70!white] (h41) {$h^{(14)}$}; 189 | \node at (7.2,3) [rectangle,draw,fill=gray!70!white] (h42) {$h^{(24)}$}; 190 | \node at (7.2,4.5) [rectangle,draw,fill=gray!70!white] (h43) {$h^{(34)}$}; 191 | \node at (7.2,6) [rectangle,draw,fill=gray!0!white] (h44) {$h^{(44)}$}; 192 | % 193 | \node at (9,0) [rectangle,draw,fill=gray!0!white] (h50) {$h^{(05)}$}; 194 | \node at (9,1.5) [rectangle,draw,fill=gray!70!white] (h51) {$h^{(15)}$}; 195 | \node at (9,3) [rectangle,draw,fill=gray!70!white] (h52) {$h^{(25)}$}; 196 | \node at (9,4.5) [rectangle,draw,fill=gray!70!white] (h53) {$h^{(35)}$}; 197 | \node at (9,6) [rectangle,draw,fill=gray!0!white] (h54) {$h^{(45)}$}; 198 | % 199 | \node at (10.8,0) [rectangle,draw,fill=gray!0!white] (h60) {$h^{(06)}$}; 200 | \node at (10.8,1.5) [rectangle,draw,fill=gray!70!white] (h61) {$h^{(16)}$}; 201 | \node at (10.8,3) [rectangle,draw,fill=gray!70!white] (h62) {$h^{(26)}$}; 202 | \node at (10.8,4.5) [rectangle,draw,fill=gray!70!white] (h63) {$h^{(36)}$}; 203 | \node at (10.8,6) [rectangle,draw,fill=gray!0!white] (h64) {$h^{(46)}$}; 204 | % 205 | \node at (12.6,0) [rectangle,draw,fill=gray!0!white] (h70) {$h^{(07)}$}; 206 | \node at (12.6,1.5) [rectangle,draw,fill=gray!30!white] (h71) {$h^{(17)}$}; 207 | \node at (12.6,3) [rectangle,draw,fill=gray!30!white] (h72) {$h^{(27)}$}; 208 | \node at (12.6,4.5) [rectangle,draw,fill=gray!30!white] (h73) {$h^{(37)}$}; 209 | \node at (12.6,6) [rectangle,draw,fill=gray!0!white] (h74) {$h^{(47)}$}; 210 | % 211 | % 212 | \draw[stealth-] (h00) -- (h01); 213 | \draw[stealth-] (h01) -- (h02); 214 | \draw[stealth-] (h02) -- (h03); 215 | \draw[dotted,stealth-] (h03) -- (h04); 216 | % 217 | \draw[stealth-] (h10) -- (h11); 218 | \draw[stealth-] (h11) -- (h12); 219 | \draw[stealth-] (h12) -- (h13); 220 | \draw[dotted,stealth-] (h13) -- (h14); 221 | % 222 | \draw[stealth-] (h20) -- (h21); 223 | \draw[stealth-] (h21) -- (h22); 224 | \draw[stealth-] (h22) -- (h23); 225 | \draw[dotted,stealth-] (h23) -- (h24); 226 | % 227 | \draw[stealth-] (h30) -- (h31); 228 | \draw[stealth-] (h31) -- (h32); 229 | \draw[stealth-] (h32) -- (h33); 230 | \draw[dotted,stealth-] (h33) -- (h34); 231 | % 232 | \draw[stealth-] (h40) -- (h41); 233 | \draw[stealth-] (h41) -- (h42); 234 | \draw[stealth-] (h42) -- (h43); 235 | \draw[dotted,stealth-] (h43) -- (h44); 236 | % 237 | \draw[stealth-] (h50) -- (h51); 238 | \draw[stealth-] (h51) -- (h52); 239 | \draw[stealth-] (h52) -- (h53); 240 | \draw[dotted,stealth-] (h53) -- (h54); 241 | % 242 | \draw[stealth-] (h60) -- (h61); 243 | \draw[stealth-] (h61) -- (h62); 244 | \draw[stealth-] (h62) -- (h63); 245 | \draw[dotted,stealth-] (h63) -- (h64); 246 | % 247 | \draw[stealth-] (h70) -- (h71); 248 | \draw[stealth-] (h71) -- (h72); 249 | \draw[stealth-] (h72) -- (h73); 250 | \draw[dotted,stealth-] (h73) -- (h74); 251 | % 252 | \draw[stealth-] (h01) -- (h11); 253 | \draw[stealth-] (h11) -- (h21); 254 | \draw[stealth-] (h21) -- (h31); 255 | \draw[stealth-] (h31) -- (h41); 256 | \draw[stealth-] (h41) -- (h51); 257 | \draw[stealth-] (h51) -- (h61); 258 | \draw[stealth-] (h61) -- (h71); 259 | % 260 | \draw[stealth-] (h02) -- (h12); 261 | \draw[stealth-] (h12) -- (h22); 262 | \draw[stealth-] (h22) -- (h32); 263 | \draw[stealth-] (h32) -- (h42); 264 | \draw[stealth-] (h42) -- (h52); 265 | \draw[stealth-] (h52) -- (h62); 266 | \draw[stealth-] (h62) -- (h72); 267 | % 268 | \draw[stealth-] (h03) -- (h13); 269 | \draw[stealth-] (h13) -- (h23); 270 | \draw[stealth-] (h23) -- (h33); 271 | \draw[stealth-] (h33) -- (h43); 272 | \draw[stealth-] (h43) -- (h53); 273 | \draw[stealth-] (h53) -- (h63); 274 | \draw[stealth-] (h63) -- (h73); 275 | \end{tikzpicture} 276 | \caption{\label{fig:rnnback}Architecture taken, backward pass. Here what cannot compute the gradient of a layer without having computed the ones that flow into it} 277 | \end{center} 278 | \end{figure} 279 | 280 | 281 | With this in mind, let us now see in details the implementation of a RNN and its advanced cousin, the Long Short Term Memory (LSTM)-RNN. 282 | 283 | \section{Extreme Layers and loss function} 284 | 285 | These part of the RNN-LSTM networks just experiences trivial modifications. Let us see them 286 | 287 | \subsection{Input layer} 288 | 289 | In a RNN-LSTM, the input layer is recursively defined as 290 | \begin{align} 291 | h^{(t)(0\tau+1)}_{f}&=\left(\tilde{h}^{(t)(0\tau)}_{f},h^{(t)(N-1\tau)}_{f}\right)\;. 292 | \end{align} 293 | where $\tilde{h}^{(t)(0\tau)}_{f}$ is $h^{(t)(0\tau)}_{f}$ with the first time column removed. 294 | 295 | \subsection{Output layer } 296 | 297 | The output layer of a RNN-LSTM reads 298 | \begin{align} 299 | h^{(t)(N\tau)}_{f}&=o\left(\sum_{f'=0}^{F_{N-1}-1}\Theta^f_{f'} h^{(t)(N-1\tau)}_{f}\right)\;, 300 | \end{align} 301 | where the output function $o$ is as for FNN's and CNN's is either the identity (regression task) or the cross-entropy function (classification task). 302 | 303 | \subsection{Loss function} 304 | 305 | The loss function for a regression task reads 306 | \begin{align} 307 | J(\Theta)&=\frac{1}{2T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\tau=0}^{T-1}\sum_{f=0}^{F_N-1} 308 | % 309 | \left(h^{(t)( N\tau)}_f-y^{(t)(\tau)}_f\right)^2\;. 310 | \end{align} 311 | and for a classification task 312 | \begin{align} 313 | J(\Theta)&=-\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\tau=0}^{T-1}\sum_{c=0}^{C-1} 314 | % 315 | \delta^{c}_{y^{(t)(\tau)}_c} \ln \left(h^{(t)( N\tau)}_f\right)\;. 316 | \end{align} 317 | 318 | 319 | \section{RNN specificities} 320 | 321 | \subsection{RNN structure} \label{sec:rnnstructure} 322 | 323 | RNN is the most basic architecture that takes -- thanks to the way it is built in -- into account the time structure of the data to be predicted. Zooming on one hidden layer of \ref{fig:RNN architecture}, here is what we see for a simple Recurrent Neural Network. 324 | 325 | \begin{figure}[H] 326 | \begin{center} 327 | \begin{tikzpicture} 328 | \node[] at (0,0) {\includegraphics[scale=1.5]{RNN_structure}}; 329 | \end{tikzpicture} 330 | \caption{\label{fig:RNN hidden unit}RNN hidden unit details} 331 | \end{center} 332 | \end{figure} 333 | 334 | And here is how the output of the hidden layer represented in \ref{fig:RNN hidden unit} enters into the subsequent hidden units 335 | 336 | \begin{figure}[H] 337 | \begin{center} 338 | \begin{tikzpicture} 339 | \node[] at (0,0) {\includegraphics[scale=0.8]{RNN_structure-tot}}; 340 | \end{tikzpicture} 341 | \caption{\label{fig:RNN interaction}How the RNN hidden unit interact with each others} 342 | \end{center} 343 | \end{figure} 344 | 345 | 346 | Lest us now mathematically express what is reprensented in figures \ref{fig:RNN hidden unit} and \ref{fig:RNN interaction}. 347 | 348 | \subsection{Forward pass in a RNN} 349 | 350 | In a RNN, the update rules read for the first time slice (spatial layer at the extreme left of figure \ref{fig:RNN architecture}) 351 | 352 | \begin{align} 353 | h^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{\nu(\nu)f}_{f'} 354 | % 355 | h^{(t)(\nu-1\tau)}_{f'}\right)\;, 356 | \end{align} 357 | 358 | and for the other ones 359 | 360 | \begin{align} 361 | h^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{\nu(\nu)f}_{f'} 362 | % 363 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{\tau(\nu)f}_{f'} 364 | % 365 | h^{(t)(\nu\tau-1)}_{f'}\right)\;. 366 | \end{align} 367 | 368 | \subsection{Backpropagation in a RNN} 369 | 370 | The backpropagation philosophy will remain unchanged : find the error rate updates, from which one can deduce the weight updates. But as for the hidden layers, the $\delta$ now have both a spatial and a temporal component. We will thus have to compute 371 | \begin{align} 372 | \delta^{(t)( \nu\tau)}_f&=\frac{\delta}{\delta h^{(t)( \nu+1\tau)}_f }J(\Theta)\;, 373 | \end{align} 374 | to deduce 375 | \begin{align} 376 | \Delta^{\Theta{\rm index}f}_{f'}&=\frac{\delta}{\delta \Delta^{\Theta{\rm index}f}_{f'} }J(\Theta)\;, 377 | \end{align} 378 | where the index can either be nothing (weights of the ouput layers), $\nu(\nu)$ (weights between two spatially connected layers) or $\tau(\nu)$ (weights between two temporally connected layers). First, it is easy to compute (in the same way as in chapter 1 for FNN) for the MSE loss function 379 | \begin{align} 380 | \delta^{(t)(N-1\tau)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-y_f^{(t)(\tau)}\right)\;, 381 | \end{align} 382 | and for the cross entropy loss function 383 | \begin{align} 384 | \delta^{(t)(N-1)}_{f}&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-\delta^f_{y^{(t)(\tau)}}\right)\;. 385 | \end{align} 386 | Calling 387 | \begin{align} 388 | \mathcal{T}_{f}^{(t)(\nu\tau)}&=1-\left(h_{f}^{(t)(\nu\tau)}\right)^2\;, 389 | \end{align} 390 | and 391 | \begin{align} 392 | \mathcal{H}^{(t')(\nu\tau)_a}_{ff'}&=\mathcal{T}^{(t')(\nu+1\tau)}_{f'}\Theta^{a(\nu+1)f'}_{f}\;, 393 | \end{align} 394 | we show in appendix \ref{sec:rnnappenderrorrate} that (if $\tau+1$ exists, otherwise the second term is absent) 395 | \begin{align} 396 | \delta^{(t)(\nu-1\tau)}_f&= 397 | % 398 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 399 | % 400 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 401 | \end{align} 402 | where $b_0=\nu$ and $b_1=\tau$. 403 | 404 | \subsection{Weight and coefficient updates in a RNN} 405 | 406 | To complete the backpropagation algorithm, we need 407 | 408 | \begin{align} 409 | &\Delta^{\nu(\nu)f}_{f'}\;,& 410 | % 411 | &\Delta^{\tau(\nu)f}_{f'}\;,& 412 | % 413 | &\Delta^{f}_{f'}\;,& 414 | % 415 | &\Delta^{\beta(\nu \tau)}_{f}\;,& 416 | % 417 | &\Delta^{\gamma(\nu \tau)}_{f}\;. 418 | \end{align} 419 | 420 | We show in appendix \ref{sec:rnncoefficient} that 421 | 422 | \begin{align} 423 | \Delta^{\nu(\nu-)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 424 | % 425 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu-1\tau)}_{f'}\;,\\ 426 | % 427 | \Delta^{\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 428 | % 429 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu\tau-1)}_{f'}\;,\\ 430 | % 431 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} h^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;,\\ 432 | % 433 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 434 | % 435 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\ 436 | % 437 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 438 | % 439 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 440 | \end{align} 441 | 442 | \section{LSTM specificities} 443 | 444 | 445 | \subsection{LSTM structure} 446 | 447 | 448 | In a Long Short Term Memory Neural Network\cite{Gers:2000:LFC:1121912.1121915}, the state of a given unit is not directly determined by its left and bottom neighbours. Instead, a cell state is updated for each hidden unit, and the output of this unit is a probe of the cell state. This formulation might seem puzzling at first, but it is philosophically similar to the ResNet approach that we briefly encounter in the appendix of chapter \ref{sec:chapterFNN}: instead of trying to fit an input with a complicated function, we try to fit tiny variation of the input, hence allowing the gradient to flow in a smoother manner in the network. In the LSTM network, several gates are thus introduced : the input gate $i^{(t)(\nu\tau)}_f$ determines if we allow new information $g^{(t)(\nu\tau)}_f$ to enter into the cell state. The output gate $o^{(t)(\nu\tau)}_f$ determines if we set or not the output hidden value to $0$, or really probes the current cell state. Finally, the forget state $f^{(t)(\nu\tau)}_f$ determines if we forget or not the past cell state. All theses concepts are illustrated on the figure \ref{fig:Lstm1}, which is the LSTM counterpart of the RNN structure of section \ref{sec:rnnstructure}. This diagram will be explained in details in the next section. 449 | 450 | \begin{figure}[H] 451 | \begin{center} 452 | \begin{tikzpicture} 453 | \node[] at (0,0) {\includegraphics[scale=1.7]{LSTM_structure}}; 454 | \end{tikzpicture} 455 | \caption{\label{fig:Lstm1}LSTM hidden unit details} 456 | \end{center} 457 | \end{figure} 458 | 459 | In a LSTM, the different hidden units interact in the following way 460 | 461 | \begin{figure}[H] 462 | \begin{center} 463 | \begin{tikzpicture} 464 | \node[] at (0,0) {\includegraphics[scale=0.8]{LSTM_structure-tot}}; 465 | \end{tikzpicture} 466 | \caption{\label{fig:Lstmall}How the LSTM hidden unit interact with each others} 467 | \end{center} 468 | \end{figure} 469 | 470 | 471 | \subsection{Forward pass in LSTM} 472 | 473 | Considering all the $\tau-1$ variable values to be $0$ when $\tau=0$, we get the following formula for the input, forget and output gates 474 | 475 | \begin{align} 476 | i^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{i_{_\nu}(\nu)f}_{f'} 477 | % 478 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F{_{\nu}}-1}\Theta^{i_{_\tau}(\nu)f}_{f'} 479 | % 480 | h^{(t)(\nu\tau-1)}_{f'}\right)\;,\\ 481 | % 482 | f^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F{_{\nu-1}}-1}\Theta^{f_{_\nu}(\nu)f}_{f'} 483 | % 484 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{f_{_\tau}(\nu)f}_{f'} 485 | % 486 | h^{(t)(\nu\tau-1)}_{f'}\right)\;,\\ 487 | % 488 | o^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{o_{_\nu}(\nu)f}_{f'} 489 | % 490 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{o_{_\tau}(\nu)f}_{f'} 491 | % 492 | h^{(t)(\nu\tau-1)}_{f'}\right)\;. 493 | \end{align} 494 | The sigmoid function is the reason why the $i,f,o$ functions are called gates: they take their values between $0$ and $1$, therefore either allowing or forbidding information to pass through the next step. The cell state update is then performed in the following way 495 | \begin{align} 496 | g^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{g_{_\nu}(\nu)f}_{f'} 497 | % 498 | h^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{g_{_\tau}(\nu)f}_{f'} 499 | % 500 | h^{(t)(\nu\tau-1)}_{f'}\right)\;,\\ 501 | % 502 | c^{(t)(\nu\tau)}_{f}&= 503 | % 504 | f^{(t)(\nu\tau)}_{f}c^{(t)(\nu\tau-1)}_{f}+i^{(t)(\nu\tau)}_{f}g^{(t)(\nu\tau)}_{f}\;, 505 | \end{align} 506 | and as announced, hidden state update is just a probe of the current cell state 507 | \begin{align} 508 | h^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\tanh \left(c^{(t)(\nu\tau)}_{f}\right)\;. 509 | \end{align} 510 | 511 | These formula singularly complicates the feed forward and especially the backpropagation procedure. For completeness, we will us nevertheless carefully derive it. Let us mention in passing that recent studies tried to replace the tanh activation function of the hidden state $h^{(t)(\nu\tau)}_{f}$ and the cell update $g^{(t)(\nu\tau)}_f$ by Rectified Linear Units, and seems to report better results with a proper initialization of all the weight matrices, argued to be diagonal 512 | \begin{align} 513 | \Theta^f_{f'}(\text{init})&=\frac12\left(\delta^f_{f'}+\sqrt{\frac{6}{F_{\rm in}+F_{\rm out}}}\mathcal{N}(0,1)\right)\;, 514 | \end{align} 515 | with the bracket term here to possibly (or not) include some randomness into the initialization 516 | 517 | \subsection{Batch normalization} 518 | 519 | In batchnorm The update rules for the gates are modified as expected 520 | 521 | \begin{align} 522 | i^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{i_\nu(\nu-)f}_{f'} 523 | % 524 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F{_{\nu}}-1}\Theta^{i_\tau(-\nu)f}_{f'} 525 | % 526 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,\\ 527 | % 528 | f^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F{_{\nu-1}}-1}\Theta^{f_\nu(\nu-)f}_{f'} 529 | % 530 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{f_\tau(-\nu)f}_{f'} 531 | % 532 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,\\ 533 | % 534 | o^{(t)(\nu\tau)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{o_\nu(\nu-)f}_{f'} 535 | % 536 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{o_\tau(-\nu)f}_{f'} 537 | % 538 | y^{(t)(\nu\tau-1)}_{f'}\right)\;,\\ 539 | % 540 | g^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{g_\nu(\nu-)f}_{f'} 541 | % 542 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{g_\tau(-\nu)f}_{f'} 543 | % 544 | y^{(t)(\nu\tau-1)}_{f'}\right)\;, 545 | \end{align} 546 | where 547 | \begin{align} 548 | y^{(t)(\nu\tau)}_{f}&=\gamma^{(\nu\tau)}_{f}\tilde{h}^{(t)(\nu\tau)}_{f}+\beta^{(\nu\tau)}_{f}\;, 549 | \end{align} 550 | as well as 551 | \begin{align} 552 | \tilde{h}^{(t)(\nu\tau)}_{f}&=\frac{h^{(t)(\nu\tau)}_{f}- 553 | % 554 | \hat{h}^{(\nu\tau)}_{f}}{\sqrt{\left(\sigma^{(\nu\tau)}_{f}\right)^2+\epsilon}} 555 | \end{align} 556 | and 557 | \begin{align} 558 | \hat{h}^{(\nu\tau)}_{f}&=\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}h^{(t)(\nu\tau)}_{f}\;,& 559 | % 560 | \left(\sigma^{(\nu\tau)}_{f}\right)^2&=\frac{1}{T_{{\rm mb}}}\sum_{t=0}^{T_{{\rm mb}}-1}\left(h^{(t)(\nu\tau)}_{f} 561 | % 562 | -\hat{h}^{(\nu\tau)}_{f}\right)^2\;. 563 | \end{align} 564 | It is important to compute a running sum for the mean and the variance, that will serve for the evaluation of the cross-validation and the test set (calling $e$ the number of iterations/epochs) 565 | \begin{align} 566 | \mathbb{E}\left[h_{f}^{(t)(\nu\tau)}\right]_{e+1} &= 567 | % 568 | \frac{e\mathbb{E}\left[h_{f}^{(t)(\nu\tau)}\right]_{e}+\hat{h}_{f}^{(\nu\tau)}}{e+1}\;,\\ 569 | % 570 | \mathbb{V}ar\left[h_{f}^{(t)(\nu\tau)}\right]_{e+1} &= 571 | % 572 | \frac{e\mathbb{V}ar\left[h_{f}^{(t)(\nu\tau)}\right]_{e}+\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2}{e+1} 573 | \end{align} 574 | and what will be used at the end is $\mathbb{E}\left[h_{f}^{(t)(\nu\tau)}\right]$ and $\frac{T_{{\rm mb}}}{T_{{\rm mb}}-1}\mathbb{V}ar\left[h_{f}^{(t)(\nu\tau)}\right]$. 575 | 576 | 577 | 578 | \subsection{Backpropagation in a LSTM} \label{sec:appendbackproplstm} 579 | 580 | 581 | The backpropagation In a LSTM keeps the same structure as in a RNN, namely 582 | 583 | \begin{align} 584 | \delta^{(t)(N-1\tau)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-y_f^{(t)(\tau)}\right)\;, 585 | \end{align} 586 | and (shown in appendix \ref{sec:ARNNLSTMerror_rates}) 587 | \begin{align} 588 | \delta^{(t)(\nu-1\tau)}_f&= 589 | % 590 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 591 | % 592 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 593 | \end{align} 594 | What changes is the form of $\mathcal{H}$, now given by 595 | \begin{align} 596 | \mathcal{O}^{(t)(\nu\tau)}_{f}&=h^{(t)(\nu\tau)}_{f} 597 | % 598 | \left(1-o^{(t)(\nu\tau)}_{f}\right)\;,\notag\\ 599 | % 600 | \mathcal{I}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right) 601 | % 602 | \right)g^{(t)(\nu\tau)}_{f} i^{(t)(\nu\tau)}_{f}\left(1-i^{(t)(\nu\tau)}_{f}\right)\;,\notag\\ 603 | % 604 | \mathcal{F}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right) 605 | % 606 | \right)c^{(t)(\nu\tau-1)}_{f} f^{(t)(\nu\tau)}_{f}\left(1-f^{(t)(\nu\tau)}_{f}\right)\;,\notag\\ 607 | % 608 | \mathcal{G}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right) 609 | % 610 | \right)i^{(t)(\nu\tau)}_{f}\left(1-\left(g^{(t)(\nu\tau)}_{f}\right)^2\right)\;, 611 | \end{align} 612 | 613 | and 614 | 615 | \begin{align} 616 | H^{(t)(\nu\tau)_a}_{ff'}&=\Theta^{o_a(\nu+1)f'}_{f}\mathcal{O}^{(t)(\nu+1\tau)}_{f'} 617 | % 618 | +\Theta^{f_a(\nu+1)f'}_{f}\mathcal{F}^{(t)(\nu+1\tau)}_{f'}\notag\\ 619 | % 620 | &+\Theta^{g_a(\nu+1)f'}_{f}\mathcal{G}^{(t)(\nu+1\tau)}_{f'} 621 | % 622 | +\Theta^{i_a(\nu+1)f'}_{f}\mathcal{I}^{(t)(\nu+1\tau)}_{f'}\;. 623 | \end{align} 624 | 625 | 626 | \subsection{Weight and coefficient updates in a LSTM} 627 | 628 | As for the RNN, (but with the $\mathcal{H}$ defined in section \ref{sec:appendbackproplstm}), we get for $\nu=1$ 629 | \begin{align} 630 | \Delta^{\rho_\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 631 | % 632 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}h^{(\nu-1\tau)(t)}_{f'}\;,\\ 633 | \end{align} 634 | and otherwise 635 | \begin{align} 636 | \Delta^{\rho_\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 637 | % 638 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu-1\tau)(t)}_{f'}\;,\\ 639 | % 640 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu-1\tau)(t)}_{f'}\;,\\ 641 | % 642 | \Delta^{\rho_\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 643 | % 644 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu\tau-1)(t)}_{f'}\;,\\ 645 | % 646 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 647 | % 648 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\ 649 | % 650 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 651 | % 652 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 653 | \end{align} 654 | 655 | and 656 | 657 | \begin{align} 658 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} y^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;. 659 | \end{align} 660 | 661 | 662 | \begin{subappendices} 663 | 664 | 665 | \section{Backpropagation trough Batch Normalization} 666 | 667 | For Backpropagation, we will need 668 | 669 | \begin{align} 670 | \frac{\partial y^{(t')(\nu\tau)}_{f'}}{\partial h_{f}^{(t)(\nu\tau)}}&= 671 | % 672 | \gamma^{(\nu\tau)}_f\frac{\partial \tilde{h}_{f'}^{(t)(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}\;. 673 | \end{align} 674 | 675 | Since 676 | 677 | \begin{align} 678 | \frac{\partial h^{(t')(\nu\tau)}_{f'}}{\partial h_{f}^{(t)(\nu\tau)}}&=\delta^{t'}_t\delta^{f'}_f\;,& 679 | % 680 | \frac{\partial \hat{h}_{f'}^{(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}&=\frac{\delta^{f'}_f}{T_{{\rm mb}}}\:; 681 | \end{align} 682 | 683 | and 684 | 685 | \begin{align} 686 | \frac{\partial \left(\hat{\sigma}_{f'}^{(\nu\tau)}\right)^2}{\partial h_{f}^{(t)(\nu\tau)}}&= 687 | % 688 | \frac{2\delta^{f'}_f}{T_{{\rm mb}}}\left(h_{f}^{(t)(\nu\tau)}-\hat{h}_{f}^{(\nu\tau)}\right)\;, 689 | \end{align} 690 | 691 | we get 692 | 693 | \begin{align} 694 | \frac{\partial \tilde{h}_{f'}^{(t')(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}&= 695 | % 696 | \frac{\delta^{f'}_f}{T_{{\rm mb}}}\left[\frac{T_{{\rm mb}}\delta^{t'}_t-1} 697 | % 698 | {\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac12}- 699 | % 700 | \frac{\left(h_{f}^{(t')(\nu\tau)}-\hat{h}_{f}^{(\nu\tau)}\right)\left(h_{f}^{(t)(\nu\tau)}-\hat{h}_{f}^{(\nu\tau)}\right)} 701 | % 702 | {\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac32}\right]\notag\\ 703 | % 704 | &=\frac{\delta^{f'}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac12} 705 | % 706 | \left[\delta^{t'}_t- 707 | % 708 | \frac{1+\tilde{h}_{f}^{(t')(\nu\tau)}\tilde{h}_{f}^{(t)(\nu\tau)}}{T_{{\rm mb}}}\right]\;. 709 | \end{align} 710 | 711 | To ease the notation we will denote 712 | 713 | \begin{align} 714 | \tilde{\gamma}^{(\nu\tau)}_f&= 715 | % 716 | \frac{\gamma^{(\nu\tau)}_f}{\left(\left(\hat{\sigma}_{f}^{(\nu\tau)}\right)^2+\epsilon\right)^\frac12}\;. 717 | \end{align} 718 | 719 | so that 720 | 721 | \begin{align} 722 | \frac{\partial y_{f'}^{(t')(\nu\tau)}}{\partial h_{f}^{(t)(\nu\tau)}}&= 723 | % 724 | \tilde{\gamma}^{(\nu\tau)}_f \delta^{f'}_f\left[\delta^{t'}_t- 725 | % 726 | \frac{1+\tilde{h}_{f}^{(t')(\nu\tau)}\tilde{h}_{f}^{(t)(\nu\tau)}}{T_{{\rm mb}}}\right]\;. 727 | \end{align} 728 | 729 | This modifies the error rate backpropagation, as well as the formula for the weight update ($y$'s instead of $h$'s). In the following we will use the formula 730 | 731 | \begin{align} 732 | J^{(tt')(\nu\tau)}_{f}&= 733 | % 734 | \tilde{\gamma}^{(\nu\tau)}_f \left[\delta^{t'}_t- 735 | % 736 | \frac{1+\tilde{h}_{f}^{(t')(\nu\tau)}\tilde{h}_{f}^{(t)(\nu\tau)}}{T_{{\rm mb}}}\right]\;. 737 | \end{align} 738 | 739 | 740 | \section{RNN Backpropagation} 741 | 742 | \subsection{RNN Error rate updates: details} \label{sec:rnnappenderrorrate} 743 | 744 | Recalling the error rate definition 745 | 746 | \begin{align} 747 | \delta^{(t)( \nu\tau)}_f&=\frac{\delta}{\delta h^{(t)( \nu+1\tau)}_f }J(\Theta)\;, 748 | \end{align} 749 | 750 | we would like to compute it for all existing values of $\nu$ and $\tau$. As computed in chapter \ref{sec:chapterFNN}, one has for the maximum $\nu$ value 751 | 752 | \begin{align} 753 | \delta^{(t)(N-1\tau)}_f&= \frac{1}{T_{{\rm mb}}}\left(h_{f}^{(t)(N\tau)}-y_f^{(t)(\tau)}\right)\;. 754 | \end{align} 755 | 756 | Now since (taking Batch Normalization into account) 757 | 758 | \begin{align} 759 | h^{(t)(N\tau)}_{f}&=o\left(\sum_{f'=0}^{F_{N-1}-1}\Theta^f_{f'} y^{(t)(N-1\tau)}_{f}\right)\;, 760 | \end{align} 761 | 762 | and 763 | 764 | \begin{align} 765 | h^{(t)(\nu\tau)}_f&=\tanh\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{\nu(\nu)f}_{f'} 766 | % 767 | y^{(t)(\nu-1\tau)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\Theta^{\tau(\nu)f}_{f'} 768 | % 769 | y^{(t)(\nu\tau-1)}_{f'}\right)\;, 770 | \end{align} 771 | 772 | we get for 773 | 774 | \begin{align} 775 | \delta^{(t)(N-2\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}\left[\sum_{f'=0}^{F_{N}-1} 776 | % 777 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-1\tau)}_{f'}\right.\notag\\ 778 | % 779 | &+\left.\sum_{f'=0}^{F_{N-1}-1}\frac{\delta h^{(t')(N-1\tau+1)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-2\tau+1)}_{f'}\right]\;. 780 | \end{align} 781 | 782 | Let us work out explicitly once (for a regression cost function and a trivial identity output function) 783 | 784 | \begin{align} 785 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }&=\sum_{f''=0}^{F_{N-1}-1}\Theta^{f'}_{f''}\, 786 | % 787 | \frac{\delta y^{(t')(N-1\tau)}_{f''}}{\delta h^{(t)(N-1\tau)}_f } \notag\\ 788 | % 789 | &=\Theta^{f'}_{f}\,J_f^{(tt')(N-1\tau)}\;. 790 | \end{align} 791 | 792 | as well as 793 | 794 | \begin{align} 795 | \frac{\delta h^{(t')(N-1\tau+1)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }&= 796 | % 797 | \left[1-\left(h^{(t')(N-1\tau+1)}_{f'}\right)^2\right]\sum_{f''=0}^{F_{{N-1}}-1}\Theta^{\tau(N-1)f'}_{f''} 798 | % 799 | \frac{\delta y^{(t')(N-1\tau)}_{f''}}{\delta h^{(t)(N-1\tau)}_f }\notag\\ 800 | % 801 | &=\mathcal{T}^{(t')(N-1\tau+1)}_{f'}\Theta^{\tau(N-1)f'}_{f}\,J_f^{(tt')(N-1\tau)}\;. 802 | \end{align} 803 | 804 | Thus 805 | 806 | \begin{align} 807 | \delta^{(t)(N-2\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}J_f^{(tt')(N-1\tau)}\left[\sum_{f'=0}^{F_{N}-1} 808 | % 809 | \Theta^{f'}_{f}\,\delta^{(t')(N-1\tau)}_{f'}\right.\notag\\ 810 | % 811 | &\left.+\sum_{f'=0}^{F_{N-1}-1}\mathcal{T}^{(t')(N-1\tau+1)}_{f'}\Theta^{\tau(N-1)f'}_{f}\,\delta^{(t')(N-2\tau+1)}_{f'}\right]\;. 812 | \end{align} 813 | 814 | Here we adopted the convention that the $\delta^{(t')(N-2\tau+1)}$'s are $0$ if $\tau=T$. In a similar way, we derive for $\nu\leq N-1$ 815 | 816 | \begin{align} 817 | \delta^{(t)(\nu-1\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\left[\sum_{f'=0}^{F_{\nu+1}-1} 818 | % 819 | \mathcal{T}^{(t')(\nu+1\tau)}_{f'}\Theta^{\nu(\nu+1)f'}_{f}\,\delta^{(t')(\nu\tau)}_{f'}\right.\notag\\ 820 | % 821 | &\left.+\sum_{f'=0}^{F_{\nu}-1}\mathcal{T}^{(t')(\nu\tau+1)}_{f'}\Theta^{\tau(\nu)f'}_{f}\,\delta^{(t')(\nu-1\tau+1)}_{f'}\right]\;. 822 | \end{align} 823 | 824 | Defining 825 | 826 | \begin{align} 827 | \mathcal{T}^{(t')(N\tau)}_{f'}&=1\;,& 828 | % 829 | \Theta^{\nu(N)f'}_{f}&=\Theta^{f'}_{f}\;, 830 | \end{align} 831 | 832 | the previous $ \delta^{(t)(\nu-1\tau)}_f$ formula extends to the case $\nu =N-1$. To unite the RNN and the LSTM formulas, let us finally define (with $a$ either $\tau$ or $\nu$ 833 | 834 | \begin{align} 835 | \mathcal{H}^{(t')(\nu\tau)_a}_{ff'}&=\mathcal{T}^{(t')(\nu+1\tau)}_{f'}\Theta^{a(\nu+1)f'}_{f}\;, 836 | \end{align} 837 | 838 | thus (defining $b_0=\nu$ and $b_1=\tau$) 839 | 840 | \begin{align} 841 | \delta^{(t)(\nu-1\tau)}_f&= 842 | % 843 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 844 | % 845 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 846 | \end{align} 847 | 848 | 849 | \subsection{RNN Weight and coefficient updates: details} \label{sec:rnncoefficient} 850 | 851 | We want here to derive 852 | 853 | \begin{align} 854 | \Delta^{\nu(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\nu(\nu)f}_{f'}} J(\Theta)& 855 | % 856 | \Delta^{\tau(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\tau(\nu)f}_{f'}} J(\Theta)\;. 857 | \end{align} 858 | 859 | We first expand 860 | 861 | \begin{align} 862 | \Delta^{\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1} 863 | % 864 | \frac{\partial h^{(t)(\nu\tau)}_{f''}}{\partial \Theta^{\nu(\nu)f}_{f'}} 865 | % 866 | \delta^{(t)(\nu-1\tau)}_{f''}\;,\notag\\ 867 | % 868 | \Delta^{\tau(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1} 869 | % 870 | \frac{\partial h^{(t)(\nu\tau)}_{f''}}{\partial \Theta^{\tau(\nu)f}_{f'}} 871 | % 872 | \delta^{(t)(\nu-1\tau)}_{f''}\;, 873 | \end{align} 874 | 875 | so that 876 | 877 | \begin{align} 878 | \Delta^{\nu(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 879 | % 880 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu-1\tau)}_{f'}\;,\\ 881 | % 882 | \Delta^{\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 883 | % 884 | \mathcal{T}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau)}_{f}h^{(t)(\nu\tau-1)}_{f'}\;. 885 | \end{align} 886 | 887 | We also have to compute 888 | 889 | \begin{align} 890 | \Delta^{f}_{f'}&=\frac{\partial}{\partial \Theta^{f}_{f'}} J(\Theta)\;. 891 | \end{align} 892 | 893 | We first expand 894 | 895 | \begin{align} 896 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_N-1}\sum_{t=0}^{T_{{\rm mb}}-1} 897 | % 898 | \frac{\partial h^{(t)(N\tau)}_{f''}}{\partial \Theta^{f}_{f'}} 899 | % 900 | \delta^{(t)(N-1\tau)}_{f''}\; 901 | \end{align} 902 | 903 | so that 904 | 905 | \begin{align} 906 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} h^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;. 907 | \end{align} 908 | 909 | Finally, we need 910 | 911 | \begin{align} 912 | \Delta^{\beta(\nu \tau)}_{f}&=\frac{\partial}{\partial\beta^{(\nu \tau)}_{f}} J(\Theta)& 913 | % 914 | \Delta^{\gamma(\nu \tau)}_{f}&=\frac{\partial}{\partial \gamma^{(\nu \tau)}_{f}} J(\Theta)\;. 915 | \end{align} 916 | 917 | First 918 | 919 | \begin{align} 920 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1} 921 | % 922 | \frac{\partial h^{(t)(\nu+1\tau)}_{f'}}{\partial \beta^{(\nu \tau)}_{f}}\delta^{(t)(\nu\tau)}_{f'}+ 923 | % 924 | \sum_{f'=0}^{F_{\nu}-1} 925 | % 926 | \frac{\partial h^{(t)(\nu\tau+1)}_{f'}}{\partial \beta^{(\nu \tau)}_{f}}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;,\notag\\ 927 | % 928 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1} 929 | % 930 | \frac{\partial h^{(t)(\nu+1\tau)}_{f'}}{\partial \gamma^{(\nu \tau)}_{f}}\delta^{(t)(\nu\tau)}_{f'}+ 931 | % 932 | \sum_{f'=0}^{F_{\nu}-1} 933 | % 934 | \frac{\partial h^{(t)(\nu\tau+1)}_{f'}}{\partial \gamma^{(\nu \tau)}_{f}}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;. 935 | \end{align} 936 | 937 | So that 938 | 939 | \begin{align} 940 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1} 941 | % 942 | \mathcal{T}^{(t)(\nu+1\tau)}_{f'}\Theta^{\nu(\nu+1)f'}_{f}\delta^{(t)(\nu\tau)}_{f'}\right.\notag\\ 943 | % 944 | &\left.+\sum_{f'=0}^{F_{\nu}-1} 945 | % 946 | \mathcal{T}^{(t)(\nu\tau+1)}_{f'}\Theta^{\tau(\nu)f'}_{f}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;,\\ 947 | % 948 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\left[\sum_{f'=0}^{F_{\nu+1}-1} 949 | % 950 | \mathcal{T}^{(t)(\nu+1\tau)}_{f'}\Theta^{\nu(\nu+1)f'}_{f} 951 | % 952 | \tilde{h}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu\tau)}_{f'}\right.\notag\\ 953 | % 954 | &\left.+\sum_{f'=0}^{F_{\nu}-1}\mathcal{T}^{(t)(\nu\tau+1)}_{f'} 955 | % 956 | \Theta^{\tau(\nu)f'}_{f}\tilde{h}^{(t)(\nu\tau)}_{f}\delta^{(t)(\nu-1\tau+1)}_{f'}\right]\;, 957 | \end{align} 958 | 959 | which we can rewrite as 960 | 961 | \begin{align} 962 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 963 | % 964 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\ 965 | % 966 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 967 | % 968 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 969 | \end{align} 970 | 971 | \section{LSTM Backpropagation} 972 | 973 | 974 | 975 | 976 | \subsection{LSTM Error rate updates: details} \label{sec:ARNNLSTMerror_rates} 977 | 978 | As for the RNN 979 | 980 | \begin{align} 981 | \delta^{(t)(N-1\tau)}_{f}& 982 | % 983 | =\frac{1}{T_{{\rm mb}}}\left(h^{(t)(N\tau)}_{f}-y^{(t)(\tau)}_{f}\right)\;. 984 | \end{align} 985 | 986 | Before going any further, it will be useful to define 987 | 988 | \begin{align} 989 | \mathcal{O}^{(t)(\nu\tau)}_{f}&=h^{(t)(\nu\tau)}_{f} 990 | % 991 | \left(1-o^{(t)(\nu\tau)}_{f}\right)\;,\notag\\ 992 | % 993 | \mathcal{I}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right) 994 | % 995 | \right)g^{(t)(\nu\tau)}_{f} i^{(t)(\nu\tau)}_{f}\left(1-i^{(t)(\nu\tau)}_{f}\right)\;,\notag\\ 996 | % 997 | \mathcal{F}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right) 998 | % 999 | \right)c^{(t)(\nu\tau-1)}_{f} f^{(t)(\nu\tau)}_{f}\left(1-f^{(t)(\nu\tau)}_{f}\right)\;,\notag\\ 1000 | % 1001 | \mathcal{G}^{(t)(\nu\tau)}_{f}&=o^{(t)(\nu\tau)}_{f}\left(1-\tanh^2\left(c^{(t)(\nu\tau)}_{f}\right) 1002 | % 1003 | \right)i^{(t)(\nu\tau)}_{f}\left(1-\left(g^{(t)(\nu\tau)}_{f}\right)^2\right)\;, 1004 | \end{align} 1005 | 1006 | and 1007 | 1008 | \begin{align} 1009 | H^{(t)(\nu\tau)_a}_{ff'}&=\Theta^{o_a(\nu+1)f'}_{f}\mathcal{O}^{(t)(\nu+1\tau)}_{f'} 1010 | % 1011 | +\Theta^{f_a(\nu+1)f'}_{f}\mathcal{F}^{(t)(\nu+1\tau)}_{f'}\notag\\ 1012 | % 1013 | &+\Theta^{g_a(\nu+1)f'}_{f}\mathcal{G}^{(t)(\nu+1\tau)}_{f'} 1014 | % 1015 | +\Theta^{i_a(\nu+1)f'}_{f}\mathcal{I}^{(t)(\nu+1\tau)}_{f'}\;. 1016 | \end{align} 1017 | 1018 | As for RNN, we will start off by looking at 1019 | 1020 | \begin{align} 1021 | \delta^{(t)(N-2\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}\left[\sum_{f'=0}^{F_{N}-1} 1022 | % 1023 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-1\tau)}_{f'}\right.\notag\\ 1024 | % 1025 | &+\left.\sum_{f'=0}^{F_{N-1}-1}\frac{\delta h^{(t')(N-1\tau+1)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }\delta^{(t')(N-2\tau+1)}_{f'}\right]\;. 1026 | \end{align} 1027 | 1028 | We will be able to get our hands on the second term with the general formula, so let us first look at 1029 | 1030 | \begin{align} 1031 | \frac{\delta h^{(t')(N\tau)}_{f'}}{\delta h^{(t)(N-1\tau)}_f }&=\Theta^{f'}_{f}\,J_f^{(tt')(N-1\tau)}\;, 1032 | \end{align} 1033 | 1034 | which is is similar to the RNN case. Let us put aside the second term of $\delta^{(t)(N-2\tau)}_f$, and look at the general case 1035 | 1036 | \begin{align} 1037 | \delta^{(t)(\nu-1\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}\left[\sum_{f'=0}^{F_{\nu+1}-1} 1038 | % 1039 | \frac{\delta h^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\delta^{(t')(\nu\tau)}_{f'} 1040 | % 1041 | +\sum_{f'=0}^{F_{\nu}-1}\frac{\delta h^{(t')(\nu\tau+1)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\delta^{(t')(\nu-1\tau+1)}_{f'}\right]\;, 1042 | \end{align} 1043 | which involves to study in details 1044 | \begin{align} 1045 | \frac{\delta h^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&= 1046 | % 1047 | \frac{\delta o^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\tanh c^{(t')(\nu+1\tau)}_{f'}\notag\\ 1048 | % 1049 | &+\frac{\delta c^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f } o^{(t')(\nu+1\tau)}_{f'} 1050 | % 1051 | \left[1-\tanh^2 c^{(t')(\nu+1\tau)}_{f'}\right]\;. 1052 | \end{align} 1053 | Now 1054 | \begin{align} 1055 | \frac{\delta o^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=o^{(t')(\nu+1\tau)}_{f'} 1056 | % 1057 | \left[1-o^{(t')(\nu+1\tau)}_{f'}\right]\sum_{f''=0}^{F_\nu-1}\Theta^{o_\nu(\nu+1)f'}_{f''} 1058 | % 1059 | \frac{\delta y^{(t')(\nu\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }\notag\\ 1060 | % 1061 | &=o^{(t')(\nu+1\tau)}_{f'} 1062 | % 1063 | \left[1-o^{(t')(\nu+1\tau)}_{f'}\right]\Theta^{o_\nu(\nu+1)f'}_{f} 1064 | % 1065 | J^{(tt')(\nu\tau)}_f\;, 1066 | \end{align} 1067 | and 1068 | \begin{align} 1069 | \frac{\delta c^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&= 1070 | % 1071 | \frac{\delta i^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }g^{(t')(\nu+1\tau)}_{f'} 1072 | % 1073 | +\frac{\delta g^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }i^{(t')(\nu+1\tau)}_{f'}\notag\\ 1074 | % 1075 | &+\frac{\delta f^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }c^{(t')(\nu\tau)}_{f'}\;. 1076 | \end{align} 1077 | We continue our journey 1078 | \begin{align} 1079 | \frac{\delta i^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&= 1080 | % 1081 | i^{(t')(\nu+1\tau)}_{f'}\left[1- i^{(t')(\nu+1\tau)}_{f'}\right]\Theta^{i_\nu(\nu+1)f'}_{f} 1082 | % 1083 | J^{(tt')(\nu\tau)}_f\;,\notag\\ 1084 | % 1085 | \frac{\delta f^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&= 1086 | % 1087 | f^{(t')(\nu+1\tau)}_{f'}\left[1- f^{(t')(\nu+1\tau)}_{f'}\right]\Theta^{f_\nu(\nu+1)f'}_{f} 1088 | % 1089 | J^{(tt')(\nu\tau)}_f\;,\notag\\ 1090 | % 1091 | \frac{\delta g^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&= 1092 | % 1093 | \left[1- \left(g^{(t')(\nu+1\tau)}_{f'}\right)^2\right]\Theta^{g_\nu(\nu+1)f'}_{f} 1094 | % 1095 | J^{(tt')(\nu\tau)}_f\;, 1096 | \end{align} 1097 | and our notations now come handy 1098 | \begin{align} 1099 | \frac{\delta h^{(t')(\nu+1\tau)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=J^{(tt')(\nu\tau)}_fH^{(t)(\nu\tau)_\nu}_{ff'}\;. 1100 | \end{align} 1101 | This formula also allows us to compute the second term for $\delta^{(t)(N-2\tau)}_f$. In a totally similar manner 1102 | \begin{align} 1103 | \frac{\delta h^{(t')(\nu\tau+1)}_{f'}}{\delta h^{(t)(\nu\tau)}_f }&=J^{(tt')(\nu\tau)}_fH^{(t)(\nu-1\tau+1)_\tau}_{ff'}\;. 1104 | \end{align} 1105 | Going back to our general formula 1106 | \begin{align} 1107 | \delta^{(t)(\nu-1\tau)}_f&= \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\left[\sum_{f'=0}^{F_{\nu+1}-1} 1108 | % 1109 | H^{(t)(\nu\tau)_\nu}_{ff'}\delta^{(t')(\nu\tau)}_{f'}\right.\notag\\ 1110 | % 1111 | &+\left.\sum_{f'=0}^{F_{\nu}-1}H^{(t)(\nu-1\tau+1)_\tau}_{ff'}\delta^{(t')(\nu-1\tau+1)}_{f'}\right]\;, 1112 | \end{align} 1113 | and as in the RNN case, we re-express it as (defining $b_0=\nu$ and $b_1=\tau$) 1114 | \begin{align} 1115 | \delta^{(t)(\nu-1\tau)}_f&= 1116 | % 1117 | \sum_{t'=0}^{T_{{\rm mb}}}J^{(tt')(\nu\tau)}_f\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 1118 | % 1119 | \mathcal{H}^{(t')(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t')(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 1120 | \end{align} 1121 | This formula is also valid for $\nu =N-1$ if we define as for the RNN case 1122 | \begin{align} 1123 | \mathcal{H}^{(t')(N\tau)}_{f'}&=1\;,& 1124 | % 1125 | \Theta^{\nu(N)f'}_{f}&=\Theta^{f'}_{f}\;, 1126 | \end{align} 1127 | 1128 | 1129 | 1130 | \subsection{LSTM Weight and coefficient updates: details} 1131 | 1132 | We want to compute 1133 | 1134 | \begin{align} 1135 | \Delta^{\rho_{_\nu}(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\rho_{_\nu}(\nu)f}_{f'}} J(\Theta)& 1136 | % 1137 | \Delta^{\rho_{_\tau}(\nu)f}_{f'}&=\frac{\partial}{\partial \Theta^{\rho_{_\tau}(\nu)f}_{f'}} J(\Theta)\;, 1138 | \end{align} 1139 | 1140 | with $\rho = (f,i,g,o)$. First we expand 1141 | 1142 | \begin{align} 1143 | \Delta^{\rho_{_\nu}(\nu)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1} 1144 | % 1145 | \frac{\partial h^{(\nu\tau)(t)}_{f''}}{\partial \Theta^{\rho_{_\nu}(\nu)f}_{f'}} 1146 | % 1147 | \frac{\partial}{\partial h^{(\nu\tau)(t)}_{f''}} J(\Theta)\notag\\ 1148 | % 1149 | &=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_\nu-1}\sum_{t=0}^{T_{{\rm mb}}-1} 1150 | % 1151 | \frac{\partial h^{(\nu\tau)(t)}_{f''}}{\partial \Theta^{\rho_{_\nu}(\nu)f}_{f'}} 1152 | % 1153 | \delta^{(\nu\tau)(t)}_{f''}\;, 1154 | \end{align} 1155 | 1156 | so that (with $\rho^{(\nu\tau)}=\left(\mathcal{F},\mathcal{I},\mathcal{G},\mathcal{O}\right)$) if $\nu=1$ 1157 | 1158 | \begin{align} 1159 | \Delta^{\rho_\nu(\nu-)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 1160 | % 1161 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}h^{(\nu-1\tau)(t)}_{f'}\;, 1162 | \end{align} 1163 | 1164 | and else 1165 | 1166 | \begin{align} 1167 | \Delta^{\rho_\nu(\nu-)f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 1168 | % 1169 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu-1\tau)(t)}_{f'}\;,\\ 1170 | % 1171 | \Delta^{\rho_\tau(\nu)f}_{f'}&=\sum_{\tau=1}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} 1172 | % 1173 | \rho^{(\nu\tau)(t)}_{f}\delta^{(\nu\tau)(t)}_{f}y^{(\nu\tau-1)(t)}_{f'}\;. 1174 | \end{align} 1175 | 1176 | We will now need to compute 1177 | 1178 | \begin{align} 1179 | \Delta^{\beta(\nu\tau)}_{f}&=\frac{\partial}{\partial \beta^{(\nu\tau)}_f} J(\Theta)& 1180 | % 1181 | \Delta^{\gamma(\nu\tau)}_{f}&=\frac{\partial}{\partial \gamma^{(\nu\tau)}_f} J(\Theta)\;. 1182 | \end{align} 1183 | 1184 | For that we need to look at 1185 | 1186 | \begin{align} 1187 | \Delta^{\beta(\nu\tau)}_{f}&=\sum_{f'=0}^{F_{\nu+1}-1}\sum_{t'=0}^{T_{{\rm mb}}-1} 1188 | % 1189 | \frac{\partial h^{(\nu+1\tau)(t')}_{f'}}{\partial \beta^{(\nu\tau)}_f}\delta^{(\nu\tau)(t')}_{f'} 1190 | % 1191 | +\sum_{f'=0}^{F_{\nu}-1}\sum_{t'=0}^{T_{{\rm mb}}-1}\frac{\partial h^{(\nu\tau+1)(t')}_{f'}} 1192 | % 1193 | {\partial \beta^{(\nu\tau)}_f}\delta^{(\nu-1\tau+1)(t')}_{f'}\notag\\ 1194 | % 1195 | &=\sum_{t=0}^{T_{{\rm mb}}-1}\left\{\sum_{f'=0}^{F_{\nu+1}-1} 1196 | % 1197 | H^{(t)(\nu\tau)}_{ff'}\delta^{(t)(\nu\tau)}_{f'} 1198 | % 1199 | +\sum_{f'=0}^{F_{\nu}-1}H^{(t)(\nu-1\tau+1)}_{ff'}\delta^{(t)(\nu\tau+1)}_{f'}\right\}\;. 1200 | \end{align} 1201 | 1202 | and 1203 | 1204 | \begin{align} 1205 | \Delta^{\gamma(\nu\tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\left\{\sum_{f'=0}^{F_{\nu+1}-1} 1206 | % 1207 | H^{(t)(\nu\tau)}_{ff'}\delta^{(t)(\nu\tau)}_{f'} 1208 | % 1209 | +\sum_{f'=0}^{F_{\nu}-1}H^{(t)(\nu-1\tau+1)}_{ff'}\delta^{(t)(\nu-1\tau+1)}_{f'}\right\}\;, 1210 | \end{align} 1211 | 1212 | which we can rewrite as 1213 | 1214 | \begin{align} 1215 | \Delta^{\beta(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 1216 | % 1217 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;,\\ 1218 | % 1219 | \Delta^{\gamma(\nu \tau)}_{f}&=\sum_{t=0}^{T_{{\rm mb}}-1}\tilde{h}^{(t)(\nu\tau)}_{f}\sum_{\epsilon=0}^{1}\sum_{f'=0}^{F_{\nu+1-\epsilon}-1} 1220 | % 1221 | \mathcal{H}^{(t)(\nu-\epsilon\tau+\epsilon)_{b_\epsilon}}_{ff'}\delta^{(t)(\nu-\epsilon\tau+\epsilon)}_{f'}\;. 1222 | \end{align} 1223 | 1224 | Finally, as in the RNN case 1225 | 1226 | \begin{align} 1227 | \Delta^{f}_{f'}&=\frac{\partial}{\partial \Theta^{f}_{f'}} J(\Theta)\;. 1228 | \end{align} 1229 | 1230 | We first expand 1231 | 1232 | \begin{align} 1233 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{f''=0}^{F_N-1}\sum_{t=0}^{T_{{\rm mb}}-1} 1234 | % 1235 | \frac{\partial h^{(t)(N\tau)}_{f''}}{\partial \Theta^{f}_{f'}} 1236 | % 1237 | \delta^{(t)(N-1\tau)}_{f''}\; 1238 | \end{align} 1239 | 1240 | so that 1241 | 1242 | \begin{align} 1243 | \Delta^{f}_{f'}&=\sum_{\tau=0}^{T-1}\sum_{t=0}^{T_{{\rm mb}}-1} h^{(t)(N-1\tau)}_{f'}\delta^{(t)(N-1\tau)}_{f}\;. 1244 | \end{align} 1245 | 1246 | \newpage 1247 | 1248 | \section{Peephole connexions} 1249 | 1250 | Some LSTM variants probe the cell state to update the gate themselves. This is illustrated in figure \ref{fig:peepholeLSTM} 1251 | 1252 | \begin{figure}[H] 1253 | \begin{center} 1254 | \begin{tikzpicture} 1255 | \node[] at (0,0) {\includegraphics[scale=1.4]{LSTM_structure-peephole}}; 1256 | \end{tikzpicture} 1257 | \caption{\label{fig:peepholeLSTM}LSTM hidden unit with peephole} 1258 | \end{center} 1259 | \end{figure} 1260 | 1261 | Peepholes modify the gate updates in the following way 1262 | 1263 | \begin{align} 1264 | i^{(\nu\tau)(t)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{i_{_\nu}(\nu)f}_{f'} 1265 | % 1266 | h^{(\nu-1\tau)(t)}_{f'}+\sum_{f'=0}^{F{_{\nu}}-1}\left[\Theta^{i_{_\tau}(\nu)f}_{f'} 1267 | % 1268 | h^{(\nu\tau-1)(t)}_{f'}+\Theta^{c_{_i}(\nu)f}_{f'}c^{(\nu\tau-1)(t)}_{f'}\right]\right)\;,\\ 1269 | % 1270 | f^{(\nu\tau)(t)}_f&=\sigma\left(\sum_{f'=0}^{F{_{\nu-1}}-1}\Theta^{f_{_\nu}(\nu)f}_{f'} 1271 | % 1272 | h^{(\nu-1\tau)(t)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\left[\Theta^{f_{_\tau}(\nu)f}_{f'} 1273 | % 1274 | h^{(\nu\tau-1)(t)}_{f'}+\Theta^{c_{_f}(\nu)f}_{f'}c^{(\nu\tau-1)(t)}_{f'}\right]\right)\;,\\ 1275 | % 1276 | o^{(\nu\tau)(t)}_f&=\sigma\left(\sum_{f'=0}^{F_{{\nu-1}}-1}\Theta^{o_{_\nu}(\nu)f}_{f'} 1277 | % 1278 | h^{(\nu-1\tau)(t)}_{f'}+\sum_{f'=0}^{F_{{\nu}}-1}\left[\Theta^{o_{_\tau}(\nu)f}_{f'} 1279 | % 1280 | h^{(\nu\tau-1)(t)}_{f'}+\Theta^{c_{_o}(\nu)f}_{f'}c^{(\nu\tau)(t)}_{f'}\right]\right)\;, 1281 | \end{align} 1282 | which also modifies the LSTM backpropagation algorithm in a non-trivial way. As it as been shown that different LSTM formulations lead to pretty similar results, we leave to the reader the derivation of the backpropagation update rules as an exercise. 1283 | 1284 | \end{subappendices} 1285 | -------------------------------------------------------------------------------- /conv_2d-crop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/conv_2d-crop.pdf -------------------------------------------------------------------------------- /conv_4d-crop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/conv_4d-crop.pdf -------------------------------------------------------------------------------- /cover_page-crop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/cover_page-crop.pdf -------------------------------------------------------------------------------- /fc_equiv.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_equiv.pdf -------------------------------------------------------------------------------- /fc_resnet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_resnet.pdf -------------------------------------------------------------------------------- /fc_resnet_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_resnet_2.pdf -------------------------------------------------------------------------------- /fc_resnet_3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fc_resnet_3.pdf -------------------------------------------------------------------------------- /formatAndDefs.tex: -------------------------------------------------------------------------------- 1 | \usepackage[T1]{fontenc} 2 | \usepackage{bm} 3 | \usepackage{bbm} 4 | \usepackage[utf8]{inputenc} 5 | \usepackage{latexsym} 6 | \usepackage[english]{babel} 7 | \usepackage{indentfirst} 8 | %\usepackage{fullpage} 9 | \usepackage{graphicx} 10 | \usepackage{lmodern} 11 | %\usepackage{epsfig} 12 | %\usepackage[math]{anttor} 13 | \usepackage[sc]{mathpazo} 14 | %\usepackage{fouriernc} 15 | %\usepackage[garamond]{mathdesign} 16 | \usepackage{geometry} 17 | \input Kramer.fd 18 | \usepackage{yfonts,color} 19 | \usepackage{minitoc} 20 | \usepackage{titlesec} 21 | \usepackage{subfigure} 22 | \usepackage{amsmath} 23 | \usepackage{amssymb} 24 | \usepackage{stmaryrd} 25 | \usepackage{url} 26 | \usepackage{pgfplots} 27 | \pgfplotsset{compat=1.5} 28 | \usepackage[colorlinks,linkcolor=red!80!black, 29 | citecolor=red!80!black,pdfpagelabels,hyperindex=true]{hyperref} 30 | \hypersetup{ 31 | % bookmarks=true, % show bookmarks bar? 32 | % unicode=false, % non-Latin characters in Acrobat’s bookmarks 33 | % pdftoolbar=true, % show Acrobat’s toolbar? 34 | % pdfmenubar=true, % show Acrobat’s menu? 35 | % pdffitwindow=false, % window fit to page when opened 36 | % pdfstartview={FitH}, % fits the width of the page to the window 37 | % pdftitle={My title}, % title 38 | % pdfauthor={Author}, % author 39 | % pdfsubject={Subject}, % subject of the document 40 | % pdfcreator={Creator}, % creator of the document 41 | % pdfproducer={Producer}, % producer of the document 42 | % pdfkeywords={keyword1} {key2} {key3}, % list of keywords 43 | % pdfnewwindow=true, % links in new window 44 | colorlinks=true, % false: boxed links; true: colored links 45 | linkcolor=brown!80!black, % color of internal links (change box color with linkbordercolor) 46 | citecolor=green!50!black, % color of links to bibliography 47 | filecolor=magenta, % color of file links 48 | urlcolor=red!80!black % color of external links 49 | } 50 | \usepackage{csquotes} 51 | \usepackage[sorting=none,backend=bibtex,style=numeric-comp]{biblatex} 52 | \usepackage{cancel} 53 | %\usepackage{natbib} 54 | \usepackage{pifont} 55 | %\bibliographystyle{plain} 56 | %\bibliographystyle{unsrt} 57 | \usepackage{tikz} 58 | \usetikzlibrary{matrix,arrows,decorations,backgrounds,shapes,calc,fit} 59 | \usetikzlibrary{fadings} 60 | \usetikzlibrary{decorations.pathmorphing} 61 | \usetikzlibrary{decorations.markings} 62 | \usepackage{xcolor} 63 | \usepackage{amsfonts} 64 | \usepackage{slashed} 65 | \usepackage{fancybox} 66 | \usepackage{fancyhdr} 67 | \newenvironment {abstract}% 68 | {\cleardoublepage \null \vfill \begin{center}% 69 | \bfseries \abstractname\end{center}}% 70 | {\vfill \null} 71 | \newcommand{\intkaha}{\ensuremath{\int\frac{\d^3\ka}{4|\ka||p-k|(2\pi)^3}}} 72 | \newcommand{\slv}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$v$}} 73 | \newcommand{\slF}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$F$}} 74 | \newcommand{\slL}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$L$}} 75 | \newcommand{\slP}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$P$}} 76 | \newcommand{\slp}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$p$}} 77 | \newcommand{\slq}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$q$}} 78 | \newcommand{\slR}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$R$}} 79 | \newcommand{\slQ}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$Q$}} 80 | \newcommand{\slK}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$K$}} 81 | \newcommand{\slk}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$k$}} 82 | \newcommand{\slD}{\raise.15ex\hbox{$/$}\kern-.73em\hbox{$D$}} 83 | \newcommand{\slC}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$C$}} 84 | \newcommand{\slA}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$A$}} 85 | \newcommand{\slSigma}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$\Sigma$}} 86 | \newcommand{\slpartial}{\raise.15ex\hbox{$/$}\kern-.53em\hbox{$\partial$}} 87 | \newcommand{\slcalP}{\raise.15ex\hbox{$/$}\kern-.63em\hbox{$\cal P$}} 88 | \definecolor{purp}{RGB}{0,0,0} 89 | \def\p{{\boldsymbol p}} 90 | \def\P{{\boldsymbol P}} 91 | \def\q{{\boldsymbol q}} 92 | \def\Q{{\boldsymbol Q}} 93 | \def\l{{\boldsymbol l}} 94 | \def\k{{\boldsymbol k}} 95 | \def\kp{{\boldsymbol k}_{{\tiny \perp}}} 96 | \def\m{{\boldsymbol m}} 97 | \def\x{{\boldsymbol x}} 98 | \def\xp{{\boldsymbol x}_{{\tiny \perp}}} 99 | \def\yp{{\boldsymbol y}_{{\tiny \perp}}} 100 | \def\pp{{\boldsymbol p}_{{\tiny \perp}}} 101 | \def\y{{\boldsymbol y}} 102 | \def\X{{\boldsymbol X}} 103 | \def\Y{{\boldsymbol Y}} 104 | \def\D{{\boldsymbol D}} 105 | \def\r{{\boldsymbol r}} 106 | \def\z{{\boldsymbol z}} 107 | \def\v{{\boldsymbol v}} 108 | \def\w{{\boldsymbol w}} 109 | \def\b{{\boldsymbol b}} 110 | \def\u{{\boldsymbol u}} 111 | \newcommand{\intkk}{\ensuremath{\int\frac{\d^3\ka}{2|\ka|(2\pi)^3}}} 112 | \renewcommand{\d}{\ensuremath{\mathrm{d}}} 113 | \newcommand{\old}{\ensuremath{\text{old}}} 114 | \newcommand{\niou}{\ensuremath{\text{new}}} 115 | \newcommand{\nab}{\ensuremath{\text{\boldmath$\nabla$}}} 116 | \newcommand{\ix}{\ensuremath{\text{\boldmath $x$}}} 117 | \newcommand{\igrec}{\boldsymbol{y}} 118 | \newcommand{\ixl}{\ensuremath{\text{\scriptsize\boldmath $x$}}} 119 | \newcommand{\ka}{\ensuremath{\text{\boldmath $k$}}} 120 | \newcommand{\parti}{\ensuremath{\text{\boldmath $\partial$}}} 121 | \newcommand{\uu}{\ensuremath{\text{\boldmath $u$}}} 122 | \newcommand{\vv}{\ensuremath{\text{\boldmath $v$}}} 123 | \newcommand{\pel}{\ensuremath{\mbox{\scriptsize\boldmath $p$}}} 124 | \newcommand{\uul}{\ensuremath{\text{\scriptsize\boldmath $u$}}} 125 | \newcommand{\vvl}{\ensuremath{\text{\scriptsize\boldmath $v$}}} 126 | \newcommand{\kal}{\ensuremath{\text{\scriptsize\boldmath $k$}}} 127 | \newcommand{\el}{\ensuremath{\text{\scriptsize\boldmath $l$}}} 128 | \newcommand{\cigma}{\ensuremath{\text{\boldmath$\Sigma$}}} 129 | \newcommand{\Sig}{\ensuremath{\mbox{\boldmath $\Sigma$}}} 130 | \newcommand{\Sigl}{\ensuremath{\mbox{\scriptsize\boldmath $\Sigma$}}} 131 | \newcommand{\pe}{\ensuremath{\mbox{\boldmath $p$}}} 132 | \newcommand{\ine}{\ensuremath{\mathrm{in}}} 133 | \newcommand{\oute}{\ensuremath{\mathrm{out}}} 134 | \newcommand{\intk}{\ensuremath{\int\frac{\d^3\ka}{2|\k|(2\pi)^3}}} 135 | \newcommand{\intmod}[1]{\ensuremath{\int\frac{\d^3{\bm #1}}{2|{\bm #1}|(2\pi)^3}}} 136 | \newcommand{\intmodd}[2]{\ensuremath{\iint\frac{\d^3{\bm #1}\,\d^3{\bm #2}}{4|{\bm #1}||{\bm #2}|(2\pi)^6}}} 137 | \newcommand{\ma}[1]{{\mathcal{#1}}} 138 | \newcommand{\SK}{ \textsc{Schwinger-Keldysh} } 139 | \newcommand{\Fnman}{ \textsc{Feynman} } 140 | \newcommand{\pointe}[2]{ 141 | \node[place5] at (#1,#2) {}; 142 | \node[place4,xshift=rand*0.3mm,yshift=rand*0.3mm] at (#1,#2) {}; } 143 | \newcommand{\pointee}[2]{ 144 | \node[place,rotate around={45:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {}; 145 | \node[place2,rotate around={45:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {}; 146 | \node[place3,rotate around={45:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {};} 147 | \newcommand{\pointeee}[2]{ 148 | \node[place2,rotate around={135:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {}; 149 | \node[place3,rotate around={135:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {}; 150 | \node[place,rotate around={135:(0,0)},xshift=rand*0.4mm,yshift=rand*2.75mm] at (#1,#2) {}; } 151 | \newcommand{\point}[2]{ 152 | \node[place,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {}; 153 | \node[place2,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {}; 154 | \node[place3,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};} 155 | \newcommand{\pointt}[2]{ 156 | \node[place2,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {}; 157 | \node[place3,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {}; 158 | \node[place,xshift=rand*1.5mm,yshift=rand*12.8mm] at (#1,#2) {};} 159 | \tikzstyle{blueballa} = [circle,shading=ball, ball color=blue!30,inner sep =1.2mm] 160 | \tikzstyle{redballa} = [circle,shading=ball, ball color=red,inner sep =1.2mm] 161 | \tikzstyle{greenballa} = [circle,shading=ball, ball color=green!70!black,inner sep =1.2mm] 162 | \tikzstyle{blueball} = [circle,shading=ball, ball color=blue!30,inner sep =0.3mm] 163 | \tikzstyle{redball} = [circle,shading=ball, ball color=red,inner sep =0.3mm] 164 | \tikzstyle{greenball} = [circle,shading=ball, ball color=green!70!black,inner sep =0.3mm] 165 | \tikzstyle{gluon} = [thick, style={decorate,decoration={coil,amplitude=4pt, segment length=4pt}}] 166 | \newcommand{\gluebr}[4]{ 167 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {}; 168 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {}; 169 | \draw[photon] (toto) --(tata);} 170 | \newcommand{\gluerb}[4]{ 171 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {}; 172 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {}; 173 | \draw[photon] (toto) --(tata);} 174 | \newcommand{\gluebg}[4]{ 175 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {}; 176 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {}; 177 | \draw[photon] (toto) --(tata);} 178 | \newcommand{\gluegb}[4]{ 179 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {}; 180 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {}; 181 | \draw[photon] (toto) --(tata);} 182 | \newcommand{\gluegr}[4]{ 183 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {}; 184 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {}; 185 | \draw[photon] (toto) --(tata);} 186 | \newcommand{\gluerg}[4]{ 187 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.2cm] (toto)at (#1,#2) {}; 188 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.2cm] (tata) at (#3,#4) {}; 189 | \draw[photon] (toto) --(tata);} 190 | \newcommand{\gluebra}[4]{ 191 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {}; 192 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {}; 193 | \draw[photon] (toto) --(tata);} 194 | \newcommand{\gluerba}[4]{ 195 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {}; 196 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {}; 197 | \draw[photon] (toto) --(tata);} 198 | \newcommand{\gluebga}[4]{ 199 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {}; 200 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {}; 201 | \draw[photon] (toto) --(tata);} 202 | \newcommand{\gluegba}[4]{ 203 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {}; 204 | \node[blueball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {}; 205 | \draw[photon] (toto) --(tata);} 206 | \newcommand{\gluegra}[4]{ 207 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {}; 208 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {}; 209 | \draw[photon] (toto) --(tata);} 210 | \newcommand{\gluerga}[4]{ 211 | \node[redball,xshift=rand*0.2cm,yshift=rand*1.8cm] (toto)at (#1,#2) {}; 212 | \node[greenball,xshift=rand*0.2cm,yshift=rand*1.8cm] (tata) at (#3,#4) {}; 213 | \draw[photon] (toto) --(tata);} 214 | \newcommand{\freezew}[2]{ 215 | \node[wball,xshift=rand*1.7cm,yshift=rand*1.4cm] at (#1,#2) {};} 216 | \newcommand{\freezeb}[2]{ 217 | \node[bball,xshift=rand*1.7cm,yshift=rand*1.4cm] at (#1,#2) {};} 218 | \tikzstyle{photon} = [very thin, style={decorate, decoration={snake,amplitude=0.4pt, segment length=2pt}}] 219 | \tikzfading [name=radialfade, inner color=transparent!40, outer color=transparent!100] 220 | \newcommand{\col}[1]{{\color{black} #1}} 221 | \newcommand{\intp}{\int \frac{\d^2 \p }{(2\pi)^2}} 222 | \newlength{\longueurAdHoc} 223 | \settodepth{\longueurAdHoc}{$\displaystyle\int\limits_{y^->0}$} 224 | \newcommand{\esss}{\begin{tikzpicture}[] 225 | \draw[red!60!white,thick] (-0.06,-0.06) circle (0.12); 226 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(45:0.12); 227 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(135:0.12); 228 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(315:0.12); 229 | \draw[red!60!black,thick] (-0.06,-0.06) -- ++(225:0.12); 230 | \end{tikzpicture}} 231 | \newcommand{\matprodu}{\begin{tikzpicture}[] 232 | \draw[red!60!white,thick] (0,0) circle (0.12); 233 | \draw[red!60!black,thick] (0,0) -- ++(45:0.12); 234 | \draw[red!60!black,thick] (0,0) -- ++(135:0.12); 235 | \draw[red!60!black,thick] (0,0) -- ++(315:0.12); 236 | \draw[red!60!black,thick] (0,0) -- ++(225:0.12); 237 | \end{tikzpicture}} 238 | \newcommand{\matprodp}{\begin{tikzpicture}[] 239 | \draw[green!60!white,thick] (0,0) circle (0.12); 240 | \draw[green!60!black,thick] (0,0) -- ++(45:0.12); 241 | \draw[green!60!black,thick] (0,0) -- ++(135:0.12); 242 | \draw[green!60!black,thick] (0,0) -- ++(315:0.12); 243 | \draw[green!60!black,thick] (0,0) -- ++(225:0.12); 244 | \end{tikzpicture}} 245 | \newcommand{\matprode}{\begin{tikzpicture}[] 246 | \draw[blue!60!white,thick] (0,0) circle (0.12); 247 | \draw[blue!60!black,thick] (0,0) -- ++(45:0.12); 248 | \draw[blue!60!black,thick] (0,0) -- ++(135:0.12); 249 | \draw[blue!60!black,thick] (0,0) -- ++(315:0.12); 250 | \draw[blue!60!black,thick] (0,0) -- ++(225:0.12); 251 | \end{tikzpicture}} 252 | \newcommand{\circa}[1]{\begin{tikzpicture}[baseline=-0.65ex] 253 | \draw[thick] (0,0) node[black] {$#1$} circle (0.17); 254 | \end{tikzpicture}} 255 | \newcommand{\circb}[2]{\begin{tikzpicture}[baseline=(current bounding box.center)] 256 | \draw[thick] (0,0) node[black] {$#1$} circle (0.17); 257 | \node[anchor=north west] at (0.085,0) {\tiny $#2$}; 258 | \end{tikzpicture}} 259 | \newcommand{\ells}{\begin{tikzpicture}[] 260 | \draw[red!60!white,thick] (0,0) circle (0.12); 261 | \draw[red!60!black,thick] (0,0) -- ++(45:0.12); 262 | \draw[red!60!black,thick] (0,0) -- ++(135:0.12); 263 | \draw[red!60!black,thick] (0,0) -- ++(315:0.12); 264 | \draw[red!60!black,thick] (0,0) -- ++(225:0.12); 265 | \end{tikzpicture}} 266 | \newcommand{\etts}{\begin{tikzpicture}[] 267 | \draw[green!60!white,thick] (0,0) circle (0.12); 268 | \draw[green!60!black,thick] (0,0) -- ++(45:0.12); 269 | \draw[green!60!black,thick] (0,0) -- ++(135:0.12); 270 | \draw[green!60!black,thick] (0,0) -- ++(315:0.12); 271 | \draw[green!60!black,thick] (0,0) -- ++(225:0.12); 272 | \end{tikzpicture}} 273 | \newcommand{\umumnud}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}] 274 | \draw[] (0,0) -- node {\midarrow} (0.5,0) --node {\midarrow} (0.5,0.5) -- node {\midarrow} (0,0.5)node[anchor=south east] {$x$} --node {\midarrow} (0,0); 275 | \node[place] at (0,0.5) {}; 276 | \node[anchor=north] at (0.25,0) {\tiny $\hat{#1}$}; 277 | \node[anchor=east] at (0,0.25) {\tiny $\hat{#2}$}; 278 | \end{tikzpicture}} 279 | \newcommand{\umumnu}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}] 280 | \draw[] (0,0) -- node {\midarrow} (0,0.5) node[anchor=south east] {$x$} --node {\midarrow} (0.5,0.5) -- node {\midarrow} (0.5,0)--node {\midarrow} (0,0); 281 | \node[place] at (0,0.5) {}; 282 | \node[anchor=south] at (0.25,0.5) {\tiny $\hat{#1}$}; 283 | \node[anchor=west] at (0.5,0.25) {\tiny $\hat{#2}$}; 284 | \end{tikzpicture}} 285 | \newcommand{\umunud}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}] 286 | \draw[] (0,0) node[anchor=north east] {$x$} -- node {\midarrow} (0,0.5) --node {\midarrow} (0.5,0.5) -- node {\midarrow} (0.5,0)--node {\midarrow} (0,0); 287 | \node[place] at (0,0) {}; 288 | \node[anchor=south] at (0.25,0.5) {\tiny $\hat{#1}$}; 289 | \node[anchor=east] at (0,0.25) {\tiny $\hat{#2}$}; 290 | \end{tikzpicture}} 291 | \newcommand{\umunu}[2]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}] 292 | \draw[] (0,0) node[anchor=north east] {$x$} -- node {\midarrow} (0.5,0) --node {\midarrow} (0.5,0.5) -- node {\midarrow} (0,0.5)--node {\midarrow} (0,0); 293 | \node[place] at (0,0) {}; 294 | \node[anchor=north] at (0.25,0) {\tiny $\hat{#1}$}; 295 | \node[anchor=west] at (0.5,0.25) {\tiny $\hat{#2}$}; 296 | \end{tikzpicture}} 297 | \newcommand{\Emu}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=green!70!black},] 298 | \node[place3] at (0,0) {}; 299 | \node[anchor=south] at (0,0) {$x$}; 300 | \node[anchor=north] at (0,0) {\tiny $#1$}; 301 | \end{tikzpicture}} 302 | \newcommand{\emu}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=blue!70!white},] 303 | \node[place3] at (0,0) {}; 304 | \node[anchor=south] at (0,0) {$x$}; 305 | \node[anchor=north] at (0,0) {\tiny $#1$}; 306 | \end{tikzpicture}} 307 | \newcommand{\emup}[1]{\begin{tikzpicture}[baseline=-0.5ex] 308 | \shade [ball color=blue!70!white] (0,0) circle [radius=0.075]; 309 | \node[anchor=south] at (0,0) {$x$}; 310 | \node[anchor=north] at (0,0) {\tiny $#1$}; 311 | \end{tikzpicture}} 312 | \newcommand{\Umu}[1]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}] 313 | \draw[] (0,0) node[anchor=south] {$x$} -- node {\midarrow} (0.5,0); 314 | \node[place] at (0,0) {}; 315 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#1}$}; 316 | \end{tikzpicture}} 317 | \newcommand{\umu}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=red!90!black},] 318 | \node[place3] at (0,0) {}; 319 | \node[anchor=south] at (0,0) {$x$}; 320 | \node[anchor=north] at (0,0) {\tiny $#1$}; 321 | \end{tikzpicture}} 322 | \newcommand{\umup}[1]{\begin{tikzpicture}[baseline=-0.5ex] 323 | \shade [ball color=red!90!black] (0,0) circle [radius=0.075]; 324 | \node[anchor=south] at (0,0) {$x$}; 325 | \node[anchor=north] at (0,0) {\tiny $#1$}; 326 | \end{tikzpicture}} 327 | \newcommand{\umuo}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=orange},] 328 | \node[place3] at (0,0) {}; 329 | \node[anchor=south] at (0,0) {$x$}; 330 | \node[anchor=north] at (0,0) {\tiny $#1$}; 331 | \end{tikzpicture}} 332 | \newcommand{\umuop}[1]{\begin{tikzpicture}[baseline=-0.5ex,] 333 | \shade [ball color=orange] (0,0) circle [radius=0.075]; 334 | \node[anchor=south] at (0,0) {$x$}; 335 | \node[anchor=north] at (0,0) {\tiny $#1$}; 336 | \end{tikzpicture}} 337 | \newcommand{\umuoo}[1]{\begin{tikzpicture}[baseline=-0.5ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=yellow!80!black},] 338 | \node[place3] at (0,0) {}; 339 | \node[anchor=south] at (0,0) {$x$}; 340 | \node[anchor=north] at (0,0) {\tiny $#1$}; 341 | \end{tikzpicture}} 342 | \newcommand{\umuoop}[1]{\begin{tikzpicture}[baseline=-0.5ex] 343 | \shade [ball color=yellow!80!black] (0,0) circle [radius=0.075]; 344 | \node[anchor=south] at (0,0) {$x$}; 345 | \node[anchor=north] at (0,0) {\tiny $#1$}; 346 | \end{tikzpicture}} 347 | \newcommand{\Umud}[1]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}] 348 | \draw[] (0.5,0) node[anchor=south west] {$x+\hat{#1}$} -- node {\midarrow} (0,0); 349 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#1}$}; 350 | \node[place] at (0.5,0) {}; 351 | \end{tikzpicture}} 352 | \newcommand{\Umudm}[1]{\begin{tikzpicture}[every node/.style={sloped,allow upside down},baseline=0ex,place/.style={inner sep =0.2mm,circle,draw=black,fill=black}] 353 | \draw[] (0.5,0) node[anchor=south west] {$x$} -- node {\midarrow} (0,0); 354 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#1}$}; 355 | \node[place] at (0.5,0) {}; 356 | \end{tikzpicture}} 357 | \newcommand{\EUmu}[2]{\begin{tikzpicture}[baseline=0ex,place3/.style={inner sep =0.4mm,circle,draw=black,fill=green!70!black},] 358 | \node[anchor=south] at (0,0) {$x$}; 359 | \draw[] (0,0) -- node {\midarrow} (0.5,0); 360 | \node[anchor=north] at (0,0) {\tiny $#1$}; 361 | \node[anchor=south] at (0.25,0) {\tiny $\hat{#2}$}; 362 | \node[place3] at (0,0) {}; 363 | \end{tikzpicture}} 364 | \newcommand{\tikzcuboid}[4]{% width, height, depth, scale 365 | \begin{tikzpicture}[scale=#4] 366 | \foreach \x in {0,...,#1} 367 | { 368 | \draw (\x,0,#3) -- (\x,#2,#3); 369 | \draw (\x,#2,#3) -- (\x,#2,0); 370 | } 371 | \foreach \x in {0,...,#2} 372 | { 373 | \draw (#1,\x,#3) -- (#1,\x,0); 374 | \draw (0,\x,#3) -- (#1,\x,#3); 375 | } 376 | \foreach \x in {0,...,#3} 377 | { 378 | \draw (#1,0,\x) -- (#1,#2,\x); 379 | \draw (0,#2,\x) -- (#1,#2,\x); 380 | } 381 | \foreach \x in {0,...,#1} 382 | { 383 | \foreach \y in {0,...,#2} 384 | { 385 | \node[redball] at (\x,\y,#3) {\tiny\phantom{a}}; 386 | } 387 | \foreach \y in {0,...,#3} 388 | { 389 | \node[redball] at (\x,#2,\y) {\tiny\phantom{a}}; 390 | } 391 | } 392 | \foreach \x in {0,...,#2} 393 | { 394 | \foreach \y in {0,...,#3} 395 | { 396 | \node[redball] at (#1,\x,\y) {\tiny \phantom{a}}; 397 | } 398 | } 399 | \end{tikzpicture} 400 | } 401 | 402 | \newcommand{\tikzcube}[2]{%lenght, scale 403 | \tikzcuboid{#1}{#1}{#1}{#2} 404 | } 405 | %\geometry{hmargin=2.5cm,vmargin=2cm} 406 | \usepackage{setspace} 407 | %\setstretch{1,5} 408 | \usepackage{pagedecouv} 409 | 410 | \usepackage[boxruled,vlined]{algorithm2e} 411 | \providecommand{\SetAlgoLined}{\SetLine} 412 | \usepackage{float} 413 | \floatstyle{plain} 414 | \newfloat{myalgo}{tbhp}{mya} 415 | \newenvironment{Algorithm}[2][tbh]% 416 | {\begin{myalgo}[#1] 417 | \centering 418 | \begin{minipage}{#2} 419 | \begin{algorithm}[H]}% 420 | {\end{algorithm} 421 | \end{minipage} 422 | \end{myalgo}} 423 | 424 | 425 | \setcounter{secnumdepth}{4} 426 | \setcounter{tocdepth}{1} 427 | \makeatletter 428 | \newcounter {subsubsubsection}[subsubsection] 429 | \renewcommand\thesubsubsubsection{\thesubsubsection .\@alph\c@subsubsubsection} 430 | \newcommand\subsubsubsection{\@startsection{subsubsubsection}{4}{\z@}% 431 | {-3.25ex\@plus -1ex \@minus -.2ex}% 432 | {1.5ex \@plus .2ex}% 433 | {\normalfont\normalsize\bfseries}} 434 | \renewcommand\paragraph{\@startsection{paragraph}{5}{\z@}% 435 | {3.25ex \@plus1ex \@minus.2ex}% 436 | {-1em}% 437 | {\normalfont\normalsize\bfseries}} 438 | \renewcommand\subparagraph{\@startsection{subparagraph}{6}{\parindent}% 439 | {3.25ex \@plus1ex \@minus .2ex}% 440 | {-1em}% 441 | {\normalfont\normalsize\bfseries}} 442 | \newcommand*\l@subsubsubsection{\@dottedtocline{4}{10.0em}{4.1em}} 443 | \renewcommand*\l@paragraph{\@dottedtocline{5}{10em}{5em}} 444 | \renewcommand*\l@subparagraph{\@dottedtocline{6}{12em}{6em}} 445 | \newcommand*{\subsubsubsectionmark}[1]{} 446 | \makeatother 447 | 448 | \makeatletter 449 | \def\toclevel@subsubsubsection{4} 450 | \def\toclevel@paragraph{5} 451 | \def\toclevel@subparagraph{6} 452 | \makeatother 453 | \tikzset{ 454 | hyperlink node/.style={ 455 | alias=sourcenode, 456 | append after command={ 457 | let \p1 = (sourcenode.north west), 458 | \p2=(sourcenode.south east), 459 | \n1={\x2-\x1}, 460 | \n2={\y1-\y2} in 461 | node [inner sep=0pt, outer sep=0pt,anchor=north west,at=(\p1)] {\hyperlink{#1}{\phantom{\rule{\n1}{\n2}}}} 462 | } 463 | } 464 | } 465 | \usepackage{appendix} 466 | %\usepackage{chngcntr} 467 | %\counterwithout{figure}{chapter} 468 | 469 | 470 | \usepackage{etoolbox} 471 | \usepackage{lipsum} 472 | \AtBeginEnvironment{subappendices}{% 473 | \section*{Appendix} 474 | \addcontentsline{toc}{section}{Appendices} 475 | %\counterwithin{figure}{section} 476 | %\counterwithin{table}{section} 477 | } 478 | 479 | %%% Auteurs en gras %%% 480 | \renewbibmacro*{author}{% 481 | \mkbibbold{% 482 | \ifboolexpr{ 483 | test \ifuseauthor 484 | and 485 | not test {\ifnameundef{author}} 486 | } 487 | {\printnames{author}% 488 | \iffieldundef{authortype} 489 | {} 490 | {\setunit{\addcomma\space}% 491 | \usebibmacro{authorstrg}}} 492 | {}}} 493 | %%% Auteurs en gras %%% 494 | -------------------------------------------------------------------------------- /fully_connected.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/fully_connected.pdf -------------------------------------------------------------------------------- /input_layer.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/input_layer.pdf -------------------------------------------------------------------------------- /lReLU.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/lReLU.pdf -------------------------------------------------------------------------------- /mathpazo.sty: -------------------------------------------------------------------------------- 1 | %% 2 | %% This is file `mathpazo.sty', 3 | %% generated with the docstrip utility. 4 | %% 5 | %% The original source files were: 6 | %% 7 | %% psfonts.dtx (with options: `mathpazo') 8 | %% 9 | %% IMPORTANT NOTICE: 10 | %% 11 | %% For the copyright see the source file. 12 | %% 13 | %% Any modified versions of this file must be renamed 14 | %% with new filenames distinct from mathpazo.sty. 15 | %% 16 | %% For distribution of the original source see the terms 17 | %% for copying and modification in the file psfonts.dtx. 18 | %% 19 | %% This generated file may be distributed as long as the 20 | %% original source files, as listed above, are part of the 21 | %% same distribution. (The sources need not necessarily be 22 | %% in the same archive or directory.) 23 | \ProvidesPackage{mathpazo}% 24 | [2005/04/12 PSNFSS-v9.2a 25 | Palatino w/ Pazo Math (D.Puga, WaS) 26 | ] 27 | \let\s@ved@info\@font@info 28 | \let\@font@info\@gobble 29 | \newif\ifpazo@osf 30 | \newif\ifpazo@sc 31 | \newif\ifpazo@slGreek 32 | \newif\ifpazo@BB \pazo@BBtrue 33 | \DeclareOption{osf}{\pazo@osftrue} 34 | \DeclareOption{sc}{\pazo@sctrue} 35 | \DeclareOption{slantedGreek}{\pazo@slGreektrue} 36 | \DeclareOption{noBBpl}{\pazo@BBfalse} 37 | \DeclareOption{osfeqnnum}{\OptionNotUsed} 38 | \ProcessOptions\relax 39 | \ifpazo@osf 40 | \renewcommand{\rmdefault}{pplj} 41 | \renewcommand{\oldstylenums}[1]{% 42 | {\fontfamily{pplj}\selectfont #1}} 43 | \else\ifpazo@sc 44 | \renewcommand{\rmdefault}{pplx} 45 | \renewcommand{\oldstylenums}[1]{% 46 | {\fontfamily{pplj}\selectfont #1}} 47 | \else 48 | \renewcommand{\rmdefault}{ppl} 49 | \fi\fi 50 | \newcommand{\ppleuro}{{\fontencoding{U}\fontfamily{fplm}\selectfont \char160}} 51 | \AtBeginDocument{\@ifpackageloaded{europs}{\renewcommand{\EURtm}{\ppleuro}}{}} 52 | \ifpazo@sc 53 | \DeclareSymbolFont{operators} {OT1}{pplx}{m}{n} 54 | \SetSymbolFont{operators}{bold} {OT1}{pplx}{b}{n} 55 | \DeclareMathAlphabet{\mathit} {OT1}{pplx}{m}{it} 56 | \SetMathAlphabet{\mathit}{bold} {OT1}{pplx}{b}{it} 57 | \else 58 | \DeclareSymbolFont{operators} {OT1}{ppl}{m}{n} 59 | \SetSymbolFont{operators}{bold} {OT1}{ppl}{b}{n} 60 | \DeclareMathAlphabet{\mathit} {OT1}{ppl}{m}{it} 61 | \SetMathAlphabet{\mathit}{bold} {OT1}{ppl}{b}{it} 62 | \fi 63 | \DeclareSymbolFont{upright} {OT1}{zplm}{m}{n} 64 | \DeclareSymbolFont{letters} {OML}{zplm}{m}{it} 65 | \DeclareSymbolFont{symbols} {OMS}{zplm}{m}{n} 66 | \DeclareSymbolFont{largesymbols} {OMX}{zplm}{m}{n} 67 | \SetSymbolFont{upright}{bold} {OT1}{zplm}{b}{n} 68 | \SetSymbolFont{letters}{bold} {OML}{zplm}{b}{it} 69 | \SetSymbolFont{symbols}{bold} {OMS}{zplm}{b}{n} 70 | \SetSymbolFont{largesymbols}{bold}{OMX}{zplm}{m}{n} 71 | %\DeclareMathAlphabet{\mathbf} {OT1}{zplm}{b}{n} 72 | %\DeclareMathAlphabet{\mathbold} {OML}{zplm}{b}{it} 73 | \DeclareSymbolFontAlphabet{\mathrm} {operators} 74 | \DeclareSymbolFontAlphabet{\mathnormal}{letters} 75 | \DeclareSymbolFontAlphabet{\mathcal} {symbols} 76 | \DeclareMathSymbol{!}{\mathclose}{upright}{"21} 77 | \DeclareMathSymbol{+}{\mathbin}{upright}{"2B} 78 | \DeclareMathSymbol{:}{\mathrel}{upright}{"3A} 79 | \DeclareMathSymbol{=}{\mathrel}{upright}{"3D} 80 | \DeclareMathSymbol{?}{\mathclose}{upright}{"3F} 81 | \DeclareMathDelimiter{(}{\mathopen} {upright}{"28}{largesymbols}{"00} 82 | \DeclareMathDelimiter{)}{\mathclose}{upright}{"29}{largesymbols}{"01} 83 | \DeclareMathDelimiter{[}{\mathopen} {upright}{"5B}{largesymbols}{"02} 84 | \DeclareMathDelimiter{]}{\mathclose}{upright}{"5D}{largesymbols}{"03} 85 | \DeclareMathDelimiter{/}{\mathord}{upright}{"2F}{largesymbols}{"0E} 86 | \DeclareMathAccent{\acute}{\mathalpha}{upright}{"13} 87 | \DeclareMathAccent{\grave}{\mathalpha}{upright}{"12} 88 | \DeclareMathAccent{\ddot}{\mathalpha}{upright}{"7F} 89 | \DeclareMathAccent{\tilde}{\mathalpha}{upright}{"7E} 90 | \DeclareMathAccent{\bar}{\mathalpha}{upright}{"16} 91 | \DeclareMathAccent{\breve}{\mathalpha}{upright}{"15} 92 | \DeclareMathAccent{\check}{\mathalpha}{upright}{"14} 93 | \DeclareMathAccent{\hat}{\mathalpha}{upright}{"5E} 94 | \DeclareMathAccent{\dot}{\mathalpha}{upright}{"5F} 95 | \DeclareMathAccent{\mathring}{\mathalpha}{upright}{"17} 96 | \DeclareMathSymbol{\mathdollar}{\mathord}{upright}{"24} 97 | \DeclareMathSymbol{,}{\mathpunct}{operators}{44} 98 | \DeclareMathSymbol{.}{\mathord}{operators}{46} 99 | \ifpazo@BB 100 | \AtBeginDocument{% 101 | \let\mathbb\relax 102 | \DeclareMathAlphabet\PazoBB{U}{fplmbb}{m}{n} 103 | \newcommand{\mathbb}{\PazoBB} 104 | } 105 | \fi 106 | \medmuskip=3.5mu plus 1mu minus 1mu 107 | \def\joinrel{\mathrel{\mkern-3.45mu}} 108 | \renewcommand{\hbar}{{\mkern0.8mu\mathchar'26\mkern-6.8muh}} 109 | \ifpazo@slGreek 110 | \DeclareMathSymbol{\Gamma} {\mathalpha}{letters}{"00} 111 | \DeclareMathSymbol{\Delta} {\mathalpha}{letters}{"01} 112 | \DeclareMathSymbol{\Theta} {\mathalpha}{letters}{"02} 113 | \DeclareMathSymbol{\Lambda} {\mathalpha}{letters}{"03} 114 | \DeclareMathSymbol{\Xi} {\mathalpha}{letters}{"04} 115 | \DeclareMathSymbol{\Pi} {\mathalpha}{letters}{"05} 116 | \DeclareMathSymbol{\Sigma} {\mathalpha}{letters}{"06} 117 | \DeclareMathSymbol{\Upsilon}{\mathalpha}{letters}{"07} 118 | \DeclareMathSymbol{\Phi} {\mathalpha}{letters}{"08} 119 | \DeclareMathSymbol{\Psi} {\mathalpha}{letters}{"09} 120 | \DeclareMathSymbol{\Omega} {\mathalpha}{letters}{"0A} 121 | \else 122 | \DeclareMathSymbol{\Gamma}{\mathalpha}{upright}{"00} 123 | \DeclareMathSymbol{\Delta}{\mathalpha}{upright}{"01} 124 | \DeclareMathSymbol{\Theta}{\mathalpha}{upright}{"02} 125 | \DeclareMathSymbol{\Lambda}{\mathalpha}{upright}{"03} 126 | \DeclareMathSymbol{\Xi}{\mathalpha}{upright}{"04} 127 | \DeclareMathSymbol{\Pi}{\mathalpha}{upright}{"05} 128 | \DeclareMathSymbol{\Sigma}{\mathalpha}{upright}{"06} 129 | \DeclareMathSymbol{\Upsilon}{\mathalpha}{upright}{"07} 130 | \DeclareMathSymbol{\Phi}{\mathalpha}{upright}{"08} 131 | \DeclareMathSymbol{\Psi}{\mathalpha}{upright}{"09} 132 | \DeclareMathSymbol{\Omega}{\mathalpha}{upright}{"0A} 133 | \fi 134 | \DeclareMathSymbol{\upGamma}{\mathord}{upright}{0} 135 | \DeclareMathSymbol{\upDelta}{\mathord}{upright}{1} 136 | \DeclareMathSymbol{\upTheta}{\mathord}{upright}{2} 137 | \DeclareMathSymbol{\upLambda}{\mathord}{upright}{3} 138 | \DeclareMathSymbol{\upXi}{\mathord}{upright}{4} 139 | \DeclareMathSymbol{\upPi}{\mathord}{upright}{5} 140 | \DeclareMathSymbol{\upSigma}{\mathord}{upright}{6} 141 | \DeclareMathSymbol{\upUpsilon}{\mathord}{upright}{7} 142 | \DeclareMathSymbol{\upPhi}{\mathord}{upright}{8} 143 | \DeclareMathSymbol{\upPsi}{\mathord}{upright}{9} 144 | \DeclareMathSymbol{\upOmega}{\mathord}{upright}{10} 145 | \DeclareMathSymbol{\alpha}{\mathalpha}{letters}{"0B} 146 | \DeclareMathSymbol{\beta}{\mathalpha}{letters}{"0C} 147 | \DeclareMathSymbol{\gamma}{\mathalpha}{letters}{"0D} 148 | \DeclareMathSymbol{\delta}{\mathalpha}{letters}{"0E} 149 | \DeclareMathSymbol{\epsilon}{\mathalpha}{letters}{"0F} 150 | \DeclareMathSymbol{\zeta}{\mathalpha}{letters}{"10} 151 | \DeclareMathSymbol{\eta}{\mathalpha}{letters}{"11} 152 | \DeclareMathSymbol{\theta}{\mathalpha}{letters}{"12} 153 | \DeclareMathSymbol{\iota}{\mathalpha}{letters}{"13} 154 | \DeclareMathSymbol{\kappa}{\mathalpha}{letters}{"14} 155 | \DeclareMathSymbol{\lambda}{\mathalpha}{letters}{"15} 156 | \DeclareMathSymbol{\mu}{\mathalpha}{letters}{"16} 157 | \DeclareMathSymbol{\nu}{\mathalpha}{letters}{"17} 158 | \DeclareMathSymbol{\xi}{\mathalpha}{letters}{"18} 159 | \DeclareMathSymbol{\pi}{\mathalpha}{letters}{"19} 160 | \DeclareMathSymbol{\rho}{\mathalpha}{letters}{"1A} 161 | \DeclareMathSymbol{\sigma}{\mathalpha}{letters}{"1B} 162 | \DeclareMathSymbol{\tau}{\mathalpha}{letters}{"1C} 163 | \DeclareMathSymbol{\upsilon}{\mathalpha}{letters}{"1D} 164 | \DeclareMathSymbol{\phi}{\mathalpha}{letters}{"1E} 165 | \DeclareMathSymbol{\chi}{\mathalpha}{letters}{"1F} 166 | \DeclareMathSymbol{\psi}{\mathalpha}{letters}{"20} 167 | \DeclareMathSymbol{\omega}{\mathalpha}{letters}{"21} 168 | \DeclareMathSymbol{\varepsilon}{\mathalpha}{letters}{"22} 169 | \DeclareMathSymbol{\vartheta}{\mathalpha}{letters}{"23} 170 | \DeclareMathSymbol{\varpi}{\mathalpha}{letters}{"24} 171 | \DeclareMathSymbol{\varrho}{\mathalpha}{letters}{"25} 172 | \DeclareMathSymbol{\varsigma}{\mathalpha}{letters}{"26} 173 | \DeclareMathSymbol{\varphi}{\mathalpha}{letters}{"27} 174 | \let\s@vedhbar\hbar 175 | \AtBeginDocument{% 176 | \DeclareFontFamily{U}{msa}{}% 177 | \DeclareFontShape{U}{msa}{m}{n}{<->s*[1.042]msam10}{}% 178 | \DeclareFontFamily{U}{msb}{}% 179 | \DeclareFontShape{U}{msb}{m}{n}{<->s*[1.042]msbm10}{}% 180 | \DeclareFontFamily{U}{euf}{}% 181 | \DeclareFontShape{U}{euf}{m}{n}{<-6>eufm5<6-8>eufm7<8->eufm10}{}% 182 | \DeclareFontShape{U}{euf}{b}{n}{<-6>eufb5<6-8>eufb7<8->eufb10}{}% 183 | \@ifpackageloaded{amsfonts}{\let\hbar\s@vedhbar}{} 184 | \@ifpackageloaded{amsmath}{}{% 185 | \newdimen\big@size 186 | \addto@hook\every@math@size{\setbox\z@\vbox{\hbox{$($}\kern\z@}% 187 | \global\big@size 1.2\ht\z@} 188 | \def\bBigg@#1#2{% 189 | {\hbox{$\left#2\vcenter to#1\big@size{}\right.\n@space$}}} 190 | \def\big{\bBigg@\@ne} 191 | \def\Big{\bBigg@{1.5}} 192 | \def\bigg{\bBigg@\tw@} 193 | \def\Bigg{\bBigg@{2.5}} 194 | } 195 | } 196 | \def\defaultscriptratio{.76} 197 | \def\defaultscriptscriptratio{.6} 198 | \DeclareMathSizes{5} {5} {5} {5} 199 | \DeclareMathSizes{6} {6} {5} {5} 200 | \DeclareMathSizes{7} {7} {5} {5} 201 | \DeclareMathSizes{8} {8} {6} {5} 202 | \DeclareMathSizes{9} {9} {7} {5} 203 | \DeclareMathSizes{10} {10} {7.6} {6} 204 | \DeclareMathSizes{10.95}{10.95}{8} {6} 205 | \DeclareMathSizes{12} {12} {9} {7} 206 | \DeclareMathSizes{14.4} {14.4} {10} {8} 207 | \DeclareMathSizes{17.28}{17.28}{12} {10} 208 | \DeclareMathSizes{20.74}{20.74}{14.4} {12} 209 | \DeclareMathSizes{24.88}{24.88}{20.74}{14.4} 210 | \let\@font@info\s@ved@info 211 | \endinput 212 | %% 213 | %% End of file `mathpazo.sty'. 214 | -------------------------------------------------------------------------------- /output_layer.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/output_layer.pdf -------------------------------------------------------------------------------- /padding.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/padding.pdf -------------------------------------------------------------------------------- /pagedecouv.sty: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/pagedecouv.sty -------------------------------------------------------------------------------- /pool_4d-crop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/pool_4d-crop.pdf -------------------------------------------------------------------------------- /sigmoid.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/sigmoid.pdf -------------------------------------------------------------------------------- /softmax.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/softmax.pdf -------------------------------------------------------------------------------- /tanh.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/tanh.pdf -------------------------------------------------------------------------------- /tanh2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomepel/Technical_Book_DL/1246c33fa086b2169cf13b28a165cc8f7d69efd8/tanh2.pdf --------------------------------------------------------------------------------