├── .DS_Store ├── .gitignore ├── LICENSE ├── README.md └── lessons ├── .DS_Store ├── 00_notebook_tutorial.ipynb ├── Lesson 2 review.pdf ├── crappify.py ├── images ├── download_images │ └── upload.png └── notebook_tutorial │ ├── add.png │ ├── busy.png │ ├── cat_example.jpg │ ├── markdown.png │ ├── new_notebook.png │ ├── run.png │ ├── save.png │ └── terminal.png ├── lesson1-pets.ipynb ├── lesson2-download.ipynb ├── lesson2-sgd.ipynb ├── lesson3-camvid-tiramisu.ipynb ├── lesson3-camvid.ipynb ├── lesson3-head-pose.ipynb ├── lesson3-imdb.ipynb ├── lesson3-planet.ipynb ├── lesson4-collab.ipynb ├── lesson4-tabular.ipynb ├── lesson5-sgd-mnist.ipynb ├── lesson6-pets-more.ipynb ├── lesson6-rossmann.ipynb ├── lesson7-human-numbers.ipynb ├── lesson7-resnet-mnist.ipynb ├── lesson7-superres-gan.ipynb ├── lesson7-superres-imagenet.ipynb ├── lesson7-superres.ipynb ├── lesson7-wgan.ipynb └── rossman_data_clean.ipynb /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.DS_Store 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 SaturdaysAI 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/SaturdaysAI/Itinerario_DeepLearning/master) 2 | 3 | # Itinerario_DeepLearning (_work in progress_) 4 | ### Itinerario oficial Deep Learning - Saturdays.ai 5 | 6 | __Para poder graduarse en el programa AI Saturdays, con la mención "_Deep Learning_", hay que superar al menos el contenido de este itinerario. 7 | Está pensado para representar poco más del 50% del contenido de la fase "_code2learn_" en AI Saturdays.__ 8 | 9 | Para trabajar con los notebooks puedes: 10 | 1) Clonar en tu equipo este contenido `git clone https://github.com/SaturdaysAI/Itinerario_DeepLearning` 11 | 2) Ejecutarlo con Google Colab online. 12 | 3) Ejectuarlo con Jupyter online usando mybinder (boton arriba). 13 | 14 | 15 | ### Troncal: 16 | 17 | 18 | **Sesion 1 (ML project pipeline and basic libraries):** 19 | - [Notebooks (machine learning project pipeline)](https://github.com/pablotalavante/ai6-madrid-demos/tree/master/session_1) (Lo veremos y explicaremos el sábado) 20 | - [Slides - Repaso Machine Learning](https://drive.google.com/file/d/1r4SBY6Dm6xjFqLH12tFb-Bf7wbvoIN_C/view) 21 | - [Introduction to Deep Learning - slides](http://introtodeeplearning.com/materials/2019_6S191_L1.pdf) 22 | - [Introduction do Deep Learning - video](https://www.youtube.com/watch?v=5v1JnYv_yWs&index=1&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI) 23 | - [Numpy](http://cs231n.github.io/python-numpy-tutorial/#numpy) 24 | - [Pandas video](https://www.youtube.com/watch?v=e60ItwlZTKM) 25 | - [Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) 26 | - [OpenCV tutorial](https://www.pyimagesearch.com/2018/07/19/opencv-tutorial-a-guide-to-learn-opencv/) 27 | - [Python os module](https://stackabuse.com/introduction-to-python-os-module/) 28 | 29 | El objetivo de esta primera semana es que todos sepamos usar las distintas librerías que utilizaremos a lo largo del curso. Dependiendo de lo familiarizado que estés con estas librerías tendrás que dedicarle más o menos tiempo, pero a todos nos vendrá bien como repaso y para que estemos todos en el mismo punto de partida. 30 | 31 | Además se incluye una introducción al Deep Learning (video y slides) que forma parte de una asignatura del MIT y que nos introducirá a lo que veremos en el curso. 32 | 33 | El sábado se propondrá un challenge para preparar un dataset con el que más tarde entrenaremos un modelo de reconocimiento de imágenes. 34 | 35 | Buena suerte a todos! 36 | 37 | **Sesion 2 (Neural networks. CNN and Image Classification):** 38 | - [Notebook 1](https://nbviewer.jupyter.org/github/fastai/course-v3/blob/master/nbs/dl1/00_notebook_tutorial.ipynb) 39 | - [Notebook 2](https://nbviewer.jupyter.org/github/fastai/course-v3/blob/master/nbs/dl1/lesson1-pets.ipynb) 40 | - [slides](https://github.com/hiromis/notes/blob/master/Lesson1.md) 41 | - [video](https://course.fast.ai/videos/?lesson=1) 42 | - [Slides Sabado](https://github.com/SaturdaysAI/Itinerario_DeepLearning/blob/master/lessons/Lesson%202%20review.pdf) 43 | 44 | En esta segunda semana estudiaremos el problema de la clasificación de imágenes en categorías. Para ello, os dejamos una serie de slides y el primer video (1h 40min) para que los veais durante la semana. En los enlaces que os hemos dejado también figura el notebook (2) que se utiliza durante el video de la lección por si queréis ir ejecutando y explorando. El primer notebook contiene un pequeño resumen de cómo funciona jupyter notebook por si no habeis tenido la oportunidad de trabajar anteriormente en este entorno. 45 | 46 | 47 | 48 | **Sesion 3 (Multiclass classification and segmentation):** 49 | - [Notebooks](https://nbviewer.jupyter.org/github/fastai/course-v3/blob/master/nbs/dl1/lesson3-camvid.ipynb) 50 | - [slides](https://github.com/hiromis/notes/blob/master/Lesson3.md) 51 | - [video](https://course.fast.ai/videos/?lesson=3) 52 | 53 | 54 | 55 | **Sesion 4 (Recurrent Neural Networks and Genetic Algorithms):** 56 | - [Notebooks](https://github.com/pablotalavante/ai6-madrid-demos/blob/master/session_4/LSTM%20sesion%204.ipynb) 57 | - [slides](https://docs.google.com/presentation/d/1hqYB3LRwg_-ntptHxH18W1ax9kBwkaZ1Pa_s3L7R-2Y/edit) 58 | - [video](https://www.youtube.com/watch?v=WCUNPb-5EYI) 59 | 60 | - faltan los algotimos genéticos 61 | 62 | 63 | 64 | **Sesion 5 (NLP):** 65 | - [Notebooks](https://github.com/pablotalavante/ai6-madrid-demos/blob/master/session_4/LSTM%20sesion%204.ipynb) 66 | - [slides](https://forums.fast.ai/t/deep-learning-lesson-4-notes/30983) 67 | - [video fast.ai](https://course.fast.ai/videos/?lesson=4) 68 | - [video NLP](https://www.youtube.com/watch?v=jpWqz85F_4Y) 69 | 70 | 71 | 72 | **Sesion 6 (GAN and Autoencoders):** 73 | - [slides](https://github.com/pablotalavante/ai6-madrid-demos/blob/master/talk_23_march/VAEs%2C%20GANs%20and%20CPPNs_%20a%20visual%20approach%20to%20generative%20models.pdf) 74 | - [video](https://www.youtube.com/watch?v=ondivPiwQho) 75 | 76 | 77 | Resto de sesiones hasta el ___Demo Day___ están dedicadas a la fase "_build2learn_", donde construiras un proyecto abierto con impacto social utilizando _deep learning_, puedes ver proyectos de otras ediciones [aqui](http://github.com/saturdaysai/projects) 78 | 79 | Este itinerario está inspirado principalmente en "Practical DL for Coders" de Fast.ai. 80 | 81 | Igualmente, siéntete libre de compartir o proponer mejoras levantando "issues" o enviando un PR. 82 | 83 | 84 | 85 | 86 | ### Itinerario oficial Reinforcement Learning - Saturdays.ai 87 | 88 | 89 | Vamos a seguir el siguiente curso que nos guiará a lo largo de las distintas sesiones: 90 | 91 | **Full course:** 92 | [A Free course in Deep Reinforcement Learning from beginner to expert](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/) 93 | 94 | 95 | **MIT Deep Learning Course (Session 3):** 96 | 97 | El curso del MIT de Deep Learning tiene una sesión y una práctica dedicada al Reinforcement Learning. 98 | 99 | [slides](http://introtodeeplearning.com/materials/2019_6S191_L5.pdf) 100 | [video](https://www.youtube.com/watch?v=i6Mi2_QM3rA&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=1) 101 | [code (lab3)](https://github.com/aamini/introtodeeplearning_labs) 102 | 103 | 104 | **Theory behind Deep Learning (Deep Mind):** 105 | 106 | El curso de David Silver de Deep Mind tiene un enfoque muy teórico para las personas que quieran profundizar en el aspecto matemático del RL. 107 | 108 | [slides](http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html) 109 | [videos](https://www.youtube.com/watch?v=2pWv7GOvuf0) 110 | 111 | El link de vídeos lleva al vídeo de la primera clase. El resto están en relacionados. 112 | -------------------------------------------------------------------------------- /lessons/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/.DS_Store -------------------------------------------------------------------------------- /lessons/Lesson 2 review.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/Lesson 2 review.pdf -------------------------------------------------------------------------------- /lessons/crappify.py: -------------------------------------------------------------------------------- 1 | from fastai.vision import * 2 | from PIL import Image, ImageDraw, ImageFont 3 | 4 | class crappifier(object): 5 | def __init__(self, path_lr, path_hr): 6 | self.path_lr = path_lr 7 | self.path_hr = path_hr 8 | 9 | def __call__(self, fn, i): 10 | dest = self.path_lr/fn.relative_to(self.path_hr) 11 | dest.parent.mkdir(parents=True, exist_ok=True) 12 | img = PIL.Image.open(fn) 13 | targ_sz = resize_to(img, 96, use_min=True) 14 | img = img.resize(targ_sz, resample=PIL.Image.BILINEAR).convert('RGB') 15 | w,h = img.size 16 | q = random.randint(10,70) 17 | ImageDraw.Draw(img).text((random.randint(0,w//2),random.randint(0,h//2)), str(q), fill=(255,255,255)) 18 | img.save(dest, quality=q) -------------------------------------------------------------------------------- /lessons/images/download_images/upload.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/download_images/upload.png -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/add.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/add.png -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/busy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/busy.png -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/cat_example.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/cat_example.jpg -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/markdown.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/markdown.png -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/new_notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/new_notebook.png -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/run.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/run.png -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/save.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/save.png -------------------------------------------------------------------------------- /lessons/images/notebook_tutorial/terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SaturdaysAI/Itinerario_DeepLearning/d93e41795917912d60733b1cd03c62f4007eec2f/lessons/images/notebook_tutorial/terminal.png -------------------------------------------------------------------------------- /lessons/lesson3-imdb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# IMDB" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "%reload_ext autoreload\n", 17 | "%autoreload 2\n", 18 | "%matplotlib inline" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "from fastai.text import *" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Preparing the data" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "First let's download the dataset we are going to study. The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) has been curated by Andrew Maas et al. and contains a total of 100,000 reviews on IMDB. 25,000 of them are labelled as positive and negative for training, another 25,000 are labelled for testing (in both cases they are highly polarized). The remaning 50,000 is an additional unlabelled data (but we will find a use for it nonetheless).\n", 42 | "\n", 43 | "We'll begin with a sample we've prepared for you, so that things run quickly before going over the full dataset." 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "[PosixPath('/home/ubuntu/notebooks/data/imdb_sample/data_clas_export.pkl'),\n", 55 | " PosixPath('/home/ubuntu/notebooks/data/imdb_sample/export_lm.pkl'),\n", 56 | " PosixPath('/home/ubuntu/notebooks/data/imdb_sample/export.pkl'),\n", 57 | " PosixPath('/home/ubuntu/notebooks/data/imdb_sample/texts.csv'),\n", 58 | " PosixPath('/home/ubuntu/notebooks/data/imdb_sample/data_lm_export.pkl'),\n", 59 | " PosixPath('/home/ubuntu/notebooks/data/imdb_sample/export_clas.pkl'),\n", 60 | " PosixPath('/home/ubuntu/notebooks/data/imdb_sample/models'),\n", 61 | " PosixPath('/home/ubuntu/notebooks/data/imdb_sample/save_data_clas.pkl')]" 62 | ] 63 | }, 64 | "execution_count": null, 65 | "metadata": {}, 66 | "output_type": "execute_result" 67 | } 68 | ], 69 | "source": [ 70 | "path = untar_data(URLs.IMDB_SAMPLE)\n", 71 | "path.ls()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "It only contains one csv file, let's have a look at it." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "text/html": [ 89 | "
\n", 90 | "\n", 103 | "\n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | "
labeltextis_valid
0negativeUn-bleeping-believable! Meg Ryan doesn't even ...False
1positiveThis is a extremely well-made film. The acting...False
2negativeEvery once in a long while a movie will come a...False
3positiveName just says it all. I watched this movie wi...False
4negativeThis movie succeeds at being one of the most u...False
\n", 145 | "
" 146 | ], 147 | "text/plain": [ 148 | " label text is_valid\n", 149 | "0 negative Un-bleeping-believable! Meg Ryan doesn't even ... False\n", 150 | "1 positive This is a extremely well-made film. The acting... False\n", 151 | "2 negative Every once in a long while a movie will come a... False\n", 152 | "3 positive Name just says it all. I watched this movie wi... False\n", 153 | "4 negative This movie succeeds at being one of the most u... False" 154 | ] 155 | }, 156 | "execution_count": null, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "df = pd.read_csv(path/'texts.csv')\n", 163 | "df.head()" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.

But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.

The result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen. But it is never painted as a black-and-white case. There is baseness and nobility on both sides, and also the hope for change in the younger generation.

There is redemption of a sort, in the end, when Puro has to make a hard choice between a man who has ruined her life, but also truly loved her, and her family which has disowned her, then later come looking for her. But by that point, she has no option that is without great pain for her.

This film carries the message that both Muslims and Hindus have their grave faults, and also that both can be dignified and caring people. The reality of partition makes that realisation all the more wrenching, since there can never be real reconciliation across the India/Pakistan border. In that sense, it is similar to \"Mr & Mrs Iyer\".

In the end, we were glad to have seen the film, even though the resolution was heartbreaking. If the UK and US could deal with their own histories of racism with this kind of frankness, they would certainly be better off.'" 175 | ] 176 | }, 177 | "execution_count": null, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "df['text'][1]" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "It contains one line per review, with the label ('negative' or 'positive'), the text and a flag to determine if it should be part of the validation set or the training set. If we ignore this flag, we can create a DataBunch containing this data in one line of code:" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "data_lm = TextDataBunch.from_csv(path, 'texts.csv')" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "By executing this line a process was launched that took a bit of time. Let's dig a bit into it. Images could be fed (almost) directly into a model because they're just a big array of pixel values that are floats between 0 and 1. A text is composed of words, and we can't apply mathematical functions to them directly. We first have to convert them to numbers. This is done in two differents steps: tokenization and numericalization. A `TextDataBunch` does all of that behind the scenes for you.\n", 207 | "\n", 208 | "Before we delve into the explanations, let's take the time to save the things that were calculated." 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "data_lm.save()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "Next time we launch this notebook, we can skip the cell above that took a bit of time (and that will take a lot more when you get to the full dataset) and load those results like this:" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "data = load_data(path)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "### Tokenization" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:\n", 248 | "\n", 249 | "- we need to take care of punctuation\n", 250 | "- some words are contractions of two different words, like isn't or don't\n", 251 | "- we may need to clean some parts of our texts, if there's HTML code for instance\n", 252 | "\n", 253 | "To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch." 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "data": { 263 | "text/html": [ 264 | "\n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | "
texttarget
xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \\n \\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , steaming bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmajnegative
xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the sweetest and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up withpositive
xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj sydney , after xxunk ) , i can xxunk join both xxunk of \" xxmaj at xxmaj the xxmaj movies \" in taking xxmaj steven xxmaj soderbergh to task . \\n \\n xxmaj it 's usually satisfying to watch a film director change his style /negative
xxbos xxmaj this film sat on my xxmaj tivo for weeks before i watched it . i dreaded a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj yorkers . \\n \\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' \" xxmaj la xxmaj rondepositive
xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first stealth games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphicspositive
" 294 | ], 295 | "text/plain": [ 296 | "" 297 | ] 298 | }, 299 | "metadata": {}, 300 | "output_type": "display_data" 301 | } 302 | ], 303 | "source": [ 304 | "data = TextClasDataBunch.from_csv(path, 'texts.csv')\n", 305 | "data.show_batch()" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols: \n", 313 | "- the \"'s\" are grouped together in one token\n", 314 | "- the contractions are separated like this: \"did\", \"n't\"\n", 315 | "- content has been cleaned for any HTML symbol and lower cased\n", 316 | "- there are several special tokens (all those that begin by xx), to replace unknown tokens (see below) or to introduce different text fields (here we only have one)." 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "### Numericalization" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.\n", 331 | "\n", 332 | "The correspondance from ids to tokens is stored in the `vocab` attribute of our datasets, in a dictionary called `itos` (for int to string)." 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "metadata": {}, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "['xxunk',\n", 344 | " 'xxpad',\n", 345 | " 'xxbos',\n", 346 | " 'xxfld',\n", 347 | " 'xxmaj',\n", 348 | " 'xxup',\n", 349 | " 'xxrep',\n", 350 | " 'xxwrep',\n", 351 | " 'the',\n", 352 | " '.']" 353 | ] 354 | }, 355 | "execution_count": null, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "data.vocab.itos[:10]" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "And if we look at what a what's in our datasets, we'll see the tokenized text as a representation:" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": null, 374 | "metadata": {}, 375 | "outputs": [ 376 | { 377 | "data": { 378 | "text/plain": [ 379 | "Text xxbos i know that originally , this film was xxup not a box office hit , but in light of recent xxmaj hollywood releases ( most of which have been decidedly formula - ridden , plot less , pointless , \" save - the - blonde - chick - no - matter - what \" xxunk ) , xxmaj xxunk of xxmaj all xxmaj xxunk , certainly in this sorry context deserves a second opinion . xxmaj the film -- like the book -- loses xxunk in some of the historical background , but it xxunk a uniquely xxmaj american dilemma set against the uniquely horrific xxmaj american xxunk of human xxunk , and some of its tragic ( and funny , and touching ) consequences . \n", 380 | "\n", 381 | " xxmaj and worthy of xxunk out is the youthful xxmaj robert xxmaj xxunk , cast as the leading figure , xxmaj xxunk , whose xxunk xxunk is truly universal as he sets out in the beginning of his ' coming of age , ' only to be xxunk disappointed at what turns out to become his true education in the ways of the xxmaj southern plantation world of xxmaj xxunk , at the xxunk of the xxunk period . xxmaj when i saw the previews featuring the ( xxunk ) blond - xxunk xxmaj xxunk , i expected a xxunk , a xxunk , a xxunk -- i was pleasantly surprised . \n", 382 | "\n", 383 | " xxmaj xxunk xxmaj davis , xxmaj ruby xxmaj dee , the late xxmaj ben xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxmaj victoria xxmaj xxunk and even xxmaj xxunk xxmaj guy xxunk vivid imagery and formidable skill as actors in the backdrop xxunk of xxunk , voodoo , xxmaj xxunk \" xxunk , \" and xxmaj xxunk revolt woven into this tale of human passion , hate , love , family , and racial xxunk in a society which is supposedly gone and yet somehow is still with us ." 384 | ] 385 | }, 386 | "execution_count": null, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "data.train_ds[0][0]" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "But the underlying data is all numbers" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [ 407 | { 408 | "data": { 409 | "text/plain": [ 410 | "array([ 2, 18, 146, 19, 3788, 10, 20, 31, 25, 5])" 411 | ] 412 | }, 413 | "execution_count": null, 414 | "metadata": {}, 415 | "output_type": "execute_result" 416 | } 417 | ], 418 | "source": [ 419 | "data.train_ds[0][0].data[:10]" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "### With the data block API" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "metadata": {}, 432 | "source": [ 433 | "We can use the data block API with NLP and have a lot more flexibility than what the default factory methods offer. In the previous example for instance, the data was randomly split between train and validation instead of reading the third column of the csv.\n", 434 | "\n", 435 | "With the data block API though, we have to manually call the tokenize and numericalize steps. This allows more flexibility, and if you're not using the defaults from fastai, the various arguments to pass will appear in the step they're revelant, so it'll be more readable." 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "data = (TextList.from_csv(path, 'texts.csv', cols='text')\n", 445 | " .split_from_df(col=2)\n", 446 | " .label_from_df(cols=0)\n", 447 | " .databunch())" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "## Language model" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "Note that language models can use a lot of GPU, so you may need to decrease batchsize here." 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [ 470 | "bs=48" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "Now let's grab the full dataset for what follows." 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": null, 483 | "metadata": {}, 484 | "outputs": [ 485 | { 486 | "data": { 487 | "text/plain": [ 488 | "[PosixPath('/home/ubuntu/.fastai/data/imdb/test'),\n", 489 | " PosixPath('/home/ubuntu/.fastai/data/imdb/tmp_clas'),\n", 490 | " PosixPath('/home/ubuntu/.fastai/data/imdb/README'),\n", 491 | " PosixPath('/home/ubuntu/.fastai/data/imdb/unsup'),\n", 492 | " PosixPath('/home/ubuntu/.fastai/data/imdb/train'),\n", 493 | " PosixPath('/home/ubuntu/.fastai/data/imdb/tmp_lm'),\n", 494 | " PosixPath('/home/ubuntu/.fastai/data/imdb/models'),\n", 495 | " PosixPath('/home/ubuntu/.fastai/data/imdb/imdb.vocab')]" 496 | ] 497 | }, 498 | "execution_count": null, 499 | "metadata": {}, 500 | "output_type": "execute_result" 501 | } 502 | ], 503 | "source": [ 504 | "path = untar_data(URLs.IMDB)\n", 505 | "path.ls()" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": null, 511 | "metadata": {}, 512 | "outputs": [ 513 | { 514 | "data": { 515 | "text/plain": [ 516 | "[PosixPath('/home/ubuntu/.fastai/data/imdb/train/neg'),\n", 517 | " PosixPath('/home/ubuntu/.fastai/data/imdb/train/unsupBow.feat'),\n", 518 | " PosixPath('/home/ubuntu/.fastai/data/imdb/train/pos'),\n", 519 | " PosixPath('/home/ubuntu/.fastai/data/imdb/train/labeledBow.feat')]" 520 | ] 521 | }, 522 | "execution_count": null, 523 | "metadata": {}, 524 | "output_type": "execute_result" 525 | } 526 | ], 527 | "source": [ 528 | "(path/'train').ls()" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "The reviews are in a training and test set following an imagenet structure. The only difference is that there is an `unsup` folder on top of `train` and `test` that contains the unlabelled data.\n", 536 | "\n", 537 | "We're not going to train a model that classifies the reviews from scratch. Like in computer vision, we'll use a model pretrained on a bigger dataset (a cleaned subset of wikipedia called [wikitext-103](https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset)). That model has been trained to guess what the next word is, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.\n", 538 | "\n", 539 | "We are going to use that 'knowledge' of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pretrained model to our particular dataset. Because the English of the reviews left by people on IMDB isn't the same as the English of wikipedia, we'll need to adjust the parameters of our model by a little bit. Plus there might be some words that would be extremely common in the reviews dataset but would be barely present in wikipedia, and therefore might not be part of the vocabulary the model was trained on." 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let's create our data object with the data block API (next line takes a few minutes)." 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": null, 552 | "metadata": {}, 553 | "outputs": [], 554 | "source": [ 555 | "data_lm = (TextList.from_folder(path)\n", 556 | " #Inputs: all the text files in path\n", 557 | " .filter_by_folder(include=['train', 'test', 'unsup']) \n", 558 | " #We may have other temp folders that contain text files so we only keep what's in train and test\n", 559 | " .split_by_rand_pct(0.1)\n", 560 | " #We randomly split and keep 10% (10,000 reviews) for validation\n", 561 | " .label_for_lm() \n", 562 | " #We want to do a language model so we label accordingly\n", 563 | " .databunch(bs=bs))\n", 564 | "data_lm.save('data_lm.pkl')" 565 | ] 566 | }, 567 | { 568 | "cell_type": "markdown", 569 | "metadata": {}, 570 | "source": [ 571 | "We have to use a special kind of `TextDataBunch` for the language model, that ignores the labels (that's why we put 0 everywhere), will shuffle the texts at each epoch before concatenating them all together (only for training, we don't shuffle for the validation set) and will send batches that read that text in order with targets that are the next word in the sentence.\n", 572 | "\n", 573 | "The line before being a bit long, we want to load quickly the final ids by using the following cell." 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": null, 579 | "metadata": {}, 580 | "outputs": [], 581 | "source": [ 582 | "data_lm = load_data(path, 'data_lm.pkl', bs=bs)" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "metadata": {}, 589 | "outputs": [ 590 | { 591 | "data": { 592 | "text/html": [ 593 | "\n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | "
idxtext
0original script that xxmaj david xxmaj dhawan has worked on . xxmaj this one was a complete bit y bit rip off xxmaj hitch . i have nothing against remakes as such , but this one is just so lousy that it makes you even hate the original one ( which was pretty decent ) . i fail to understand what actors like xxmaj salman and xxmaj govinda saw in
1' classic ' xxmaj the xxmaj big xxmaj doll xxmaj house ' , which takes xxmaj awful to a whole new level . i can heartily recommend these two xxunk as a double - bill . xxmaj you 'll laugh yourself silly . xxbos xxmaj this movie is a pure disaster , the story is stupid and the editing is the worst i have seen , it confuses you incredibly
2of xxmaj european cinema 's most quietly disturbing sociopaths and one of the most memorable finales of all time ( shamelessly stolen by xxmaj tarantino for xxmaj kill xxmaj bill xxmaj volume xxmaj two ) , but it has plenty more to offer than that . xxmaj playing around with chronology and inverting the usual clichés of standard ' lady vanishes ' plots , it also offers superb characterisation and
3but even xxmaj martin xxmaj short managed a distinct , supporting character . ) \\n\\n i can understand the attraction of an imaginary world created in a good romantic comedy . xxmaj but this film is the prozac version of an imaginary world . i 'm frightened to consider that anyone could enjoy it even as pure fantasy . xxbos movie i have ever seen . xxmaj actually i find
4xxmaj pre - xxmaj code film . xxbos xxmaj here 's a decidedly average xxmaj italian post apocalyptic take on the hunting / killing humans for sport theme ala xxmaj the xxmaj most xxmaj dangerous xxmaj game , xxmaj turkey xxmaj shoot , xxmaj gymkata and xxmaj the xxmaj running xxmaj man . \\n\\n xxmaj certainly the film reviewed here is nowhere near as much fun as the other listed
\n" 618 | ], 619 | "text/plain": [ 620 | "" 621 | ] 622 | }, 623 | "metadata": {}, 624 | "output_type": "display_data" 625 | } 626 | ], 627 | "source": [ 628 | "data_lm.show_batch()" 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": {}, 634 | "source": [ 635 | "We can then put this in a learner object very easily with a model loaded with the pretrained weights. They'll be downloaded the first time you'll execute the following line and stored in `~/.fastai/models/` (or elsewhere if you specified different paths in your config file)." 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": {}, 651 | "outputs": [ 652 | { 653 | "data": { 654 | "text/html": [], 655 | "text/plain": [ 656 | "" 657 | ] 658 | }, 659 | "metadata": {}, 660 | "output_type": "display_data" 661 | }, 662 | { 663 | "name": "stdout", 664 | "output_type": "stream", 665 | "text": [ 666 | "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" 667 | ] 668 | } 669 | ], 670 | "source": [ 671 | "learn.lr_find()" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "metadata": {}, 678 | "outputs": [ 679 | { 680 | "data": { 681 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZgAAAEKCAYAAAAvlUMdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xd8lfXZ+PHPlU0WARJGCBtZsglLFPdCRdFaFxYcRa1tqdZRH/1Vq4+j1T51VS21qIibqlWLoCJoRRESRth7hQAJCQkkZOf6/XHuYIoJCXDus3K9X6/z8j73OPf19YRc+c5bVBVjjDHG28L8HYAxxpjQZAnGGGOMKyzBGGOMcYUlGGOMMa6wBGOMMcYVlmCMMca4whKMMcYYV1iCMcYY4wpLMMYYY1wR4e8AvCU5OVm7du3q7zCMMSaoZGZm7lPVFDc+29UEIyLbgINANVClqulHHBfgGWAccAiYrKpLnWOTgAecU/9XVV872r26du1KRkaGdwtgjDEhTkS2u/XZvqjBnKmq+xo4diFwkvMaCbwIjBSR1sCDQDqgQKaIfKSq+30QrzHGGC/wdx/MpcAM9VgEJIlIB+B84HNVLXCSyufABf4M1BhjzLFxO8Eo8JmIZIrIlHqOdwR21nmf7exraP9/EZEpIpIhIhl5eXleDNsYY8yJcjvBjFHVoXiawm4XkbFHHJd6rtGj7P/vHarTVDVdVdNTUlzpozLGGHOcXE0wqprj/DcX+AAYccQp2UCnOu/TgJyj7DfGGBMkXEswIhInIgm128B5wKojTvsI+Jl4jAKKVHU3MBc4T0RaiUgr59q5bsVqjDHG+9wcRdYO+MAzEpkI4E1VnSMitwKo6kvAbDxDlDfhGaZ8g3OsQEQeAZY4n/Wwqha4GKsxxhgvk1B5ZHJ6erraPBhjTHMzKzObyuoarhnR+biuF5HMI+coeou/hykbY4w5AbMyd/L+0mx/h1EvSzDGGBPECkoqaB0X5e8w6mUJxhhjglhBSQVt4qP9HUa9LMEYY0yQqqlRT4KxGowxxhhvKiytpEaxJjJjjDHeVVBSDmBNZMYYY7wrv7gCwJrIjDHGeFd+iSfBWBOZMcYYr6pNMFaDMcYY41UFThNZK0swxhhjvKmgpJyWLSKJDA/MX+WBGZUxxphG7QvgOTBgCcYYY4JWQXEFbeItwRhjjPGyQF6HDCzBGGNM0MovKad1XGBOsgRLMMYYE5RqapT9hypJtiaywHagrJKKqhp/h2GMMU1WVFpJdY1aE1kg27qvhDGPf8mHy3f5OxRjjGmyfGcdMkswAaxrm1jSWsfy0lebqakJjcdHG2NC3w/rkDXjPhgRCReRZSLyST3HuojIPBHJEpEFIpJW59ifRGS1iKwVkWdFRFyKj9vO6MGWvBI+W7PHjVsYY4zXFdQuE9PM+2CmAmsbOPYUMENVBwIPA48DiMgpwBhgINAfGA6c7laA4/q3p0ubWF5YsBlVq8UYYwJfoK9DBi4nGKdGchHwcgOn9APmOdvzgUudbQVigCggGogE9roVZ0R4GLeM7UFWdhELN+W7dRtjjPGa/ABfhwzcr8E8DdwDNDREawVwhbM9AUgQkTaq+h2ehLPbec1V1R/VgkRkiohkiEhGXl7eCQV6+dCOpCRE88KCTSf0OcYY4wuBvg4ZuJhgRORiIFdVM49y2l3A6SKyDE8T2C6gSkR6An2BNKAjcJaIjD3yYlWdpqrpqpqekpJyQvHGRIZz86nd+HZzPst3Fp7QZxljjNvyA3wdMnC3BjMGGC8i24C38SSJmXVPUNUcVb1cVYcA9zv7ivDUZhaparGqFgOfAqNcjBWA60Z1ITEmghetFmOMCXD5xYG9TAy4mGBU9T5VTVPVrsDVwJeqOrHuOSKSLCK1MdwHTHe2d+Cp2USISCSe2k1DAwW8Jj46gkmndGXu6r0s2pJPtQ1bNsYEqEBfhwwgwtc3FJGHgQxV/Qg4A3hcRBT4GrjdOW0WcBawEk+H/xxV/dgX8U0+pSuvLtzG1dMWERMZRu92CfRLTeTk1JYMSkuid/sEoiICt83TGNM85JdUMLRLK3+HcVQ+STCqugBY4Gz/vs7+WXiSyZHnVwO3+CK2I7WJj2bOHWNZtDmfNbsPsHb3AT5dtYe3Fu8EICoijH4dEpl0ShcmDElr5NOMMcb7POuQBX4fjM9rMMGgY1ILrhiWdnh4m6qSvb+UFdmFrNhZyFcb8rhnVhYD05LokRLv11iNMc1PMKxDBrZUTJOICJ1ax3LxwFTuv6gfb9w8ipjIcB7812qbmGmM8bn8IJjFD5ZgjktKQjR3n9+bbzbt498rd/s7HGNMM3N4mZgAXocMLMEct+tGduHk1EQe+WQNxeVV/g7HGNOM5BcH/krKYAnmuIWHCY9c1p+9B8p5dt5Gf4djjGlGrImsGRjauRVXpXdi+jdb2bD3oL/DMcY0E7VNZK1iLcGEtHsv7ENcdAT3/jOLQxXWVGaMcV9BSQWJMREBPycvsKMLAq3jonj88gGs2FnIz/6xmANllf4OyRgT4vYVl9MmPrA7+MESjFeMG9CB568dyvKdhUx8+XsKD1X4OyRjTAgrCIKFLsESjNeMG9CBv10/jHW7D3L1tEXsc0Z5GGOMtwXDOmRgCcarzu7bjn9MTmdbfgnXTFtkzWXGGFfsK64I+BFkYAnG6047KYXpk4azZV8Jv313BTW2IrMxxotq1yGzGkwzdUrPZO4f15fP1+y1J2QaY7zqQJlnHbJAn8UPlmBcc8OYrlw6OJU/f76BBetz/R2OMSZE7CsOjkmWYAnGNSLCE5cPpHe7BKa+vZwd+Yf8HZIxJgTUTrK0JrJmrkVUONOuT0dVmfJ6BvtLbPiyMebEFJR4RqhaE5mhc5tYnrt2KFvySrjipW/ZWWA1GWPM8QuWdcjAEoxPnN4rhddvGsG+g+VMeOFbVmYX+TskY0yQyi8OjnXIwAcJRkTCRWSZiHxSz7EuIjJPRLJEZIGIpNU51llEPhORtSKyRkS6uh2rm0Z2b8P7vziF6Igwrpr2HfOt498YcxwKSipICIJ1yMA3NZipwNoGjj0FzFDVgcDDwON1js0AnlTVvsAIIOh/I/dsm8AHvziF7ilx3PjqEn760nf845utZO+3ZjNjTNPkl1SQHATrkIHLCcapkVwEvNzAKf2Aec72fOBS57p+QISqfg6gqsWqGhK/hdsmxvD2lNH85uxeHCir5JFP1nDqH+cz/vlvWLfngL/DM8YEuPzi8qAYQQbu12CeBu4Baho4vgK4wtmeACSISBugF1AoIu87zWtPiki4y7H6THx0BFPPOYk5vxnLgrvO4L4L+7CnqIzJ05eQU1jq7/CMMQEsWNYhAxcTjIhcDOSqauZRTrsLOF1ElgGnA7uAKiACOM05PhzoDkyu5x5TRCRDRDLy8vK8XALf6Jocxy2n92DGTSMoKa9i8iuLKSq1NcyMMfXLO1hOSoI1kY0BxovINuBt4CwRmVn3BFXNUdXLVXUIcL+zrwjIBpap6hZVrQI+BIYeeQNVnaaq6aqanpKS4mJR3NenfSJ/u34YW/eVMGVGBuVV1f4OyRgTYCqra8gvqaBtc08wqnqfqqapalfgauBLVZ1Y9xwRSRaR2hjuA6Y720uAViJSmzXOAta4FWugOKVnMk9dOYjvtxbYQpnGmB+pHaJsNZgGiMjDIjLeeXsGsF5ENgDtgEcBVLUaT/PYPBFZCQjwd1/H6g+XDu7I7y7swydZu/n7f7b4OxxjTADJPVgGQNuEGD9H0jQRvriJqi4AFjjbv6+zfxYwq4FrPgcG+iC8gHPL2O5kbCvg2XkbuWxIR9olBscPkzHGXXkHPcvENPsmMnP8RIQHLupHZbXyx0/X+TscY0yAyHUSjDWRmRPSNTmOm0/rxvvLdpG5vcDf4RhjAkDuAU+CsYmW5oTdfmZP2iVG89BHa6i2Dn9jmr284jJaxUYGxTIxYAkmoMVFR/A/4/qyclcR72Xs9Hc4xhg/yz1QHjQd/GAJJuCNH5RKepdW/GnuepuAaUwzl3uwnLaJwdE8BpZgAp6I8ND4k9l/qILn5m30dzjGGD/KO1hOSpD0v4AlmKDQv2NLfjI0jRnfbbeVl41pplTVk2CsBmO87Y5ze4HA019YLcaY5qiotJKK6hqrwRjvS01qwaTRXXh/aTYb9h70dzjGGB87PMkyiCZeW4IJIr84oydxURH8ac56f4dijPGx3CCbxQ+WYIJKq7gobj2jB1+s3UvGNpt8aUxzkhdks/jBEkzQuWFMV1ISovnjnHWo2uRLY5qLHxa6tARjXBIbFcGvzz6JJdv2M29trr/DMcb4SO6BclpEhhMf7ZM1ir3CEkwQunp4J7qnxHHHO8v5ZuM+f4djjPGBvGLPkyxFxN+hNJklmCAUGR7GzJtG0rFVCya/sph3l9gyMsaEOs8yMcHTPAaWYIJWalIL3rt1NKN7tOGef2bx1Nz11idjTAjLPVgWVB38YAkmqCXERDJ98nCuHt6J5+dv4pdvLuNgma1XZkwoyjtoNRjjY5HhYTx++QDuu7APc1bv4eLnviEru9DfYRljvKisspoDZVVBNckSLMGEBBHhltN78PaUUVRW1XDFi9/yj2+2WpOZMSHi8ByYIFomBnyQYEQkXESWicgn9RzrIiLzRCRLRBaISNoRxxNFZJeIPO92nKFgeNfWzJ56Gqf3assjn6zhptcyyD1Q5u+wjDEn6PCjkoNooUvwTQ1mKrC2gWNPATNUdSDwMPD4EccfAb5yMbaQkxQbxd9/NoyHLunHwk37OPcvX/Phsl1WmzEmiOU5kyytBlOHUyO5CHi5gVP6AfOc7fnApXWuHQa0Az5zM8ZQJCJMHtON2VNPo0dKHL95ZzlTXs88PBPYGBNcfljo0hJMXU8D9wA1DRxfAVzhbE8AEkSkjYiEAX8G7nY5vpDWIyWe9249hfvH9eXrDXlc9vxCqqob+iqMMYEq92A5YQJt4izBACAiFwO5qpp5lNPuAk4XkWXA6cAuoAr4BTBbVY86g1BEpohIhohk5OXleSv0kBIeJvx8bHcenTCAnKIyNuwt9ndIxphjlHugnDbx0YSHBc8sfgA3F7UZA4wXkXFADJAoIjNVdWLtCaqaA1wOICLxwBWqWiQio4HTROQXQDwQJSLFqvq7ujdQ1WnANID09HTrZDiK4V1bAbAiu5B+qYl+jsYYcyzyioNvDgy4WINR1ftUNU1VuwJXA1/WTS4AIpLsNIcB3AdMd669TlU7O9fehWcgwH8lF3NsOreOJSk2kuU7bI6MMcEmGGfxgx/mwYjIwyIy3nl7BrBeRDbg6dB/1NfxNBciwqC0JFbYJExjgk4wzuIHd5vIDlPVBcACZ/v3dfbPAmY1cu2rwKuuBdeMDO6UxHNfbqSkvIq4IFry25jmrLpG2VdcQduE4JrFDzaTv1kZ3CmJGoWVu4r8HYoxpokKSiqorlFrIjOBbVCnJABW7LRmMmOCxeE5MJZgTCBrHRdF59axLLcEY0zQOPyo5CCbZAmWYJqdQZ2SrAZjTBA5vA5ZvPXBmAA3uFMSOUVltgimMUHi8ErK1kRmAt3gTi0BrJnMmCCRd7CchOgIWkSF+zuUY2YJppk5ObUlEWFiCcaYIJF3sDzolumv1aQEIyI9RCTa2T5DRH4tIknuhmbcEBMZTp8OCTbh0pggkXuwLChHkEHTazD/BKpFpCfwD6Ab8KZrURlXDUpLImtnETU1tnybMYEu92A5KUE4yRKanmBqVLUKz5L6T6vqHUAH98IybhrcKYmD5VVs2XdsKyuv3X2AQxVVLkVljDmSqpJ7IDiXiYGmJ5hKEbkGmATUPvo40p2QjNsGOxMul+9s2oz+quoa/veTNVz4zH849/++5vM1e90MzxjjKCqtpLSymg4tQ7sGcwMwGnhUVbeKSDdgpnthGTd1T4knPjqC5Tv3N3ruvuJyJv7je17+Zis/GZZGfHQEP5+RwZQZGeQUlvogWmOar5xCz3SC1KQWfo7k+DRpxUNVXQP8GkBEWgEJqvqEm4EZ94SHCQPTWrLC6YfZsq+YFTuL2JxXTJv4aNJataBjUgsOllVxxzvL2X+ogv/76SAuH5pGZXUNL/9nK8/M28C5//cVT1wxkEsGpfq7SMaEpNo/4oK1BtOkBCMiC4DxzvnLgTwR+UpV73QxNuOiQZ2S+NtXmxn0h884WO7pVwkTOLLfP61VC/552yn07+iZPxMZHsZtZ/Tg4oEduPPd5dzxznLioyM4s0/bo96vpkb598rd5B4s54ZTuhIWZE/mM8Yfdhd5EkzHUK7BAC1V9YCI3Ay8oqoPikiWm4EZd108sANLt++nZ9t4BnVKYnCnJHqkxFNUWsmu/aVk7z9EYWklF/ZvT1Js1I+u79Q6llduGME10xZx2xuZvHHzSIZ1aV3vvb7fks9js9eyItvT55OxrYC/XDWYmMjgmzhmjC/lFJURGS4kxwdnJ39TE0yEiHQAfgrc72I8xkdOTm3JO7eM/tH+1nFRtI6LYkBay0Y/Iz46glduGM6VL33HDa8s4b1bT6F3+wTA8wyLrOxCXliwmc/X7KVDyxj+fOUg9h+q4NHZa9k9bREvT0oP2n84xvhCTmEp7RJjgrbG39QE8zAwF1ioqktEpDuw0b2wTLBIjo9mxo0juOLFb/nZ9O+5ZWwPFm8t4NvN+zhQVkV8dAR3n9+bm07tdrjGktYqlt+8s4wJLyzklckj6Nk23s+lMCYw7S4sI7VlcDaPAYhqaEy2S09P14yMDH+H0Wyt23OAn770HQfKquiY1IJTeyYz5qRkxp6UXG8T2/Kdhdz82hKqa5Qv7jydNlaTMeZHTv3jl6R3acXTVw9x7R4ikqmq6W58dlM7+dOA54AxgALfAFNVNduNoEzw6dM+kTm/GUtFVQ1d2sQicvQq/eBOSbz581GMe+Y/PDl3PU9cMdBHkRoTHKprlL0HyugQpB380PR5MK8AHwGpQEfgY2dfo0QkXESWicgn9RzrIiLzRCRLRBY4iQwRGSwi34nIaufYVU2M0/hRalILuibHNZpcavVql8DkU7ryTsZOsmxtNGP+y77iciqrldQgHaIMTU8wKar6iqpWOa9XgZQmXjsVWNvAsaeAGao6EE8/z+PO/kPAz1T1ZOAC4GlbXDM0TT3nJNrERfPgR6ttbTRj6qidAxOskyyh6Qlmn4hMdGoj4SIyEchv7CKnRnIR8HIDp/QD5jnb84FLAVR1g6pudLZzgFyantBMEEmIieR3F/Zh2Y5C3l+2y9/hGBMwdhd5ZvF3COJO/qYmmBvxDFHeA+wGfoJn+ZjGPA3cA9Q0cHwFcIWzPQFIEJE2dU8QkRFAFLC5ibGaIHP5kI4M6ZzEE5+u40BZpb/DMSYg/FCDCfEmMlXdoarjVTVFVduq6mXA5Ue7RkQuBnJVNfMop90FnC4iy4DTgV3A4eV6nbk3rwM3qOqPkpSITBGRDBHJyMvLa0pRTAAKCxP+MP5k8kvK+cvnG8g7WE5OYSnb9pWwYe9BVmYXkbGtgIWb9vHtpn1UVjf090poKi6vYsm2Asoqq/0divGhnMIyWkSG07JF8K4r3NR5MPW5E08NpSFjgPEiMg6IARJFZKaqTqw9wWn+uhxAROKBK1S1yHmfCPwbeEBVF9V3A1WdBkwDzzDlEyiL8bOBaUlcld6JVxZu45WF24567jl92/HCdUOJigj9B7LuyD/Eja8tYVNuMQkxEYzr34FLB6cysnsbBNh/qIJ9xRWUVFQxKC2J8CCdkGd+bHdRKalJMU0eNBOITiTBHLXUqnofcB94noIJ3FU3uTj7k4ECp3ZyHzDd2R8FfIBnAMB7JxCjCSK/v6QfA9OSqK6pITI8jKgIzys6IpzoiDCiI8JYtrOQJz5dxy/eyOSv1w0lOiJ0l5tZsq2AW17PpLpGeXRCfzK37+eTrBzeydhJXFQ4ZVU1VNcZGHFS23juPr835/ZrF9S/lIxHTlFZUHfww4klmOOqMYjIw0CGqn4EnAE8LiIKfA3c7pz2U2As0EZEJjv7Jqvq8hOI1wS42KgIrh3Z+ajnjOzehrjoCP7fh6u4beZSXpwYmknmg2XZ3DtrJR1bteAfk9LpnhLPdSO7UHpZNfPW7eX7LQW0bBFJcnwUyQnRlFXW8ML8TUx5PZOhnZO494I+jOzepvEbmYCVU1hK797BPbbpqDP5ReQg9ScSAVqo6okkKK+ymfzNy8xF23ngw1Wc2TuFFycOC5mFM1WVZ+dt4i9fbGBkt9b87fph9a6EUJ+q6hpmZWbz9Bcb2XOgjKeuHMRPhqW5HLFxQ0VVDb3/36f8+qyTuOPcXq7ey82Z/EdtxFbVBFVNrOeVEEjJxTQ/E0d14bEJA5i/Po8Lnv6aOat2E+zLHtXUKA99tJq/fLGBy4d25PWbRjY5uQBEhIdx9YjOLLj7DIZ2TuJPc9bZI66D1N4DZagG7zL9tUK/l9SErGtHdmbGjSOIDA/j1plLuepvi1ixMzhXBKisruGOd5fz2nfbuenUbjz1k0HHPYghJjKc/xnXl9yD5Y0OmDCB6fCDxoJ4iDJYgjFBbmyvFD6dehqPTujPln3FXPrXhby+aLu/wzompRXVTJmRwb+W53D3+b154KK+J7w8e3rX1pzbrx0vLdhMQUmFlyI1vpJTVPsky+CuwVgzlwl6EeFhXDeyC+MHpTJp+mL+/vUWJo7sHLAjqT5akcPc1XvYXVjK7qIycg+WU6OekWLXjezitfvcc35vzn/6a577ciMPXnKy1z7XuC+n0DOLP5gnWYLVYEwISYiJ5NqRXdhRcIilOwKzqeyjFTn8+q1lLN2+n+iIcEb3aMNtp/fgjZtHejW5AJzULoGfpndi5qLt7Mg/5NXPNu7aXVRKUmwksVHBXQcI7uiNOcL5J7fjgQ/D+HDZLoZ1aeXvcP7Loi353PXuCkZ0bc2Mm0b4ZOTbb87pxYfLd/Hnz9fzjIvPFDHelVNYFvTNY2A1GBNiEmIiOadvOz7JyvHKkjIl5VXMWbWbiqoT+6yNew8yZUYGnVq3YNrPfDesun3LGG4c041/Lc8hY1uBT+5pTlxOYWlQL9NfyxKMCTkThnRk/6FKvt5wYuvTVdcov3xzKbfOXMqlf13I6pyi4/qc3ANlTH5lCVER4bx6w4hjGnrsDbee0YN2idFcPW0Rf5yzjtKKH9Y0U1W+35LPlBkZ3PnucnIPlPk0NlO/3SEwix+sicyEoLG9UmgVG8kHy3Zxdt92x/05j89ey/z1eVw/qgufrtrDpc8v5PYze3L7mT2POoS4rLKa1TlFrNhZxIrsQr7bnE9xeRXvTBlNp9axxx3P8UqMieTTqWN5bPZaXlywmY9X5PDwpSdTXQMvLtjE0h2FtImL4mB5FZ+v2cvvLuzDNcM7n/BINnN8SsqrKCqtDPohymAJxoSgyPAwLhmUyjtLdnKwrJKEmGNfjfbdJTt5+ZutTBrdhT9c2p/fnteLP3y8hmfmbWTu6j08f+1QeraN/9F1S7YVMGVGBvsPeR470D4xhsGdkrjp1G4MSGt5wmU7Xq3jog7P7H/gw1Xc+Kpn1YtOrVvwyGX9uXJYGruLyvif91dy/wer+GDpLp64YgA92yb4LebmarczRDk1BPpgjrpUTDCxpWJMXUt37OfyF77lyZ8M5Mr0Tsd07eKtBVz38iJGdmvDqzcMJyL8h9rKF2v2cu8/s6ioquGZawZzVp8fakhzV+/h128tIzWpBfde0IfBnZJoH4Dt6BVVNbyXuZOEmEjG9W//X+VTVWZlZvPo7LVUVtXw4sRhjO0V3OthBZuvN+Txs+mLefeW0Yzo1tr1+/ltqRhjgtWQTkl0aRPLh8uP7SmZy3bs59aZmXRqFctfrx36X798Ac7p146PfnUqXZJjuem1DP46fxOqysxF27ltZiZ9OiQy69bRXNC/fUAmF4CoiB/mDR1ZPhHhyvROzP3NWDq3iePGV5fw/tJsP0XaPO0+PMkyMH9+joU1kZmQJCJcOrgjz325kT1FZUf9ZV9eVc3slbt57dvtLN9ZSKvYSF6elE7L2Pqb1jomteC9W07hd+9n8eTc9cxeuZvVOQc4s3cKf71uaNDPXQBolxjDO7eM4tbXM7nz3RXsPVDOrad3D9jJq6FkV2EZIgTsHyjHIvj/JRjTgMsGp/LsvI08++VGfpreiR4pcSTERKKqbMs/ROb2/WRuL+Cz1XvJL6mge3IcD13Sj8uHpZHYSL9Ni6hwnr5qMCenJvLEp+u4clgaj10+gMjw0GkUSIyJ5JUbhnPXe1n8cc46CkrKuf+ifv4OK+TtLiwlJT46JH6WLMGYkNU9JZ6z+rTlze938Ob3OwBolxhNdY2yr9izPldCTARjeiRz3ajOjOmRfEwjp0SEKWN7cNXwziTGRITkX/fREeE8c9VgEmMi+Pt/tnLeye0Z3tX9foHmLFSGKIMlGBPi/v6zdLbnl7Apt5hNecVsyi1GEIZ1acWwLq04qW38CQ/HDeZnpjdFWJjwwEX9+GLtXh6bvZb3bzslJJNpoMgpKqVP+9AYvWcJxoS08DChe0o83VPiOc/fwQSxFlHh3HluL+7950rmrt7DBf07+DukkKSq5BSWcmbvtv4OxSuCv5HPGOMTVwxN46S28fxpznqvLMNjfiy/pIKyypqQaSKzBGOMaZKI8DDuvaAPW/aV8M6Snf4OJyRtySsBoEdKnJ8j8Q7XE4yIhIvIMhH5pJ5jXURknohkicgCEUmrc2ySiGx0XpPcjtMY07iz+7ZlRNfWPP3FRkrK7XHM3rYptxiAHik/XiUiGPmiBjMVWNvAsaeAGao6EHgYeBxARFoDDwIjgRHAgyISWGuvG9MMiQi/G9eHfcXl/P0/W/wdTsjZnFdMTGQYHUOkiczVTn6nRnIR8ChwZz2n9APucLbnAx862+cDn6tqgfM5nwMXAG+5Ga8xpnFDO7fiwv7tee7LTcxeuZuT2ibQo208Azu25Oy+bW2E2QnYlFtM9+QTH9kYKNweRfY0cA/Q0Ji7FcAVwDPABCBBRNoAHYG6jbzZzj5jTAB4dMIAuiXHsWHvQVblFDF71W5U4Zax3fndhX0syRwKMbnDAAAVH0lEQVSnzXnFDO0cOo01riUYEbkYyFXVTBE5o4HT7gKeF5HJwNfALqAKqO+n80ercorIFGAKQOfOnb0QtTGmKVrHRXHPBX0Ovy+rrObRf6/lb19vIbFFJLef2dOP0QWn0opqdhWWcuWwY1ucNZC5WYMZA4wXkXFADJAoIjNVdWLtCaqaA1wOICLxwBWqWiQi2cAZdT4rDVhw5A1UdRowDTyrKbtUDmNMI2Iiw/nD+JM5WFbJk3PXk9gikutHdfF3WEFly75iVKn3MRDByrVOflW9T1XTVLUrcDXwZd3kAiAiySJSG8N9wHRney5wnoi0cjr3z3P2GWMCVFiY8OSVgzinb1t+/69VfLjs2Faybu4OjyBrGxpDlMEP82BE5GERGe+8PQNYLyIbgHZ4BgPgdO4/AixxXg/XdvgbYwJXZHgYz187lJHdWvPb91Ywf12uv0MKGpvzSggT6NomdBKMPXDMGON1xeVVXD3tOzbnlvDOLaMYmJbk75AC3u1vLmXVriK+uvtMn97XHjhmjAkq8dERTJ88nDbxUdz46hK255f4O6SAtzm3mJ4hMsGyliUYY4wr2ibE8NqNI6iqUSZNX0x+cbm/QwpY1TXKln0l9AihDn6wBGOMcVGPlHj+MSmd3UVl3PRaBocqbHmZ+mTvP0RFVU3IrEFWyxKMMcZVw7q05tlrhpCVXcjNlmTqtTnPM4IslIYogyUYY4wPnH9ye/7800Es2pLPja8usSRzhFBb5LKWJRhjjE9MGJLGX64azOKtBUyevsRWY65jc24JyfFRJMVG+TsUr7IEY4zxmUsHd+SZq4eQuWM/k6YvptiSDACb8orpHmK1F7AEY4zxsUsGpfLcNUNYtrOQJ+es83c4fqeqbMotDrn+F7AEY4zxg3EDOnDpoFT+uXRXs6/F5JdUUFRaGXL9L2AJxhjjJ9eP7kJxeRUfNPM1yzbnhuYIMrAEY4zxk8GdkujfMZHXv9tGqCxZdTw25dWOIAutOTBgCcYY4yciws9GdWXD3mIWb22+a9luzi2hRWQ4qS1D4zHJdVmCMcb4zSWDUmnZIpIZi7b7OxS/2ZxXTPeUuJB5THJdlmCMMX7TIiqcK4elMXfVHnIPlPk7HL8I1RFkYAnGGONn143qQlWN8tbinf4OxedqH5MciiPIwBKMMcbPuiXHMbZXCm8u3k5ldY2/w/GpTSE8ggwswRhjAsD1o7qw90A5n6/Z6+9QfGrlriIA+qe29HMk7rAEY4zxu7P6tKVLm1junZXFpyt3+zscn1m5q5CWLSLp1Dr0RpCBJRhjTAAIDxPeuHkk3dvGc9sbS/nDx6upqAr95rKs7CIGprVEJPRGkIEPEoyIhIvIMhH5pJ5jnUVkvnM8S0TGOfsjReQ1EVkpImtF5D634zTG+Fdaq1jeu2U0k0/pyisLt/HTv33HrsJSf4flmrLKatbvOciAjqHZPAa+qcFMBdY2cOwB4F1VHQJcDbzg7L8SiFbVAcAw4BYR6epynMYYP4uKCOOh8SfzwnVD2ZRbzC9mZvo7JNes23OQqhplYJolmOMiImnARcDLDZyiQKKz3RLIqbM/TkQigBZABXDAxVCNMQFk3IAO3H1+b1ZkF7Eyu8jf4bjicAe/1WCO29PAPUBDjakPARNFJBuYDfzK2T8LKAF2AzuAp1S1+a4lYUwzdNmQjsREhvHWkh3+DsUVK7MLaR0XRcek0OzgBxcTjIhcDOSq6tHquNcAr6pqGjAOeF1EwoARQDWQCnQDfisi3eu5xxQRyRCRjLy8PO8XwhjjNy1bRHLxwFT+tSw0l/TPyi5iQMfQ7eAHd2swY4DxIrINeBs4S0RmHnHOTcC7AKr6HRADJAPXAnNUtVJVc4GFQPqRN1DVaaqarqrpKSkp7pXEGOMX14zoTElFNR+vyGn85CBSVlnNxtzikO5/ARcTjKrep6ppqtoVTwf+l6o68YjTdgBnA4hIXzwJJs/Zf5Z4xAGjAHv0nTHNzNDOSfRpn8Bbi0OrmWzN7gNU12hI97+AH+bBiMjDIjLeeftb4OcisgJ4C5isngdD/BWIB1YBS4BXVDXL17EaY/xLRLhmRGeysotYtSt0OvtrBy6Eeg0mwhc3UdUFwAJn+/d19q/B05R25PnFeIYqG2OaucuGdOSx2Wt5c/EOHpswwN/heEVWdhHJ8dG0T4zxdyiuspn8xpiAVrezvyREOvtX7ioM6Rn8tSzBGGMC3rUjQ6ez/1BFFZtyi0O+/wUswRhjgsDQzkn0bpfA9IVbKaus9nc4J2RNzgFqFAZagjHGGP8TEe4+vzcb9hbz0Eer/R3OCclyOvgHhHgHP1iCMcYEiXP6teOXZ/bk7SU7g3rY8spdRbRLjKZdiHfwgyUYY0wQuePcXoztlcKD/1rN8p2F/g7nuGRlF4b0Csp1WYIxxgSN8DDhmasG0zYxmttmZrKvuNzfIR2T4vIqtuwrYUDHJH+H4hOWYIwxQaVVXBQvTRxGQUkFv5i5lKJDlf4OqclW7SpCFQakJTZ+cgiwBGOMCTr9O7bkySsHsWznfi55/htW5wTHLP8lWwsQgaGdW/k7FJ+wBGOMCUrjB6Xy9pTRVFTVcPkL3/Jexk5/h9SoRVvz6d0ugaTYKH+H4hOWYIwxQWtYl1Z88utTGdalFXfPyuJ/PlhJTY36O6x6VVTVkLl9P6O6t/F3KD5jCcYYE9SS46N5/aaR/Py0brz5/Q4+zgrM2f5Z2YWUVdYwqntrf4fiM5ZgjDFBLzxMuO/Cvpycmsif5qwPyNn+32/1PJR3RDerwRhjTFAJCxPuv6gvuwpLmb5wq7/D+ZFFWzz9L63jmkf/C1iCMcaEkFN6JHNO37a8MH9zQM2RqayuIWPb/mbVPAaWYIwxIea+cX0pq6zmL59v8Hcoh2VlF1FaWc3IZtTBD5ZgjDEhpkdKPNeN7Mxbi3ewce9Bf4cDwPdb8wEY0c1qMMYYE9SmntOLuOgIHpu91t+hALBoSwEntY0nOT7a36H4lCUYY0zIaR0XxS/P7Mn89XlMfXuZX/tjKqtryNxW0Kzmv9RyPcGISLiILBORT+o51llE5jvHs0RkXJ1jA0XkOxFZLSIrRST017Y2xnjNTad2Y+rZJ/Hpyj2c/eeveDdjJ6q+n4S5alcRJRXVjGxmHfzgmxrMVKCheuoDwLuqOgS4GngBQEQigJnArap6MnAGEDwr2hlj/C4iPIw7zu3F7Kmn0qtdPPfMyuKavy9iZ8Ehn8ZRO/9lZDOa/1LL1QQjImnARcDLDZyiQO2yoi2B2im45wFZqroCQFXzVTXwZk4ZYwJez7YJvDNlNI9fPoDVuw4w7pn/8PEK3832X7Qlnx4pcaQkNK/+F3C/BvM0cA9Q08Dxh4CJIpINzAZ+5ezvBaiIzBWRpSJyj8txGmNCWFiYcM2Izsyeeho928Xzq7eWce+sLA5VVLl636rD81+aX+0FXEwwInIxkKuqmUc57RrgVVVNA8YBr4tIGBABnApc5/x3goicXc89pohIhohk5OXleb8QxpiQ0ql1LO/eMprbz+zBu5k7ueS5b9iSV+za/VbnHKC4vKrZzX+p5WYNZgwwXkS2AW8DZ4nIzCPOuQl4F0BVvwNigGQgG/hKVfep6iE8tZuhR95AVaeparqqpqekpLhXEmNMyIgMD+Pu8/sw86aRFB6q5Pp/LGZPUZnX71NTozwzbyMRYdLsZvDXci3BqOp9qpqmql3xdOB/qaoTjzhtB3A2gIj0xZNg8oC5wEARiXU6/E8H1rgVqzGm+RnTM5nXbhxB4aEKJk1fTFGpd8cRPf3FBr5cl8vvL+lH24TmOQjW5/NgRORhERnvvP0t8HMRWQG8BUxWj/3A/wFLgOXAUlX9t69jNcaEtv4dW/K369PZsq+Yn7+W4bVVmOeu3sOzX27iymFpXD+qi1c+MxiJP8aFuyE9PV0zMjL8HYYxJgh9vCKHX7+9jHP7tuPFicMID5Pj/qxNuQe57K/f0j0ljndvGU1MZLgXI/U+EclU1XQ3PjvCjQ81xphgcsmgVPYVl/OHj9dw1p8XcHafdpzVpy0jurUmKqLpDT2FhyqY8nom0RFhvDRxWMAnF7dZgjHGGOCGMd1oFRvFB8t2MfP77UxfuJW4qHC6pcQRGxlBi6hwYqPCad8yhv6pLenfsSU9UuIor6rhy3W5/DtrN/PX51JVo7xx80hSk1r4u0h+Z01kxhhzhEMVVSzclM+C9bnkFJZyqKKasspqSiqq2bW/lFKnryYm0lO7KausISUhmosGdOCKoWkMSGvpz/CPiTWRGWOMD8VGRXBuv3ac26/dj45V1yhb9xWzclcRK7MPUKPKBf3bM7xr6xPquwlFlmCMMeYYhIcJPdsm0LNtAhOG+DuawGbL9RtjjHGFJRhjjDGusARjjDHGFZZgjDHGuMISjDHGGFdYgjHGGOMKSzDGGGNcYQnGGGOMK0JmqRgRyQO2H7G7JVB0jPsa204G9h1nmPXd+1jOaUp5fFWWxmJt7JxjLcuR72u36+6z76ZpsTZ2jn03/v0dcLTz3ChLnKq688RGVQ3ZFzDtWPc1tg1keDOeYzmnKeXxVVlOtDzHWpajlKHuPvtu7LsJ6O+mKWXx5nfj9s9ZY69QbyL7+Dj2NWXbm/EcyzlNKY+vytLUz2nonGMty5HvP27gnONl383R99t347vfAUc7L5DK0qiQaSLzFRHJUJdWHvW1UCoLhFZ5QqksEFrlsbI0XajXYNwwzd8BeFEolQVCqzyhVBYIrfJYWZrIajDGGGNcYTUYY4wxrmjWCUZEpotIroisOo5rh4nIShHZJCLPiojUOfYrEVkvIqtF5E/ejbrBeLxeFhF5SER2ichy5zXO+5E3GJMr341z/C4RURFJ9l7ER43Hje/mERHJcr6Xz0Qk1fuR1xuPG2V5UkTWOeX5QESSvB95gzG5UZ4rnX/7NSLiel/NiZShgc+bJCIbndekOvuP+u+qXm4OUQv0FzAWGAqsOo5rFwOjAQE+BS509p8JfAFEO+/bBnFZHgLuCpXvxjnWCZiLZ85UcrCWBUisc86vgZeCuCznARHO9h+BPwbzzxnQF+gNLADSA7UMTnxdj9jXGtji/LeVs93qaOU92qtZ12BU9WugoO4+EekhInNEJFNE/iMifY68TkQ64PkH/p16/s/PAC5zDt8GPKGq5c49ct0thYdLZfEbF8vzF+AewGedj26URVUP1Dk1Dh+Vx6WyfKaqVc6pi4A0d0vxA5fKs1ZV1/sifud+x1WGBpwPfK6qBaq6H/gcuOB4f0806wTTgGnAr1R1GHAX8EI953QEsuu8z3b2AfQCThOR70XkKxEZ7mq0R3eiZQH4pdN0MV1EWrkXapOcUHlEZDywS1VXuB1oE5zwdyMij4rITuA64PcuxtoYb/yc1boRz1/H/uTN8vhLU8pQn47Azjrva8t1XOWNaOJNmwURiQdOAd6r07wYXd+p9eyr/QsyAk/VchQwHHhXRLo7Wd9nvFSWF4FHnPePAH/G8wvA5060PCISC9yPpznGr7z03aCq9wP3i8h9wC+BB70caqO8VRbns+4HqoA3vBnjsfBmefzlaGUQkRuAqc6+nsBsEakAtqrqBBou13GV1xLMfwsDClV1cN2dIhIOZDpvP8Lzi7duNT4NyHG2s4H3nYSyWERq8Kz3k+dm4PU44bKo6t461/0d+MTNgBtxouXpAXQDVjj/6NKApSIyQlX3uBz7kbzxc1bXm8C/8UOCwUtlcTqTLwbO9vUfY0fw9nfjD/WWAUBVXwFeARCRBcBkVd1W55Rs4Iw679Pw9NVkczzldbsDKtBfQFfqdI4B3wJXOtsCDGrguiV4aim1HV7jnP23Ag87273wVDclSMvSoc45dwBvB/N3c8Q52/BRJ79L381Jdc75FTAriMtyAbAGSPHlz5fbP2f4qJP/eMtAw538W/G0wrRytls3pbz1xuWPLzRQXsBbwG6gEk+GvgnPX7lzgBXOD/3vG7g2HVgFbAae54dJq1HATOfYUuCsIC7L68BKIAvPX20dfFEWt8pzxDnb8N0oMje+m386+7PwrCvVMYjLsgnPH2LLnZdPRsS5WJ4JzmeVA3uBuYFYBupJMM7+G53vZBNwQ2PlPdrLZvIbY4xxhY0iM8YY4wpLMMYYY1xhCcYYY4wrLMEYY4xxhSUYY4wxrrAEY0KaiBT7+H4vi0g/L31WtXhWS14lIh83tsqwiCSJyC+8cW9jvMGGKZuQJiLFqhrvxc+L0B8WZnRV3dhF5DVgg6o+epTzuwKfqGp/X8RnTGOsBmOaHRFJEZF/isgS5zXG2T9CRL4VkWXOf3s7+yeLyHsi8jHwmYicISILRGSWeJ5j8kbtszGc/enOdrGzIOUKEVkkIu2c/T2c90tE5OEm1rK+44dFO+NFZJ6ILBXP8zkudc55Aujh1HqedM6927lPloj8wYv/G41plCUY0xw9A/xFVYcDVwAvO/vXAWNVdQie1Ykfq3PNaGCSqp7lvB8C/AboB3QHxtRznzhgkaoOAr4Gfl7n/s849290PSdnHayz8aymAFAGTFDVoXieP/RnJ8H9DtisqoNV9W4ROQ84CRgBDAaGicjYxu5njLfYYpemOToH6FdnpdlEEUkAWgKvichJeFaKjaxzzeeqWveZG4tVNRtARJbjWQvqmyPuU8EPC4RmAuc626P54VkabwJPNRBnizqfnYnn2RzgWQvqMSdZ1OCp2bSr5/rznNcy5308noTzdQP3M8arLMGY5igMGK2qpXV3ishzwHxVneD0Zyyoc7jkiM8or7NdTf3/lir1h07Ohs45mlJVHSwiLfEkqtuBZ/E8/yUFGKaqlSKyDYip53oBHlfVvx3jfY3xCmsiM83RZ3ienwKAiNQua94S2OVsT3bx/ovwNM0BXN3YyapahOexyHeJSCSeOHOd5HIm0MU59SCQUOfSucCNzvNBEJGOItLWS2UwplGWYEyoixWR7DqvO/H8sk53Or7X4HnEAsCfgMdFZCEQ7mJMvwHuFJHFQAegqLELVHUZnpVxr8bzQK50EcnAU5tZ55yTDyx0hjU/qaqf4WmC+05EVgKz+O8EZIyrbJiyMT7mPF2zVFVVRK4GrlHVSxu7zphgY30wxvjeMOB5Z+RXIX56DLUxbrMajDHGGFdYH4wxxhhXWIIxxhjjCkswxhhjXGEJxhhjjCsswRhjjHGFJRhjjDGu+P++g7XHqdtVDgAAAABJRU5ErkJggg==\n", 682 | "text/plain": [ 683 | "
" 684 | ] 685 | }, 686 | "metadata": { 687 | "needs_background": "light" 688 | }, 689 | "output_type": "display_data" 690 | } 691 | ], 692 | "source": [ 693 | "learn.recorder.plot(skip_end=15)" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": null, 699 | "metadata": {}, 700 | "outputs": [ 701 | { 702 | "data": { 703 | "text/html": [ 704 | "Total time: 16:47

\n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | "
epochtrain_lossvalid_lossaccuracy
14.2320344.0602730.292894
\n" 718 | ], 719 | "text/plain": [ 720 | "" 721 | ] 722 | }, 723 | "metadata": {}, 724 | "output_type": "display_data" 725 | } 726 | ], 727 | "source": [ 728 | "learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": {}, 735 | "outputs": [], 736 | "source": [ 737 | "learn.save('fit_head')" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": null, 743 | "metadata": {}, 744 | "outputs": [], 745 | "source": [ 746 | "learn.load('fit_head');" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": {}, 752 | "source": [ 753 | "To complete the fine-tuning, we can then unfeeze and launch a new training." 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": null, 759 | "metadata": {}, 760 | "outputs": [], 761 | "source": [ 762 | "learn.unfreeze()" 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": null, 768 | "metadata": {}, 769 | "outputs": [ 770 | { 771 | "data": { 772 | "text/html": [ 773 | "Total time: 3:08:33

\n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | "
epochtrain_lossvalid_lossaccuracy
13.9584893.8851530.310139
23.8716053.8147740.319821
33.8045893.7679660.325793
43.7712483.7296660.330175
53.6775343.6992440.333532
63.6441403.6740710.336564
73.6035973.6550990.338747
83.5242713.6419790.340568
93.5054763.6361940.341246
103.4612323.6359630.341371
\n" 841 | ], 842 | "text/plain": [ 843 | "" 844 | ] 845 | }, 846 | "metadata": {}, 847 | "output_type": "display_data" 848 | } 849 | ], 850 | "source": [ 851 | "learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))" 852 | ] 853 | }, 854 | { 855 | "cell_type": "code", 856 | "execution_count": null, 857 | "metadata": {}, 858 | "outputs": [], 859 | "source": [ 860 | "learn.save('fine_tuned')" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": {}, 866 | "source": [ 867 | "How good is our model? Well let's try to see what it predicts after a few given words." 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": null, 873 | "metadata": {}, 874 | "outputs": [], 875 | "source": [ 876 | "learn.load('fine_tuned');" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": null, 882 | "metadata": {}, 883 | "outputs": [], 884 | "source": [ 885 | "TEXT = \"I liked this movie because\"\n", 886 | "N_WORDS = 40\n", 887 | "N_SENTENCES = 2" 888 | ] 889 | }, 890 | { 891 | "cell_type": "code", 892 | "execution_count": null, 893 | "metadata": {}, 894 | "outputs": [ 895 | { 896 | "name": "stdout", 897 | "output_type": "stream", 898 | "text": [ 899 | "I liked this movie because of the cool scenery and the high level of xxmaj british hunting . xxmaj the only thing this movie has going for it is the horrible acting and no script . xxmaj the movie was a big disappointment . xxmaj\n", 900 | "I liked this movie because it was one of the few movies that made me laugh so hard i did n't like it . xxmaj it was a hilarious film and it was very entertaining . \n", 901 | "\n", 902 | " xxmaj the acting was great , i 'm\n" 903 | ] 904 | } 905 | ], 906 | "source": [ 907 | "print(\"\\n\".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" 908 | ] 909 | }, 910 | { 911 | "cell_type": "markdown", 912 | "metadata": {}, 913 | "source": [ 914 | "We have to save not only the model, but also its encoder, the part that's responsible for creating and updating the hidden state. For the next part, we don't care about the part that tries to guess the next word." 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": null, 920 | "metadata": {}, 921 | "outputs": [], 922 | "source": [ 923 | "learn.save_encoder('fine_tuned_enc')" 924 | ] 925 | }, 926 | { 927 | "cell_type": "markdown", 928 | "metadata": {}, 929 | "source": [ 930 | "## Classifier" 931 | ] 932 | }, 933 | { 934 | "cell_type": "markdown", 935 | "metadata": {}, 936 | "source": [ 937 | "Now, we'll create a new data object that only grabs the labelled data and keeps those labels. Again, this line takes a bit of time." 938 | ] 939 | }, 940 | { 941 | "cell_type": "code", 942 | "execution_count": null, 943 | "metadata": {}, 944 | "outputs": [], 945 | "source": [ 946 | "path = untar_data(URLs.IMDB)" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": null, 952 | "metadata": {}, 953 | "outputs": [], 954 | "source": [ 955 | "data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)\n", 956 | " #grab all the text files in path\n", 957 | " .split_by_folder(valid='test')\n", 958 | " #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)\n", 959 | " .label_from_folder(classes=['neg', 'pos'])\n", 960 | " #label them all with their folders\n", 961 | " .databunch(bs=bs))\n", 962 | "\n", 963 | "data_clas.save('data_clas.pkl')" 964 | ] 965 | }, 966 | { 967 | "cell_type": "code", 968 | "execution_count": null, 969 | "metadata": {}, 970 | "outputs": [], 971 | "source": [ 972 | "data_clas = load_data(path, 'data_clas.pkl', bs=bs)" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": null, 978 | "metadata": {}, 979 | "outputs": [ 980 | { 981 | "data": { 982 | "text/html": [ 983 | "\n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | "
texttarget
xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rulespos
xxbos xxmaj titanic directed by xxmaj james xxmaj cameron presents a fictional love story on the historical setting of the xxmaj titanic . xxmaj the plot is simple , xxunk , or not for those who love plots that twist and turn and keep you in suspense . xxmaj the end of the movie can be figured out within minutes of the start of the film , but the lovepos
xxbos xxmaj here are the matches . . . ( adv . = advantage ) \\n\\n xxmaj the xxmaj warriors ( xxmaj ultimate xxmaj warrior , xxmaj texas xxmaj tornado and xxmaj legion of xxmaj doom ) v xxmaj the xxmaj perfect xxmaj team ( xxmaj mr xxmaj perfect , xxmaj ax , xxmaj smash and xxmaj crush of xxmaj demolition ) : xxmaj ax is the first to goneg
xxbos i felt duty bound to watch the 1983 xxmaj timothy xxmaj dalton / xxmaj zelah xxmaj clarke adaptation of \" xxmaj jane xxmaj eyre , \" because i 'd just written an article about the 2006 xxup bbc \" xxmaj jane xxmaj eyre \" for xxunk . \\n\\n xxmaj so , i approached watching this the way i 'd approach doing homework . \\n\\n i was irritated at firstpos
xxbos xxmaj no , this is n't a sequel to the fabulous xxup ova series , but rather a remake of the events that occurred after the death of xxmaj xxunk ( and the disappearance of xxmaj woodchuck ) . xxmaj it is also more accurate to the novels that inspired this wonderful series , which is why characters ( namely xxmaj orson and xxmaj xxunk ) are xxunk ,pos
\n" 1008 | ], 1009 | "text/plain": [ 1010 | "" 1011 | ] 1012 | }, 1013 | "metadata": {}, 1014 | "output_type": "display_data" 1015 | } 1016 | ], 1017 | "source": [ 1018 | "data_clas.show_batch()" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "markdown", 1023 | "metadata": {}, 1024 | "source": [ 1025 | "We can then create a model to classify those reviews and load the encoder we saved before." 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "code", 1030 | "execution_count": null, 1031 | "metadata": {}, 1032 | "outputs": [], 1033 | "source": [ 1034 | "learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)\n", 1035 | "learn.load_encoder('fine_tuned_enc')" 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "code", 1040 | "execution_count": null, 1041 | "metadata": {}, 1042 | "outputs": [], 1043 | "source": [ 1044 | "learn.lr_find()" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": null, 1050 | "metadata": {}, 1051 | "outputs": [], 1052 | "source": [ 1053 | "learn.recorder.plot()" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "code", 1058 | "execution_count": null, 1059 | "metadata": {}, 1060 | "outputs": [ 1061 | { 1062 | "data": { 1063 | "text/html": [ 1064 | "Total time: 03:40

\n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | "
epochtrain_lossvalid_lossaccuracy
10.3100780.1972040.926960
\n" 1078 | ], 1079 | "text/plain": [ 1080 | "" 1081 | ] 1082 | }, 1083 | "metadata": {}, 1084 | "output_type": "display_data" 1085 | } 1086 | ], 1087 | "source": [ 1088 | "learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "code", 1093 | "execution_count": null, 1094 | "metadata": {}, 1095 | "outputs": [], 1096 | "source": [ 1097 | "learn.save('first')" 1098 | ] 1099 | }, 1100 | { 1101 | "cell_type": "code", 1102 | "execution_count": null, 1103 | "metadata": {}, 1104 | "outputs": [], 1105 | "source": [ 1106 | "learn.load('first');" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": null, 1112 | "metadata": {}, 1113 | "outputs": [ 1114 | { 1115 | "data": { 1116 | "text/html": [ 1117 | "Total time: 04:03

\n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | "
epochtrain_lossvalid_lossaccuracy
10.2559130.1691860.937800
\n" 1131 | ], 1132 | "text/plain": [ 1133 | "" 1134 | ] 1135 | }, 1136 | "metadata": {}, 1137 | "output_type": "display_data" 1138 | } 1139 | ], 1140 | "source": [ 1141 | "learn.freeze_to(-2)\n", 1142 | "learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))" 1143 | ] 1144 | }, 1145 | { 1146 | "cell_type": "code", 1147 | "execution_count": null, 1148 | "metadata": {}, 1149 | "outputs": [], 1150 | "source": [ 1151 | "learn.save('second')" 1152 | ] 1153 | }, 1154 | { 1155 | "cell_type": "code", 1156 | "execution_count": null, 1157 | "metadata": {}, 1158 | "outputs": [], 1159 | "source": [ 1160 | "learn.load('second');" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "code", 1165 | "execution_count": null, 1166 | "metadata": {}, 1167 | "outputs": [ 1168 | { 1169 | "data": { 1170 | "text/html": [ 1171 | "Total time: 05:42

\n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | "
epochtrain_lossvalid_lossaccuracy
10.2231740.1656790.939600
\n" 1185 | ], 1186 | "text/plain": [ 1187 | "" 1188 | ] 1189 | }, 1190 | "metadata": {}, 1191 | "output_type": "display_data" 1192 | } 1193 | ], 1194 | "source": [ 1195 | "learn.freeze_to(-3)\n", 1196 | "learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "code", 1201 | "execution_count": null, 1202 | "metadata": {}, 1203 | "outputs": [], 1204 | "source": [ 1205 | "learn.save('third')" 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "code", 1210 | "execution_count": null, 1211 | "metadata": {}, 1212 | "outputs": [], 1213 | "source": [ 1214 | "learn.load('third');" 1215 | ] 1216 | }, 1217 | { 1218 | "cell_type": "code", 1219 | "execution_count": null, 1220 | "metadata": {}, 1221 | "outputs": [ 1222 | { 1223 | "data": { 1224 | "text/html": [ 1225 | "Total time: 15:17

\n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | "
epochtrain_lossvalid_lossaccuracy
10.2404240.1552040.943160
20.2174620.1534210.943960
\n" 1245 | ], 1246 | "text/plain": [ 1247 | "" 1248 | ] 1249 | }, 1250 | "metadata": {}, 1251 | "output_type": "display_data" 1252 | } 1253 | ], 1254 | "source": [ 1255 | "learn.unfreeze()\n", 1256 | "learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))" 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "code", 1261 | "execution_count": null, 1262 | "metadata": {}, 1263 | "outputs": [ 1264 | { 1265 | "data": { 1266 | "text/plain": [ 1267 | "(Category pos, tensor(1), tensor([7.5928e-04, 9.9924e-01]))" 1268 | ] 1269 | }, 1270 | "execution_count": null, 1271 | "metadata": {}, 1272 | "output_type": "execute_result" 1273 | } 1274 | ], 1275 | "source": [ 1276 | "learn.predict(\"I really loved that movie, it was awesome!\")" 1277 | ] 1278 | }, 1279 | { 1280 | "cell_type": "code", 1281 | "execution_count": null, 1282 | "metadata": {}, 1283 | "outputs": [], 1284 | "source": [] 1285 | } 1286 | ], 1287 | "metadata": { 1288 | "kernelspec": { 1289 | "display_name": "Python 3", 1290 | "language": "python", 1291 | "name": "python3" 1292 | } 1293 | }, 1294 | "nbformat": 4, 1295 | "nbformat_minor": 2 1296 | } 1297 | -------------------------------------------------------------------------------- /lessons/lesson4-tabular.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tabular models" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "from fastai.tabular import *" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "Tabular data should be in a Pandas `DataFrame`." 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "path = untar_data(URLs.ADULT_SAMPLE)\n", 33 | "df = pd.read_csv(path/'adult.csv')" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "dep_var = 'salary'\n", 43 | "cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']\n", 44 | "cont_names = ['age', 'fnlwgt', 'education-num']\n", 45 | "procs = [FillMissing, Categorify, Normalize]" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)\n", 64 | " .split_by_idx(list(range(800,1000)))\n", 65 | " .label_from_df(cols=dep_var)\n", 66 | " .add_test(test)\n", 67 | " .databunch())" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "data": { 77 | "text/html": [ 78 | "\n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | "
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numtarget
Private HS-grad Never-married Sales Not-in-family WhiteFalse-1.21581.1004-0.4224<50k
? HS-grad Widowed ? Not-in-family WhiteFalse1.86270.0976-0.4224<50k
Self-emp-not-inc HS-grad Never-married Craft-repair Own-child BlackFalse0.03030.2092-0.4224<50k
Private HS-grad Married-civ-spouse Protective-serv Husband WhiteFalse1.5695-0.5938-0.4224<50k
Private HS-grad Married-civ-spouse Handlers-cleaners Husband WhiteFalse-0.9959-0.0318-0.4224<50k
Private 10th Married-civ-spouse Farming-fishing Wife WhiteFalse-0.70270.6071-1.5958<50k
Private HS-grad Married-civ-spouse Machine-op-inspct Husband WhiteFalse0.1036-0.0968-0.4224<50k
Private Some-college Married-civ-spouse Exec-managerial Own-child WhiteFalse-0.7760-0.6653-0.0312>=50k
State-gov Some-college Never-married Tech-support Own-child WhiteFalse-0.8493-1.4959-0.0312<50k
Private 11th Never-married Machine-op-inspct Not-in-family WhiteFalse-1.0692-0.9516-1.2046<50k
\n" 222 | ], 223 | "text/plain": [ 224 | "" 225 | ] 226 | }, 227 | "metadata": {}, 228 | "output_type": "display_data" 229 | } 230 | ], 231 | "source": [ 232 | "data.show_batch(rows=10)" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "learn = tabular_learner(data, layers=[200,100], metrics=accuracy)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/html": [ 252 | "Total time: 00:03

\n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | "
epochtrain_lossvalid_lossaccuracy
10.3546040.3785200.820000
\n" 266 | ], 267 | "text/plain": [ 268 | "" 269 | ] 270 | }, 271 | "metadata": {}, 272 | "output_type": "display_data" 273 | } 274 | ], 275 | "source": [ 276 | "learn.fit(1, 1e-2)" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "## Inference" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "row = df.iloc[0]" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "data": { 302 | "text/plain": [ 303 | "(Category >=50k, tensor(1), tensor([0.4402, 0.5598]))" 304 | ] 305 | }, 306 | "execution_count": null, 307 | "metadata": {}, 308 | "output_type": "execute_result" 309 | } 310 | ], 311 | "source": [ 312 | "learn.predict(row)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [] 321 | } 322 | ], 323 | "metadata": { 324 | "kernelspec": { 325 | "display_name": "Python 3", 326 | "language": "python", 327 | "name": "python3" 328 | } 329 | }, 330 | "nbformat": 4, 331 | "nbformat_minor": 2 332 | } 333 | -------------------------------------------------------------------------------- /lessons/lesson7-human-numbers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Human numbers" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "from fastai.text import *" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "bs=64" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "## Data" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "data": { 42 | "text/plain": [ 43 | "[PosixPath('/home/ubuntu/.fastai/data/human_numbers/train.txt'),\n", 44 | " PosixPath('/home/ubuntu/.fastai/data/human_numbers/valid.txt')]" 45 | ] 46 | }, 47 | "execution_count": null, 48 | "metadata": {}, 49 | "output_type": "execute_result" 50 | } 51 | ], 52 | "source": [ 53 | "path = untar_data(URLs.HUMAN_NUMBERS)\n", 54 | "path.ls()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "data": { 73 | "text/plain": [ 74 | "'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'" 75 | ] 76 | }, 77 | "execution_count": null, 78 | "metadata": {}, 79 | "output_type": "execute_result" 80 | } 81 | ], 82 | "source": [ 83 | "train_txt = readnums('train.txt'); train_txt[0][:80]" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [ 91 | { 92 | "data": { 93 | "text/plain": [ 94 | "' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'" 95 | ] 96 | }, 97 | "execution_count": null, 98 | "metadata": {}, 99 | "output_type": "execute_result" 100 | } 101 | ], 102 | "source": [ 103 | "valid_txt = readnums('valid.txt'); valid_txt[0][-80:]" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "train = TextList(train_txt, path=path)\n", 113 | "valid = TextList(valid_txt, path=path)\n", 114 | "\n", 115 | "src = ItemLists(path=path, train=train, valid=valid).label_for_lm()\n", 116 | "data = src.databunch(bs=bs)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [ 124 | { 125 | "data": { 126 | "text/plain": [ 127 | "'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'" 128 | ] 129 | }, 130 | "execution_count": null, 131 | "metadata": {}, 132 | "output_type": "execute_result" 133 | } 134 | ], 135 | "source": [ 136 | "train[0].text[:80]" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "13017" 148 | ] 149 | }, 150 | "execution_count": null, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "len(data.valid_ds[0][0].data)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "text/plain": [ 167 | "(70, 3)" 168 | ] 169 | }, 170 | "execution_count": null, 171 | "metadata": {}, 172 | "output_type": "execute_result" 173 | } 174 | ], 175 | "source": [ 176 | "data.bptt, len(data.valid_dl)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [ 184 | { 185 | "data": { 186 | "text/plain": [ 187 | "2.905580357142857" 188 | ] 189 | }, 190 | "execution_count": null, 191 | "metadata": {}, 192 | "output_type": "execute_result" 193 | } 194 | ], 195 | "source": [ 196 | "13017/70/bs" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "it = iter(data.valid_dl)\n", 206 | "x1,y1 = next(it)\n", 207 | "x2,y2 = next(it)\n", 208 | "x3,y3 = next(it)\n", 209 | "it.close()" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "data": { 219 | "text/plain": [ 220 | "13440" 221 | ] 222 | }, 223 | "execution_count": null, 224 | "metadata": {}, 225 | "output_type": "execute_result" 226 | } 227 | ], 228 | "source": [ 229 | "x1.numel()+x2.numel()+x3.numel()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/plain": [ 240 | "(torch.Size([64, 70]), torch.Size([64, 70]))" 241 | ] 242 | }, 243 | "execution_count": null, 244 | "metadata": {}, 245 | "output_type": "execute_result" 246 | } 247 | ], 248 | "source": [ 249 | "x1.shape,y1.shape" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [ 257 | { 258 | "data": { 259 | "text/plain": [ 260 | "(torch.Size([64, 70]), torch.Size([64, 70]))" 261 | ] 262 | }, 263 | "execution_count": null, 264 | "metadata": {}, 265 | "output_type": "execute_result" 266 | } 267 | ], 268 | "source": [ 269 | "x2.shape,y2.shape" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [ 277 | { 278 | "data": { 279 | "text/plain": [ 280 | "tensor([ 2, 8, 10, 11, 12, 10, 9, 8, 9, 13, 18, 24, 18, 14, 15, 10, 18, 8,\n", 281 | " 9, 8, 18, 24, 18, 10, 18, 10, 9, 8, 18, 19, 10, 25, 19, 22, 19, 19,\n", 282 | " 23, 19, 10, 13, 10, 10, 8, 13, 8, 19, 9, 19, 34, 16, 10, 9, 8, 16,\n", 283 | " 8, 19, 9, 19, 10, 19, 10, 19, 19, 19], device='cuda:0')" 284 | ] 285 | }, 286 | "execution_count": null, 287 | "metadata": {}, 288 | "output_type": "execute_result" 289 | } 290 | ], 291 | "source": [ 292 | "x1[:,0]" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "data": { 302 | "text/plain": [ 303 | "tensor([18, 18, 26, 9, 8, 11, 31, 18, 25, 9, 10, 14, 10, 9, 8, 14, 10, 18,\n", 304 | " 25, 18, 10, 17, 10, 17, 8, 17, 20, 18, 9, 9, 19, 8, 10, 15, 10, 10,\n", 305 | " 12, 10, 12, 8, 12, 13, 19, 9, 19, 10, 23, 10, 8, 8, 15, 16, 19, 9,\n", 306 | " 19, 10, 23, 10, 18, 8, 18, 10, 10, 9], device='cuda:0')" 307 | ] 308 | }, 309 | "execution_count": null, 310 | "metadata": {}, 311 | "output_type": "execute_result" 312 | } 313 | ], 314 | "source": [ 315 | "y1[:,0]" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "v = data.valid_ds.vocab" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": null, 330 | "metadata": {}, 331 | "outputs": [ 332 | { 333 | "data": { 334 | "text/plain": [ 335 | "'xxbos eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight'" 336 | ] 337 | }, 338 | "execution_count": null, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "v.textify(x1[0])" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/plain": [ 355 | "'eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight thousand'" 356 | ] 357 | }, 358 | "execution_count": null, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "v.textify(y1[0])" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": {}, 371 | "outputs": [ 372 | { 373 | "data": { 374 | "text/plain": [ 375 | "'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'" 376 | ] 377 | }, 378 | "execution_count": null, 379 | "metadata": {}, 380 | "output_type": "execute_result" 381 | } 382 | ], 383 | "source": [ 384 | "v.textify(x2[0])" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": {}, 391 | "outputs": [ 392 | { 393 | "data": { 394 | "text/plain": [ 395 | "'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'" 396 | ] 397 | }, 398 | "execution_count": null, 399 | "metadata": {}, 400 | "output_type": "execute_result" 401 | } 402 | ], 403 | "source": [ 404 | "v.textify(x3[0])" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "data": { 414 | "text/plain": [ 415 | "', eight thousand forty six , eight thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine ,'" 416 | ] 417 | }, 418 | "execution_count": null, 419 | "metadata": {}, 420 | "output_type": "execute_result" 421 | } 422 | ], 423 | "source": [ 424 | "v.textify(x1[1])" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [ 432 | { 433 | "data": { 434 | "text/plain": [ 435 | "'eight thousand sixty , eight thousand sixty one , eight thousand sixty two , eight thousand sixty three , eight thousand sixty four , eight thousand sixty five , eight thousand sixty six , eight thousand sixty seven , eight thousand sixty eight , eight thousand sixty nine , eight thousand seventy , eight thousand seventy one , eight thousand seventy two , eight thousand seventy three , eight thousand'" 436 | ] 437 | }, 438 | "execution_count": null, 439 | "metadata": {}, 440 | "output_type": "execute_result" 441 | } 442 | ], 443 | "source": [ 444 | "v.textify(x2[1])" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [ 452 | { 453 | "data": { 454 | "text/plain": [ 455 | "'seventy four , eight thousand seventy five , eight thousand seventy six , eight thousand seventy seven , eight thousand seventy eight , eight thousand seventy nine , eight thousand eighty , eight thousand eighty one , eight thousand eighty two , eight thousand eighty three , eight thousand eighty four , eight thousand eighty five , eight thousand eighty six , eight thousand eighty seven , eight thousand eighty'" 456 | ] 457 | }, 458 | "execution_count": null, 459 | "metadata": {}, 460 | "output_type": "execute_result" 461 | } 462 | ], 463 | "source": [ 464 | "v.textify(x3[1])" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "data": { 474 | "text/plain": [ 475 | "'ninety , nine thousand nine hundred ninety one , nine thousand nine hundred ninety two , nine thousand nine hundred ninety three , nine thousand nine hundred ninety four , nine thousand nine hundred ninety five , nine thousand nine hundred ninety six , nine thousand nine hundred ninety seven , nine thousand nine hundred ninety eight , nine thousand nine hundred ninety nine xxbos eight thousand one , eight'" 476 | ] 477 | }, 478 | "execution_count": null, 479 | "metadata": {}, 480 | "output_type": "execute_result" 481 | } 482 | ], 483 | "source": [ 484 | "v.textify(x3[-1])" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": {}, 491 | "outputs": [ 492 | { 493 | "data": { 494 | "text/html": [ 495 | "\n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | "
idxtext
0thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine , eight thousand sixty , eight thousand sixty
1eight , eight thousand eighty nine , eight thousand ninety , eight thousand ninety one , eight thousand ninety two , eight thousand ninety three , eight thousand ninety four , eight thousand ninety five , eight thousand ninety six , eight thousand ninety seven , eight thousand ninety eight , eight thousand ninety nine , eight thousand one hundred , eight thousand one hundred one , eight thousand one
2thousand one hundred twenty four , eight thousand one hundred twenty five , eight thousand one hundred twenty six , eight thousand one hundred twenty seven , eight thousand one hundred twenty eight , eight thousand one hundred twenty nine , eight thousand one hundred thirty , eight thousand one hundred thirty one , eight thousand one hundred thirty two , eight thousand one hundred thirty three , eight thousand
3three , eight thousand one hundred fifty four , eight thousand one hundred fifty five , eight thousand one hundred fifty six , eight thousand one hundred fifty seven , eight thousand one hundred fifty eight , eight thousand one hundred fifty nine , eight thousand one hundred sixty , eight thousand one hundred sixty one , eight thousand one hundred sixty two , eight thousand one hundred sixty three
4thousand one hundred eighty three , eight thousand one hundred eighty four , eight thousand one hundred eighty five , eight thousand one hundred eighty six , eight thousand one hundred eighty seven , eight thousand one hundred eighty eight , eight thousand one hundred eighty nine , eight thousand one hundred ninety , eight thousand one hundred ninety one , eight thousand one hundred ninety two , eight thousand
\n" 520 | ], 521 | "text/plain": [ 522 | "" 523 | ] 524 | }, 525 | "metadata": {}, 526 | "output_type": "display_data" 527 | } 528 | ], 529 | "source": [ 530 | "data.show_batch(ds_type=DatasetType.Valid)" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "## Single fully connected model" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": null, 543 | "metadata": {}, 544 | "outputs": [], 545 | "source": [ 546 | "data = src.databunch(bs=bs, bptt=3)" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": null, 552 | "metadata": {}, 553 | "outputs": [ 554 | { 555 | "data": { 556 | "text/plain": [ 557 | "(torch.Size([64, 3]), torch.Size([64, 3]))" 558 | ] 559 | }, 560 | "execution_count": null, 561 | "metadata": {}, 562 | "output_type": "execute_result" 563 | } 564 | ], 565 | "source": [ 566 | "x,y = data.one_batch()\n", 567 | "x.shape,y.shape" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": null, 573 | "metadata": {}, 574 | "outputs": [ 575 | { 576 | "data": { 577 | "text/plain": [ 578 | "38" 579 | ] 580 | }, 581 | "execution_count": null, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "nv = len(v.itos); nv" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": null, 593 | "metadata": {}, 594 | "outputs": [], 595 | "source": [ 596 | "nh=64" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": {}, 603 | "outputs": [], 604 | "source": [ 605 | "def loss4(input,target): return F.cross_entropy(input, target[:,-1])\n", 606 | "def acc4 (input,target): return accuracy(input, target[:,-1])" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "class Model0(nn.Module):\n", 616 | " def __init__(self):\n", 617 | " super().__init__()\n", 618 | " self.i_h = nn.Embedding(nv,nh) # green arrow\n", 619 | " self.h_h = nn.Linear(nh,nh) # brown arrow\n", 620 | " self.h_o = nn.Linear(nh,nv) # blue arrow\n", 621 | " self.bn = nn.BatchNorm1d(nh)\n", 622 | " \n", 623 | " def forward(self, x):\n", 624 | " h = self.bn(F.relu(self.h_h(self.i_h(x[:,0]))))\n", 625 | " if x.shape[1]>1:\n", 626 | " h = h + self.i_h(x[:,1])\n", 627 | " h = self.bn(F.relu(self.h_h(h)))\n", 628 | " if x.shape[1]>2:\n", 629 | " h = h + self.i_h(x[:,2])\n", 630 | " h = self.bn(F.relu(self.h_h(h)))\n", 631 | " return self.h_o(h)" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": {}, 638 | "outputs": [], 639 | "source": [ 640 | "learn = Learner(data, Model0(), loss_func=loss4, metrics=acc4)" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": null, 646 | "metadata": {}, 647 | "outputs": [ 648 | { 649 | "data": { 650 | "text/html": [ 651 | "Total time: 00:07

\n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | "
epochtrain_lossvalid_lossacc4
13.5962863.5888690.046645
23.0861003.2057630.274816
32.4944112.7493650.392004
42.1447532.4635370.415671
52.0109152.3528870.409237
61.9839922.3369670.408778
\n" 695 | ], 696 | "text/plain": [ 697 | "" 698 | ] 699 | }, 700 | "metadata": {}, 701 | "output_type": "display_data" 702 | } 703 | ], 704 | "source": [ 705 | "learn.fit_one_cycle(6, 1e-4)" 706 | ] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": {}, 711 | "source": [ 712 | "## Same thing with a loop" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": null, 718 | "metadata": {}, 719 | "outputs": [], 720 | "source": [ 721 | "class Model1(nn.Module):\n", 722 | " def __init__(self):\n", 723 | " super().__init__()\n", 724 | " self.i_h = nn.Embedding(nv,nh) # green arrow\n", 725 | " self.h_h = nn.Linear(nh,nh) # brown arrow\n", 726 | " self.h_o = nn.Linear(nh,nv) # blue arrow\n", 727 | " self.bn = nn.BatchNorm1d(nh)\n", 728 | " \n", 729 | " def forward(self, x):\n", 730 | " h = torch.zeros(x.shape[0], nh).to(device=x.device)\n", 731 | " for i in range(x.shape[1]):\n", 732 | " h = h + self.i_h(x[:,i])\n", 733 | " h = self.bn(F.relu(self.h_h(h)))\n", 734 | " return self.h_o(h)" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": null, 740 | "metadata": {}, 741 | "outputs": [], 742 | "source": [ 743 | "learn = Learner(data, Model1(), loss_func=loss4, metrics=acc4)" 744 | ] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": null, 749 | "metadata": {}, 750 | "outputs": [ 751 | { 752 | "data": { 753 | "text/html": [ 754 | "Total time: 00:07

\n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | "
epochtrain_lossvalid_lossacc4
13.4935253.4202310.156250
22.9876002.9378930.376149
32.4401992.4779950.388787
42.1328372.2565690.391774
52.0113052.1813370.392923
61.9859132.1708740.393153
\n" 798 | ], 799 | "text/plain": [ 800 | "" 801 | ] 802 | }, 803 | "metadata": {}, 804 | "output_type": "display_data" 805 | } 806 | ], 807 | "source": [ 808 | "learn.fit_one_cycle(6, 1e-4)" 809 | ] 810 | }, 811 | { 812 | "cell_type": "markdown", 813 | "metadata": {}, 814 | "source": [ 815 | "## Multi fully connected model" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "metadata": {}, 822 | "outputs": [], 823 | "source": [ 824 | "data = src.databunch(bs=bs, bptt=20)" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": null, 830 | "metadata": {}, 831 | "outputs": [ 832 | { 833 | "data": { 834 | "text/plain": [ 835 | "(torch.Size([64, 20]), torch.Size([64, 20]))" 836 | ] 837 | }, 838 | "execution_count": null, 839 | "metadata": {}, 840 | "output_type": "execute_result" 841 | } 842 | ], 843 | "source": [ 844 | "x,y = data.one_batch()\n", 845 | "x.shape,y.shape" 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": null, 851 | "metadata": {}, 852 | "outputs": [], 853 | "source": [ 854 | "class Model2(nn.Module):\n", 855 | " def __init__(self):\n", 856 | " super().__init__()\n", 857 | " self.i_h = nn.Embedding(nv,nh)\n", 858 | " self.h_h = nn.Linear(nh,nh)\n", 859 | " self.h_o = nn.Linear(nh,nv)\n", 860 | " self.bn = nn.BatchNorm1d(nh)\n", 861 | " \n", 862 | " def forward(self, x):\n", 863 | " h = torch.zeros(x.shape[0], nh).to(device=x.device)\n", 864 | " res = []\n", 865 | " for i in range(x.shape[1]):\n", 866 | " h = h + self.i_h(x[:,i])\n", 867 | " h = F.relu(self.h_h(h))\n", 868 | " res.append(self.h_o(self.bn(h)))\n", 869 | " return torch.stack(res, dim=1)" 870 | ] 871 | }, 872 | { 873 | "cell_type": "code", 874 | "execution_count": null, 875 | "metadata": {}, 876 | "outputs": [], 877 | "source": [ 878 | "learn = Learner(data, Model2(), metrics=accuracy)" 879 | ] 880 | }, 881 | { 882 | "cell_type": "code", 883 | "execution_count": null, 884 | "metadata": {}, 885 | "outputs": [ 886 | { 887 | "data": { 888 | "text/html": [ 889 | "Total time: 00:06

\n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | "
epochtrain_lossvalid_lossaccuracy
13.6392853.7092780.058949
23.5511513.5656770.151776
33.4399083.4318500.207741
43.3230833.3142370.283949
53.2134223.2199060.321662
63.1196733.1511620.336790
73.0466453.1066300.341690
82.9953793.0825520.346662
92.9638003.0733270.349645
102.9473123.0719510.349787
\n" 957 | ], 958 | "text/plain": [ 959 | "" 960 | ] 961 | }, 962 | "metadata": {}, 963 | "output_type": "display_data" 964 | } 965 | ], 966 | "source": [ 967 | "learn.fit_one_cycle(10, 1e-4, pct_start=0.1)" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "metadata": {}, 973 | "source": [ 974 | "## Maintain state" 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": null, 980 | "metadata": {}, 981 | "outputs": [], 982 | "source": [ 983 | "class Model3(nn.Module):\n", 984 | " def __init__(self):\n", 985 | " super().__init__()\n", 986 | " self.i_h = nn.Embedding(nv,nh)\n", 987 | " self.h_h = nn.Linear(nh,nh)\n", 988 | " self.h_o = nn.Linear(nh,nv)\n", 989 | " self.bn = nn.BatchNorm1d(nh)\n", 990 | " self.h = torch.zeros(bs, nh).cuda()\n", 991 | " \n", 992 | " def forward(self, x):\n", 993 | " res = []\n", 994 | " h = self.h\n", 995 | " for i in range(x.shape[1]):\n", 996 | " h = h + self.i_h(x[:,i])\n", 997 | " h = F.relu(self.h_h(h))\n", 998 | " res.append(self.bn(h))\n", 999 | " self.h = h.detach()\n", 1000 | " res = torch.stack(res, dim=1)\n", 1001 | " res = self.h_o(res)\n", 1002 | " return res" 1003 | ] 1004 | }, 1005 | { 1006 | "cell_type": "code", 1007 | "execution_count": null, 1008 | "metadata": {}, 1009 | "outputs": [], 1010 | "source": [ 1011 | "learn = Learner(data, Model3(), metrics=accuracy)" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": null, 1017 | "metadata": {}, 1018 | "outputs": [ 1019 | { 1020 | "data": { 1021 | "text/html": [ 1022 | "Total time: 00:11

\n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | "
epochtrain_lossvalid_lossaccuracy
13.5981833.5563620.050710
23.2746162.9756990.401634
32.6242062.0368940.467330
42.0227021.9564390.316193
51.6818131.9349520.336861
61.4530071.9482010.351349
71.2769712.0057760.368679
81.1384992.0812610.360156
91.0292172.1458530.360795
100.9399492.2153880.372230
110.8654412.2404380.401491
120.8053102.1958460.409375
130.7550352.3243730.422727
140.7130732.3055420.449716
150.6773932.3501550.446449
160.6458412.4187380.446591
170.6218092.4569030.446165
180.6053002.5416990.443040
190.5940992.5398240.443040
200.5875632.5514230.442827
\n" 1150 | ], 1151 | "text/plain": [ 1152 | "" 1153 | ] 1154 | }, 1155 | "metadata": {}, 1156 | "output_type": "display_data" 1157 | } 1158 | ], 1159 | "source": [ 1160 | "learn.fit_one_cycle(20, 3e-3)" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "markdown", 1165 | "metadata": {}, 1166 | "source": [ 1167 | "## nn.RNN" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": null, 1173 | "metadata": {}, 1174 | "outputs": [], 1175 | "source": [ 1176 | "class Model4(nn.Module):\n", 1177 | " def __init__(self):\n", 1178 | " super().__init__()\n", 1179 | " self.i_h = nn.Embedding(nv,nh)\n", 1180 | " self.rnn = nn.RNN(nh,nh, batch_first=True)\n", 1181 | " self.h_o = nn.Linear(nh,nv)\n", 1182 | " self.bn = BatchNorm1dFlat(nh)\n", 1183 | " self.h = torch.zeros(1, bs, nh).cuda()\n", 1184 | " \n", 1185 | " def forward(self, x):\n", 1186 | " res,h = self.rnn(self.i_h(x), self.h)\n", 1187 | " self.h = h.detach()\n", 1188 | " return self.h_o(self.bn(res))" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "code", 1193 | "execution_count": null, 1194 | "metadata": {}, 1195 | "outputs": [], 1196 | "source": [ 1197 | "learn = Learner(data, Model4(), metrics=accuracy)" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "code", 1202 | "execution_count": null, 1203 | "metadata": {}, 1204 | "outputs": [ 1205 | { 1206 | "data": { 1207 | "text/html": [ 1208 | "Total time: 00:04

\n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | "
epochtrain_lossvalid_lossaccuracy
13.4514323.2683440.224148
22.9749382.4565690.466051
32.3167321.9469690.465625
41.8661511.9919520.314702
51.6185161.8024030.437216
61.4115171.7311070.436293
71.1719161.6559790.504048
80.9658871.5799630.522088
90.7970461.4798190.565057
100.6593781.4878310.579048
110.5532821.4419220.597798
120.4751671.4981480.600781
130.4161311.5469840.606463
140.3723951.5942610.607386
150.3370931.5783210.613352
160.3113851.5809730.623366
170.2928691.6257450.618253
180.2794861.6239600.626065
190.2700541.6820900.611719
200.2638571.6756760.614702
\n" 1336 | ], 1337 | "text/plain": [ 1338 | "" 1339 | ] 1340 | }, 1341 | "metadata": {}, 1342 | "output_type": "display_data" 1343 | } 1344 | ], 1345 | "source": [ 1346 | "learn.fit_one_cycle(20, 3e-3)" 1347 | ] 1348 | }, 1349 | { 1350 | "cell_type": "markdown", 1351 | "metadata": {}, 1352 | "source": [ 1353 | "## 2-layer GRU" 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "execution_count": null, 1359 | "metadata": {}, 1360 | "outputs": [], 1361 | "source": [ 1362 | "class Model5(nn.Module):\n", 1363 | " def __init__(self):\n", 1364 | " super().__init__()\n", 1365 | " self.i_h = nn.Embedding(nv,nh)\n", 1366 | " self.rnn = nn.GRU(nh, nh, 2, batch_first=True)\n", 1367 | " self.h_o = nn.Linear(nh,nv)\n", 1368 | " self.bn = BatchNorm1dFlat(nh)\n", 1369 | " self.h = torch.zeros(2, bs, nh).cuda()\n", 1370 | " \n", 1371 | " def forward(self, x):\n", 1372 | " res,h = self.rnn(self.i_h(x), self.h)\n", 1373 | " self.h = h.detach()\n", 1374 | " return self.h_o(self.bn(res))" 1375 | ] 1376 | }, 1377 | { 1378 | "cell_type": "code", 1379 | "execution_count": null, 1380 | "metadata": {}, 1381 | "outputs": [], 1382 | "source": [ 1383 | "learn = Learner(data, Model5(), metrics=accuracy)" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "code", 1388 | "execution_count": null, 1389 | "metadata": {}, 1390 | "outputs": [ 1391 | { 1392 | "data": { 1393 | "text/html": [ 1394 | "Total time: 00:02

\n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | "
epochtrain_lossvalid_lossaccuracy
12.8648542.3149430.454545
21.7989881.3571160.629688
30.9327291.3074630.796733
40.4519691.3296990.788636
50.2257871.2935700.800142
60.1180851.2659260.803338
70.0653061.2070960.806960
80.0380981.2053610.813920
90.0240691.2394110.807813
100.0170781.2534090.807102
\n" 1462 | ], 1463 | "text/plain": [ 1464 | "" 1465 | ] 1466 | }, 1467 | "metadata": {}, 1468 | "output_type": "display_data" 1469 | } 1470 | ], 1471 | "source": [ 1472 | "learn.fit_one_cycle(10, 1e-2)" 1473 | ] 1474 | }, 1475 | { 1476 | "cell_type": "markdown", 1477 | "metadata": {}, 1478 | "source": [ 1479 | "## fin" 1480 | ] 1481 | }, 1482 | { 1483 | "cell_type": "code", 1484 | "execution_count": null, 1485 | "metadata": {}, 1486 | "outputs": [], 1487 | "source": [] 1488 | } 1489 | ], 1490 | "metadata": { 1491 | "kernelspec": { 1492 | "display_name": "Python 3", 1493 | "language": "python", 1494 | "name": "python3" 1495 | } 1496 | }, 1497 | "nbformat": 4, 1498 | "nbformat_minor": 2 1499 | } 1500 | -------------------------------------------------------------------------------- /lessons/rossman_data_clean.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%reload_ext autoreload\n", 10 | "%autoreload 2" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "from fastai.basics import *" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "# Rossmann" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Data preparation / Feature engineering" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "In addition to the provided data, we will be using external datasets put together by participants in the Kaggle competition. You can download all of them [here](http://files.fast.ai/part2/lesson14/rossmann.tgz). Then you shold untar them in the dirctory to which `PATH` is pointing below.\n", 41 | "\n", 42 | "For completeness, the implementation used to put them together is included below." 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "data": { 52 | "text/plain": [ 53 | "(1017209, 41088)" 54 | ] 55 | }, 56 | "execution_count": null, 57 | "metadata": {}, 58 | "output_type": "execute_result" 59 | } 60 | ], 61 | "source": [ 62 | "PATH=Config().data_path()/Path('rossmann/')\n", 63 | "table_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']\n", 64 | "tables = [pd.read_csv(PATH/f'{fname}.csv', low_memory=False) for fname in table_names]\n", 65 | "train, store, store_states, state_names, googletrend, weather, test = tables\n", 66 | "len(train),len(test)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "We turn state Holidays to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "train.StateHoliday = train.StateHoliday!='0'\n", 83 | "test.StateHoliday = test.StateHoliday!='0'" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "`join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join of `right` on the `left` argument using the given fields for each table.\n", 91 | "\n", 92 | "Pandas does joins using the `merge` method. The `suffixes` argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a \"\\_y\" to those on the right." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "def join_df(left, right, left_on, right_on=None, suffix='_y'):\n", 102 | " if right_on is None: right_on = left_on\n", 103 | " return left.merge(right, how='left', left_on=left_on, right_on=right_on, \n", 104 | " suffixes=(\"\", suffix))" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "Join weather/state names." 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "weather = join_df(weather, state_names, \"file\", \"StateName\")" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "In pandas you can add new columns to a dataframe by simply defining it. We'll do this for googletrends by extracting dates and state names from the given data and adding those columns.\n", 128 | "\n", 129 | "We're also going to replace all instances of state name 'NI' to match the usage in the rest of the data: 'HB,NI'. This is a good opportunity to highlight pandas indexing. We can use `.loc[rows, cols]` to select a list of rows and a list of columns from the dataframe. In this case, we're selecting rows w/ statename 'NI' by using a boolean list `googletrend.State=='NI'` and selecting \"State\"." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]\n", 139 | "googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]\n", 140 | "googletrend.loc[googletrend.State=='NI', \"State\"] = 'HB,NI'" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.\n", 148 | "\n", 149 | "You should *always* consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "def add_datepart(df, fldname, drop=True, time=False):\n", 159 | " \"Helper function that adds columns relevant to a date.\"\n", 160 | " fld = df[fldname]\n", 161 | " fld_dtype = fld.dtype\n", 162 | " if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):\n", 163 | " fld_dtype = np.datetime64\n", 164 | "\n", 165 | " if not np.issubdtype(fld_dtype, np.datetime64):\n", 166 | " df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)\n", 167 | " targ_pre = re.sub('[Dd]ate$', '', fldname)\n", 168 | " attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',\n", 169 | " 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']\n", 170 | " if time: attr = attr + ['Hour', 'Minute', 'Second']\n", 171 | " for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())\n", 172 | " df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9\n", 173 | " if drop: df.drop(fldname, axis=1, inplace=True)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "add_datepart(weather, \"Date\", drop=False)\n", 183 | "add_datepart(googletrend, \"Date\", drop=False)\n", 184 | "add_datepart(train, \"Date\", drop=False)\n", 185 | "add_datepart(test, \"Date\", drop=False)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "The Google trends data has a special category for the whole of the Germany - we'll pull that out so we can use it explicitly." 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "trend_de = googletrend[googletrend.file == 'Rossmann_DE']" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.\n", 209 | "\n", 210 | "*Aside*: Why not just do an inner join?\n", 211 | "If you are assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event you are wrong or a mistake is made, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [ 219 | { 220 | "data": { 221 | "text/plain": [ 222 | "0" 223 | ] 224 | }, 225 | "execution_count": null, 226 | "metadata": {}, 227 | "output_type": "execute_result" 228 | } 229 | ], 230 | "source": [ 231 | "store = join_df(store, store_states, \"Store\")\n", 232 | "len(store[store.State.isnull()])" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [ 240 | { 241 | "data": { 242 | "text/plain": [ 243 | "(0, 0)" 244 | ] 245 | }, 246 | "execution_count": null, 247 | "metadata": {}, 248 | "output_type": "execute_result" 249 | } 250 | ], 251 | "source": [ 252 | "joined = join_df(train, store, \"Store\")\n", 253 | "joined_test = join_df(test, store, \"Store\")\n", 254 | "len(joined[joined.StoreType.isnull()]),len(joined_test[joined_test.StoreType.isnull()])" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": {}, 261 | "outputs": [ 262 | { 263 | "data": { 264 | "text/plain": [ 265 | "(0, 0)" 266 | ] 267 | }, 268 | "execution_count": null, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "joined = join_df(joined, googletrend, [\"State\",\"Year\", \"Week\"])\n", 275 | "joined_test = join_df(joined_test, googletrend, [\"State\",\"Year\", \"Week\"])\n", 276 | "len(joined[joined.trend.isnull()]),len(joined_test[joined_test.trend.isnull()])" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [ 284 | { 285 | "data": { 286 | "text/plain": [ 287 | "(0, 0)" 288 | ] 289 | }, 290 | "execution_count": null, 291 | "metadata": {}, 292 | "output_type": "execute_result" 293 | } 294 | ], 295 | "source": [ 296 | "joined = joined.merge(trend_de, 'left', [\"Year\", \"Week\"], suffixes=('', '_DE'))\n", 297 | "joined_test = joined_test.merge(trend_de, 'left', [\"Year\", \"Week\"], suffixes=('', '_DE'))\n", 298 | "len(joined[joined.trend_DE.isnull()]),len(joined_test[joined_test.trend_DE.isnull()])" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "data": { 308 | "text/plain": [ 309 | "(0, 0)" 310 | ] 311 | }, 312 | "execution_count": null, 313 | "metadata": {}, 314 | "output_type": "execute_result" 315 | } 316 | ], 317 | "source": [ 318 | "joined = join_df(joined, weather, [\"State\",\"Date\"])\n", 319 | "joined_test = join_df(joined_test, weather, [\"State\",\"Date\"])\n", 320 | "len(joined[joined.Mean_TemperatureC.isnull()]),len(joined_test[joined_test.Mean_TemperatureC.isnull()])" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "for df in (joined, joined_test):\n", 330 | " for c in df.columns:\n", 331 | " if c.endswith('_y'):\n", 332 | " if c in df.columns: df.drop(c, inplace=True, axis=1)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "Next we'll fill in missing values to avoid complications with `NA`'s. `NA` (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we are picking an arbitrary *signal value* that doesn't otherwise appear in the data." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "for df in (joined,joined_test):\n", 349 | " df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)\n", 350 | " df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)\n", 351 | " df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)\n", 352 | " df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "Next we'll extract features \"CompetitionOpenSince\" and \"CompetitionDaysOpen\". Note the use of `apply()` in mapping a function across dataframe values." 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "for df in (joined,joined_test):\n", 369 | " df[\"CompetitionOpenSince\"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, \n", 370 | " month=df.CompetitionOpenSinceMonth, day=15))\n", 371 | " df[\"CompetitionDaysOpen\"] = df.Date.subtract(df.CompetitionOpenSince).dt.days" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "We'll replace some erroneous / outlying data." 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": {}, 385 | "outputs": [], 386 | "source": [ 387 | "for df in (joined,joined_test):\n", 388 | " df.loc[df.CompetitionDaysOpen<0, \"CompetitionDaysOpen\"] = 0\n", 389 | " df.loc[df.CompetitionOpenSinceYear<1990, \"CompetitionDaysOpen\"] = 0" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "We add \"CompetitionMonthsOpen\" field, limiting the maximum to 2 years to limit number of unique categories." 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": {}, 403 | "outputs": [ 404 | { 405 | "data": { 406 | "text/plain": [ 407 | "array([24, 3, 19, 9, 0, 16, 17, 7, 15, 22, 11, 13, 2, 23, 12, 4, 10,\n", 408 | " 1, 14, 20, 8, 18, 6, 21, 5])" 409 | ] 410 | }, 411 | "execution_count": null, 412 | "metadata": {}, 413 | "output_type": "execute_result" 414 | } 415 | ], 416 | "source": [ 417 | "for df in (joined,joined_test):\n", 418 | " df[\"CompetitionMonthsOpen\"] = df[\"CompetitionDaysOpen\"]//30\n", 419 | " df.loc[df.CompetitionMonthsOpen>24, \"CompetitionMonthsOpen\"] = 24\n", 420 | "joined.CompetitionMonthsOpen.unique()" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "Same process for Promo dates. You may need to install the `isoweek` package first." 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [ 436 | "# If needed, uncomment:\n", 437 | "# ! pip install isoweek" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [ 446 | "from isoweek import Week\n", 447 | "for df in (joined,joined_test):\n", 448 | " df[\"Promo2Since\"] = pd.to_datetime(df.apply(lambda x: Week(\n", 449 | " x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1))\n", 450 | " df[\"Promo2Days\"] = df.Date.subtract(df[\"Promo2Since\"]).dt.days" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": null, 456 | "metadata": {}, 457 | "outputs": [], 458 | "source": [ 459 | "for df in (joined,joined_test):\n", 460 | " df.loc[df.Promo2Days<0, \"Promo2Days\"] = 0\n", 461 | " df.loc[df.Promo2SinceYear<1990, \"Promo2Days\"] = 0\n", 462 | " df[\"Promo2Weeks\"] = df[\"Promo2Days\"]//7\n", 463 | " df.loc[df.Promo2Weeks<0, \"Promo2Weeks\"] = 0\n", 464 | " df.loc[df.Promo2Weeks>25, \"Promo2Weeks\"] = 25\n", 465 | " df.Promo2Weeks.unique()" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "joined.to_pickle(PATH/'joined')\n", 475 | "joined_test.to_pickle(PATH/'joined_test')" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "## Durations" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:\n", 490 | "* Running averages\n", 491 | "* Time until next event\n", 492 | "* Time since last event\n", 493 | "\n", 494 | "This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.\n", 495 | "\n", 496 | "We'll define a function `get_elapsed` for cumulative counting across a sorted dataframe. Given a particular field `fld` to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.\n", 497 | "\n", 498 | "Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly." 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": null, 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "def get_elapsed(fld, pre):\n", 508 | " day1 = np.timedelta64(1, 'D')\n", 509 | " last_date = np.datetime64()\n", 510 | " last_store = 0\n", 511 | " res = []\n", 512 | "\n", 513 | " for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):\n", 514 | " if s != last_store:\n", 515 | " last_date = np.datetime64()\n", 516 | " last_store = s\n", 517 | " if v: last_date = d\n", 518 | " res.append(((d-last_date).astype('timedelta64[D]') / day1))\n", 519 | " df[pre+fld] = res" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": {}, 525 | "source": [ 526 | "We'll be applying this to a subset of columns:" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": {}, 533 | "outputs": [], 534 | "source": [ 535 | "columns = [\"Date\", \"Store\", \"Promo\", \"StateHoliday\", \"SchoolHoliday\"]" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "#df = train[columns]\n", 545 | "df = train[columns].append(test[columns])" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "Let's walk through an example.\n", 553 | "\n", 554 | "Say we're looking at School Holiday. We'll first sort by Store, then Date, and then call `add_elapsed('SchoolHoliday', 'After')`:\n", 555 | "This will apply to each row with School Holiday:\n", 556 | "* A applied to every row of the dataframe in order of store and date\n", 557 | "* Will add to the dataframe the days since seeing a School Holiday\n", 558 | "* If we sort in the other direction, this will count the days until another holiday." 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": {}, 565 | "outputs": [], 566 | "source": [ 567 | "fld = 'SchoolHoliday'\n", 568 | "df = df.sort_values(['Store', 'Date'])\n", 569 | "get_elapsed(fld, 'After')\n", 570 | "df = df.sort_values(['Store', 'Date'], ascending=[True, False])\n", 571 | "get_elapsed(fld, 'Before')" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "We'll do this for two more fields." 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": null, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "fld = 'StateHoliday'\n", 588 | "df = df.sort_values(['Store', 'Date'])\n", 589 | "get_elapsed(fld, 'After')\n", 590 | "df = df.sort_values(['Store', 'Date'], ascending=[True, False])\n", 591 | "get_elapsed(fld, 'Before')" 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": null, 597 | "metadata": {}, 598 | "outputs": [], 599 | "source": [ 600 | "fld = 'Promo'\n", 601 | "df = df.sort_values(['Store', 'Date'])\n", 602 | "get_elapsed(fld, 'After')\n", 603 | "df = df.sort_values(['Store', 'Date'], ascending=[True, False])\n", 604 | "get_elapsed(fld, 'Before')" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "We're going to set the active index to Date." 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [ 620 | "df = df.set_index(\"Date\")" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "Then set null values from elapsed field calculations to 0." 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "columns = ['SchoolHoliday', 'StateHoliday', 'Promo']" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": null, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "for o in ['Before', 'After']:\n", 646 | " for p in columns:\n", 647 | " a = o+p\n", 648 | " df[a] = df[a].fillna(0).astype(int)" 649 | ] 650 | }, 651 | { 652 | "cell_type": "markdown", 653 | "metadata": {}, 654 | "source": [ 655 | "Next we'll demonstrate window functions in pandas to calculate rolling quantities.\n", 656 | "\n", 657 | "Here we're sorting by date (`sort_index()`) and counting the number of events of interest (`sum()`) defined in `columns` in the following week (`rolling()`), grouped by Store (`groupby()`). We do the same in the opposite direction." 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "metadata": {}, 664 | "outputs": [], 665 | "source": [ 666 | "bwd = df[['Store']+columns].sort_index().groupby(\"Store\").rolling(7, min_periods=1).sum()" 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [ 675 | "fwd = df[['Store']+columns].sort_index(ascending=False\n", 676 | " ).groupby(\"Store\").rolling(7, min_periods=1).sum()" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "Next we want to drop the Store indices grouped together in the window function.\n", 684 | "\n", 685 | "Often in pandas, there is an option to do this in place. This is time and memory efficient when working with large datasets." 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": {}, 692 | "outputs": [], 693 | "source": [ 694 | "bwd.drop('Store',1,inplace=True)\n", 695 | "bwd.reset_index(inplace=True)" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "metadata": {}, 702 | "outputs": [], 703 | "source": [ 704 | "fwd.drop('Store',1,inplace=True)\n", 705 | "fwd.reset_index(inplace=True)" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": null, 711 | "metadata": {}, 712 | "outputs": [], 713 | "source": [ 714 | "df.reset_index(inplace=True)" 715 | ] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "metadata": {}, 720 | "source": [ 721 | "Now we'll merge these values onto the df." 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": null, 727 | "metadata": {}, 728 | "outputs": [], 729 | "source": [ 730 | "df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])\n", 731 | "df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "metadata": {}, 738 | "outputs": [], 739 | "source": [ 740 | "df.drop(columns,1,inplace=True)" 741 | ] 742 | }, 743 | { 744 | "cell_type": "code", 745 | "execution_count": null, 746 | "metadata": {}, 747 | "outputs": [ 748 | { 749 | "data": { 750 | "text/html": [ 751 | "

\n", 752 | "\n", 765 | "\n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | "
DateStoreAfterSchoolHolidayBeforeSchoolHolidayAfterStateHolidayBeforeStateHolidayAfterPromoBeforePromoSchoolHoliday_bwStateHoliday_bwPromo_bwSchoolHoliday_fwStateHoliday_fwPromo_fw
02015-09-1711301050000.00.04.00.00.01.0
12015-09-1611201040000.00.03.00.00.02.0
22015-09-1511101030000.00.02.00.00.03.0
32015-09-1411001020000.00.01.00.00.04.0
42015-09-1319010109-10.00.00.00.00.04.0
\n", 873 | "
" 874 | ], 875 | "text/plain": [ 876 | " Date Store AfterSchoolHoliday BeforeSchoolHoliday \\\n", 877 | "0 2015-09-17 1 13 0 \n", 878 | "1 2015-09-16 1 12 0 \n", 879 | "2 2015-09-15 1 11 0 \n", 880 | "3 2015-09-14 1 10 0 \n", 881 | "4 2015-09-13 1 9 0 \n", 882 | "\n", 883 | " AfterStateHoliday BeforeStateHoliday AfterPromo BeforePromo \\\n", 884 | "0 105 0 0 0 \n", 885 | "1 104 0 0 0 \n", 886 | "2 103 0 0 0 \n", 887 | "3 102 0 0 0 \n", 888 | "4 101 0 9 -1 \n", 889 | "\n", 890 | " SchoolHoliday_bw StateHoliday_bw Promo_bw SchoolHoliday_fw \\\n", 891 | "0 0.0 0.0 4.0 0.0 \n", 892 | "1 0.0 0.0 3.0 0.0 \n", 893 | "2 0.0 0.0 2.0 0.0 \n", 894 | "3 0.0 0.0 1.0 0.0 \n", 895 | "4 0.0 0.0 0.0 0.0 \n", 896 | "\n", 897 | " StateHoliday_fw Promo_fw \n", 898 | "0 0.0 1.0 \n", 899 | "1 0.0 2.0 \n", 900 | "2 0.0 3.0 \n", 901 | "3 0.0 4.0 \n", 902 | "4 0.0 4.0 " 903 | ] 904 | }, 905 | "execution_count": null, 906 | "metadata": {}, 907 | "output_type": "execute_result" 908 | } 909 | ], 910 | "source": [ 911 | "df.head()" 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "metadata": {}, 917 | "source": [ 918 | "It's usually a good idea to back up large tables of extracted / wrangled features before you join them onto another one, that way you can go back to it easily if you need to make changes to it." 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": null, 924 | "metadata": {}, 925 | "outputs": [], 926 | "source": [ 927 | "df.to_pickle(PATH/'df')" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "execution_count": null, 933 | "metadata": {}, 934 | "outputs": [], 935 | "source": [ 936 | "df[\"Date\"] = pd.to_datetime(df.Date)" 937 | ] 938 | }, 939 | { 940 | "cell_type": "code", 941 | "execution_count": null, 942 | "metadata": {}, 943 | "outputs": [ 944 | { 945 | "data": { 946 | "text/plain": [ 947 | "Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',\n", 948 | " 'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',\n", 949 | " 'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',\n", 950 | " 'StateHoliday_fw', 'Promo_fw'],\n", 951 | " dtype='object')" 952 | ] 953 | }, 954 | "execution_count": null, 955 | "metadata": {}, 956 | "output_type": "execute_result" 957 | } 958 | ], 959 | "source": [ 960 | "df.columns" 961 | ] 962 | }, 963 | { 964 | "cell_type": "code", 965 | "execution_count": null, 966 | "metadata": {}, 967 | "outputs": [], 968 | "source": [ 969 | "joined = pd.read_pickle(PATH/'joined')\n", 970 | "joined_test = pd.read_pickle(PATH/f'joined_test')" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "metadata": {}, 977 | "outputs": [], 978 | "source": [ 979 | "joined = join_df(joined, df, ['Store', 'Date'])" 980 | ] 981 | }, 982 | { 983 | "cell_type": "code", 984 | "execution_count": null, 985 | "metadata": {}, 986 | "outputs": [], 987 | "source": [ 988 | "joined_test = join_df(joined_test, df, ['Store', 'Date'])" 989 | ] 990 | }, 991 | { 992 | "cell_type": "markdown", 993 | "metadata": {}, 994 | "source": [ 995 | "The authors also removed all instances where the store had zero sale / was closed. We speculate that this may have cost them a higher standing in the competition. One reason this may be the case is that a little exploratory data analysis reveals that there are often periods where stores are closed, typically for refurbishment. Before and after these periods, there are naturally spikes in sales that one might expect. By ommitting this data from their training, the authors gave up the ability to leverage information about these periods to predict this otherwise volatile behavior." 996 | ] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": null, 1001 | "metadata": {}, 1002 | "outputs": [], 1003 | "source": [ 1004 | "joined = joined[joined.Sales!=0]" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "markdown", 1009 | "metadata": {}, 1010 | "source": [ 1011 | "We'll back this up as well." 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": null, 1017 | "metadata": {}, 1018 | "outputs": [], 1019 | "source": [ 1020 | "joined.reset_index(inplace=True)\n", 1021 | "joined_test.reset_index(inplace=True)" 1022 | ] 1023 | }, 1024 | { 1025 | "cell_type": "code", 1026 | "execution_count": null, 1027 | "metadata": {}, 1028 | "outputs": [], 1029 | "source": [ 1030 | "joined.to_pickle(PATH/'train_clean')\n", 1031 | "joined_test.to_pickle(PATH/'test_clean')" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "code", 1036 | "execution_count": null, 1037 | "metadata": {}, 1038 | "outputs": [], 1039 | "source": [] 1040 | } 1041 | ], 1042 | "metadata": { 1043 | "kernelspec": { 1044 | "display_name": "Python 3", 1045 | "language": "python", 1046 | "name": "python3" 1047 | } 1048 | }, 1049 | "nbformat": 4, 1050 | "nbformat_minor": 2 1051 | } 1052 | --------------------------------------------------------------------------------