├── .gitattributes ├── .gitignore ├── README.md ├── notebooks ├── Categorical Data.ipynb ├── Ordered Variable Length Features.ipynb ├── Real World Example.ipynb ├── Tabular Data.ipynb └── Variable Length Features.ipynb └── requirements.txt /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_store 2 | whats-cooking 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | MANIFEST 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .nox/ 45 | .coverage 46 | .coverage.* 47 | .cache 48 | nosetests.xml 49 | coverage.xml 50 | *.cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | 63 | # Flask stuff: 64 | instance/ 65 | .webassets-cache 66 | 67 | # Scrapy stuff: 68 | .scrapy 69 | 70 | # Sphinx documentation 71 | docs/_build/ 72 | 73 | # PyBuilder 74 | target/ 75 | 76 | # Jupyter Notebook 77 | .ipynb_checkpoints 78 | 79 | # IPython 80 | profile_default/ 81 | ipython_config.py 82 | 83 | # pyenv 84 | .python-version 85 | 86 | # celery beat schedule file 87 | celerybeat-schedule 88 | 89 | # SageMath parsed files 90 | *.sage.py 91 | 92 | # Environments 93 | .env 94 | .venv 95 | env/ 96 | venv/ 97 | ENV/ 98 | env.bak/ 99 | venv.bak/ 100 | 101 | # Spyder project settings 102 | .spyderproject 103 | .spyproject 104 | 105 | # Rope project settings 106 | .ropeproject 107 | 108 | # mkdocs documentation 109 | /site 110 | 111 | # mypy 112 | .mypy_cache/ 113 | .dmypy.json 114 | dmypy.json 115 | 116 | # Pyre type checker 117 | .pyre/ 118 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # deep learning building blocks 2 | 3 | Welcome to deep learning building blocks. This is an intermediate tutorial on deep learning that focuses on how to design neural networks for various data types. 4 | 5 | Because this is intermediate, we are not going to be introducing what a convolution is or backpropagation. We assume that you either already know about them, or more care about solving problems than understanding theory. 6 | 7 | This tutorial progresses by introducing different types of data (often data that is hard for traditional ML to take advantage of). We then present neural network designs that typically work well with that type of data. If you have am exotic type of data that you don't see listed here, let me know and I'd be happy to cover it! 8 | 9 | On a final note, because I feel that there are already pretty decent tutorials on working with image and text data out there, I'll start this series by talking about good old fashioned tabular data 10 | 11 | The order in which these tutorials go is as follows: 12 | 13 | 1. [Tabular Data](https://github.com/knathanieltucker/deep-learning-building-blocks/blob/master/notebooks/Tabular%20Data.ipynb) 14 | 2. [Categorical Data](https://github.com/knathanieltucker/deep-learning-building-blocks/blob/master/notebooks/Categorical%20Data.ipynb) 15 | 3. [Variable Length Features](https://github.com/knathanieltucker/deep-learning-building-blocks/blob/master/notebooks/Variable%20Length%20Features.ipynb) 16 | 4. [Ordered Variable Length Features](https://github.com/knathanieltucker/deep-learning-building-blocks/blob/master/notebooks/Ordered%20Variable%20Length%20Features.ipynb) 17 | 5. [Real World Example](https://github.com/knathanieltucker/deep-learning-building-blocks/blob/master/notebooks/Real%20World%20Example.ipynb) 18 | 19 | ## Installing What You'll Need 20 | 21 | The first step to get running with these tutorials is to install virtualenv. Fortunately there is a [great tutorial](https://docs.python-guide.org/dev/virtualenvs/#lower-level-virtualenv) on hitchhiker's guide to python. Please follow the steps in the guide. 22 | 23 | Once you have installed virtualenv let's make an enviornment with the following command: 24 | 25 | `virtualenv -p python3 env` 26 | 27 | You will want to be using python 3.6, so please make sure your enviornment is running it. 28 | 29 | Again the tutorial is a great resource on showing you how to do this on both windows and mac: 30 | 31 | [Activate your env](https://docs.python-guide.org/dev/virtualenvs/#lower-level-virtualenv) 32 | 33 | The next step is that we will need to install all the requirements: 34 | 35 | `pip install -r requirements.txt` 36 | 37 | Finally the last step is to run an ipython notebook from within the env and then we are ready to go: 38 | 39 | `ipython notebook` 40 | 41 | ## Other Downloads 42 | 43 | Downloading the [Kaggle Recipe Ingredients Dataset](https://www.kaggle.com/kaggle/recipe-ingredients-dataset) will be necessary for the final notebook. 44 | -------------------------------------------------------------------------------- /notebooks/Categorical Data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from sklearn.datasets import make_classification" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# Categorical Data\n", 17 | "\n", 18 | "With this next dataset we start to move into deep learning territory. \n", 19 | "\n", 20 | "Now not all categorical data is better suited to deep learning, but high cardinality categorical data (aka columns with a lot of categories) is. \n", 21 | "\n", 22 | "Old ML algs can only treat each category as completely separate entities, whereas deep learning with the use of embeddings, can capture the similarities of some categories with others. The most classic version of this with word embeddings, but the same thing can be done with zipcodes.\n", 23 | "\n", 24 | "So let's get cracking by making a dataset." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 2, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "numeric_dataset = make_classification(\n", 34 | " n_samples=10_000, \n", 35 | " n_features=25, \n", 36 | " n_informative=10,\n", 37 | " n_classes=2)\n", 38 | "\n", 39 | "x, y = numeric_dataset" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 4, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "import pandas as pd\n", 49 | "import numpy as np\n", 50 | "\n", 51 | "np.set_printoptions(precision=2)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 5, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "num_categories = 100\n", 61 | "for i in range(5):\n", 62 | " x[:, i] = pd.cut(x[:, i], num_categories, labels=False)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 6, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "data": { 72 | "text/plain": [ 73 | "array([54. , 52. , 31. , 41. , 39. , -0.3 , 1.59, 1.12, -0.82,\n", 74 | " 0.3 , 1.25, 0.67, 0.09, -1.39, -0.45, 1.73, 0.89, -0.97,\n", 75 | " -2.52, -0.35, -0.06, -0.66, -2.65, 1.07, -1.3 ])" 76 | ] 77 | }, 78 | "execution_count": 6, 79 | "metadata": {}, 80 | "output_type": "execute_result" 81 | } 82 | ], 83 | "source": [ 84 | "x[0]" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 7, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "x_numeric = x[:, 5:]\n", 94 | "x_cat = x[:, :5]" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "We have 5 different variables with 100 categories each. \n", 102 | "\n", 103 | "The next step is to standardize the inputs. The nice thing about categoricals is that we won't need to standardize them. We will still need to standardize the numerice ones" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 20, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "from sklearn.preprocessing import StandardScaler\n", 113 | "\n", 114 | "ss = StandardScaler()\n", 115 | "\n", 116 | "x_standardized = ss.fit_transform(x_numeric)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Now we can start to make our model. " 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 13, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "import tensorflow as tf\n", 133 | "\n", 134 | "p = .1\n", 135 | "\n", 136 | "numeric_inputs = tf.keras.layers.Input((20,), name='numeric_inputs')\n", 137 | "cat_inputs = tf.keras.layers.Input((5,), name='cat_inputs')" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "Notice that now our model takes two inputs, categorical and numeric. The categorical inputs are fed into an embedding layer:" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 14, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "def emb_sz_rule(n_cat): \n", 154 | " return min(600, round(1.6 * n_cat**0.56))\n", 155 | "\n", 156 | "embedding_layer = tf.keras.layers.Embedding(\n", 157 | " num_categories, \n", 158 | " emb_sz_rule(num_categories), \n", 159 | " input_length=5)\n", 160 | "cats = embedding_layer(cat_inputs)\n", 161 | "cats = tf.keras.layers.Flatten()(cats)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "Above we make an embedding layer. An embedding layer uses a series of weights to represent each category and in that way learns how the categories relate. To find out how many weights we should use, we use the `emb_sz_rule`. It's a pretty good rule of thumb (comes from fast.ai).\n", 169 | "\n", 170 | "Next we pass both the embeddings and the numeric inputs into the same network we used last time:" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 15, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "x = tf.keras.layers.Concatenate()([cats, numeric_inputs])\n", 180 | "\n", 181 | "x = tf.keras.layers.Dropout(p)(x)\n", 182 | "x = tf.keras.layers.Dense(100, activation='relu')(x)\n", 183 | "\n", 184 | "x = tf.keras.layers.BatchNormalization()(x)\n", 185 | "x = tf.keras.layers.Dropout(p)(x)\n", 186 | "x = tf.keras.layers.Dense(20, activation='relu')(x)\n", 187 | "\n", 188 | "x = tf.keras.layers.BatchNormalization()(x)\n", 189 | "x = tf.keras.layers.Dropout(p)(x)\n", 190 | "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", 191 | "\n", 192 | "x = tf.keras.layers.BatchNormalization()(x)\n", 193 | "x = tf.keras.layers.Dropout(p)(x)\n", 194 | "out = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(x)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 16, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "model = tf.keras.models.Model(\n", 204 | " inputs=[numeric_inputs, cat_inputs], outputs=out)\n", 205 | "model.compile(optimizer='rmsprop',\n", 206 | " loss='binary_crossentropy',\n", 207 | " metrics=['accuracy'])" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 17, 213 | "metadata": {}, 214 | "outputs": [ 215 | { 216 | "name": "stdout", 217 | "output_type": "stream", 218 | "text": [ 219 | "Model: \"model_1\"\n", 220 | "__________________________________________________________________________________________________\n", 221 | "Layer (type) Output Shape Param # Connected to \n", 222 | "==================================================================================================\n", 223 | "cat_inputs (InputLayer) [(None, 5)] 0 \n", 224 | "__________________________________________________________________________________________________\n", 225 | "embedding_1 (Embedding) (None, 5, 21) 2100 cat_inputs[0][0] \n", 226 | "__________________________________________________________________________________________________\n", 227 | "flatten_1 (Flatten) (None, 105) 0 embedding_1[0][0] \n", 228 | "__________________________________________________________________________________________________\n", 229 | "numeric_inputs (InputLayer) [(None, 20)] 0 \n", 230 | "__________________________________________________________________________________________________\n", 231 | "concatenate_1 (Concatenate) (None, 125) 0 flatten_1[0][0] \n", 232 | " numeric_inputs[0][0] \n", 233 | "__________________________________________________________________________________________________\n", 234 | "dropout_4 (Dropout) (None, 125) 0 concatenate_1[0][0] \n", 235 | "__________________________________________________________________________________________________\n", 236 | "dense_3 (Dense) (None, 100) 12600 dropout_4[0][0] \n", 237 | "__________________________________________________________________________________________________\n", 238 | "batch_normalization_v2_3 (Batch (None, 100) 400 dense_3[0][0] \n", 239 | "__________________________________________________________________________________________________\n", 240 | "dropout_5 (Dropout) (None, 100) 0 batch_normalization_v2_3[0][0] \n", 241 | "__________________________________________________________________________________________________\n", 242 | "dense_4 (Dense) (None, 20) 2020 dropout_5[0][0] \n", 243 | "__________________________________________________________________________________________________\n", 244 | "batch_normalization_v2_4 (Batch (None, 20) 80 dense_4[0][0] \n", 245 | "__________________________________________________________________________________________________\n", 246 | "dropout_6 (Dropout) (None, 20) 0 batch_normalization_v2_4[0][0] \n", 247 | "__________________________________________________________________________________________________\n", 248 | "dense_5 (Dense) (None, 10) 210 dropout_6[0][0] \n", 249 | "__________________________________________________________________________________________________\n", 250 | "batch_normalization_v2_5 (Batch (None, 10) 40 dense_5[0][0] \n", 251 | "__________________________________________________________________________________________________\n", 252 | "dropout_7 (Dropout) (None, 10) 0 batch_normalization_v2_5[0][0] \n", 253 | "__________________________________________________________________________________________________\n", 254 | "output (Dense) (None, 1) 11 dropout_7[0][0] \n", 255 | "==================================================================================================\n", 256 | "Total params: 17,461\n", 257 | "Trainable params: 17,201\n", 258 | "Non-trainable params: 260\n", 259 | "__________________________________________________________________________________________________\n" 260 | ] 261 | } 262 | ], 263 | "source": [ 264 | "model.summary()" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 25, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "import numpy as np\n", 274 | "\n", 275 | "def bootstrap_sample_generator(batch_size):\n", 276 | " while True:\n", 277 | " batch_idx = np.random.choice(\n", 278 | " x_standardized.shape[0], batch_size)\n", 279 | " yield ({'numeric_inputs': x_standardized[batch_idx],\n", 280 | " 'cat_inputs': x_cat[batch_idx]}, \n", 281 | " {'output': y[batch_idx]})" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 26, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "Epoch 1/5\n", 294 | "313/312 [==============================] - 4s 14ms/step - loss: 0.4936 - accuracy: 0.7644\n", 295 | "Epoch 2/5\n", 296 | "313/312 [==============================] - 1s 4ms/step - loss: 0.3458 - accuracy: 0.8538\n", 297 | "Epoch 3/5\n", 298 | "313/312 [==============================] - 1s 3ms/step - loss: 0.2670 - accuracy: 0.8915\n", 299 | "Epoch 4/5\n", 300 | "313/312 [==============================] - 1s 4ms/step - loss: 0.2487 - accuracy: 0.9066\n", 301 | "Epoch 5/5\n", 302 | "313/312 [==============================] - 1s 4ms/step - loss: 0.2096 - accuracy: 0.9228\n" 303 | ] 304 | }, 305 | { 306 | "data": { 307 | "text/plain": [ 308 | "" 309 | ] 310 | }, 311 | "execution_count": 26, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "batch_size = 32\n", 318 | "\n", 319 | "model.fit_generator(\n", 320 | " bootstrap_sample_generator(batch_size),\n", 321 | " steps_per_epoch=10_000 / batch_size,\n", 322 | " epochs=5,\n", 323 | " max_queue_size=10,\n", 324 | ")" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "Definitely a lower accuracy (partly because we destroyed information by converting numbers into categories.\n", 332 | "\n", 333 | "Using embeddings can help out a ton with these sorts of problems. So if you have a dataset that for the most part is normal, but also has high cardinality categorical variables, then consider NNs.\n", 334 | "\n", 335 | "One thing more I'll say here is that initializing the embedding from another similar dataset can help a lot. For example initializing word vectors is a very common trend in NLP." 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [] 351 | } 352 | ], 353 | "metadata": { 354 | "kernelspec": { 355 | "display_name": "Python 3", 356 | "language": "python", 357 | "name": "python3" 358 | }, 359 | "language_info": { 360 | "codemirror_mode": { 361 | "name": "ipython", 362 | "version": 3 363 | }, 364 | "file_extension": ".py", 365 | "mimetype": "text/x-python", 366 | "name": "python", 367 | "nbconvert_exporter": "python", 368 | "pygments_lexer": "ipython3", 369 | "version": "3.6.8" 370 | } 371 | }, 372 | "nbformat": 4, 373 | "nbformat_minor": 2 374 | } 375 | -------------------------------------------------------------------------------- /notebooks/Ordered Variable Length Features.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from sklearn.datasets import make_classification" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# Ordered Variable Length Features\n", 17 | "\n", 18 | "If you have not already looked at the [variable length features lesson](https://github.com/knathanieltucker/deep-learning-building-blocks/blob/master/notebooks/Variable%20Length%20Features.ipynb) please do so now, because this lesson relies heavily on that information.\n", 19 | "\n", 20 | "The difference between these lessons is that the features in our example will be ordered. So now thing of a customer with a credit card statement. That statement has an intrensic order, that is the chronological order. But that statement can also be variable length.\n", 21 | "\n", 22 | "---\n", 23 | "\n", 24 | "As before we will need to spend some time constructing an appropriate dataset." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 2, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "base_dataset = make_classification(\n", 34 | " n_samples=10_000, \n", 35 | " n_features=30, \n", 36 | " n_informative=10,\n", 37 | " n_clusters_per_class=2,\n", 38 | " n_classes=4)\n", 39 | "\n", 40 | "x, y = base_dataset" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "from sklearn.preprocessing import StandardScaler\n", 50 | "\n", 51 | "ss = StandardScaler()\n", 52 | "\n", 53 | "x_standardized = ss.fit_transform(x)" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "We use the same trick as before to construct the dataset:" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 4, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "base_classes = []\n", 70 | "\n", 71 | "for i in range(4):\n", 72 | " base_classes.append(x_standardized[y == i])" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 5, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "import numpy as np\n", 82 | "\n", 83 | "num_points = 5_000\n", 84 | "class1_dist = np.array([.5, .5, 0, 0])\n", 85 | "class2_dist = np.array([0, .2, .6, .2])\n", 86 | "\n", 87 | "def make_var_len_feature_point(dist):\n", 88 | " sequence_dist = dist.copy()\n", 89 | " \n", 90 | " feature_sets = []\n", 91 | " previous_feature_set = np.zeros((1, 30))\n", 92 | " num_features = np.random.randint(3, 11)\n", 93 | " for i in range(num_features):\n", 94 | " # choose which distribution the transaction comes from\n", 95 | " base_class = np.random.choice([0, 1, 2, 3], 1, p=sequence_dist)\n", 96 | " base_class_points = base_classes[base_class[0]]\n", 97 | " feature_set_idx = np.random.choice(base_class_points.shape[0], 1)\n", 98 | " previous_feature_set += base_class_points[feature_set_idx]\n", 99 | " feature_sets.append(previous_feature_set)\n", 100 | " \n", 101 | " # now make it more likely to come from the same dist\n", 102 | " dist_update = np.zeros([4]); dist_update[base_class] = 1\n", 103 | " sequence_dist += dist_update\n", 104 | " sequence_dist = sequence_dist / sequence_dist.sum()\n", 105 | "\n", 106 | " \n", 107 | " for _ in range(10 - num_features):\n", 108 | " feature_sets.append(np.zeros((1, 30)))\n", 109 | "\n", 110 | " return np.concatenate(feature_sets)[np.newaxis, :, :]\n", 111 | "\n", 112 | "\n", 113 | "class1_points = []\n", 114 | "for _ in range(num_points):\n", 115 | " class1_points.append(\n", 116 | " make_var_len_feature_point(class1_dist))\n", 117 | "class1_points = np.concatenate(class1_points)\n", 118 | " \n", 119 | "class2_points = []\n", 120 | "for _ in range(num_points):\n", 121 | " class2_points.append(\n", 122 | " make_var_len_feature_point(class2_dist))\n", 123 | "class2_points = np.concatenate(class2_points)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 6, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "data": { 133 | "text/plain": [ 134 | "(5000, 10, 30)" 135 | ] 136 | }, 137 | "execution_count": 6, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "class2_points.shape" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "So notice the one difference from the above. The distribution of the points changes depending on the previous point. For a concrete example, you might have more transactions of a particular type if you have had a lot of that type before.\n", 151 | "\n", 152 | "Once again we pad the input as well so that we can batch them all together.\n", 153 | "\n", 154 | "---\n", 155 | "\n", 156 | "Okay, now that we have the data, let's do the model" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 7, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "def bootstrap_sample_generator(batch_size):\n", 166 | " while True:\n", 167 | " batch_idx = np.random.choice(\n", 168 | " class1_points.shape[0], batch_size // 2)\n", 169 | " batch_x = np.concatenate([\n", 170 | " class1_points[batch_idx],\n", 171 | " class2_points[batch_idx],\n", 172 | " ])\n", 173 | " batch_y = np.concatenate([\n", 174 | " np.zeros(batch_size // 2),\n", 175 | " np.ones(batch_size // 2),\n", 176 | " ])\n", 177 | " yield ({'numeric_inputs': batch_x}, \n", 178 | " {'output': batch_y})" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 8, 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "import tensorflow as tf\n", 188 | "\n", 189 | "p = .1" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 9, 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "inputs = tf.keras.layers.Input((10, 30), name='numeric_inputs')" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "This time we are doing a different task. We want to look at each individual input in order, consider the information, and then use that to judge the next ones. \n", 206 | "\n", 207 | "The tool that we use to do this is an RNN, in particular a GRU (gated recurrent unit). We wrap that unit in Bidirectional so that we can read from both ends. Probably less effective in the particular case" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 10, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "x = tf.keras.layers.Dropout(p)(inputs)\n", 217 | "x = tf.keras.layers.Bidirectional(\n", 218 | " tf.keras.layers.GRU(10))(x)\n", 219 | "\n", 220 | "x = tf.keras.layers.BatchNormalization()(x)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "As with before we can add global context back into the inputs and repeat." 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 11, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "# bonus\n", 237 | "x = tf.keras.layers.RepeatVector(10)(x)\n", 238 | "x = tf.keras.layers.Concatenate()([inputs, x])\n", 239 | "\n", 240 | "x = tf.keras.layers.Dropout(p)(x)\n", 241 | "x = tf.keras.layers.Bidirectional(\n", 242 | " tf.keras.layers.GRU(10))(x)\n", 243 | "\n", 244 | "x = tf.keras.layers.BatchNormalization()(x)" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "And the rest is the good old network from the past:" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 12, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "x = tf.keras.layers.Dropout(p)(x)\n", 261 | "x = tf.keras.layers.Dense(100, activation='relu')(x)\n", 262 | "\n", 263 | "x = tf.keras.layers.BatchNormalization()(x)\n", 264 | "x = tf.keras.layers.Dropout(p)(x)\n", 265 | "x = tf.keras.layers.Dense(20, activation='relu')(x)\n", 266 | "\n", 267 | "x = tf.keras.layers.BatchNormalization()(x)\n", 268 | "x = tf.keras.layers.Dropout(p)(x)\n", 269 | "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", 270 | "\n", 271 | "x = tf.keras.layers.BatchNormalization()(x)\n", 272 | "x = tf.keras.layers.Dropout(p)(x)\n", 273 | "out = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(x)" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 13, 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [ 282 | "model = tf.keras.models.Model(inputs=inputs, outputs=out)\n", 283 | "model.compile(optimizer='rmsprop',\n", 284 | " loss='binary_crossentropy',\n", 285 | " metrics=['accuracy'])" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 14, 291 | "metadata": {}, 292 | "outputs": [ 293 | { 294 | "name": "stdout", 295 | "output_type": "stream", 296 | "text": [ 297 | "Model: \"model\"\n", 298 | "__________________________________________________________________________________________________\n", 299 | "Layer (type) Output Shape Param # Connected to \n", 300 | "==================================================================================================\n", 301 | "numeric_inputs (InputLayer) [(None, 10, 30)] 0 \n", 302 | "__________________________________________________________________________________________________\n", 303 | "dropout (Dropout) (None, 10, 30) 0 numeric_inputs[0][0] \n", 304 | "__________________________________________________________________________________________________\n", 305 | "bidirectional (Bidirectional) (None, 20) 2520 dropout[0][0] \n", 306 | "__________________________________________________________________________________________________\n", 307 | "batch_normalization_v2 (BatchNo (None, 20) 80 bidirectional[0][0] \n", 308 | "__________________________________________________________________________________________________\n", 309 | "repeat_vector (RepeatVector) (None, 10, 20) 0 batch_normalization_v2[0][0] \n", 310 | "__________________________________________________________________________________________________\n", 311 | "concatenate (Concatenate) (None, 10, 50) 0 numeric_inputs[0][0] \n", 312 | " repeat_vector[0][0] \n", 313 | "__________________________________________________________________________________________________\n", 314 | "dropout_1 (Dropout) (None, 10, 50) 0 concatenate[0][0] \n", 315 | "__________________________________________________________________________________________________\n", 316 | "bidirectional_1 (Bidirectional) (None, 20) 3720 dropout_1[0][0] \n", 317 | "__________________________________________________________________________________________________\n", 318 | "batch_normalization_v2_1 (Batch (None, 20) 80 bidirectional_1[0][0] \n", 319 | "__________________________________________________________________________________________________\n", 320 | "dropout_2 (Dropout) (None, 20) 0 batch_normalization_v2_1[0][0] \n", 321 | "__________________________________________________________________________________________________\n", 322 | "dense (Dense) (None, 100) 2100 dropout_2[0][0] \n", 323 | "__________________________________________________________________________________________________\n", 324 | "batch_normalization_v2_2 (Batch (None, 100) 400 dense[0][0] \n", 325 | "__________________________________________________________________________________________________\n", 326 | "dropout_3 (Dropout) (None, 100) 0 batch_normalization_v2_2[0][0] \n", 327 | "__________________________________________________________________________________________________\n", 328 | "dense_1 (Dense) (None, 20) 2020 dropout_3[0][0] \n", 329 | "__________________________________________________________________________________________________\n", 330 | "batch_normalization_v2_3 (Batch (None, 20) 80 dense_1[0][0] \n", 331 | "__________________________________________________________________________________________________\n", 332 | "dropout_4 (Dropout) (None, 20) 0 batch_normalization_v2_3[0][0] \n", 333 | "__________________________________________________________________________________________________\n", 334 | "dense_2 (Dense) (None, 10) 210 dropout_4[0][0] \n", 335 | "__________________________________________________________________________________________________\n", 336 | "batch_normalization_v2_4 (Batch (None, 10) 40 dense_2[0][0] \n", 337 | "__________________________________________________________________________________________________\n", 338 | "dropout_5 (Dropout) (None, 10) 0 batch_normalization_v2_4[0][0] \n", 339 | "__________________________________________________________________________________________________\n", 340 | "output (Dense) (None, 1) 11 dropout_5[0][0] \n", 341 | "==================================================================================================\n", 342 | "Total params: 11,261\n", 343 | "Trainable params: 10,921\n", 344 | "Non-trainable params: 340\n", 345 | "__________________________________________________________________________________________________\n" 346 | ] 347 | } 348 | ], 349 | "source": [ 350 | "model.summary()" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 15, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "name": "stdout", 360 | "output_type": "stream", 361 | "text": [ 362 | "Epoch 1/5\n", 363 | "312/312 [==============================] - 13s 41ms/step - loss: 0.5876 - accuracy: 0.6863\n", 364 | "Epoch 2/5\n", 365 | "312/312 [==============================] - 5s 17ms/step - loss: 0.4570 - accuracy: 0.7860\n", 366 | "Epoch 3/5\n", 367 | "312/312 [==============================] - 5s 17ms/step - loss: 0.4006 - accuracy: 0.8173\n", 368 | "Epoch 4/5\n", 369 | "312/312 [==============================] - 5s 18ms/step - loss: 0.3839 - accuracy: 0.8281\n", 370 | "Epoch 5/5\n", 371 | "312/312 [==============================] - 5s 17ms/step - loss: 0.3584 - accuracy: 0.8398\n" 372 | ] 373 | }, 374 | { 375 | "data": { 376 | "text/plain": [ 377 | "" 378 | ] 379 | }, 380 | "execution_count": 15, 381 | "metadata": {}, 382 | "output_type": "execute_result" 383 | } 384 | ], 385 | "source": [ 386 | "batch_size = 32\n", 387 | "\n", 388 | "model.fit_generator(\n", 389 | " bootstrap_sample_generator(batch_size),\n", 390 | " steps_per_epoch=10_000 // batch_size,\n", 391 | " epochs=5,\n", 392 | " max_queue_size=10,\n", 393 | ")" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "Again this is only the tip of the iceberg for things that you could do with data like this.\n", 401 | "\n", 402 | "Next up we will go through a real world example to give us more intuition." 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": {}, 409 | "outputs": [], 410 | "source": [] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": null, 415 | "metadata": {}, 416 | "outputs": [], 417 | "source": [] 418 | } 419 | ], 420 | "metadata": { 421 | "kernelspec": { 422 | "display_name": "Python 3", 423 | "language": "python", 424 | "name": "python3" 425 | }, 426 | "language_info": { 427 | "codemirror_mode": { 428 | "name": "ipython", 429 | "version": 3 430 | }, 431 | "file_extension": ".py", 432 | "mimetype": "text/x-python", 433 | "name": "python", 434 | "nbconvert_exporter": "python", 435 | "pygments_lexer": "ipython3", 436 | "version": "3.6.8" 437 | } 438 | }, 439 | "nbformat": 4, 440 | "nbformat_minor": 2 441 | } 442 | -------------------------------------------------------------------------------- /notebooks/Real World Example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import tensorflow as tf" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Whats Cooking\n", 19 | "\n", 20 | "Today we are going to be using the kaggle [What's Cooking Dataset](https://www.kaggle.com/c/whats-cooking-kernels-only/data). (Please download and load the data in appropriately to follow along below).\n", 21 | "\n", 22 | "\n", 23 | "This is basically a list of recipes, and we need to decide which cuisine it comes from. We can check out some of the data below:" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 2, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/html": [ 34 | "
\n", 35 | "\n", 48 | "\n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | "
cuisineidingredientsingredientsFlat
0greek10259[romaine lettuce, black olives, grape tomatoes...romaine lettuce black olives grape tomatoes ga...
1southern_us25693[plain flour, ground pepper, salt, tomatoes, g...plain flour ground pepper salt tomatoes ground...
2filipino20130[eggs, pepper, salt, mayonaise, cooking oil, g...eggs pepper salt mayonaise cooking oil green c...
3indian22213[water, vegetable oil, wheat, salt]water vegetable oil wheat salt
4indian13162[black pepper, shallots, cornflour, cayenne pe...black pepper shallots cornflour cayenne pepper...
\n", 96 | "
" 97 | ], 98 | "text/plain": [ 99 | " cuisine id ingredients \\\n", 100 | "0 greek 10259 [romaine lettuce, black olives, grape tomatoes... \n", 101 | "1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g... \n", 102 | "2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g... \n", 103 | "3 indian 22213 [water, vegetable oil, wheat, salt] \n", 104 | "4 indian 13162 [black pepper, shallots, cornflour, cayenne pe... \n", 105 | "\n", 106 | " ingredientsFlat \n", 107 | "0 romaine lettuce black olives grape tomatoes ga... \n", 108 | "1 plain flour ground pepper salt tomatoes ground... \n", 109 | "2 eggs pepper salt mayonaise cooking oil green c... \n", 110 | "3 water vegetable oil wheat salt \n", 111 | "4 black pepper shallots cornflour cayenne pepper... " 112 | ] 113 | }, 114 | "execution_count": 2, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "import json\n", 121 | "recipeRaw = pd.read_json(\"../whats-cooking/train.json\")\n", 122 | "recipeRaw[\"ingredientsFlat\"] = recipeRaw[\"ingredients\"].apply(lambda x: ' '.join(x))\n", 123 | "recipeRaw.head()" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "So our goal is the predict the cuisine - this means a multiclassification problem. We can see all the classes below:" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 3, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "data": { 140 | "text/plain": [ 141 | "array(['brazilian', 'british', 'cajun_creole', 'chinese', 'filipino',\n", 142 | " 'french', 'greek', 'indian', 'irish', 'italian', 'jamaican',\n", 143 | " 'japanese', 'korean', 'mexican', 'moroccan', 'russian',\n", 144 | " 'southern_us', 'spanish', 'thai', 'vietnamese'], dtype=object)" 145 | ] 146 | }, 147 | "execution_count": 3, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "from sklearn import preprocessing\n", 154 | "le = preprocessing.LabelEncoder()\n", 155 | "le.fit(recipeRaw[\"cuisine\"].values)\n", 156 | "le.classes_" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "For keras to be able to work with this, we will need to convert these strings into one-hot encodings:" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 4, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "array([[0., 0., 0., ..., 0., 0., 0.],\n", 175 | " [0., 0., 0., ..., 0., 0., 0.],\n", 176 | " [0., 0., 0., ..., 0., 0., 0.],\n", 177 | " ...,\n", 178 | " [0., 0., 0., ..., 0., 0., 0.],\n", 179 | " [0., 0., 0., ..., 0., 0., 0.],\n", 180 | " [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)" 181 | ] 182 | }, 183 | "execution_count": 4, 184 | "metadata": {}, 185 | "output_type": "execute_result" 186 | } 187 | ], 188 | "source": [ 189 | "docs = recipeRaw[\"ingredientsFlat\"].values\n", 190 | "labels_enc = le.transform(recipeRaw[\"cuisine\"].values)\n", 191 | "labels = tf.keras.utils.to_categorical(labels_enc)\n", 192 | "labels" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "One useful numeric feature we could use, is the number of ingredients in each recipe" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 5, 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "recipeRaw['ingredients_len'] = recipeRaw['ingredients'].apply(len)\n", 209 | "doc_lengths = recipeRaw[['ingredients_len']].values" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 6, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "name": "stderr", 219 | "output_type": "stream", 220 | "text": [ 221 | "/Users/tucker/Desktop/deep-learning-building-blocks/env/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n", 222 | " warnings.warn(msg, DataConversionWarning)\n", 223 | "/Users/tucker/Desktop/deep-learning-building-blocks/env/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n", 224 | " warnings.warn(msg, DataConversionWarning)\n" 225 | ] 226 | } 227 | ], 228 | "source": [ 229 | "from sklearn.preprocessing import StandardScaler\n", 230 | "\n", 231 | "ss = StandardScaler()\n", 232 | "\n", 233 | "doc_lengths_standardized = ss.fit_transform(doc_lengths)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "Next we need to transform the ingredients into categories. In one sense this is a pretty typical NLP problem, but the cool thing about it is that the order of the ingredients does not matter, so this is an unordered variable length features problem with high cardinality categorical variables.\n", 241 | "\n", 242 | "---\n", 243 | "\n", 244 | "To transform these into categories we use the below:" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 7, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "data": { 254 | "text/plain": [ 255 | "3065" 256 | ] 257 | }, 258 | "execution_count": 7, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "pad_sequences = tf.keras.preprocessing.sequence.pad_sequences\n", 265 | "\n", 266 | "t = tf.keras.preprocessing.text.Tokenizer()\n", 267 | "t.fit_on_texts(docs)\n", 268 | "vocab_size = len(t.word_index) + 1\n", 269 | "\n", 270 | "# label encode the documents\n", 271 | "encoded_docs = t.texts_to_sequences(docs)\n", 272 | "\n", 273 | "# pad documents to a max length of 40 words\n", 274 | "max_length = 40\n", 275 | "padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')\n", 276 | "\n", 277 | "vocab_size" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "And now we are ready for modeling" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 8, 290 | "metadata": {}, 291 | "outputs": [], 292 | "source": [ 293 | "def bootstrap_sample_generator(batch_size):\n", 294 | " while True:\n", 295 | " batch_idx = np.random.choice(\n", 296 | " padded_docs.shape[0], batch_size)\n", 297 | " yield ({'cat_inputs': padded_docs[batch_idx],\n", 298 | " 'numeric_inputs': doc_lengths[batch_idx]\n", 299 | " }, \n", 300 | " {'output': labels[batch_idx] })" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 9, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "def emb_sz_rule(n_cat): \n", 310 | " return min(600, round(1.6 * n_cat**0.56))\n", 311 | "\n", 312 | "p = .1" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "Notice that again we have two types of inputs:" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 10, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "cat_inputs = tf.keras.layers.Input((40,), name='cat_inputs')\n", 329 | "numeric_inputs = tf.keras.layers.Input((1,), name='numeric_inputs')" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "And we use the same rules as last time to make and add in the embedding layer:" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 11, 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "embedding_layer = tf.keras.layers.Embedding(\n", 346 | " vocab_size, \n", 347 | " emb_sz_rule(vocab_size), \n", 348 | " input_length=40)\n", 349 | "cat_x = embedding_layer(cat_inputs)\n", 350 | "\n", 351 | "global_ave = tf.keras.layers.GlobalAveragePooling1D()(cat_x)\n", 352 | "global_max = tf.keras.layers.GlobalMaxPool1D()(cat_x)\n", 353 | "x = tf.keras.layers.Concatenate()([global_ave, global_max])" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 12, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "# bonus\n", 363 | "x = tf.keras.layers.RepeatVector(40)(x)\n", 364 | "x = tf.keras.layers.Concatenate()([cat_x, x])\n", 365 | "\n", 366 | "x = tf.keras.layers.Dropout(p)(x)\n", 367 | "x = tf.keras.layers.Conv1D(20, 1)(x)\n", 368 | "x = tf.keras.layers.Activation('relu')(x)\n", 369 | "\n", 370 | "global_ave = tf.keras.layers.GlobalAveragePooling1D()(x)\n", 371 | "global_max = tf.keras.layers.GlobalMaxPool1D()(x)\n", 372 | "x = tf.keras.layers.Concatenate()([global_ave, global_max])" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "And then after we process the variable length data, we will add on the fixed numeric inputs (notice they go in right where they went in the first two lessons):" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 13, 385 | "metadata": {}, 386 | "outputs": [], 387 | "source": [ 388 | "x = tf.keras.layers.Concatenate()([x, numeric_inputs])" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 14, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "x = tf.keras.layers.Dropout(p)(x)\n", 398 | "x = tf.keras.layers.Dense(100, activation='relu')(x)\n", 399 | "\n", 400 | "x = tf.keras.layers.BatchNormalization()(x)\n", 401 | "x = tf.keras.layers.Dropout(p)(x)\n", 402 | "x = tf.keras.layers.Dense(20, activation='relu')(x)\n", 403 | "\n", 404 | "x = tf.keras.layers.BatchNormalization()(x)\n", 405 | "x = tf.keras.layers.Dropout(p)(x)\n", 406 | "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", 407 | "\n", 408 | "x = tf.keras.layers.BatchNormalization()(x)\n", 409 | "x = tf.keras.layers.Dropout(p)(x)\n", 410 | "out = tf.keras.layers.Dense(20, activation='softmax', name='output')(x)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 15, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "model = tf.keras.models.Model(inputs=[cat_inputs, numeric_inputs], outputs=out)\n", 420 | "model.compile(optimizer='rmsprop',\n", 421 | " loss='categorical_crossentropy',\n", 422 | " metrics=['accuracy'])" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 16, 428 | "metadata": {}, 429 | "outputs": [ 430 | { 431 | "name": "stdout", 432 | "output_type": "stream", 433 | "text": [ 434 | "Model: \"model\"\n", 435 | "__________________________________________________________________________________________________\n", 436 | "Layer (type) Output Shape Param # Connected to \n", 437 | "==================================================================================================\n", 438 | "cat_inputs (InputLayer) [(None, 40)] 0 \n", 439 | "__________________________________________________________________________________________________\n", 440 | "embedding (Embedding) (None, 40, 143) 438295 cat_inputs[0][0] \n", 441 | "__________________________________________________________________________________________________\n", 442 | "global_average_pooling1d (Globa (None, 143) 0 embedding[0][0] \n", 443 | "__________________________________________________________________________________________________\n", 444 | "global_max_pooling1d (GlobalMax (None, 143) 0 embedding[0][0] \n", 445 | "__________________________________________________________________________________________________\n", 446 | "concatenate (Concatenate) (None, 286) 0 global_average_pooling1d[0][0] \n", 447 | " global_max_pooling1d[0][0] \n", 448 | "__________________________________________________________________________________________________\n", 449 | "repeat_vector (RepeatVector) (None, 40, 286) 0 concatenate[0][0] \n", 450 | "__________________________________________________________________________________________________\n", 451 | "concatenate_1 (Concatenate) (None, 40, 429) 0 embedding[0][0] \n", 452 | " repeat_vector[0][0] \n", 453 | "__________________________________________________________________________________________________\n", 454 | "dropout (Dropout) (None, 40, 429) 0 concatenate_1[0][0] \n", 455 | "__________________________________________________________________________________________________\n", 456 | "conv1d (Conv1D) (None, 40, 20) 8600 dropout[0][0] \n", 457 | "__________________________________________________________________________________________________\n", 458 | "activation (Activation) (None, 40, 20) 0 conv1d[0][0] \n", 459 | "__________________________________________________________________________________________________\n", 460 | "global_average_pooling1d_1 (Glo (None, 20) 0 activation[0][0] \n", 461 | "__________________________________________________________________________________________________\n", 462 | "global_max_pooling1d_1 (GlobalM (None, 20) 0 activation[0][0] \n", 463 | "__________________________________________________________________________________________________\n", 464 | "concatenate_2 (Concatenate) (None, 40) 0 global_average_pooling1d_1[0][0] \n", 465 | " global_max_pooling1d_1[0][0] \n", 466 | "__________________________________________________________________________________________________\n", 467 | "numeric_inputs (InputLayer) [(None, 1)] 0 \n", 468 | "__________________________________________________________________________________________________\n", 469 | "concatenate_3 (Concatenate) (None, 41) 0 concatenate_2[0][0] \n", 470 | " numeric_inputs[0][0] \n", 471 | "__________________________________________________________________________________________________\n", 472 | "dropout_1 (Dropout) (None, 41) 0 concatenate_3[0][0] \n", 473 | "__________________________________________________________________________________________________\n", 474 | "dense (Dense) (None, 100) 4200 dropout_1[0][0] \n", 475 | "__________________________________________________________________________________________________\n", 476 | "batch_normalization_v2 (BatchNo (None, 100) 400 dense[0][0] \n", 477 | "__________________________________________________________________________________________________\n", 478 | "dropout_2 (Dropout) (None, 100) 0 batch_normalization_v2[0][0] \n", 479 | "__________________________________________________________________________________________________\n", 480 | "dense_1 (Dense) (None, 20) 2020 dropout_2[0][0] \n", 481 | "__________________________________________________________________________________________________\n", 482 | "batch_normalization_v2_1 (Batch (None, 20) 80 dense_1[0][0] \n", 483 | "__________________________________________________________________________________________________\n", 484 | "dropout_3 (Dropout) (None, 20) 0 batch_normalization_v2_1[0][0] \n", 485 | "__________________________________________________________________________________________________\n", 486 | "dense_2 (Dense) (None, 10) 210 dropout_3[0][0] \n", 487 | "__________________________________________________________________________________________________\n", 488 | "batch_normalization_v2_2 (Batch (None, 10) 40 dense_2[0][0] \n", 489 | "__________________________________________________________________________________________________\n", 490 | "dropout_4 (Dropout) (None, 10) 0 batch_normalization_v2_2[0][0] \n", 491 | "__________________________________________________________________________________________________\n", 492 | "output (Dense) (None, 20) 220 dropout_4[0][0] \n", 493 | "==================================================================================================\n", 494 | "Total params: 454,065\n", 495 | "Trainable params: 453,805\n", 496 | "Non-trainable params: 260\n", 497 | "__________________________________________________________________________________________________\n" 498 | ] 499 | } 500 | ], 501 | "source": [ 502 | "model.summary()" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 17, 508 | "metadata": {}, 509 | "outputs": [ 510 | { 511 | "name": "stdout", 512 | "output_type": "stream", 513 | "text": [ 514 | "Epoch 1/5\n", 515 | "625/625 [==============================] - 10s 15ms/step - loss: 2.4451 - accuracy: 0.3290\n", 516 | "Epoch 2/5\n", 517 | "625/625 [==============================] - 6s 9ms/step - loss: 1.7605 - accuracy: 0.5229\n", 518 | "Epoch 3/5\n", 519 | "625/625 [==============================] - 6s 10ms/step - loss: 1.5414 - accuracy: 0.5713\n", 520 | "Epoch 4/5\n", 521 | "625/625 [==============================] - 6s 9ms/step - loss: 1.4575 - accuracy: 0.5908\n", 522 | "Epoch 5/5\n", 523 | "625/625 [==============================] - 5s 8ms/step - loss: 1.3314 - accuracy: 0.6231\n" 524 | ] 525 | }, 526 | { 527 | "data": { 528 | "text/plain": [ 529 | "" 530 | ] 531 | }, 532 | "execution_count": 17, 533 | "metadata": {}, 534 | "output_type": "execute_result" 535 | } 536 | ], 537 | "source": [ 538 | "batch_size = 16\n", 539 | "\n", 540 | "model.fit_generator(\n", 541 | " bootstrap_sample_generator(batch_size),\n", 542 | " steps_per_epoch=10_000 // batch_size,\n", 543 | " epochs=5,\n", 544 | " max_queue_size=10,\n", 545 | ")" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "Not as good absolute accuracy, but hey we are looking at a different dataset with a different loss metric\n", 553 | "\n", 554 | "---\n", 555 | "\n", 556 | "I hope you can start to see how you can transform your old techniques into deep learning ones. There are a ton more things to do of course and a couple of different things I'm thinking about including:\n", 557 | "\n", 558 | "* Time series\n", 559 | "* Natural data (images, language, sound, etc)" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": {}, 566 | "outputs": [], 567 | "source": [] 568 | } 569 | ], 570 | "metadata": { 571 | "kernelspec": { 572 | "display_name": "Python 3", 573 | "language": "python", 574 | "name": "python3" 575 | }, 576 | "language_info": { 577 | "codemirror_mode": { 578 | "name": "ipython", 579 | "version": 3 580 | }, 581 | "file_extension": ".py", 582 | "mimetype": "text/x-python", 583 | "name": "python", 584 | "nbconvert_exporter": "python", 585 | "pygments_lexer": "ipython3", 586 | "version": "3.6.8" 587 | } 588 | }, 589 | "nbformat": 4, 590 | "nbformat_minor": 2 591 | } 592 | -------------------------------------------------------------------------------- /notebooks/Tabular Data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from sklearn.datasets import make_classification" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# Tabular Data\n", 17 | "\n", 18 | "Tabular data is the data that you most often see. It is data that you can cleanly write in a table. It has a set number of rows and columns, and for our example below, all the data is numeric.\n", 19 | "\n", 20 | "This is the one type of data that we will go over that is not necessarily suited to neural networks. Because it is so simple and so well studied, traditional ML can do quite well on it. \n", 21 | "\n", 22 | "That being said it makes a nice springboard to begin the rest of the tutorial." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "To make this data we will be using sklearn `make_classification`. This will generate a dummy classification dataset:" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "dataset = make_classification(n_samples=10_000, n_features=20, n_classes=2)\n", 39 | "x, y = dataset" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "text/plain": [ 50 | "(array([[-1.47591055, 0.25345616, 0.6174182 , ..., 0.44527873,\n", 51 | " -2.02793885, -0.25553664],\n", 52 | " [ 1.19614338, -1.66752205, -1.60501694, ..., -0.1298167 ,\n", 53 | " -1.5453044 , -0.56323096],\n", 54 | " [ 1.136674 , -0.53942846, -0.97723932, ..., 0.68611902,\n", 55 | " 0.9081234 , 0.86679452],\n", 56 | " ...,\n", 57 | " [ 1.12859474, 0.62318725, -0.17071723, ..., -0.37103146,\n", 58 | " -2.11036497, 1.72595764],\n", 59 | " [-0.94219602, 0.31865075, 0.04442349, ..., 0.60564122,\n", 60 | " -1.12027859, 0.74158706],\n", 61 | " [ 1.00780519, 1.14463957, -0.50560505, ..., 0.31718227,\n", 62 | " 0.38186864, -0.4792807 ]]), array([1, 1, 0, ..., 1, 1, 0]))" 63 | ] 64 | }, 65 | "execution_count": 3, 66 | "metadata": {}, 67 | "output_type": "execute_result" 68 | } 69 | ], 70 | "source": [ 71 | "x, y" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "Because we have two classes, this is binary classification, so predicting either 0 or a 1 based off of these 20 features.\n", 79 | "\n", 80 | "So now that we have the data we can just throw it into a NN right? \n", 81 | "\n", 82 | "Well not quite yet. Because a NN is basically a linear ML alg, we first need to scale all the inputs:" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 4, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "from sklearn.preprocessing import StandardScaler\n", 92 | "\n", 93 | "ss = StandardScaler()\n", 94 | "\n", 95 | "standardized_x = ss.fit_transform(x)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Perfect, now we can just throw it into a NN :) \n", 103 | "\n", 104 | "Yup for this data there is not too much else to it but to build the NN." 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 7, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "import tensorflow as tf\n", 114 | "\n", 115 | "# dropout probability\n", 116 | "p = .1" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "We are going to be using keras to build our NN. Because this is tabular data we can follow a fairly simple structure of a NN:\n", 124 | "\n", 125 | "1. Standardize/Normalize\n", 126 | "2. (Optional) Regularize/Dropout\n", 127 | "3. Apply a Dense Layer\n", 128 | "\n", 129 | "Let me talk about the first and the last.\n", 130 | "\n", 131 | "Standardizing is important because of the way that NNs train by using gradient descent. If a particular layer's input is too big, then the gradients might be massive and the training process goes out of wack. \n", 132 | "\n", 133 | "The dense layer is the core of the NN and applies a non-linear transformation to the inputs allowing the NN to represent any non-linear function - or something like that. Regardless without that you couldn't learn.\n", 134 | "\n", 135 | "Dropout is a simple way of regularizing NNs. The reason I put this as optional, is that there is some debate on whether you need dropout in addition to batch normalization.\n", 136 | "\n", 137 | "Ultimately you can experiment with the amt of dropout you need in your network, and if it's none, so be it.\n", 138 | "\n", 139 | "---\n", 140 | "\n", 141 | "So all that being said below is our first NN." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 8, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "inputs = tf.keras.layers.Input((20,), name='numeric_inputs')" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 9, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "x = tf.keras.layers.Dropout(p)(inputs)\n", 160 | "x = tf.keras.layers.Dense(100, activation='relu')(x)\n", 161 | "\n", 162 | "x = tf.keras.layers.BatchNormalization()(x)\n", 163 | "x = tf.keras.layers.Dropout(p)(x)\n", 164 | "x = tf.keras.layers.Dense(20, activation='relu')(x)\n", 165 | "\n", 166 | "x = tf.keras.layers.BatchNormalization()(x)\n", 167 | "x = tf.keras.layers.Dropout(p)(x)\n", 168 | "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", 169 | "\n", 170 | "x = tf.keras.layers.BatchNormalization()(x)\n", 171 | "x = tf.keras.layers.Dropout(p)(x)\n", 172 | "out = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(x)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "Now there are probably a couple of questions as to the above:\n", 180 | "\n", 181 | "* Why so many layers?\n", 182 | "* Why so many neurons in each layer\n", 183 | "\n", 184 | "Well a good rule of thumb is that your NN can have as many params as the number of data points that you have, and the above NN has half as many, so we could probably increase the number of parameters. \n", 185 | "\n", 186 | "As for the width vs the depth of the network, well there has been a ton of results on either side of the aisle and honeslty I'm not sure what to tell you other than experimentation.\n", 187 | "\n", 188 | "Some things you might want to keep in mind are:\n", 189 | "\n", 190 | "* Skip connections seem to be pretty cool\n", 191 | "* Alternating small and large layers might be a thing too" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 10, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "model = tf.keras.models.Model(inputs=inputs, outputs=out)\n", 201 | "model.compile(optimizer='rmsprop',\n", 202 | " loss='binary_crossentropy',\n", 203 | " metrics=['accuracy'])" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 11, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "Model: \"model\"\n", 216 | "_________________________________________________________________\n", 217 | "Layer (type) Output Shape Param # \n", 218 | "=================================================================\n", 219 | "numeric_inputs (InputLayer) [(None, 20)] 0 \n", 220 | "_________________________________________________________________\n", 221 | "dropout (Dropout) (None, 20) 0 \n", 222 | "_________________________________________________________________\n", 223 | "dense (Dense) (None, 100) 2100 \n", 224 | "_________________________________________________________________\n", 225 | "batch_normalization_v2 (Batc (None, 100) 400 \n", 226 | "_________________________________________________________________\n", 227 | "dropout_1 (Dropout) (None, 100) 0 \n", 228 | "_________________________________________________________________\n", 229 | "dense_1 (Dense) (None, 20) 2020 \n", 230 | "_________________________________________________________________\n", 231 | "batch_normalization_v2_1 (Ba (None, 20) 80 \n", 232 | "_________________________________________________________________\n", 233 | "dropout_2 (Dropout) (None, 20) 0 \n", 234 | "_________________________________________________________________\n", 235 | "dense_2 (Dense) (None, 10) 210 \n", 236 | "_________________________________________________________________\n", 237 | "batch_normalization_v2_2 (Ba (None, 10) 40 \n", 238 | "_________________________________________________________________\n", 239 | "dropout_3 (Dropout) (None, 10) 0 \n", 240 | "_________________________________________________________________\n", 241 | "output (Dense) (None, 1) 11 \n", 242 | "=================================================================\n", 243 | "Total params: 4,861\n", 244 | "Trainable params: 4,601\n", 245 | "Non-trainable params: 260\n", 246 | "_________________________________________________________________\n" 247 | ] 248 | } 249 | ], 250 | "source": [ 251 | "model.summary()" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "As a final amendment to our data, I always like to use keras's `fit_generator` function, so I will often make a generator to feed data to the NN instead of using the default fit funtion." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 12, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "import numpy as np\n", 268 | "\n", 269 | "def bootstrap_sample_generator(batch_size):\n", 270 | " while True:\n", 271 | " batch_idx = np.random.choice(\n", 272 | " standardized_x.shape[0], batch_size)\n", 273 | " yield ({'numeric_inputs': standardized_x[batch_idx]}, \n", 274 | " {'output': y[batch_idx]})" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 13, 280 | "metadata": {}, 281 | "outputs": [ 282 | { 283 | "name": "stdout", 284 | "output_type": "stream", 285 | "text": [ 286 | "Epoch 1/5\n", 287 | "312/312 [==============================] - 3s 11ms/step - loss: 0.5338 - accuracy: 0.7359\n", 288 | "Epoch 2/5\n", 289 | "312/312 [==============================] - 1s 3ms/step - loss: 0.4264 - accuracy: 0.8096\n", 290 | "Epoch 3/5\n", 291 | "312/312 [==============================] - 1s 3ms/step - loss: 0.4016 - accuracy: 0.8192\n", 292 | "Epoch 4/5\n", 293 | "312/312 [==============================] - 1s 4ms/step - loss: 0.4074 - accuracy: 0.8144\n", 294 | "Epoch 5/5\n", 295 | "312/312 [==============================] - 1s 3ms/step - loss: 0.3830 - accuracy: 0.8279\n" 296 | ] 297 | }, 298 | { 299 | "data": { 300 | "text/plain": [ 301 | "" 302 | ] 303 | }, 304 | "execution_count": 13, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "batch_size = 32\n", 311 | "\n", 312 | "model.fit_generator(\n", 313 | " bootstrap_sample_generator(batch_size),\n", 314 | " steps_per_epoch=10_000 // batch_size,\n", 315 | " epochs=5,\n", 316 | " max_queue_size=10,\n", 317 | ")" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [] 326 | } 327 | ], 328 | "metadata": { 329 | "kernelspec": { 330 | "display_name": "Python 3", 331 | "language": "python", 332 | "name": "python3" 333 | }, 334 | "language_info": { 335 | "codemirror_mode": { 336 | "name": "ipython", 337 | "version": 3 338 | }, 339 | "file_extension": ".py", 340 | "mimetype": "text/x-python", 341 | "name": "python", 342 | "nbconvert_exporter": "python", 343 | "pygments_lexer": "ipython3", 344 | "version": "3.6.8" 345 | } 346 | }, 347 | "nbformat": 4, 348 | "nbformat_minor": 2 349 | } 350 | -------------------------------------------------------------------------------- /notebooks/Variable Length Features.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from sklearn.datasets import make_classification" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# Variable Length Features\n", 17 | "\n", 18 | "Now we start to get into the stuff that NNs shine at. \n", 19 | "\n", 20 | "So we are still focusing on typical datasets, so no NL or images etc. But this time we are adding one more caveat, we can have variable length features. \n", 21 | "\n", 22 | "One example of this is trying to classify whether somebody will default on their loan given all of the credit cards that they have. \n", 23 | "\n", 24 | "Before what you'd have to do is look at aggregations of those features like: average balance of all the credit cards, max balance, etc.\n", 25 | "\n", 26 | "Now with NNs we can use all of those features directly.\n", 27 | "\n", 28 | "---\n", 29 | "\n", 30 | "To practice with this data we will need to do some work create it. We will start by using some more advanced features from the make classification function:" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "make_classification?" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 3, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "base_dataset = make_classification(\n", 49 | " n_samples=10_000, \n", 50 | " n_features=30, \n", 51 | " n_informative=10,\n", 52 | " n_clusters_per_class=2,\n", 53 | " n_classes=4)\n", 54 | "\n", 55 | "x, y = base_dataset" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "Notice that this time we have four classes. We will use those to create two classes below. But before that we will normalize the data:" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 4, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "from sklearn.preprocessing import StandardScaler\n", 72 | "\n", 73 | "ss = StandardScaler()\n", 74 | "\n", 75 | "x_standardized = ss.fit_transform(x)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 5, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "base_classes = []\n", 85 | "\n", 86 | "for i in range(4):\n", 87 | " base_classes.append(x_standardized[y == i])" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 6, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "import numpy as np\n", 97 | "\n", 98 | "num_points = 5_000\n", 99 | "class1_dist = [.5, .5, 0, 0]\n", 100 | "class2_dist = [0, .2, .6, .2]\n", 101 | "\n", 102 | "def make_var_len_feature_point(dist):\n", 103 | " feature_sets = []\n", 104 | " num_features = np.random.randint(3, 11)\n", 105 | " for _ in range(num_features):\n", 106 | " # choose which distribution the credit card comes from\n", 107 | " base_class = np.random.choice([0, 1, 2, 3], 1, p=dist)\n", 108 | " base_class_points = base_classes[base_class[0]]\n", 109 | " feature_set_idx = np.random.choice(base_class_points.shape[0], 1)\n", 110 | " feature_sets.append(base_class_points[feature_set_idx])\n", 111 | " \n", 112 | " for _ in range(10 - num_features):\n", 113 | " feature_sets.append(np.zeros((1, 30)))\n", 114 | "\n", 115 | " return np.concatenate(feature_sets)[np.newaxis, :, :]\n", 116 | "\n", 117 | "\n", 118 | "class1_points = []\n", 119 | "for _ in range(num_points):\n", 120 | " class1_points.append(\n", 121 | " make_var_len_feature_point(class1_dist))\n", 122 | "class1_points = np.concatenate(class1_points)\n", 123 | " \n", 124 | "class2_points = []\n", 125 | "for _ in range(num_points):\n", 126 | " class2_points.append(\n", 127 | " make_var_len_feature_point(class2_dist))\n", 128 | "class2_points = np.concatenate(class2_points)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 7, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "(5000, 10, 30)" 140 | ] 141 | }, 142 | "execution_count": 7, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "class2_points.shape" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Notice that we have two classes above and that they have a variable number of feature sets (or in concrete terms, our customers have a variable number of credit cards). Each feature set represents information about a single credit card (thus they are a series of numbers).\n", 156 | "\n", 157 | "I'm making the classes/customers in class 1 and 0 distinct by saying that the credit cards they generally have are distinct. Thus those two class distributions above signify that they generally have different types of credit cards.\n", 158 | "\n", 159 | "The final thing to notice here is that we go ahead and pad people that don't have 10 cards at least up to 10. Unfortunately this is necessary if you want to have batch sizes greater than 1. That being said, in more sophisticated applications, you will see people group customers with similar number of cards together and run on batches of the same size.\n", 160 | "\n", 161 | "---\n", 162 | "\n", 163 | "Ultimately we end up with data that that consists of customers coming from different classes that have different credit cards. \n", 164 | "\n", 165 | "The next step is to make the model" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 8, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "def bootstrap_sample_generator(batch_size):\n", 175 | " while True:\n", 176 | " batch_idx = np.random.choice(\n", 177 | " class1_points.shape[0], batch_size // 2)\n", 178 | " batch_x = np.concatenate([\n", 179 | " class1_points[batch_idx],\n", 180 | " class2_points[batch_idx],\n", 181 | " ])\n", 182 | " batch_y = np.concatenate([\n", 183 | " np.zeros(batch_size // 2),\n", 184 | " np.ones(batch_size // 2),\n", 185 | " ])\n", 186 | " yield ({'numeric_inputs': batch_x}, \n", 187 | " {'output': batch_y})" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 9, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "import tensorflow as tf\n", 197 | "\n", 198 | "p = .1" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "Notice that we are back to just having one input." 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 10, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "inputs = tf.keras.layers.Input((10, 30), name='numeric_inputs')" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "This is where the big difference lay. We want to operate on a variable number of inputs. So sometimes there are 4 cards and sometimes 10. Even moreso, there is no order to these inputs.\n", 222 | "\n", 223 | "It would be nice if we could process each card separately and then combine the information about all the cards together.\n", 224 | "\n", 225 | "And we can do that with two layers:\n", 226 | "\n", 227 | "1. Conv1D: we use a convolution layer to apply the same operation to each feature set, thus processing each card separately\n", 228 | "2. GlogalMax/AveragePool: We use this layer to combine information from all the cards together into one" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 11, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "x = tf.keras.layers.Dropout(p)(inputs)\n", 238 | "# notice I use a kernel size of 1\n", 239 | "# this is because there is no information given by adjacency\n", 240 | "x = tf.keras.layers.Conv1D(10, 1)(x)\n", 241 | "x = tf.keras.layers.Activation('relu')(x)\n", 242 | "\n", 243 | "global_ave = tf.keras.layers.GlobalAveragePooling1D()(x)\n", 244 | "global_max = tf.keras.layers.GlobalMaxPool1D()(x)\n", 245 | "x = tf.keras.layers.Concatenate()([global_ave, global_max])\n", 246 | "\n", 247 | "x = tf.keras.layers.BatchNormalization()(x)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "Notice that we still use batch norm and dropout like before. This time though the work is done in the convolution and the pooling layers\n", 255 | "\n", 256 | "---\n", 257 | "\n", 258 | "The next step is a bit of a bonus, but I think it is a cool addition. The one problem with the above is that we consider each card separately. So one technique that has been highly effective is adding in global information to the original inputs.\n", 259 | "\n", 260 | "The way I think about this is: let's first consider all the the credit cards separately and combine that information, then let's re-examine them all in light of that information.\n", 261 | "\n", 262 | "We do this by adding that global information back onto the original inputs and then repeating the same operations we did above:" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 12, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "# bonus\n", 272 | "x = tf.keras.layers.RepeatVector(10)(x)\n", 273 | "x = tf.keras.layers.Concatenate()([inputs, x])\n", 274 | "\n", 275 | "x = tf.keras.layers.Dropout(p)(x)\n", 276 | "x = tf.keras.layers.Conv1D(10, 1)(x)\n", 277 | "x = tf.keras.layers.Activation('relu')(x)\n", 278 | "\n", 279 | "global_ave = tf.keras.layers.GlobalAveragePooling1D()(x)\n", 280 | "global_max = tf.keras.layers.GlobalMaxPool1D()(x)\n", 281 | "x = tf.keras.layers.Concatenate()([global_ave, global_max])\n", 282 | "\n", 283 | "x = tf.keras.layers.BatchNormalization()(x)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "Now that we have gathered all this information about the credit cards, we will feed it though the same old network we had before" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 13, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "x = tf.keras.layers.Dropout(p)(x)\n", 300 | "x = tf.keras.layers.Dense(100, activation='relu')(x)\n", 301 | "\n", 302 | "x = tf.keras.layers.BatchNormalization()(x)\n", 303 | "x = tf.keras.layers.Dropout(p)(x)\n", 304 | "x = tf.keras.layers.Dense(20, activation='relu')(x)\n", 305 | "\n", 306 | "x = tf.keras.layers.BatchNormalization()(x)\n", 307 | "x = tf.keras.layers.Dropout(p)(x)\n", 308 | "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", 309 | "\n", 310 | "x = tf.keras.layers.BatchNormalization()(x)\n", 311 | "x = tf.keras.layers.Dropout(p)(x)\n", 312 | "out = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(x)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 14, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "model = tf.keras.models.Model(inputs=inputs, outputs=out)\n", 322 | "model.compile(optimizer='rmsprop',\n", 323 | " loss='binary_crossentropy',\n", 324 | " metrics=['accuracy'])" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 15, 330 | "metadata": {}, 331 | "outputs": [ 332 | { 333 | "name": "stdout", 334 | "output_type": "stream", 335 | "text": [ 336 | "Model: \"model\"\n", 337 | "__________________________________________________________________________________________________\n", 338 | "Layer (type) Output Shape Param # Connected to \n", 339 | "==================================================================================================\n", 340 | "numeric_inputs (InputLayer) [(None, 10, 30)] 0 \n", 341 | "__________________________________________________________________________________________________\n", 342 | "dropout (Dropout) (None, 10, 30) 0 numeric_inputs[0][0] \n", 343 | "__________________________________________________________________________________________________\n", 344 | "conv1d (Conv1D) (None, 10, 10) 310 dropout[0][0] \n", 345 | "__________________________________________________________________________________________________\n", 346 | "activation (Activation) (None, 10, 10) 0 conv1d[0][0] \n", 347 | "__________________________________________________________________________________________________\n", 348 | "global_average_pooling1d (Globa (None, 10) 0 activation[0][0] \n", 349 | "__________________________________________________________________________________________________\n", 350 | "global_max_pooling1d (GlobalMax (None, 10) 0 activation[0][0] \n", 351 | "__________________________________________________________________________________________________\n", 352 | "concatenate (Concatenate) (None, 20) 0 global_average_pooling1d[0][0] \n", 353 | " global_max_pooling1d[0][0] \n", 354 | "__________________________________________________________________________________________________\n", 355 | "batch_normalization_v2 (BatchNo (None, 20) 80 concatenate[0][0] \n", 356 | "__________________________________________________________________________________________________\n", 357 | "repeat_vector (RepeatVector) (None, 10, 20) 0 batch_normalization_v2[0][0] \n", 358 | "__________________________________________________________________________________________________\n", 359 | "concatenate_1 (Concatenate) (None, 10, 50) 0 numeric_inputs[0][0] \n", 360 | " repeat_vector[0][0] \n", 361 | "__________________________________________________________________________________________________\n", 362 | "dropout_1 (Dropout) (None, 10, 50) 0 concatenate_1[0][0] \n", 363 | "__________________________________________________________________________________________________\n", 364 | "conv1d_1 (Conv1D) (None, 10, 10) 510 dropout_1[0][0] \n", 365 | "__________________________________________________________________________________________________\n", 366 | "activation_1 (Activation) (None, 10, 10) 0 conv1d_1[0][0] \n", 367 | "__________________________________________________________________________________________________\n", 368 | "global_average_pooling1d_1 (Glo (None, 10) 0 activation_1[0][0] \n", 369 | "__________________________________________________________________________________________________\n", 370 | "global_max_pooling1d_1 (GlobalM (None, 10) 0 activation_1[0][0] \n", 371 | "__________________________________________________________________________________________________\n", 372 | "concatenate_2 (Concatenate) (None, 20) 0 global_average_pooling1d_1[0][0] \n", 373 | " global_max_pooling1d_1[0][0] \n", 374 | "__________________________________________________________________________________________________\n", 375 | "batch_normalization_v2_1 (Batch (None, 20) 80 concatenate_2[0][0] \n", 376 | "__________________________________________________________________________________________________\n", 377 | "dropout_2 (Dropout) (None, 20) 0 batch_normalization_v2_1[0][0] \n", 378 | "__________________________________________________________________________________________________\n", 379 | "dense (Dense) (None, 100) 2100 dropout_2[0][0] \n", 380 | "__________________________________________________________________________________________________\n", 381 | "batch_normalization_v2_2 (Batch (None, 100) 400 dense[0][0] \n", 382 | "__________________________________________________________________________________________________\n", 383 | "dropout_3 (Dropout) (None, 100) 0 batch_normalization_v2_2[0][0] \n", 384 | "__________________________________________________________________________________________________\n", 385 | "dense_1 (Dense) (None, 20) 2020 dropout_3[0][0] \n", 386 | "__________________________________________________________________________________________________\n", 387 | "batch_normalization_v2_3 (Batch (None, 20) 80 dense_1[0][0] \n", 388 | "__________________________________________________________________________________________________\n", 389 | "dropout_4 (Dropout) (None, 20) 0 batch_normalization_v2_3[0][0] \n", 390 | "__________________________________________________________________________________________________\n", 391 | "dense_2 (Dense) (None, 10) 210 dropout_4[0][0] \n", 392 | "__________________________________________________________________________________________________\n", 393 | "batch_normalization_v2_4 (Batch (None, 10) 40 dense_2[0][0] \n", 394 | "__________________________________________________________________________________________________\n", 395 | "dropout_5 (Dropout) (None, 10) 0 batch_normalization_v2_4[0][0] \n", 396 | "__________________________________________________________________________________________________\n", 397 | "output (Dense) (None, 1) 11 dropout_5[0][0] \n", 398 | "==================================================================================================\n", 399 | "Total params: 5,841\n", 400 | "Trainable params: 5,501\n", 401 | "Non-trainable params: 340\n", 402 | "__________________________________________________________________________________________________\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "model.summary()" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 16, 413 | "metadata": {}, 414 | "outputs": [ 415 | { 416 | "name": "stdout", 417 | "output_type": "stream", 418 | "text": [ 419 | "Epoch 1/5\n", 420 | "312/312 [==============================] - 6s 19ms/step - loss: 0.5587 - accuracy: 0.7149\n", 421 | "Epoch 2/5\n", 422 | "312/312 [==============================] - 2s 7ms/step - loss: 0.3034 - accuracy: 0.8735\n", 423 | "Epoch 3/5\n", 424 | "312/312 [==============================] - 2s 6ms/step - loss: 0.2233 - accuracy: 0.9131\n", 425 | "Epoch 4/5\n", 426 | "312/312 [==============================] - 2s 5ms/step - loss: 0.1833 - accuracy: 0.9295\n", 427 | "Epoch 5/5\n", 428 | "312/312 [==============================] - 2s 6ms/step - loss: 0.1676 - accuracy: 0.9359\n" 429 | ] 430 | }, 431 | { 432 | "data": { 433 | "text/plain": [ 434 | "" 435 | ] 436 | }, 437 | "execution_count": 16, 438 | "metadata": {}, 439 | "output_type": "execute_result" 440 | } 441 | ], 442 | "source": [ 443 | "batch_size = 32\n", 444 | "\n", 445 | "model.fit_generator(\n", 446 | " bootstrap_sample_generator(batch_size),\n", 447 | " steps_per_epoch=10_000 // batch_size,\n", 448 | " epochs=5,\n", 449 | " max_queue_size=10,\n", 450 | ")" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "Our next lesson will be pretty similar to this one, but we will be working with ordered data instead." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": {}, 471 | "outputs": [], 472 | "source": [] 473 | } 474 | ], 475 | "metadata": { 476 | "kernelspec": { 477 | "display_name": "Python 3", 478 | "language": "python", 479 | "name": "python3" 480 | }, 481 | "language_info": { 482 | "codemirror_mode": { 483 | "name": "ipython", 484 | "version": 3 485 | }, 486 | "file_extension": ".py", 487 | "mimetype": "text/x-python", 488 | "name": "python", 489 | "nbconvert_exporter": "python", 490 | "pygments_lexer": "ipython3", 491 | "version": "3.6.8" 492 | } 493 | }, 494 | "nbformat": 4, 495 | "nbformat_minor": 2 496 | } 497 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.7.1 2 | alabaster==0.7.12 3 | appnope==0.1.0 4 | astor==0.7.1 5 | attrs==19.1.0 6 | Babel==2.6.0 7 | backcall==0.1.0 8 | bleach==3.1.1 9 | certifi==2019.3.9 10 | chardet==3.0.4 11 | decorator==4.4.0 12 | defusedxml==0.6.0 13 | docutils==0.14 14 | entrypoints==0.3 15 | gast==0.2.2 16 | google-pasta==0.1.5 17 | grpcio==1.20.0 18 | h5py==2.9.0 19 | idna==2.8 20 | imagesize==1.1.0 21 | ipykernel==5.1.0 22 | ipyparallel==6.2.3 23 | ipython==7.4.0 24 | ipython-genutils==0.2.0 25 | ipywidgets==7.4.2 26 | jedi==0.13.3 27 | Jinja2==2.10.1 28 | jsonschema==3.0.1 29 | jupyter-client==5.2.4 30 | jupyter-core==4.4.0 31 | Keras-Applications==1.0.7 32 | Keras-Preprocessing==1.0.9 33 | Markdown==3.1 34 | MarkupSafe==1.1.1 35 | mistune==0.8.4 36 | nbconvert==5.4.1 37 | nbformat==4.4.0 38 | nose==1.3.7 39 | notebook==5.7.8 40 | numpy==1.16.3 41 | packaging==19.0 42 | pandas==0.24.2 43 | pandocfilters==1.4.2 44 | parso==0.4.0 45 | pexpect==4.7.0 46 | pickleshare==0.7.5 47 | prometheus-client==0.6.0 48 | prompt-toolkit==2.0.9 49 | protobuf==3.7.1 50 | ptyprocess==0.6.0 51 | Pygments==2.3.1 52 | pyparsing==2.4.0 53 | pyrsistent==0.14.11 54 | python-dateutil==2.8.0 55 | pytz==2019.1 56 | pyzmq==18.0.1 57 | qtconsole==4.4.3 58 | requests==2.21.0 59 | scikit-learn==0.20.3 60 | scipy==1.2.1 61 | Send2Trash==1.5.0 62 | six==1.12.0 63 | sklearn==0.0 64 | snowballstemmer==1.2.1 65 | Sphinx==2.0.1 66 | sphinxcontrib-applehelp==1.0.1 67 | sphinxcontrib-devhelp==1.0.1 68 | sphinxcontrib-htmlhelp==1.0.2 69 | sphinxcontrib-jsmath==1.0.1 70 | sphinxcontrib-qthelp==1.0.2 71 | sphinxcontrib-serializinghtml==1.1.3 72 | tb-nightly==1.14.0a20190301 73 | tensorflow==2.0.0a0 74 | termcolor==1.1.0 75 | terminado==0.8.2 76 | testpath==0.4.2 77 | tf-estimator-nightly==1.14.0.dev2019030115 78 | tornado==6.0.2 79 | traitlets==4.3.2 80 | urllib3==1.24.2 81 | wcwidth==0.1.7 82 | webencodings==0.5.1 83 | Werkzeug==0.15.3 84 | widgetsnbextension==3.4.2 85 | --------------------------------------------------------------------------------