├── README.md ├── utf-8''TFDS-Week1-Question.ipynb ├── utf-8''TFDS-Week3-Question.ipynb ├── utf-8''TFDS-Week2-Question.ipynb └── utf-8''TFDS-Week4-Question.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Data-Pipelines-with-TensorFlow-Data-Services-Exercises 2 | This repository contains the exercise notebooks for the [Data Pipelines with TensorFlow Data Services (Coursera) course](https://www.coursera.org/learn/data-pipelines-tensorflow). 3 | -------------------------------------------------------------------------------- /utf-8''TFDS-Week1-Question.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Rock, Paper, Scissors\n", 8 | "\n", 9 | "In this week's exercise you will be working with TFDS and the rock-paper-scissors dataset. You'll do a few tasks such as exploring the info of the dataset in order to figure out the name of the splits. You'll also write code to see if the dataset supports the new S3 API before creating your own versions of the dataset." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Setup" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": { 23 | "colab": {}, 24 | "colab_type": "code", 25 | "id": "TTBSvHcSLBzc" 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "# Use all imports\n", 30 | "from os import getcwd\n", 31 | "\n", 32 | "import tensorflow as tf\n", 33 | "import tensorflow_datasets as tfds" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Extract the Rock, Paper, Scissors Dataset\n", 41 | "\n", 42 | "In the cell below, you will extract the `rock_paper_scissors` dataset and then print its info. Take note of the splits, what they're called, and their size." 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": { 49 | "colab": {}, 50 | "colab_type": "code", 51 | "id": "KGsVrzy84WI2" 52 | }, 53 | "outputs": [ 54 | { 55 | "name": "stderr", 56 | "output_type": "stream", 57 | "text": [ 58 | "WARNING:absl:Found a different version 3.0.0 of dataset rock_paper_scissors in data_dir /tf/week1/../tmp2. Using currently defined version 1.0.0.\n" 59 | ] 60 | }, 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | "tfds.core.DatasetInfo(\n", 66 | " name='rock_paper_scissors',\n", 67 | " version=1.0.0,\n", 68 | " description='Images of hands playing rock, paper, scissor game.',\n", 69 | " urls=['http://laurencemoroney.com/rock-paper-scissors-dataset'],\n", 70 | " features=FeaturesDict({\n", 71 | " 'image': Image(shape=(300, 300, 3), dtype=tf.uint8),\n", 72 | " 'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),\n", 73 | " }),\n", 74 | " total_num_examples=2892,\n", 75 | " splits={\n", 76 | " 'test': 372,\n", 77 | " 'train': 2520,\n", 78 | " },\n", 79 | " supervised_keys=('image', 'label'),\n", 80 | " citation=\"\"\"@ONLINE {rps,\n", 81 | " author = \"Laurence Moroney\",\n", 82 | " title = \"Rock, Paper, Scissors Dataset\",\n", 83 | " month = \"feb\",\n", 84 | " year = \"2019\",\n", 85 | " url = \"http://laurencemoroney.com/rock-paper-scissors-dataset\"\n", 86 | " }\"\"\",\n", 87 | " redistribution_info=,\n", 88 | ")\n", 89 | "\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "# EXERCISE: Use tfds.load to extract the rock_paper_scissors dataset.\n", 95 | "\n", 96 | "filePath = f\"{getcwd()}/../tmp2\"\n", 97 | "data, info = tfds.load(name='rock_paper_scissors', with_info=True, data_dir=filePath)# YOUR CODE HERE (Include the following argument in your code: data_dir=filePath)\n", 98 | "print(info)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 11, 104 | "metadata": { 105 | "colab": {}, 106 | "colab_type": "code", 107 | "id": "epPGTUqE5Z2E" 108 | }, 109 | "outputs": [ 110 | { 111 | "name": "stdout", 112 | "output_type": "stream", 113 | "text": [ 114 | "test:372\n", 115 | "train:2520\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "# EXERCISE: In the space below, write code that iterates through the splits\n", 121 | "# without hardcoding any keys. The code should extract 'test' and 'train' as\n", 122 | "# the keys, and then print out the number of items in the dataset for each key. \n", 123 | "# HINT: num_examples property is very useful here.\n", 124 | "\n", 125 | "for key, value in data.items():# YOUR CODE HERE:\n", 126 | " print(key + \":\" + str(info.splits[key].num_examples))\n", 127 | "\n", 128 | "# EXPECTED OUTPUT\n", 129 | "# test:372\n", 130 | "# train:2520" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## Use the New S3 API\n", 138 | "\n", 139 | "Before using the new S3 API, you must first find out whether the `rock_paper_scissors` dataset implements the new S3 API. In the cell below you should use version `3.*.*` of the `rock_paper_scissors` dataset." 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 12, 145 | "metadata": { 146 | "colab": {}, 147 | "colab_type": "code", 148 | "id": "Ms5ld5Ov6_OP" 149 | }, 150 | "outputs": [ 151 | { 152 | "name": "stdout", 153 | "output_type": "stream", 154 | "text": [ 155 | "True\n" 156 | ] 157 | } 158 | ], 159 | "source": [ 160 | "# EXERCISE: In the space below, use the tfds.builder to fetch the\n", 161 | "# rock_paper_scissors dataset and check to see if it supports the\n", 162 | "# new S3 API. \n", 163 | "# HINT: The builder should 'implement' something\n", 164 | "\n", 165 | "rps_builder = tfds.builder(\"rock_paper_scissors:3.*.*\", data_dir=filePath)# YOUR CODE HERE (Include the following arguments in your code: \"rock_paper_scissors:3.*.*\", data_dir=filePath)\n", 166 | "\n", 167 | "print(rps_builder.version.implements(tfds.core.Experiment.S3))\n", 168 | "\n", 169 | "# Expected output:\n", 170 | "# True" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "## Create New Datasets with the S3 API\n", 178 | "\n", 179 | "Sometimes datasets are too big for prototyping. In the cell below, you will create a smaller dataset, where instead of using all of the training data and all of the test data, you instead have a `small_train` and `small_test` each of which are comprised of the first 10% of the records in their respective datasets." 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 21, 185 | "metadata": { 186 | "colab": {}, 187 | "colab_type": "code", 188 | "id": "QMGkJW6j7Ldl" 189 | }, 190 | "outputs": [], 191 | "source": [ 192 | "# EXERCISE: In the space below, create two small datasets, `small_train`\n", 193 | "# and `small_test`, each of which are comprised of the first 10% of the\n", 194 | "# records in their respective datasets.\n", 195 | "\n", 196 | "small_train = tfds.load(\"rock_paper_scissors:3.*.*\", data_dir=filePath, split='train[:10%]') # YOUR CODE HERE (Include the following arguments in your code: \"rock_paper_scissors:3.*.*\", data_dir=filePath)\n", 197 | "small_test = tfds.load(\"rock_paper_scissors:3.*.*\", data_dir=filePath, split='test[:10%]') # Include the following arguments in your code: \"rock_paper_scissors:3.*.*\", data_dir=filePath)\n", 198 | "\n", 199 | "# No expected output yet, that's in the next cell" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 22, 205 | "metadata": { 206 | "colab": {}, 207 | "colab_type": "code", 208 | "id": "SOm99-zO_nAe" 209 | }, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "252\n", 216 | "37\n" 217 | ] 218 | } 219 | ], 220 | "source": [ 221 | "# EXERCISE: Print out the size (length) of the small versions of the datasets.\n", 222 | "\n", 223 | "print(len(list(small_train)))\n", 224 | "print(len(list(small_test)))\n", 225 | "\n", 226 | "# Expected output\n", 227 | "# 252\n", 228 | "# 37" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "The original dataset doesn't have a validation set, just training and testing sets. In the cell below, you will use TFDS to create new datasets according to these rules:\n", 236 | "\n", 237 | "* `new_train`: The new training set should be the first 90% of the original training set.\n", 238 | "\n", 239 | "\n", 240 | "* `new_test`: The new test set should be the first 90% of the original test set.\n", 241 | "\n", 242 | "\n", 243 | "* `validation`: The new validation set should be the last 10% of the original training set + the last 10% of the original test set." 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 25, 249 | "metadata": { 250 | "colab": {}, 251 | "colab_type": "code", 252 | "id": "jL7KXYi17s_1" 253 | }, 254 | "outputs": [ 255 | { 256 | "name": "stdout", 257 | "output_type": "stream", 258 | "text": [ 259 | "2268\n", 260 | "335\n", 261 | "289\n" 262 | ] 263 | } 264 | ], 265 | "source": [ 266 | "# EXERCISE: In the space below, create 3 new datasets according to\n", 267 | "# the rules indicated above.\n", 268 | "\n", 269 | "new_train = tfds.load(\"rock_paper_scissors:3.*.*\", data_dir=filePath, split='train[:90%]') # YOUR CODE HERE (Include the following arguments in your code: \"rock_paper_scissors:3.*.*\", data_dir=filePath)\n", 270 | "print(len(list(new_train)))\n", 271 | "\n", 272 | "new_test = tfds.load(\"rock_paper_scissors:3.*.*\", data_dir=filePath, split='test[:90%]')# YOUR CODE HERE (Include the following arguments in your code: \"rock_paper_scissors:3.*.*\", data_dir=filePath)\n", 273 | "print(len(list(new_test)))\n", 274 | "\n", 275 | "validation = tfds.load(\"rock_paper_scissors:3.*.*\", data_dir=filePath, split='train[-10%:] + test[-10%:]')# YOUR CODE HERE (Include the following arguments in your code: \"rock_paper_scissors:3.*.*\", data_dir=filePath)\n", 276 | "print(len(list(validation)))\n", 277 | "\n", 278 | "# Expected output\n", 279 | "# 2268\n", 280 | "# 335\n", 281 | "# 289" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "# Submission Instructions" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "# Now click the 'Submit Assignment' button above." 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "# When you're done or would like to take a break, please run the two cells below to save your work and close the Notebook. This frees up resources for your fellow learners." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 26, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "application/javascript": [ 315 | "\n", 316 | "IPython.notebook.save_checkpoint();\n" 317 | ], 318 | "text/plain": [ 319 | "" 320 | ] 321 | }, 322 | "metadata": {}, 323 | "output_type": "display_data" 324 | } 325 | ], 326 | "source": [ 327 | "%%javascript\n", 328 | "\n", 329 | "IPython.notebook.save_checkpoint();" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "%%javascript\n", 339 | "\n", 340 | "window.onbeforeunload = null\n", 341 | "window.close();\n", 342 | "IPython.notebook.session.delete();" 343 | ] 344 | } 345 | ], 346 | "metadata": { 347 | "accelerator": "GPU", 348 | "colab": { 349 | "collapsed_sections": [], 350 | "name": "Part 25 - Exercise - Question.ipynb", 351 | "provenance": [ 352 | { 353 | "file_id": "1JCok9fYE1xBsFr0GC-vFa0cUocr-Eo3j", 354 | "timestamp": 1569508700122 355 | } 356 | ] 357 | }, 358 | "coursera": { 359 | "course_slug": "data-pipelines-tensorflow", 360 | "graded_item_id": "kYLnd", 361 | "launcher_item_id": "YBlDH" 362 | }, 363 | "kernelspec": { 364 | "display_name": "Python 3", 365 | "language": "python", 366 | "name": "python3" 367 | }, 368 | "language_info": { 369 | "codemirror_mode": { 370 | "name": "ipython", 371 | "version": 3 372 | }, 373 | "file_extension": ".py", 374 | "mimetype": "text/x-python", 375 | "name": "python", 376 | "nbconvert_exporter": "python", 377 | "pygments_lexer": "ipython3", 378 | "version": "3.6.8" 379 | } 380 | }, 381 | "nbformat": 4, 382 | "nbformat_minor": 1 383 | } 384 | -------------------------------------------------------------------------------- /utf-8''TFDS-Week3-Question.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Parallelization with TFDS" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this week's exercise, we'll go back to the classic cats versus dogs example, but instead of just naively loading the data to train a model, you will be parallelizing various stages of the Extract, Transform and Load processes. In particular, you will be performing following tasks: \n", 15 | "\n", 16 | "1. Parallelize the extraction of the stored TFRecords of the cats_vs_dogs dataset by using the interleave operation.\n", 17 | "2. Parallelize the transformation during the preprocessing of the raw dataset by using the map operation.\n", 18 | "3. Cache the processed dataset in memory by using the cache operation for faster retrieval.\n", 19 | "4. Parallelize the loading of the cached dataset during the training cycle by using the prefetch operation." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## Setup" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "metadata": { 33 | "colab": {}, 34 | "colab_type": "code", 35 | "id": "RoPuCbDtBlYK" 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "import multiprocessing\n", 40 | "\n", 41 | "import tensorflow as tf\n", 42 | "import tensorflow_datasets as tfds\n", 43 | "\n", 44 | "from os import getcwd" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Create and Compile the Model" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "metadata": { 58 | "colab": {}, 59 | "colab_type": "code", 60 | "id": "WOI6Dk_oJQEK" 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "def create_model():\n", 65 | " input_layer = tf.keras.layers.Input(shape=(224, 224, 3))\n", 66 | " base_model = tf.keras.applications.MobileNetV2(input_tensor=input_layer,\n", 67 | " weights='imagenet',\n", 68 | " include_top=False)\n", 69 | " base_model.trainable = False\n", 70 | " x = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)\n", 71 | " x = tf.keras.layers.Dense(2, activation='softmax')(x)\n", 72 | " \n", 73 | " model = tf.keras.models.Model(inputs=input_layer, outputs=x)\n", 74 | " model.compile(optimizer='adam',\n", 75 | " loss='sparse_categorical_crossentropy',\n", 76 | " metrics=['acc'])\n", 77 | " return model" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Naive Approach\n", 85 | "\n", 86 | "Just for comparison, let's start by using the naive approach to Extract, Transform, and Load the data to train the model defined above. By naive approach we mean that we won't apply any of the new concepts of parallelization that we learned about in this module." 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": { 93 | "colab": {}, 94 | "colab_type": "code", 95 | "id": "SPjns6UfCCSn" 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "dataset_name = 'cats_vs_dogs'\n", 100 | "filePath = f\"{getcwd()}/../tmp2\"\n", 101 | "dataset, info = tfds.load(name=dataset_name, split=tfds.Split.TRAIN, with_info=True, data_dir=filePath)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "metadata": { 108 | "colab": {}, 109 | "colab_type": "code", 110 | "id": "hN3P7OWKQLG2" 111 | }, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "2.0.1\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "print(info.version)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 5, 128 | "metadata": { 129 | "colab": {}, 130 | "colab_type": "code", 131 | "id": "I3Q7Etb8ENRG" 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "def preprocess(features):\n", 136 | " image = features['image']\n", 137 | " image = tf.image.resize(image, (224, 224))\n", 138 | " image = image / 255.0\n", 139 | " return image, features['label']" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 6, 145 | "metadata": { 146 | "colab": {}, 147 | "colab_type": "code", 148 | "id": "sQCfvf4WENg2" 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "train_dataset = dataset.map(preprocess).batch(32)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": { 158 | "colab": {}, 159 | "colab_type": "code", 160 | "id": "8jyjiJd8Cvwc" 161 | }, 162 | "source": [ 163 | "The next step will be to train the model using the following code:\n", 164 | "\n", 165 | "```python\n", 166 | "model = create_model()\n", 167 | "model.fit(train_dataset, epochs=5)\n", 168 | "```\n", 169 | "Since we want to focus on the parallelization techniques, we won't go through the training process here, as this can take some time. " 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": { 175 | "colab_type": "text", 176 | "id": "c5fzrFnXLEJW" 177 | }, 178 | "source": [ 179 | "# Parallelize Various Stages of the ETL Processes\n", 180 | "\n", 181 | "The following exercises are about parallelizing various stages of Extract, Transform and Load processes. In particular, you will be tasked with performing following tasks: \n", 182 | "\n", 183 | "1. Parallelize the extraction of the stored TFRecords of the cats_vs_dogs dataset by using the interleave operation.\n", 184 | "2. Parallelize the transformation during the preprocessing of the raw dataset by using the map operation.\n", 185 | "3. Cache the processed dataset in memory by using the cache operation for faster retrieval.\n", 186 | "4. Parallelize the loading of the cached dataset during the training cycle by using the prefetch operation.\n", 187 | "\n", 188 | "We start by creating a dataset of strings corresponding to the `file_pattern` of the TFRecords of the cats_vs_dogs dataset." 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 8, 194 | "metadata": { 195 | "colab": {}, 196 | "colab_type": "code", 197 | "id": "S9Tqn9gALFaE" 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "file_pattern = f'{getcwd()}/../tmp2/{dataset_name}/{info.version}/{dataset_name}-train.tfrecord*'\n", 202 | "files = tf.data.Dataset.list_files(file_pattern)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "Let's recall that the TFRecord format is a simple format for storing a sequence of binary records. This is very useful because by serializing the data and storing it in a set of files (100-200MB each) that can each be read linearly greatly increases the efficiency when reading the data.\n", 210 | "\n", 211 | "Since we will use it later, we should also recall that a `tf.Example` message (or protobuf) is a flexible message type that represents a `{\"string\": tf.train.Feature}` mapping." 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": { 217 | "colab_type": "text", 218 | "id": "bqvYsWmVS9EW" 219 | }, 220 | "source": [ 221 | "## Parallelize Extraction\n", 222 | "\n", 223 | "In the cell below you will use the [interleave](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave) operation with certain [arguments](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#args_38) to parallelize the extraction of the stored TFRecords of the cats_vs_dogs dataset.\n", 224 | "\n", 225 | "Recall that `tf.data.experimental.AUTOTUNE` will delegate the decision about what level of parallelism to use to the `tf.data` runtime." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 9, 231 | "metadata": { 232 | "colab": {}, 233 | "colab_type": "code", 234 | "id": "2zYCJMSoSHhd" 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "# EXERCISE: Parallelize the extraction of the stored TFRecords of\n", 239 | "# the cats_vs_dogs dataset by using the interleave operation with\n", 240 | "# cycle_length = 4 and the number of parallel calls set to tf.data.experimental.AUTOTUNE.\n", 241 | "train_dataset = files.interleave(tf.data.TFRecordDataset, \n", 242 | " cycle_length=4,\n", 243 | " num_parallel_calls=tf.data.experimental.AUTOTUNE)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": { 249 | "colab_type": "text", 250 | "id": "OiL5S0GdTKPK" 251 | }, 252 | "source": [ 253 | "## Parse and Decode\n", 254 | "\n", 255 | "At this point the `train_dataset` contains serialized `tf.train.Example` messages. When iterated over, it returns these as scalar string tensors. The sample output for one record is given below:\n", 256 | "\n", 257 | "```\n", 258 | "\n", 259 | "```\n", 260 | "\n", 261 | "In order to be able to use these tensors to train our model, we must first parse them and decode them. We can parse and decode these string tensors by using a function. In the cell below you will create a `read_tfrecord` function that will read the serialized `tf.train.Example` messages and decode them. The function will also normalize and resize the images after they have been decoded. \n", 262 | "\n", 263 | "In order to parse the `tf.train.Example` messages we need to create a `feature_description` dictionary. We need the `feature_description` dictionary because TFDS uses graph-execution and therefore, needs this description to build their shape and type signature. The basic structure of the `feature_description` dictionary looks like this:\n", 264 | "\n", 265 | "```python\n", 266 | "feature_description = {'feature': tf.io.FixedLenFeature([], tf.Dtype, default_value)}\n", 267 | "```\n", 268 | "\n", 269 | "The number of features in your `feature_description` dictionary will vary depending on your dataset. In our particular case, the features are `'image'` and `'label'` and can be seen in the sample output of the string tensor above. Therefore, our `feature_description` dictionary will look like this:\n", 270 | "\n", 271 | "```python\n", 272 | "feature_description = {\n", 273 | " 'image': tf.io.FixedLenFeature((), tf.string, \"\"),\n", 274 | " 'label': tf.io.FixedLenFeature((), tf.int64, -1),\n", 275 | "}\n", 276 | "```\n", 277 | "\n", 278 | "where we have given the default values of `\"\"` and `-1` to the `'image'` and `'label'` respectively.\n", 279 | "\n", 280 | "The next step will be to parse the serialized `tf.train.Example` message using the `feature_description` dictionary given above. This can be done with the following code:\n", 281 | "\n", 282 | "```python\n", 283 | "example = tf.io.parse_single_example(serialized_example, feature_description)\n", 284 | "```\n", 285 | "\n", 286 | "Finally, we can decode the image by using:\n", 287 | "\n", 288 | "```python\n", 289 | "image = tf.io.decode_jpeg(example['image'], channels=3)\n", 290 | "```\n", 291 | "\n", 292 | "Use the code given above to complete the exercise below." 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 10, 298 | "metadata": { 299 | "colab": {}, 300 | "colab_type": "code", 301 | "id": "5iWEqIYQSYgN" 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "# EXERCISE: Fill in the missing code below.\n", 306 | "\n", 307 | "def read_tfrecord(serialized_example):\n", 308 | " \n", 309 | " # Create the feature description dictionary\n", 310 | " feature_description = {\n", 311 | " 'image': tf.io.FixedLenFeature((), tf.string, \"\"),\n", 312 | " 'label': tf.io.FixedLenFeature((), tf.int64, -1),\n", 313 | " }\n", 314 | " # Parse the serialized_example and decode the image\n", 315 | " example = tf.io.parse_single_example(serialized_example, feature_description)\n", 316 | " image = tf.io.decode_jpeg(example['image'], channels=3)\n", 317 | " \n", 318 | " image = tf.cast(image, tf.float32)\n", 319 | " \n", 320 | " # Normalize the pixels in the image\n", 321 | " image = image/255.\n", 322 | " \n", 323 | " # Resize the image to (224, 224) using tf.image.resize\n", 324 | " image = tf.image.resize(image, [224, 224])\n", 325 | " \n", 326 | " return image, example['label']" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "## Parallelize Transformation\n", 334 | "\n", 335 | "You can now apply the `read_tfrecord` function to each item in the `train_dataset` by using the `map` method. You can parallelize the transformation of the `train_dataset` by using the `map` method with the `num_parallel_calls` set to the number of CPU cores." 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 11, 341 | "metadata": { 342 | "colab": {}, 343 | "colab_type": "code", 344 | "id": "mRFO7n7odLTk" 345 | }, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "8\n" 352 | ] 353 | } 354 | ], 355 | "source": [ 356 | "# EXERCISE: Fill in the missing code below.\n", 357 | "\n", 358 | "# Get the number of CPU cores. \n", 359 | "cores = multiprocessing.cpu_count()\n", 360 | "\n", 361 | "print(cores)\n", 362 | "\n", 363 | "# Parallelize the transformation of the train_dataset by using\n", 364 | "# the map operation with the number of parallel calls set to\n", 365 | "# the number of CPU cores.\n", 366 | "train_dataset = train_dataset.map(read_tfrecord, num_parallel_calls=cores)" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": { 372 | "colab_type": "text", 373 | "id": "43XLYAvGTsew" 374 | }, 375 | "source": [ 376 | "## Cache the Dataset" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 12, 382 | "metadata": { 383 | "colab": {}, 384 | "colab_type": "code", 385 | "id": "D0zWUJ3gTuRx" 386 | }, 387 | "outputs": [], 388 | "source": [ 389 | "# EXERCISE: Cache the train_dataset in-memory.\n", 390 | "train_dataset = train_dataset.cache()" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": { 396 | "colab_type": "text", 397 | "id": "KhpFlwM8TTxO" 398 | }, 399 | "source": [ 400 | "## Parallelize Loading" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 13, 406 | "metadata": { 407 | "colab": {}, 408 | "colab_type": "code", 409 | "id": "FdZ-aTECSE2a" 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "# EXERCISE: Fill in the missing code below.\n", 414 | "\n", 415 | "# Shuffle and batch the train_dataset. Use a buffer size of 1024\n", 416 | "# for shuffling and a batch size 32 for batching. \n", 417 | "train_dataset = train_dataset.shuffle(1024).batch(32)\n", 418 | "\n", 419 | "# Parallelize the loading by prefetching the train_dataset.\n", 420 | "# Set the prefetching buffer size to tf.data.experimental.AUTOTUNE.\n", 421 | "train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": { 427 | "colab": {}, 428 | "colab_type": "code", 429 | "id": "zSMpkNrbLFoa" 430 | }, 431 | "source": [ 432 | "The next step will be to train your model using the following code:\n", 433 | "\n", 434 | "```python\n", 435 | "model = create_model()\n", 436 | "model.fit(train_dataset, epochs=5)\n", 437 | "```\n", 438 | "We won't go through the training process here as this can take some time. However, due to the parallelization of the various stages of the ETL processes, you should see a decrease in training time as compared to the naive approach depicted at beginning of the notebook." 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "# Submission Instructions" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 14, 451 | "metadata": { 452 | "colab": {}, 453 | "colab_type": "code", 454 | "id": "uJPiA98oPfrg" 455 | }, 456 | "outputs": [], 457 | "source": [ 458 | "# Now click the 'Submit Assignment' button above." 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "# When you're done or would like to take a break, please run the two cells below to save your work and close the Notebook. This frees up resources for your fellow learners." 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "%%javascript\n", 475 | "\n", 476 | "IPython.notebook.save_checkpoint();" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "metadata": {}, 483 | "outputs": [], 484 | "source": [ 485 | "%%javascript\n", 486 | "\n", 487 | "window.onbeforeunload = null\n", 488 | "window.close();\n", 489 | "IPython.notebook.session.delete();" 490 | ] 491 | } 492 | ], 493 | "metadata": { 494 | "accelerator": "GPU", 495 | "colab": { 496 | "collapsed_sections": [], 497 | "machine_shape": "hm", 498 | "name": "Module3 Exercise-Question.ipynb", 499 | "provenance": [], 500 | "toc_visible": true 501 | }, 502 | "coursera": { 503 | "course_slug": "data-pipelines-tensorflow", 504 | "graded_item_id": "PPCTl", 505 | "launcher_item_id": "84r9D" 506 | }, 507 | "kernelspec": { 508 | "display_name": "Python 3", 509 | "language": "python", 510 | "name": "python3" 511 | }, 512 | "language_info": { 513 | "codemirror_mode": { 514 | "name": "ipython", 515 | "version": 3 516 | }, 517 | "file_extension": ".py", 518 | "mimetype": "text/x-python", 519 | "name": "python", 520 | "nbconvert_exporter": "python", 521 | "pygments_lexer": "ipython3", 522 | "version": "3.6.8" 523 | } 524 | }, 525 | "nbformat": 4, 526 | "nbformat_minor": 1 527 | } 528 | -------------------------------------------------------------------------------- /utf-8''TFDS-Week2-Question.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "c05P9g5WjizZ" 8 | }, 9 | "source": [ 10 | "# Classify Structured Data" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "VxyBFc_kKazA" 18 | }, 19 | "source": [ 20 | "## Import TensorFlow and Other Libraries" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 24, 26 | "metadata": { 27 | "colab": {}, 28 | "colab_type": "code", 29 | "id": "9dEreb4QKizj" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "import pandas as pd\n", 34 | "import tensorflow as tf\n", 35 | "\n", 36 | "from tensorflow.keras import layers\n", 37 | "from tensorflow import feature_column\n", 38 | "\n", 39 | "from os import getcwd\n", 40 | "from sklearn.model_selection import train_test_split" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": { 46 | "colab_type": "text", 47 | "id": "KCEhSZcULZ9n" 48 | }, 49 | "source": [ 50 | "## Use Pandas to Create a Dataframe\n", 51 | "\n", 52 | "[Pandas](https://pandas.pydata.org/) is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset and load it into a dataframe." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 25, 58 | "metadata": { 59 | "colab": {}, 60 | "colab_type": "code", 61 | "id": "REZ57BXCLdfG" 62 | }, 63 | "outputs": [ 64 | { 65 | "data": { 66 | "text/html": [ 67 | "
\n", 68 | "\n", 81 | "\n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | "
agesexcptrestbpscholfbsrestecgthalachexangoldpeakslopecathaltarget
063111452331215002.330fixed0
167141602860210811.523normal1
267141202290212912.622reversible0
337131302500018703.530normal0
441021302040217201.410normal0
\n", 189 | "
" 190 | ], 191 | "text/plain": [ 192 | " age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \\\n", 193 | "0 63 1 1 145 233 1 2 150 0 2.3 3 \n", 194 | "1 67 1 4 160 286 0 2 108 1 1.5 2 \n", 195 | "2 67 1 4 120 229 0 2 129 1 2.6 2 \n", 196 | "3 37 1 3 130 250 0 0 187 0 3.5 3 \n", 197 | "4 41 0 2 130 204 0 2 172 0 1.4 1 \n", 198 | "\n", 199 | " ca thal target \n", 200 | "0 0 fixed 0 \n", 201 | "1 3 normal 1 \n", 202 | "2 2 reversible 0 \n", 203 | "3 0 normal 0 \n", 204 | "4 0 normal 0 " 205 | ] 206 | }, 207 | "execution_count": 25, 208 | "metadata": {}, 209 | "output_type": "execute_result" 210 | } 211 | ], 212 | "source": [ 213 | "filePath = f\"{getcwd()}/../tmp2/heart.csv\"\n", 214 | "dataframe = pd.read_csv(filePath)\n", 215 | "dataframe.head()" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": { 221 | "colab_type": "text", 222 | "id": "u0zhLtQqMPem" 223 | }, 224 | "source": [ 225 | "## Split the Dataframe Into Train, Validation, and Test Sets\n", 226 | "\n", 227 | "The dataset we downloaded was a single CSV file. We will split this into train, validation, and test sets." 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 26, 233 | "metadata": { 234 | "colab": {}, 235 | "colab_type": "code", 236 | "id": "YEOpw7LhMYsI" 237 | }, 238 | "outputs": [ 239 | { 240 | "name": "stdout", 241 | "output_type": "stream", 242 | "text": [ 243 | "193 train examples\n", 244 | "49 validation examples\n", 245 | "61 test examples\n" 246 | ] 247 | } 248 | ], 249 | "source": [ 250 | "train, test = train_test_split(dataframe, test_size=0.2)\n", 251 | "train, val = train_test_split(train, test_size=0.2)\n", 252 | "print(len(train), 'train examples')\n", 253 | "print(len(val), 'validation examples')\n", 254 | "print(len(test), 'test examples')" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": { 260 | "colab_type": "text", 261 | "id": "84ef46LXMfvu" 262 | }, 263 | "source": [ 264 | "## Create an Input Pipeline Using `tf.data`\n", 265 | "\n", 266 | "Next, we will wrap the dataframes with [tf.data](https://www.tensorflow.org/guide/datasets). This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly." 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 27, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "# EXERCISE: A utility method to create a tf.data dataset from a Pandas Dataframe.\n", 276 | "\n", 277 | "def df_to_dataset(dataframe, shuffle=True, batch_size=32):\n", 278 | " dataframe = dataframe.copy()\n", 279 | " # Use Pandas dataframe's pop method to get the list of targets.\n", 280 | " labels = dataframe.pop('target')\n", 281 | "# dataframe['thal'] = pd.Categorical(dataframe['thal'])\n", 282 | "# dataframe['thal'] = dataframe.thal.cat.codes\n", 283 | " \n", 284 | " # Create a tf.data.Dataset from the dataframe and labels.\n", 285 | " ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels.values))\n", 286 | " \n", 287 | " if shuffle:\n", 288 | " # Shuffle dataset.\n", 289 | " ds = ds.shuffle(1024)\n", 290 | " \n", 291 | " # Batch dataset with specified batch_size parameter.\n", 292 | " ds = ds.batch(batch_size)\n", 293 | " \n", 294 | " return ds" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 28, 300 | "metadata": { 301 | "colab": {}, 302 | "colab_type": "code", 303 | "id": "CXbbXkJvMy34" 304 | }, 305 | "outputs": [], 306 | "source": [ 307 | "batch_size = 5 # A small batch sized is used for demonstration purposes\n", 308 | "train_ds = df_to_dataset(train, batch_size=batch_size)\n", 309 | "val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)\n", 310 | "test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": { 316 | "colab_type": "text", 317 | "id": "qRLGSMDzM-dl" 318 | }, 319 | "source": [ 320 | "## Understand the Input Pipeline\n", 321 | "\n", 322 | "Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable." 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 29, 328 | "metadata": { 329 | "colab": {}, 330 | "colab_type": "code", 331 | "id": "CSBo3dUVNFc9" 332 | }, 333 | "outputs": [ 334 | { 335 | "name": "stdout", 336 | "output_type": "stream", 337 | "text": [ 338 | "Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']\n", 339 | "A batch of ages: tf.Tensor([57 50 46 60 46], shape=(5,), dtype=int32)\n", 340 | "A batch of targets: tf.Tensor([1 1 1 0 0], shape=(5,), dtype=int64)\n" 341 | ] 342 | } 343 | ], 344 | "source": [ 345 | "for feature_batch, label_batch in train_ds.take(1):\n", 346 | " print('Every feature:', list(feature_batch.keys()))\n", 347 | " print('A batch of ages:', feature_batch['age'])\n", 348 | " print('A batch of targets:', label_batch )" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": { 354 | "colab_type": "text", 355 | "id": "OT5N6Se-NQsC" 356 | }, 357 | "source": [ 358 | "We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe." 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": { 364 | "colab_type": "text", 365 | "id": "ttIvgLRaNoOQ" 366 | }, 367 | "source": [ 368 | "## Create Several Types of Feature Columns\n", 369 | "\n", 370 | "TensorFlow provides many types of feature columns. In this section, we will create several types of feature columns, and demonstrate how they transform a column from the dataframe." 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 30, 376 | "metadata": { 377 | "colab": {}, 378 | "colab_type": "code", 379 | "id": "mxwiHFHuNhmf" 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "# Try to demonstrate several types of feature columns by getting an example.\n", 384 | "example_batch = next(iter(train_ds))[0]" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 31, 390 | "metadata": { 391 | "colab": {}, 392 | "colab_type": "code", 393 | "id": "0wfLB8Q3N3UH" 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "# A utility method to create a feature column and to transform a batch of data.\n", 398 | "def demo(feature_column):\n", 399 | " feature_layer = layers.DenseFeatures(feature_column, dtype='float64')\n", 400 | " print(feature_layer(example_batch).numpy())" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": { 406 | "colab_type": "text", 407 | "id": "Q7OEKe82N-Qb" 408 | }, 409 | "source": [ 410 | "### Numeric Columns\n", 411 | "\n", 412 | "The output of a feature column becomes the input to the model (using the demo function defined above, we will be able to see exactly how each column from the dataframe is transformed). A [numeric column](https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column) is the simplest type of column. It is used to represent real valued features. " 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 32, 418 | "metadata": { 419 | "colab": {}, 420 | "colab_type": "code", 421 | "id": "QZTZ0HnHOCxC" 422 | }, 423 | "outputs": [ 424 | { 425 | "name": "stdout", 426 | "output_type": "stream", 427 | "text": [ 428 | "[[44.]\n", 429 | " [48.]\n", 430 | " [52.]\n", 431 | " [67.]\n", 432 | " [44.]]\n" 433 | ] 434 | } 435 | ], 436 | "source": [ 437 | "# EXERCISE: Create a numeric feature column out of 'age' and demo it.\n", 438 | "age = tf.feature_column.numeric_column('age')\n", 439 | "\n", 440 | "demo(age)" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": { 446 | "colab_type": "text", 447 | "id": "7a6ddSyzOKpq" 448 | }, 449 | "source": [ 450 | "In the heart disease dataset, most columns from the dataframe are numeric." 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": { 456 | "colab_type": "text", 457 | "id": "IcSxUoYgOlA1" 458 | }, 459 | "source": [ 460 | "### Bucketized Columns\n", 461 | "\n", 462 | "Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider raw data that represents a person's age. Instead of representing age as a numeric column, we could split the age into several buckets using a [bucketized column](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column). " 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 34, 468 | "metadata": { 469 | "colab": {}, 470 | "colab_type": "code", 471 | "id": "wJ4Wt3SAOpTQ" 472 | }, 473 | "outputs": [ 474 | { 475 | "name": "stdout", 476 | "output_type": "stream", 477 | "text": [ 478 | "[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]\n", 479 | " [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]\n", 480 | " [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]\n", 481 | " [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]\n", 482 | " [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]\n" 483 | ] 484 | } 485 | ], 486 | "source": [ 487 | "# EXERCISE: Create a bucketized feature column out of 'age' with\n", 488 | "# the following boundaries and demo it.\n", 489 | "boundaries = [18, 25, 30, 35, 40, 45, 50, 55, 60, 65]\n", 490 | "\n", 491 | "age_buckets = tf.feature_column.bucketized_column(age, boundaries)\n", 492 | "\n", 493 | "demo(age_buckets)" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": { 499 | "colab_type": "text", 500 | "id": "-me1NKJ4BIEB" 501 | }, 502 | "source": [ 503 | "Notice the one-hot values above describe which age range each row matches." 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": { 509 | "colab_type": "text", 510 | "id": "r1tArzewPb-b" 511 | }, 512 | "source": [ 513 | "### Categorical Columns\n", 514 | "\n", 515 | "In this dataset, thal is represented as a string (e.g. 'fixed', 'normal', or 'reversible'). We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector (much like you have seen above with age buckets). \n", 516 | "\n", 517 | "**Note**: You will probably see some warning messages when running some of the code cell below. These warnings have to do with software updates and should not cause any errors or prevent your code from running." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 40, 523 | "metadata": { 524 | "colab": {}, 525 | "colab_type": "code", 526 | "id": "DJ6QnSHkPtOC" 527 | }, 528 | "outputs": [ 529 | { 530 | "name": "stdout", 531 | "output_type": "stream", 532 | "text": [ 533 | "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4276: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", 534 | "Instructions for updating:\n", 535 | "The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.\n", 536 | "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4331: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", 537 | "Instructions for updating:\n", 538 | "The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.\n", 539 | "[[1. 0. 0.]\n", 540 | " [0. 1. 0.]\n", 541 | " [0. 0. 1.]\n", 542 | " [0. 0. 1.]\n", 543 | " [0. 1. 0.]]\n" 544 | ] 545 | } 546 | ], 547 | "source": [ 548 | "# EXERCISE: Create a categorical vocabulary column out of the\n", 549 | "# above mentioned categories with the key specified as 'thal'.\n", 550 | "thal = tf.feature_column.categorical_column_with_vocabulary_list('thal', ['fixed', 'normal', 'reversible'])\n", 551 | "\n", 552 | "# EXERCISE: Create an indicator column out of the created categorical column.\n", 553 | "thal_one_hot = tf.feature_column.indicator_column(thal)\n", 554 | "\n", 555 | "demo(thal_one_hot)" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": { 561 | "colab_type": "text", 562 | "id": "zQT4zecNBtji" 563 | }, 564 | "source": [ 565 | "The vocabulary can be passed as a list using [categorical_column_with_vocabulary_list](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list), or loaded from a file using [categorical_column_with_vocabulary_file](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_file)." 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": { 571 | "colab_type": "text", 572 | "id": "LEFPjUr6QmwS" 573 | }, 574 | "source": [ 575 | "### Embedding Columns\n", 576 | "\n", 577 | "Suppose instead of having just a few possible strings, we have thousands (or more) values per category. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an [embedding column](https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column) represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. You can tune the size of the embedding with the `dimension` parameter." 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 41, 583 | "metadata": { 584 | "colab": {}, 585 | "colab_type": "code", 586 | "id": "hSlohmr2Q_UU" 587 | }, 588 | "outputs": [ 589 | { 590 | "name": "stdout", 591 | "output_type": "stream", 592 | "text": [ 593 | "[[ 0.28724766 0.21722521 0.42196316 0.39877316 -0.34977257 -0.36904067\n", 594 | " -0.25824752 -0.12017009]\n", 595 | " [ 0.06605755 0.311711 -0.18009546 -0.5645003 -0.4292082 0.00976347\n", 596 | " 0.08982086 0.09808949]\n", 597 | " [ 0.29693002 -0.0981563 -0.24770111 -0.5202013 -0.11236827 -0.6415816\n", 598 | " 0.58798677 -0.24337412]\n", 599 | " [ 0.29693002 -0.0981563 -0.24770111 -0.5202013 -0.11236827 -0.6415816\n", 600 | " 0.58798677 -0.24337412]\n", 601 | " [ 0.06605755 0.311711 -0.18009546 -0.5645003 -0.4292082 0.00976347\n", 602 | " 0.08982086 0.09808949]]\n" 603 | ] 604 | } 605 | ], 606 | "source": [ 607 | "# EXERCISE: Create an embedding column out of the categorical\n", 608 | "# vocabulary you just created (thal). Set the size of the \n", 609 | "# embedding to 8, by using the dimension parameter.\n", 610 | "\n", 611 | "thal_embedding = tf.feature_column.embedding_column(thal, 8)\n", 612 | "\n", 613 | "\n", 614 | "demo(thal_embedding)" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": { 620 | "colab_type": "text", 621 | "id": "urFCAvTVRMpB" 622 | }, 623 | "source": [ 624 | "### Hashed Feature Columns\n", 625 | "\n", 626 | "Another way to represent a categorical column with a large number of values is to use a [categorical_column_with_hash_bucket](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket). This feature column calculates a hash value of the input, then selects one of the `hash_bucket_size` buckets to encode a string. When using this column, you do not need to provide the vocabulary, and you can choose to make the number of hash buckets significantly smaller than the number of actual categories to save space." 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": 42, 632 | "metadata": { 633 | "colab": {}, 634 | "colab_type": "code", 635 | "id": "YHU_Aj2nRRDC" 636 | }, 637 | "outputs": [ 638 | { 639 | "name": "stdout", 640 | "output_type": "stream", 641 | "text": [ 642 | "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4331: HashedCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", 643 | "Instructions for updating:\n", 644 | "The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.\n", 645 | "[[0. 0. 0. ... 0. 0. 0.]\n", 646 | " [0. 0. 0. ... 0. 0. 0.]\n", 647 | " [0. 0. 0. ... 0. 0. 0.]\n", 648 | " [0. 0. 0. ... 0. 0. 0.]\n", 649 | " [0. 0. 0. ... 0. 0. 0.]]\n" 650 | ] 651 | } 652 | ], 653 | "source": [ 654 | "# EXERCISE: Create a hashed feature column with 'thal' as the key and \n", 655 | "# 1000 hash buckets.\n", 656 | "thal_hashed = tf.feature_column.categorical_column_with_hash_bucket('thal', 1000)\n", 657 | "\n", 658 | "demo(feature_column.indicator_column(thal_hashed))" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": { 664 | "colab_type": "text", 665 | "id": "fB94M27DRXtZ" 666 | }, 667 | "source": [ 668 | "### Crossed Feature Columns\n", 669 | "Combining features into a single feature, better known as [feature crosses](https://developers.google.com/machine-learning/glossary/#feature_cross), enables a model to learn separate weights for each combination of features. Here, we will create a new feature that is the cross of age and thal. Note that `crossed_column` does not build the full table of all possible combinations (which could be very large). Instead, it is backed by a `hashed_column`, so you can choose how large the table is." 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": 46, 675 | "metadata": { 676 | "colab": {}, 677 | "colab_type": "code", 678 | "id": "oaPVERd9Rep6" 679 | }, 680 | "outputs": [ 681 | { 682 | "name": "stdout", 683 | "output_type": "stream", 684 | "text": [ 685 | "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4331: CrossedColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", 686 | "Instructions for updating:\n", 687 | "The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.\n", 688 | "[[0. 0. 0. ... 0. 0. 0.]\n", 689 | " [0. 0. 0. ... 0. 0. 0.]\n", 690 | " [0. 0. 0. ... 0. 0. 0.]\n", 691 | " [0. 0. 0. ... 0. 0. 0.]\n", 692 | " [0. 0. 0. ... 0. 0. 0.]]\n" 693 | ] 694 | } 695 | ], 696 | "source": [ 697 | "# EXERCISE: Create a crossed column using the bucketized column (age_buckets),\n", 698 | "# the categorical vocabulary column (thal) previously created, and 1000 hash buckets.\n", 699 | "crossed_feature = tf.feature_column.crossed_column([age_buckets, thal], 1000)\n", 700 | "\n", 701 | "demo(feature_column.indicator_column(crossed_feature))" 702 | ] 703 | }, 704 | { 705 | "cell_type": "markdown", 706 | "metadata": { 707 | "colab_type": "text", 708 | "id": "ypkI9zx6Rj1q" 709 | }, 710 | "source": [ 711 | "## Choose Which Columns to Use\n", 712 | "\n", 713 | "We have seen how to use several types of feature columns. Now we will use them to train a model. The goal of this exercise is to show you the complete code needed to work with feature columns. We have selected a few columns to train our model below arbitrarily.\n", 714 | "\n", 715 | "If your aim is to build an accurate model, try a larger dataset of your own, and think carefully about which features are the most meaningful to include, and how they should be represented." 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": 47, 721 | "metadata": { 722 | "colab": {}, 723 | "colab_type": "code", 724 | "id": "Eu8bJWmCScfC" 725 | }, 726 | "outputs": [ 727 | { 728 | "data": { 729 | "text/plain": [ 730 | "age int64\n", 731 | "sex int64\n", 732 | "cp int64\n", 733 | "trestbps int64\n", 734 | "chol int64\n", 735 | "fbs int64\n", 736 | "restecg int64\n", 737 | "thalach int64\n", 738 | "exang int64\n", 739 | "oldpeak float64\n", 740 | "slope int64\n", 741 | "ca int64\n", 742 | "thal object\n", 743 | "target int64\n", 744 | "dtype: object" 745 | ] 746 | }, 747 | "execution_count": 47, 748 | "metadata": {}, 749 | "output_type": "execute_result" 750 | } 751 | ], 752 | "source": [ 753 | "dataframe.dtypes" 754 | ] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "metadata": { 759 | "colab_type": "text", 760 | "id": "2pV4tSI3SkuX" 761 | }, 762 | "source": [ 763 | "You can use the above list of column datatypes to map the appropriate feature column to every column in the dataframe." 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 48, 769 | "metadata": { 770 | "colab": {}, 771 | "colab_type": "code", 772 | "id": "4PlLY7fORuzA" 773 | }, 774 | "outputs": [], 775 | "source": [ 776 | "# EXERCISE: Fill in the missing code below\n", 777 | "feature_columns = []\n", 778 | "\n", 779 | "# Numeric Cols.\n", 780 | "# Create a list of numeric columns. Use the following list of columns\n", 781 | "# that have a numeric datatype: ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca'].\n", 782 | "numeric_columns = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']\n", 783 | "\n", 784 | "for header in numeric_columns:\n", 785 | " # Create a numeric feature column out of the header.\n", 786 | " numeric_feature_column = tf.feature_column.numeric_column(header)\n", 787 | " \n", 788 | " feature_columns.append(numeric_feature_column)\n", 789 | "\n", 790 | "# Bucketized Cols.\n", 791 | "# Create a bucketized feature column out of the age column (numeric column)\n", 792 | "# that you've already created. Use the following boundaries:\n", 793 | "# [18, 25, 30, 35, 40, 45, 50, 55, 60, 65]\n", 794 | "age_buckets = tf.feature_column.bucketized_column(feature_columns[0], boundaries)\n", 795 | "\n", 796 | "feature_columns.append(age_buckets)\n", 797 | "\n", 798 | "# Indicator Cols.\n", 799 | "# Create a categorical vocabulary column out of the categories\n", 800 | "# ['fixed', 'normal', 'reversible'] with the key specified as 'thal'.\n", 801 | "thal = tf.feature_column.categorical_column_with_vocabulary_list('thal', ['fixed', 'normal', 'reversible'])\n", 802 | "\n", 803 | "# Create an indicator column out of the created thal categorical column\n", 804 | "thal_one_hot = tf.feature_column.indicator_column(thal)\n", 805 | "\n", 806 | "feature_columns.append(thal_one_hot)\n", 807 | "\n", 808 | "# Embedding Cols.\n", 809 | "# Create an embedding column out of the categorical vocabulary you\n", 810 | "# just created (thal). Set the size of the embedding to 8, by using\n", 811 | "# the dimension parameter.\n", 812 | "thal_embedding = tf.feature_column.embedding_column(thal, 8)\n", 813 | "\n", 814 | "feature_columns.append(thal_embedding)\n", 815 | "\n", 816 | "# Crossed Cols.\n", 817 | "# Create a crossed column using the bucketized column (age_buckets),\n", 818 | "# the categorical vocabulary column (thal) previously created, and 1000 hash buckets.\n", 819 | "crossed_feature = tf.feature_column.crossed_column([age_buckets, thal], 1000)\n", 820 | "\n", 821 | "# Create an indicator column out of the crossed column created above to one-hot encode it.\n", 822 | "crossed_feature = tf.feature_column.indicator_column(crossed_feature)\n", 823 | "\n", 824 | "feature_columns.append(crossed_feature)" 825 | ] 826 | }, 827 | { 828 | "cell_type": "markdown", 829 | "metadata": { 830 | "colab_type": "text", 831 | "id": "M-nDp8krS_ts" 832 | }, 833 | "source": [ 834 | "### Create a Feature Layer\n", 835 | "\n", 836 | "Now that we have defined our feature columns, we will use a [DenseFeatures](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/DenseFeatures) layer to input them to our Keras model." 837 | ] 838 | }, 839 | { 840 | "cell_type": "code", 841 | "execution_count": 49, 842 | "metadata": { 843 | "colab": {}, 844 | "colab_type": "code", 845 | "id": "6o-El1R2TGQP" 846 | }, 847 | "outputs": [], 848 | "source": [ 849 | "# EXERCISE: Create a Keras DenseFeatures layer and pass the feature_columns you just created.\n", 850 | "feature_layer = tf.keras.layers.DenseFeatures(feature_columns)" 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "metadata": { 856 | "colab_type": "text", 857 | "id": "8cf6vKfgTH0U" 858 | }, 859 | "source": [ 860 | "Earlier, we used a small batch size to demonstrate how feature columns worked. We create a new input pipeline with a larger batch size." 861 | ] 862 | }, 863 | { 864 | "cell_type": "code", 865 | "execution_count": 50, 866 | "metadata": { 867 | "colab": {}, 868 | "colab_type": "code", 869 | "id": "gcemszoGSse_" 870 | }, 871 | "outputs": [], 872 | "source": [ 873 | "batch_size = 32\n", 874 | "train_ds = df_to_dataset(train, batch_size=batch_size)\n", 875 | "val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)\n", 876 | "test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": { 882 | "colab_type": "text", 883 | "id": "bBx4Xu0eTXWq" 884 | }, 885 | "source": [ 886 | "## Create, Compile, and Train the Model" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": 51, 892 | "metadata": { 893 | "colab": {}, 894 | "colab_type": "code", 895 | "id": "_YJPPb3xTPeZ" 896 | }, 897 | "outputs": [ 898 | { 899 | "name": "stdout", 900 | "output_type": "stream", 901 | "text": [ 902 | "WARNING:tensorflow:Layer sequential is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because it's dtype defaults to floatx.\n", 903 | "\n", 904 | "If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.\n", 905 | "\n", 906 | "To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.\n", 907 | "\n", 908 | "Epoch 1/100\n", 909 | "7/7 [==============================] - 4s 627ms/step - loss: 1.9296 - accuracy: 0.6114 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00\n", 910 | "Epoch 2/100\n", 911 | "7/7 [==============================] - 0s 57ms/step - loss: 1.3997 - accuracy: 0.5699 - val_loss: 0.5146 - val_accuracy: 0.7551\n", 912 | "Epoch 3/100\n", 913 | "7/7 [==============================] - 0s 45ms/step - loss: 0.8760 - accuracy: 0.7358 - val_loss: 0.5962 - val_accuracy: 0.6327\n", 914 | "Epoch 4/100\n", 915 | "7/7 [==============================] - 0s 57ms/step - loss: 0.6013 - accuracy: 0.7254 - val_loss: 0.5584 - val_accuracy: 0.7755\n", 916 | "Epoch 5/100\n", 917 | "7/7 [==============================] - 0s 55ms/step - loss: 0.5858 - accuracy: 0.7461 - val_loss: 1.2739 - val_accuracy: 0.4082\n", 918 | "Epoch 6/100\n", 919 | "7/7 [==============================] - 0s 56ms/step - loss: 0.8936 - accuracy: 0.6218 - val_loss: 0.8761 - val_accuracy: 0.7551\n", 920 | "Epoch 7/100\n", 921 | "7/7 [==============================] - 0s 44ms/step - loss: 0.7993 - accuracy: 0.6995 - val_loss: 0.3965 - val_accuracy: 0.7959\n", 922 | "Epoch 8/100\n", 923 | "7/7 [==============================] - 0s 56ms/step - loss: 0.6975 - accuracy: 0.7098 - val_loss: 0.3979 - val_accuracy: 0.8163\n", 924 | "Epoch 9/100\n", 925 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5593 - accuracy: 0.7617 - val_loss: 0.4653 - val_accuracy: 0.7551\n", 926 | "Epoch 10/100\n", 927 | "7/7 [==============================] - 0s 46ms/step - loss: 0.4394 - accuracy: 0.7927 - val_loss: 0.4132 - val_accuracy: 0.7755\n", 928 | "Epoch 11/100\n", 929 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5097 - accuracy: 0.7720 - val_loss: 0.4286 - val_accuracy: 0.8163\n", 930 | "Epoch 12/100\n", 931 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4794 - accuracy: 0.7824 - val_loss: 0.3486 - val_accuracy: 0.8571\n", 932 | "Epoch 13/100\n", 933 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4476 - accuracy: 0.7772 - val_loss: 0.4103 - val_accuracy: 0.8163\n", 934 | "Epoch 14/100\n", 935 | "7/7 [==============================] - 0s 45ms/step - loss: 0.4218 - accuracy: 0.7824 - val_loss: 0.3641 - val_accuracy: 0.7755\n", 936 | "Epoch 15/100\n", 937 | "7/7 [==============================] - 0s 45ms/step - loss: 0.5187 - accuracy: 0.7461 - val_loss: 0.4110 - val_accuracy: 0.7959\n", 938 | "Epoch 16/100\n", 939 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4351 - accuracy: 0.7772 - val_loss: 0.3188 - val_accuracy: 0.8776\n", 940 | "Epoch 17/100\n", 941 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4072 - accuracy: 0.7979 - val_loss: 0.7014 - val_accuracy: 0.6122\n", 942 | "Epoch 18/100\n", 943 | "7/7 [==============================] - 0s 46ms/step - loss: 0.6399 - accuracy: 0.6528 - val_loss: 0.3580 - val_accuracy: 0.7959\n", 944 | "Epoch 19/100\n", 945 | "7/7 [==============================] - 0s 56ms/step - loss: 0.6151 - accuracy: 0.6788 - val_loss: 0.5508 - val_accuracy: 0.7551\n", 946 | "Epoch 20/100\n", 947 | "7/7 [==============================] - 0s 55ms/step - loss: 0.6014 - accuracy: 0.7824 - val_loss: 0.7927 - val_accuracy: 0.6327\n", 948 | "Epoch 21/100\n", 949 | "7/7 [==============================] - 0s 56ms/step - loss: 0.6057 - accuracy: 0.7461 - val_loss: 0.3545 - val_accuracy: 0.8367\n", 950 | "Epoch 22/100\n", 951 | "7/7 [==============================] - 0s 46ms/step - loss: 0.5995 - accuracy: 0.7461 - val_loss: 0.4451 - val_accuracy: 0.7755\n", 952 | "Epoch 23/100\n", 953 | "7/7 [==============================] - 0s 56ms/step - loss: 0.6724 - accuracy: 0.7461 - val_loss: 0.9457 - val_accuracy: 0.5714\n", 954 | "Epoch 24/100\n", 955 | "7/7 [==============================] - 0s 56ms/step - loss: 0.8102 - accuracy: 0.6114 - val_loss: 0.3960 - val_accuracy: 0.8163\n", 956 | "Epoch 25/100\n", 957 | "7/7 [==============================] - 0s 56ms/step - loss: 0.6388 - accuracy: 0.7720 - val_loss: 0.3012 - val_accuracy: 0.8571\n", 958 | "Epoch 26/100\n", 959 | "7/7 [==============================] - 0s 44ms/step - loss: 0.4745 - accuracy: 0.7979 - val_loss: 0.3281 - val_accuracy: 0.8163\n", 960 | "Epoch 27/100\n", 961 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4129 - accuracy: 0.8238 - val_loss: 0.2976 - val_accuracy: 0.8571\n", 962 | "Epoch 28/100\n", 963 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3809 - accuracy: 0.8342 - val_loss: 0.2889 - val_accuracy: 0.8367\n", 964 | "Epoch 29/100\n", 965 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3797 - accuracy: 0.8238 - val_loss: 0.6418 - val_accuracy: 0.6735\n", 966 | "Epoch 30/100\n", 967 | "7/7 [==============================] - 0s 55ms/step - loss: 0.8321 - accuracy: 0.6373 - val_loss: 0.8086 - val_accuracy: 0.7551\n", 968 | "Epoch 31/100\n", 969 | "7/7 [==============================] - 0s 45ms/step - loss: 0.7858 - accuracy: 0.7720 - val_loss: 0.6868 - val_accuracy: 0.6531\n", 970 | "Epoch 32/100\n", 971 | "7/7 [==============================] - 0s 45ms/step - loss: 0.5676 - accuracy: 0.7565 - val_loss: 0.3067 - val_accuracy: 0.8367\n", 972 | "Epoch 33/100\n", 973 | "7/7 [==============================] - 0s 56ms/step - loss: 0.7333 - accuracy: 0.6891 - val_loss: 0.3283 - val_accuracy: 0.8163\n", 974 | "Epoch 34/100\n", 975 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5698 - accuracy: 0.7824 - val_loss: 0.3153 - val_accuracy: 0.8367\n", 976 | "Epoch 35/100\n", 977 | "7/7 [==============================] - 0s 46ms/step - loss: 0.4736 - accuracy: 0.8031 - val_loss: 0.5187 - val_accuracy: 0.7959\n", 978 | "Epoch 36/100\n", 979 | "7/7 [==============================] - 0s 55ms/step - loss: 1.3414 - accuracy: 0.7150 - val_loss: 0.2790 - val_accuracy: 0.8367\n", 980 | "Epoch 37/100\n", 981 | "7/7 [==============================] - 0s 56ms/step - loss: 1.4810 - accuracy: 0.4767 - val_loss: 0.5076 - val_accuracy: 0.7755\n", 982 | "Epoch 38/100\n", 983 | "7/7 [==============================] - 0s 45ms/step - loss: 1.2869 - accuracy: 0.7150 - val_loss: 0.3337 - val_accuracy: 0.7959\n", 984 | "Epoch 39/100\n", 985 | "7/7 [==============================] - 0s 55ms/step - loss: 1.0146 - accuracy: 0.5596 - val_loss: 0.3137 - val_accuracy: 0.8163\n", 986 | "Epoch 40/100\n", 987 | "7/7 [==============================] - 0s 56ms/step - loss: 0.9725 - accuracy: 0.7202 - val_loss: 0.4498 - val_accuracy: 0.7551\n", 988 | "Epoch 41/100\n", 989 | "7/7 [==============================] - 0s 46ms/step - loss: 0.5430 - accuracy: 0.7565 - val_loss: 0.2688 - val_accuracy: 0.8163\n", 990 | "Epoch 42/100\n", 991 | "7/7 [==============================] - 0s 55ms/step - loss: 0.6374 - accuracy: 0.7461 - val_loss: 0.3064 - val_accuracy: 0.8571\n", 992 | "Epoch 43/100\n", 993 | "7/7 [==============================] - 0s 56ms/step - loss: 0.7502 - accuracy: 0.6062 - val_loss: 0.3151 - val_accuracy: 0.7959\n", 994 | "Epoch 44/100\n", 995 | "7/7 [==============================] - 0s 46ms/step - loss: 0.9650 - accuracy: 0.7202 - val_loss: 0.5780 - val_accuracy: 0.7551\n", 996 | "Epoch 45/100\n", 997 | "7/7 [==============================] - 0s 55ms/step - loss: 0.5589 - accuracy: 0.7254 - val_loss: 0.4608 - val_accuracy: 0.8163\n", 998 | "Epoch 46/100\n", 999 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5084 - accuracy: 0.7720 - val_loss: 0.3232 - val_accuracy: 0.7959\n", 1000 | "Epoch 47/100\n", 1001 | "7/7 [==============================] - 0s 45ms/step - loss: 0.4185 - accuracy: 0.8083 - val_loss: 0.3897 - val_accuracy: 0.7959\n", 1002 | "Epoch 48/100\n", 1003 | "7/7 [==============================] - 0s 45ms/step - loss: 0.4356 - accuracy: 0.8187 - val_loss: 0.2732 - val_accuracy: 0.8163\n", 1004 | "Epoch 49/100\n", 1005 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3936 - accuracy: 0.8342 - val_loss: 0.2642 - val_accuracy: 0.8571\n", 1006 | "Epoch 50/100\n", 1007 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3721 - accuracy: 0.8238 - val_loss: 0.2742 - val_accuracy: 0.8163\n", 1008 | "Epoch 51/100\n", 1009 | "7/7 [==============================] - 0s 55ms/step - loss: 0.3724 - accuracy: 0.8342 - val_loss: 0.3011 - val_accuracy: 0.8571\n", 1010 | "Epoch 52/100\n", 1011 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4092 - accuracy: 0.7979 - val_loss: 0.2632 - val_accuracy: 0.8163\n", 1012 | "Epoch 53/100\n", 1013 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3680 - accuracy: 0.8290 - val_loss: 0.2762 - val_accuracy: 0.8163\n", 1014 | "Epoch 54/100\n", 1015 | "7/7 [==============================] - 0s 45ms/step - loss: 0.4386 - accuracy: 0.7979 - val_loss: 0.2679 - val_accuracy: 0.7959\n", 1016 | "Epoch 55/100\n", 1017 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3843 - accuracy: 0.8238 - val_loss: 0.2639 - val_accuracy: 0.8163\n", 1018 | "Epoch 56/100\n", 1019 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3764 - accuracy: 0.8394 - val_loss: 0.2660 - val_accuracy: 0.8163\n", 1020 | "Epoch 57/100\n", 1021 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3658 - accuracy: 0.8446 - val_loss: 0.2595 - val_accuracy: 0.8163\n", 1022 | "Epoch 58/100\n", 1023 | "7/7 [==============================] - 0s 55ms/step - loss: 0.4064 - accuracy: 0.8031 - val_loss: 0.2756 - val_accuracy: 0.8163\n", 1024 | "Epoch 59/100\n", 1025 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3705 - accuracy: 0.8342 - val_loss: 0.2693 - val_accuracy: 0.8163\n", 1026 | "Epoch 60/100\n", 1027 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3650 - accuracy: 0.8342 - val_loss: 0.2717 - val_accuracy: 0.8163\n", 1028 | "Epoch 61/100\n", 1029 | "7/7 [==============================] - 0s 55ms/step - loss: 0.3812 - accuracy: 0.8031 - val_loss: 0.2976 - val_accuracy: 0.7959\n", 1030 | "Epoch 62/100\n", 1031 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4379 - accuracy: 0.7720 - val_loss: 0.2973 - val_accuracy: 0.7959\n", 1032 | "Epoch 63/100\n", 1033 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3999 - accuracy: 0.7824 - val_loss: 0.2830 - val_accuracy: 0.8367\n", 1034 | "Epoch 64/100\n", 1035 | "7/7 [==============================] - 0s 57ms/step - loss: 0.3726 - accuracy: 0.8290 - val_loss: 0.2960 - val_accuracy: 0.7959\n", 1036 | "Epoch 65/100\n", 1037 | "7/7 [==============================] - 0s 58ms/step - loss: 0.4268 - accuracy: 0.7824 - val_loss: 0.2883 - val_accuracy: 0.8571\n", 1038 | "Epoch 66/100\n", 1039 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3477 - accuracy: 0.8601 - val_loss: 0.3709 - val_accuracy: 0.7755\n", 1040 | "Epoch 67/100\n", 1041 | "7/7 [==============================] - 0s 56ms/step - loss: 0.7493 - accuracy: 0.7254 - val_loss: 0.2933 - val_accuracy: 0.8163\n", 1042 | "Epoch 68/100\n", 1043 | "7/7 [==============================] - 0s 56ms/step - loss: 0.9944 - accuracy: 0.5751 - val_loss: 0.3116 - val_accuracy: 0.8571\n", 1044 | "Epoch 69/100\n", 1045 | "7/7 [==============================] - 0s 56ms/step - loss: 0.7653 - accuracy: 0.7306 - val_loss: 0.2910 - val_accuracy: 0.8163\n", 1046 | "Epoch 70/100\n", 1047 | "7/7 [==============================] - 0s 45ms/step - loss: 0.6731 - accuracy: 0.6995 - val_loss: 0.3410 - val_accuracy: 0.8776\n", 1048 | "Epoch 71/100\n", 1049 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5814 - accuracy: 0.7617 - val_loss: 0.4832 - val_accuracy: 0.7551\n", 1050 | "Epoch 72/100\n", 1051 | "7/7 [==============================] - 0s 45ms/step - loss: 0.4929 - accuracy: 0.7927 - val_loss: 0.3046 - val_accuracy: 0.8776\n", 1052 | "Epoch 73/100\n", 1053 | "7/7 [==============================] - 0s 55ms/step - loss: 0.4462 - accuracy: 0.7668 - val_loss: 0.2881 - val_accuracy: 0.8367\n", 1054 | "Epoch 74/100\n", 1055 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3851 - accuracy: 0.8394 - val_loss: 0.2639 - val_accuracy: 0.7959\n", 1056 | "Epoch 75/100\n", 1057 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5075 - accuracy: 0.7979 - val_loss: 0.3609 - val_accuracy: 0.7755\n", 1058 | "Epoch 76/100\n", 1059 | "7/7 [==============================] - 0s 57ms/step - loss: 0.4599 - accuracy: 0.7565 - val_loss: 0.3936 - val_accuracy: 0.7959\n", 1060 | "Epoch 77/100\n", 1061 | "7/7 [==============================] - 0s 46ms/step - loss: 0.4148 - accuracy: 0.8187 - val_loss: 0.3239 - val_accuracy: 0.7959\n", 1062 | "Epoch 78/100\n", 1063 | "7/7 [==============================] - 0s 57ms/step - loss: 0.4405 - accuracy: 0.7772 - val_loss: 0.2770 - val_accuracy: 0.8163\n", 1064 | "Epoch 79/100\n", 1065 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3465 - accuracy: 0.8549 - val_loss: 0.2951 - val_accuracy: 0.8367\n", 1066 | "Epoch 80/100\n", 1067 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3660 - accuracy: 0.8290 - val_loss: 0.2705 - val_accuracy: 0.8163\n", 1068 | "Epoch 81/100\n", 1069 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3853 - accuracy: 0.8187 - val_loss: 0.3365 - val_accuracy: 0.8776\n", 1070 | "Epoch 82/100\n", 1071 | "7/7 [==============================] - 0s 45ms/step - loss: 0.4300 - accuracy: 0.7979 - val_loss: 0.2742 - val_accuracy: 0.7755\n", 1072 | "Epoch 83/100\n", 1073 | "7/7 [==============================] - 0s 55ms/step - loss: 0.3489 - accuracy: 0.8653 - val_loss: 0.3677 - val_accuracy: 0.7959\n", 1074 | "Epoch 84/100\n", 1075 | "7/7 [==============================] - 0s 57ms/step - loss: 0.6761 - accuracy: 0.6684 - val_loss: 0.2885 - val_accuracy: 0.8163\n", 1076 | "Epoch 85/100\n", 1077 | "7/7 [==============================] - 0s 45ms/step - loss: 0.4812 - accuracy: 0.7461 - val_loss: 0.3174 - val_accuracy: 0.8163\n", 1078 | "Epoch 86/100\n", 1079 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4925 - accuracy: 0.7513 - val_loss: 0.2871 - val_accuracy: 0.7959\n", 1080 | "Epoch 87/100\n", 1081 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5317 - accuracy: 0.7979 - val_loss: 0.3800 - val_accuracy: 0.7959\n", 1082 | "Epoch 88/100\n", 1083 | "7/7 [==============================] - 0s 44ms/step - loss: 0.4481 - accuracy: 0.7720 - val_loss: 0.3335 - val_accuracy: 0.8571\n", 1084 | "Epoch 89/100\n", 1085 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4836 - accuracy: 0.7876 - val_loss: 0.3751 - val_accuracy: 0.7959\n", 1086 | "Epoch 90/100\n", 1087 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4298 - accuracy: 0.7927 - val_loss: 0.4078 - val_accuracy: 0.8163\n", 1088 | "Epoch 91/100\n", 1089 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3987 - accuracy: 0.8238 - val_loss: 0.3872 - val_accuracy: 0.7551\n", 1090 | "Epoch 92/100\n", 1091 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4831 - accuracy: 0.7513 - val_loss: 0.2874 - val_accuracy: 0.8367\n", 1092 | "Epoch 93/100\n", 1093 | "7/7 [==============================] - 0s 56ms/step - loss: 0.4355 - accuracy: 0.8187 - val_loss: 0.2981 - val_accuracy: 0.8367\n", 1094 | "Epoch 94/100\n", 1095 | "7/7 [==============================] - 0s 46ms/step - loss: 0.3641 - accuracy: 0.8394 - val_loss: 0.2862 - val_accuracy: 0.8367\n", 1096 | "Epoch 95/100\n", 1097 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3740 - accuracy: 0.8187 - val_loss: 0.2772 - val_accuracy: 0.8367\n", 1098 | "Epoch 96/100\n", 1099 | "7/7 [==============================] - 0s 55ms/step - loss: 0.3477 - accuracy: 0.8549 - val_loss: 0.2714 - val_accuracy: 0.8163\n", 1100 | "Epoch 97/100\n", 1101 | "7/7 [==============================] - 0s 56ms/step - loss: 0.3527 - accuracy: 0.8446 - val_loss: 0.2709 - val_accuracy: 0.8163\n", 1102 | "Epoch 98/100\n", 1103 | "7/7 [==============================] - 0s 45ms/step - loss: 0.3518 - accuracy: 0.8187 - val_loss: 0.3093 - val_accuracy: 0.8571\n", 1104 | "Epoch 99/100\n", 1105 | "7/7 [==============================] - 0s 56ms/step - loss: 0.5058 - accuracy: 0.7720 - val_loss: 0.2779 - val_accuracy: 0.8163\n", 1106 | "Epoch 100/100\n", 1107 | "7/7 [==============================] - 0s 57ms/step - loss: 0.7768 - accuracy: 0.7254 - val_loss: 0.5054 - val_accuracy: 0.7551\n" 1108 | ] 1109 | }, 1110 | { 1111 | "data": { 1112 | "text/plain": [ 1113 | "" 1114 | ] 1115 | }, 1116 | "execution_count": 51, 1117 | "metadata": {}, 1118 | "output_type": "execute_result" 1119 | } 1120 | ], 1121 | "source": [ 1122 | "model = tf.keras.Sequential([\n", 1123 | " feature_layer,\n", 1124 | " layers.Dense(128, activation='relu'),\n", 1125 | " layers.Dense(128, activation='relu'),\n", 1126 | " layers.Dense(1, activation='sigmoid')\n", 1127 | "])\n", 1128 | "\n", 1129 | "model.compile(optimizer='adam',\n", 1130 | " loss='binary_crossentropy',\n", 1131 | " metrics=['accuracy'])\n", 1132 | "\n", 1133 | "model.fit(train_ds,\n", 1134 | " validation_data=val_ds,\n", 1135 | " epochs=100)" 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "code", 1140 | "execution_count": 52, 1141 | "metadata": { 1142 | "colab": {}, 1143 | "colab_type": "code", 1144 | "id": "GnFmMOW0Tcaa" 1145 | }, 1146 | "outputs": [ 1147 | { 1148 | "name": "stdout", 1149 | "output_type": "stream", 1150 | "text": [ 1151 | "2/2 [==============================] - 0s 6ms/step - loss: 0.5217 - accuracy: 0.7541\n", 1152 | "Accuracy 0.75409836\n" 1153 | ] 1154 | } 1155 | ], 1156 | "source": [ 1157 | "loss, accuracy = model.evaluate(test_ds)\n", 1158 | "print(\"Accuracy\", accuracy)" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "markdown", 1163 | "metadata": {}, 1164 | "source": [ 1165 | "# Submission Instructions" 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "code", 1170 | "execution_count": null, 1171 | "metadata": {}, 1172 | "outputs": [], 1173 | "source": [ 1174 | "# Now click the 'Submit Assignment' button above." 1175 | ] 1176 | }, 1177 | { 1178 | "cell_type": "markdown", 1179 | "metadata": {}, 1180 | "source": [ 1181 | "# When you're done or would like to take a break, please run the two cells below to save your work and close the Notebook. This frees up resources for your fellow learners." 1182 | ] 1183 | }, 1184 | { 1185 | "cell_type": "code", 1186 | "execution_count": null, 1187 | "metadata": {}, 1188 | "outputs": [], 1189 | "source": [ 1190 | "%%javascript\n", 1191 | "\n", 1192 | "IPython.notebook.save_checkpoint();" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "code", 1197 | "execution_count": null, 1198 | "metadata": {}, 1199 | "outputs": [], 1200 | "source": [ 1201 | "%%javascript\n", 1202 | "\n", 1203 | "window.onbeforeunload = null\n", 1204 | "window.close();\n", 1205 | "IPython.notebook.session.delete();" 1206 | ] 1207 | } 1208 | ], 1209 | "metadata": { 1210 | "colab": { 1211 | "collapsed_sections": [ 1212 | "qRLGSMDzM-dl" 1213 | ], 1214 | "name": "Week 2-Question.ipynb", 1215 | "private_outputs": true, 1216 | "provenance": [], 1217 | "toc_visible": true 1218 | }, 1219 | "coursera": { 1220 | "course_slug": "data-pipelines-tensorflow", 1221 | "graded_item_id": "rQvlJ", 1222 | "launcher_item_id": "DXaub" 1223 | }, 1224 | "kernelspec": { 1225 | "display_name": "Python 3", 1226 | "language": "python", 1227 | "name": "python3" 1228 | }, 1229 | "language_info": { 1230 | "codemirror_mode": { 1231 | "name": "ipython", 1232 | "version": 3 1233 | }, 1234 | "file_extension": ".py", 1235 | "mimetype": "text/x-python", 1236 | "name": "python", 1237 | "nbconvert_exporter": "python", 1238 | "pygments_lexer": "ipython3", 1239 | "version": "3.6.8" 1240 | } 1241 | }, 1242 | "nbformat": 4, 1243 | "nbformat_minor": 1 1244 | } 1245 | -------------------------------------------------------------------------------- /utf-8''TFDS-Week4-Question.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "aKwi1_4l0wev" 8 | }, 9 | "source": [ 10 | "# Adding a Dataset of Your Own to TFDS" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": { 17 | "colab": {}, 18 | "colab_type": "code", 19 | "id": "w9nZyRcLhtiX" 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "import os\n", 24 | "import textwrap\n", 25 | "import scipy.io\n", 26 | "import pandas as pd\n", 27 | "\n", 28 | "from os import getcwd" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": { 34 | "colab_type": "text", 35 | "id": "wooh61rn2FvF" 36 | }, 37 | "source": [ 38 | "## IMDB Faces Dataset\n", 39 | "\n", 40 | "This is the largest publicly available dataset of face images with gender and age labels for training.\n", 41 | "\n", 42 | "Source: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/\n", 43 | "\n", 44 | "The IMDb Faces dataset provides a separate .mat file which can be loaded with Matlab containing all the meta information. The format is as follows: \n", 45 | "**dob**: date of birth (Matlab serial date number) \n", 46 | "**photo_taken**: year when the photo was taken \n", 47 | "**full_path**: path to file \n", 48 | "**gender**: 0 for female and 1 for male, NaN if unknown \n", 49 | "**name**: name of the celebrity \n", 50 | "**face_location**: location of the face (bounding box) \n", 51 | "**face_score**: detector score (the higher the better). Inf implies that no face was found in the image and the face_location then just returns the entire image \n", 52 | "**second_face_score**: detector score of the face with the second highest score. This is useful to ignore images with more than one face. second_face_score is NaN if no second face was detected. \n", 53 | "**celeb_names**: list of all celebrity names \n", 54 | "**celeb_id**: index of celebrity name " 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Next, let's inspect the dataset" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": { 67 | "colab_type": "text", 68 | "id": "uspGC84pWmjR" 69 | }, 70 | "source": [ 71 | "## Exploring the Data" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 2, 77 | "metadata": { 78 | "colab": {}, 79 | "colab_type": "code", 80 | "id": "sp7bUzZr3ZUQ" 81 | }, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "imdb.mat\n" 88 | ] 89 | } 90 | ], 91 | "source": [ 92 | "# Inspect the directory structure\n", 93 | "imdb_crop_file_path = f\"{getcwd()}/../tmp2/imdb_crop\"\n", 94 | "files = os.listdir(imdb_crop_file_path)\n", 95 | "print(textwrap.fill(' '.join(sorted(files)), 80))" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 3, 101 | "metadata": { 102 | "colab": {}, 103 | "colab_type": "code", 104 | "id": "1aPlCn9E2PMj" 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "# Inspect the meta data\n", 109 | "imdb_mat_file_path = f\"{getcwd()}/../tmp2/imdb_crop/imdb.mat\"\n", 110 | "meta = scipy.io.loadmat(imdb_mat_file_path)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 4, 116 | "metadata": { 117 | "colab": {}, 118 | "colab_type": "code", 119 | "id": "aFj-jsz-6z-I" 120 | }, 121 | "outputs": [ 122 | { 123 | "data": { 124 | "text/plain": [ 125 | "{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Jan 17 11:30:27 2016',\n", 126 | " '__version__': '1.0',\n", 127 | " '__globals__': [],\n", 128 | " 'imdb': array([[(array([[693726, 693726, 693726, ..., 726831, 726831, 726831]], dtype=int32), array([[1968, 1970, 1968, ..., 2011, 2011, 2011]], dtype=uint16), array([[array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='\n", 557 | "\n", 570 | "\n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | "
dobphoto_takenfull_pathgendernameface_locationface_scoresecond_face_scoreceleb_id
06937261968[01/nm0000001_rm124825600_1899-5-10_1968.jpg]1.0[Fred Astaire][[1072.926, 161.838, 1214.7839999999999, 303.6...1.4596931.1189736488
16937261970[01/nm0000001_rm3343756032_1899-5-10_1970.jpg]1.0[Fred Astaire][[477.184, 100.352, 622.592, 245.76]]2.5431981.8520086488
26937261968[01/nm0000001_rm577153792_1899-5-10_1968.jpg]1.0[Fred Astaire][[114.96964308962852, 114.96964308962852, 451....3.4555792.9856606488
36937261968[01/nm0000001_rm946909184_1899-5-10_1968.jpg]1.0[Fred Astaire][[622.8855056426588, 424.21750383700805, 844.3...1.872117NaN6488
46937261968[01/nm0000001_rm980463616_1899-5-10_1968.jpg]1.0[Fred Astaire][[1013.8590023603723, 233.8820422075853, 1201....1.158766NaN6488
\n", 648 | "" 649 | ], 650 | "text/plain": [ 651 | " dob photo_taken full_path \\\n", 652 | "0 693726 1968 [01/nm0000001_rm124825600_1899-5-10_1968.jpg] \n", 653 | "1 693726 1970 [01/nm0000001_rm3343756032_1899-5-10_1970.jpg] \n", 654 | "2 693726 1968 [01/nm0000001_rm577153792_1899-5-10_1968.jpg] \n", 655 | "3 693726 1968 [01/nm0000001_rm946909184_1899-5-10_1968.jpg] \n", 656 | "4 693726 1968 [01/nm0000001_rm980463616_1899-5-10_1968.jpg] \n", 657 | "\n", 658 | " gender name face_location \\\n", 659 | "0 1.0 [Fred Astaire] [[1072.926, 161.838, 1214.7839999999999, 303.6... \n", 660 | "1 1.0 [Fred Astaire] [[477.184, 100.352, 622.592, 245.76]] \n", 661 | "2 1.0 [Fred Astaire] [[114.96964308962852, 114.96964308962852, 451.... \n", 662 | "3 1.0 [Fred Astaire] [[622.8855056426588, 424.21750383700805, 844.3... \n", 663 | "4 1.0 [Fred Astaire] [[1013.8590023603723, 233.8820422075853, 1201.... \n", 664 | "\n", 665 | " face_score second_face_score celeb_id \n", 666 | "0 1.459693 1.118973 6488 \n", 667 | "1 2.543198 1.852008 6488 \n", 668 | "2 3.455579 2.985660 6488 \n", 669 | "3 1.872117 NaN 6488 \n", 670 | "4 1.158766 NaN 6488 " 671 | ] 672 | }, 673 | "execution_count": 13, 674 | "metadata": {}, 675 | "output_type": "execute_result" 676 | } 677 | ], 678 | "source": [ 679 | "df = pd.DataFrame(values, columns=names)\n", 680 | "df.head()" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": { 686 | "colab_type": "text", 687 | "id": "w-wdFD8uIyRf" 688 | }, 689 | "source": [ 690 | "The Pandas dataframe may contain some Null values or nan. We will have to filter them later on." 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 14, 696 | "metadata": { 697 | "colab": {}, 698 | "colab_type": "code", 699 | "id": "YGsTHc2VIoJh" 700 | }, 701 | "outputs": [ 702 | { 703 | "data": { 704 | "text/plain": [ 705 | "dob 0\n", 706 | "photo_taken 0\n", 707 | "full_path 0\n", 708 | "gender 8462\n", 709 | "name 0\n", 710 | "face_location 0\n", 711 | "face_score 0\n", 712 | "second_face_score 246926\n", 713 | "celeb_id 0\n", 714 | "dtype: int64" 715 | ] 716 | }, 717 | "execution_count": 14, 718 | "metadata": {}, 719 | "output_type": "execute_result" 720 | } 721 | ], 722 | "source": [ 723 | "df.isna().sum()" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "metadata": { 729 | "colab_type": "text", 730 | "id": "DS-9rLTR065l" 731 | }, 732 | "source": [ 733 | "# TensorFlow Datasets\n", 734 | "\n", 735 | "TFDS provides a way to transform all those datasets into a standard format, do the preprocessing necessary to make them ready for a machine learning pipeline, and provides a standard input pipeline using `tf.data`.\n", 736 | "\n", 737 | "To enable this, each dataset implements a subclass of `DatasetBuilder`, which specifies:\n", 738 | "\n", 739 | "* Where the data is coming from (i.e. its URL). \n", 740 | "* What the dataset looks like (i.e. its features). \n", 741 | "* How the data should be split (e.g. TRAIN and TEST). \n", 742 | "* The individual records in the dataset.\n", 743 | "\n", 744 | "The first time a dataset is used, the dataset is downloaded, prepared, and written to disk in a standard format. Subsequent access will read from those pre-processed files directly." 745 | ] 746 | }, 747 | { 748 | "cell_type": "markdown", 749 | "metadata": { 750 | "colab_type": "text", 751 | "id": "6bGCSA-jX0Uw" 752 | }, 753 | "source": [ 754 | "## Clone the TFDS Repository\n", 755 | "\n", 756 | "The next step will be to clone the GitHub TFDS Repository. For this particular notebook, we will clone a particular version of the repository. You can clone the repository by running the following command:\n", 757 | "\n", 758 | "```\n", 759 | "!git clone https://github.com/tensorflow/datasets.git -b v1.2.0\n", 760 | "```\n", 761 | "\n", 762 | "However, for simplicity, we have already cloned this repository for you and placed the files locally. Therefore, there is no need to run the above command if you are running this notebook in Coursera environment.\n", 763 | "\n", 764 | "Next, we set the current working directory to `/datasets/`." 765 | ] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": 15, 770 | "metadata": { 771 | "colab": {}, 772 | "colab_type": "code", 773 | "id": "KhYXnLCf5F-Y" 774 | }, 775 | "outputs": [ 776 | { 777 | "name": "stdout", 778 | "output_type": "stream", 779 | "text": [ 780 | "/tf/week4/datasets\n" 781 | ] 782 | } 783 | ], 784 | "source": [ 785 | "cd datasets" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": { 791 | "colab_type": "text", 792 | "id": "6Fct97VEYxlT" 793 | }, 794 | "source": [ 795 | "If you want to contribute to TFDS' repo and add a new dataset, you can use the the following script to help you generate a template of the required python file. To use it, you must first clone the tfds repository and then run the following command:" 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": 16, 801 | "metadata": { 802 | "colab": {}, 803 | "colab_type": "code", 804 | "id": "wZ3psFN65G9u" 805 | }, 806 | "outputs": [ 807 | { 808 | "name": "stdout", 809 | "output_type": "stream", 810 | "text": [ 811 | "Dataset generated in /usr/local/lib/python3.6/dist-packages/tensorflow_datasets\n", 812 | "You can start with searching TODO(my_dataset).\n", 813 | "Please check this `https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md`for details.\n" 814 | ] 815 | } 816 | ], 817 | "source": [ 818 | "%%bash\n", 819 | "\n", 820 | "python tensorflow_datasets/scripts/create_new_dataset.py \\\n", 821 | " --dataset my_dataset \\\n", 822 | " --type image" 823 | ] 824 | }, 825 | { 826 | "cell_type": "markdown", 827 | "metadata": { 828 | "colab_type": "text", 829 | "id": "a5UbwBVRTmb2" 830 | }, 831 | "source": [ 832 | "If you wish to see the template generated by the `create_new_dataset.py` file, navigate to the folder indicated in the above cell output. Then go to the `/image/` folder and look for a file called `my_dataset.py`. Feel free to open the file and inspect it. You will see a template with place holders, indicated with the word `TODO`, where you have to fill in the information. \n", 833 | "\n", 834 | "Now we will use IPython's `%%writefile` in-built magic command to write whatever is in the current cell into a file. To create or overwrite a file you can use:\n", 835 | "```\n", 836 | "%%writefile filename\n", 837 | "```\n", 838 | "\n", 839 | "Let's see an example:" 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": 17, 845 | "metadata": { 846 | "colab": {}, 847 | "colab_type": "code", 848 | "id": "qkspG9KV7X7i" 849 | }, 850 | "outputs": [ 851 | { 852 | "name": "stdout", 853 | "output_type": "stream", 854 | "text": [ 855 | "Overwriting something.py\n" 856 | ] 857 | } 858 | ], 859 | "source": [ 860 | "%%writefile something.py\n", 861 | "x = 10" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": { 867 | "colab_type": "text", 868 | "id": "TQ--c2h0K6R1" 869 | }, 870 | "source": [ 871 | "Now that the file has been written, let's inspect its contents." 872 | ] 873 | }, 874 | { 875 | "cell_type": "code", 876 | "execution_count": 18, 877 | "metadata": { 878 | "colab": {}, 879 | "colab_type": "code", 880 | "id": "VqBEa9UrK4-Z" 881 | }, 882 | "outputs": [ 883 | { 884 | "name": "stdout", 885 | "output_type": "stream", 886 | "text": [ 887 | "x = 10\r\n" 888 | ] 889 | } 890 | ], 891 | "source": [ 892 | "!cat something.py" 893 | ] 894 | }, 895 | { 896 | "cell_type": "markdown", 897 | "metadata": { 898 | "colab_type": "text", 899 | "id": "UJT2Mh-bYmYa" 900 | }, 901 | "source": [ 902 | "## Define the Dataset with `GeneratorBasedBuilder`\n", 903 | "\n", 904 | "Most datasets subclass `tfds.core.GeneratorBasedBuilder`, which is a subclass of `tfds.core.DatasetBuilder` that simplifies defining a dataset. It works well for datasets that can be generated on a single machine. Its subclasses implement:\n", 905 | "\n", 906 | "* `_info`: builds the DatasetInfo object describing the dataset\n", 907 | "\n", 908 | "\n", 909 | "* `_split_generators`: downloads the source data and defines the dataset splits\n", 910 | "\n", 911 | "\n", 912 | "* `_generate_examples`: yields (key, example) tuples in the dataset from the source data\n", 913 | "\n", 914 | "In this exercise, you will use the `GeneratorBasedBuilder`.\n", 915 | "\n", 916 | "### EXERCISE: Fill in the missing code below." 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": 19, 922 | "metadata": { 923 | "colab": {}, 924 | "colab_type": "code", 925 | "id": "cYyTvIoO7FqS" 926 | }, 927 | "outputs": [ 928 | { 929 | "name": "stdout", 930 | "output_type": "stream", 931 | "text": [ 932 | "Overwriting tensorflow_datasets/image/imdb_faces.py\n" 933 | ] 934 | } 935 | ], 936 | "source": [ 937 | "%%writefile tensorflow_datasets/image/imdb_faces.py\n", 938 | "\n", 939 | "# coding=utf-8\n", 940 | "# Copyright 2019 The TensorFlow Datasets Authors.\n", 941 | "#\n", 942 | "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", 943 | "# you may not use this file except in compliance with the License.\n", 944 | "# You may obtain a copy of the License at\n", 945 | "#\n", 946 | "# http://www.apache.org/licenses/LICENSE-2.0\n", 947 | "#\n", 948 | "# Unless required by applicable law or agreed to in writing, software\n", 949 | "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", 950 | "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", 951 | "# See the License for the specific language governing permissions and\n", 952 | "# limitations under the License.\n", 953 | "\n", 954 | "\"\"\"IMDB Faces dataset.\"\"\"\n", 955 | "from __future__ import absolute_import\n", 956 | "from __future__ import division\n", 957 | "from __future__ import print_function\n", 958 | "\n", 959 | "import collections\n", 960 | "import os\n", 961 | "import re\n", 962 | "\n", 963 | "import tensorflow as tf\n", 964 | "import tensorflow_datasets.public_api as tfds\n", 965 | "\n", 966 | "_DESCRIPTION = \"\"\"Since the publicly available face image datasets are often of small to medium size, rarely exceeding tens of thousands of images, this is an attempt to put together a diverse dataset in that domain.\"\"\"\n", 967 | "\n", 968 | "_URL = (\"https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/\")\n", 969 | "_DATASET_ROOT_DIR = 'imdb_crop'\n", 970 | "_ANNOTATION_FILE = 'imdb.mat'\n", 971 | "\n", 972 | "\n", 973 | "_CITATION = \"\"\"@article{Rothe-IJCV-2016,\n", 974 | " author = {Rasmus Rothe and Radu Timofte and Luc Van Gool},\n", 975 | " title = {Deep expectation of real and apparent age from a single image without facial landmarks},\n", 976 | " journal = {International Journal of Computer Vision (IJCV)},\n", 977 | " year = {2016},\n", 978 | " month = {July},\n", 979 | "}\n", 980 | "@InProceedings{Rothe-ICCVW-2015,\n", 981 | " author = {Rasmus Rothe and Radu Timofte and Luc Van Gool},\n", 982 | " title = {DEX: Deep EXpectation of apparent age from a single image},\n", 983 | " booktitle = {IEEE International Conference on Computer Vision Workshops (ICCVW)},\n", 984 | " year = {2015},\n", 985 | " month = {December},\n", 986 | "}\n", 987 | "\"\"\"\n", 988 | "\n", 989 | "# Source URL of the IMDB faces dataset\n", 990 | "_TARBALL_URL = \"https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar\"\n", 991 | "\n", 992 | "class ImdbFaces(tfds.core.GeneratorBasedBuilder):\n", 993 | " \"\"\"IMDB Faces dataset.\"\"\"\n", 994 | "\n", 995 | " VERSION = tfds.core.Version(\"0.1.0\")\n", 996 | " \n", 997 | " def _info(self):\n", 998 | " return tfds.core.DatasetInfo(\n", 999 | " builder=self,\n", 1000 | " description=_DESCRIPTION,\n", 1001 | " # Describe the features of the dataset by following this url\n", 1002 | " # https://www.tensorflow.org/datasets/api_docs/python/tfds/features\n", 1003 | " features=tfds.features.FeaturesDict({\n", 1004 | " \"image\": tfds.features.Image(),\n", 1005 | " \"gender\": tfds.features.ClassLabel(num_classes=2),\n", 1006 | " \"dob\": tf.int32,\n", 1007 | " \"photo_taken\": tf.int32,\n", 1008 | " \"face_location\": tfds.features.BBoxFeature(),\n", 1009 | " \"face_score\": tf.float32,\n", 1010 | " \"second_face_score\": tf.float32,\n", 1011 | " \"celeb_id\": tf.int32\n", 1012 | " }),\n", 1013 | " supervised_keys=(\"image\", \"gender\"),\n", 1014 | " urls=[_URL],\n", 1015 | " citation=_CITATION)\n", 1016 | "\n", 1017 | " def _split_generators(self, dl_manager):\n", 1018 | " # Download the dataset and then extract it.\n", 1019 | " download_path = dl_manager.download([_TARBALL_URL])\n", 1020 | " extracted_path = dl_manager.download_and_extract([_TARBALL_URL])\n", 1021 | "\n", 1022 | " # Parsing the mat file which contains the list of train images\n", 1023 | " def parse_mat_file(file_name):\n", 1024 | " with tf.io.gfile.GFile(file_name, \"rb\") as f:\n", 1025 | " # Add a lazy import for scipy.io and import the loadmat method to \n", 1026 | " # load the annotation file\n", 1027 | " dataset = tfds.core.lazy_imports.scipy.io.loadmat(file_name)['imdb']\n", 1028 | " return dataset\n", 1029 | "\n", 1030 | " # Parsing the mat file by using scipy's loadmat method\n", 1031 | " # Pass the path to the annotation file using the downloaded/extracted paths above\n", 1032 | " meta = parse_mat_file(os.path.join(extracted_path[0], _DATASET_ROOT_DIR, _ANNOTATION_FILE))\n", 1033 | "\n", 1034 | " # Get the names of celebrities from the metadata\n", 1035 | " celeb_names = meta[0, 0]['celeb_names'][0]\n", 1036 | "\n", 1037 | " # Create tuples out of the distinct set of genders and celeb names\n", 1038 | " self.info.features['gender'].names = ('Female', 'Male')\n", 1039 | " self.info.features['celeb_id'].names = tuple([x[0] for x in celeb_names])\n", 1040 | "\n", 1041 | " return [\n", 1042 | " tfds.core.SplitGenerator(\n", 1043 | " name=tfds.Split.TRAIN,\n", 1044 | " gen_kwargs={\n", 1045 | " \"image_dir\": extracted_path[0],\n", 1046 | " \"metadata\": meta,\n", 1047 | " })\n", 1048 | " ]\n", 1049 | "\n", 1050 | " def _get_bounding_box_values(self, bbox_annotations, img_width, img_height):\n", 1051 | " \"\"\"Function to get normalized bounding box values.\n", 1052 | "\n", 1053 | " Args:\n", 1054 | " bbox_annotations: list of bbox values in kitti format\n", 1055 | " img_width: image width\n", 1056 | " img_height: image height\n", 1057 | "\n", 1058 | " Returns:\n", 1059 | " Normalized bounding box xmin, ymin, xmax, ymax values\n", 1060 | " \"\"\"\n", 1061 | "\n", 1062 | " ymin = bbox_annotations[0] / img_height\n", 1063 | " xmin = bbox_annotations[1] / img_width\n", 1064 | " ymax = bbox_annotations[2] / img_height\n", 1065 | " xmax = bbox_annotations[3] / img_width\n", 1066 | " return ymin, xmin, ymax, xmax\n", 1067 | " \n", 1068 | " def _get_image_shape(self, image_path):\n", 1069 | " image = tf.io.read_file(image_path)\n", 1070 | " image = tf.image.decode_image(image, channels=3)\n", 1071 | " shape = image.shape[:2]\n", 1072 | " return shape\n", 1073 | "\n", 1074 | " def _generate_examples(self, image_dir, metadata):\n", 1075 | " # Add a lazy import for pandas here (pd)\n", 1076 | " pd = tfds.core.lazy_imports.pandas\n", 1077 | "\n", 1078 | " # Extract the root dictionary from the metadata so that you can query all the keys inside it\n", 1079 | " root = metadata[0, 0]\n", 1080 | "\n", 1081 | " \"\"\"Extract image names, dobs, genders, \n", 1082 | " face locations, \n", 1083 | " year when the photos were taken,\n", 1084 | " face scores (second face score too),\n", 1085 | " celeb ids\n", 1086 | " \"\"\"\n", 1087 | " image_names = root[\"full_path\"][0]\n", 1088 | " # Do the same for other attributes (dob, genders etc)\n", 1089 | " dobs = root['dob'][0]\n", 1090 | " genders = root['gender'][0]\n", 1091 | " face_locations = root['face_location'][0]\n", 1092 | " photo_taken_years = root['photo_taken'][0]\n", 1093 | " face_scores = root['face_score'][0]\n", 1094 | " second_face_scores = root['second_face_score'][0]\n", 1095 | " celeb_id = root['celeb_id'][0]\n", 1096 | " \n", 1097 | " # Now create a dataframe out of all the features like you've seen before\n", 1098 | " df = pd.DataFrame(\n", 1099 | " list(zip(image_names, \n", 1100 | " dobs,\n", 1101 | " genders,\n", 1102 | " face_locations,\n", 1103 | " photo_taken_years,\n", 1104 | " face_scores,\n", 1105 | " second_face_scores,\n", 1106 | " celeb_id)),\n", 1107 | " columns=['image_names', 'dobs', 'genders', 'face_locations', 'photo_taken_years',\n", 1108 | " 'face_scores', 'second_face_scores', 'celeb_id'])\n", 1109 | "\n", 1110 | " # Filter dataframe by only having the rows with face_scores > 1.0\n", 1111 | " df = df[df['face_scores'] > 1.0]\n", 1112 | "\n", 1113 | "\n", 1114 | " # Remove any records that contain Nulls/NaNs by checking for NaN with .isna()\n", 1115 | " df = df[~df['genders'].isna()]\n", 1116 | " df = df[~df['second_face_scores'].isna()]\n", 1117 | "\n", 1118 | " # Cast genders to integers so that mapping can take place\n", 1119 | " df.genders = df.genders.astype(int)\n", 1120 | "\n", 1121 | " # Iterate over all the rows in the dataframe and map each feature\n", 1122 | " for _, row in df.iterrows():\n", 1123 | " # Extract filename, gender, dob, photo_taken, \n", 1124 | " # face_score, second_face_score and celeb_id\n", 1125 | " filename = os.path.join(image_dir, _DATASET_ROOT_DIR, row['image_names'][0])\n", 1126 | " gender = row['genders']\n", 1127 | " dob = row['dobs']\n", 1128 | " photo_taken = row['photo_taken_years']\n", 1129 | " face_score = row['face_scores']\n", 1130 | " second_face_score = row['second_face_scores']\n", 1131 | " celeb_id = root['celeb_id']\n", 1132 | "\n", 1133 | " # Get the image shape\n", 1134 | " image_width, image_height = self._get_image_shape(filename)\n", 1135 | " # Normalize the bounding boxes by using the face coordinates and the image shape\n", 1136 | " bbox = self._get_bounding_box_values(row['face_locations'][0], \n", 1137 | " image_width, image_height)\n", 1138 | "\n", 1139 | " # Yield a feature dictionary \n", 1140 | " yield filename, {\n", 1141 | " \"image\": filename,\n", 1142 | " \"gender\": gender,\n", 1143 | " \"dob\": dob,\n", 1144 | " \"photo_taken\": photo_taken,\n", 1145 | " \"face_location\": tfds.features.BBox(\n", 1146 | " ymin=min(bbox[0], 1),\n", 1147 | " xmin=min(bbox[0], 1),\n", 1148 | " ymax=min(bbox[0], 1),\n", 1149 | " xmax=min(bbox[0], 1)\n", 1150 | " ),\n", 1151 | " \"face_score\": face_score,\n", 1152 | " \"second_face_score\": second_face_score,\n", 1153 | " \"celeb_id\": celeb_id\n", 1154 | " }\n" 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "markdown", 1159 | "metadata": { 1160 | "colab_type": "text", 1161 | "id": "7Lu65xXYZC8m" 1162 | }, 1163 | "source": [ 1164 | "## Add an Import for Registration\n", 1165 | "\n", 1166 | "All subclasses of `tfds.core.DatasetBuilder` are automatically registered when their module is imported such that they can be accessed through `tfds.builder` and `tfds.load`.\n", 1167 | "\n", 1168 | "If you're contributing the dataset to `tensorflow/datasets`, you must add the module import to its subdirectory's `__init__.py` (e.g. `image/__init__.py`), as shown below:" 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "code", 1173 | "execution_count": 20, 1174 | "metadata": { 1175 | "colab": {}, 1176 | "colab_type": "code", 1177 | "id": "pKC49eVJXJLe" 1178 | }, 1179 | "outputs": [ 1180 | { 1181 | "name": "stdout", 1182 | "output_type": "stream", 1183 | "text": [ 1184 | "Overwriting tensorflow_datasets/image/__init__.py\n" 1185 | ] 1186 | } 1187 | ], 1188 | "source": [ 1189 | "%%writefile tensorflow_datasets/image/__init__.py\n", 1190 | "# coding=utf-8\n", 1191 | "# Copyright 2019 The TensorFlow Datasets Authors.\n", 1192 | "#\n", 1193 | "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", 1194 | "# you may not use this file except in compliance with the License.\n", 1195 | "# You may obtain a copy of the License at\n", 1196 | "#\n", 1197 | "# http://www.apache.org/licenses/LICENSE-2.0\n", 1198 | "#\n", 1199 | "# Unless required by applicable law or agreed to in writing, software\n", 1200 | "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", 1201 | "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", 1202 | "# See the License for the specific language governing permissions and\n", 1203 | "# limitations under the License.\n", 1204 | "\n", 1205 | "\"\"\"Image datasets.\"\"\"\n", 1206 | "\n", 1207 | "from tensorflow_datasets.image.abstract_reasoning import AbstractReasoning\n", 1208 | "from tensorflow_datasets.image.aflw2k3d import Aflw2k3d\n", 1209 | "from tensorflow_datasets.image.bigearthnet import Bigearthnet\n", 1210 | "from tensorflow_datasets.image.binarized_mnist import BinarizedMNIST\n", 1211 | "from tensorflow_datasets.image.binary_alpha_digits import BinaryAlphaDigits\n", 1212 | "from tensorflow_datasets.image.caltech import Caltech101\n", 1213 | "from tensorflow_datasets.image.caltech_birds import CaltechBirds2010\n", 1214 | "from tensorflow_datasets.image.cats_vs_dogs import CatsVsDogs\n", 1215 | "from tensorflow_datasets.image.cbis_ddsm import CuratedBreastImagingDDSM\n", 1216 | "from tensorflow_datasets.image.celeba import CelebA\n", 1217 | "from tensorflow_datasets.image.celebahq import CelebAHq\n", 1218 | "from tensorflow_datasets.image.chexpert import Chexpert\n", 1219 | "from tensorflow_datasets.image.cifar import Cifar10\n", 1220 | "from tensorflow_datasets.image.cifar import Cifar100\n", 1221 | "from tensorflow_datasets.image.cifar10_corrupted import Cifar10Corrupted\n", 1222 | "from tensorflow_datasets.image.clevr import CLEVR\n", 1223 | "from tensorflow_datasets.image.coco import Coco\n", 1224 | "from tensorflow_datasets.image.coco2014_legacy import Coco2014\n", 1225 | "from tensorflow_datasets.image.coil100 import Coil100\n", 1226 | "from tensorflow_datasets.image.colorectal_histology import ColorectalHistology\n", 1227 | "from tensorflow_datasets.image.colorectal_histology import ColorectalHistologyLarge\n", 1228 | "from tensorflow_datasets.image.cycle_gan import CycleGAN\n", 1229 | "from tensorflow_datasets.image.deep_weeds import DeepWeeds\n", 1230 | "from tensorflow_datasets.image.diabetic_retinopathy_detection import DiabeticRetinopathyDetection\n", 1231 | "from tensorflow_datasets.image.downsampled_imagenet import DownsampledImagenet\n", 1232 | "from tensorflow_datasets.image.dsprites import Dsprites\n", 1233 | "from tensorflow_datasets.image.dtd import Dtd\n", 1234 | "from tensorflow_datasets.image.eurosat import Eurosat\n", 1235 | "from tensorflow_datasets.image.flowers import TFFlowers\n", 1236 | "from tensorflow_datasets.image.food101 import Food101\n", 1237 | "from tensorflow_datasets.image.horses_or_humans import HorsesOrHumans\n", 1238 | "from tensorflow_datasets.image.image_folder import ImageLabelFolder\n", 1239 | "from tensorflow_datasets.image.imagenet import Imagenet2012\n", 1240 | "from tensorflow_datasets.image.imagenet2012_corrupted import Imagenet2012Corrupted\n", 1241 | "from tensorflow_datasets.image.kitti import Kitti\n", 1242 | "from tensorflow_datasets.image.lfw import LFW\n", 1243 | "from tensorflow_datasets.image.lsun import Lsun\n", 1244 | "from tensorflow_datasets.image.mnist import EMNIST\n", 1245 | "from tensorflow_datasets.image.mnist import FashionMNIST\n", 1246 | "from tensorflow_datasets.image.mnist import KMNIST\n", 1247 | "from tensorflow_datasets.image.mnist import MNIST\n", 1248 | "from tensorflow_datasets.image.mnist_corrupted import MNISTCorrupted\n", 1249 | "from tensorflow_datasets.image.omniglot import Omniglot\n", 1250 | "from tensorflow_datasets.image.open_images import OpenImagesV4\n", 1251 | "from tensorflow_datasets.image.oxford_flowers102 import OxfordFlowers102\n", 1252 | "from tensorflow_datasets.image.oxford_iiit_pet import OxfordIIITPet\n", 1253 | "from tensorflow_datasets.image.patch_camelyon import PatchCamelyon\n", 1254 | "from tensorflow_datasets.image.pet_finder import PetFinder\n", 1255 | "from tensorflow_datasets.image.quickdraw import QuickdrawBitmap\n", 1256 | "from tensorflow_datasets.image.resisc45 import Resisc45\n", 1257 | "from tensorflow_datasets.image.rock_paper_scissors import RockPaperScissors\n", 1258 | "from tensorflow_datasets.image.scene_parse_150 import SceneParse150\n", 1259 | "from tensorflow_datasets.image.shapes3d import Shapes3d\n", 1260 | "from tensorflow_datasets.image.smallnorb import Smallnorb\n", 1261 | "from tensorflow_datasets.image.so2sat import So2sat\n", 1262 | "from tensorflow_datasets.image.stanford_dogs import StanfordDogs\n", 1263 | "from tensorflow_datasets.image.stanford_online_products import StanfordOnlineProducts\n", 1264 | "from tensorflow_datasets.image.sun import Sun397\n", 1265 | "from tensorflow_datasets.image.svhn import SvhnCropped\n", 1266 | "from tensorflow_datasets.image.uc_merced import UcMerced\n", 1267 | "from tensorflow_datasets.image.visual_domain_decathlon import VisualDomainDecathlon\n", 1268 | "\n", 1269 | "# EXERCISE: Import your dataset module here\n", 1270 | "\n", 1271 | "from tensorflow_datasets.image.imdb_faces import ImdbFaces" 1272 | ] 1273 | }, 1274 | { 1275 | "cell_type": "markdown", 1276 | "metadata": { 1277 | "colab_type": "text", 1278 | "id": "QYmgS2SrYXtP" 1279 | }, 1280 | "source": [ 1281 | "## URL Checksums\n", 1282 | "\n", 1283 | "If you're contributing the dataset to `tensorflow/datasets`, add a checksums file for the dataset. On first download, the DownloadManager will automatically add the sizes and checksums for all downloaded URLs to that file. This ensures that on subsequent data generation, the downloaded files are as expected." 1284 | ] 1285 | }, 1286 | { 1287 | "cell_type": "code", 1288 | "execution_count": 21, 1289 | "metadata": { 1290 | "colab": {}, 1291 | "colab_type": "code", 1292 | "id": "cvrp-iHuYG_e" 1293 | }, 1294 | "outputs": [], 1295 | "source": [ 1296 | "!touch tensorflow_datasets/url_checksums/imdb_faces.txt" 1297 | ] 1298 | }, 1299 | { 1300 | "cell_type": "markdown", 1301 | "metadata": { 1302 | "colab_type": "text", 1303 | "id": "JwnUAn49U-U8" 1304 | }, 1305 | "source": [ 1306 | "## Build the Dataset" 1307 | ] 1308 | }, 1309 | { 1310 | "cell_type": "code", 1311 | "execution_count": 22, 1312 | "metadata": { 1313 | "colab": {}, 1314 | "colab_type": "code", 1315 | "id": "Y8uKiqWrU_C0" 1316 | }, 1317 | "outputs": [], 1318 | "source": [ 1319 | "# EXERCISE: Fill in the name of your dataset.\n", 1320 | "# The name must be a string.\n", 1321 | "DATASET_NAME = \"imdb_faces\"" 1322 | ] 1323 | }, 1324 | { 1325 | "cell_type": "markdown", 1326 | "metadata": { 1327 | "colab_type": "text", 1328 | "id": "S7evoTtpon7I" 1329 | }, 1330 | "source": [ 1331 | "We then run the `download_and_prepare` script locally to build it, using the following command:\n", 1332 | "\n", 1333 | "```\n", 1334 | "%%bash -s $DATASET_NAME\n", 1335 | "python -m tensorflow_datasets.scripts.download_and_prepare \\\n", 1336 | " --register_checksums \\\n", 1337 | " --datasets=$1\n", 1338 | "```\n", 1339 | "\n", 1340 | "**NOTE:** It may take more than 30 minutes to download the dataset and then write all the preprocessed files as TFRecords. Due to the enormous size of the data involved, we are unable to run the above script in the Coursera environment. " 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "markdown", 1345 | "metadata": { 1346 | "colab_type": "text", 1347 | "id": "7hNPD2rraN5o" 1348 | }, 1349 | "source": [ 1350 | "## Load the Dataset\n", 1351 | "\n", 1352 | "Once the dataset is built you can load it in the usual way, by using `tfds.load`, as shown below:\n", 1353 | "\n", 1354 | "```python\n", 1355 | "import tensorflow_datasets as tfds\n", 1356 | "dataset, info = tfds.load('imdb_faces', with_info=True)\n", 1357 | "```\n", 1358 | "\n", 1359 | "**Note:** Since we couldn't build the `imdb_faces` dataset due to its size, we are unable to run the above code in the Coursera environment." 1360 | ] 1361 | }, 1362 | { 1363 | "cell_type": "markdown", 1364 | "metadata": {}, 1365 | "source": [ 1366 | "## Explore the Dataset\n", 1367 | "\n", 1368 | "Once the dataset is loaded, you can explore it by using the following loop:\n", 1369 | "\n", 1370 | "```python\n", 1371 | "for feature in tfds.as_numpy(dataset['train']):\n", 1372 | " for key, value in feature.items():\n", 1373 | " if key == 'image':\n", 1374 | " value = value.shape\n", 1375 | " print(key, value)\n", 1376 | " break\n", 1377 | "```\n", 1378 | "\n", 1379 | "**Note:** Since we couldn't build the `imdb_faces` dataset due to its size, we are unable to run the above code in the Coursera environment.\n", 1380 | "\n", 1381 | "The expected output from the code block shown above should be:\n", 1382 | "\n", 1383 | "```python\n", 1384 | ">>>\n", 1385 | "celeb_id 12387\n", 1386 | "dob 722957\n", 1387 | "face_location [1. 0.56327355 1. 1. ]\n", 1388 | "face_score 4.0612864\n", 1389 | "gender 0\n", 1390 | "image (96, 97, 3)\n", 1391 | "photo_taken 2007\n", 1392 | "second_face_score 3.6680346\n", 1393 | "```" 1394 | ] 1395 | }, 1396 | { 1397 | "cell_type": "markdown", 1398 | "metadata": { 1399 | "colab_type": "text", 1400 | "id": "BhUO2vXDZw8q" 1401 | }, 1402 | "source": [ 1403 | "# Next steps for publishing\n", 1404 | "\n", 1405 | "**Double-check the citation** \n", 1406 | "\n", 1407 | "It's important that DatasetInfo.citation includes a good citation for the dataset. It's hard and important work contributing a dataset to the community and we want to make it easy for dataset users to cite the work.\n", 1408 | "\n", 1409 | "If the dataset's website has a specifically requested citation, use that (in BibTex format).\n", 1410 | "\n", 1411 | "If the paper is on arXiv, find it there and click the bibtex link on the right-hand side.\n", 1412 | "\n", 1413 | "If the paper is not on arXiv, find the paper on Google Scholar and click the double-quotation mark underneath the title and on the popup, click BibTeX.\n", 1414 | "\n", 1415 | "If there is no associated paper (for example, there's just a website), you can use the BibTeX Online Editor to create a custom BibTeX entry (the drop-down menu has an Online entry type).\n", 1416 | " \n", 1417 | "\n", 1418 | "**Add a test** \n", 1419 | "\n", 1420 | "Most datasets in TFDS should have a unit test and your reviewer may ask you to add one if you haven't already. See the testing section below. \n", 1421 | "**Check your code style** \n", 1422 | "\n", 1423 | "Follow the PEP 8 Python style guide, except TensorFlow uses 2 spaces instead of 4. Please conform to the Google Python Style Guide,\n", 1424 | "\n", 1425 | "Most importantly, use tensorflow_datasets/oss_scripts/lint.sh to ensure your code is properly formatted. For example, to lint the image directory\n", 1426 | "See TensorFlow code style guide for more information.\n", 1427 | "\n", 1428 | "**Add release notes**\n", 1429 | "Add the dataset to the release notes. The release note will be published for the next release.\n", 1430 | "\n", 1431 | "**Send for review!**\n", 1432 | "Send the pull request for review.\n", 1433 | "\n", 1434 | "For more information, visit https://www.tensorflow.org/datasets/add_dataset" 1435 | ] 1436 | }, 1437 | { 1438 | "cell_type": "markdown", 1439 | "metadata": {}, 1440 | "source": [ 1441 | "# Submission Instructions" 1442 | ] 1443 | }, 1444 | { 1445 | "cell_type": "code", 1446 | "execution_count": 23, 1447 | "metadata": {}, 1448 | "outputs": [], 1449 | "source": [ 1450 | "# Now click the 'Submit Assignment' button above." 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "markdown", 1455 | "metadata": {}, 1456 | "source": [ 1457 | "# When you're done or would like to take a break, please run the two cells below to save your work and close the Notebook. This frees up resources for your fellow learners." 1458 | ] 1459 | }, 1460 | { 1461 | "cell_type": "code", 1462 | "execution_count": null, 1463 | "metadata": {}, 1464 | "outputs": [], 1465 | "source": [ 1466 | "%%javascript\n", 1467 | "\n", 1468 | "IPython.notebook.save_checkpoint();" 1469 | ] 1470 | }, 1471 | { 1472 | "cell_type": "code", 1473 | "execution_count": null, 1474 | "metadata": {}, 1475 | "outputs": [], 1476 | "source": [ 1477 | "%%javascript\n", 1478 | "\n", 1479 | "window.onbeforeunload = null\n", 1480 | "window.close();\n", 1481 | "IPython.notebook.session.delete();" 1482 | ] 1483 | } 1484 | ], 1485 | "metadata": { 1486 | "accelerator": "GPU", 1487 | "colab": { 1488 | "collapsed_sections": [], 1489 | "name": "TFDS Week 4 - Question.ipynb", 1490 | "provenance": [], 1491 | "toc_visible": true 1492 | }, 1493 | "coursera": { 1494 | "course_slug": "data-pipelines-tensorflow", 1495 | "graded_item_id": "fqOvf", 1496 | "launcher_item_id": "QCJEw" 1497 | }, 1498 | "kernelspec": { 1499 | "display_name": "Python 3", 1500 | "language": "python", 1501 | "name": "python3" 1502 | }, 1503 | "language_info": { 1504 | "codemirror_mode": { 1505 | "name": "ipython", 1506 | "version": 3 1507 | }, 1508 | "file_extension": ".py", 1509 | "mimetype": "text/x-python", 1510 | "name": "python", 1511 | "nbconvert_exporter": "python", 1512 | "pygments_lexer": "ipython3", 1513 | "version": "3.6.8" 1514 | } 1515 | }, 1516 | "nbformat": 4, 1517 | "nbformat_minor": 1 1518 | } 1519 | --------------------------------------------------------------------------------