├── .github └── PULL_REQUEST_TEMPLATE.md ├── 0_Running_TensorFlow_In_SageMaker.ipynb ├── 1_Monitoring_your_TensorFlow_scripts.ipynb ├── 2_Using_Pipemode_input_for_big_datasets.ipynb ├── 3_Distributed_training_with_Horovod.ipynb ├── 4_Deploying_your_TensorFlow_model.ipynb ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── generate_cifar10_tfrecords.py └── training_script └── cifar10_keras.py /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. 7 | -------------------------------------------------------------------------------- /0_Running_TensorFlow_In_SageMaker.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Train a Keras Sequential Model\n", 8 | "This notebook shows how to train and host a Keras Sequential model on SageMaker. The model used for this notebook is a simple deep CNN that was extracted from [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py)." 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "## The dataset\n", 16 | "The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:\n", 17 | "\n", 18 | "![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)\n", 19 | "\n", 20 | "In this tutorial, we will train a deep CNN to recognize these images.\n", 21 | "\n", 22 | "We'll compare trainig with file mode, pipe mode datasets and distributed training with Horovod" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "!pip install sagemaker-experiments sagemaker boto3 awscli --upgrade" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "## Getting the data\n", 39 | "To use the CIFAR-10 dataset, we first need to download it and convert it to TFRecords. This step takes around 5 minutes.\n", 40 | "\n", 41 | "You can use the following command:" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "!python generate_cifar10_tfrecords.py --data-dir ./data" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "## Run the training locally" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "The script uses arguments for configuration. it requires the following configurations:\n", 65 | "1. Model_dir - location where it'll save checkpoints and logs\n", 66 | "2. train, validation, eval - location of the relevant tf records\n", 67 | "\n", 68 | "Run the script locally:" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "!mkdir -p logs\n", 78 | "!python training_script/cifar10_keras.py --model_dir ./logs \\\n", 79 | " --train data/train \\\n", 80 | " --validation data/validation \\\n", 81 | " --eval data/eval \\\n", 82 | " --epochs 1\n", 83 | "!rm -rf logs" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "**Altough the script was running on a SageMaker notebook, you can run the same script on your computer using the same command.**" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## Use TensorFlow Script Mode\n", 98 | "For TensorFlow versions 1.11 and later, the Amazon SageMaker Python SDK supports script mode training scripts. Script mode has the following advantages over legacy mode training scripts:\n", 99 | "\n", 100 | "* Script mode training scripts are more similar to training scripts you write for TensorFlow in general, so it is easier to modify your existing TensorFlow training scripts to work with Amazon SageMaker.\n", 101 | "\n", 102 | "* Script mode supports both Python 2.7- andPython 3.6-compatible source files.\n", 103 | "\n", 104 | "* Script mode supports Horovod for distributed training.\n", 105 | "\n", 106 | "For information about writing TensorFlow script mode training scripts and using TensorFlow script mode estimators and models with Amazon SageMaker, see https://sagemaker.readthedocs.io/en/stable/using_tf.html." 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "### Preparing your script for training in SageMaker\n", 114 | "The training script is very similar to a training script you might run outside of SageMaker.\n", 115 | "SageMaker runs the script with 1 argument, model_dir, an S3 location that is used for logs and artifacts.\n", 116 | "\n", 117 | "You can access useful properties about the training environment through various environment variables.\n", 118 | "In this example, we are sending 3 data channels to the script: Train, Validation, Eval.\n", 119 | "\n", 120 | "**Create a copy of the script (training_script/cifar10_keras.py) and save it as training_script/cifar10_keras_sm.py.**\n", 121 | "\n", 122 | "In cifar10_keras_sm.py, scroll down to the **if __name__ == '__main__':** section. \n", 123 | "Update the train,validation,eval arguments to get the data by default from the relevant environment variable: SM_CHANNEL_TRAIN, SM_CHANNEL_VALIDATION, SM_CHANNEL_EVAL\n", 124 | "Add the default configuration to the arguments in **cifar10_keras_sm.py**. \n", 125 | "The lines should look as following:\n", 126 | "```python\n", 127 | "parser.add_argument(\n", 128 | " '--train',\n", 129 | " type=str,\n", 130 | " required=False,\n", 131 | " default=os.environ.get('SM_CHANNEL_TRAIN'),\n", 132 | " help='The directory where the CIFAR-10 input data is stored.')\n", 133 | "parser.add_argument(\n", 134 | " '--validation',\n", 135 | " type=str,\n", 136 | " required=False,\n", 137 | " default=os.environ.get('SM_CHANNEL_VALIDATION'),\n", 138 | " help='The directory where the CIFAR-10 input data is stored.')\n", 139 | "parser.add_argument(\n", 140 | " '--eval',\n", 141 | " type=str,\n", 142 | " required=False,\n", 143 | " default=os.environ.get('SM_CHANNEL_EVAL'),\n", 144 | " help='The directory where the CIFAR-10 input data is stored.')\n", 145 | "```\n", 146 | "\n", 147 | "For info see the SageMaker-python-sdk [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script)\n", 148 | "\n", 149 | "SageMaker will not send the locations as arguments, it'll use environment variables instead.\n", 150 | "\n", 151 | "SageMaker send different useful environment variables to your scripts, e.g.:\n", 152 | "* `SM_MODEL_DIR`: A string that represents the local path where the training job can write the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting. This is different than the model_dir argument passed to your training script which is a S3 location. `SM_MODEL_DIR` is always set to /opt/ml/model.\n", 153 | "* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.\n", 154 | "* `SM_OUTPUT_DATA_DIR`: A string that represents the path to the directory to write output artifacts to. Output artifacts might include checkpoints, graphs, and other files to save, but do not include model artifacts. These artifacts are compressed and uploaded to S3 to an S3 bucket with the same prefix as the model artifacts.\n", 155 | "\n", 156 | "In this Example, to reduce the network latency. we would like to save the model checkpoints locally, they will be uploaded to S3 at the end of the job.\n", 157 | "\n", 158 | "Add the following argument to your script:\n", 159 | "```python\n", 160 | "parser.add_argument(\n", 161 | " '--model_output_dir',\n", 162 | " type=str,\n", 163 | " default=os.environ.get('SM_MODEL_DIR'))\n", 164 | "```\n", 165 | "Change the ModelCheckPoint line to use to new location:\n", 166 | "```python\n", 167 | "checkpoint = ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5')\n", 168 | "```\n", 169 | "\n", 170 | "Change the save_model call to use that folder. \n", 171 | "From: \n", 172 | "```python\n", 173 | "return save_model(model, args.model_dir)\n", 174 | "```\n", 175 | "To: \n", 176 | "```python\n", 177 | "return save_model(model, args.model_output_dir)\n", 178 | "```" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "### Test your script locally\n", 186 | "For testing, run the new script with the same command as above, make sure it runs as expected. \n", 187 | "Add the new model_output_dir as an argument for the script. " 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "# Run the script locally\n", 197 | "!mkdir -p logs\n", 198 | "!python training_script/cifar10_keras_sm.py --model_dir ./logs \\\n", 199 | " --model_output_dir ./logs \\\n", 200 | " --train data/train \\\n", 201 | " --validation data/validation \\\n", 202 | " --eval data/eval \\\n", 203 | " --epochs 1\n", 204 | "!rm -rf logs" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "### Use SageMaker local for local testing\n", 212 | "The local mode in the Amazon SageMaker Python SDK can emulate CPU (single and multi-instance) and GPU (single instance) SageMaker training jobs by changing a single argument in the TensorFlow or MXNet estimators. To do this, it uses Docker compose and NVIDIA Docker. It will also pull the Amazon SageMaker TensorFlow container from Amazon ECS.\n", 213 | "\n", 214 | "Training in local mode also allows us to easily monitor metrics like GPU consumption to ensure that our code is written properly to take advantage of the hardware we’re using." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "import os\n", 224 | "import sagemaker\n", 225 | "from sagemaker import get_execution_role\n", 226 | "\n", 227 | "sagemaker_session = sagemaker.Session()\n", 228 | "\n", 229 | "role = get_execution_role()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "Using the sagemaker.tensorflow class we will create a new SageMaker TensorFlow job\n", 237 | "We can use the command to pass different configuration or hyperparameters to the script\n", 238 | "\n", 239 | "For info see the [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow-estimator)" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "from sagemaker.tensorflow import TensorFlow\n", 249 | "estimator = TensorFlow(base_job_name='cifar10',\n", 250 | " entry_point='cifar10_keras_sm.py',\n", 251 | " source_dir='training_script',\n", 252 | " role=role,\n", 253 | " framework_version='1.12.0',\n", 254 | " py_version='py3',\n", 255 | " hyperparameters={'epochs' : 1},\n", 256 | " train_instance_count=1, train_instance_type='local')" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "estimator.fit({'train' : 'file://data/train',\n", 266 | " 'validation' : 'file://data/validation',\n", 267 | " 'eval' : 'file://data/eval'})" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "The first time the estimator runs, it needs to download the container image from its Amazon ECR repository, but then training can begin immediately. There’s no need to wait for a separate training cluster to be provisioned. In addition, on subsequent runs, which may be necessary when iterating and testing, changes to your MXNet or TensorFlow script will start to run instantaneously." 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "## Using SageMaker\n", 282 | "In the next part, we'll use a GPU machine for faster training time\n", 283 | "First, We'll upload the data to S3. \n", 284 | "SageMaker creates a default bucket per region" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "### Start a new SageMaker experiment\n", 292 | "Amazon SageMaker Experiments is a capability of Amazon SageMaker that lets you organize, track, compare, and evaluate your machine learning experiments.\n", 293 | "\n", 294 | "Machine learning is an iterative process. You need to experiment with multiple combinations of data, algorithm and parameters, all the while observing the impact of incremental changes on model accuracy. Over time this iterative experimentation can result in thousands of model training runs and model versions. This makes it hard to track the best performing models and their input configurations. It’s also difficult to compare active experiments with past experiments to identify opportunities for further incremental improvements.\n", 295 | "\n", 296 | "Amazon SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. Experiments is integrated with Amazon SageMaker Studio providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best performing models." 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "from smexperiments.experiment import Experiment\n", 306 | "from smexperiments.trial import Trial\n", 307 | "import time\n", 308 | "\n", 309 | "cifar10_experiment = Experiment.create(\n", 310 | " experiment_name=\"TensorFlow-cifar10-experiment\",\n", 311 | " description=\"Classification of cifar10 images\")" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')\n", 321 | "display(dataset_location)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "First we'll create a new trial in this trial we'll run a simple 20 epochs training job on a GPU instance" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "# create trial\n", 338 | "trial_name = f\"cifar10-training-job-{int(time.time())}\"\n", 339 | "trial = Trial.create(\n", 340 | " trial_name=trial_name, \n", 341 | " experiment_name=cifar10_experiment.experiment_name\n", 342 | ")" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "Create a new estimator (You can copy the command from above), this time use the **ml.p3.2xlarge** as the instance type and configure **epochs:20**" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": {}, 356 | "outputs": [], 357 | "source": [ 358 | "from sagemaker.tensorflow import TensorFlow\n", 359 | "estimator = ..." 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "This time, use the S3 data location for each of the channels\n", 367 | "```python\n", 368 | "dataset_location + '/train'\n", 369 | "dataset_location + '/validation' \n", 370 | "dataset_location + '/eval'\n", 371 | "```\n", 372 | "\n", 373 | "Connect the trial configured above to the job. add the experiment config to the fit function.\n", 374 | "```python\n", 375 | "experiment_config={\n", 376 | " \"ExperimentName\": cifar10_experiment.experiment_name, \n", 377 | " \"TrialName\": trial.trial_name,\n", 378 | " \"TrialComponentDisplayName\": \"Training\"}\n", 379 | "```" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": {}, 386 | "outputs": [], 387 | "source": [ 388 | "estimator.fit({'train' : 'train_data_location',\n", 389 | " 'validation' : 'validation_data_location',\n", 390 | " 'eval' : 'eval_data_location'},\n", 391 | " experiment_config=)" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "**Good job!** \n", 399 | "You were able to run 20 epochs on a bigger instance in SageMaker. \n", 400 | "Before continuing to the next notebook, take a look at the training jobs section in the SageMaker console, find your job and look at its configuration" 401 | ] 402 | } 403 | ], 404 | "metadata": { 405 | "kernelspec": { 406 | "display_name": "conda_tensorflow_p36", 407 | "language": "python", 408 | "name": "conda_tensorflow_p36" 409 | }, 410 | "language_info": { 411 | "codemirror_mode": { 412 | "name": "ipython", 413 | "version": 3 414 | }, 415 | "file_extension": ".py", 416 | "mimetype": "text/x-python", 417 | "name": "python", 418 | "nbconvert_exporter": "python", 419 | "pygments_lexer": "ipython3", 420 | "version": "3.6.5" 421 | }, 422 | "pycharm": { 423 | "stem_cell": { 424 | "cell_type": "raw", 425 | "metadata": { 426 | "collapsed": false 427 | }, 428 | "source": [] 429 | } 430 | } 431 | }, 432 | "nbformat": 4, 433 | "nbformat_minor": 4 434 | } 435 | -------------------------------------------------------------------------------- /1_Monitoring_your_TensorFlow_scripts.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Monitor and Analyze Training Jobs Using Metrics\n", 8 | "An Amazon SageMaker training job is an iterative process that teaches a model to make predictions by presenting examples from a training dataset. \n", 9 | "Typically, a training algorithm computes several metrics, such as training error and prediction accuracy. These metrics help diagnose whether the model is learning well and will generalize well for making predictions on unseen data. \n", 10 | "\n", 11 | "The training algorithm writes the values of these metrics to logs, which Amazon SageMaker monitors and sends to Amazon CloudWatch in real time.\n", 12 | "\n", 13 | "If you want Amazon SageMaker to parse logs from a custom algorithm and send metrics that the algorithm emits to CloudWatch, you have to specify the metrics that you want Amazon SageMaker to send to CloudWatch when you configure the training job. \n", 14 | "You specify the name of the metrics that you want to send and the regular expressions that Amazon SageMaker uses to parse the logs that your algorithm emits to find those metrics." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Defining Training Metrics (Amazon SageMaker Python SDK)\n", 22 | "Define the metrics that you want to send to CloudWatch by specifying a list of metric names and regular expressions as the metric_definitions argument when you initialize an Estimator object. For example, if you want to monitor both the train:error and validation:error metrics in CloudWatch, your Estimator initialization would look like the following:\n", 23 | "```python \n", 24 | "estimator = TensorFlow(base_job_name='cifar10',\n", 25 | " entry_point='cifar10_keras_sm.py',\n", 26 | " source_dir='training_script',\n", 27 | " role=role,\n", 28 | " framework_version='1.12.0',\n", 29 | " py_version='py3',\n", 30 | " metric_definitions=[\n", 31 | " {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},\n", 32 | " {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}\n", 33 | " ],\n", 34 | " hyperparameters={'epochs' : 20},\n", 35 | " train_instance_count=1, train_instance_type='ml.p3.2xlarge')\n", 36 | "```" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Monitoring the cifar10 training\n", 44 | "Find your previous (cifar10_keras_sm) training job in the SageMaker console.\n", 45 | "Open the job details and look at the job cloudwatch logs. \n", 46 | "Configure the metrics regex that fits the logs. use regex tools to check your expressions, use () to catch each matric\n", 47 | "In this example, the solution (One option for a solution) is below." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "metric_definitions = [\n", 57 | " {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\\\.]+) - acc: [0-9\\\\.]+'},\n", 58 | " {'Name': 'train:accuracy', 'Regex': 'loss: [0-9\\\\.]+ - acc: ([0-9\\\\.]+)'},\n", 59 | " {'Name': 'validation:accuracy', 'Regex': 'val_loss: [0-9\\\\.]+ - val_acc: ([0-9\\\\.]+)'},\n", 60 | " {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\\\.]+) - val_acc: [0-9\\\\.]+'},\n", 61 | "]" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "Continue with the notebook and run the same job as before (Same estimator configuration). this time, add the ```metric_definitions=metric_definitions``` argument. \n", 69 | "Run the job for 20 epochs" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "import os\n", 79 | "import sagemaker\n", 80 | "from sagemaker import get_execution_role\n", 81 | "\n", 82 | "sagemaker_session = sagemaker.Session()\n", 83 | "\n", 84 | "role = get_execution_role()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "### Load the SageMaker experiment" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "from smexperiments.experiment import Experiment\n", 101 | "from smexperiments.trial import Trial\n", 102 | "import time\n", 103 | "cifar10_experiment = Experiment.load(\n", 104 | " experiment_name=\"TensorFlow-cifar10-experiment\")" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# create a new trial\n", 114 | "trial_name = f\"cifar10-training-job-{int(time.time())}\"\n", 115 | "trial = Trial.create(\n", 116 | " trial_name=trial_name, \n", 117 | " experiment_name=cifar10_experiment.experiment_name\n", 118 | ")" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "outputs": [], 125 | "source": [ 126 | "# Configure the dataset location variable\n", 127 | "dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')\n", 128 | "display(dataset_location)" 129 | ], 130 | "metadata": { 131 | "collapsed": false, 132 | "pycharm": { 133 | "name": "#%%\n" 134 | } 135 | } 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "from sagemaker.tensorflow import TensorFlow\n", 144 | "estimator = ... # Make sure you use the metric_definitions=metric_definitions argument." 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "Connect the trial configured above to the job. add the experiment config to the fit function.\n", 152 | "```python\n", 153 | "experiment_config={\n", 154 | " \"ExperimentName\": cifar10_experiment.experiment_name, \n", 155 | " \"TrialName\": trial.trial_name,\n", 156 | " \"TrialComponentDisplayName\": \"Training\"}\n", 157 | "```" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "estimator.fit({'train' : 'train_data_location',\n", 167 | " 'validation' : 'validation_data_location',\n", 168 | " 'eval' : 'eval_data_location'},\n", 169 | " experiment_config=)" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "### View the job training metrics\n", 177 | "SageMaker used the regular expression configured above, to send the job metrics to CloudWatch metrics.\n", 178 | "You can now view the job metrics directly from the SageMaker console. \n", 179 | "\n", 180 | "login to the [SageMaker console](https://console.aws.amazon.com/sagemaker/home) choose the latest training job, scroll down to the monitor section. \n", 181 | "Using CloudWatch metrics, you can change the period and configure the statistics\n", 182 | "\n", 183 | "Find the metrics using the following cell:" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "from IPython.core.display import Markdown\n", 193 | "\n", 194 | "link = 'https://console.aws.amazon.com/cloudwatch/home?region='+sagemaker_session.boto_region_name+'#metricsV2:query=%7B/aws/sagemaker/TrainingJobs,TrainingJobName%7D%20'+estimator.latest_training_job.job_name\n", 195 | "display(Markdown('CloudWatch metrics: [link]('+link+')'))\n", 196 | "display(Markdown('After you choose a metric, change the period to 1 Minute (Graphed Metrics -> Period)'))" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "## Monitor with TensorBoard\n", 204 | "TensorBoard provides the visualization and tooling needed for machine learning experimentation:\n", 205 | "* Tracking and visualizing metrics such as loss and accuracy\n", 206 | "* Visualizing the model graph (ops and layers)\n", 207 | "* Viewing histograms of weights, biases, or other tensors as they change over time\n", 208 | "* Projecting embeddings to a lower dimensional space\n", 209 | "* Displaying images, text, and audio data\n", 210 | "* And much more\n", 211 | "\n", 212 | "In the next section we'll update the script to save TensorBoard logs. \n", 213 | "We'll be able to use TensorBoard for monitoring our jobs in real time. \n", 214 | "\n", 215 | "Update your cifar10-keras-sm.py script to send logs to TensorBoard. \n", 216 | "Add the `from keras.callbacks import TensorBoard` import.\n", 217 | "\n", 218 | "Keras will send TensorBoard logs in each batch. sending the logs to S3 will slow down the training job, change the TensorBoard callback to send the logs only at the end of an epoch.\n", 219 | "\n", 220 | "Add the TensorBoard callback to your script (add this line after the ModelCheckpoint callback)\n", 221 | "```python\n", 222 | "tb_callback = TensorBoard(log_dir=args.model_dir,update_freq='epoch')\n", 223 | "```\n", 224 | "\n", 225 | "Add tb_callback to the model.fit callbacks list.\n", 226 | "```python\n", 227 | "callbacks=[checkpoint, tb_callback]\n", 228 | "```" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "### Run a training job with TensorBoard support" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "# create a new trial\n", 245 | "trial_name = f\"cifar10-training-job-{int(time.time())}\"\n", 246 | "trial = Trial.create(\n", 247 | " trial_name=trial_name, \n", 248 | " experiment_name=cifar10_experiment.experiment_name\n", 249 | ")" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "estimator = ..." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "estimator.fit({'train' : 'train_data_location',\n", 268 | " 'validation' : 'validation_data_location',\n", 269 | " 'eval' : 'eval_data_location'},\n", 270 | " experiment_config=,\n", 271 | " wait=False) # Use wait=False to run async jobs" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "### Install Tensorboard on your local machine\n", 279 | "Install [TensorBoard](https://github.com/tensorflow/tensorboard) locally (On you computer) using `pip install tensorboard`. \n", 280 | "To access an S3 log directory, configure the TensorBoard default region. \n", 281 | "You can do this by configuring an environment variable named AWS_REGION, and setting the value of the environment variable to the AWS region your training job runs in. \n", 282 | "For example, `AWS_REGION='us-east-1' tensorboard --logdir model_dir` \n", 283 | "\n", 284 | "**You can get your TensorBoard command from the next cell.**\n", 285 | "\n", 286 | "You'll need an AccessKey + SecretKey with access to model_dir for this Event, get those from https://dashboard.eventengine.run/dashboard. \n", 287 | "\n", 288 | "AWS Console -> copy the Credentials / CLI Snippets and run in your Terminal / CMD." 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "from IPython.core.display import Markdown\n", 298 | "\n", 299 | "link = 'AWS_REGION=\\''+sagemaker_session.boto_region_name+'\\' tensorboard --logdir ' + estimator.model_dir\n", 300 | "display(Markdown('TensorBoard command: '+link))" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "You can access TensorBoard locally at http://localhost:6006" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "### Analyze the experiments" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "search_expression = {\n", 324 | " \"Filters\":[\n", 325 | " {\n", 326 | " \"Name\": \"DisplayName\",\n", 327 | " \"Operator\": \"Equals\",\n", 328 | " \"Value\": \"Training\",\n", 329 | " }\n", 330 | " ],\n", 331 | "}" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "import pandas as pd \n", 341 | "pd.options.display.max_columns = 500\n", 342 | "\n", 343 | "from sagemaker.analytics import ExperimentAnalytics\n", 344 | "trial_component_analytics = ExperimentAnalytics(\n", 345 | " sagemaker_session=sagemaker_session, \n", 346 | " experiment_name=cifar10_experiment.experiment_name,\n", 347 | " search_expression=search_expression\n", 348 | ")\n", 349 | "\n", 350 | "table = trial_component_analytics.dataframe(force_refresh=True)\n", 351 | "display(table)" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "**Good job!** \n", 359 | "You can now monitor your job using both CloudWatch metrics and TensorBoard. \n", 360 | "Before continuing to the next notebook, take a look at the [TensorBoard callback configuration](https://keras.io/callbacks/#tensorboard) for other TensorBoard configurations." 361 | ] 362 | } 363 | ], 364 | "metadata": { 365 | "kernelspec": { 366 | "display_name": "conda_tensorflow_p36", 367 | "language": "python", 368 | "name": "conda_tensorflow_p36" 369 | }, 370 | "language_info": { 371 | "codemirror_mode": { 372 | "name": "ipython", 373 | "version": 3 374 | }, 375 | "file_extension": ".py", 376 | "mimetype": "text/x-python", 377 | "name": "python", 378 | "nbconvert_exporter": "python", 379 | "pygments_lexer": "ipython3", 380 | "version": "3.6.5" 381 | }, 382 | "pycharm": { 383 | "stem_cell": { 384 | "cell_type": "raw", 385 | "source": [], 386 | "metadata": { 387 | "collapsed": false 388 | } 389 | } 390 | } 391 | }, 392 | "nbformat": 4, 393 | "nbformat_minor": 4 394 | } -------------------------------------------------------------------------------- /2_Using_Pipemode_input_for_big_datasets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training with Pipe Mode using PipeModeDataset\n", 8 | "Amazon SageMaker allows users to create training jobs using Pipe input mode. With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. This means that your training jobs start sooner, finish quicker, and need less disk space.\n", 9 | "\n", 10 | "SageMaker TensorFlow provides an implementation of `tf.data.Dataset` that makes it easy to take advantage of Pipe input mode in SageMaker. You can replace your `tf.data.Dataset` with a `sagemaker_tensorflow.PipeModeDataset` to read TFRecords as they are streamed to your training instances.\n", 11 | "\n", 12 | "In your entry_point script, you can use `PipeModeDataset` like a `Dataset`. In this example, we create a `PipeModeDataset` to read TFRecords from the ‘training’ channel:\n", 13 | "\n", 14 | "```python\n", 15 | "from sagemaker_tensorflow import PipeModeDataset\n", 16 | "\n", 17 | "features = {\n", 18 | " 'data': tf.FixedLenFeature([], tf.string),\n", 19 | " 'labels': tf.FixedLenFeature([], tf.int64),\n", 20 | "}\n", 21 | "\n", 22 | "def parse(record):\n", 23 | " parsed = tf.parse_single_example(record, features)\n", 24 | " return ({\n", 25 | " 'data': tf.decode_raw(parsed['data'], tf.float64)\n", 26 | " }, parsed['labels'])\n", 27 | "\n", 28 | "def train_input_fn(training_dir, hyperparameters):\n", 29 | " ds = PipeModeDataset(channel='training', record_format='TFRecord')\n", 30 | " ds = ds.repeat(20)\n", 31 | " ds = ds.prefetch(10)\n", 32 | " ds = ds.map(parse, num_parallel_calls=10)\n", 33 | " ds = ds.batch(64)\n", 34 | " return ds\n", 35 | "```\n", 36 | "\n", 37 | "To run training job with Pipe input mode, pass in input_mode='Pipe' to your TensorFlow Estimator:\n", 38 | "\n", 39 | "```python\n", 40 | "from sagemaker.tensorflow import TensorFlow\n", 41 | "\n", 42 | "tf_estimator = TensorFlow(entry_point='tf-train-with-pipemodedataset.py', role='SageMakerRole',\n", 43 | " train_instance_count=1, train_instance_type='ml.c5.2xlarge',\n", 44 | " framework_version='1.12.0', input_mode='Pipe')\n", 45 | "\n", 46 | "tf_estimator.fit('s3://bucket/path/to/training/data')\n", 47 | "```" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "## Create a training script that support pipemode datasets\n", 55 | "Create a copy of the script (training_script/cifar10_keras_sm.py) and save it as training_script/cifar10_keras_pipe.py.\n", 56 | "\n", 57 | "In cifar10_keras_pipe.py, import the PIpeModeDataset using:\n", 58 | "```python\n", 59 | "from sagemaker_tensorflow import PipeModeDataset\n", 60 | "```\n", 61 | "update \n", 62 | "```python\n", 63 | "def _input(epochs, batch_size, channel, channel_name):\n", 64 | "```\n", 65 | "to create the dataset variable using\n", 66 | "```python\n", 67 | "dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')\n", 68 | "```\n", 69 | "\n", 70 | "The new _input function should look as following:\n", 71 | "```python\n", 72 | "def _input(epochs, batch_size, channel, channel_name):\n", 73 | " dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')\n", 74 | "\n", 75 | " dataset = dataset.repeat(epochs)\n", 76 | " dataset = dataset.prefetch(10)\n", 77 | " ...\n", 78 | "```\n", 79 | "For info see the SageMaker-python-sdk [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-pipe-mode-using-pipemodedataset)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "Run the previous job, this time use the new script (cifar10_keras_pipe.py)\n", 87 | "Run the job for 20 epochs and configure it with `input_mode='Pipe'`" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "import os\n", 97 | "import sagemaker\n", 98 | "from sagemaker import get_execution_role\n", 99 | "\n", 100 | "sagemaker_session = sagemaker.Session()\n", 101 | "\n", 102 | "role = get_execution_role()" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "### Load the SageMaker experiment" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "from smexperiments.experiment import Experiment\n", 119 | "from smexperiments.trial import Trial\n", 120 | "import time\n", 121 | "cifar10_experiment = Experiment.load(\n", 122 | " experiment_name=\"TensorFlow-cifar10-experiment\")" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# create a new trial\n", 132 | "trial_name = f\"cifar10-training-job-pipemode-{int(time.time())}\"\n", 133 | "trial = Trial.create(\n", 134 | " trial_name=trial_name, \n", 135 | " experiment_name=cifar10_experiment.experiment_name\n", 136 | ")" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "outputs": [], 143 | "source": [ 144 | "# Configure the dataset location variable\n", 145 | "dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')\n", 146 | "display(dataset_location)" 147 | ], 148 | "metadata": { 149 | "collapsed": false, 150 | "pycharm": { 151 | "name": "#%%\n" 152 | } 153 | } 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "outputs": [], 159 | "source": [ 160 | "metric_definitions = [\n", 161 | " {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\\\.]+) - acc: [0-9\\\\.]+'},\n", 162 | " {'Name': 'train:accuracy', 'Regex': 'loss: [0-9\\\\.]+ - acc: ([0-9\\\\.]+)'},\n", 163 | " {'Name': 'validation:accuracy', 'Regex': 'val_loss: [0-9\\\\.]+ - val_acc: ([0-9\\\\.]+)'},\n", 164 | " {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\\\.]+) - val_acc: [0-9\\\\.]+'},\n", 165 | "]" 166 | ], 167 | "metadata": { 168 | "collapsed": false, 169 | "pycharm": { 170 | "name": "#%%\n" 171 | } 172 | } 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "from sagemaker.tensorflow import TensorFlow\n", 181 | "# You should add the metric_definitions arguments to all of your jobs\n", 182 | "# Change base_job_name to 'cifar10-pipe' for console visibility\n", 183 | "# Remember to configure input_mode='Pipe' \n", 184 | "estimator = ... " 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "Connect the trial configured above to the job. add the experiment config to the fit function.\n", 192 | "```python\n", 193 | "experiment_config={\n", 194 | " \"ExperimentName\": cifar10_experiment.experiment_name, \n", 195 | " \"TrialName\": trial.trial_name,\n", 196 | " \"TrialComponentDisplayName\": \"Training\"}\n", 197 | "```" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "estimator.fit({'train' : 'train_data_location',\n", 207 | " 'validation' : 'validation_data_location',\n", 208 | " 'eval' : 'eval_data_location'},\n", 209 | " experiment_config=)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "### Analyze the experiments" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "search_expression = {\n", 226 | " \"Filters\":[\n", 227 | " {\n", 228 | " \"Name\": \"DisplayName\",\n", 229 | " \"Operator\": \"Equals\",\n", 230 | " \"Value\": \"Training\",\n", 231 | " }\n", 232 | " ],\n", 233 | "}" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "import pandas as pd \n", 243 | "pd.options.display.max_columns = 500\n", 244 | "\n", 245 | "from sagemaker.analytics import ExperimentAnalytics\n", 246 | "trial_component_analytics = ExperimentAnalytics(\n", 247 | " sagemaker_session=sagemaker_session, \n", 248 | " experiment_name=cifar10_experiment.experiment_name,\n", 249 | " search_expression=search_expression\n", 250 | ")\n", 251 | "\n", 252 | "table = trial_component_analytics.dataframe(force_refresh=True)\n", 253 | "display(table)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "**Good job!** \n", 261 | "You can now use pipemode datasets. With big datasets it'll reduce the training time, and the local disk needs.\n", 262 | "Before continuing to the next notebook, look at the pipemode job metrics from CloudWatch and TensorBoard." 263 | ] 264 | } 265 | ], 266 | "metadata": { 267 | "kernelspec": { 268 | "display_name": "conda_tensorflow_p36", 269 | "language": "python", 270 | "name": "conda_tensorflow_p36" 271 | }, 272 | "language_info": { 273 | "codemirror_mode": { 274 | "name": "ipython", 275 | "version": 3 276 | }, 277 | "file_extension": ".py", 278 | "mimetype": "text/x-python", 279 | "name": "python", 280 | "nbconvert_exporter": "python", 281 | "pygments_lexer": "ipython3", 282 | "version": "3.6.5" 283 | }, 284 | "pycharm": { 285 | "stem_cell": { 286 | "cell_type": "raw", 287 | "source": [], 288 | "metadata": { 289 | "collapsed": false 290 | } 291 | } 292 | } 293 | }, 294 | "nbformat": 4, 295 | "nbformat_minor": 4 296 | } -------------------------------------------------------------------------------- /3_Distributed_training_with_Horovod.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Distributed training with horovod\n", 8 | "Horovod is a distributed training framework based on MPI. Horovod is only available with TensorFlow version 1.12 or newer. You can find more details at [Horovod README](https://github.com/uber/horovod).\n", 9 | "\n", 10 | "To enable Horovod, we need to make small changes to our script." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "## Create a training script that support Horovod distributed training\n", 18 | "Create a copy of the script (training_script/cifar10_keras_sm.py, **not the pipe script**) and save it as training_script/cifar10_keras_dist.py.\n", 19 | "in:\n", 20 | "```python\n", 21 | "def main(args):\n", 22 | "```\n", 23 | "\n", 24 | "### Start horovod\n", 25 | "add horovod support using the following code:\n", 26 | "```python\n", 27 | " import horovod.keras as hvd\n", 28 | " hvd.init()\n", 29 | " config = tf.ConfigProto()\n", 30 | " config.gpu_options.allow_growth = True\n", 31 | " config.gpu_options.visible_device_list = str(hvd.local_rank())\n", 32 | " K.set_session(tf.Session(config=config))\n", 33 | "```\n", 34 | "\n", 35 | "### Configure callbacks\n", 36 | "add the following callbacks:\n", 37 | "```python\n", 38 | "hvdBroadcast = hvd.callbacks.BroadcastGlobalVariablesCallback(0)\n", 39 | "hvdMetricAverage = hvd.callbacks.MetricAverageCallback()\n", 40 | "hvdLearningRate = hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1)\n", 41 | "```\n", 42 | "\n", 43 | "change the checkpoint and tensorboard callback to run only on `hvd.rank() == o` (You want only a single process the send logs)\n", 44 | "```python\n", 45 | "callbacks = [hvdBroadcast,hvdMetricAverage,hvdLearningRate]\n", 46 | "if hvd.rank() == 0:\n", 47 | " callbacks.append(checkpoint)\n", 48 | " callbacks.append(tb_callback)\n", 49 | "```\n", 50 | "update model.fit to use the new callbacks list\n", 51 | "\n", 52 | "### Configure the optimizer\n", 53 | "in\n", 54 | "```python\n", 55 | "# Add hvd to the function. also add it in the function call\n", 56 | "def keras_model_fn(learning_rate, weight_decay, optimizer, momentum, hvd): \n", 57 | "```\n", 58 | "configure the horovod optimizer.\n", 59 | "Change `size=1` to `size=hvd.size()` \n", 60 | "\n", 61 | "add \n", 62 | "```python\n", 63 | "opt = hvd.DistributedOptimizer(opt)\n", 64 | "```\n", 65 | "before \n", 66 | "```python\n", 67 | " model.compile(loss='categorical_crossentropy',\n", 68 | " optimizer=opt,\n", 69 | " metrics=['accuracy'])\n", 70 | "```" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "## Run Distributed training\n", 78 | "To start a distributed training job with Horovod, configure the job distribution:\n", 79 | "```python\n", 80 | "distributions = {'mpi': {\n", 81 | " 'enabled': True,\n", 82 | " 'processes_per_host': # Number of Horovod processes per host\n", 83 | " }\n", 84 | " }\n", 85 | "```\n", 86 | "\n", 87 | "Run the same job using 2 ml.p3.2xlarge instances (processes_per_host:1). \n", 88 | "add the distributions configuration" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "import os\n", 98 | "import sagemaker\n", 99 | "from sagemaker import get_execution_role\n", 100 | "\n", 101 | "sagemaker_session = sagemaker.Session()\n", 102 | "\n", 103 | "role = get_execution_role()" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "### Load the SageMaker experiment" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "from smexperiments.experiment import Experiment\n", 120 | "from smexperiments.trial import Trial\n", 121 | "import time\n", 122 | "cifar10_experiment = Experiment.load(\n", 123 | " experiment_name=\"TensorFlow-cifar10-experiment\")" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "# create a new trial\n", 133 | "trial_name = f\"cifar10-training-job-distributed-{int(time.time())}\"\n", 134 | "trial = Trial.create(\n", 135 | " trial_name=trial_name, \n", 136 | " experiment_name=cifar10_experiment.experiment_name\n", 137 | ")" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "outputs": [], 144 | "source": [ 145 | "# Configure the dataset location variable\n", 146 | "dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')\n", 147 | "display(dataset_location)" 148 | ], 149 | "metadata": { 150 | "collapsed": false, 151 | "pycharm": { 152 | "name": "#%%\n" 153 | } 154 | } 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "outputs": [], 160 | "source": [ 161 | "metric_definitions = [\n", 162 | " {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\\\.]+) - acc: [0-9\\\\.]+'},\n", 163 | " {'Name': 'train:accuracy', 'Regex': 'loss: [0-9\\\\.]+ - acc: ([0-9\\\\.]+)'},\n", 164 | " {'Name': 'validation:accuracy', 'Regex': 'val_loss: [0-9\\\\.]+ - val_acc: ([0-9\\\\.]+)'},\n", 165 | " {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\\\.]+) - val_acc: [0-9\\\\.]+'},\n", 166 | "]" 167 | ], 168 | "metadata": { 169 | "collapsed": false, 170 | "pycharm": { 171 | "name": "#%%\n" 172 | } 173 | } 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "from sagemaker.tensorflow import TensorFlow\n", 182 | "# Change base_job_name to 'cifar10-dist' for console visibility\n", 183 | "# Remember to configure distributions = ...\n", 184 | "estimator = ... " 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "Connect the trial configured above to the job. add the experiment config to the fit function.\n", 192 | "```python\n", 193 | "experiment_config={\n", 194 | " \"ExperimentName\": cifar10_experiment.experiment_name, \n", 195 | " \"TrialName\": trial.trial_name,\n", 196 | " \"TrialComponentDisplayName\": \"Training\"}\n", 197 | "```" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "estimator.fit({'train' : 'train_data_location',\n", 207 | " 'validation' : 'validation_data_location',\n", 208 | " 'eval' : 'eval_data_location'},\n", 209 | " experiment_config=)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "### Analyze the experiments" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "search_expression = {\n", 226 | " \"Filters\":[\n", 227 | " {\n", 228 | " \"Name\": \"DisplayName\",\n", 229 | " \"Operator\": \"Equals\",\n", 230 | " \"Value\": \"Training\",\n", 231 | " }\n", 232 | " ],\n", 233 | "}" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "import pandas as pd \n", 243 | "pd.options.display.max_columns = 500\n", 244 | "\n", 245 | "from sagemaker.analytics import ExperimentAnalytics\n", 246 | "trial_component_analytics = ExperimentAnalytics(\n", 247 | " sagemaker_session=sagemaker_session, \n", 248 | " experiment_name=cifar10_experiment.experiment_name,\n", 249 | " search_expression=search_expression\n", 250 | ")\n", 251 | "\n", 252 | "table = trial_component_analytics.dataframe(force_refresh=True)\n", 253 | "display(table)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "**Good job!** \n", 261 | "You can now use SageMaker training jobs for distributed jobs.\n", 262 | "Before continuing to the next notebook, look at the distribution job metrics from CloudWatch and TensorBoard. \n", 263 | "You can use TensorBoard to compare between the different jobs that you ran.\n", 264 | "Run TensorBoard with \n", 265 | "`--logdir dist:dist_model_dir,pipe:pipe_model_dir,file:normal_job_model_dir`" 266 | ] 267 | } 268 | ], 269 | "metadata": { 270 | "kernelspec": { 271 | "display_name": "conda_tensorflow_p36", 272 | "language": "python", 273 | "name": "conda_tensorflow_p36" 274 | }, 275 | "language_info": { 276 | "codemirror_mode": { 277 | "name": "ipython", 278 | "version": 3 279 | }, 280 | "file_extension": ".py", 281 | "mimetype": "text/x-python", 282 | "name": "python", 283 | "nbconvert_exporter": "python", 284 | "pygments_lexer": "ipython3", 285 | "version": "3.6.5" 286 | }, 287 | "pycharm": { 288 | "stem_cell": { 289 | "cell_type": "raw", 290 | "source": [], 291 | "metadata": { 292 | "collapsed": false 293 | } 294 | } 295 | } 296 | }, 297 | "nbformat": 4, 298 | "nbformat_minor": 4 299 | } -------------------------------------------------------------------------------- /4_Deploying_your_TensorFlow_model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Deploy trained Keras or TensorFlow models using Amazon SageMaker\n", 8 | "Amazon SageMaker makes it easier for any developer or data scientist to build, train, and deploy machine learning (ML) models. While it’s designed to alleviate the undifferentiated heavy lifting from the full life cycle of ML models, Amazon SageMaker’s capabilities can also be used independently of one another; that is, models trained in Amazon SageMaker can be optimized and deployed outside of Amazon SageMaker (or even out of the cloud on mobile or IoT devices at the edge). Conversely, Amazon SageMaker can deploy and host pre-trained models from model zoos, or other members of your team.\n", 9 | "\n", 10 | "Amazon SageMaker deploy your model to a TensorFlow Serving-based server. The server provides a super-set of the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest).\n", 11 | "\n", 12 | "To Deploy a Keras/TensorFlow Model, You'll need to save your model in a TensorFlow SavedModel format (Already implemented in the script). \n", 13 | "\n", 14 | "for More Info take a look at the ```def save_model(model, output):``` in the script" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Deploy a trained Model\n", 22 | "Instead of retraining the model, we'll start from one of the previously trained models." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "import os\n", 32 | "import sagemaker\n", 33 | "from sagemaker import get_execution_role\n", 34 | "\n", 35 | "sagemaker_session = sagemaker.Session()\n", 36 | "\n", 37 | "role = get_execution_role()" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "Configure the estimator as it was configured in notebook: 1 - Monitoring your TensorFlow scripts" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "from sagemaker.tensorflow import TensorFlow\n", 54 | "estimator = ..." 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "estimator = estimator.attach(training_job_name=) ## Configure with your previous cifar10 job name (First Completed job. You can find that name from the SageMaker console" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "## Make some predictions\n", 80 | "To verify the that the endpoint functions properly, we generate random data in the correct shape and get a prediction." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "# Creating fake prediction data\n", 90 | "import numpy as np\n", 91 | "data = np.random.randn(1, 32, 32, 3)\n", 92 | "print(\"Predicted class is {}\".format(np.argmax(predictor.predict(data)['predictions'])))" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "### Calculating accuracy and create a confusion matrix based on the test dataset\n", 100 | "\n", 101 | "Our endpoint works as expected, we'll now use the test dataset for predictions and calculate our model accuracy" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "from keras.datasets import cifar10\n", 111 | "from keras.preprocessing.image import ImageDataGenerator\n", 112 | "from sklearn.metrics import confusion_matrix\n", 113 | "datagen = ImageDataGenerator()\n", 114 | "\n", 115 | "(x_train, y_train), (x_test, y_test) = cifar10.load_data()\n", 116 | "\n", 117 | "def predict(data):\n", 118 | " predictions = predictor.predict(data)['predictions']\n", 119 | " return predictions" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "batch_size = 128\n", 129 | "predicted = []\n", 130 | "actual = []\n", 131 | "batches = 0\n", 132 | "for data in datagen.flow(x_test,y_test,batch_size=batch_size):\n", 133 | " for i,prediction in enumerate(predict(data[0])):\n", 134 | " predicted.append(np.argmax(prediction))\n", 135 | " actual.append(data[1][i][0])\n", 136 | " batches += 1\n", 137 | " if batches >= len(x_test) / batch_size:\n", 138 | " break" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "from sklearn.metrics import accuracy_score, confusion_matrix\n", 148 | "\n", 149 | "accuracy = accuracy_score(y_pred=predicted,y_true=actual)\n", 150 | "display('Average accuracy: {}%'.format(round(accuracy*100,2)))" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "%matplotlib inline\n", 160 | "import seaborn as sn\n", 161 | "import pandas as pd\n", 162 | "import matplotlib.pyplot as plt\n", 163 | "\n", 164 | "cm = confusion_matrix(y_pred=predicted,y_true=actual)\n", 165 | "cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", 166 | "sn.set(rc={'figure.figsize':(11.7,8.27)})\n", 167 | "sn.set(font_scale=1.4)#for label size\n", 168 | "sn.heatmap(cm, annot=True,annot_kws={\"size\": 10})# font size" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "Using this heatmap we can calculate the accuracy of each one of the labels" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "# Cleaning up\n", 183 | "To avoid incurring charges to your AWS account for the resources used in this tutorial you need to delete the SageMaker Endpoint:" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "sagemaker_session.delete_endpoint(predictor.endpoint)" 193 | ] 194 | } 195 | ], 196 | "metadata": { 197 | "kernelspec": { 198 | "display_name": "conda_tensorflow_p36", 199 | "language": "python", 200 | "name": "conda_tensorflow_p36" 201 | }, 202 | "language_info": { 203 | "codemirror_mode": { 204 | "name": "ipython", 205 | "version": 3 206 | }, 207 | "file_extension": ".py", 208 | "mimetype": "text/x-python", 209 | "name": "python", 210 | "nbconvert_exporter": "python", 211 | "pygments_lexer": "ipython3", 212 | "version": "3.6.5" 213 | }, 214 | "pycharm": { 215 | "stem_cell": { 216 | "cell_type": "raw", 217 | "source": [], 218 | "metadata": { 219 | "collapsed": false 220 | } 221 | } 222 | } 223 | }, 224 | "nbformat": 4, 225 | "nbformat_minor": 2 226 | } -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/issues), or [recently closed](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/labels/help%20wanted) issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Running your TensorFlow Models in SageMaker Workshop 2 | 3 | TensorFlow™ enables developers to quickly and easily get started with deep learning in the cloud. 4 | The framework has broad support in the industry and has become a popular choice for deep learning research and application development, particularly in areas such as computer vision, natural language understanding and speech translation. 5 | You can get started on AWS with a fully-managed TensorFlow experience with Amazon SageMaker, a platform to build, train, and deploy machine learning models at scale. 6 | 7 | ## Use Machine Learning Frameworks with Amazon SageMaker 8 | The Amazon SageMaker Python SDK provides open source APIs and containers that make it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks. For general information about the Amazon SageMaker Python SDK, see https://sagemaker.readthedocs.io/. 9 | 10 | You can use Amazon SageMaker to train and deploy a model using custom TensorFlow code. The Amazon SageMaker Python SDK TensorFlow estimators and models and the Amazon SageMaker open-source TensorFlow containers make writing a TensorFlow script and running it in Amazon SageMaker easier. 11 | 12 | In this workshop you will port a working TensorFlow script to run on SageMaker and utilize some of the feature available for TensorFlow in SageMaker 13 | 14 | The workshop is based on 5 modules: 15 | 16 | 1. [Porting a TensorFlow script to run in SageMaker using SageMaker script mode.](0_Running_TensorFlow_In_SageMaker.ipynb) 17 | 2. [Monitoring your training job using TensorBoard and Amazon CloudWatch metrics.](1_Monitoring_your_TensorFlow_scripts.ipynb) 18 | 3. [Optimizing your training job using SageMaker pipemode input.](2_Using_Pipemode_input_for_big_datasets.ipynb) 19 | 4. [Running a distributed training job.](3_Distributed_training_with_Horovod.ipynb) 20 | 5. [Deploying your trained model on Amazon SageMaker.](4_Deploying_your_TensorFlow_model.ipynb) 21 | 22 | ## License Summary 23 | 24 | This sample code is made available under the MIT-0 license. See the LICENSE file. 25 | -------------------------------------------------------------------------------- /generate_cifar10_tfrecords.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). 4 | # You may not use this file except in compliance with the License. 5 | # A copy of the License is located at 6 | # 7 | # https://aws.amazon.com/apache-2-0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is distributed 10 | # on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 11 | # express or implied. See the License for the specific language governing 12 | # permissions and limitations under the License. 13 | from __future__ import absolute_import 14 | 15 | import argparse 16 | import os 17 | import shutil 18 | import sys 19 | import tarfile 20 | 21 | import tensorflow as tf 22 | from six.moves import cPickle as pickle 23 | from six.moves import xrange 24 | 25 | CIFAR_FILENAME = 'cifar-10-python.tar.gz' 26 | CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME 27 | CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py' 28 | 29 | 30 | def download_and_extract(data_dir): 31 | # download CIFAR-10 if not already downloaded. 32 | tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir, CIFAR_DOWNLOAD_URL) 33 | tarfile.open(os.path.join(data_dir, CIFAR_FILENAME), 'r:gz').extractall(data_dir) 34 | 35 | 36 | def _int64_feature(value): 37 | return tf.train.Feature(int64_list=tf.train.Int64List(value=[value])) 38 | 39 | 40 | def _bytes_feature(value): 41 | return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) 42 | 43 | 44 | def _get_file_names(): 45 | """Returns the file names expected to exist in the input_dir.""" 46 | return { 47 | 'train': ['data_batch_%d' % i for i in xrange(1, 5)], 48 | 'validation': ['data_batch_5'], 49 | 'eval': ['test_batch'], 50 | } 51 | 52 | 53 | def read_pickle_from_file(filename): 54 | with tf.io.gfile.GFile(filename, 'rb') as f: 55 | if sys.version_info.major >= 3: 56 | return pickle.load(f, encoding='bytes') 57 | else: 58 | return pickle.load(f) 59 | 60 | 61 | def convert_to_tfrecord(input_files, output_file): 62 | """Converts a file to TFRecords.""" 63 | print('Generating %s' % output_file) 64 | with tf.io.TFRecordWriter(output_file) as record_writer: 65 | for input_file in input_files: 66 | data_dict = read_pickle_from_file(input_file) 67 | data = data_dict[b'data'] 68 | labels = data_dict[b'labels'] 69 | 70 | num_entries_in_batch = len(labels) 71 | for i in range(num_entries_in_batch): 72 | example = tf.train.Example(features=tf.train.Features( 73 | feature={ 74 | 'image': _bytes_feature(data[i].tobytes()), 75 | 'label': _int64_feature(labels[i]) 76 | })) 77 | record_writer.write(example.SerializeToString()) 78 | 79 | 80 | def main(data_dir): 81 | print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL)) 82 | download_and_extract(data_dir) 83 | 84 | file_names = _get_file_names() 85 | input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER) 86 | for mode, files in file_names.items(): 87 | input_files = [os.path.join(input_dir, f) for f in files] 88 | 89 | mode_dir = os.path.join(data_dir, mode) 90 | output_file = os.path.join(mode_dir, mode + '.tfrecords') 91 | if not os.path.exists(mode_dir): 92 | os.makedirs(mode_dir) 93 | try: 94 | os.remove(output_file) 95 | except OSError: 96 | pass 97 | 98 | # Convert to tf.train.Example and write the to TFRecords. 99 | convert_to_tfrecord(input_files, output_file) 100 | 101 | print('Done!') 102 | shutil.rmtree(os.path.join(data_dir, 'cifar-10-batches-py')) 103 | os.remove(os.path.join(data_dir, 'cifar-10-python.tar.gz')) # Remove the original .tzr.gz files 104 | 105 | 106 | if __name__ == '__main__': 107 | parser = argparse.ArgumentParser() 108 | parser.add_argument( 109 | '--data-dir', 110 | type=str, 111 | default='', 112 | help='Directory to download and extract CIFAR-10 to.' 113 | ) 114 | 115 | args = parser.parse_args() 116 | main(args.data_dir) 117 | -------------------------------------------------------------------------------- /training_script/cifar10_keras.py: -------------------------------------------------------------------------------- 1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | 20 | import argparse 21 | import logging 22 | import os 23 | 24 | from keras.callbacks import ModelCheckpoint 25 | from keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization 26 | from keras.models import Sequential 27 | from keras.optimizers import Adam, SGD, RMSprop 28 | import tensorflow as tf 29 | from keras import backend as K 30 | 31 | sess = tf.Session() 32 | K.set_session(sess) 33 | 34 | logging.getLogger().setLevel(logging.INFO) 35 | 36 | HEIGHT = 32 37 | WIDTH = 32 38 | DEPTH = 3 39 | NUM_CLASSES = 10 40 | NUM_DATA_BATCHES = 5 41 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES 42 | INPUT_TENSOR_NAME = 'inputs_input' # needs to match the name of the first layer + "_input" 43 | 44 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum): 45 | """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model. 46 | The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 47 | TensorFlow Serving SavedModel at the end of training. 48 | 49 | Args: 50 | hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 51 | training script. 52 | Returns: A compiled Keras model 53 | """ 54 | model = Sequential() 55 | model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH))) 56 | model.add(BatchNormalization()) 57 | model.add(Activation('relu')) 58 | model.add(Conv2D(32, (3, 3))) 59 | model.add(BatchNormalization()) 60 | model.add(Activation('relu')) 61 | model.add(MaxPooling2D(pool_size=(2, 2))) 62 | model.add(Dropout(0.2)) 63 | 64 | model.add(Conv2D(64, (3, 3), padding='same')) 65 | model.add(BatchNormalization()) 66 | model.add(Activation('relu')) 67 | model.add(Conv2D(64, (3, 3))) 68 | model.add(BatchNormalization()) 69 | model.add(Activation('relu')) 70 | model.add(MaxPooling2D(pool_size=(2, 2))) 71 | model.add(Dropout(0.3)) 72 | 73 | model.add(Conv2D(128, (3, 3), padding='same')) 74 | model.add(BatchNormalization()) 75 | model.add(Activation('relu')) 76 | model.add(Conv2D(128, (3, 3))) 77 | model.add(BatchNormalization()) 78 | model.add(Activation('relu')) 79 | model.add(MaxPooling2D(pool_size=(2, 2))) 80 | model.add(Dropout(0.4)) 81 | 82 | model.add(Flatten()) 83 | model.add(Dense(512)) 84 | model.add(Activation('relu')) 85 | model.add(Dropout(0.5)) 86 | model.add(Dense(NUM_CLASSES)) 87 | model.add(Activation('softmax')) 88 | 89 | size = 1 90 | 91 | if optimizer.lower() == 'sgd': 92 | opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum) 93 | elif optimizer.lower() == 'rmsprop': 94 | opt = RMSprop(lr=learning_rate * size, decay=weight_decay) 95 | else: 96 | opt = Adam(lr=learning_rate * size, decay=weight_decay) 97 | 98 | model.compile(loss='categorical_crossentropy', 99 | optimizer=opt, 100 | metrics=['accuracy']) 101 | return model 102 | 103 | 104 | def get_filenames(channel_name, channel): 105 | if channel_name in ['train', 'validation', 'eval']: 106 | return [os.path.join(channel, channel_name + '.tfrecords')] 107 | else: 108 | raise ValueError('Invalid data subset "%s"' % channel_name) 109 | 110 | 111 | def train_input_fn(): 112 | return _input(args.epochs, args.batch_size, args.train, 'train') 113 | 114 | 115 | def eval_input_fn(): 116 | return _input(args.epochs, args.batch_size, args.eval, 'eval') 117 | 118 | 119 | def validation_input_fn(): 120 | return _input(args.epochs, args.batch_size, args.validation, 'validation') 121 | 122 | 123 | def _input(epochs, batch_size, channel, channel_name): 124 | 125 | filenames = get_filenames(channel_name, channel) 126 | dataset = tf.data.TFRecordDataset(filenames) 127 | 128 | dataset = dataset.repeat(epochs) 129 | dataset = dataset.prefetch(10) 130 | 131 | # Parse records. 132 | dataset = dataset.map( 133 | _dataset_parser, num_parallel_calls=10) 134 | 135 | # Potentially shuffle records. 136 | if channel_name == 'train': 137 | # Ensure that the capacity is sufficiently large to provide good random 138 | # shuffling. 139 | buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size 140 | dataset = dataset.shuffle(buffer_size=buffer_size) 141 | 142 | # Batch it up. 143 | dataset = dataset.batch(batch_size, drop_remainder=True) 144 | iterator = dataset.make_one_shot_iterator() 145 | image_batch, label_batch = iterator.get_next() 146 | 147 | return {INPUT_TENSOR_NAME: image_batch}, label_batch 148 | 149 | 150 | def _train_preprocess_fn(image): 151 | """Preprocess a single training image of layout [height, width, depth].""" 152 | # Resize the image to add four extra pixels on each side. 153 | image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8) 154 | 155 | # Randomly crop a [HEIGHT, WIDTH] section of the image. 156 | image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH]) 157 | 158 | # Randomly flip the image horizontally. 159 | image = tf.image.random_flip_left_right(image) 160 | 161 | return image 162 | 163 | 164 | def _dataset_parser(value): 165 | """Parse a CIFAR-10 record from value.""" 166 | featdef = { 167 | 'image': tf.FixedLenFeature([], tf.string), 168 | 'label': tf.FixedLenFeature([], tf.int64), 169 | } 170 | 171 | example = tf.parse_single_example(value, featdef) 172 | image = tf.decode_raw(example['image'], tf.uint8) 173 | image.set_shape([DEPTH * HEIGHT * WIDTH]) 174 | 175 | # Reshape from [depth * height * width] to [depth, height, width]. 176 | image = tf.cast( 177 | tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]), 178 | tf.float32) 179 | label = tf.cast(example['label'], tf.int32) 180 | image = _train_preprocess_fn(image) 181 | return image, tf.one_hot(label, NUM_CLASSES) 182 | 183 | def save_model(model, output): 184 | signature = tf.saved_model.signature_def_utils.predict_signature_def( 185 | inputs={'inputs': model.input}, outputs={'scores': model.output}) 186 | 187 | builder = tf.saved_model.builder.SavedModelBuilder(output+'/1/') 188 | builder.add_meta_graph_and_variables( 189 | sess=K.get_session(), 190 | tags=[tf.saved_model.tag_constants.SERVING], 191 | signature_def_map={"serving_default": signature}) 192 | builder.save() 193 | 194 | logging.info("Model successfully saved at: {}".format(output)) 195 | return 196 | 197 | def main(args): 198 | logging.info("getting data") 199 | train_dataset = train_input_fn() 200 | eval_dataset = eval_input_fn() 201 | validation_dataset = validation_input_fn() 202 | 203 | logging.info("configuring model") 204 | model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum) 205 | checkpoint = ModelCheckpoint(args.model_dir + '/checkpoint-{epoch}.h5') 206 | 207 | logging.info("Starting training") 208 | model.fit(x=train_dataset[0], y=train_dataset[1], 209 | steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size), 210 | epochs=args.epochs, validation_data=validation_dataset, 211 | validation_steps=(num_examples_per_epoch('validation') // args.batch_size), callbacks=[checkpoint]) 212 | 213 | score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=num_examples_per_epoch('eval') // args.batch_size, 214 | verbose=0) 215 | 216 | logging.info('Test loss:{}'.format(score[0])) 217 | logging.info('Test accuracy:{}'.format(score[1])) 218 | 219 | return save_model(model, args.model_dir) 220 | 221 | def num_examples_per_epoch(subset='train'): 222 | if subset == 'train': 223 | return 40000 224 | elif subset == 'validation': 225 | return 10000 226 | elif subset == 'eval': 227 | return 10000 228 | else: 229 | raise ValueError('Invalid data subset "%s"' % subset) 230 | 231 | 232 | if __name__ == '__main__': 233 | parser = argparse.ArgumentParser() 234 | parser.add_argument( 235 | '--train', 236 | type=str, 237 | required=False, 238 | help='The directory where the CIFAR-10 input data is stored.') 239 | parser.add_argument( 240 | '--validation', 241 | type=str, 242 | required=False, 243 | help='The directory where the CIFAR-10 input data is stored.') 244 | parser.add_argument( 245 | '--eval', 246 | type=str, 247 | required=False, 248 | help='The directory where the CIFAR-10 input data is stored.') 249 | parser.add_argument( 250 | '--model_dir', 251 | type=str, 252 | required=True, 253 | help='The directory where the model will be stored.') 254 | parser.add_argument( 255 | '--weight-decay', 256 | type=float, 257 | default=2e-4, 258 | help='Weight decay for convolutions.') 259 | parser.add_argument( 260 | '--learning-rate', 261 | type=float, 262 | default=0.001, 263 | help="""\ 264 | This is the inital learning rate value. The learning rate will decrease 265 | during training. For more details check the model_fn implementation in 266 | this file.\ 267 | """) 268 | parser.add_argument( 269 | '--epochs', 270 | type=int, 271 | default=10, 272 | help='The number of steps to use for training.') 273 | parser.add_argument( 274 | '--batch-size', 275 | type=int, 276 | default=128, 277 | help='Batch size for training.') 278 | parser.add_argument( 279 | '--optimizer', 280 | type=str, 281 | default='adam') 282 | parser.add_argument( 283 | '--momentum', 284 | type=float, 285 | default='0.9') 286 | args = parser.parse_args() 287 | main(args) 288 | --------------------------------------------------------------------------------