├── .gitignore
├── 01_groundtruth(optional)
    └── README.md
├── 02_preprocessing
    ├── README.md
    ├── data_preprocessing.ipynb
    ├── preprocessing.py
    └── statics
    │   └── birds.png
├── 03_training
    ├── README.md
    ├── pytorch
    │   ├── cv_hpo_pytorch_pipe.ipynb
    │   └── source_dir
    │   │   └── cifar10.py
    ├── statics
    │   ├── CIFAR-10.png
    │   ├── Experiments.png
    │   └── HPO_experiments.png
    └── tensorflow
    │   ├── cv_hpo_keras_pipe.ipynb
    │   └── source_dir
    │       └── keras_cifar10.py
├── 04_advanced_training
    ├── README.md
    ├── pytorch
    │   ├── code
    │   │   ├── inference.py
    │   │   ├── model_def.py
    │   │   └── train_pytorch_smdataparallel_mnist.py
    │   ├── pytorch_smdataparallel_mnist_demo.ipynb
    │   └── static
    │   │   └── ss.png
    └── tensorflow
    │   ├── code
    │       ├── train_param_server_debugger.py
    │       ├── train_param_server_debugger2.py
    │       └── train_sdp_debugger_3.py
    │   ├── distributed_training_tf.ipynb
    │   └── static
    │       └── debugger_output.png
├── 05_model_evaluation_and_model_explainability
    ├── 05a_model_evaluation
    │   ├── code
    │   │   ├── inference.py
    │   │   ├── requirements-gpu.txt
    │   │   ├── requirements.txt
    │   │   └── train-mobilenet.py
    │   ├── docker
    │   │   ├── Dockerfile
    │   │   └── requirements.txt
    │   ├── evaluation.py
    │   ├── model-evaluation-processing-job.ipynb
    │   └── optional-prepare-data-and-model.ipynb
    ├── 05b_model_explainability
    │   ├── code
    │   │   ├── inference.py
    │   │   ├── requirements-gpu.txt
    │   │   ├── requirements.txt
    │   │   └── train-mobilenet.py
    │   ├── optional-prepare-data-and-model-explainability-clarify.ipynb
    │   └── preprocessing.py
    └── README.md
├── 06_training_pipeline
    ├── README.md
    ├── code
    │   ├── inference.py
    │   ├── requirements-gpu.txt
    │   ├── requirements.txt
    │   └── train-mobilenet.py
    ├── evaluation.py
    ├── lambda
    │   ├── index.py
    │   └── lambda.zip
    ├── pipeline.ipynb
    ├── pipeline.py
    ├── preprocess.py
    ├── sagemaker-project.yaml
    └── statics
    │   ├── birds.png
    │   ├── compare-model.png
    │   ├── confussion_matrix.png
    │   ├── cost-explore.png
    │   ├── cv-training-pipeline.png
    │   ├── execute-pipeline.png
    │   ├── parameters-input.png
    │   ├── project-parameter.png
    │   ├── project-template.png
    │   └── studio-ui-pipeline.png
├── 07_deployment
    ├── README.md
    ├── cloud_deployment
    │   ├── cv_utils.py
    │   ├── sagemaker-deploy-model-for-inference.ipynb
    │   └── statics
    │   │   └── active-sagemaker-endpoints.png
    ├── edge_deployment
    │   ├── README.md
    │   ├── arm64-edge-manager-greengrass-v2.ipynb
    │   ├── build
    │   │   ├── image_classification
    │   │   │   ├── agent_pb2.py
    │   │   │   ├── agent_pb2_grpc.py
    │   │   │   └── inference.py
    │   │   └── installer.sh
    │   └── static
    │   │   ├── 1_select_AMI.png
    │   │   ├── 2_select_instance.png
    │   │   ├── 4_deploy_greengrass.png
    │   │   ├── 5_deployment_dashboard.png
    │   │   ├── 6_monitoring_core_device.png
    │   │   └── x_deployment_error.png
    └── optional-prepare-data-and-model.ipynb
├── 08_end-to-end
    ├── 1_training
    │   ├── 0.bird-end2end-ml-pipeline.ipynb
    │   ├── README.md
    │   ├── __init__.py
    │   ├── input.manifest
    │   ├── pipeline
    │   │   ├── __init__.py
    │   │   ├── __pycache__
    │   │   │   ├── __init__.cpython-37.pyc
    │   │   │   ├── pipeline.cpython-37.pyc
    │   │   │   └── pipeline_tuning.cpython-37.pyc
    │   │   ├── code
    │   │   │   ├── __init__.py
    │   │   │   ├── train.py
    │   │   │   └── train_debugger.py
    │   │   ├── evaluation.py
    │   │   ├── pipeline.py
    │   │   ├── pipeline_tuning.py
    │   │   └── preprocess.py
    │   └── statics
    │   │   ├── birds.png
    │   │   ├── debugger_access.png
    │   │   ├── debugger_node_profiler.png
    │   │   ├── debugger_profiler.png
    │   │   ├── debugger_summary.png
    │   │   ├── end2end.png
    │   │   ├── execute-pipeline.png
    │   │   ├── manual_approval.png
    │   │   ├── model_registry.png
    │   │   ├── panorama_test.png
    │   │   └── test_util_folder.png
    ├── 2_deployment
    │   ├── bird_demo.ipynb
    │   └── bird_demo_app
    │   │   ├── graphs
    │   │       └── bird_demo_app
    │   │       │   └── graph.json
    │   │   └── packages
    │   │       ├── .keep
    │   │       ├── 987720697751-BIRD_DEMO_CODE-1.0
    │   │           ├── Dockerfile
    │   │           ├── descriptor.json
    │   │           ├── package.json
    │   │           └── src
    │   │           │   ├── .keep
    │   │           │   └── app.py
    │   │       ├── 987720697751-BIRD_DEMO_TF_MODEL-1.0
    │   │           ├── descriptor.json
    │   │           └── package.json
    │   │       └── 987720697751-RTSP_STREAM-1.0
    │   │           └── package.json
    └── README.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
└── statics
    ├── cv-workshop-overview.png
    └── workshop_overview.png


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | .ipynb_checkpoints
3 | inference-test-data/
4 | *.mp4
5 | *.jpg
6 | 


--------------------------------------------------------------------------------
/01_groundtruth(optional)/README.md:
--------------------------------------------------------------------------------
 1 | ## Data Labeling w/ SageMaker GroundTruth
 2 | 
 3 | SageMaker Ground Truth (SMGT) is fully managed data labeling service in which you can launch a labeling job with just a few clicks in the console or use a single AWS SDK API call. It provides 30+ labeling workflows for computer vision and NLP use cases, and also allows you to tap into different workforce options.
 4 | 
 5 | ![SMGT](https://docs.aws.amazon.com/sagemaker/latest/dg/images/image-classification-example.png)
 6 | 
 7 | This module is optional. But if you want to get some labeling practices, here are all the [SMGT Examples](https://github.com/aws/amazon-sagemaker-examples/tree/main/ground_truth_labeling_jobs).  
 8 | 
 9 | For Computer Vision (CV) specific examples, This [Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/master/ground_truth_labeling_jobs/from_unlabeled_data_to_deployed_machine_learning_model_ground_truth_demo_image_classification/from_unlabeled_data_to_deployed_machine_learning_model_ground_truth_demo_image_classification.ipynb) and [Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/2c2cd35c8ed389e638fe5c912e24cd00d0874874/ground_truth_labeling_jobs/ground_truth_object_detection_tutorial/object_detection_tutorial.ipynb) tutorials are good practices.
10 | 
11 | ## Prerequisites
12 | 
13 | To run this notebook, you simply download the provided Jupyter notebook and execute each cell one-by-one. To understand what's happening, you'll need:
14 | 
15 | - An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the EXP_NAME to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket.
16 | - The S3 bucket that you use for this demo must have a CORS policy attached. To learn more about this requirement, and how to attach a CORS policy to an S3 bucket, see CORS Permission Requirement.
17 | - Familiarity with Python and numpy.
18 | - Basic familiarity with AWS S3,
19 | - Basic understanding of AWS Sagemaker,
20 | - Basic familiarity with AWS Command Line Interface (CLI) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances.


--------------------------------------------------------------------------------
/02_preprocessing/README.md:
--------------------------------------------------------------------------------
 1 | # Data Preparationg using SageMaker Processing
 2 | 
 3 | Amazon SageMaker Processing is a Managed Solution to Run Data Processing and Model Evaluation Workloads. Data processing tasks such as feature engineering, data validation, model evaluation, and model interpretation are essential steps performed by engineers and data scientists in this machine learning workflow. 
 4 | 
 5 | Your custom processing scripts will be containerized running in infrastructure that is created on demand and terminated automatically. You can choose to use a SageMaker optimized containers for popular data processing or model evaluation frameworks like Scikit learn, PySpark etc. or Bring Your Own Containers (BYOC).
 6 | 
 7 | ![Process Data](https://docs.aws.amazon.com/sagemaker/latest/dg/images/Processing-1.png)
 8 | 
 9 | ---
10 | # Introduction
11 | Preprocess data before model training is an important step in the overall MLOps process. In this lab you will learn how to use [SKLearnProcessor](https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html), a type of SageMaker process uses Processcikit-learn scripts in a container image provided and maintained by SageMaker to preprocess data or evaluate models.
12 | 
13 | The example script will first Load the bird dataset, and then split data into train, validation, and test channels, and finally Export the data and annotation files to S3.
14 | 
15 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio**
16 | 
17 | ---
18 | ## Prerequisites
19 | 
20 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need:
21 | 
22 | - Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket.
23 | - Familiarity with Python and numpy
24 | - Basic familiarity with AWS S3.
25 | - Basic understanding of AWS Sagemaker.
26 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
27 | - SageMaker Studio is preferred for the full UI integration
28 | 
29 | ---
30 | 
31 | ## Dataset
32 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.
33 | 
34 | ![Bird Dataset](statics/birds.png)
35 | 
36 | Run the notebook to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data.
37 | 
38 | ---
39 | 
40 | # Review Outputs
41 | 
42 | At the end of the lab, your dataset will be randomly split into train, valid, and test folders.  YUou will also have a csv manifest file for each channel. **If you plan to complete other modules in this workshop, please keep these data.  Otherwise, you can clean up after this lab.**


--------------------------------------------------------------------------------
/02_preprocessing/preprocessing.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import logging
  3 | 
  4 | import pandas as pd
  5 | import argparse
  6 | import boto3
  7 | import json
  8 | import os
  9 | import shutil
 10 | 
 11 | logger = logging.getLogger()
 12 | logger.setLevel(logging.INFO)
 13 | logger.addHandler(logging.StreamHandler())
 14 | 
 15 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 
 16 | output_path = '/opt/ml/processing/output' #"output" # 
 17 | IMAGES_DIR   = os.path.join(input_path, 'images')
 18 | SPLIT_RATIOS = (0.6, 0.2, 0.2)
 19 | 
 20 | 
 21 | # this function is used to split a dataframe into 3 seperate dataframes
 22 | # one of each: train, validate, test
 23 | 
 24 | def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False):
 25 |     train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
 26 | 
 27 |     labels = df[label_column].unique()
 28 |     for lbl in labels:
 29 |         lbl_df = df[df[label_column] == lbl]
 30 | 
 31 |         lbl_train_df        = lbl_df.sample(frac=splits[0])
 32 |         lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index)
 33 |         lbl_test_df         = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2]))
 34 |         lbl_val_df          = lbl_val_and_test_df.drop(lbl_test_df.index)
 35 | 
 36 |         if verbose:
 37 |             print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl,
 38 |                                                                         len(lbl_df), 
 39 |                                                                         len(lbl_train_df), 
 40 |                                                                         len(lbl_val_df), 
 41 |                                                                         len(lbl_test_df)))
 42 |         train_df = train_df.append(lbl_train_df)
 43 |         val_df   = val_df.append(lbl_val_df)
 44 |         test_df  = test_df.append(lbl_test_df)
 45 | 
 46 |     # shuffle them on the way out using .sample(frac=1)
 47 |     return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1)
 48 | 
 49 | # This function grabs the manifest files and build a dataframe, then call the split_to_train_val_test
 50 | # function above and return the 3 dataframes
 51 | def get_train_val_dataframes(BASE_DIR, classes, split_ratios):
 52 |     CLASSES_FILE = os.path.join(BASE_DIR, 'classes.txt')
 53 |     IMAGE_FILE   = os.path.join(BASE_DIR, 'images.txt')
 54 |     LABEL_FILE   = os.path.join(BASE_DIR, 'image_class_labels.txt')
 55 | 
 56 |     images_df = pd.read_csv(IMAGE_FILE, sep=' ',
 57 |                             names=['image_pretty_name', 'image_file_name'],
 58 |                             header=None)
 59 |     image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ',
 60 |                                 names=['image_pretty_name', 'orig_class_id'], header=None)
 61 | 
 62 |     # Merge the metadata into a single flat dataframe for easier processing
 63 |     full_df = pd.DataFrame(images_df)
 64 | 
 65 |     full_df.reset_index(inplace=True, drop=True)
 66 |     full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name')
 67 | 
 68 |     # grab a small subset of species for testing
 69 |     criteria = full_df['orig_class_id'].isin(classes)
 70 |     full_df = full_df[criteria]
 71 |     print('Using {} images from {} classes'.format(full_df.shape[0], len(classes)))
 72 | 
 73 |     unique_classes = full_df['orig_class_id'].drop_duplicates()
 74 |     sorted_unique_classes = sorted(unique_classes)
 75 |     id_to_one_based = {}
 76 |     i = 1
 77 |     for c in sorted_unique_classes:
 78 |         id_to_one_based[c] = str(i)
 79 |         i += 1
 80 | 
 81 |     full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based)
 82 |     full_df.reset_index(inplace=True, drop=True)
 83 | 
 84 |     def get_class_name(fn):
 85 |         return fn.split('/')[0]
 86 |     full_df['class_name'] = full_df['image_file_name'].apply(get_class_name)
 87 |     full_df = full_df.drop(['image_pretty_name'], axis=1)
 88 | 
 89 |     train_df = []
 90 |     test_df  = []
 91 |     val_df   = []
 92 | 
 93 |     # split into training and validation sets
 94 |     train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', split_ratios)
 95 | 
 96 |     print('num images total: ' + str(images_df.shape[0]))
 97 |     print('\nnum train: ' + str(train_df.shape[0]))
 98 |     print('num val: ' + str(val_df.shape[0]))
 99 |     print('num test: ' + str(test_df.shape[0]))
100 |     return train_df, val_df, test_df
101 | 
102 | # this function copy images by channel to its destination folder
103 | def copy_files_for_channel(df, channel_name, verbose=False):
104 |     print('\nCopying files for {} images in channel: {}...'.format(df.shape[0], channel_name))
105 |     for i in range(df.shape[0]):
106 |         target_fname = df.iloc[i]['image_file_name']
107 | #         if verbose:
108 | #             print(target_fname)
109 |         src = "{}/{}".format(IMAGES_DIR, target_fname) #f"{IMAGES_DIR}/{target_fname}"
110 |         dst = "{}/{}/{}".format(output_path,channel_name,target_fname)
111 |         shutil.copyfile(src, dst)
112 | 
113 | if __name__ == "__main__":
114 |     parser = argparse.ArgumentParser()
115 |     parser.add_argument("--classes", type=str, default="")
116 |     parser.add_argument("--input-data", type=str, default="classes.txt")
117 |     args, _ = parser.parse_known_args()
118 | 
119 | 
120 |     c_list = args.classes.split(',')
121 |     input_data = args.input_data
122 |         
123 |     CLASSES_FILE = os.path.join(input_path, input_data)
124 | 
125 |     CLASS_COLS      = ['class_number','class_id']
126 |     
127 |     if len(c_list)==0:
128 |         # Otherwise, you can use the full set of species
129 |         CLASSES = []
130 |         for c in range(200):
131 |             CLASSES += [c + 1]
132 |         prefix = prefix + '-full'
133 |     else:
134 |         CLASSES = list(map(int, c_list))
135 | 
136 |             
137 |     classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None)
138 | 
139 |     criteria = classes_df['class_number'].isin(CLASSES)
140 |     classes_df = classes_df[criteria]
141 | 
142 |     class_name_list = sorted(classes_df['class_id'].unique().tolist())
143 |     print(class_name_list)
144 |     
145 |     
146 |     train_df, val_df, test_df = get_train_val_dataframes(input_path, CLASSES, SPLIT_RATIOS)
147 |         
148 |     for c in class_name_list:
149 |         os.mkdir('{}/{}/{}'.format(output_path, 'valid', c))
150 |         os.mkdir('{}/{}/{}'.format(output_path, 'test', c))
151 |         os.mkdir('{}/{}/{}'.format(output_path, 'train', c))
152 | 
153 |     copy_files_for_channel(val_df,   'valid')
154 |     copy_files_for_channel(test_df,  'test')
155 |     copy_files_for_channel(train_df, 'train')
156 |     
157 |     # export manifest file for validation
158 |     train_m_file = "{}/manifest/train.csv".format(output_path)
159 |     train_df.to_csv(train_m_file, index=False)
160 |     test_m_file = "{}/manifest/test.csv".format(output_path)
161 |     test_df.to_csv(test_m_file, index=False)
162 |     val_m_file = "{}/manifest/valid.csv".format(output_path)
163 |     val_df.to_csv(val_m_file, index=False)
164 |     
165 |     print("Finished running processing job")
166 | 


--------------------------------------------------------------------------------
/02_preprocessing/statics/birds.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/02_preprocessing/statics/birds.png


--------------------------------------------------------------------------------
/03_training/README.md:
--------------------------------------------------------------------------------
 1 | ## Training a CV model on Amazon SageMaker
 2 | 
 3 | Training models is easy on Amazon SageMaker. You simply specify the location of your data in Amazon S3, define the type and quantity number of ML instances you need, and start training a model with just a few lines of code. Amazon SageMaker sets up a distributed compute cluster, performs the training, outputs the result to Amazon S3, and tears down the cluster when complete. 
 4 | 
 5 | ---
 6 | ## Introduction
 7 | This module is focused on SageMaker training for Computer Vision models. You will go through examples of Bring Your Own(BYO) Script training, hyperparameter tuning, and experiment tracking. In the future modules, you will see how experiment tracking can be automated through SageMaker Pipeline's native integration.
 8 | 
 9 | At the end of this lab, you should develop hands on experience 1) training custom CV models on Amazon SageMaker, 2) Build Automatic Model Tuning Jobs, and 3) Organize your ML experiments.
10 | 
11 | ### Keras
12 | The model used for the tensorflow\cv_hpo_keras_pipe notebook is a simple deep CNN that is based on the [Keras examples](https://www.tensorflow.org/tutorials/images/cnn). 
13 | 
14 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio**
15 | 
16 | ### Pytorch
17 | The model used for the pytorch\cv_hpo_pytorch_pipe notebook is a simple deep CNN that is based on the [Pytorch cnn example](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_cnn_cifar10/source/cifar10.py). 
18 | 
19 | ** Note: This Notebook was tested on 'Python3 (Pytorch 1.8, Python 3.6)' kernel in SageMaker Studio**
20 | 
21 | ---
22 | ## Prerequisites
23 | 
24 | To get started, download the provided Jupyter notebook and associated files to you SageMaker Studio Environment. To run the notebook, you can simply execute each cell in order. To understand what's happening, you'll need:
25 | 
26 | - Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket.
27 | - Familiarity with Python and numpy
28 | - Basic familiarity with AWS S3.
29 | - Basic understanding of AWS Sagemaker.
30 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
31 | - SageMaker Studio is preferred for the full UI integration
32 | 
33 | ---
34 | 
35 | ## Dataset
36 | 
37 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.
38 | 
39 | ![Bird Dataset](statics/birds.png)
40 | 
41 | The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each.
42 | 
43 | ![cifar10](statics/CIFAR-10.png)
44 | 
45 | 
46 | 
47 | 


--------------------------------------------------------------------------------
/03_training/pytorch/source_dir/cifar10.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import logging
  3 | import os
  4 | 
  5 | import torch
  6 | import torch.distributed as dist
  7 | import torch.nn as nn
  8 | import torch.nn.functional as F
  9 | import torch.nn.parallel
 10 | import torch.optim
 11 | import torch.utils.data
 12 | import torch.utils.data.distributed
 13 | import torchvision
 14 | import torchvision.models
 15 | import torchvision.transforms as transforms
 16 | 
 17 | try:
 18 |     from sagemaker_inference import environment
 19 | except:
 20 |     from sagemaker_training import environment
 21 | 
 22 | logger = logging.getLogger()
 23 | logger.setLevel(logging.DEBUG)
 24 | 
 25 | classes = ("plane", "car", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck")
 26 | 
 27 | 
 28 | # https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/cifar10_tutorial.py#L118
 29 | class Net(nn.Module):
 30 |     def __init__(self):
 31 |         super(Net, self).__init__()
 32 |         self.conv1 = nn.Conv2d(3, 6, 5)
 33 |         self.pool = nn.MaxPool2d(2, 2)
 34 |         self.conv2 = nn.Conv2d(6, 16, 5)
 35 |         self.fc1 = nn.Linear(16 * 5 * 5, 120)
 36 |         self.fc2 = nn.Linear(120, 84)
 37 |         self.fc3 = nn.Linear(84, 10)
 38 | 
 39 |     def forward(self, x):
 40 |         x = self.pool(F.relu(self.conv1(x)))
 41 |         x = self.pool(F.relu(self.conv2(x)))
 42 |         x = x.view(-1, 16 * 5 * 5)
 43 |         x = F.relu(self.fc1(x))
 44 |         x = F.relu(self.fc2(x))
 45 |         x = self.fc3(x)
 46 |         return x
 47 |     
 48 |     def validation_step(self, batch):
 49 |         images, labels = batch 
 50 |         out = self(images)                    # Generate predictions
 51 |         loss = F.cross_entropy(out, labels)   # Calculate loss
 52 |         acc = accuracy(out, labels)           # Calculate accuracy
 53 |         return {'val_loss': loss.detach(), 'val_acc': acc}
 54 |         
 55 |     def validation_epoch_end(self, outputs):
 56 |         batch_losses = [x['val_loss'] for x in outputs]
 57 |         epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
 58 |         batch_accs = [x['val_acc'] for x in outputs]
 59 |         epoch_acc = torch.stack(batch_accs).mean()      # Combine accuracies
 60 |         logging.info("val_loss: {:.4f}, val_acc: {:4f}".format(epoch_loss.item(), epoch_acc.item()))
 61 |         print("val_loss: {:.4f}, val_acc: {:4f}".format(epoch_loss.item(), epoch_acc.item()))
 62 |         return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
 63 |     
 64 |     def epoch_end(self, epoch, result):
 65 |         print('epoch ended')
 66 |         print("Epoch [{}], train_loss: {:.4f}, val_loss: {:.4f}, val_acc: {:.4f}".format(
 67 |             epoch, result['train_loss'], result['val_loss'], result['val_acc']))
 68 |         logging.info("Epoch [{}], train_loss: {:.4f}, val_loss: {:.4f}, val_acc: {:.4f}".format(
 69 |             epoch, result['train_loss'], result['val_loss'], result['val_acc']))
 70 |         
 71 | def accuracy(outputs, labels):
 72 |     _, preds = torch.max(outputs, dim=1)
 73 |     return torch.tensor(torch.sum(preds == labels).item() / len(preds))
 74 | 
 75 | def _train(args):
 76 |     is_distributed = len(args.hosts) > 1 and args.dist_backend is not None
 77 |     logger.debug("Distributed training - {}".format(is_distributed))
 78 | 
 79 |     if is_distributed:
 80 |         # Initialize the distributed environment.
 81 |         world_size = len(args.hosts)
 82 |         os.environ["WORLD_SIZE"] = str(world_size)
 83 |         host_rank = args.hosts.index(args.current_host)
 84 |         os.environ["RANK"] = str(host_rank)
 85 |         dist.init_process_group(backend=args.dist_backend, rank=host_rank, world_size=world_size)
 86 |         logger.info(
 87 |             "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
 88 |                 args.dist_backend, dist.get_world_size()
 89 |             )
 90 |             + "Current host rank is {}. Using cuda: {}. Number of gpus: {}".format(
 91 |                 dist.get_rank(), torch.cuda.is_available(), args.num_gpus
 92 |             )
 93 |         )
 94 | 
 95 |     device = "cuda" if torch.cuda.is_available() else "cpu"
 96 |     logger.info("Device Type: {}".format(device))
 97 | 
 98 |     logger.info("Loading Cifar10 dataset")
 99 |     transform = transforms.Compose(
100 |         [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
101 |     )
102 | 
103 |     trainset = torchvision.datasets.CIFAR10(
104 |         root=args.data_dir, train=True, download=False, transform=transform
105 |     )
106 |     train_loader = torch.utils.data.DataLoader(
107 |         trainset, batch_size=args.batch_size, shuffle=True, num_workers=args.workers
108 |     )
109 | 
110 |     testset = torchvision.datasets.CIFAR10(
111 |         root=args.data_dir, train=False, download=False, transform=transform
112 |     )
113 |     test_loader = torch.utils.data.DataLoader(
114 |         testset, batch_size=args.batch_size, shuffle=False, num_workers=args.workers
115 |     )
116 | 
117 |     logger.info("Model loaded")
118 |     model = Net()
119 | 
120 |     if torch.cuda.device_count() > 1:
121 |         logger.info("Gpu count: {}".format(torch.cuda.device_count()))
122 |         model = nn.DataParallel(model)
123 | 
124 |     model = model.to(device)
125 | 
126 |     criterion = nn.CrossEntropyLoss().to(device)
127 |     optimizer = torch.optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
128 | 
129 |     for epoch in range(0, args.epochs):
130 |         running_loss = 0.0
131 |         train_losses = []
132 |         for i, data in enumerate(train_loader):
133 |             # get the inputs
134 |             inputs, labels = data
135 |             inputs, labels = inputs.to(device), labels.to(device)
136 | 
137 |             # zero the parameter gradients
138 |             optimizer.zero_grad()
139 | 
140 |             # forward + backward + optimize
141 |             outputs = model(inputs)
142 |             loss = criterion(outputs, labels)
143 |             train_losses.append(loss)
144 |             loss.backward()
145 |             optimizer.step()
146 | 
147 |             # print statistics
148 |             running_loss += loss.item()
149 |             if i % 2000 == 1999:  # print every 2000 mini-batches
150 |                 print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1, running_loss / 2000))
151 |                 running_loss = 0.0
152 |         
153 |         # Validation phase
154 |         result = evaluate(model, test_loader)
155 |         result['train_loss'] = torch.stack(train_losses).mean().item()
156 |         model.epoch_end(epoch, result)
157 |         
158 |     print("Finished Training")
159 |     
160 |     print(model.eval())
161 |     outputs = [model.validation_step(batch) for batch in test_loader]
162 |     
163 |     print(model.validation_epoch_end(outputs))
164 |     
165 |     return _save_model(model, args.model_dir)
166 | 
167 | def evaluate(model, val_loader):
168 |     model.eval()
169 |     outputs = [model.validation_step(batch) for batch in val_loader]
170 |     return model.validation_epoch_end(outputs)
171 | 
172 | def _save_model(model, model_dir):
173 |     logger.info("Saving the model.")
174 |     path = os.path.join(model_dir, "model.pth")
175 |     # recommended way from http://pytorch.org/docs/master/notes/serialization.html
176 |     torch.save(model.cpu().state_dict(), path)
177 | 
178 | 
179 | def model_fn(model_dir):
180 |     logger.info("model_fn")
181 |     device = "cuda" if torch.cuda.is_available() else "cpu"
182 |     model = Net()
183 |     if torch.cuda.device_count() > 1:
184 |         logger.info("Gpu count: {}".format(torch.cuda.device_count()))
185 |         model = nn.DataParallel(model)
186 | 
187 |     with open(os.path.join(model_dir, "model.pth"), "rb") as f:
188 |         model.load_state_dict(torch.load(f))
189 |     return model.to(device)
190 | 
191 | 
192 | if __name__ == "__main__":
193 |     parser = argparse.ArgumentParser()
194 | 
195 |     parser.add_argument(
196 |         "--workers",
197 |         type=int,
198 |         default=2,
199 |         metavar="W",
200 |         help="number of data loading workers (default: 2)",
201 |     )
202 |     parser.add_argument(
203 |         "--epochs",
204 |         type=int,
205 |         default=2,
206 |         metavar="E",
207 |         help="number of total epochs to run (default: 2)",
208 |     )
209 |     parser.add_argument(
210 |         "--batch_size", type=int, default=4, metavar="BS", help="batch size (default: 4)"
211 |     )
212 |     parser.add_argument(
213 |         "--lr",
214 |         type=float,
215 |         default=0.001,
216 |         metavar="LR",
217 |         help="initial learning rate (default: 0.001)",
218 |     )
219 |     parser.add_argument(
220 |         "--momentum", type=float, default=0.9, metavar="M", help="momentum (default: 0.9)"
221 |     )
222 |     parser.add_argument(
223 |         "--dist_backend", type=str, default="gloo", help="distributed backend (default: gloo)"
224 |     )
225 | 
226 |     env = environment.Environment()
227 |     parser.add_argument("--hosts", type=list, default=env.hosts)
228 |     parser.add_argument("--current-host", type=str, default=env.current_host)
229 |     parser.add_argument("--model-dir", type=str, default=env.model_dir)
230 |     parser.add_argument("--data-dir", type=str, default=env.channel_input_dirs.get("training"))
231 |     parser.add_argument("--num-gpus", type=int, default=env.num_gpus)
232 | 
233 |     _train(parser.parse_args())


--------------------------------------------------------------------------------
/03_training/statics/CIFAR-10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/03_training/statics/CIFAR-10.png


--------------------------------------------------------------------------------
/03_training/statics/Experiments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/03_training/statics/Experiments.png


--------------------------------------------------------------------------------
/03_training/statics/HPO_experiments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/03_training/statics/HPO_experiments.png


--------------------------------------------------------------------------------
/03_training/tensorflow/source_dir/keras_cifar10.py:
--------------------------------------------------------------------------------
  1 | #     Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | #     Licensed under the Apache License, Version 2.0 (the "License").
  4 | #     You may not use this file except in compliance with the License.
  5 | #     A copy of the License is located at
  6 | #
  7 | #         https://aws.amazon.com/apache-2-0/
  8 | #
  9 | #     or in the "license" file accompanying this file. This file is distributed
 10 | #     on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 11 | #     express or implied. See the License for the specific language governing
 12 | #     permissions and limitations under the License.
 13 | 
 14 | import argparse
 15 | import io
 16 | import itertools
 17 | import json
 18 | import logging
 19 | import os
 20 | import re
 21 | 
 22 | import numpy as np
 23 | import sklearn.metrics
 24 | import tensorflow as tf
 25 | from tensorflow import keras
 26 | from tensorflow.keras import backend as K
 27 | from tensorflow.keras.constraints import max_norm
 28 | from tensorflow.keras.layers import (
 29 |     Activation,
 30 |     BatchNormalization,
 31 |     Conv2D,
 32 |     Dense,
 33 |     Dropout,
 34 |     Flatten,
 35 |     MaxPooling2D,
 36 | )
 37 | from tensorflow.keras.models import Sequential
 38 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop
 39 | 
 40 | logging.getLogger().setLevel(logging.INFO)
 41 | tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
 42 | 
 43 | HEIGHT = 32
 44 | WIDTH = 32
 45 | DEPTH = 3
 46 | NUM_CLASSES = 10
 47 | 
 48 | 
 49 | METRIC_ACCURACY = "accuracy"
 50 | 
 51 | 
 52 | def keras_model_fn(learning_rate, optimizer):
 53 |     model = Sequential()
 54 |     model.add(Conv2D(32, (3, 3), padding="same", name="inputs", input_shape=(HEIGHT, WIDTH, DEPTH)))
 55 |     model.add(BatchNormalization())
 56 |     model.add(Activation("relu"))
 57 |     model.add(Conv2D(32, (3, 3)))
 58 |     model.add(BatchNormalization())
 59 |     model.add(Activation("relu"))
 60 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 61 |     model.add(Dropout(0.2))
 62 | 
 63 |     model.add(Conv2D(64, (3, 3), padding="same"))
 64 |     model.add(BatchNormalization())
 65 |     model.add(Activation("relu"))
 66 |     model.add(Conv2D(64, (3, 3)))
 67 |     model.add(BatchNormalization())
 68 |     model.add(Activation("relu"))
 69 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 70 |     model.add(Dropout(0.3))
 71 | 
 72 |     model.add(Conv2D(128, (3, 3), padding="same"))
 73 |     model.add(BatchNormalization())
 74 |     model.add(Activation("relu"))
 75 |     model.add(Conv2D(128, (3, 3)))
 76 |     model.add(BatchNormalization())
 77 |     model.add(Activation("relu"))
 78 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 79 |     model.add(Dropout(0.4))
 80 | 
 81 |     model.add(Flatten())
 82 |     model.add(Dense(512, kernel_constraint=max_norm(2.0)))
 83 |     model.add(Activation("relu"))
 84 |     model.add(Dropout(0.5))
 85 |     model.add(Dense(NUM_CLASSES))
 86 |     model.add(Activation("softmax"))
 87 | 
 88 |     if optimizer == "sgd":
 89 |         opt = SGD(learning_rate=learning_rate)
 90 |     elif optimizer == "rmsprop":
 91 |         opt = RMSprop(learning_rate=learning_rate)
 92 |     elif optimizer == "adam":
 93 |         opt = Adam(learning_rate=learning_rate)
 94 |     else:
 95 |         raise Exception("Unknown optimizer", optimizer)
 96 | 
 97 |     model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
 98 |     return model
 99 | 
100 | 
101 | def read_dataset(epochs, batch_size, channel, channel_name):
102 |     mode = args.data_config[channel_name]["TrainingInputMode"]
103 | 
104 |     logging.info("Running {} in {} mode".format(channel_name, mode))
105 |     if mode == "Pipe":
106 |         from sagemaker_tensorflow import PipeModeDataset
107 | 
108 |         dataset = PipeModeDataset(channel=channel_name, record_format="TFRecord")
109 |     else:
110 |         filenames = [os.path.join(channel, channel_name + ".tfrecords")]
111 |         dataset = tf.data.TFRecordDataset(filenames)
112 | 
113 |     image_feature_description = {
114 |         "image": tf.io.FixedLenFeature([], tf.string),
115 |         "label": tf.io.FixedLenFeature([], tf.int64),
116 |     }
117 | 
118 |     def _parse_image_function(example_proto):
119 |         # Parse the input tf.Example proto using the dictionary above.
120 |         features = tf.io.parse_single_example(example_proto, image_feature_description)
121 |         image = tf.io.decode_raw(features["image"], tf.uint8)
122 |         image.set_shape([3 * 32 * 32])
123 |         image = tf.reshape(image, [32, 32, 3])
124 | 
125 |         label = tf.cast(features["label"], tf.int32)
126 |         label = tf.one_hot(label, 10)
127 | 
128 |         return image, label
129 | 
130 |     dataset = dataset.map(_parse_image_function, num_parallel_calls=10)
131 |     dataset = dataset.prefetch(10)
132 |     dataset = dataset.repeat(epochs)
133 |     dataset = dataset.shuffle(buffer_size=10 * batch_size)
134 |     dataset = dataset.batch(batch_size, drop_remainder=True)
135 | 
136 |     return dataset
137 | 
138 | 
139 | 
140 | def main(args):
141 |     # Initializing TensorFlow summary writer
142 |     job_name = json.loads(os.environ.get("SM_TRAINING_ENV"))["job_name"]
143 | 
144 |     # Importing datasets
145 |     train_dataset = read_dataset(args.epochs, args.batch_size, args.train, "train")
146 |     validation_dataset = read_dataset(args.epochs, args.batch_size, args.validation, "validation")
147 | 
148 |     # Initializing and compiling the model
149 |     model = keras_model_fn(args.learning_rate, args.optimizer)
150 | 
151 |     # Train the model
152 |     model.fit(
153 |         x=train_dataset,
154 |         epochs=args.epochs,
155 |         batch_size=args.batch_size,
156 |         validation_data=validation_dataset,
157 |     )
158 | 
159 |     # Saving trained model
160 |     model.save(args.model_output + "/1")
161 | 
162 |     # Converting validation dataset to numpy array
163 |     validation_array = np.array(list(validation_dataset.unbatch().take(-1).as_numpy_iterator()))
164 |     test_x = np.stack(validation_array[:, 0])
165 |     test_y = np.stack(validation_array[:, 1])
166 | 
167 |     # Use the model to predict the labels
168 |     test_predictions = model.predict(test_x)
169 |     test_y_pred = np.argmax(test_predictions, axis=1)
170 |     test_y_true = np.argmax(test_y, axis=1)
171 | 
172 |     # Evaluating model accuracy and logging it as a scalar for TensorBoard hyperparameter visualization.
173 |     accuracy = sklearn.metrics.accuracy_score(test_y_true, test_y_pred)
174 |     tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
175 |     logging.info("Test accuracy:{}".format(accuracy))
176 | 
177 | 
178 | if __name__ == "__main__":
179 |     parser = argparse.ArgumentParser()
180 | 
181 |     parser.add_argument(
182 |         "--train",
183 |         type=str,
184 |         required=False,
185 |         default=os.environ.get("SM_CHANNEL_TRAIN"),
186 |         help="The directory where the CIFAR-10 input data is stored.",
187 |     )
188 |     parser.add_argument(
189 |         "--validation",
190 |         type=str,
191 |         required=False,
192 |         default=os.environ.get("SM_CHANNEL_VALIDATION"),
193 |         help="The directory where the CIFAR-10 input data is stored.",
194 |     )
195 |     parser.add_argument(
196 |         "--model-output",
197 |         type=str,
198 |         default=os.environ.get("SM_MODEL_DIR"),
199 |         help="The directory where the trained model will be stored.",
200 |     )
201 |     parser.add_argument("--learning-rate", type=float, default=0.01, help="Initial learning rate.")
202 |     parser.add_argument(
203 |         "--epochs", type=int, default=10, help="The number of steps to use for training."
204 |     )
205 |     parser.add_argument("--batch-size", type=int, default=128, help="Batch size for training.")
206 |     parser.add_argument(
207 |         "--data-config", type=json.loads, default=os.environ.get("SM_INPUT_DATA_CONFIG")
208 |     )
209 |     parser.add_argument("--optimizer", type=str.lower, default="adam")
210 |     parser.add_argument("--model_dir", type=str)
211 | 
212 |     args = parser.parse_args()
213 |     main(args)
214 | 


--------------------------------------------------------------------------------
/04_advanced_training/README.md:
--------------------------------------------------------------------------------
 1 | ## Advance Training on Amazon SageMaker
 2 | Machine Learning (ML) practitioners commonly face performance and scalibilty challenges when training Computer Vision (CV) models. This is because your model size and complexity grows quickly with the increase in dataset size. While you can always scale up to use bigger instances with more CPUs and GPUs, there will eventually be capacity limits where you cannot scale up anymore.
 3 | 
 4 | This is when we need to leverage advanced training techniques like distributed training, debugging and monitoring to help us over come these challenges.
 5 | 
 6 | ---
 7 | 
 8 | ## Introduction
 9 | 
10 | This module covers advanced training topics like debugging and distributed training. You will get exposure to SageMaker services like [SageMaker debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html)and Amazon SageMaker's distributed library. SageMaker debugger allows you to attach a debug process to your training job. This helps you monitor your training at a much granualar time interval and automatically profiling the instance to help you identify performance bottlenecks.
11 | 
12 | 
13 | While [Amazon SageMaker's distributed library](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) helps you train deep learning models faster and cheaper. The [data parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) feature in this library is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet. This module provides 2 examples demonstrating how to use the SageMaker distributed data library to train a TensorFlow and PyTorch model using the [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) and MNIST dataset.
14 | 
15 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio**
16 | 
17 | ---
18 | ## Prerequisites
19 | 
20 | To get started, download the provided Jupyter notebook and associated files to you SageMaker Studio Environment. To run the notebook, you can simply execute each cell in order. To understand what's happening, you'll need:
21 | 
22 | - Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket.
23 | - Access to 2 p3.16xlarge GPU instances.  SageMaker distributed training library requirement, so you may need to request a service limit adjustment in your account.
24 | - Familiarity with distributed training concept
25 | - Familiarity with training on SageMaker
26 | - Basic familiarity with AWS S3.
27 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
28 | - SageMaker Studio is preferred for the full UI integration
29 | 
30 | ---
31 | 
32 | ## Dataset
33 | For Tensorflow framework, we are using [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset containing 11,788 images across 200 bird species (the original technical report can be found here). Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.
34 | 
35 | For Pytorch, we are using [MNIST database](http://yann.lecun.com/exdb/mnist/). It has a training set of 60,000 examples, and a test set of 10,000 examples of handwritten digits.
36 | 
37 | ---
38 | 
39 | ## Additional Resource:
40 | 
41 | 1. [SageMaker distributed data parallel PyTorch API Specification](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel_pytorch.html)
42 | 1. [Getting started with SageMaker distributed data parallel](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html)
43 | 1. [PyTorch in SageMaker](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html)


--------------------------------------------------------------------------------
/04_advanced_training/pytorch/code/inference.py:
--------------------------------------------------------------------------------
 1 | # Licensed to the Apache Software Foundation (ASF) under one
 2 | # or more contributor license agreements.  See the NOTICE file
 3 | # distributed with this work for additional information
 4 | # regarding copyright ownership.  The ASF licenses this file
 5 | # to you under the Apache License, Version 2.0 (the
 6 | # "License"); you may not use this file except in compliance
 7 | # with the License.  You may obtain a copy of the License at
 8 | #
 9 | #   http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing,
12 | # software distributed under the License is distributed on an
13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14 | # KIND, either express or implied.  See the License for the
15 | # specific language governing permissions and limitations
16 | # under the License.
17 | 
18 | from __future__ import print_function
19 | 
20 | import os
21 | 
22 | import torch
23 | 
24 | # Network definition
25 | from model_def import Net
26 | 
27 | 
28 | def model_fn(model_dir):
29 |     print("In model_fn. Model directory is -")
30 |     print(model_dir)
31 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
32 |     model = Net()
33 |     with open(os.path.join(model_dir, "model.pth"), "rb") as f:
34 |         print("Loading the mnist model")
35 |         model.load_state_dict(torch.load(f, map_location=device))
36 |     return model
37 | 


--------------------------------------------------------------------------------
/04_advanced_training/pytorch/code/model_def.py:
--------------------------------------------------------------------------------
 1 | # Licensed to the Apache Software Foundation (ASF) under one
 2 | # or more contributor license agreements.  See the NOTICE file
 3 | # distributed with this work for additional information
 4 | # regarding copyright ownership.  The ASF licenses this file
 5 | # to you under the Apache License, Version 2.0 (the
 6 | # "License"); you may not use this file except in compliance
 7 | # with the License.  You may obtain a copy of the License at
 8 | #
 9 | #   http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing,
12 | # software distributed under the License is distributed on an
13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14 | # KIND, either express or implied.  See the License for the
15 | # specific language governing permissions and limitations
16 | # under the License.
17 | 
18 | import torch
19 | import torch.nn as nn
20 | import torch.nn.functional as F
21 | 
22 | 
23 | class Net(nn.Module):
24 |     def __init__(self):
25 |         super(Net, self).__init__()
26 |         self.conv1 = nn.Conv2d(1, 32, 3, 1)
27 |         self.conv2 = nn.Conv2d(32, 64, 3, 1)
28 |         self.dropout1 = nn.Dropout2d(0.25)
29 |         self.dropout2 = nn.Dropout2d(0.5)
30 |         self.fc1 = nn.Linear(9216, 128)
31 |         self.fc2 = nn.Linear(128, 10)
32 | 
33 |     def forward(self, x):
34 |         x = self.conv1(x)
35 |         x = F.relu(x)
36 |         x = self.conv2(x)
37 |         x = F.relu(x)
38 |         x = F.max_pool2d(x, 2)
39 |         x = self.dropout1(x)
40 |         x = torch.flatten(x, 1)
41 |         x = self.fc1(x)
42 |         x = F.relu(x)
43 |         x = self.dropout2(x)
44 |         x = self.fc2(x)
45 |         output = F.log_softmax(x, dim=1)
46 |         return output
47 | 


--------------------------------------------------------------------------------
/04_advanced_training/pytorch/code/train_pytorch_smdataparallel_mnist.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
  4 | # may not use this file except in compliance with the License. A copy of
  5 | # the License is located at
  6 | #
  7 | #     http://aws.amazon.com/apache2.0/
  8 | #
  9 | # or in the "license" file accompanying this file. This file is
 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
 11 | # ANY KIND, either express or implied. See the License for the specific
 12 | # language governing permissions and limitations under the License.
 13 | 
 14 | from __future__ import print_function
 15 | 
 16 | import argparse
 17 | from packaging.version import Version
 18 | import os
 19 | import time
 20 | 
 21 | import smdistributed.dataparallel.torch.distributed as dist
 22 | import torch
 23 | import torchvision
 24 | import torch.nn as nn
 25 | import torch.nn.functional as F
 26 | import torch.optim as optim
 27 | 
 28 | # Network definition
 29 | from model_def import Net
 30 | 
 31 | # Import SMDataParallel PyTorch Modules
 32 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
 33 | from torch.optim.lr_scheduler import StepLR
 34 | from torchvision import datasets, transforms
 35 | 
 36 | # override dependency on mirrors provided by torch vision package
 37 | # from torchvision 0.9.1, 2 candidate mirror website links will be added before "resources" items automatically
 38 | # Reference PR: https://github.com/pytorch/vision/pull/3559
 39 | TORCHVISION_VERSION = "0.9.1"
 40 | if Version(torchvision.__version__) < Version(TORCHVISION_VERSION):
 41 |     datasets.MNIST.resources = [
 42 |         (
 43 |             "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/train-images-idx3-ubyte.gz",
 44 |             "f68b3c2dcbeaaa9fbdd348bbdeb94873",
 45 |         ),
 46 |         (
 47 |             "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/train-labels-idx1-ubyte.gz",
 48 |             "d53e105ee54ea40749a09fcbcd1e9432",
 49 |         ),
 50 |         (
 51 |             "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz",
 52 |             "9fb629c4189551a2d022fa330f9573f3",
 53 |         ),
 54 |         (
 55 |             "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz",
 56 |             "ec29112dd5afa0611ce80d1b7f02629c",
 57 |         ),
 58 |     ]
 59 | else:
 60 |     datasets.MNIST.mirrors = ["https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/"]
 61 | 
 62 | 
 63 | class CUDANotFoundException(Exception):
 64 |     pass
 65 | 
 66 | 
 67 | dist.init_process_group()
 68 | 
 69 | 
 70 | def train(args, model, device, train_loader, optimizer, epoch):
 71 |     model.train()
 72 |     for batch_idx, (data, target) in enumerate(train_loader):
 73 |         data, target = data.to(device), target.to(device)
 74 |         optimizer.zero_grad()
 75 |         output = model(data)
 76 |         loss = F.nll_loss(output, target)
 77 |         loss.backward()
 78 |         optimizer.step()
 79 |         if batch_idx % args.log_interval == 0 and args.rank == 0:
 80 |             print(
 81 |                 "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
 82 |                     epoch,
 83 |                     batch_idx * len(data) * args.world_size,
 84 |                     len(train_loader.dataset),
 85 |                     100.0 * batch_idx / len(train_loader),
 86 |                     loss.item(),
 87 |                 )
 88 |             )
 89 |         if args.verbose:
 90 |             print("Batch", batch_idx, "from rank", args.rank)
 91 | 
 92 | 
 93 | def test(model, device, test_loader):
 94 |     model.eval()
 95 |     test_loss = 0
 96 |     correct = 0
 97 |     with torch.no_grad():
 98 |         for data, target in test_loader:
 99 |             data, target = data.to(device), target.to(device)
100 |             output = model(data)
101 |             test_loss += F.nll_loss(output, target, reduction="sum").item()  # sum up batch loss
102 |             pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
103 |             correct += pred.eq(target.view_as(pred)).sum().item()
104 | 
105 |     test_loss /= len(test_loader.dataset)
106 | 
107 |     print(
108 |         "\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format(
109 |             test_loss, correct, len(test_loader.dataset), 100.0 * correct / len(test_loader.dataset)
110 |         )
111 |     )
112 | 
113 | 
114 | def main():
115 |     # Training settings
116 |     parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
117 |     parser.add_argument(
118 |         "--batch-size",
119 |         type=int,
120 |         default=64,
121 |         metavar="N",
122 |         help="input batch size for training (default: 64)",
123 |     )
124 |     parser.add_argument(
125 |         "--test-batch-size",
126 |         type=int,
127 |         default=1000,
128 |         metavar="N",
129 |         help="input batch size for testing (default: 1000)",
130 |     )
131 |     parser.add_argument(
132 |         "--epochs",
133 |         type=int,
134 |         default=14,
135 |         metavar="N",
136 |         help="number of epochs to train (default: 14)",
137 |     )
138 |     parser.add_argument(
139 |         "--lr", type=float, default=1.0, metavar="LR", help="learning rate (default: 1.0)"
140 |     )
141 |     parser.add_argument(
142 |         "--gamma",
143 |         type=float,
144 |         default=0.7,
145 |         metavar="M",
146 |         help="Learning rate step gamma (default: 0.7)",
147 |     )
148 |     parser.add_argument("--seed", type=int, default=1, metavar="S", help="random seed (default: 1)")
149 |     parser.add_argument(
150 |         "--log-interval",
151 |         type=int,
152 |         default=10,
153 |         metavar="N",
154 |         help="how many batches to wait before logging training status",
155 |     )
156 |     parser.add_argument(
157 |         "--save-model", action="store_true", default=False, help="For Saving the current Model"
158 |     )
159 |     parser.add_argument(
160 |         "--verbose",
161 |         action="store_true",
162 |         default=False,
163 |         help="For displaying smdistributed.dataparallel-specific logs",
164 |     )
165 |     parser.add_argument(
166 |         "--data-path",
167 |         type=str,
168 |         default="/tmp/data",
169 |         help="Path for downloading " "the MNIST dataset",
170 |     )
171 | 
172 |     args = parser.parse_args()
173 |     args.world_size = dist.get_world_size()
174 |     args.rank = rank = dist.get_rank()
175 |     args.local_rank = local_rank = dist.get_local_rank()
176 |     args.lr = 1.0
177 |     args.batch_size //= args.world_size // 8
178 |     args.batch_size = max(args.batch_size, 1)
179 |     data_path = args.data_path
180 | 
181 |     if args.verbose:
182 |         print(
183 |             "Hello from rank",
184 |             rank,
185 |             "of local_rank",
186 |             local_rank,
187 |             "in world size of",
188 |             args.world_size,
189 |         )
190 | 
191 |     if not torch.cuda.is_available():
192 |         raise CUDANotFoundException(
193 |             "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices."
194 |         )
195 | 
196 |     torch.manual_seed(args.seed)
197 | 
198 |     device = torch.device("cuda")
199 | 
200 |     # select a single rank per node to download data
201 |     is_first_local_rank = local_rank == 0
202 |     if is_first_local_rank:
203 |         train_dataset = datasets.MNIST(
204 |             data_path,
205 |             train=True,
206 |             download=True,
207 |             transform=transforms.Compose(
208 |                 [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
209 |             ),
210 |         )
211 |     dist.barrier()  # prevent other ranks from accessing the data early
212 |     if not is_first_local_rank:
213 |         train_dataset = datasets.MNIST(
214 |             data_path,
215 |             train=True,
216 |             download=False,
217 |             transform=transforms.Compose(
218 |                 [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
219 |             ),
220 |         )
221 | 
222 |     train_sampler = torch.utils.data.distributed.DistributedSampler(
223 |         train_dataset, num_replicas=args.world_size, rank=rank
224 |     )
225 |     train_loader = torch.utils.data.DataLoader(
226 |         train_dataset,
227 |         batch_size=args.batch_size,
228 |         shuffle=False,
229 |         num_workers=0,
230 |         pin_memory=True,
231 |         sampler=train_sampler,
232 |     )
233 |     if rank == 0:
234 |         test_loader = torch.utils.data.DataLoader(
235 |             datasets.MNIST(
236 |                 data_path,
237 |                 train=False,
238 |                 transform=transforms.Compose(
239 |                     [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
240 |                 ),
241 |             ),
242 |             batch_size=args.test_batch_size,
243 |             shuffle=True,
244 |         )
245 | 
246 |     model = DDP(Net().to(device))
247 |     torch.cuda.set_device(local_rank)
248 |     model.cuda(local_rank)
249 |     optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
250 |     scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
251 |     for epoch in range(1, args.epochs + 1):
252 |         train(args, model, device, train_loader, optimizer, epoch)
253 |         if rank == 0:
254 |             test(model, device, test_loader)
255 |         scheduler.step()
256 | 
257 |     if args.save_model:
258 |         torch.save(model.state_dict(), "mnist_cnn.pt")
259 | 
260 | 
261 | if __name__ == "__main__":
262 |     main()
263 | 


--------------------------------------------------------------------------------
/04_advanced_training/pytorch/static/ss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/04_advanced_training/pytorch/static/ss.png


--------------------------------------------------------------------------------
/04_advanced_training/tensorflow/static/debugger_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/04_advanced_training/tensorflow/static/debugger_output.png


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05a_model_evaluation/code/inference.py:
--------------------------------------------------------------------------------
  1 | print('******* in inference.py *******')
  2 | import tensorflow as tf
  3 | print(f'TensorFlow version is: {tf.version.VERSION}')
  4 | 
  5 | from tensorflow.keras.preprocessing import image
  6 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
  7 | print(f'Keras version is: {tf.keras.__version__}')
  8 | 
  9 | import io
 10 | import base64
 11 | import json
 12 | import numpy as np
 13 | from numpy import argmax
 14 | from collections import namedtuple
 15 | from PIL import Image
 16 | import time
 17 | import requests
 18 | 
 19 | # Imports for GRPC invoke on TFS
 20 | import grpc
 21 | from tensorflow.compat.v1 import make_tensor_proto
 22 | from tensorflow_serving.apis import predict_pb2
 23 | from tensorflow_serving.apis import prediction_service_pb2_grpc
 24 | 
 25 | import os
 26 | # default to use of GRPC
 27 | PREDICT_USING_GRPC = os.environ.get('PREDICT_USING_GRPC', 'true')
 28 | if PREDICT_USING_GRPC == 'true':
 29 |     USE_GRPC = True
 30 | else:
 31 |     USE_GRPC = False
 32 |     
 33 | MAX_GRPC_MESSAGE_LENGTH = 512 * 1024 * 1024
 34 | 
 35 | HEIGHT = 224
 36 | WIDTH  = 224
 37 | 
 38 | # Restrict memory growth on GPU's
 39 | physical_gpus = tf.config.experimental.list_physical_devices('GPU')
 40 | if physical_gpus:
 41 |     try:
 42 |         # Currently, memory growth needs to be the same across GPUs
 43 |         for gpu in physical_gpus:
 44 |             tf.config.experimental.set_memory_growth(gpu, True)
 45 |         logical_gpus = tf.config.experimental.list_logical_devices('GPU')
 46 |         print(len(physical_gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPUs')
 47 |     except RuntimeError as e:
 48 |         # Memory growth must be set before GPUs have been initialized
 49 |         print(e)
 50 | else:
 51 |     print('**** NO physical GPUs')
 52 | 
 53 | 
 54 | num_inferences = 0
 55 | print(f'num_inferences: {num_inferences}')
 56 | 
 57 | Context = namedtuple('Context',
 58 |                      'model_name, model_version, method, rest_uri, grpc_uri, '
 59 |                      'custom_attributes, request_content_type, accept_header')
 60 | 
 61 | def handler(data, context):
 62 | 
 63 |     global num_inferences
 64 |     num_inferences += 1
 65 |     
 66 |     print(f'\n************ inference #: {num_inferences}')
 67 |     if context.request_content_type == 'application/x-image':
 68 |         stream = io.BytesIO(data.read())
 69 |         img = Image.open(stream).convert('RGB')
 70 |         _print_image_metadata(img)
 71 |             
 72 |         img = img.resize((WIDTH, HEIGHT))
 73 |         img_array = image.img_to_array(img) #, data_format = "channels_first")
 74 |         # the image is now in an array of shape (224, 224, 3) or (3, 224, 224) based on data_format
 75 |         # need to expand it to add dim for num samples, e.g. (1, 224, 224, 3)
 76 |         x = img_array.reshape((1,) + img_array.shape)
 77 |         instance = preprocess_input(x)
 78 |         print(f'    final image shape: {instance.shape}')
 79 |         del x, img
 80 |     else:
 81 |         _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown'))
 82 | 
 83 |     start_time = time.time()
 84 |     
 85 |     if USE_GRPC:
 86 |         prediction = _predict_using_grpc(context, instance)
 87 | 
 88 |     else: # use TFS REST API
 89 |         inst_json = json.dumps({'instances': instance.tolist()})
 90 |         response = requests.post(context.rest_uri, data=inst_json)
 91 |         if response.status_code != 200:
 92 |             raise Exception(response.content.decode('utf-8'))
 93 |         prediction = response.content
 94 | 
 95 |     end_time   = time.time()
 96 |     latency    = int((end_time - start_time) * 1000)
 97 |     print(f'=== TFS invoke took: {latency} ms')
 98 |     
 99 |     response_content_type = context.accept_header
100 |     return prediction, response_content_type
101 | 
102 | def _return_error(code, message):
103 |     raise ValueError('Error: {}, {}'.format(str(code), message))
104 | 
105 | def _predict_using_grpc(context, instance):
106 |     request = predict_pb2.PredictRequest()
107 |     request.model_spec.name = 'model'
108 |     request.model_spec.signature_name = 'serving_default'
109 | 
110 |     request.inputs['input_1'].CopyFrom(make_tensor_proto(instance))
111 |     options = [
112 |         ('grpc.max_send_message_length', MAX_GRPC_MESSAGE_LENGTH),
113 |         ('grpc.max_receive_message_length', MAX_GRPC_MESSAGE_LENGTH)
114 |     ]
115 |     channel = grpc.insecure_channel(f'0.0.0.0:{context.grpc_port}', options=options)
116 |     stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
117 |     result_future = stub.Predict.future(request, 30)  # 5 seconds  
118 |     output_tensor_proto = result_future.result().outputs['output']
119 |     output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim]
120 |     output_np = np.array(output_tensor_proto.float_val).reshape(output_shape)
121 |     predicted_class_idx = argmax(output_np) 
122 |     print(f'    Predicted class: {predicted_class_idx}')
123 |     prediction_json = {'predictions': output_np.tolist()}
124 |     return json.dumps(prediction_json)
125 |     
126 | def _print_image_metadata(img):
127 |     # Retrieve the attributes of the image
128 |     fileFormat      = img.format       
129 |     imageMode       = img.mode        
130 |     imageSize       = img.size  # (width, height)
131 |     colorPalette    = img.palette       
132 | 
133 |     print(f'    File format: {fileFormat}')
134 |     print(f'    Image mode:  {imageMode}')
135 |     print(f'    Image size:  {imageSize}')
136 |     print(f'    Color pal:   {colorPalette}')
137 | 
138 |     print(f'    Keys from image.info dictionary:')
139 |     for key, value in img.info.items():
140 |         print(f'      {key}')


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05a_model_evaluation/code/requirements-gpu.txt:
--------------------------------------------------------------------------------
1 | Pillow
2 | numpy
3 | tensorflow-gpu


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05a_model_evaluation/code/requirements.txt:
--------------------------------------------------------------------------------
1 | # This is the set of Python packages that will get pip installed
2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 
3 | Pillow
4 | numpy==1.21.6
5 | tensorflow==2.11.1


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05a_model_evaluation/code/train-mobilenet.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
  4 | # may not use this file except in compliance with the License. A copy of
  5 | # the License is located at
  6 | #
  7 | #     http://aws.amazon.com/apache2.0/
  8 | #
  9 | # or in the "license" file accompanying this file. This file is
 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
 11 | # ANY KIND, either express or implied. See the License for the specific
 12 | # language governing permissions and limitations under the License.
 13 | 
 14 | import tensorflow as tf
 15 | import tensorflow.keras
 16 | 
 17 | TF_VERSION = tf.version.VERSION
 18 | print('TF version: {}'.format(tf.__version__))
 19 | print('Keras version: {}'.format(tensorflow.keras.__version__))
 20 | 
 21 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
 22 | LAST_FROZEN_LAYER = 20
 23 | 
 24 | from tensorflow.keras.preprocessing.image import ImageDataGenerator
 25 | from tensorflow.keras.layers import Dense, Activation, Flatten, Dropout, GlobalAveragePooling2D
 26 | from tensorflow.keras.models import Sequential, Model, load_model
 27 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop
 28 | from tensorflow.keras.callbacks import ModelCheckpoint
 29 | from tensorflow.keras import backend as K
 30 | 
 31 | import numpy as np
 32 | import os
 33 | import re
 34 | import json
 35 | import argparse
 36 | import glob
 37 | import boto3
 38 | 
 39 | HEIGHT = 224
 40 | WIDTH  = 224
 41 | 
 42 | class PurgeCheckpointsCallback(tensorflow.keras.callbacks.Callback):
 43 |     def __init__(self, args):
 44 |         """ Save params in constructor
 45 |         """
 46 |         self.args = args
 47 |         
 48 |     def on_epoch_end(self, epoch, logs=None):
 49 |         files = sorted([f for f in os.listdir(self.args.checkpoint_path) if f.endswith('.' + 'h5')])
 50 |         keep  = 3
 51 |         print(f'    End of epoch {epoch + 1}. Removing old checkpoints...')
 52 |         for i in range(0, len(files) - keep - 1):
 53 |             print(f'    local: {files[i]}')
 54 |             os.remove(f'{self.args.checkpoint_path}/{files[i]}')
 55 |             
 56 |             s3_uri_parts = self.args.s3_checkpoint_path.split('/')
 57 |             bucket = s3_uri_parts[2]
 58 |             key = '/'.join(s3_uri_parts[3:]) + '/' + files[i]
 59 | 
 60 |             print(f'    s3 bucket: {bucket}, key: {key}')
 61 |             s3_client = boto3.client('s3')
 62 |             s3_client.delete_object(Bucket=bucket, Key=key)
 63 |         print(f'      Done\n')
 64 | 
 65 | def load_checkpoint_model(checkpoint_path):
 66 |     files = sorted([f for f in os.listdir(checkpoint_path) if f.endswith('.' + 'h5')])
 67 |     epoch_numbers = [re.search('(?<=\.)(.*[0-9])(?=\.)',f).group() for f in files]
 68 |       
 69 |     max_epoch_number = max(epoch_numbers)
 70 |     max_epoch_index = epoch_numbers.index(max_epoch_number)
 71 |     max_epoch_filename = files[max_epoch_index]
 72 |     
 73 |     print('\nList of available checkpoints:')
 74 |     print('------------------------------------')
 75 |     [print(f) for f in files]
 76 |     print('------------------------------------')
 77 |     print(f'Checkpoint file for latest epoch: {max_epoch_filename}')
 78 |     print(f'Resuming training from epoch: {max_epoch_number}')
 79 |     print('------------------------------------')
 80 |     
 81 |     resume_model = load_model(f'{checkpoint_path}/{max_epoch_filename}')
 82 |     return resume_model, int(max_epoch_number)
 83 | 
 84 | def build_finetune_model(base_model, dropout, fc_layers, num_classes):
 85 |     # Freeze all base layers
 86 |     for layer in base_model.layers:
 87 |         layer.trainable = False
 88 | 
 89 |     x = base_model.output
 90 |     x = Flatten()(x)
 91 |     for fc in fc_layers:
 92 |         x = Dense(fc, activation='relu')(x) 
 93 |         if (dropout != 0.0):
 94 |             x = Dropout(dropout)(x)
 95 | 
 96 |     # New softmax layer
 97 |     predictions = Dense(num_classes, activation='softmax', name='output')(x) 
 98 |     
 99 |     finetune_model = Model(inputs=base_model.input, outputs=predictions)
100 | 
101 |     return finetune_model
102 | 
103 | def create_data_generators(args):
104 |     train_datagen =  ImageDataGenerator(
105 |               preprocessing_function=preprocess_input,
106 |               rotation_range=70,
107 |               brightness_range=(0.6, 1.0),
108 |               width_shift_range=0.3,
109 |               height_shift_range=0.3,
110 |               shear_range=0.3,
111 |               zoom_range=0.3,
112 |               horizontal_flip=True,
113 |               vertical_flip=False)
114 |     val_datagen  = ImageDataGenerator(preprocessing_function=preprocess_input)
115 |     test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
116 | 
117 |     train_gen = train_datagen.flow_from_directory('/opt/ml/input/data/train',
118 |                                                   target_size=(HEIGHT, WIDTH), 
119 |                                                   batch_size=args.batch_size)
120 |     test_gen = train_datagen.flow_from_directory('/opt/ml/input/data/test',
121 |                                                   target_size=(HEIGHT, WIDTH), 
122 |                                                   batch_size=args.batch_size)
123 |     val_gen = train_datagen.flow_from_directory('/opt/ml/input/data/validation',
124 |                                                   target_size=(HEIGHT, WIDTH), 
125 |                                                   batch_size=args.batch_size)
126 |     return train_gen, test_gen, val_gen
127 | 
128 | def save_model_artifacts(model, model_dir):
129 |     print(f'Saving model to {model_dir}...')
130 |     # Note that this method of saving does produce a warning about not containing the train and evaluate graphs.
131 |     # The resulting saved model works fine for inference. It will simply not support incremental training. If that
132 |     # is needed, one can use model checkpoints and save those.
133 |     print('Model directory files BEFORE save: {}'.format(glob.glob(f'{model_dir}/*/*')))
134 |     if tf.version.VERSION[0] == '2':
135 |         model.save(f'{model_dir}/1', save_format='tf')
136 |     else:
137 |         tf.contrib.saved_model.save_keras_model(model, f'{model_dir}/1')
138 |     print('Model directory files AFTER save: {}'.format(glob.glob(f'{model_dir}/*/*')))
139 |     print('...DONE saving model!')
140 | 
141 |     # Need to copy these files to the code directory, else the SageMaker endpoint will not use them.
142 |     print('Copying inference source files...')
143 | 
144 |     if not os.path.exists(f'{model_dir}/code'):
145 |         os.system(f'mkdir {model_dir}/code')
146 |     os.system(f'cp inference.py {model_dir}/code')
147 |     os.system(f'cp requirements.txt {model_dir}/code')
148 |     print('Files after copying custom inference handler files: {}'.format(glob.glob(f'{model_dir}/code/*')))
149 | 
150 | def main(args):
151 |     sm_training_env_json = json.loads(os.environ.get('SM_TRAINING_ENV'))
152 |     is_master = sm_training_env_json['is_master']
153 |     print('is_master {}'.format(is_master))
154 |     
155 |     # Create data generators for feeding training and evaluation based on data provided to us
156 |     # by the SageMaker TensorFlow container
157 |     train_gen, test_gen, val_gen = create_data_generators(args)
158 | 
159 |     base_model = MobileNetV2(weights='imagenet', 
160 |                           include_top=False, 
161 |                           input_shape=(HEIGHT, WIDTH, 3))
162 | 
163 |     # Here we extend the base model with additional fully connected layers, dropout for avoiding
164 |     # overfitting to the training dataset, and a classification layer
165 |     fully_connected_layers = []
166 |     for i in range(args.num_fully_connected_layers):
167 |         fully_connected_layers.append(1024)
168 | 
169 |     num_classes = len(glob.glob('/opt/ml/input/data/train/*'))
170 |     model = build_finetune_model(base_model, 
171 |                                   dropout=args.dropout, 
172 |                                   fc_layers=fully_connected_layers, 
173 |                                   num_classes=num_classes)
174 | 
175 |     opt = RMSprop(lr=args.initial_lr)
176 |     model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy'])
177 | 
178 |     print('\nBeginning training...')
179 |     
180 |     NUM_EPOCHS  = args.fine_tuning_epochs
181 |     
182 |     num_train_images = len(train_gen.filepaths)
183 |     num_val_images   = len(val_gen.filepaths)
184 | 
185 |     num_hosts   = len(args.hosts) 
186 |     train_steps = num_train_images // args.batch_size // num_hosts
187 |     val_steps   = num_val_images   // args.batch_size // num_hosts
188 | 
189 |     print('Batch size: {}, Train Steps: {}, Val Steps: {}'.format(args.batch_size, train_steps, val_steps))
190 | 
191 |     # If there are no checkpoints found, must be starting from scratch, so load the pretrained model
192 |     checkpoints_avail = os.path.isdir(args.checkpoint_path) and os.listdir(args.checkpoint_path)
193 |     if not checkpoints_avail:
194 |         if not os.path.isdir(args.checkpoint_path):
195 |             os.mkdir(args.checkpoint_path)
196 |         
197 |         # Train for a few epochs
198 |         model.fit_generator(train_gen, epochs=args.initial_epochs, workers=8, 
199 |                                steps_per_epoch=train_steps, 
200 |                                validation_data=val_gen, validation_steps=val_steps,
201 |                                shuffle=True) 
202 | 
203 |         # Now fine tune the last set of layers in the model
204 |         for layer in model.layers[LAST_FROZEN_LAYER:]:
205 |             layer.trainable = True
206 | 
207 |         initial_epoch_number = 0
208 |         
209 |     # Otherwise, start from the latest checkpoint
210 |     else:    
211 |         model, initial_epoch_number = load_checkpoint_model(args.checkpoint_path)
212 | 
213 |     fine_tuning_lr = args.fine_tuning_lr
214 |     model.compile(optimizer=SGD(lr=fine_tuning_lr, momentum=0.9), 
215 |                   loss='categorical_crossentropy', metrics=['accuracy'])
216 | 
217 |     callbacks = []
218 |     if not (args.s3_checkpoint_path == ''):
219 |         checkpoint_names = 'img-classifier.{epoch:03d}.h5'
220 |         checkpoint_callback = ModelCheckpoint(filepath=f'{args.checkpoint_path}/{checkpoint_names}',
221 |                                               save_weights_only=False,
222 |                                               monitor='val_accuracy')
223 |         callbacks.append(checkpoint_callback)
224 |         callbacks.append(PurgeCheckpointsCallback(args))
225 | 
226 |     history = model.fit_generator(train_gen, epochs=NUM_EPOCHS, workers=8, 
227 |                            steps_per_epoch=train_steps, 
228 |                            initial_epoch=initial_epoch_number,
229 |                            validation_data=val_gen, validation_steps=val_steps,
230 |                            shuffle=True, callbacks=callbacks)
231 |     print('Model has been fit.')
232 | 
233 |     # Save the model if we are executing on the master host
234 |     if is_master:
235 |         print('Saving model, since we are master host')
236 |         save_model_artifacts(model, os.environ.get('SM_MODEL_DIR'))
237 |     else:
238 |         print('NOT saving model, will leave that up to master host')
239 | 
240 |     checkpoint_files = [f for f in os.listdir(args.checkpoint_path) if f.endswith('.' + 'h5')]  
241 |     print(f'\nCheckpoints: {sorted(checkpoint_files)}')
242 | 
243 |     print('\nExiting training script.\n')
244 |     
245 | if __name__=='__main__':
246 |     parser = argparse.ArgumentParser()
247 |     # hyperparameters sent by the client are passed as command-line arguments to the script.
248 |     parser.add_argument('--initial_epochs', type=int, default=5)
249 |     parser.add_argument('--fine_tuning_epochs', type=int, default=30)
250 |     parser.add_argument('--batch_size', type=int, default=8)
251 |     parser.add_argument('--initial_lr', type=float, default=0.0001)
252 |     parser.add_argument('--fine_tuning_lr', type=float, default=0.00001)
253 |     parser.add_argument('--dropout', type=float, default=0.5)
254 |     parser.add_argument('--num_fully_connected_layers', type=int, default=1)
255 |     parser.add_argument('--s3_checkpoint_path', type=str, default='')
256 |     # input data and model directories
257 |     parser.add_argument('--model_dir', type=str)
258 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
259 |     parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
260 |     parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
261 |     parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
262 |     parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
263 |     parser.add_argument('--checkpoint_path', type=str, default='/opt/ml/checkpoints')
264 | 
265 |     args, _ = parser.parse_known_args()
266 |     print('args: {}'.format(args))
267 |     
268 |     main(args)


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05a_model_evaluation/docker/Dockerfile:
--------------------------------------------------------------------------------
 1 | 
 2 | FROM public.ecr.aws/docker/library/python:3.7
 3 |     
 4 | ADD requirements.txt /
 5 | 
 6 | RUN pip3 install -r requirements.txt
 7 | 
 8 | ENV PYTHONUNBUFFERED=TRUE 
 9 | ENV TF_CPP_MIN_LOG_LEVEL="2"
10 | 
11 | ENTRYPOINT ["python3"]
12 | 


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05a_model_evaluation/docker/requirements.txt:
--------------------------------------------------------------------------------
 1 | # This is the set of Python packages that will get pip installed
 2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 
 3 | Pillow
 4 | scikit-learn
 5 | pandas
 6 | numpy==1.21.6
 7 | tensorflow==2.11.1
 8 | boto3==1.18.4
 9 | sagemaker-experiments
10 | matplotlib==3.4.2
11 | 


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05a_model_evaluation/evaluation.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | 
  3 | import pandas as pd
  4 | import argparse
  5 | import pathlib
  6 | import json
  7 | import os
  8 | import numpy as np
  9 | import tarfile
 10 | import uuid
 11 | 
 12 | from PIL import Image
 13 | 
 14 | from sklearn.metrics import (
 15 |     accuracy_score,
 16 |     precision_score,
 17 |     recall_score,
 18 |     confusion_matrix,
 19 |     f1_score
 20 | )
 21 | 
 22 | from tensorflow import keras
 23 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
 24 | from tensorflow.keras.preprocessing import image
 25 | # from smexperiments import tracker
 26 | 
 27 | logger = logging.getLogger()
 28 | logger.setLevel(logging.INFO)
 29 | logger.addHandler(logging.StreamHandler())
 30 | 
 31 | input_path =  "/opt/ml/processing/input/test" #"output/test" #
 32 | manifest_path = "/opt/ml/processing/input/manifest/test.csv"#"output/manifest/test.csv"
 33 | model_path = "/opt/ml/processing/model" #"model" # 
 34 | output_path = '/opt/ml/processing/output' #"output" # 
 35 | 
 36 | HEIGHT=224; WIDTH=224
 37 | 
 38 | def predict_bird_from_file_new(fn, model):
 39 |     
 40 |     img = Image.open(fn).convert('RGB')
 41 |     
 42 |     img = img.resize((WIDTH, HEIGHT))
 43 |     img_array = image.img_to_array(img) #, data_format = "channels_first")
 44 | 
 45 |     x = img_array.reshape((1,) + img_array.shape)
 46 |     instance = preprocess_input(x)
 47 | 
 48 |     del x, img
 49 |     
 50 |     result = model.predict(instance)
 51 | 
 52 |     predicted_class_idx = np.argmax(result)
 53 |     confidence = result[0][predicted_class_idx]
 54 | 
 55 |     return predicted_class_idx, confidence
 56 | 
 57 | if __name__ == "__main__":
 58 |     parser = argparse.ArgumentParser()
 59 |     parser.add_argument("--model-file", type=str, default="model.tar.gz")
 60 |     args, _ = parser.parse_known_args()
 61 | 
 62 |     logger.debug("Extracting the model")
 63 | 
 64 |     model_file = os.path.join(model_path, args.model_file)
 65 |     file = tarfile.open(model_file)
 66 |     file.extractall(model_path)
 67 | 
 68 |     file.close()
 69 | 
 70 |     logger.debug("Load model")
 71 | 
 72 |     model = keras.models.load_model("{}/1".format(model_path))
 73 | 
 74 |     logger.debug("Starting evaluation.")
 75 |     
 76 |     # load test data.  this should be an argument
 77 |     df = pd.read_csv(manifest_path)
 78 |     
 79 |     num_images = df.shape[0]
 80 |     
 81 |     class_name_list = sorted(df['class_id'].unique().tolist())
 82 |     
 83 |     class_name = pd.Series(df['class_name'].values,index=df['class_id']).to_dict()
 84 |     
 85 |     logger.debug('Testing {} images'.format(df.shape[0]))
 86 |     num_errors = 0
 87 |     preds = []
 88 |     acts  = []
 89 |     for i in range(df.shape[0]):
 90 |         fname = df.iloc[i]['image_file_name']
 91 |         act   = int(df.iloc[i]['class_id']) - 1
 92 |         acts.append(act)
 93 |         
 94 |         pred, conf = predict_bird_from_file_new(input_path + '/' + fname, model)
 95 |         preds.append(pred)
 96 |         if (pred != act):
 97 |             num_errors += 1
 98 |             logger.debug('ERROR on image index {} -- Pred: {} {:.2f}, Actual: {}'.format(i, 
 99 |                                                                    class_name_list[pred], conf, 
100 |                                                                    class_name_list[act]))
101 |     precision = precision_score(acts, preds, average='micro')
102 |     recall = recall_score(acts, preds, average='micro')
103 |     accuracy = accuracy_score(acts, preds)
104 |     cnf_matrix = confusion_matrix(acts, preds, labels=range(len(class_name_list)))
105 |     f1 = f1_score(acts, preds, average='micro')
106 |     
107 |     logger.debug("Accuracy: {}".format(accuracy))
108 |     logger.debug("Precision: {}".format(precision))
109 |     logger.debug("Recall: {}".format(recall))
110 |     logger.debug("Confusion matrix: {}".format(cnf_matrix))
111 |     logger.debug("F1 score: {}".format(f1))
112 |     
113 |     logger.debug(cnf_matrix)
114 |     
115 |     matrix_output = dict()
116 |     
117 |     for i in range(len(cnf_matrix)):
118 |         matrix_row = dict()
119 |         for j in range(len(cnf_matrix[0])):
120 |             matrix_row[class_name[class_name_list[j]]] = int(cnf_matrix[i][j])
121 |         matrix_output[class_name[class_name_list[i]]] = matrix_row
122 | 
123 |     
124 |     report_dict = {
125 |         "multiclass_classification_metrics": {
126 |             "accuracy": {"value": accuracy, "standard_deviation": "NaN"},
127 |             "precision": {"value": precision, "standard_deviation": "NaN"},
128 |             "recall": {"value": recall, "standard_deviation": "NaN"},
129 |             "f1": {"value": f1, "standard_deviation": "NaN"},
130 |             "confusion_matrix":matrix_output
131 |         },
132 |     }
133 | 
134 |     output_dir = "/opt/ml/processing/evaluation"
135 |     pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
136 | 
137 |     evaluation_path = f"{output_dir}/evaluation.json"
138 |     with open(evaluation_path, "w") as f:
139 |         f.write(json.dumps(report_dict))
140 | 


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05b_model_explainability/code/inference.py:
--------------------------------------------------------------------------------
  1 | print('******* in inference.py *******')
  2 | import tensorflow as tf
  3 | print(f'TensorFlow version is: {tf.version.VERSION}')
  4 | 
  5 | from tensorflow.keras.preprocessing import image
  6 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
  7 | print(f'Keras version is: {tf.keras.__version__}')
  8 | 
  9 | import io
 10 | import base64
 11 | import json
 12 | import numpy as np
 13 | from numpy import argmax
 14 | from collections import namedtuple
 15 | from PIL import Image
 16 | import time
 17 | import requests
 18 | 
 19 | # Imports for GRPC invoke on TFS
 20 | import grpc
 21 | from tensorflow.compat.v1 import make_tensor_proto
 22 | from tensorflow_serving.apis import predict_pb2
 23 | from tensorflow_serving.apis import prediction_service_pb2_grpc
 24 | 
 25 | import os
 26 | # default to use of GRPC
 27 | PREDICT_USING_GRPC = os.environ.get('PREDICT_USING_GRPC', 'true')
 28 | if PREDICT_USING_GRPC == 'true':
 29 |     USE_GRPC = True
 30 | else:
 31 |     USE_GRPC = False
 32 |     
 33 | MAX_GRPC_MESSAGE_LENGTH = 512 * 1024 * 1024
 34 | 
 35 | HEIGHT = 224
 36 | WIDTH  = 224
 37 | 
 38 | # Restrict memory growth on GPU's
 39 | physical_gpus = tf.config.experimental.list_physical_devices('GPU')
 40 | if physical_gpus:
 41 |     try:
 42 |         # Currently, memory growth needs to be the same across GPUs
 43 |         for gpu in physical_gpus:
 44 |             tf.config.experimental.set_memory_growth(gpu, True)
 45 |         logical_gpus = tf.config.experimental.list_logical_devices('GPU')
 46 |         print(len(physical_gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPUs')
 47 |     except RuntimeError as e:
 48 |         # Memory growth must be set before GPUs have been initialized
 49 |         print(e)
 50 | else:
 51 |     print('**** NO physical GPUs')
 52 | 
 53 | 
 54 | num_inferences = 0
 55 | print(f'num_inferences: {num_inferences}')
 56 | 
 57 | Context = namedtuple('Context',
 58 |                      'model_name, model_version, method, rest_uri, grpc_uri, '
 59 |                      'custom_attributes, request_content_type, accept_header')
 60 | 
 61 | def handler(data, context):
 62 | 
 63 |     global num_inferences
 64 |     num_inferences += 1
 65 |     
 66 |     print(f'\n************ inference #: {num_inferences}')
 67 |     if context.request_content_type == 'image/jpeg':
 68 |         stream = io.BytesIO(data.read())
 69 |         img = Image.open(stream).convert('RGB')
 70 |         _print_image_metadata(img)
 71 |             
 72 |         img = img.resize((WIDTH, HEIGHT))
 73 |         img_array = image.img_to_array(img) #, data_format = "channels_first")
 74 |         # the image is now in an array of shape (224, 224, 3) or (3, 224, 224) based on data_format
 75 |         # need to expand it to add dim for num samples, e.g. (1, 224, 224, 3)
 76 |         x = img_array.reshape((1,) + img_array.shape)
 77 |         instance = preprocess_input(x)
 78 |         print(f'    final image shape: {instance.shape}')
 79 |         del x, img
 80 |     else:
 81 |         _return_error(415, 'Unsupported different content type "{}"'.format(context.request_content_type or 'Unknown'))
 82 | 
 83 |     start_time = time.time()
 84 |     
 85 |     if USE_GRPC:
 86 |         prediction = _predict_using_grpc(context, instance)
 87 | 
 88 |     else: # use TFS REST API
 89 |         inst_json = json.dumps({'instances': instance.tolist()})
 90 |         response = requests.post(context.rest_uri, data=inst_json)
 91 |         if response.status_code != 200:
 92 |             raise Exception(response.content.decode('utf-8'))
 93 |         prediction = response.content
 94 | 
 95 |     end_time   = time.time()
 96 |     latency    = int((end_time - start_time) * 1000)
 97 |     print(f'=== TFS invoke took: {latency} ms')
 98 |     
 99 |     response_content_type = context.accept_header
100 |     return prediction, response_content_type
101 | 
102 | def _return_error(code, message):
103 |     raise ValueError('Error: {}, {}'.format(str(code), message))
104 | 
105 | def _predict_using_grpc(context, instance):
106 |     request = predict_pb2.PredictRequest()
107 |     request.model_spec.name = 'model'
108 |     request.model_spec.signature_name = 'serving_default'
109 | 
110 |     request.inputs['input_1'].CopyFrom(make_tensor_proto(instance))
111 |     options = [
112 |         ('grpc.max_send_message_length', MAX_GRPC_MESSAGE_LENGTH),
113 |         ('grpc.max_receive_message_length', MAX_GRPC_MESSAGE_LENGTH)
114 |     ]
115 |     channel = grpc.insecure_channel(f'0.0.0.0:{context.grpc_port}', options=options)
116 |     stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
117 |     result_future = stub.Predict.future(request, 30)  # 5 seconds  
118 |     output_tensor_proto = result_future.result().outputs['output']
119 |     output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim]
120 |     print ('output shape' , output_shape)
121 |     output_np = np.array(output_tensor_proto.float_val).reshape(output_shape)
122 |     print ('output_np ' , output_np)
123 |     predicted_class_idx = argmax(output_np) 
124 |     print(f'    Predicted class: {predicted_class_idx}')
125 |     prediction_json = np.reshape (output_np, -1).tolist()
126 |     print('this is what output predictions looks like ', np.reshape (output_np, -1).tolist() )
127 |     print('this is what output predictions looks like ', prediction_json)
128 |     return json.dumps(prediction_json)
129 |     
130 | def _print_image_metadata(img):
131 |     # Retrieve the attributes of the image
132 |     fileFormat      = img.format       
133 |     imageMode       = img.mode        
134 |     imageSize       = img.size  # (width, height)
135 |     colorPalette    = img.palette       
136 | 
137 |     print(f'    File format: {fileFormat}')
138 |     print(f'    Image mode:  {imageMode}')
139 |     print(f'    Image size:  {imageSize}')
140 |     print(f'    Color pal:   {colorPalette}')
141 | 
142 |     print(f'    Keys from image.info dictionary:')
143 |     for key, value in img.info.items():
144 |         print(f'      {key}')


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05b_model_explainability/code/requirements-gpu.txt:
--------------------------------------------------------------------------------
1 | Pillow
2 | numpy
3 | tensorflow-gpu


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05b_model_explainability/code/requirements.txt:
--------------------------------------------------------------------------------
1 | # This is the set of Python packages that will get pip installed
2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 
3 | Pillow
4 | numpy==1.21.6
5 | tensorflow==2.11.1


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05b_model_explainability/code/train-mobilenet.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
  4 | # may not use this file except in compliance with the License. A copy of
  5 | # the License is located at
  6 | #
  7 | #     http://aws.amazon.com/apache2.0/
  8 | #
  9 | # or in the "license" file accompanying this file. This file is
 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
 11 | # ANY KIND, either express or implied. See the License for the specific
 12 | # language governing permissions and limitations under the License.
 13 | 
 14 | import tensorflow as tf
 15 | import tensorflow.keras
 16 | 
 17 | TF_VERSION = tf.version.VERSION
 18 | print('TF version: {}'.format(tf.__version__))
 19 | print('Keras version: {}'.format(tensorflow.keras.__version__))
 20 | 
 21 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
 22 | LAST_FROZEN_LAYER = 20
 23 | 
 24 | from tensorflow.keras.preprocessing.image import ImageDataGenerator
 25 | from tensorflow.keras.layers import Dense, Activation, Flatten, Dropout, GlobalAveragePooling2D
 26 | from tensorflow.keras.models import Sequential, Model, load_model
 27 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop
 28 | from tensorflow.keras.callbacks import ModelCheckpoint
 29 | from tensorflow.keras import backend as K
 30 | 
 31 | import numpy as np
 32 | import os
 33 | import re
 34 | import json
 35 | import argparse
 36 | import glob
 37 | import boto3
 38 | 
 39 | HEIGHT = 224
 40 | WIDTH  = 224
 41 | 
 42 | class PurgeCheckpointsCallback(tensorflow.keras.callbacks.Callback):
 43 |     def __init__(self, args):
 44 |         """ Save params in constructor
 45 |         """
 46 |         self.args = args
 47 |         
 48 |     def on_epoch_end(self, epoch, logs=None):
 49 |         files = sorted([f for f in os.listdir(self.args.checkpoint_path) if f.endswith('.' + 'h5')])
 50 |         keep  = 3
 51 |         print(f'    End of epoch {epoch + 1}. Removing old checkpoints...')
 52 |         for i in range(0, len(files) - keep - 1):
 53 |             print(f'    local: {files[i]}')
 54 |             os.remove(f'{self.args.checkpoint_path}/{files[i]}')
 55 |             
 56 |             s3_uri_parts = self.args.s3_checkpoint_path.split('/')
 57 |             bucket = s3_uri_parts[2]
 58 |             key = '/'.join(s3_uri_parts[3:]) + '/' + files[i]
 59 | 
 60 |             print(f'    s3 bucket: {bucket}, key: {key}')
 61 |             s3_client = boto3.client('s3')
 62 |             s3_client.delete_object(Bucket=bucket, Key=key)
 63 |         print(f'      Done\n')
 64 | 
 65 | def load_checkpoint_model(checkpoint_path):
 66 |     files = sorted([f for f in os.listdir(checkpoint_path) if f.endswith('.' + 'h5')])
 67 |     epoch_numbers = [re.search('(?<=\.)(.*[0-9])(?=\.)',f).group() for f in files]
 68 |       
 69 |     max_epoch_number = max(epoch_numbers)
 70 |     max_epoch_index = epoch_numbers.index(max_epoch_number)
 71 |     max_epoch_filename = files[max_epoch_index]
 72 |     
 73 |     print('\nList of available checkpoints:')
 74 |     print('------------------------------------')
 75 |     [print(f) for f in files]
 76 |     print('------------------------------------')
 77 |     print(f'Checkpoint file for latest epoch: {max_epoch_filename}')
 78 |     print(f'Resuming training from epoch: {max_epoch_number}')
 79 |     print('------------------------------------')
 80 |     
 81 |     resume_model = load_model(f'{checkpoint_path}/{max_epoch_filename}')
 82 |     return resume_model, int(max_epoch_number)
 83 | 
 84 | def build_finetune_model(base_model, dropout, fc_layers, num_classes):
 85 |     # Freeze all base layers
 86 |     for layer in base_model.layers:
 87 |         layer.trainable = False
 88 | 
 89 |     x = base_model.output
 90 |     x = Flatten()(x)
 91 |     for fc in fc_layers:
 92 |         x = Dense(fc, activation='relu')(x) 
 93 |         if (dropout != 0.0):
 94 |             x = Dropout(dropout)(x)
 95 | 
 96 |     # New softmax layer
 97 |     predictions = Dense(num_classes, activation='softmax', name='output')(x) 
 98 |     
 99 |     finetune_model = Model(inputs=base_model.input, outputs=predictions)
100 | 
101 |     return finetune_model
102 | 
103 | def create_data_generators(args):
104 |     train_datagen =  ImageDataGenerator(
105 |               preprocessing_function=preprocess_input,
106 |               rotation_range=70,
107 |               brightness_range=(0.6, 1.0),
108 |               width_shift_range=0.3,
109 |               height_shift_range=0.3,
110 |               shear_range=0.3,
111 |               zoom_range=0.3,
112 |               horizontal_flip=True,
113 |               vertical_flip=False)
114 |     val_datagen  = ImageDataGenerator(preprocessing_function=preprocess_input)
115 |     test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
116 | 
117 |     train_gen = train_datagen.flow_from_directory('/opt/ml/input/data/train',
118 |                                                   target_size=(HEIGHT, WIDTH), 
119 |                                                   batch_size=args.batch_size)
120 |     test_gen = train_datagen.flow_from_directory('/opt/ml/input/data/test',
121 |                                                   target_size=(HEIGHT, WIDTH), 
122 |                                                   batch_size=args.batch_size)
123 |     val_gen = train_datagen.flow_from_directory('/opt/ml/input/data/validation',
124 |                                                   target_size=(HEIGHT, WIDTH), 
125 |                                                   batch_size=args.batch_size)
126 |     return train_gen, test_gen, val_gen
127 | 
128 | def save_model_artifacts(model, model_dir):
129 |     print(f'Saving model to {model_dir}...')
130 |     # Note that this method of saving does produce a warning about not containing the train and evaluate graphs.
131 |     # The resulting saved model works fine for inference. It will simply not support incremental training. If that
132 |     # is needed, one can use model checkpoints and save those.
133 |     print('Model directory files BEFORE save: {}'.format(glob.glob(f'{model_dir}/*/*')))
134 |     if tf.version.VERSION[0] == '2':
135 |         model.save(f'{model_dir}/1', save_format='tf')
136 |     else:
137 |         tf.contrib.saved_model.save_keras_model(model, f'{model_dir}/1')
138 |     print('Model directory files AFTER save: {}'.format(glob.glob(f'{model_dir}/*/*')))
139 |     print('...DONE saving model!')
140 | 
141 |     # Need to copy these files to the code directory, else the SageMaker endpoint will not use them.
142 |     print('Copying inference source files...')
143 | 
144 |     if not os.path.exists(f'{model_dir}/code'):
145 |         os.system(f'mkdir {model_dir}/code')
146 |     os.system(f'cp inference.py {model_dir}/code')
147 |     os.system(f'cp requirements.txt {model_dir}/code')
148 |     print('Files after copying custom inference handler files: {}'.format(glob.glob(f'{model_dir}/code/*')))
149 | 
150 | def main(args):
151 |     sm_training_env_json = json.loads(os.environ.get('SM_TRAINING_ENV'))
152 |     is_master = sm_training_env_json['is_master']
153 |     print('is_master {}'.format(is_master))
154 |     
155 |     # Create data generators for feeding training and evaluation based on data provided to us
156 |     # by the SageMaker TensorFlow container
157 |     train_gen, test_gen, val_gen = create_data_generators(args)
158 | 
159 |     base_model = MobileNetV2(weights='imagenet', 
160 |                           include_top=False, 
161 |                           input_shape=(HEIGHT, WIDTH, 3))
162 | 
163 |     # Here we extend the base model with additional fully connected layers, dropout for avoiding
164 |     # overfitting to the training dataset, and a classification layer
165 |     fully_connected_layers = []
166 |     for i in range(args.num_fully_connected_layers):
167 |         fully_connected_layers.append(1024)
168 | 
169 |     num_classes = len(glob.glob('/opt/ml/input/data/train/*'))
170 |     model = build_finetune_model(base_model, 
171 |                                   dropout=args.dropout, 
172 |                                   fc_layers=fully_connected_layers, 
173 |                                   num_classes=num_classes)
174 | 
175 |     opt = RMSprop(lr=args.initial_lr)
176 |     model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy'])
177 | 
178 |     print('\nBeginning training...')
179 |     
180 |     NUM_EPOCHS  = args.fine_tuning_epochs
181 |     
182 |     num_train_images = len(train_gen.filepaths)
183 |     num_val_images   = len(val_gen.filepaths)
184 | 
185 |     num_hosts   = len(args.hosts) 
186 |     train_steps = num_train_images // args.batch_size // num_hosts
187 |     val_steps   = num_val_images   // args.batch_size // num_hosts
188 | 
189 |     print('Batch size: {}, Train Steps: {}, Val Steps: {}'.format(args.batch_size, train_steps, val_steps))
190 | 
191 |     # If there are no checkpoints found, must be starting from scratch, so load the pretrained model
192 |     checkpoints_avail = os.path.isdir(args.checkpoint_path) and os.listdir(args.checkpoint_path)
193 |     if not checkpoints_avail:
194 |         if not os.path.isdir(args.checkpoint_path):
195 |             os.mkdir(args.checkpoint_path)
196 |         
197 |         # Train for a few epochs
198 |         model.fit_generator(train_gen, epochs=args.initial_epochs, workers=8, 
199 |                                steps_per_epoch=train_steps, 
200 |                                validation_data=val_gen, validation_steps=val_steps,
201 |                                shuffle=True) 
202 | 
203 |         # Now fine tune the last set of layers in the model
204 |         for layer in model.layers[LAST_FROZEN_LAYER:]:
205 |             layer.trainable = True
206 | 
207 |         initial_epoch_number = 0
208 |         
209 |     # Otherwise, start from the latest checkpoint
210 |     else:    
211 |         model, initial_epoch_number = load_checkpoint_model(args.checkpoint_path)
212 | 
213 |     fine_tuning_lr = args.fine_tuning_lr
214 |     model.compile(optimizer=SGD(lr=fine_tuning_lr, momentum=0.9), 
215 |                   loss='categorical_crossentropy', metrics=['accuracy'])
216 | 
217 |     callbacks = []
218 |     if not (args.s3_checkpoint_path == ''):
219 |         checkpoint_names = 'img-classifier.{epoch:03d}.h5'
220 |         checkpoint_callback = ModelCheckpoint(filepath=f'{args.checkpoint_path}/{checkpoint_names}',
221 |                                               save_weights_only=False,
222 |                                               monitor='val_accuracy')
223 |         callbacks.append(checkpoint_callback)
224 |         callbacks.append(PurgeCheckpointsCallback(args))
225 | 
226 |     history = model.fit_generator(train_gen, epochs=NUM_EPOCHS, workers=8, 
227 |                            steps_per_epoch=train_steps, 
228 |                            initial_epoch=initial_epoch_number,
229 |                            validation_data=val_gen, validation_steps=val_steps,
230 |                            shuffle=True, callbacks=callbacks)
231 |     print('Model has been fit.')
232 | 
233 |     # Save the model if we are executing on the master host
234 |     if is_master:
235 |         print('Saving model, since we are master host')
236 |         save_model_artifacts(model, os.environ.get('SM_MODEL_DIR'))
237 |     else:
238 |         print('NOT saving model, will leave that up to master host')
239 | 
240 |     checkpoint_files = [f for f in os.listdir(args.checkpoint_path) if f.endswith('.' + 'h5')]  
241 |     print(f'\nCheckpoints: {sorted(checkpoint_files)}')
242 | 
243 |     print('\nExiting training script.\n')
244 |     
245 | if __name__=='__main__':
246 |     parser = argparse.ArgumentParser()
247 |     # hyperparameters sent by the client are passed as command-line arguments to the script.
248 |     parser.add_argument('--initial_epochs', type=int, default=5)
249 |     parser.add_argument('--fine_tuning_epochs', type=int, default=30)
250 |     parser.add_argument('--batch_size', type=int, default=8)
251 |     parser.add_argument('--initial_lr', type=float, default=0.0001)
252 |     parser.add_argument('--fine_tuning_lr', type=float, default=0.00001)
253 |     parser.add_argument('--dropout', type=float, default=0.5)
254 |     parser.add_argument('--num_fully_connected_layers', type=int, default=1)
255 |     parser.add_argument('--s3_checkpoint_path', type=str, default='')
256 |     # input data and model directories
257 |     parser.add_argument('--model_dir', type=str)
258 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
259 |     parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
260 |     parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
261 |     parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
262 |     parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
263 |     parser.add_argument('--checkpoint_path', type=str, default='/opt/ml/checkpoints')
264 | 
265 |     args, _ = parser.parse_known_args()
266 |     print('args: {}'.format(args))
267 |     
268 |     main(args)


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/05b_model_explainability/preprocessing.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import logging
  3 | 
  4 | import pandas as pd
  5 | import argparse
  6 | import boto3
  7 | import json
  8 | import os
  9 | import shutil
 10 | 
 11 | logger = logging.getLogger()
 12 | logger.setLevel(logging.INFO)
 13 | logger.addHandler(logging.StreamHandler())
 14 | 
 15 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 
 16 | output_path = '/opt/ml/processing/output' #"output" # 
 17 | IMAGES_DIR   = os.path.join(input_path, 'images')
 18 | SPLIT_RATIOS = (0.6, 0.2, 0.2)
 19 | 
 20 | 
 21 | # this function is used to split a dataframe into 3 seperate dataframes
 22 | # one of each: train, validate, test
 23 | 
 24 | def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False):
 25 |     train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
 26 | 
 27 |     labels = df[label_column].unique()
 28 |     for lbl in labels:
 29 |         lbl_df = df[df[label_column] == lbl]
 30 | 
 31 |         lbl_train_df        = lbl_df.sample(frac=splits[0])
 32 |         lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index)
 33 |         lbl_test_df         = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2]))
 34 |         lbl_val_df          = lbl_val_and_test_df.drop(lbl_test_df.index)
 35 | 
 36 |         if verbose:
 37 |             print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl,
 38 |                                                                         len(lbl_df), 
 39 |                                                                         len(lbl_train_df), 
 40 |                                                                         len(lbl_val_df), 
 41 |                                                                         len(lbl_test_df)))
 42 |         train_df = train_df.append(lbl_train_df)
 43 |         val_df   = val_df.append(lbl_val_df)
 44 |         test_df  = test_df.append(lbl_test_df)
 45 | 
 46 |     # shuffle them on the way out using .sample(frac=1)
 47 |     return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1)
 48 | 
 49 | # This function grabs the manifest files and build a dataframe, then call the split_to_train_val_test
 50 | # function above and return the 3 dataframes
 51 | def get_train_val_dataframes(BASE_DIR, classes, split_ratios):
 52 |     CLASSES_FILE = os.path.join(BASE_DIR, 'classes.txt')
 53 |     IMAGE_FILE   = os.path.join(BASE_DIR, 'images.txt')
 54 |     LABEL_FILE   = os.path.join(BASE_DIR, 'image_class_labels.txt')
 55 | 
 56 |     images_df = pd.read_csv(IMAGE_FILE, sep=' ',
 57 |                             names=['image_pretty_name', 'image_file_name'],
 58 |                             header=None)
 59 |     image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ',
 60 |                                 names=['image_pretty_name', 'orig_class_id'], header=None)
 61 | 
 62 |     # Merge the metadata into a single flat dataframe for easier processing
 63 |     full_df = pd.DataFrame(images_df)
 64 | 
 65 |     full_df.reset_index(inplace=True, drop=True)
 66 |     full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name')
 67 | 
 68 |     # grab a small subset of species for testing
 69 |     criteria = full_df['orig_class_id'].isin(classes)
 70 |     full_df = full_df[criteria]
 71 |     print('Using {} images from {} classes'.format(full_df.shape[0], len(classes)))
 72 | 
 73 |     unique_classes = full_df['orig_class_id'].drop_duplicates()
 74 |     sorted_unique_classes = sorted(unique_classes)
 75 |     id_to_one_based = {}
 76 |     i = 1
 77 |     for c in sorted_unique_classes:
 78 |         id_to_one_based[c] = str(i)
 79 |         i += 1
 80 | 
 81 |     full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based)
 82 |     full_df.reset_index(inplace=True, drop=True)
 83 | 
 84 |     def get_class_name(fn):
 85 |         return fn.split('/')[0]
 86 |     full_df['class_name'] = full_df['image_file_name'].apply(get_class_name)
 87 |     full_df = full_df.drop(['image_pretty_name'], axis=1)
 88 | 
 89 |     train_df = []
 90 |     test_df  = []
 91 |     val_df   = []
 92 | 
 93 |     # split into training and validation sets
 94 |     train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', split_ratios)
 95 | 
 96 |     print('num images total: ' + str(images_df.shape[0]))
 97 |     print('\nnum train: ' + str(train_df.shape[0]))
 98 |     print('num val: ' + str(val_df.shape[0]))
 99 |     print('num test: ' + str(test_df.shape[0]))
100 |     return train_df, val_df, test_df
101 | 
102 | # this function copy images by channel to its destination folder
103 | def copy_files_for_channel(df, channel_name, verbose=False):
104 |     print('\nCopying files for {} images in channel: {}...'.format(df.shape[0], channel_name))
105 |     for i in range(df.shape[0]):
106 |         target_fname = df.iloc[i]['image_file_name']
107 | #         if verbose:
108 | #             print(target_fname)
109 |         src = "{}/{}".format(IMAGES_DIR, target_fname) #f"{IMAGES_DIR}/{target_fname}"
110 |         dst = "{}/{}/{}".format(output_path,channel_name,target_fname)
111 |         shutil.copyfile(src, dst)
112 | 
113 | if __name__ == "__main__":
114 |     parser = argparse.ArgumentParser()
115 |     parser.add_argument("--classes", type=str, default="")
116 |     parser.add_argument("--input-data", type=str, default="classes.txt")
117 |     args, _ = parser.parse_known_args()
118 | 
119 | 
120 |     c_list = args.classes.split(',')
121 |     input_data = args.input_data
122 |         
123 |     CLASSES_FILE = os.path.join(input_path, input_data)
124 | 
125 |     CLASS_COLS      = ['class_number','class_id']
126 |     
127 |     if len(c_list)==0:
128 |         # Otherwise, you can use the full set of species
129 |         CLASSES = []
130 |         for c in range(200):
131 |             CLASSES += [c + 1]
132 |         prefix = prefix + '-full'
133 |     else:
134 |         CLASSES = list(map(int, c_list))
135 | 
136 |             
137 |     classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None)
138 | 
139 |     criteria = classes_df['class_number'].isin(CLASSES)
140 |     classes_df = classes_df[criteria]
141 | 
142 |     class_name_list = sorted(classes_df['class_id'].unique().tolist())
143 |     print(class_name_list)
144 |     
145 |     
146 |     train_df, val_df, test_df = get_train_val_dataframes(input_path, CLASSES, SPLIT_RATIOS)
147 |         
148 |     for c in class_name_list:
149 |         os.mkdir('{}/{}/{}'.format(output_path, 'valid', c))
150 |         os.mkdir('{}/{}/{}'.format(output_path, 'test', c))
151 |         os.mkdir('{}/{}/{}'.format(output_path, 'train', c))
152 | 
153 |     copy_files_for_channel(val_df,   'valid')
154 |     copy_files_for_channel(test_df,  'test')
155 |     copy_files_for_channel(train_df, 'train')
156 |     
157 |     # export manifest file for validation
158 |     train_m_file = "{}/manifest/train.csv".format(output_path)
159 |     train_df.to_csv(train_m_file, index=False)
160 |     test_m_file = "{}/manifest/test.csv".format(output_path)
161 |     test_df.to_csv(test_m_file, index=False)
162 |     val_m_file = "{}/manifest/valid.csv".format(output_path)
163 |     val_df.to_csv(val_m_file, index=False)
164 |     
165 |     print("Finished running processing job")
166 | 


--------------------------------------------------------------------------------
/05_model_evaluation_and_model_explainability/README.md:
--------------------------------------------------------------------------------
 1 | # Model Evaluation and Model Explainability
 2 | 
 3 | Amazon SageMaker Processing is a managed solution for various machine learning(ML) processes. These include feature engineering, data validation, model evaluation, model explainability, and many other essential steps performed by engineers and data scientists during their ML workflow.
 4 | 
 5 | Your custom processing scripts will be containerized running in infrastructure that is created on demand and terminated automatically. You can choose to use a SageMaker optimized containers for popular data processing or model evaluation frameworks like Scikit learn, PySpark etc. or Bring Your Own Containers (BYOC).
 6 | 
 7 | ![Process Data](https://docs.aws.amazon.com/sagemaker/latest/dg/images/Processing-1.png)
 8 | 
 9 | 
10 | # The notebook in Module 5a describes how to evaluate your ML model
11 | 
12 | # Introduction
13 | 
14 | Postprocess and model evaluation are important steps before deployment the model to production. In this module you will use ScriptProcessor from SageMaker Processing to evaluate the model performance after the training.
15 | 
16 | To setup your ScriptProcessor, we will build a custom container for a model evaluation script which will Load the tensorflow model, load the test dataset and annotation (either from previous module or run the :code[optional-prepare-data-and-model.ipynb] notebook), and then run predicition and generate the confusion matrix.
17 | 
18 | ---
19 | # Prerequisites
20 | 
21 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need:
22 | 
23 | - Access to the SageMaker default S3 bucket.
24 | - Familiarity with Python and numpy
25 | - Basic familiarity with AWS S3.
26 | - Basic understanding of AWS Sagemaker.
27 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
28 | - SageMaker Studio is preferred for the full UI integration
29 | 
30 | ---
31 | 
32 | ### Dataset & Inputs
33 | 
34 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html). If you kept your artifacts from previous labs, then simply update the s3 location below for you Test images and Test data annotation file.  If you do not have them, just run the `optional-prepare-data-and-model.ipynb` notebook to generate the files, and then update the path below.
35 | 
36 | - S3 path for test image data
37 | - S3 path for test data annotation file
38 | - S3 path for the bird classification model
39 | 
40 | ---
41 | # Review Outputs
42 | 
43 | At the end of the lab, you will generate a json file containing the performance metrics (accuracy, precision, recall, f1, and confusion matrix) on your test dataset.
44 | 
45 | 
46 | # The notebook in Module 5b describes Image Classification ML model explainability using SageMaker Clarify
47 | 
48 | Amazon SageMaker Clarify provides machine learning (ML) developers with purpose built tools to gain greater insights into their ML training data and models. Amazon Clarify detects and measures potential bias using a variety of metrics so that they can address potential bias, and explain model predictions.
49 | 
50 | Amazon SageMaker Clarify can detect potential bias during data preparation, after model training, and in your deployed model. For instance, you can check for bias related to age in your dataset or in your trained model and receive a detailed report that quantifies different types of potential bias. SageMaker Clarify also includes feature importance scores that help you explain how your model make predictions and produces explainability reports in bulk or real time via online explainability. You can use these reports to support customer or internal presentations, or to identify potential issues with your model.
51 | 
52 | 
53 | 
54 | ---
55 | 
56 | # Introduction
57 | 
58 | # Explain model predictions
59 | Understand which features contributed the most to model prediction
60 | 
61 | SageMaker Clarify is integrated with SageMaker Experiments to provide scores detailing which features contributed the most to your model prediction on a particular input for tabular, NLP and computer vision models. For tabular datasets, SageMaker Clarify can also output an aggregated feature importance chart which provides insights into the overall prediction process of the model. These details can help determine if a particular model input has more influence than expected on overall model behavior. For tabular data, in addition to the feature importance scores, you can also use partial dependence plots (PDP) to show the dependence of the predicted target response on a set of input features of interest.
62 | 
63 | ---
64 | # Prerequisites
65 | 
66 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need:
67 | 
68 | - Access to the SageMaker default S3 bucket.
69 | - Familiarity with Python and numpy
70 | - Basic familiarity with AWS S3.
71 | - Basic understanding of AWS Sagemaker.
72 | - Basic understanding of AWS Sagemaker Clarify.
73 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
74 | - SageMaker Studio is preferred for the full UI integration
75 | 
76 | ---
77 | 
78 | ### Dataset & Inputs
79 | 
80 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html). If you kept your artifacts from previous labs, then simply update the s3 location below for you Test images and Test data annotation file.  
81 | 
82 | - S3 path for test image data
83 | - S3 path for test data annotation file
84 | - S3 path for the bird classification model
85 | 
86 | ---
87 | # Review Outputs
88 | 
89 | At the end of the lab, you will download the results of the clarify explainability job and can see the SHAP values as well as the heat map of the pixels for your image classification
90 | 


--------------------------------------------------------------------------------
/06_training_pipeline/README.md:
--------------------------------------------------------------------------------
 1 | # Training Pipeline
 2 | 
 3 | Machine learning(ML) orchestration workflows is an important milestone to operationalize ML solutions for production. This usually entails connecting and automating various ML steps which include exploring and preparing data, experimenting with different algorithms and parameters, training and tuning models, and deploying models to production. This is why purpose-built tools like [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html) and [SageMaker Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html) are important to help you automate, manage, and standardize ML orchestration workflows, so you can quickly productionalize and scale your ML solutions.
 4 | 
 5 | ---
 6 | # Introduction
 7 | 
 8 | This module demonstrates how to build a reusable computer vision (CV) workflow using SageMaker Pipelines. The pipeline goes through preprocessing, training, and evaluating steps for 2 different training jobs:1) Spot training and 2) On Demand training. If the accuracy meets certain requirements, the models are then registered with SageMaker Model Registry.
 9 | 
10 | ![Training Pipeline](statics/cv-training-pipeline.png)
11 | 
12 | [SageMaker pipelines](https://aws.amazon.com/sagemaker/pipelines/) is the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). It works on the concept of steps. The order steps are executed in is inferred from the dependencies each step have. If a step has a dependency on the output from a previous step, it's not executed until after that step has completed successfully. This also allows SageMaker to create a Direct Acyclic Graph, DAG, that can be visuallized in Amazon SageMaker Studio (see diagram above). The DAG can be used to track pipeline executions, inputs/outputs and metrics, giving user the full lineage of the model creation.
13 | 
14 | ![Spot Training](statics/cost-explore.png)
15 | 
16 | This module will also illustrate the concept of Spot Training. The pipeline train two models (exactly same) in parallel, one using on-demand instances and the other uses spot instances. Training workloads are tagged accordingly: TrainingType: Spot or OnDemand. If you are interested and have permission to access billing of your AWS account, you can visualize the cost savings from spot training in your cost explore. To enable custom cost allocation tags, please follow this [AWS documentation](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/activating-tags.html). It takes 12-48 hrs for the new tag to show in your cost explore.
17 | 
18 | ---
19 | 
20 | Optionally, you can also try SageMaker Project and building a repeatable pattern from your pipeline. It is a great way to standardize and scale ML solutions for your teams and organizations.
21 | 
22 | ![Studio Template](statics/project-template.png)
23 | 
24 | ![Studio Parameters](statics/project-parameter.png)
25 | 
26 | An Amazon SageMaker project template automates the set up and implementation of MLOps patterns. SageMaker already provides a set of project templates 
27 | that create the infrastructure you need to create an MLOps solution for CI/CD of your ML models. [More on Provided Project Templates](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-sm.html) If the SageMaker provided templates do not meet your needs, you can also create your own templates.
28 | 
29 | For that, you need the pipeline definition file. It is a JSON file that defines a series of interconnected steps using Directed Acyclic Graph (DAG) and specifies the requirements and relationships between each step of your pipeline. This JSON file along with Cloudformation template also allows you to manage your pipeline and it's underline infrastructure as code, so you can establish re-usable CI/CD patterns to standardize among your teams.
30 | 
31 | A CloudFormation template `sagemaker-project.yaml` is supplied with this week's lab.
32 | 
33 | It will generate following resources:
34 | * S3 bucket - with Lambda PUT notification enabled
35 | * A lambda function - when triggered will create/update a SageMaker pipeline and execute it
36 | * A SageMaker pipeline - generated from the JSON definition file
37 | * A SageMaker Model Group - to house the complete models
38 | 
39 | Note: for lab purpose, we are loose with our role permissions.  You will likely use tighter permission control than what's provided.
40 | 
41 | To get started, load the provided Jupyter notebook and associated files to you SageMaker Studio Environment.
42 | 
43 | ---
44 | 
45 | ## Prerequisites
46 | 
47 | To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need:
48 | 
49 | - Access to the SageMaker default S3 bucket
50 | - Access to Elastic Container Registry (ECR)
51 | - For the optional portion of this lab, you will need access to CloudFormation, Service Catelog, and Cost Explore
52 | - Familiarity with Training on Amazon SageMaker
53 | - Familiarity with Python
54 | - Familiarity with AWS S3
55 | - Basic understanding of CloudFormaton and concept of deploy infra as code
56 | - Basic understanding of tagging and cost governance
57 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
58 | - SageMaker Studio is preferred for the full UI integration
59 | 
60 | ---
61 | ## Dataset
62 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.
63 | 
64 | ![Bird Dataset](statics/birds.png)
65 | 
66 | Run the notebook to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data.
67 | 


--------------------------------------------------------------------------------
/06_training_pipeline/code/inference.py:
--------------------------------------------------------------------------------
  1 | print('******* in inference.py *******')
  2 | import tensorflow as tf
  3 | print(f'TensorFlow version is: {tf.version.VERSION}')
  4 | 
  5 | from tensorflow.keras.preprocessing import image
  6 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
  7 | print(f'Keras version is: {tf.keras.__version__}')
  8 | 
  9 | import io
 10 | import base64
 11 | import json
 12 | import numpy as np
 13 | from numpy import argmax
 14 | from collections import namedtuple
 15 | from PIL import Image
 16 | import time
 17 | import requests
 18 | 
 19 | # Imports for GRPC invoke on TFS
 20 | import grpc
 21 | from tensorflow.compat.v1 import make_tensor_proto
 22 | from tensorflow_serving.apis import predict_pb2
 23 | from tensorflow_serving.apis import prediction_service_pb2_grpc
 24 | 
 25 | import os
 26 | # default to use of GRPC
 27 | PREDICT_USING_GRPC = os.environ.get('PREDICT_USING_GRPC', 'true')
 28 | if PREDICT_USING_GRPC == 'true':
 29 |     USE_GRPC = True
 30 | else:
 31 |     USE_GRPC = False
 32 |     
 33 | MAX_GRPC_MESSAGE_LENGTH = 512 * 1024 * 1024
 34 | 
 35 | HEIGHT = 224
 36 | WIDTH  = 224
 37 | 
 38 | # Restrict memory growth on GPU's
 39 | physical_gpus = tf.config.experimental.list_physical_devices('GPU')
 40 | if physical_gpus:
 41 |     try:
 42 |         # Currently, memory growth needs to be the same across GPUs
 43 |         for gpu in physical_gpus:
 44 |             tf.config.experimental.set_memory_growth(gpu, True)
 45 |         logical_gpus = tf.config.experimental.list_logical_devices('GPU')
 46 |         print(len(physical_gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPUs')
 47 |     except RuntimeError as e:
 48 |         # Memory growth must be set before GPUs have been initialized
 49 |         print(e)
 50 | else:
 51 |     print('**** NO physical GPUs')
 52 | 
 53 | 
 54 | num_inferences = 0
 55 | print(f'num_inferences: {num_inferences}')
 56 | 
 57 | Context = namedtuple('Context',
 58 |                      'model_name, model_version, method, rest_uri, grpc_uri, '
 59 |                      'custom_attributes, request_content_type, accept_header')
 60 | 
 61 | def handler(data, context):
 62 | 
 63 |     global num_inferences
 64 |     num_inferences += 1
 65 |     
 66 |     print(f'\n************ inference #: {num_inferences}')
 67 |     if context.request_content_type == 'application/x-image':
 68 |         stream = io.BytesIO(data.read())
 69 |         img = Image.open(stream).convert('RGB')
 70 |         _print_image_metadata(img)
 71 |             
 72 |         img = img.resize((WIDTH, HEIGHT))
 73 |         img_array = image.img_to_array(img) #, data_format = "channels_first")
 74 |         # the image is now in an array of shape (224, 224, 3) or (3, 224, 224) based on data_format
 75 |         # need to expand it to add dim for num samples, e.g. (1, 224, 224, 3)
 76 |         x = img_array.reshape((1,) + img_array.shape)
 77 |         instance = preprocess_input(x)
 78 |         print(f'    final image shape: {instance.shape}')
 79 |         del x, img
 80 |     else:
 81 |         _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown'))
 82 | 
 83 |     start_time = time.time()
 84 |     
 85 |     if USE_GRPC:
 86 |         prediction = _predict_using_grpc(context, instance)
 87 | 
 88 |     else: # use TFS REST API
 89 |         inst_json = json.dumps({'instances': instance.tolist()})
 90 |         response = requests.post(context.rest_uri, data=inst_json)
 91 |         if response.status_code != 200:
 92 |             raise Exception(response.content.decode('utf-8'))
 93 |         prediction = response.content
 94 | 
 95 |     end_time   = time.time()
 96 |     latency    = int((end_time - start_time) * 1000)
 97 |     print(f'=== TFS invoke took: {latency} ms')
 98 |     
 99 |     response_content_type = context.accept_header
100 |     return prediction, response_content_type
101 | 
102 | def _return_error(code, message):
103 |     raise ValueError('Error: {}, {}'.format(str(code), message))
104 | 
105 | def _predict_using_grpc(context, instance):
106 |     request = predict_pb2.PredictRequest()
107 |     request.model_spec.name = 'model'
108 |     request.model_spec.signature_name = 'serving_default'
109 | 
110 |     request.inputs['input_1'].CopyFrom(make_tensor_proto(instance))
111 |     options = [
112 |         ('grpc.max_send_message_length', MAX_GRPC_MESSAGE_LENGTH),
113 |         ('grpc.max_receive_message_length', MAX_GRPC_MESSAGE_LENGTH)
114 |     ]
115 |     channel = grpc.insecure_channel(f'0.0.0.0:{context.grpc_port}', options=options)
116 |     stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
117 |     result_future = stub.Predict.future(request, 30)  # 5 seconds  
118 |     output_tensor_proto = result_future.result().outputs['output']
119 |     output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim]
120 |     output_np = np.array(output_tensor_proto.float_val).reshape(output_shape)
121 |     predicted_class_idx = argmax(output_np) 
122 |     print(f'    Predicted class: {predicted_class_idx}')
123 |     prediction_json = {'predictions': output_np.tolist()}
124 |     return json.dumps(prediction_json)
125 |     
126 | def _print_image_metadata(img):
127 |     # Retrieve the attributes of the image
128 |     fileFormat      = img.format       
129 |     imageMode       = img.mode        
130 |     imageSize       = img.size  # (width, height)
131 |     colorPalette    = img.palette       
132 | 
133 |     print(f'    File format: {fileFormat}')
134 |     print(f'    Image mode:  {imageMode}')
135 |     print(f'    Image size:  {imageSize}')
136 |     print(f'    Color pal:   {colorPalette}')
137 | 
138 |     print(f'    Keys from image.info dictionary:')
139 |     for key, value in img.info.items():
140 |         print(f'      {key}')


--------------------------------------------------------------------------------
/06_training_pipeline/code/requirements-gpu.txt:
--------------------------------------------------------------------------------
1 | Pillow
2 | numpy
3 | tensorflow-gpu


--------------------------------------------------------------------------------
/06_training_pipeline/code/requirements.txt:
--------------------------------------------------------------------------------
1 | # This is the set of Python packages that will get pip installed
2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 
3 | Pillow
4 | numpy==1.21.6
5 | tensorflow==2.11.1


--------------------------------------------------------------------------------
/06_training_pipeline/code/train-mobilenet.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
  4 | # may not use this file except in compliance with the License. A copy of
  5 | # the License is located at
  6 | #
  7 | #     http://aws.amazon.com/apache2.0/
  8 | #
  9 | # or in the "license" file accompanying this file. This file is
 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
 11 | # ANY KIND, either express or implied. See the License for the specific
 12 | # language governing permissions and limitations under the License.
 13 | 
 14 | import tensorflow as tf
 15 | import tensorflow.keras
 16 | 
 17 | TF_VERSION = tf.version.VERSION
 18 | print('TF version: {}'.format(tf.__version__))
 19 | print('Keras version: {}'.format(tensorflow.keras.__version__))
 20 | 
 21 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
 22 | LAST_FROZEN_LAYER = 20
 23 | 
 24 | from tensorflow.keras.preprocessing.image import ImageDataGenerator
 25 | from tensorflow.keras.layers import Dense, Activation, Flatten, Dropout, GlobalAveragePooling2D
 26 | from tensorflow.keras.models import Sequential, Model, load_model
 27 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop
 28 | from tensorflow.keras.callbacks import ModelCheckpoint
 29 | from tensorflow.keras import backend as K
 30 | 
 31 | import numpy as np
 32 | import os
 33 | import re
 34 | import json
 35 | import argparse
 36 | import glob
 37 | import boto3
 38 | 
 39 | HEIGHT = 224
 40 | WIDTH  = 224
 41 | 
 42 | class PurgeCheckpointsCallback(tensorflow.keras.callbacks.Callback):
 43 |     def __init__(self, args):
 44 |         """ Save params in constructor
 45 |         """
 46 |         self.args = args
 47 |         
 48 |     def on_epoch_end(self, epoch, logs=None):
 49 |         files = sorted([f for f in os.listdir(self.args.checkpoint_path) if f.endswith('.' + 'h5')])
 50 |         keep  = 3
 51 |         print(f'    End of epoch {epoch + 1}. Removing old checkpoints...')
 52 |         for i in range(0, len(files) - keep - 1):
 53 |             print(f'    local: {files[i]}')
 54 |             os.remove(f'{self.args.checkpoint_path}/{files[i]}')
 55 |             
 56 |             s3_uri_parts = self.args.s3_checkpoint_path.split('/')
 57 |             bucket = s3_uri_parts[2]
 58 |             key = '/'.join(s3_uri_parts[3:]) + '/' + files[i]
 59 | 
 60 |             print(f'    s3 bucket: {bucket}, key: {key}')
 61 |             s3_client = boto3.client('s3')
 62 |             s3_client.delete_object(Bucket=bucket, Key=key)
 63 |         print(f'      Done\n')
 64 | 
 65 | def load_checkpoint_model(checkpoint_path):
 66 |     files = sorted([f for f in os.listdir(checkpoint_path) if f.endswith('.' + 'h5')])
 67 |     epoch_numbers = [re.search('(?<=\.)(.*[0-9])(?=\.)',f).group() for f in files]
 68 |       
 69 |     max_epoch_number = max(epoch_numbers)
 70 |     max_epoch_index = epoch_numbers.index(max_epoch_number)
 71 |     max_epoch_filename = files[max_epoch_index]
 72 |     
 73 |     print('\nList of available checkpoints:')
 74 |     print('------------------------------------')
 75 |     [print(f) for f in files]
 76 |     print('------------------------------------')
 77 |     print(f'Checkpoint file for latest epoch: {max_epoch_filename}')
 78 |     print(f'Resuming training from epoch: {max_epoch_number}')
 79 |     print('------------------------------------')
 80 |     
 81 |     resume_model = load_model(f'{checkpoint_path}/{max_epoch_filename}')
 82 |     return resume_model, int(max_epoch_number)
 83 | 
 84 | def build_finetune_model(base_model, dropout, fc_layers, num_classes):
 85 |     # Freeze all base layers
 86 |     for layer in base_model.layers:
 87 |         layer.trainable = False
 88 | 
 89 |     x = base_model.output
 90 |     x = Flatten()(x)
 91 |     for fc in fc_layers:
 92 |         x = Dense(fc, activation='relu')(x) 
 93 |         if (dropout != 0.0):
 94 |             x = Dropout(dropout)(x)
 95 | 
 96 |     # New softmax layer
 97 |     predictions = Dense(num_classes, activation='softmax', name='output')(x) 
 98 |     
 99 |     finetune_model = Model(inputs=base_model.input, outputs=predictions)
100 | 
101 |     return finetune_model
102 | 
103 | def create_data_generators(args):
104 |     train_datagen =  ImageDataGenerator(
105 |               preprocessing_function=preprocess_input,
106 |               rotation_range=70,
107 |               brightness_range=(0.6, 1.0),
108 |               width_shift_range=0.3,
109 |               height_shift_range=0.3,
110 |               shear_range=0.3,
111 |               zoom_range=0.3,
112 |               horizontal_flip=True,
113 |               vertical_flip=False)
114 |     val_datagen  = ImageDataGenerator(preprocessing_function=preprocess_input)
115 |     test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
116 | 
117 |     train_gen = train_datagen.flow_from_directory('/opt/ml/input/data/train',
118 |                                                   target_size=(HEIGHT, WIDTH), 
119 |                                                   batch_size=args.batch_size)
120 |     test_gen = train_datagen.flow_from_directory('/opt/ml/input/data/test',
121 |                                                   target_size=(HEIGHT, WIDTH), 
122 |                                                   batch_size=args.batch_size)
123 |     val_gen = train_datagen.flow_from_directory('/opt/ml/input/data/validation',
124 |                                                   target_size=(HEIGHT, WIDTH), 
125 |                                                   batch_size=args.batch_size)
126 |     return train_gen, test_gen, val_gen
127 | 
128 | def save_model_artifacts(model, model_dir):
129 |     print(f'Saving model to {model_dir}...')
130 |     # Note that this method of saving does produce a warning about not containing the train and evaluate graphs.
131 |     # The resulting saved model works fine for inference. It will simply not support incremental training. If that
132 |     # is needed, one can use model checkpoints and save those.
133 |     print('Model directory files BEFORE save: {}'.format(glob.glob(f'{model_dir}/*/*')))
134 |     if tf.version.VERSION[0] == '2':
135 |         model.save(f'{model_dir}/1', save_format='tf')
136 |     else:
137 |         tf.contrib.saved_model.save_keras_model(model, f'{model_dir}/1')
138 |     print('Model directory files AFTER save: {}'.format(glob.glob(f'{model_dir}/*/*')))
139 |     print('...DONE saving model!')
140 | 
141 |     # Need to copy these files to the code directory, else the SageMaker endpoint will not use them.
142 |     print('Copying inference source files...')
143 | 
144 |     if not os.path.exists(f'{model_dir}/code'):
145 |         os.system(f'mkdir {model_dir}/code')
146 |     os.system(f'cp inference.py {model_dir}/code')
147 |     os.system(f'cp requirements.txt {model_dir}/code')
148 |     print('Files after copying custom inference handler files: {}'.format(glob.glob(f'{model_dir}/code/*')))
149 | 
150 | def main(args):
151 |     sm_training_env_json = json.loads(os.environ.get('SM_TRAINING_ENV'))
152 |     is_master = sm_training_env_json['is_master']
153 |     print('is_master {}'.format(is_master))
154 |     
155 |     # Create data generators for feeding training and evaluation based on data provided to us
156 |     # by the SageMaker TensorFlow container
157 |     train_gen, test_gen, val_gen = create_data_generators(args)
158 | 
159 |     base_model = MobileNetV2(weights='imagenet', 
160 |                           include_top=False, 
161 |                           input_shape=(HEIGHT, WIDTH, 3))
162 | 
163 |     # Here we extend the base model with additional fully connected layers, dropout for avoiding
164 |     # overfitting to the training dataset, and a classification layer
165 |     fully_connected_layers = []
166 |     for i in range(args.num_fully_connected_layers):
167 |         fully_connected_layers.append(1024)
168 | 
169 |     num_classes = len(glob.glob('/opt/ml/input/data/train/*'))
170 |     model = build_finetune_model(base_model, 
171 |                                   dropout=args.dropout, 
172 |                                   fc_layers=fully_connected_layers, 
173 |                                   num_classes=num_classes)
174 | 
175 |     opt = RMSprop(lr=args.initial_lr)
176 |     model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy'])
177 | 
178 |     print('\nBeginning training...')
179 |     
180 |     NUM_EPOCHS  = args.fine_tuning_epochs
181 |     
182 |     num_train_images = len(train_gen.filepaths)
183 |     num_val_images   = len(val_gen.filepaths)
184 | 
185 |     num_hosts   = len(args.hosts) 
186 |     train_steps = num_train_images // args.batch_size // num_hosts
187 |     val_steps   = num_val_images   // args.batch_size // num_hosts
188 | 
189 |     print('Batch size: {}, Train Steps: {}, Val Steps: {}'.format(args.batch_size, train_steps, val_steps))
190 | 
191 |     # If there are no checkpoints found, must be starting from scratch, so load the pretrained model
192 |     checkpoints_avail = os.path.isdir(args.checkpoint_path) and os.listdir(args.checkpoint_path)
193 |     if not checkpoints_avail:
194 |         if not os.path.isdir(args.checkpoint_path):
195 |             os.mkdir(args.checkpoint_path)
196 |         
197 |         # Train for a few epochs
198 |         model.fit_generator(train_gen, epochs=args.initial_epochs, workers=8, 
199 |                                steps_per_epoch=train_steps, 
200 |                                validation_data=val_gen, validation_steps=val_steps,
201 |                                shuffle=True) 
202 | 
203 |         # Now fine tune the last set of layers in the model
204 |         for layer in model.layers[LAST_FROZEN_LAYER:]:
205 |             layer.trainable = True
206 | 
207 |         initial_epoch_number = 0
208 |         
209 |     # Otherwise, start from the latest checkpoint
210 |     else:    
211 |         model, initial_epoch_number = load_checkpoint_model(args.checkpoint_path)
212 | 
213 |     fine_tuning_lr = args.fine_tuning_lr
214 |     model.compile(optimizer=SGD(lr=fine_tuning_lr, momentum=0.9), 
215 |                   loss='categorical_crossentropy', metrics=['accuracy'])
216 | 
217 |     callbacks = []
218 |     if not (args.s3_checkpoint_path == ''):
219 |         checkpoint_names = 'img-classifier.{epoch:03d}.h5'
220 |         checkpoint_callback = ModelCheckpoint(filepath=f'{args.checkpoint_path}/{checkpoint_names}',
221 |                                               save_weights_only=False,
222 |                                               monitor='val_accuracy')
223 |         callbacks.append(checkpoint_callback)
224 |         callbacks.append(PurgeCheckpointsCallback(args))
225 | 
226 |     history = model.fit_generator(train_gen, epochs=NUM_EPOCHS, workers=8, 
227 |                            steps_per_epoch=train_steps, 
228 |                            initial_epoch=initial_epoch_number,
229 |                            validation_data=val_gen, validation_steps=val_steps,
230 |                            shuffle=True, callbacks=callbacks)
231 |     print('Model has been fit.')
232 | 
233 |     # Save the model if we are executing on the master host
234 |     if is_master:
235 |         print('Saving model, since we are master host')
236 |         save_model_artifacts(model, os.environ.get('SM_MODEL_DIR'))
237 |     else:
238 |         print('NOT saving model, will leave that up to master host')
239 | 
240 |     checkpoint_files = [f for f in os.listdir(args.checkpoint_path) if f.endswith('.' + 'h5')]  
241 |     print(f'\nCheckpoints: {sorted(checkpoint_files)}')
242 | 
243 |     print('\nExiting training script.\n')
244 |     
245 | if __name__=='__main__':
246 |     parser = argparse.ArgumentParser()
247 |     # hyperparameters sent by the client are passed as command-line arguments to the script.
248 |     parser.add_argument('--initial_epochs', type=int, default=5)
249 |     parser.add_argument('--fine_tuning_epochs', type=int, default=30)
250 |     parser.add_argument('--batch_size', type=int, default=8)
251 |     parser.add_argument('--initial_lr', type=float, default=0.0001)
252 |     parser.add_argument('--fine_tuning_lr', type=float, default=0.00001)
253 |     parser.add_argument('--dropout', type=float, default=0.5)
254 |     parser.add_argument('--num_fully_connected_layers', type=int, default=1)
255 |     parser.add_argument('--s3_checkpoint_path', type=str, default='')
256 |     # input data and model directories
257 |     parser.add_argument('--model_dir', type=str)
258 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
259 |     parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
260 |     parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
261 |     parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
262 |     parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
263 |     parser.add_argument('--checkpoint_path', type=str, default='/opt/ml/checkpoints')
264 | 
265 |     args, _ = parser.parse_known_args()
266 |     print('args: {}'.format(args))
267 |     
268 |     main(args)


--------------------------------------------------------------------------------
/06_training_pipeline/evaluation.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | 
  3 | import pandas as pd
  4 | import argparse
  5 | import pathlib
  6 | import json
  7 | import os
  8 | import numpy as np
  9 | import tarfile
 10 | import uuid
 11 | 
 12 | from PIL import Image
 13 | 
 14 | from sklearn.metrics import (
 15 |     accuracy_score,
 16 |     precision_score,
 17 |     recall_score,
 18 |     confusion_matrix,
 19 |     f1_score
 20 | )
 21 | 
 22 | from tensorflow import keras
 23 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
 24 | from tensorflow.keras.preprocessing import image
 25 | from tensorflow.keras.optimizers import SGD
 26 | # from smexperiments import tracker
 27 | 
 28 | logger = logging.getLogger()
 29 | logger.setLevel(logging.INFO)
 30 | logger.addHandler(logging.StreamHandler())
 31 | 
 32 | input_path =  "/opt/ml/processing/input/test" #"output/test" #
 33 | manifest_path = "/opt/ml/processing/input/manifest/test.csv"#"output/manifest/test.csv"
 34 | model_path = "/opt/ml/processing/model" #"model" # 
 35 | output_path = '/opt/ml/processing/output' #"output" # 
 36 | 
 37 | HEIGHT=224; WIDTH=224
 38 | 
 39 | def predict_bird_from_file_new(fn, model):
 40 |     
 41 |     img = Image.open(fn).convert('RGB')
 42 |     
 43 |     img = img.resize((WIDTH, HEIGHT))
 44 |     img_array = image.img_to_array(img) #, data_format = "channels_first")
 45 | 
 46 |     x = img_array.reshape((1,) + img_array.shape)
 47 |     instance = preprocess_input(x)
 48 | 
 49 |     del x, img
 50 |     
 51 |     result = model.predict(instance)
 52 | 
 53 |     predicted_class_idx = np.argmax(result)
 54 |     confidence = result[0][predicted_class_idx]
 55 | 
 56 |     return predicted_class_idx, confidence
 57 | 
 58 | if __name__ == "__main__":
 59 |     parser = argparse.ArgumentParser()
 60 |     parser.add_argument("--model-file", type=str, default="model.tar.gz")
 61 |     args, _ = parser.parse_known_args()
 62 | 
 63 |     print("Extracting the model")
 64 | 
 65 |     model_file = os.path.join(model_path, args.model_file)
 66 |     file = tarfile.open(model_file)
 67 |     file.extractall(model_path)
 68 | 
 69 |     file.close()
 70 | 
 71 |     print("Load model")
 72 | 
 73 |     model = keras.models.load_model("{}/1".format(model_path), compile=False)
 74 |     model.compile(optimizer=SGD(lr=0.00001, momentum=0.9), loss='categorical_crossentropy', metrics=['accuracy'])
 75 | 
 76 |     print("Starting evaluation.")
 77 |     
 78 |     # load test data.  this should be an argument
 79 |     df = pd.read_csv(manifest_path)
 80 |     
 81 |     num_images = df.shape[0]
 82 |     
 83 |     class_name_list = sorted(df['class_id'].unique().tolist())
 84 |     
 85 |     class_name = pd.Series(df['class_name'].values,index=df['class_id']).to_dict()
 86 |     
 87 |     print('Testing {} images'.format(df.shape[0]))
 88 |     num_errors = 0
 89 |     preds = []
 90 |     acts  = []
 91 |     for i in range(df.shape[0]):
 92 |         fname = df.iloc[i]['image_file_name']
 93 |         act   = int(df.iloc[i]['class_id']) - 1
 94 |         acts.append(act)
 95 |         
 96 |         pred, conf = predict_bird_from_file_new(input_path + '/' + fname, model)
 97 | 
 98 |         preds.append(pred)
 99 |         
100 |         print(f'max range is {len(class_name_list)-1}, prediction is {pred}, and the actual is {act}============')
101 |         if (pred != act):
102 |             num_errors += 1
103 |             logger.debug('ERROR on image index {} -- Pred: {} {:.2f}, Actual: {}'.format(i, 
104 |                                                                    class_name_list[pred], conf, 
105 |                                                                    class_name_list[act]))
106 |     precision = precision_score(acts, preds, average='micro')
107 |     recall = recall_score(acts, preds, average='micro')
108 |     accuracy = accuracy_score(acts, preds)
109 |     cnf_matrix = confusion_matrix(acts, preds, labels=range(len(class_name_list)))
110 |     f1 = f1_score(acts, preds, average='micro')
111 |     
112 |     print("Accuracy: {}".format(accuracy))
113 |     logger.debug("Precision: {}".format(precision))
114 |     logger.debug("Recall: {}".format(recall))
115 |     logger.debug("Confusion matrix: {}".format(cnf_matrix))
116 |     logger.debug("F1 score: {}".format(f1))
117 |     
118 |     print(cnf_matrix)
119 |     
120 |     matrix_output = dict()
121 |     
122 |     for i in range(len(cnf_matrix)):
123 |         matrix_row = dict()
124 |         for j in range(len(cnf_matrix[0])):
125 |             matrix_row[class_name[class_name_list[j]]] = int(cnf_matrix[i][j])
126 |         matrix_output[class_name[class_name_list[i]]] = matrix_row
127 | 
128 |     
129 |     report_dict = {
130 |         "multiclass_classification_metrics": {
131 |             "accuracy": {"value": accuracy, "standard_deviation": "NaN"},
132 |             "precision": {"value": precision, "standard_deviation": "NaN"},
133 |             "recall": {"value": recall, "standard_deviation": "NaN"},
134 |             "f1": {"value": f1, "standard_deviation": "NaN"},
135 |             "confusion_matrix":matrix_output
136 |         },
137 |     }
138 | 
139 |     output_dir = "/opt/ml/processing/evaluation"
140 |     pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
141 | 
142 |     evaluation_path = f"{output_dir}/evaluation.json"
143 |     with open(evaluation_path, "w") as f:
144 |         f.write(json.dumps(report_dict))
145 | 


--------------------------------------------------------------------------------
/06_training_pipeline/lambda/index.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import boto3
 3 | import os
 4 | import pprint as pp
 5 | import uuid
 6 | 
 7 | s3 = boto3.client('s3')
 8 | sagemaker = boto3.client("sagemaker")
 9 | 
10 | pipeline_name = os.getenv('PipelineName')
11 | role = os.getenv('RoleArn')
12 | 
13 | def lambda_handler(event, context):
14 |     if len(event['Records']) >0:
15 |         bucket = event['Records'][0]['s3']['bucket']['name']
16 |         key = event['Records'][0]['s3']['object']['key']
17 |         response = s3.get_object(Bucket=bucket, Key=key)
18 |         
19 |         definition_data = response["Body"].read().decode('utf-8')
20 |         
21 |         try:
22 |             response = sagemaker.list_pipelines(
23 |                     PipelineNamePrefix=pipeline_name
24 |                 )
25 |             
26 |             pipelines = response['PipelineSummaries']
27 |         except Exception as e:
28 |             return e
29 |         
30 |         create = False
31 |         
32 |         if len(pipelines)>0:
33 |             for p in pipelines:
34 |                 if p['PipelineName'] == pipeline_name:
35 |                     create = False
36 |                 else:
37 |                     create = True
38 |         # Update the pipeline
39 |         try:
40 |             if create:
41 |                 response = sagemaker.create_pipeline(
42 |                         PipelineName=pipeline_name,
43 |                         PipelineDisplayName=pipeline_name,
44 |                         PipelineDefinition=definition_data,
45 |                         PipelineDescription=f'pipeline from {key}',
46 |                         ClientRequestToken=str(uuid.uuid4()),
47 |                         RoleArn=role
48 |                     )
49 |                 print(f'created new pipeline: {pipeline_name}')
50 |                 
51 |             else:
52 |                 response = sagemaker.update_pipeline(
53 |                         PipelineName=pipeline_name,
54 |                         PipelineDisplayName=pipeline_name,
55 |                         PipelineDefinition=definition_data,
56 |                         PipelineDescription=f'pipeline from {key}',
57 |                         RoleArn=role
58 |                     )
59 |                     
60 |                 print(f'updated pipeline: {pipeline_name}')
61 |         except Exception as e:
62 |             return e
63 |         
64 |         # Execute the pipeline
65 |         try:
66 |             
67 |             response = sagemaker.start_pipeline_execution(
68 |                 PipelineName=pipeline_name,
69 |                 PipelineExecutionDisplayName=pipeline_name,
70 |                 PipelineExecutionDescription=f'pipeline from {key}',
71 |             )
72 |         
73 |         except Exception as e:
74 |             print(f'Failed starting SageMaker pipeline because: {e}')
75 |     
76 |         return {
77 |             'statusCode': 200,
78 |             'body': json.dumps(f"Launched SageMaker pipeline: {pipeline_name}")
79 |         }
80 |         
81 |     return "Complete Pipeline Execution...."


--------------------------------------------------------------------------------
/06_training_pipeline/lambda/lambda.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/lambda/lambda.zip


--------------------------------------------------------------------------------
/06_training_pipeline/preprocess.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import logging
  3 | 
  4 | import pandas as pd
  5 | import argparse
  6 | import boto3
  7 | import json
  8 | import os
  9 | import shutil
 10 | 
 11 | logger = logging.getLogger()
 12 | logger.setLevel(logging.INFO)
 13 | logger.addHandler(logging.StreamHandler())
 14 | 
 15 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 
 16 | output_path = '/opt/ml/processing/output' #"output" # 
 17 | IMAGES_DIR   = os.path.join(input_path, 'images')
 18 | SPLIT_RATIOS = (0.6, 0.2, 0.2)
 19 | 
 20 | 
 21 | # this function is used to split a dataframe into 3 seperate dataframes
 22 | # one of each: train, validate, test
 23 | 
 24 | def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False):
 25 |     train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
 26 | 
 27 |     labels = df[label_column].unique()
 28 |     for lbl in labels:
 29 |         lbl_df = df[df[label_column] == lbl]
 30 | 
 31 |         lbl_train_df        = lbl_df.sample(frac=splits[0])
 32 |         lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index)
 33 |         lbl_test_df         = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2]))
 34 |         lbl_val_df          = lbl_val_and_test_df.drop(lbl_test_df.index)
 35 | 
 36 |         if verbose:
 37 |             print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl,
 38 |                                                                         len(lbl_df), 
 39 |                                                                         len(lbl_train_df), 
 40 |                                                                         len(lbl_val_df), 
 41 |                                                                         len(lbl_test_df)))
 42 |         train_df = train_df.append(lbl_train_df)
 43 |         val_df   = val_df.append(lbl_val_df)
 44 |         test_df  = test_df.append(lbl_test_df)
 45 | 
 46 |     # shuffle them on the way out using .sample(frac=1)
 47 |     return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1)
 48 | 
 49 | # This function grabs the manifest files and build a dataframe, then call the split_to_train_val_test
 50 | # function above and return the 3 dataframes
 51 | def get_train_val_dataframes(BASE_DIR, classes, images_to_exclude, split_ratios):
 52 |     CLASSES_FILE = os.path.join(BASE_DIR, 'classes.txt')
 53 |     IMAGE_FILE   = os.path.join(BASE_DIR, 'images.txt')
 54 |     LABEL_FILE   = os.path.join(BASE_DIR, 'image_class_labels.txt')
 55 | 
 56 |     images_df = pd.read_csv(IMAGE_FILE, sep=' ',
 57 |                             names=['image_pretty_name', 'image_file_name'],
 58 |                             header=None)
 59 |     image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ',
 60 |                                 names=['image_pretty_name', 'orig_class_id'], header=None)
 61 | 
 62 |     # Merge the metadata into a single flat dataframe for easier processing
 63 |     full_df = pd.DataFrame(images_df)
 64 |     full_df = full_df[~full_df.image_file_name.isin(images_to_exclude)]
 65 | 
 66 |     full_df.reset_index(inplace=True, drop=True)
 67 |     full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name')
 68 | 
 69 |     # grab a small subset of species for testing
 70 |     criteria = full_df['orig_class_id'].isin(classes)
 71 |     full_df = full_df[criteria]
 72 |     print('Using {} images from {} classes'.format(full_df.shape[0], len(classes)))
 73 | 
 74 |     unique_classes = full_df['orig_class_id'].drop_duplicates()
 75 |     sorted_unique_classes = sorted(unique_classes)
 76 |     id_to_one_based = {}
 77 |     i = 1
 78 |     for c in sorted_unique_classes:
 79 |         id_to_one_based[c] = str(i)
 80 |         i += 1
 81 | 
 82 |     full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based)
 83 |     full_df.reset_index(inplace=True, drop=True)
 84 | 
 85 |     def get_class_name(fn):
 86 |         return fn.split('/')[0]
 87 |     full_df['class_name'] = full_df['image_file_name'].apply(get_class_name)
 88 |     full_df = full_df.drop(['image_pretty_name'], axis=1)
 89 | 
 90 |     train_df = []
 91 |     test_df  = []
 92 |     val_df   = []
 93 | 
 94 |     # split into training and validation sets
 95 |     train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', split_ratios)
 96 | 
 97 |     print('num images total: ' + str(images_df.shape[0]))
 98 |     print('\nnum train: ' + str(train_df.shape[0]))
 99 |     print('num val: ' + str(val_df.shape[0]))
100 |     print('num test: ' + str(test_df.shape[0]))
101 |     return train_df, val_df, test_df
102 | 
103 | # this function copy images by channel to its destination folder
104 | def copy_files_for_channel(df, channel_name, verbose=False):
105 |     print('\nCopying files for {} images in channel: {}...'.format(df.shape[0], channel_name))
106 |     for i in range(df.shape[0]):
107 |         target_fname = df.iloc[i]['image_file_name']
108 | #         if verbose:
109 | #             print(target_fname)
110 |         src = "{}/{}".format(IMAGES_DIR, target_fname) #f"{IMAGES_DIR}/{target_fname}"
111 |         dst = "{}/{}/{}".format(output_path,channel_name,target_fname)
112 |         shutil.copyfile(src, dst)
113 | 
114 | if __name__ == "__main__":
115 |     parser = argparse.ArgumentParser()
116 |     parser.add_argument("--classes", type=str, default="")
117 |     parser.add_argument("--input-data", type=str, default="classes.txt")
118 |     args, _ = parser.parse_known_args()
119 | 
120 | 
121 |     c_list = args.classes.split(',')
122 |     input_data = args.input_data
123 |     
124 |     print(c_list)
125 |         
126 |     CLASSES_FILE = os.path.join(input_path, input_data)
127 | 
128 |     EXCLUDE_IMAGE_LIST = ['087.Mallard/Mallard_0130_76836.jpg']
129 |     CLASS_COLS      = ['class_number','class_id']
130 |     
131 |     if len(c_list)==0:
132 |         # Otherwise, you can use the full set of species
133 |         CLASSES = []
134 |         for c in range(200):
135 |             CLASSES += [c + 1]
136 |         prefix = prefix + '-full'
137 |     else:
138 |         CLASSES = list(map(int, c_list))
139 | 
140 |             
141 |     classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None)
142 | 
143 |     criteria = classes_df['class_number'].isin(CLASSES)
144 |     classes_df = classes_df[criteria]
145 | 
146 |     class_name_list = sorted(classes_df['class_id'].unique().tolist())
147 |     print(class_name_list)
148 |     
149 |     
150 |     train_df, val_df, test_df = get_train_val_dataframes(input_path, CLASSES, EXCLUDE_IMAGE_LIST, SPLIT_RATIOS)
151 |         
152 |     for c in class_name_list:
153 |         os.mkdir('{}/{}/{}'.format(output_path, 'validation', c))
154 |         os.mkdir('{}/{}/{}'.format(output_path, 'test', c))
155 |         os.mkdir('{}/{}/{}'.format(output_path, 'train', c))
156 | 
157 |     copy_files_for_channel(val_df,   'validation')
158 |     copy_files_for_channel(test_df,  'test')
159 |     copy_files_for_channel(train_df, 'train')
160 |     
161 |     # export manifest file for validation
162 |     train_m_file = "{}/manifest/train.csv".format(output_path)
163 |     train_df.to_csv(train_m_file, index=False)
164 |     test_m_file = "{}/manifest/test.csv".format(output_path)
165 |     test_df.to_csv(test_m_file, index=False)
166 |     val_m_file = "{}/manifest/validation.csv".format(output_path)
167 |     val_df.to_csv(val_m_file, index=False)
168 |     
169 |     print("Finished running processing job")
170 | 


--------------------------------------------------------------------------------
/06_training_pipeline/sagemaker-project.yaml:
--------------------------------------------------------------------------------
  1 | Description: Upload an object to an S3 bucket, triggering a Lambda event, returning the object key as a Stack Output.
  2 | Parameters:
  3 |   SageMakerProjectName:
  4 |     Type: String
  5 |     Description: The name of the SageMaker project. This name is alos used to create the S3 Bucket (most not already exist)
  6 |   SageMakerProjectId:
  7 |     Type: String
  8 |     Description: Service generated Id of the project.
  9 |     MaxLength: 16
 10 |     MinLength: 1
 11 |   LambdaBucket:
 12 |     Description: S3 Bucket of your lambda file
 13 |     Type: String
 14 |   LambdaKey:
 15 |     Description: S3 Key of your lambnda zip file
 16 |     Type: String
 17 |   PipelineDefinitionBucket:
 18 |     Description: S3 Bucket of your pipeline definition file
 19 |     Type: String
 20 |   PipelineDefinitionKey:
 21 |     Description: S3 Key of your pipeline definition file
 22 |     Type: String
 23 |   PipelineExecutionRoleArn:
 24 |     Description: Execution Role Arn for your SageMaker Pipeline (if use studio, use the same execution role as studio)
 25 |     Type: String
 26 | Resources:
 27 |   Bucket:
 28 |     Type: AWS::S3::Bucket
 29 |     DependsOn: BucketPermission
 30 |     Properties:
 31 |       BucketName: !Ref SageMakerProjectName
 32 |       NotificationConfiguration:
 33 |         LambdaConfigurations:
 34 |         - Event: 's3:ObjectCreated:*'
 35 |           Function: !GetAtt BucketWatcher.Arn
 36 |           Filter:
 37 |             S3Key:
 38 |               Rules:
 39 |                 - Name: prefix
 40 |                   Value: pipeline/
 41 |                 - Name: suffix
 42 |                   Value: .json
 43 |   BucketPermission:
 44 |     Type: AWS::Lambda::Permission
 45 |     Properties:
 46 |       Action: 'lambda:InvokeFunction'
 47 |       FunctionName: !Ref BucketWatcher
 48 |       Principal: s3.amazonaws.com
 49 |       SourceAccount: !Ref "AWS::AccountId"
 50 |       SourceArn: !Sub "arn:aws:s3:::${SageMakerProjectName}"
 51 |   BucketWatcher:
 52 |     Type: AWS::Lambda::Function
 53 |     Properties:
 54 |       Description: Sends a Wait Condition signal to Handle when invoked
 55 |       Handler: index.lambda_handler
 56 |       Role: !GetAtt LambdaExecutionRole.Arn
 57 |       Code:
 58 |         S3Bucket: !Ref LambdaBucket
 59 |         S3Key: !Ref LambdaKey
 60 |       Timeout: 5
 61 |       Runtime: python3.7
 62 |       Environment:
 63 |         Variables:
 64 |             PipelineName: !Sub "${SageMakerProjectName}-pipeline"
 65 |             RoleArn: !Ref PipelineExecutionRoleArn
 66 |   LambdaExecutionRole:
 67 |     Type: AWS::IAM::Role
 68 |     Properties:
 69 |       AssumeRolePolicyDocument:
 70 |         Version: '2012-10-17'
 71 |         Statement:
 72 |         - Effect: Allow
 73 |           Principal: {Service: [lambda.amazonaws.com, sagemaker.amazonaws.com]}
 74 |           Action: ['sts:AssumeRole']
 75 |       Path: /
 76 |       ManagedPolicyArns:
 77 |       - "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
 78 |       - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
 79 |       Policies:
 80 |       - PolicyName: S3Policy
 81 |         PolicyDocument:
 82 |           Version: '2012-10-17'
 83 |           Statement:
 84 |             - Effect: Allow
 85 |               Action:
 86 |                 - 's3:*'
 87 |               Resource: 
 88 |                 - "arn:aws:s3:::*"
 89 |   MyPipeline:
 90 |     Type: AWS::SageMaker::Pipeline
 91 |     Properties:
 92 |       PipelineName: !Sub "${SageMakerProjectName}-pipeline"
 93 |       PipelineDisplayName: !Sub "${SageMakerProjectName}-pipeline"
 94 |       PipelineDescription: !Sub "${SageMakerProjectName}-pipeline"
 95 |       PipelineDefinition:
 96 |         PipelineDefinitionS3Location:
 97 |           Bucket: !Ref PipelineDefinitionBucket
 98 |           Key: !Ref PipelineDefinitionKey
 99 |       RoleArn: !Ref PipelineExecutionRoleArn
100 |       Tags:
101 |         - Key: sagemaker:project-id
102 |           Value:
103 |             Ref: SageMakerProjectId
104 |         - Key: sagemaker:project-name
105 |           Value:
106 |             Ref: SageMakerProjectName
107 |   MyModelPackageGroup:
108 |     Type: AWS::SageMaker::ModelPackageGroup
109 |     Properties: 
110 |       ModelPackageGroupDescription: !Sub "${SageMakerProjectName}-model-group"
111 |       ModelPackageGroupName: !Sub "${SageMakerProjectName}-model-group"
112 |       Tags: 
113 |         - Key: sagemaker:project-id
114 |           Value:
115 |             Ref: SageMakerProjectId
116 |         - Key: sagemaker:project-name
117 |           Value:
118 |             Ref: SageMakerProjectName
119 | 


--------------------------------------------------------------------------------
/06_training_pipeline/statics/birds.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/birds.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/compare-model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/compare-model.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/confussion_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/confussion_matrix.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/cost-explore.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/cost-explore.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/cv-training-pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/cv-training-pipeline.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/execute-pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/execute-pipeline.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/parameters-input.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/parameters-input.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/project-parameter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/project-parameter.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/project-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/project-template.png


--------------------------------------------------------------------------------
/06_training_pipeline/statics/studio-ui-pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/studio-ui-pipeline.png


--------------------------------------------------------------------------------
/07_deployment/README.md:
--------------------------------------------------------------------------------
 1 | # Model Deployment with Sagemaker
 2 | 
 3 | ---
 4 | 
 5 | # Introduction
 6 | 
 7 | Once your models have been built, trained, and evaluated such that you are satisfied with their performance, you would likely want to deploy them to get predictions. This lab focuses on the following deployment types:
 8 | * [Cloud deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html) using Amazon SageMaker hosting services
 9 |   * [Real time inference deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) to cloud endpoints       with compute you have control over.
10 |   * [Serverless inference deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) to cloud endpoints with compute provisioned and manged by Sagemaker.
11 | * [Edge deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/edge.html) to devices using [AWS IoT Greengrass for ML Inferences](https://aws.amazon.com/greengrass/ml/).
12 | 
13 | ---
14 | # Prerequisites
15 | 
16 | Download the notebook corresponding to the desired deployment type you'd want to do into your environemnt. Run it by executing each cell in order.
17 | 
18 | To understand what's happening, you'll need:  
19 | 
20 | - Access to the SageMaker default S3 bucket.
21 | - Familiarity with Python and numpy
22 | - Basic familiarity with AWS S3.
23 | - Basic understanding of AWS Sagemaker.
24 | - SageMaker Studio is preferred for the full UI integration
25 | 
26 | ---
27 | 
28 | ### Dataset & Inputs
29 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html). The main components needed for running the deployment notebooks are:  
30 | 
31 | - S3 path for test image data
32 | - S3 path for test data annotation file
33 | - S3 path for the bird classification model
34 | 
35 | If you don't have model artifacts from previous modules, there is an `optional-prepare-data-and-model` notebook you can run to generate the necessary components. In the first cell, be sure to change the S3 prefix to the appropriate value according to the deployment type you want to do.  
36 | The outputs from a single run of this notebook, or kept ones from previous modules can be used for any of the deployments by simply updating the S3 location in the first cell of the deployment notebook you run.
37 | 
38 | 
39 | 
40 | ---
41 | # Review Outputs
42 | 
43 | At the end of the deployment notbook, you will have used sample images from the test dataset to get predictions from the endpoints.


--------------------------------------------------------------------------------
/07_deployment/cloud_deployment/cv_utils.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | from numpy import argmax
 3 | from IPython.display import Image, display
 4 | import os
 5 | import boto3
 6 | import random
 7 | 
 8 | s3 = boto3.client('s3')
 9 | 
10 | def get_n_random_images(bucket_name, prefix,n):
11 |     keys = []
12 |     resp = s3.list_objects_v2(Bucket=f'{bucket_name}', MaxKeys=500,Prefix=prefix)
13 |     for obj in resp['Contents']:
14 |         keys.append(obj['Key'])
15 |     random.shuffle(keys)
16 |     del keys[n:]
17 |     return keys
18 | 
19 | def download_images_locally(bucket_name,image_keys):
20 |     if not os.path.isdir('./inference-test-data'):
21 |         os.mkdir('./inference-test-data')
22 |     local_paths=[]
23 |     for obj in image_keys:
24 |         fname = obj.split('/')[-1]
25 |         local_dl_path = './inference-test-data/'+fname
26 | 
27 |         s3.download_file(bucket_name,obj,local_dl_path)
28 |         local_paths.append(local_dl_path)
29 |     return local_paths
30 | 
31 | def get_classes_as_list(cf, class_filter):
32 |     classes_df = pd.read_csv(cf, sep=' ', header=None)
33 |     criteria = classes_df.iloc[:,0].isin(class_filter)
34 |     classes_df = classes_df[criteria]
35 | 
36 |     class_name_list = sorted(classes_df.iloc[:,1].unique().tolist())
37 |     return class_name_list
38 | 
39 | 
40 | def predict_bird_from_file(fn, predictor,possible_classes,verbose=True, height=224,width=224):
41 |     with open(fn, 'rb') as img:
42 |         f = img.read()
43 |     x = bytearray(f)
44 | 
45 |     #class_selection = '13, 17, 35, 36, 47, 68, 73, 87'
46 |     
47 |     results = predictor.predict(x)['predictions']
48 |     predicted_class_idx = argmax(results)
49 |     predicted_class = possible_classes[predicted_class_idx]
50 |     confidence = results[0][predicted_class_idx]
51 |     if verbose:
52 |         display(Image(fn, height=height, width=width))
53 |         print('Class: {}, confidence: {:.2f}'.format(predicted_class, confidence))
54 |     del img, x
55 |     return predicted_class_idx, confidence
56 | 


--------------------------------------------------------------------------------
/07_deployment/cloud_deployment/statics/active-sagemaker-endpoints.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/cloud_deployment/statics/active-sagemaker-endpoints.png


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/README.md:
--------------------------------------------------------------------------------
 1 | ## Edge Deployment using SageMaker Edge Manager and IoT Greengrass v2
 2 | 
 3 | ---
 4 | 
 5 | # Introduction
 6 | 
 7 | In this lab illustrate how you can optimize and deploy machine learning models for edge devices.
 8 | 
 9 | 1. Use an ARM64 EC2 instance to mimic an typical ARM based edge device.
10 | 2. Config the edge device by installing Greengrass core to manage the edge deployment
11 | 3. Prepare our model for edge using SageMaker Neo and Edge Manager (either use the model from the previous modules or run the `optional-prepare-data-and-model.ipynb` notebook to create a new model)
12 | 4. Finally we will deploy a simple application that predicts a list of bird images and send the results to the cloud
13 | 
14 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio**
15 | 
16 | ---
17 | # Prerequisites
18 | 
19 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need:
20 | 
21 | - Following Policy to your **SageMaker Studio Execution role** to properly executed this lab. <span style="color:red">Warning: the permissions set for this lab are very loose for simplicity purposes. Please follow the least privilege frame when you work on your own projects.  </span> These permissions allow user to interact with other AWS services like EC2, System Manager (SSM), IoT Core, and GreengrassV2.
22 | 
23 |     - AmazonEC2FullAccess
24 |     - AmazonEC2RoleforSSM
25 |     - AmazonSSMManagedInstanceCore
26 |     - AmazonSSMFullAccess
27 |     - AWSGreengrassFullAccess
28 |     - AWSIoTFullAccess
29 |     
30 | - Familiarity with Python
31 | - Basic understanding of SSM
32 | - Basic familiarity with AWS S3
33 | - Basic familiarity with IoT Core
34 | - Basic familiarity with GreengrassV2
35 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
36 | - SageMaker Studio is preferred for the full UI integration
37 | 
38 | ---
39 | ## Prepare the Model for the Edge
40 | 
41 | You can use the bird model use created from the previous modules or run the `optional-prepare-data-and-model.ipynb` notebook to create a new model. Update the path to your model below if necessary.
42 | 
43 | ---
44 | ## View Your Deployment in GreenGrass Console
45 | 
46 | 1. Go to [Greengrass Console](https://console.aws.amazon.com/greengrass/)
47 | 
48 | 2. View the deployment you just made in the deployment dashoboard
49 |    
50 |    - This shows you the deployment details: components, target devices, and deployment status
51 |    - You can also see the device status to indicate the device is running in a healthy state.
52 | 
53 | <img src="static/5_deployment_dashboard.png" width = 900 />
54 | 
55 | 3. Click inot Core device to get more detail on the device status
56 |    - You can get status details on each components
57 | 
58 | <img src="static/6_monitoring_core_device.png" width = 900 />


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/build/image_classification/inference.py:
--------------------------------------------------------------------------------
  1 | import grpc
  2 | from PIL import Image
  3 | import agent_pb2 as agent
  4 | import agent_pb2_grpc as agent_grpc
  5 | import os
  6 | import time
  7 | import numpy as np    
  8 | from numpy import argmax
  9 | import uuid
 10 | 
 11 | # This is a test script to check if my stuff are uploaded properly
 12 | model_path = '/greengrass/v2/work/Bird-Model-ARM-TF2'
 13 | model_name = 'bird-model'
 14 | 
 15 | image_path = os.path.expandvars(os.environ.get("DEFAULT_SMEM_IC_IMAGE_DIR"))
 16 |                     
 17 | agent_socket = 'unix:///tmp/aws.greengrass.SageMakerEdgeManager.sock'
 18 | 
 19 | channel = grpc.insecure_channel(agent_socket)
 20 | client = agent_grpc.AgentStub(channel)
 21 | 
 22 | print(f"Images directory is {image_path}================")
 23 | 
 24 | print(f"our current directory is {os.getcwd()}==========")
 25 | 
 26 | 
 27 | def list_models(cli):
 28 |     resp = cli.ListModels(agent.ListModelsRequest())
 29 |     return {
 30 |         m.name: {"in": m.input_tensor_metadatas, "out": m.output_tensor_metadatas}
 31 |         for m in resp.models
 32 |     }
 33 | 
 34 | 
 35 | def unload_model(cli, model_name):
 36 |     try:
 37 |         req = agent.UnLoadModelRequest()
 38 |         req.name = model_name
 39 |         return cli.UnLoadModel(req)
 40 |     except Exception as e:
 41 |         print(e)
 42 |         return None
 43 | 
 44 | 
 45 | def load_model(cli, model_name, model_path):
 46 |     """ Load a new model into the Edge Agent if not loaded yet"""
 47 |     try:
 48 |         req = agent.LoadModelRequest()
 49 |         req.url = model_path
 50 |         req.name = model_name
 51 |         return cli.LoadModel(req)
 52 |     except Exception as e:
 53 |         print(e)
 54 |         return None
 55 | 
 56 | 
 57 | def create_tensor(x, tensor_name):
 58 |     if x.dtype != np.float32:
 59 |         raise Exception("It only supports numpy float32 arrays for this tensor")
 60 |     tensor = agent.Tensor()
 61 |     tensor.tensor_metadata.name = tensor_name
 62 |     tensor.tensor_metadata.data_type = agent.FLOAT32
 63 |     for s in x.shape:
 64 |         tensor.tensor_metadata.shape.append(s)
 65 |     tensor.byte_data = x.tobytes()
 66 |     return tensor
 67 | 
 68 | 
 69 | def predict(cli, model_name, x, shm=False):
 70 |     """
 71 |     Invokes the model and get the predictions
 72 |     """
 73 |     model_map = list_models(cli)
 74 |     if model_map.get(model_name) is None:
 75 |         raise Exception("Model %s not loaded" % model_name)
 76 |     # Create a request
 77 |     req = agent.PredictRequest()
 78 |     req.name = model_name
 79 |     # Then load the data into a temp Tensor
 80 |     tensor = agent.Tensor()
 81 |     meta = model_map[model_name]["in"][0]
 82 |     tensor.tensor_metadata.name = meta.name
 83 |     tensor.tensor_metadata.data_type = meta.data_type
 84 |     for s in meta.shape:
 85 |         tensor.tensor_metadata.shape.append(s)
 86 | 
 87 |     if shm:
 88 |         tensor.shared_memory_handle.offset = 0
 89 |         tensor.shared_memory_handle.segment_id = x
 90 |     else:
 91 |         tensor.byte_data = x.astype(np.float32).tobytes()
 92 | 
 93 |     req.tensors.append(tensor)
 94 | 
 95 |     # Invoke the model
 96 |     resp = cli.Predict(req)
 97 | 
 98 |     # Parse the output
 99 |     meta = model_map[model_name]["out"][0]
100 |     tensor = resp.tensors[0]
101 |     data = np.frombuffer(tensor.byte_data, dtype=np.float32)
102 |     return data.reshape(tensor.tensor_metadata.shape)
103 | 
104 | 
105 | def main():
106 |     if list_models(client).get(model_name) is None:
107 |         load_model(client, model_name, model_path)
108 | 
109 |     print("Loaded Models:")
110 |     print(list_models(client))
111 | 
112 |     start = time.time()
113 |     
114 |     count = 0
115 |     
116 |     if os.path.isdir(image_path):
117 |         for images in os.listdir(image_path):
118 |             if ".jpg" in images:
119 |                 im = Image.open(f"{image_path}/{images}")
120 |                 input_image = np.array(im).astype(np.float32) / 255.0
121 |                 print(f"Shape of the Image {input_image.shape}============")
122 | 
123 |                 y = predict(client, model_name, input_image, False)
124 |                 
125 |                 print(f"output tensor is {y}===================")
126 |                 predicted_class_idx = argmax(y)
127 | 
128 |                 total_time = time.time() - start
129 |                 print(f"Output Prediction: {predicted_class_idx}")
130 | 
131 |                 req = agent.CaptureDataRequest()
132 |                 req.model_name = model_name
133 |                 req.capture_id = str(uuid.uuid4())
134 |                 req.input_tensors.append(create_tensor(input_image, "input_image"))
135 |                 req.output_tensors.append(create_tensor(y, "output_image"))
136 |                 resp = client.CaptureData(req)
137 |                 
138 |                 count +=1
139 |     else:
140 |         print("Images directory not found ===============")
141 |         
142 |     print(f"Number of images predicted: {count}============")
143 |         
144 |     print("End of the script ==============")
145 | 
146 | 
147 | if __name__ == '__main__':
148 |     main()


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/build/installer.sh:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: Apache-2.0
 3 | 
 4 | #!/bin/bash
 5 | install_lib() {
 6 |     py_lib="${1}"
 7 |     module="${2}"
 8 |     command=$(
 9 |         cat <<END
10 | try:
11 |     import $module
12 | except Exception as e:
13 |     print(e)
14 | END
15 |     )
16 |     # Install if the module doesn't already exist
17 |     import_output=$(python3 -c "$command")
18 |     if [[ "$import_output" == *"No module named "* ]]; then
19 |         echo "Installing $py_lib..."
20 |         pip3 install "$py_lib" --user
21 |     else
22 |         echo "Skipping $py_lib installation as it already exists."
23 |     fi
24 | }
25 | 
26 | install_lib "opencv-python" "cv2"
27 | install_lib "numpy" "numpy"
28 | install_lib "grpcio" "grpc"
29 | install_lib "grpcio-tools" "grpc.tools"
30 | install_lib "protobuf" "google.protobuf"
31 | install_lib "awsiotsdk" "awsiot"
32 | install_lib "Pillow" "PIL"
33 | 


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/static/1_select_AMI.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/edge_deployment/static/1_select_AMI.png


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/static/2_select_instance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/edge_deployment/static/2_select_instance.png


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/static/4_deploy_greengrass.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/edge_deployment/static/4_deploy_greengrass.png


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/static/5_deployment_dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/edge_deployment/static/5_deployment_dashboard.png


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/static/6_monitoring_core_device.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/edge_deployment/static/6_monitoring_core_device.png


--------------------------------------------------------------------------------
/07_deployment/edge_deployment/static/x_deployment_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/edge_deployment/static/x_deployment_error.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/README.md:
--------------------------------------------------------------------------------
 1 | # End-to-end Edge Computer Vision (CV) Machine Learning (ML) Orchistration
 2 | 
 3 | ## Introduction
 4 | 
 5 | In this workshop, we will walk you through a step by step process to prepare, build and train computer vision models with Amazon SageMaker and deploy them to AWS panorama for edge inference. The workshop image classification of different species of birds, but the concept can be applied to a wide range of industrial applications. We will complete the CV MLOps lifecycle covering: data labeling, continuous integrated (CI/CD) training pipeline, model updates to the edge w/ approval, and, fianlly, edge inferencing and data capture back to the cloud.
 6 | 
 7 | ## Architecture
 8 | ![MLOps Pattern](statics/end2end.png)
 9 | 
10 | The architecture we will build during this workshop is illustrated on the right. Several key components can be highlighted:
11 | 
12 | 1. **Data labeling w/ SageMaker GroundTruth**: We are using the CUB_MINI.tar data which is a subset of [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset. The code below will spin up a GoundTruth labeling job for private/public workforce to label.  The output is a fully labeled manifest file that can be ingested by the atutomated training pipeline.
13 | 
14 | 2. **Automated training Pipeline** Once the dataset is fully labelled, we have prepared a SageMaker pipeline that will go through steps: preprocessing, model training, model evaluation, and storing the model to model registry.  At the same time, the example code will provide advance topics like streaming data with pipe mode and fast file mode, monitor training with SageMaker debugger, and distributed training to increase training effeciency and reduce training time.
15 | 
16 | 3. **Model Deployment & Edge Inferencing w/ AWS Panorama** With manual approval of the model in Model registry, we can kick off the edge deploymnet with a deployment lambda.  That can be accomplish using the AWS Panorama API.  We will also illustration the edge application development and testing using the AWSW Panorama Test Utility (Emulator)
17 | 
18 | ---
19 | 
20 | ## Prerequisites
21 | 
22 | To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need:
23 | 
24 | - Access to the SageMaker default S3 bucket or use your own
25 | - The S3 bucket that you use for this demo must have a CORS policy attached. To learn more about this requirement, and how to attach a CORS policy to an S3 bucket, see [CORS Permission Requirement](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-cors-update.html)
26 | - <span style="color:red">The notebook use public workforce (Mechanical Turk) for the groundTruth labeling job. That will incur some cost at **$0.012 per image**.  If want to use a prviate workforce, then you must have a private workforce already created and update the code with your private workforce ARN. Please follow this [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html) on how to create a private workforce</span>
27 | - Access to Elastic Container Registry (ECR)
28 | - Familiarity with Training on Amazon SageMaker
29 | - Familiarity with SageMaker Processing Job
30 | - Familiarity with SageMaker Pipeline
31 | - Familiarity with Python
32 | - Familiarity with AWS S3
33 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
34 | - SageMaker Studio is preferred for the full UI integration
35 | 
36 | ---
37 | 
38 | ## Dataset
39 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species (the original technical report can be found here). Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.
40 | 
41 | ![Bird Dataset](statics/birds.png)
42 | 
43 | The dataset can be downloaded [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. To make the download and training time reasonable, we will use a subset of the bird data. CUB_MINI.tar is attached in this workshop repo. It contains images and metadata for 8 of the 200 species, and can be used effectively for testing and learning how to use SageMaker and TensorFlow together on computer vision use cases.
44 | 
45 | * 13 (Bobolink)
46 | * 17 (Cardinal)
47 | * 35 (Purple_Finch)
48 | * 36 (Noerhern_Flicker)
49 | * 47 (American_Golsdi..)
50 | * 68 (Ruby_throated_...)
51 | * 73 (Blue_Jay)
52 | * 87 (Mallard)
53 | 
54 | ---
55 | To help you understand and test each step of the training pipeline: 1) preprocess, 2) training/tuning, and 3) model evaluation.  3 seperate notebooks are provided which you can step through each.
56 | 
57 | - test-1-preprocess.ipynb
58 | - test-2-training.ipynb
59 | - test-3-evaluate.ipynb
60 | 
61 | ---
62 | ## Edge Deployment


--------------------------------------------------------------------------------
/08_end-to-end/1_training/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/__init__.py


--------------------------------------------------------------------------------
/08_end-to-end/1_training/pipeline/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__init__.py


--------------------------------------------------------------------------------
/08_end-to-end/1_training/pipeline/__pycache__/__init__.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__pycache__/__init__.cpython-37.pyc


--------------------------------------------------------------------------------
/08_end-to-end/1_training/pipeline/__pycache__/pipeline.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__pycache__/pipeline.cpython-37.pyc


--------------------------------------------------------------------------------
/08_end-to-end/1_training/pipeline/__pycache__/pipeline_tuning.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__pycache__/pipeline_tuning.cpython-37.pyc


--------------------------------------------------------------------------------
/08_end-to-end/1_training/pipeline/code/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/code/__init__.py


--------------------------------------------------------------------------------
/08_end-to-end/1_training/pipeline/evaluation.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | 
  3 | import argparse
  4 | import pathlib
  5 | import json
  6 | import os
  7 | import numpy as np
  8 | import tarfile
  9 | import glob
 10 | 
 11 | from sklearn.metrics import (
 12 |     accuracy_score,
 13 |     precision_score,
 14 |     recall_score,
 15 |     confusion_matrix,
 16 |     f1_score
 17 | )
 18 | 
 19 | import tensorflow as tf
 20 | from tensorflow import keras
 21 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
 22 | from tensorflow.keras.preprocessing import image
 23 | from tensorflow.keras.optimizers import SGD
 24 | # from smexperiments import tracker
 25 | 
 26 | input_path =  "/opt/ml/processing/input/test" #"output/test" #
 27 | classes_path = "/opt/ml/processing/input/classes"#"output/manifest/test.csv"
 28 | model_path = "/opt/ml/processing/model" #"model" # 
 29 | output_path = '/opt/ml/processing/output' #"output" # 
 30 | 
 31 | HEIGHT = 224
 32 | WIDTH  = 224
 33 | DEPTH = 3
 34 | NUM_CLASSES = 3
 35 | 
 36 | def load_classes(file_name):
 37 |     
 38 |     classes_file = os.path.join(classes_path, file_name)
 39 |     
 40 |     with open(classes_file) as f:
 41 |         classes = json.load(f)
 42 |         
 43 |     return classes
 44 | 
 45 | def get_filenames(input_data_dir):
 46 |     return glob.glob('{}/*.tfrecords'.format(input_data_dir))
 47 | 
 48 | def _parse_image_function(example):
 49 |     image_feature_description = {
 50 |         'image': tf.io.FixedLenFeature([], tf.string),
 51 |         'label': tf.io.FixedLenFeature([], tf.int64),
 52 |     }
 53 | 
 54 |     features = tf.io.parse_single_example(example, image_feature_description)
 55 |     
 56 |     image = tf.image.decode_jpeg(features['image'], channels=DEPTH)
 57 | 
 58 |     image = tf.image.resize(image, [WIDTH, HEIGHT])
 59 |     
 60 |     label = tf.cast(features['label'], tf.int32)
 61 | 
 62 |     return image, label
 63 | 
 64 | def predict_bird(model, img_array):
 65 |     
 66 |     x = img_array.reshape((1,) + img_array.shape)
 67 |     instance = preprocess_input(x)
 68 |     
 69 |     del x
 70 |     
 71 |     result = model.predict(instance)
 72 |     
 73 |     predicted_class_idx = np.argmax(result)
 74 |     confidence = result[0][predicted_class_idx]
 75 |     
 76 |     return predicted_class_idx, confidence
 77 | 
 78 | if __name__ == "__main__":
 79 |     parser = argparse.ArgumentParser()
 80 |     parser.add_argument("--model-file", type=str, default="model.tar.gz")
 81 |     parser.add_argument("--classes-file", type=str, default="classes.json")
 82 |     args, _ = parser.parse_known_args()
 83 | 
 84 |     print("Extracting the model")
 85 | 
 86 |     model_file = os.path.join(model_path, args.model_file)
 87 |     file = tarfile.open(model_file)
 88 |     file.extractall(model_path)
 89 | 
 90 |     file.close()
 91 | 
 92 |     print("Load model")
 93 | 
 94 |     model = keras.models.load_model("{}/1".format(model_path), compile=False)
 95 |     model.compile(optimizer=SGD(lr=0.00001, momentum=0.9), loss='categorical_crossentropy', metrics=['accuracy'])
 96 | 
 97 |     print("Starting evaluation.")
 98 |     
 99 |     class_name = load_classes(args.classes_file)
100 | 
101 |     filenames = get_filenames(input_path)#args[channel])
102 |     dataset = tf.data.TFRecordDataset(filenames)
103 | 
104 |     test_dataset = dataset.map(_parse_image_function)
105 | 
106 |     preds = []
107 |     acts = []
108 |     num_errors = 0
109 | 
110 |     for image_features in test_dataset:
111 |         act = image_features[1].numpy()
112 |         acts.append(act)
113 | 
114 |         pred, cnf = predict_bird(model, image_features[0].numpy())
115 | 
116 |         preds.append(pred)
117 | 
118 |         print(f'prediction is {pred}, and the actual is {act}============')
119 |         if (pred != act):
120 |             num_errors += 1
121 |             print('ERROR - Pred: {} {:.2f}, Actual: {}'.format(pred, cnf, act))
122 |     
123 |     precision = precision_score(acts, preds, average='micro')
124 |     recall = recall_score(acts, preds, average='micro')
125 |     accuracy = accuracy_score(acts, preds)
126 |     cnf_matrix = confusion_matrix(acts, preds, labels=range(len(class_name)))
127 |     f1 = f1_score(acts, preds, average='micro')
128 | 
129 |     print("Accuracy: {}".format(accuracy))
130 |     print("Precision: {}".format(precision))
131 |     print("Recall: {}".format(recall))
132 |     print("Confusion matrix: {}".format(cnf_matrix))
133 |     print("F1 score: {}".format(f1))
134 | 
135 |     matrix_output = dict()
136 | 
137 |     for i in range(len(cnf_matrix)):
138 |         matrix_row = dict()
139 |         for j in range(len(cnf_matrix[0])):
140 |             matrix_row[class_name[str(j)]] = int(cnf_matrix[i][j])
141 |         matrix_output[class_name[str(i)]] = matrix_row
142 | 
143 | 
144 |     report_dict = {
145 |         "multiclass_classification_metrics": {
146 |             "accuracy": {"value": accuracy, "standard_deviation": "NaN"},
147 |             "precision": {"value": precision, "standard_deviation": "NaN"},
148 |             "recall": {"value": recall, "standard_deviation": "NaN"},
149 |             "f1": {"value": f1, "standard_deviation": "NaN"},
150 |             "confusion_matrix":matrix_output
151 |         },
152 |     }
153 |     
154 |     pathlib.Path(output_path).mkdir(parents=True, exist_ok=True)
155 |     
156 |     evaluation_path = f"{output_path}/evaluation.json"
157 |     
158 |     print(f"Saving the result json file to {evaluation_path} ....")
159 |     
160 |     with open(evaluation_path, "w") as f:
161 |         f.write(json.dumps(report_dict))
162 |         
163 |     print("End of the evaluation process....")
164 | 


--------------------------------------------------------------------------------
/08_end-to-end/1_training/pipeline/preprocess.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import sys
  3 | import tensorflow as tf
  4 | import argparse
  5 | import boto3
  6 | import random
  7 | import os
  8 | import pathlib
  9 | import glob
 10 | import json
 11 | 
 12 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 
 13 | output_path = '/opt/ml/processing/output' #"output" # 
 14 | 
 15 | def serialize_example(image, label):
 16 | 
 17 |     feature = {
 18 |         'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
 19 |         'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(label)]))
 20 |     }
 21 | 
 22 |     example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
 23 |     return example_proto.SerializeToString()
 24 | 
 25 | def parse_manifest(manifest_file, labels, classes):
 26 | 
 27 |     f = open(manifest_file, 'r')
 28 |     
 29 |     lines = f.readlines()
 30 | 
 31 |     for l in lines:
 32 |         img_data = dict()
 33 |         img_raw = json.loads(l)
 34 | 
 35 |         img_data['image-path'] = img_raw['source-ref'].split('/')[-1]
 36 |         img_data['label'] = img_raw['label']
 37 |         class_name = img_raw['label-metadata']['class-name']
 38 | 
 39 | 
 40 |         if img_data['label'] not in labels:
 41 |             labels[img_data['label']] = []
 42 |             
 43 |         if img_data['label'] not in classes:
 44 |             classes[img_data['label']] = class_name
 45 | 
 46 |         labels[img_data['label']].append(img_data)
 47 |     
 48 |     return labels, classes
 49 | 
 50 | def split_dataset(labels):
 51 |     channels ={
 52 |         "train":[],
 53 |         "valid":[],
 54 |         "test":[]
 55 |     }
 56 | 
 57 |     for l in labels:
 58 |         images = labels[l]
 59 |         random.shuffle(images)
 60 | 
 61 |         splits = [0.7, 0.9]
 62 | 
 63 |         channels["train"] += images[0: int(splits[0] * len(images))]
 64 |         channels["valid"] += images[int(splits[0] * len(images)):int(splits[1] * len(images))]
 65 |         channels["test"] += images[int(splits[1] * len(images)):]
 66 |     
 67 |     return channels
 68 |     
 69 | 
 70 | # def get_filenames(input_data_dir):
 71 | #     return glob.glob('{}/*.tfrecords'.format(input_data_dir))
 72 | 
 73 | def building_tfrecord(channels, images_dir):
 74 |     for c in channels:
 75 | 
 76 |         pathlib.Path(f'{output_path}/{c}').mkdir(parents=True, exist_ok=True)
 77 | 
 78 | 
 79 |         tfrecord_file = f'{output_path}/{c}/bird-{c}.tfrecords'
 80 | 
 81 |         count = 0
 82 | 
 83 |         with tf.io.TFRecordWriter(tfrecord_file) as writer:
 84 |             for img in channels[c]:
 85 | 
 86 |                 image_string = open(os.path.join(images_dir, 
 87 |                                                  img['image-path']), 'rb').read()
 88 | 
 89 |                 tf_example = serialize_example(image_string, img['label'])
 90 |                 writer.write(tf_example)
 91 |                 count +=1
 92 | 
 93 |         print(f"number of images processed for {c} is {count}")
 94 | 
 95 | if __name__ == "__main__":
 96 |     parser = argparse.ArgumentParser()
 97 |     parser.add_argument("--manifest", type=str, default="manifest")
 98 |     parser.add_argument("--images", type=str, default="images")
 99 |     args, _ = parser.parse_known_args()
100 |     
101 |     manifest_dir = os.path.join(input_path, args.manifest)
102 |     images_dir = os.path.join(input_path, args.images)
103 |     
104 |     print(f"manifest file location: {manifest_dir}...")
105 |     
106 |     if len(os.listdir(manifest_dir)) < 1:
107 |         print(f"No manifest files at this location: {manifest_dir}...")
108 |         sys.exit()
109 |     
110 |     labels = dict()
111 |     classes = dict()
112 |     
113 |     # looping through all the manifest files and parse the values by class
114 |     for m in os.listdir(manifest_dir):
115 |         manifest_file = os.path.join(manifest_dir, m)
116 |         labels, classes = parse_manifest(manifest_file, labels, classes)
117 |     
118 |     # split the dataset by channel
119 |     channels = split_dataset(labels)
120 |     
121 |     building_tfrecord(channels, images_dir)
122 |     
123 |     # save the classes json file
124 |     classes_path = f"{output_path}/classes/classes.json"
125 |     with open(classes_path, "w") as f:
126 |         f.write(json.dumps(classes))
127 |     
128 |     print("Finished running processing job")    
129 | 
130 | 
131 |     


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/birds.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/birds.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/debugger_access.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_access.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/debugger_node_profiler.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_node_profiler.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/debugger_profiler.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_profiler.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/debugger_summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_summary.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/end2end.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/end2end.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/execute-pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/execute-pipeline.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/manual_approval.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/manual_approval.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/model_registry.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/model_registry.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/panorama_test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/panorama_test.png


--------------------------------------------------------------------------------
/08_end-to-end/1_training/statics/test_util_folder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/test_util_folder.png


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/graphs/bird_demo_app/graph.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "nodeGraph": {
 3 |         "envelopeVersion": "2021-01-01",
 4 |         "packages": [
 5 |             {
 6 |                 "name": "987720697751::BIRD_DEMO_CODE",
 7 |                 "version": "1.0"
 8 |             },
 9 |             {
10 |                 "name": "987720697751::BIRD_DEMO_TF_MODEL",
11 |                 "version": "1.0"
12 |             },
13 |             {
14 |                 "name": "987720697751::RTSP_STREAM",
15 |                 "version": "1.0"
16 |             },
17 |             {
18 |                 "name": "panorama::hdmi_data_sink",
19 |                 "version": "1.0"
20 |             }
21 |         ],
22 |         "nodes": [
23 |             {
24 |                 "name": "code_node",
25 |                 "interface": "987720697751::BIRD_DEMO_CODE.interface",
26 |                 "overridable": false,
27 |                 "launch": "onAppStart"
28 |             },
29 |             {
30 |                 "name": "model_node",
31 |                 "interface": "987720697751::BIRD_DEMO_TF_MODEL.interface",
32 |                 "overridable": false,
33 |                 "launch": "onAppStart"
34 |             },
35 |             {
36 |                 "name": "camera_node",
37 |                 "interface": "987720697751::RTSP_STREAM.interface",
38 |                 "overridable": true,
39 |                 "launch": "onAppStart",
40 |                 "decorator": {
41 |                     "title": "IP camera",
42 |                     "description": "Choose a camera stream."
43 |                 }
44 |             },
45 |             {
46 |                 "name": "output_node",
47 |                 "interface": "panorama::hdmi_data_sink.hdmi0",
48 |                 "overridable": true,
49 |                 "launch": "onAppStart"
50 |             },
51 |             {
52 |                 "name": "detection_threshold",
53 |                 "interface": "float32",
54 |                 "value": 60.0,
55 |                 "overridable": true,
56 |                 "decorator": {
57 |                     "title": "Threshold",
58 |                     "description": "The minimum confidence percentage for a positive classification."
59 |                 }
60 |             }
61 |         ],
62 |         "edges": [
63 |             {
64 |                 "producer": "camera_node.video_out",
65 |                 "consumer": "code_node.video_in"
66 |             },
67 |             {
68 |                 "producer": "code_node.video_out",
69 |                 "consumer": "output_node.video_in"
70 |             },
71 |             {
72 |                 "producer": "detection_threshold",
73 |                 "consumer": "code_node.threshold"
74 |             }
75 |         ]
76 |     }
77 | }


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/2_deployment/bird_demo_app/packages/.keep


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM public.ecr.aws/panorama/panorama-application
2 | COPY src /panorama
3 | 
4 | ARG DEBIAN_FRONTEND=noninteractive
5 | RUN python3 -m pip install -U scipy
6 | RUN pip3 install opencv-python boto3


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/descriptor.json:
--------------------------------------------------------------------------------
1 | {"runtimeDescriptor": {"envelopeVersion": "2021-01-01", "entry": {"path": "python3", "name": "/panorama/app.py"}}}


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "nodePackage": {
 3 |         "envelopeVersion": "2021-01-01",
 4 |         "name": "BIRD_DEMO_CODE",
 5 |         "version": "1.0",
 6 |         "description": "Default description for package ssd_tf_latest",
 7 |         "assets": [
 8 |             {
 9 |                 "name": "code_asset",
10 |                 "implementations": [
11 |                     {
12 |                         "type": "container",
13 |                         "assetUri": "4f1fdec28a45df7b4cee58e9ec1e4c9440fd3d85cf36c15a3183920b9692cd32.tar.gz",
14 |                         "descriptorUri": "577ccf8b6cff24dc1f90cc2af22644ee9d0c81b5a22e771287bf3630e1b7b8fb.json"
15 |                     }
16 |                 ]
17 |             }
18 |         ],
19 |         "interfaces": [
20 |             {
21 |                 "name": "interface",
22 |                 "category": "business_logic",
23 |                 "asset": "code_asset",
24 |                 "inputs": [
25 |                     {
26 |                         "name": "video_in",
27 |                         "type": "media"
28 |                     },
29 |                     {
30 |                         "name": "threshold",
31 |                         "type": "float32"
32 |                     }
33 |                 ],
34 |                 "outputs": [
35 |                     {
36 |                         "description": "Video stream output",
37 |                         "name": "video_out",
38 |                         "type": "media"
39 |                     }
40 |                 ]
41 |             }
42 |         ]
43 |     }
44 | }


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/src/.keep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/src/.keep


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/src/app.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import json
  3 | import logging
  4 | import time
  5 | from logging.handlers import RotatingFileHandler
  6 | 
  7 | import boto3
  8 | from botocore.exceptions import ClientError
  9 | import cv2
 10 | import numpy as np
 11 | import panoramasdk
 12 | import datetime
 13 | 
 14 | class Application(panoramasdk.node):
 15 |     def __init__(self):
 16 |         """Initializes the application's attributes with parameters from the interface, and default values."""
 17 |         self.MODEL_NODE = "model_node"
 18 |         self.MODEL_DIM = 224
 19 |         self.frame_num = 0
 20 |         self.tracked_objects = []
 21 |         self.tracked_objects_start_time = dict()
 22 |         self.tracked_objects_duration = dict()
 23 |     
 24 |         self.classes = {
 25 |             0: "Bobolink", 
 26 |             1: "Cardinal", 
 27 |             2: "Purple_Finch",
 28 |             3: "Northern_Flicker",
 29 |             4:"American_Goldfinch",
 30 |             5:"Ruby_throated_Hummingbird",
 31 |             6:"Blue_Jay",
 32 |             7:"Mallard"   
 33 |         }
 34 | 
 35 |     def process_streams(self):
 36 |         """Processes one frame of video from one or more video streams."""
 37 |         self.frame_num += 1
 38 |         logger.debug(self.frame_num)
 39 | 
 40 |         # Loop through attached video streams
 41 |         streams = self.inputs.video_in.get()
 42 |         for stream in streams:
 43 |             self.process_media(stream)
 44 | 
 45 |         self.outputs.video_out.put(streams)
 46 | 
 47 |     def process_media(self, stream):
 48 |         """Runs inference on a frame of video."""
 49 |         image_data = preprocess(stream.image, self.MODEL_DIM)
 50 |         logger.debug(image_data.shape)
 51 | 
 52 |         # Run inference
 53 |         inference_results = self.call({"input_1":image_data}, self.MODEL_NODE)
 54 | 
 55 |         # Process results (object deteciton)
 56 |         self.process_results(inference_results, stream)
 57 | 
 58 |     def process_results(self, inference_results, stream):
 59 |         """Processes output tensors from a computer vision model and annotates a video frame."""
 60 |         if inference_results is None:
 61 |             logger.warning("Inference results are None.")
 62 |             return
 63 |         
 64 |         logger.debug('Inference results: {}'.format(inference_results))
 65 |         count = 0
 66 |         for det in inference_results:
 67 |             if count == 0:
 68 |                 first_output = det
 69 |             count += 1
 70 |             
 71 |         # first_output = inference_results[0]
 72 |         logger.debug('Output one type: {}'.format(type(first_output)))
 73 |         probabilities = first_output[0]
 74 |         # 1000 values for 1000 classes
 75 |         logger.debug('Result one shape: {}'.format(probabilities.shape))
 76 |         top_result = probabilities.argmax()
 77 |         
 78 |         self.detected_class = self.classes[top_result]
 79 |         self.detected_frame = self.frame_num
 80 |         # persist for up to 5 seconds
 81 |         # if self.frame_num - self.detected_frame < 75:
 82 |         label = '{} ({}%)'.format(self.detected_class, int(probabilities[top_result]*100))
 83 |         stream.add_label(label, 0.1, 0.1)
 84 | 
 85 | def preprocess(img, size):
 86 |     """Resizes and normalizes a frame of video."""
 87 |     resized = cv2.resize(img, (size, size))
 88 |     x1 = np.asarray(resized)
 89 |     x1 = np.expand_dims(x1, 0)
 90 |     return x1
 91 | 
 92 | def get_logger(name=__name__,level=logging.INFO):
 93 |     logger = logging.getLogger(name)
 94 |     logger.setLevel(level)
 95 |     handler = RotatingFileHandler("/opt/aws/panorama/logs/app.log", maxBytes=100000000, backupCount=2)
 96 |     formatter = logging.Formatter(fmt='%(asctime)s %(levelname)-8s %(message)s',
 97 |                                     datefmt='%Y-%m-%d %H:%M:%S')
 98 |     handler.setFormatter(formatter)
 99 |     logger.addHandler(handler)
100 |     return logger
101 | 
102 | def main():
103 |     try:
104 |         logger.info("INITIALIZING APPLICATION")
105 |         app = Application()
106 |         logger.info("PROCESSING STREAMS")
107 |         while True:
108 |             app.process_streams()
109 |     except Exception as e:
110 |         logger.warning(e)
111 | 
112 | logger = get_logger(level=logging.INFO)
113 | main()
114 | 


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_TF_MODEL-1.0/descriptor.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "mlModelDescriptor": {
 3 |         "envelopeVersion": "2021-01-01",
 4 |         "framework": "TENSORFLOW",
 5 |         "inputs": [
 6 |             {
 7 |                 "name": "image_tensor",
 8 |                 "shape": [1,224,224,3]
 9 |             }
10 |         ]
11 |     }
12 | }


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_TF_MODEL-1.0/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "nodePackage": {
 3 |         "envelopeVersion": "2021-01-01",
 4 |         "name": "BIRD_DEMO_TF_MODEL",
 5 |         "version": "1.0",
 6 |         "description": "Default description for package ",
 7 |         "assets": [
 8 |             {
 9 |                 "name": "model_asset",
10 |                 "implementations": [
11 |                     {
12 |                         "type": "model",
13 |                         "assetUri": "ae6c7df9e9aaff6e5001f88c82f5cc938ab84b7a01a38247083f2542f30e65cf.tar.gz",
14 |                         "descriptorUri": "1d13ccfae4b1b8cde03bfa404c7115538d511320db773a24fed30a2b9d78f355.json"
15 |                     }
16 |                 ]
17 |             }
18 |         ],
19 |         "interfaces": [
20 |             {
21 |                 "name": "interface",
22 |                 "category": "ml_model",
23 |                 "asset": "model_asset",
24 |                 "inputs": [
25 |                     {
26 |                         "description": "Video stream output",
27 |                         "name": "video_in",
28 |                         "type": "media"
29 |                     }
30 |                 ],
31 |                 "outputs": [
32 |                     {
33 |                         "description": "Video stream output",
34 |                         "name": "video_out",
35 |                         "type": "media"
36 |                     }
37 |                 ]
38 |             },
39 |             {
40 |                 "name": "model_asset_interface",
41 |                 "category": "ml_model",
42 |                 "asset": "model_asset",
43 |                 "inputs": [
44 |                     {
45 |                         "name": "video_in",
46 |                         "type": "media"
47 |                     }
48 |                 ],
49 |                 "outputs": [
50 |                     {
51 |                         "name": "video_out",
52 |                         "type": "media"
53 |                     }
54 |                 ]
55 |             }
56 |         ]
57 |     }
58 | }


--------------------------------------------------------------------------------
/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-RTSP_STREAM-1.0/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |    "nodePackage":{
 3 |       "envelopeVersion":"2021-01-01",
 4 |       "description":"RTSP/ONVIF camera data source node package",
 5 |       "assets":[
 6 |          {
 7 |             "name":"rtsp_camera",
 8 |             "implementations":[
 9 |                {
10 |                   "type":"system",
11 |                   "assetUri":"source/video/camera/rtsp/source_rtsp"
12 |                }
13 |             ]
14 |          }
15 |       ],
16 |       "interfaces":[
17 |          {
18 |             "name":"interface",
19 |             "category":"media_source",
20 |             "asset":"rtsp_camera",
21 |             "inputs":[
22 |                {
23 |                   "description":"Camera username",
24 |                   "name":"username",
25 |                   "type":"string",
26 |                   "default":"admin"
27 |                },
28 |                {
29 |                   "description":"Camera password",
30 |                   "name":"password",
31 |                   "type":"string",
32 |                   "default":"123456"
33 |                },
34 |                {
35 |                   "description":"Camera streamUrl",
36 |                   "name":"streamUrl",
37 |                   "type":"string",
38 |                   "default":"rtsp://10.0.0.212:554/stream0"
39 |                }
40 |             ],
41 |             "outputs":[
42 |                {
43 |                   "description":"Video stream output",
44 |                   "name":"video_out",
45 |                   "type":"media"
46 |                }
47 |             ]
48 |          }
49 |       ]
50 |    }
51 | }


--------------------------------------------------------------------------------
/08_end-to-end/README.md:
--------------------------------------------------------------------------------
 1 | # End-to-end Edge Computer Vision (CV) Machine Learning (ML) Orchistration
 2 | 
 3 | ## Introduction
 4 | 
 5 | In this module, you will walk through an end-to-end process of developing an enterprise Computer Vision(CV) solution. The module starts with data label, then move to training orchestration where you will prepare, train, and evaluate the models. If model performance meets your expectation, you will deploy it to an AWS panorama for edge inference. The CV example is predicting different species of birds, but technique and concepts are applicable to a wide range of industrial CV use cases.
 6 | 
 7 | ---
 8 | ## Folder Structure
 9 | - 1_training - End-to-end MLOps pipeline example.
10 | - 2_deployment - Panorama application code to deploy to AWS Panorama device or simulate using the test untility.
11 | 
12 | ## Architecture
13 | ---
14 | ![MLOps Pattern](1_training/statics/end2end.png)
15 | 
16 | The architecture we will build during this workshop is illustrated on the right. Several key components can be highlighted:
17 | 
18 | 1. **Data labeling w/ SageMaker GroundTruth**: We are using the CUB_MINI.tar data which is a subset of [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset. The code below will spin up a GoundTruth labeling job for private/public workforce to label.  The output is a fully labeled manifest file that can be ingested by the atutomated training pipeline.
19 | 
20 | 2. **Automated training Pipeline** Once the dataset is fully labelled, we have prepared a SageMaker pipeline that will go through steps: preprocessing, model training, model evaluation, and storing the model to model registry.  At the same time, the example code will provide advance topics like streaming data with pipe mode and fast file mode, monitor training with SageMaker debugger, and distributed training to increase training effeciency and reduce training time.
21 | 
22 | 3. **Model Deployment & Edge Inferencing w/ AWS Panorama** With manual approval of the model in Model registry, we can kick off the edge deploymnet with a deployment lambda.  That can be accomplish using the AWS Panorama API.  We will also illustration the edge application development and testing using the AWS Panorama Test Utility (Emulator)
23 | 
24 | ---
25 | ### Setup Test Utility
26 | 
27 | Panorama Test Utility is a set of python libraries and commandline commands, which allows you to test-run Panorama applications without Panorama appliance device. With Test Utility, you can start running sample applications and developing your own Panorama applications before preparing real Panorama appliance. Sample applications in this repository also use Test Utility.
28 | 
29 | For more about the Test Utility and its current capabilities, please refer to [Introducing AWS Panorama Test Utility document](https://github.com/aws-samples/aws-panorama-samples/blob/main/docs/AboutTestUtility.md).
30 | 
31 | To set up your environment for Test Utility, please refer to **[Test Utility environment setup](https://github.com/aws-samples/aws-panorama-samples/blob/main/docs/EnvironmentSetup.md)**.
32 | 
33 | ![Panorama Test Utility](1_training/statics/panorama_test.png)
34 | 
35 | ---
36 | 
37 | ## Prerequisites
38 | 
39 | To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need:
40 | 
41 | - Access to the SageMaker default S3 bucket or use your own
42 | - The S3 bucket that you use for this demo must have a CORS policy attached. To learn more about this requirement, and how to attach a CORS policy to an S3 bucket, see [CORS Permission Requirement](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-cors-update.html)
43 | - <span style="color:red">The notebook use public workforce (Mechanical Turk) for the groundTruth labeling job. That will incur some cost at **$0.012 per image**.  If want to use a prviate workforce, then you must have a private workforce already created and update the code with your private workforce ARN. Please follow this [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html) on how to create a private workforce</span>
44 | - Access to Elastic Container Registry (ECR)
45 | - Familiarity with Training on Amazon SageMaker
46 | - Familiarity with SageMaker Processing Job
47 | - Familiarity with SageMaker Pipelines
48 | - Familiarity with Python
49 | - Familiarity with AWS S3
50 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
51 | - SageMaker Studio is preferred for the full UI integration
52 | 
53 | ---
54 | 
55 | ## Dataset
56 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.
57 | 
58 | ![Bird Dataset](1_training/statics/birds.png)
59 | 
60 | Run the code in the notebook to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data.


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## End-to-end Computer Vision (CV) Training Workshop
 2 | ---
 3 | 
 4 | This is an end-to-end CV MLOps workshop aimed to help Machine Learning (ML) and Data Science (DS) teams build relevant AWS and SageMaker competencies for an enterprise scale solution. The content is derived from a real world CV use case where an image classification model is developed and trained on SageMaker and then deployed to an edge computing devices. Here is a diagram overview of the workshop and the learning outcome for each module
 5 | 
 6 | ![Workshop Overview](statics/cv-workshop-overview.png)
 7 | 
 8 | 
 9 | The curriculum consists following modules:
10 | 
11 | 1. [Data Labeling (Optional)](01_groundtruth(optional)/README.md)
12 | 2. [Preprocessing](02_preprocessing/README.md)
13 | 3. [Training on SageMaker](03_training/README.md)
14 | 4. [Advance Training on SageMaker](04_advanced_training/README.md)
15 | 5. [Model Evaluation and Model Explainability](05_model_evaluation_and_model_explainability/README.md)
16 | 6. [Sagemaker Trainning Pipeline](06_training_pipeline/README.md)
17 | 7. [Edge Deployment](07_edge_deployment/README.md)
18 | 8. [End-to-end](08_end-to-end/README.md)
19 | 
20 | To get started, load the provided Jupyter notebook and associated files to you SageMaker Studio Environment.
21 | 
22 | ## Security
23 | 
24 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
25 | 
26 | ## License
27 | 
28 | This library is licensed under the MIT-0 License. See the LICENSE file.
29 | 


--------------------------------------------------------------------------------
/statics/cv-workshop-overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/statics/cv-workshop-overview.png


--------------------------------------------------------------------------------
/statics/workshop_overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/statics/workshop_overview.png


--------------------------------------------------------------------------------