├── .gitignore ├── 01_groundtruth(optional) └── README.md ├── 02_preprocessing ├── README.md ├── data_preprocessing.ipynb ├── preprocessing.py └── statics │ └── birds.png ├── 03_training ├── README.md ├── pytorch │ ├── cv_hpo_pytorch_pipe.ipynb │ └── source_dir │ │ └── cifar10.py ├── statics │ ├── CIFAR-10.png │ ├── Experiments.png │ └── HPO_experiments.png └── tensorflow │ ├── cv_hpo_keras_pipe.ipynb │ └── source_dir │ └── keras_cifar10.py ├── 04_advanced_training ├── README.md ├── pytorch │ ├── code │ │ ├── inference.py │ │ ├── model_def.py │ │ └── train_pytorch_smdataparallel_mnist.py │ ├── pytorch_smdataparallel_mnist_demo.ipynb │ └── static │ │ └── ss.png └── tensorflow │ ├── code │ ├── train_param_server_debugger.py │ ├── train_param_server_debugger2.py │ └── train_sdp_debugger_3.py │ ├── distributed_training_tf.ipynb │ └── static │ └── debugger_output.png ├── 05_model_evaluation_and_model_explainability ├── 05a_model_evaluation │ ├── code │ │ ├── inference.py │ │ ├── requirements-gpu.txt │ │ ├── requirements.txt │ │ └── train-mobilenet.py │ ├── docker │ │ ├── Dockerfile │ │ └── requirements.txt │ ├── evaluation.py │ ├── model-evaluation-processing-job.ipynb │ └── optional-prepare-data-and-model.ipynb ├── 05b_model_explainability │ ├── code │ │ ├── inference.py │ │ ├── requirements-gpu.txt │ │ ├── requirements.txt │ │ └── train-mobilenet.py │ ├── optional-prepare-data-and-model-explainability-clarify.ipynb │ └── preprocessing.py └── README.md ├── 06_training_pipeline ├── README.md ├── code │ ├── inference.py │ ├── requirements-gpu.txt │ ├── requirements.txt │ └── train-mobilenet.py ├── evaluation.py ├── lambda │ ├── index.py │ └── lambda.zip ├── pipeline.ipynb ├── pipeline.py ├── preprocess.py ├── sagemaker-project.yaml └── statics │ ├── birds.png │ ├── compare-model.png │ ├── confussion_matrix.png │ ├── cost-explore.png │ ├── cv-training-pipeline.png │ ├── execute-pipeline.png │ ├── parameters-input.png │ ├── project-parameter.png │ ├── project-template.png │ └── studio-ui-pipeline.png ├── 07_deployment ├── README.md ├── cloud_deployment │ ├── cv_utils.py │ ├── sagemaker-deploy-model-for-inference.ipynb │ └── statics │ │ └── active-sagemaker-endpoints.png ├── edge_deployment │ ├── README.md │ ├── arm64-edge-manager-greengrass-v2.ipynb │ ├── build │ │ ├── image_classification │ │ │ ├── agent_pb2.py │ │ │ ├── agent_pb2_grpc.py │ │ │ └── inference.py │ │ └── installer.sh │ └── static │ │ ├── 1_select_AMI.png │ │ ├── 2_select_instance.png │ │ ├── 4_deploy_greengrass.png │ │ ├── 5_deployment_dashboard.png │ │ ├── 6_monitoring_core_device.png │ │ └── x_deployment_error.png └── optional-prepare-data-and-model.ipynb ├── 08_end-to-end ├── 1_training │ ├── 0.bird-end2end-ml-pipeline.ipynb │ ├── README.md │ ├── __init__.py │ ├── input.manifest │ ├── pipeline │ │ ├── __init__.py │ │ ├── __pycache__ │ │ │ ├── __init__.cpython-37.pyc │ │ │ ├── pipeline.cpython-37.pyc │ │ │ └── pipeline_tuning.cpython-37.pyc │ │ ├── code │ │ │ ├── __init__.py │ │ │ ├── train.py │ │ │ └── train_debugger.py │ │ ├── evaluation.py │ │ ├── pipeline.py │ │ ├── pipeline_tuning.py │ │ └── preprocess.py │ └── statics │ │ ├── birds.png │ │ ├── debugger_access.png │ │ ├── debugger_node_profiler.png │ │ ├── debugger_profiler.png │ │ ├── debugger_summary.png │ │ ├── end2end.png │ │ ├── execute-pipeline.png │ │ ├── manual_approval.png │ │ ├── model_registry.png │ │ ├── panorama_test.png │ │ └── test_util_folder.png ├── 2_deployment │ ├── bird_demo.ipynb │ └── bird_demo_app │ │ ├── graphs │ │ └── bird_demo_app │ │ │ └── graph.json │ │ └── packages │ │ ├── .keep │ │ ├── 987720697751-BIRD_DEMO_CODE-1.0 │ │ ├── Dockerfile │ │ ├── descriptor.json │ │ ├── package.json │ │ └── src │ │ │ ├── .keep │ │ │ └── app.py │ │ ├── 987720697751-BIRD_DEMO_TF_MODEL-1.0 │ │ ├── descriptor.json │ │ └── package.json │ │ └── 987720697751-RTSP_STREAM-1.0 │ │ └── package.json └── README.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md └── statics ├── cv-workshop-overview.png └── workshop_overview.png /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .ipynb_checkpoints 3 | inference-test-data/ 4 | *.mp4 5 | *.jpg 6 | -------------------------------------------------------------------------------- /01_groundtruth(optional)/README.md: -------------------------------------------------------------------------------- 1 | ## Data Labeling w/ SageMaker GroundTruth 2 | 3 | SageMaker Ground Truth (SMGT) is fully managed data labeling service in which you can launch a labeling job with just a few clicks in the console or use a single AWS SDK API call. It provides 30+ labeling workflows for computer vision and NLP use cases, and also allows you to tap into different workforce options. 4 | 5 | ![SMGT](https://docs.aws.amazon.com/sagemaker/latest/dg/images/image-classification-example.png) 6 | 7 | This module is optional. But if you want to get some labeling practices, here are all the [SMGT Examples](https://github.com/aws/amazon-sagemaker-examples/tree/main/ground_truth_labeling_jobs). 8 | 9 | For Computer Vision (CV) specific examples, This [Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/master/ground_truth_labeling_jobs/from_unlabeled_data_to_deployed_machine_learning_model_ground_truth_demo_image_classification/from_unlabeled_data_to_deployed_machine_learning_model_ground_truth_demo_image_classification.ipynb) and [Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/2c2cd35c8ed389e638fe5c912e24cd00d0874874/ground_truth_labeling_jobs/ground_truth_object_detection_tutorial/object_detection_tutorial.ipynb) tutorials are good practices. 10 | 11 | ## Prerequisites 12 | 13 | To run this notebook, you simply download the provided Jupyter notebook and execute each cell one-by-one. To understand what's happening, you'll need: 14 | 15 | - An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the EXP_NAME to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket. 16 | - The S3 bucket that you use for this demo must have a CORS policy attached. To learn more about this requirement, and how to attach a CORS policy to an S3 bucket, see CORS Permission Requirement. 17 | - Familiarity with Python and numpy. 18 | - Basic familiarity with AWS S3, 19 | - Basic understanding of AWS Sagemaker, 20 | - Basic familiarity with AWS Command Line Interface (CLI) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances. -------------------------------------------------------------------------------- /02_preprocessing/README.md: -------------------------------------------------------------------------------- 1 | # Data Preparationg using SageMaker Processing 2 | 3 | Amazon SageMaker Processing is a Managed Solution to Run Data Processing and Model Evaluation Workloads. Data processing tasks such as feature engineering, data validation, model evaluation, and model interpretation are essential steps performed by engineers and data scientists in this machine learning workflow. 4 | 5 | Your custom processing scripts will be containerized running in infrastructure that is created on demand and terminated automatically. You can choose to use a SageMaker optimized containers for popular data processing or model evaluation frameworks like Scikit learn, PySpark etc. or Bring Your Own Containers (BYOC). 6 | 7 | ![Process Data](https://docs.aws.amazon.com/sagemaker/latest/dg/images/Processing-1.png) 8 | 9 | --- 10 | # Introduction 11 | Preprocess data before model training is an important step in the overall MLOps process. In this lab you will learn how to use [SKLearnProcessor](https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html), a type of SageMaker process uses Processcikit-learn scripts in a container image provided and maintained by SageMaker to preprocess data or evaluate models. 12 | 13 | The example script will first Load the bird dataset, and then split data into train, validation, and test channels, and finally Export the data and annotation files to S3. 14 | 15 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio** 16 | 17 | --- 18 | ## Prerequisites 19 | 20 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need: 21 | 22 | - Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket. 23 | - Familiarity with Python and numpy 24 | - Basic familiarity with AWS S3. 25 | - Basic understanding of AWS Sagemaker. 26 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 27 | - SageMaker Studio is preferred for the full UI integration 28 | 29 | --- 30 | 31 | ## Dataset 32 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not. 33 | 34 | ![Bird Dataset](statics/birds.png) 35 | 36 | Run the notebook to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data. 37 | 38 | --- 39 | 40 | # Review Outputs 41 | 42 | At the end of the lab, your dataset will be randomly split into train, valid, and test folders. YUou will also have a csv manifest file for each channel. **If you plan to complete other modules in this workshop, please keep these data. Otherwise, you can clean up after this lab.** -------------------------------------------------------------------------------- /02_preprocessing/preprocessing.py: -------------------------------------------------------------------------------- 1 | 2 | import logging 3 | 4 | import pandas as pd 5 | import argparse 6 | import boto3 7 | import json 8 | import os 9 | import shutil 10 | 11 | logger = logging.getLogger() 12 | logger.setLevel(logging.INFO) 13 | logger.addHandler(logging.StreamHandler()) 14 | 15 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 16 | output_path = '/opt/ml/processing/output' #"output" # 17 | IMAGES_DIR = os.path.join(input_path, 'images') 18 | SPLIT_RATIOS = (0.6, 0.2, 0.2) 19 | 20 | 21 | # this function is used to split a dataframe into 3 seperate dataframes 22 | # one of each: train, validate, test 23 | 24 | def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False): 25 | train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame() 26 | 27 | labels = df[label_column].unique() 28 | for lbl in labels: 29 | lbl_df = df[df[label_column] == lbl] 30 | 31 | lbl_train_df = lbl_df.sample(frac=splits[0]) 32 | lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index) 33 | lbl_test_df = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2])) 34 | lbl_val_df = lbl_val_and_test_df.drop(lbl_test_df.index) 35 | 36 | if verbose: 37 | print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl, 38 | len(lbl_df), 39 | len(lbl_train_df), 40 | len(lbl_val_df), 41 | len(lbl_test_df))) 42 | train_df = train_df.append(lbl_train_df) 43 | val_df = val_df.append(lbl_val_df) 44 | test_df = test_df.append(lbl_test_df) 45 | 46 | # shuffle them on the way out using .sample(frac=1) 47 | return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1) 48 | 49 | # This function grabs the manifest files and build a dataframe, then call the split_to_train_val_test 50 | # function above and return the 3 dataframes 51 | def get_train_val_dataframes(BASE_DIR, classes, split_ratios): 52 | CLASSES_FILE = os.path.join(BASE_DIR, 'classes.txt') 53 | IMAGE_FILE = os.path.join(BASE_DIR, 'images.txt') 54 | LABEL_FILE = os.path.join(BASE_DIR, 'image_class_labels.txt') 55 | 56 | images_df = pd.read_csv(IMAGE_FILE, sep=' ', 57 | names=['image_pretty_name', 'image_file_name'], 58 | header=None) 59 | image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ', 60 | names=['image_pretty_name', 'orig_class_id'], header=None) 61 | 62 | # Merge the metadata into a single flat dataframe for easier processing 63 | full_df = pd.DataFrame(images_df) 64 | 65 | full_df.reset_index(inplace=True, drop=True) 66 | full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name') 67 | 68 | # grab a small subset of species for testing 69 | criteria = full_df['orig_class_id'].isin(classes) 70 | full_df = full_df[criteria] 71 | print('Using {} images from {} classes'.format(full_df.shape[0], len(classes))) 72 | 73 | unique_classes = full_df['orig_class_id'].drop_duplicates() 74 | sorted_unique_classes = sorted(unique_classes) 75 | id_to_one_based = {} 76 | i = 1 77 | for c in sorted_unique_classes: 78 | id_to_one_based[c] = str(i) 79 | i += 1 80 | 81 | full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based) 82 | full_df.reset_index(inplace=True, drop=True) 83 | 84 | def get_class_name(fn): 85 | return fn.split('/')[0] 86 | full_df['class_name'] = full_df['image_file_name'].apply(get_class_name) 87 | full_df = full_df.drop(['image_pretty_name'], axis=1) 88 | 89 | train_df = [] 90 | test_df = [] 91 | val_df = [] 92 | 93 | # split into training and validation sets 94 | train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', split_ratios) 95 | 96 | print('num images total: ' + str(images_df.shape[0])) 97 | print('\nnum train: ' + str(train_df.shape[0])) 98 | print('num val: ' + str(val_df.shape[0])) 99 | print('num test: ' + str(test_df.shape[0])) 100 | return train_df, val_df, test_df 101 | 102 | # this function copy images by channel to its destination folder 103 | def copy_files_for_channel(df, channel_name, verbose=False): 104 | print('\nCopying files for {} images in channel: {}...'.format(df.shape[0], channel_name)) 105 | for i in range(df.shape[0]): 106 | target_fname = df.iloc[i]['image_file_name'] 107 | # if verbose: 108 | # print(target_fname) 109 | src = "{}/{}".format(IMAGES_DIR, target_fname) #f"{IMAGES_DIR}/{target_fname}" 110 | dst = "{}/{}/{}".format(output_path,channel_name,target_fname) 111 | shutil.copyfile(src, dst) 112 | 113 | if __name__ == "__main__": 114 | parser = argparse.ArgumentParser() 115 | parser.add_argument("--classes", type=str, default="") 116 | parser.add_argument("--input-data", type=str, default="classes.txt") 117 | args, _ = parser.parse_known_args() 118 | 119 | 120 | c_list = args.classes.split(',') 121 | input_data = args.input_data 122 | 123 | CLASSES_FILE = os.path.join(input_path, input_data) 124 | 125 | CLASS_COLS = ['class_number','class_id'] 126 | 127 | if len(c_list)==0: 128 | # Otherwise, you can use the full set of species 129 | CLASSES = [] 130 | for c in range(200): 131 | CLASSES += [c + 1] 132 | prefix = prefix + '-full' 133 | else: 134 | CLASSES = list(map(int, c_list)) 135 | 136 | 137 | classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None) 138 | 139 | criteria = classes_df['class_number'].isin(CLASSES) 140 | classes_df = classes_df[criteria] 141 | 142 | class_name_list = sorted(classes_df['class_id'].unique().tolist()) 143 | print(class_name_list) 144 | 145 | 146 | train_df, val_df, test_df = get_train_val_dataframes(input_path, CLASSES, SPLIT_RATIOS) 147 | 148 | for c in class_name_list: 149 | os.mkdir('{}/{}/{}'.format(output_path, 'valid', c)) 150 | os.mkdir('{}/{}/{}'.format(output_path, 'test', c)) 151 | os.mkdir('{}/{}/{}'.format(output_path, 'train', c)) 152 | 153 | copy_files_for_channel(val_df, 'valid') 154 | copy_files_for_channel(test_df, 'test') 155 | copy_files_for_channel(train_df, 'train') 156 | 157 | # export manifest file for validation 158 | train_m_file = "{}/manifest/train.csv".format(output_path) 159 | train_df.to_csv(train_m_file, index=False) 160 | test_m_file = "{}/manifest/test.csv".format(output_path) 161 | test_df.to_csv(test_m_file, index=False) 162 | val_m_file = "{}/manifest/valid.csv".format(output_path) 163 | val_df.to_csv(val_m_file, index=False) 164 | 165 | print("Finished running processing job") 166 | -------------------------------------------------------------------------------- /02_preprocessing/statics/birds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/02_preprocessing/statics/birds.png -------------------------------------------------------------------------------- /03_training/README.md: -------------------------------------------------------------------------------- 1 | ## Training a CV model on Amazon SageMaker 2 | 3 | Training models is easy on Amazon SageMaker. You simply specify the location of your data in Amazon S3, define the type and quantity number of ML instances you need, and start training a model with just a few lines of code. Amazon SageMaker sets up a distributed compute cluster, performs the training, outputs the result to Amazon S3, and tears down the cluster when complete. 4 | 5 | --- 6 | ## Introduction 7 | This module is focused on SageMaker training for Computer Vision models. You will go through examples of Bring Your Own(BYO) Script training, hyperparameter tuning, and experiment tracking. In the future modules, you will see how experiment tracking can be automated through SageMaker Pipeline's native integration. 8 | 9 | At the end of this lab, you should develop hands on experience 1) training custom CV models on Amazon SageMaker, 2) Build Automatic Model Tuning Jobs, and 3) Organize your ML experiments. 10 | 11 | ### Keras 12 | The model used for the tensorflow\cv_hpo_keras_pipe notebook is a simple deep CNN that is based on the [Keras examples](https://www.tensorflow.org/tutorials/images/cnn). 13 | 14 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio** 15 | 16 | ### Pytorch 17 | The model used for the pytorch\cv_hpo_pytorch_pipe notebook is a simple deep CNN that is based on the [Pytorch cnn example](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_cnn_cifar10/source/cifar10.py). 18 | 19 | ** Note: This Notebook was tested on 'Python3 (Pytorch 1.8, Python 3.6)' kernel in SageMaker Studio** 20 | 21 | --- 22 | ## Prerequisites 23 | 24 | To get started, download the provided Jupyter notebook and associated files to you SageMaker Studio Environment. To run the notebook, you can simply execute each cell in order. To understand what's happening, you'll need: 25 | 26 | - Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket. 27 | - Familiarity with Python and numpy 28 | - Basic familiarity with AWS S3. 29 | - Basic understanding of AWS Sagemaker. 30 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 31 | - SageMaker Studio is preferred for the full UI integration 32 | 33 | --- 34 | 35 | ## Dataset 36 | 37 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not. 38 | 39 | ![Bird Dataset](statics/birds.png) 40 | 41 | The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each. 42 | 43 | ![cifar10](statics/CIFAR-10.png) 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /03_training/pytorch/source_dir/cifar10.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | import os 4 | 5 | import torch 6 | import torch.distributed as dist 7 | import torch.nn as nn 8 | import torch.nn.functional as F 9 | import torch.nn.parallel 10 | import torch.optim 11 | import torch.utils.data 12 | import torch.utils.data.distributed 13 | import torchvision 14 | import torchvision.models 15 | import torchvision.transforms as transforms 16 | 17 | try: 18 | from sagemaker_inference import environment 19 | except: 20 | from sagemaker_training import environment 21 | 22 | logger = logging.getLogger() 23 | logger.setLevel(logging.DEBUG) 24 | 25 | classes = ("plane", "car", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck") 26 | 27 | 28 | # https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/cifar10_tutorial.py#L118 29 | class Net(nn.Module): 30 | def __init__(self): 31 | super(Net, self).__init__() 32 | self.conv1 = nn.Conv2d(3, 6, 5) 33 | self.pool = nn.MaxPool2d(2, 2) 34 | self.conv2 = nn.Conv2d(6, 16, 5) 35 | self.fc1 = nn.Linear(16 * 5 * 5, 120) 36 | self.fc2 = nn.Linear(120, 84) 37 | self.fc3 = nn.Linear(84, 10) 38 | 39 | def forward(self, x): 40 | x = self.pool(F.relu(self.conv1(x))) 41 | x = self.pool(F.relu(self.conv2(x))) 42 | x = x.view(-1, 16 * 5 * 5) 43 | x = F.relu(self.fc1(x)) 44 | x = F.relu(self.fc2(x)) 45 | x = self.fc3(x) 46 | return x 47 | 48 | def validation_step(self, batch): 49 | images, labels = batch 50 | out = self(images) # Generate predictions 51 | loss = F.cross_entropy(out, labels) # Calculate loss 52 | acc = accuracy(out, labels) # Calculate accuracy 53 | return {'val_loss': loss.detach(), 'val_acc': acc} 54 | 55 | def validation_epoch_end(self, outputs): 56 | batch_losses = [x['val_loss'] for x in outputs] 57 | epoch_loss = torch.stack(batch_losses).mean() # Combine losses 58 | batch_accs = [x['val_acc'] for x in outputs] 59 | epoch_acc = torch.stack(batch_accs).mean() # Combine accuracies 60 | logging.info("val_loss: {:.4f}, val_acc: {:4f}".format(epoch_loss.item(), epoch_acc.item())) 61 | print("val_loss: {:.4f}, val_acc: {:4f}".format(epoch_loss.item(), epoch_acc.item())) 62 | return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()} 63 | 64 | def epoch_end(self, epoch, result): 65 | print('epoch ended') 66 | print("Epoch [{}], train_loss: {:.4f}, val_loss: {:.4f}, val_acc: {:.4f}".format( 67 | epoch, result['train_loss'], result['val_loss'], result['val_acc'])) 68 | logging.info("Epoch [{}], train_loss: {:.4f}, val_loss: {:.4f}, val_acc: {:.4f}".format( 69 | epoch, result['train_loss'], result['val_loss'], result['val_acc'])) 70 | 71 | def accuracy(outputs, labels): 72 | _, preds = torch.max(outputs, dim=1) 73 | return torch.tensor(torch.sum(preds == labels).item() / len(preds)) 74 | 75 | def _train(args): 76 | is_distributed = len(args.hosts) > 1 and args.dist_backend is not None 77 | logger.debug("Distributed training - {}".format(is_distributed)) 78 | 79 | if is_distributed: 80 | # Initialize the distributed environment. 81 | world_size = len(args.hosts) 82 | os.environ["WORLD_SIZE"] = str(world_size) 83 | host_rank = args.hosts.index(args.current_host) 84 | os.environ["RANK"] = str(host_rank) 85 | dist.init_process_group(backend=args.dist_backend, rank=host_rank, world_size=world_size) 86 | logger.info( 87 | "Initialized the distributed environment: '{}' backend on {} nodes. ".format( 88 | args.dist_backend, dist.get_world_size() 89 | ) 90 | + "Current host rank is {}. Using cuda: {}. Number of gpus: {}".format( 91 | dist.get_rank(), torch.cuda.is_available(), args.num_gpus 92 | ) 93 | ) 94 | 95 | device = "cuda" if torch.cuda.is_available() else "cpu" 96 | logger.info("Device Type: {}".format(device)) 97 | 98 | logger.info("Loading Cifar10 dataset") 99 | transform = transforms.Compose( 100 | [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))] 101 | ) 102 | 103 | trainset = torchvision.datasets.CIFAR10( 104 | root=args.data_dir, train=True, download=False, transform=transform 105 | ) 106 | train_loader = torch.utils.data.DataLoader( 107 | trainset, batch_size=args.batch_size, shuffle=True, num_workers=args.workers 108 | ) 109 | 110 | testset = torchvision.datasets.CIFAR10( 111 | root=args.data_dir, train=False, download=False, transform=transform 112 | ) 113 | test_loader = torch.utils.data.DataLoader( 114 | testset, batch_size=args.batch_size, shuffle=False, num_workers=args.workers 115 | ) 116 | 117 | logger.info("Model loaded") 118 | model = Net() 119 | 120 | if torch.cuda.device_count() > 1: 121 | logger.info("Gpu count: {}".format(torch.cuda.device_count())) 122 | model = nn.DataParallel(model) 123 | 124 | model = model.to(device) 125 | 126 | criterion = nn.CrossEntropyLoss().to(device) 127 | optimizer = torch.optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) 128 | 129 | for epoch in range(0, args.epochs): 130 | running_loss = 0.0 131 | train_losses = [] 132 | for i, data in enumerate(train_loader): 133 | # get the inputs 134 | inputs, labels = data 135 | inputs, labels = inputs.to(device), labels.to(device) 136 | 137 | # zero the parameter gradients 138 | optimizer.zero_grad() 139 | 140 | # forward + backward + optimize 141 | outputs = model(inputs) 142 | loss = criterion(outputs, labels) 143 | train_losses.append(loss) 144 | loss.backward() 145 | optimizer.step() 146 | 147 | # print statistics 148 | running_loss += loss.item() 149 | if i % 2000 == 1999: # print every 2000 mini-batches 150 | print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1, running_loss / 2000)) 151 | running_loss = 0.0 152 | 153 | # Validation phase 154 | result = evaluate(model, test_loader) 155 | result['train_loss'] = torch.stack(train_losses).mean().item() 156 | model.epoch_end(epoch, result) 157 | 158 | print("Finished Training") 159 | 160 | print(model.eval()) 161 | outputs = [model.validation_step(batch) for batch in test_loader] 162 | 163 | print(model.validation_epoch_end(outputs)) 164 | 165 | return _save_model(model, args.model_dir) 166 | 167 | def evaluate(model, val_loader): 168 | model.eval() 169 | outputs = [model.validation_step(batch) for batch in val_loader] 170 | return model.validation_epoch_end(outputs) 171 | 172 | def _save_model(model, model_dir): 173 | logger.info("Saving the model.") 174 | path = os.path.join(model_dir, "model.pth") 175 | # recommended way from http://pytorch.org/docs/master/notes/serialization.html 176 | torch.save(model.cpu().state_dict(), path) 177 | 178 | 179 | def model_fn(model_dir): 180 | logger.info("model_fn") 181 | device = "cuda" if torch.cuda.is_available() else "cpu" 182 | model = Net() 183 | if torch.cuda.device_count() > 1: 184 | logger.info("Gpu count: {}".format(torch.cuda.device_count())) 185 | model = nn.DataParallel(model) 186 | 187 | with open(os.path.join(model_dir, "model.pth"), "rb") as f: 188 | model.load_state_dict(torch.load(f)) 189 | return model.to(device) 190 | 191 | 192 | if __name__ == "__main__": 193 | parser = argparse.ArgumentParser() 194 | 195 | parser.add_argument( 196 | "--workers", 197 | type=int, 198 | default=2, 199 | metavar="W", 200 | help="number of data loading workers (default: 2)", 201 | ) 202 | parser.add_argument( 203 | "--epochs", 204 | type=int, 205 | default=2, 206 | metavar="E", 207 | help="number of total epochs to run (default: 2)", 208 | ) 209 | parser.add_argument( 210 | "--batch_size", type=int, default=4, metavar="BS", help="batch size (default: 4)" 211 | ) 212 | parser.add_argument( 213 | "--lr", 214 | type=float, 215 | default=0.001, 216 | metavar="LR", 217 | help="initial learning rate (default: 0.001)", 218 | ) 219 | parser.add_argument( 220 | "--momentum", type=float, default=0.9, metavar="M", help="momentum (default: 0.9)" 221 | ) 222 | parser.add_argument( 223 | "--dist_backend", type=str, default="gloo", help="distributed backend (default: gloo)" 224 | ) 225 | 226 | env = environment.Environment() 227 | parser.add_argument("--hosts", type=list, default=env.hosts) 228 | parser.add_argument("--current-host", type=str, default=env.current_host) 229 | parser.add_argument("--model-dir", type=str, default=env.model_dir) 230 | parser.add_argument("--data-dir", type=str, default=env.channel_input_dirs.get("training")) 231 | parser.add_argument("--num-gpus", type=int, default=env.num_gpus) 232 | 233 | _train(parser.parse_args()) -------------------------------------------------------------------------------- /03_training/statics/CIFAR-10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/03_training/statics/CIFAR-10.png -------------------------------------------------------------------------------- /03_training/statics/Experiments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/03_training/statics/Experiments.png -------------------------------------------------------------------------------- /03_training/statics/HPO_experiments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/03_training/statics/HPO_experiments.png -------------------------------------------------------------------------------- /03_training/tensorflow/source_dir/keras_cifar10.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). 4 | # You may not use this file except in compliance with the License. 5 | # A copy of the License is located at 6 | # 7 | # https://aws.amazon.com/apache-2-0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is distributed 10 | # on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 11 | # express or implied. See the License for the specific language governing 12 | # permissions and limitations under the License. 13 | 14 | import argparse 15 | import io 16 | import itertools 17 | import json 18 | import logging 19 | import os 20 | import re 21 | 22 | import numpy as np 23 | import sklearn.metrics 24 | import tensorflow as tf 25 | from tensorflow import keras 26 | from tensorflow.keras import backend as K 27 | from tensorflow.keras.constraints import max_norm 28 | from tensorflow.keras.layers import ( 29 | Activation, 30 | BatchNormalization, 31 | Conv2D, 32 | Dense, 33 | Dropout, 34 | Flatten, 35 | MaxPooling2D, 36 | ) 37 | from tensorflow.keras.models import Sequential 38 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop 39 | 40 | logging.getLogger().setLevel(logging.INFO) 41 | tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) 42 | 43 | HEIGHT = 32 44 | WIDTH = 32 45 | DEPTH = 3 46 | NUM_CLASSES = 10 47 | 48 | 49 | METRIC_ACCURACY = "accuracy" 50 | 51 | 52 | def keras_model_fn(learning_rate, optimizer): 53 | model = Sequential() 54 | model.add(Conv2D(32, (3, 3), padding="same", name="inputs", input_shape=(HEIGHT, WIDTH, DEPTH))) 55 | model.add(BatchNormalization()) 56 | model.add(Activation("relu")) 57 | model.add(Conv2D(32, (3, 3))) 58 | model.add(BatchNormalization()) 59 | model.add(Activation("relu")) 60 | model.add(MaxPooling2D(pool_size=(2, 2))) 61 | model.add(Dropout(0.2)) 62 | 63 | model.add(Conv2D(64, (3, 3), padding="same")) 64 | model.add(BatchNormalization()) 65 | model.add(Activation("relu")) 66 | model.add(Conv2D(64, (3, 3))) 67 | model.add(BatchNormalization()) 68 | model.add(Activation("relu")) 69 | model.add(MaxPooling2D(pool_size=(2, 2))) 70 | model.add(Dropout(0.3)) 71 | 72 | model.add(Conv2D(128, (3, 3), padding="same")) 73 | model.add(BatchNormalization()) 74 | model.add(Activation("relu")) 75 | model.add(Conv2D(128, (3, 3))) 76 | model.add(BatchNormalization()) 77 | model.add(Activation("relu")) 78 | model.add(MaxPooling2D(pool_size=(2, 2))) 79 | model.add(Dropout(0.4)) 80 | 81 | model.add(Flatten()) 82 | model.add(Dense(512, kernel_constraint=max_norm(2.0))) 83 | model.add(Activation("relu")) 84 | model.add(Dropout(0.5)) 85 | model.add(Dense(NUM_CLASSES)) 86 | model.add(Activation("softmax")) 87 | 88 | if optimizer == "sgd": 89 | opt = SGD(learning_rate=learning_rate) 90 | elif optimizer == "rmsprop": 91 | opt = RMSprop(learning_rate=learning_rate) 92 | elif optimizer == "adam": 93 | opt = Adam(learning_rate=learning_rate) 94 | else: 95 | raise Exception("Unknown optimizer", optimizer) 96 | 97 | model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) 98 | return model 99 | 100 | 101 | def read_dataset(epochs, batch_size, channel, channel_name): 102 | mode = args.data_config[channel_name]["TrainingInputMode"] 103 | 104 | logging.info("Running {} in {} mode".format(channel_name, mode)) 105 | if mode == "Pipe": 106 | from sagemaker_tensorflow import PipeModeDataset 107 | 108 | dataset = PipeModeDataset(channel=channel_name, record_format="TFRecord") 109 | else: 110 | filenames = [os.path.join(channel, channel_name + ".tfrecords")] 111 | dataset = tf.data.TFRecordDataset(filenames) 112 | 113 | image_feature_description = { 114 | "image": tf.io.FixedLenFeature([], tf.string), 115 | "label": tf.io.FixedLenFeature([], tf.int64), 116 | } 117 | 118 | def _parse_image_function(example_proto): 119 | # Parse the input tf.Example proto using the dictionary above. 120 | features = tf.io.parse_single_example(example_proto, image_feature_description) 121 | image = tf.io.decode_raw(features["image"], tf.uint8) 122 | image.set_shape([3 * 32 * 32]) 123 | image = tf.reshape(image, [32, 32, 3]) 124 | 125 | label = tf.cast(features["label"], tf.int32) 126 | label = tf.one_hot(label, 10) 127 | 128 | return image, label 129 | 130 | dataset = dataset.map(_parse_image_function, num_parallel_calls=10) 131 | dataset = dataset.prefetch(10) 132 | dataset = dataset.repeat(epochs) 133 | dataset = dataset.shuffle(buffer_size=10 * batch_size) 134 | dataset = dataset.batch(batch_size, drop_remainder=True) 135 | 136 | return dataset 137 | 138 | 139 | 140 | def main(args): 141 | # Initializing TensorFlow summary writer 142 | job_name = json.loads(os.environ.get("SM_TRAINING_ENV"))["job_name"] 143 | 144 | # Importing datasets 145 | train_dataset = read_dataset(args.epochs, args.batch_size, args.train, "train") 146 | validation_dataset = read_dataset(args.epochs, args.batch_size, args.validation, "validation") 147 | 148 | # Initializing and compiling the model 149 | model = keras_model_fn(args.learning_rate, args.optimizer) 150 | 151 | # Train the model 152 | model.fit( 153 | x=train_dataset, 154 | epochs=args.epochs, 155 | batch_size=args.batch_size, 156 | validation_data=validation_dataset, 157 | ) 158 | 159 | # Saving trained model 160 | model.save(args.model_output + "/1") 161 | 162 | # Converting validation dataset to numpy array 163 | validation_array = np.array(list(validation_dataset.unbatch().take(-1).as_numpy_iterator())) 164 | test_x = np.stack(validation_array[:, 0]) 165 | test_y = np.stack(validation_array[:, 1]) 166 | 167 | # Use the model to predict the labels 168 | test_predictions = model.predict(test_x) 169 | test_y_pred = np.argmax(test_predictions, axis=1) 170 | test_y_true = np.argmax(test_y, axis=1) 171 | 172 | # Evaluating model accuracy and logging it as a scalar for TensorBoard hyperparameter visualization. 173 | accuracy = sklearn.metrics.accuracy_score(test_y_true, test_y_pred) 174 | tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1) 175 | logging.info("Test accuracy:{}".format(accuracy)) 176 | 177 | 178 | if __name__ == "__main__": 179 | parser = argparse.ArgumentParser() 180 | 181 | parser.add_argument( 182 | "--train", 183 | type=str, 184 | required=False, 185 | default=os.environ.get("SM_CHANNEL_TRAIN"), 186 | help="The directory where the CIFAR-10 input data is stored.", 187 | ) 188 | parser.add_argument( 189 | "--validation", 190 | type=str, 191 | required=False, 192 | default=os.environ.get("SM_CHANNEL_VALIDATION"), 193 | help="The directory where the CIFAR-10 input data is stored.", 194 | ) 195 | parser.add_argument( 196 | "--model-output", 197 | type=str, 198 | default=os.environ.get("SM_MODEL_DIR"), 199 | help="The directory where the trained model will be stored.", 200 | ) 201 | parser.add_argument("--learning-rate", type=float, default=0.01, help="Initial learning rate.") 202 | parser.add_argument( 203 | "--epochs", type=int, default=10, help="The number of steps to use for training." 204 | ) 205 | parser.add_argument("--batch-size", type=int, default=128, help="Batch size for training.") 206 | parser.add_argument( 207 | "--data-config", type=json.loads, default=os.environ.get("SM_INPUT_DATA_CONFIG") 208 | ) 209 | parser.add_argument("--optimizer", type=str.lower, default="adam") 210 | parser.add_argument("--model_dir", type=str) 211 | 212 | args = parser.parse_args() 213 | main(args) 214 | -------------------------------------------------------------------------------- /04_advanced_training/README.md: -------------------------------------------------------------------------------- 1 | ## Advance Training on Amazon SageMaker 2 | Machine Learning (ML) practitioners commonly face performance and scalibilty challenges when training Computer Vision (CV) models. This is because your model size and complexity grows quickly with the increase in dataset size. While you can always scale up to use bigger instances with more CPUs and GPUs, there will eventually be capacity limits where you cannot scale up anymore. 3 | 4 | This is when we need to leverage advanced training techniques like distributed training, debugging and monitoring to help us over come these challenges. 5 | 6 | --- 7 | 8 | ## Introduction 9 | 10 | This module covers advanced training topics like debugging and distributed training. You will get exposure to SageMaker services like [SageMaker debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html)and Amazon SageMaker's distributed library. SageMaker debugger allows you to attach a debug process to your training job. This helps you monitor your training at a much granualar time interval and automatically profiling the instance to help you identify performance bottlenecks. 11 | 12 | 13 | While [Amazon SageMaker's distributed library](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) helps you train deep learning models faster and cheaper. The [data parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) feature in this library is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet. This module provides 2 examples demonstrating how to use the SageMaker distributed data library to train a TensorFlow and PyTorch model using the [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) and MNIST dataset. 14 | 15 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio** 16 | 17 | --- 18 | ## Prerequisites 19 | 20 | To get started, download the provided Jupyter notebook and associated files to you SageMaker Studio Environment. To run the notebook, you can simply execute each cell in order. To understand what's happening, you'll need: 21 | 22 | - Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket. 23 | - Access to 2 p3.16xlarge GPU instances. SageMaker distributed training library requirement, so you may need to request a service limit adjustment in your account. 24 | - Familiarity with distributed training concept 25 | - Familiarity with training on SageMaker 26 | - Basic familiarity with AWS S3. 27 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 28 | - SageMaker Studio is preferred for the full UI integration 29 | 30 | --- 31 | 32 | ## Dataset 33 | For Tensorflow framework, we are using [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset containing 11,788 images across 200 bird species (the original technical report can be found here). Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not. 34 | 35 | For Pytorch, we are using [MNIST database](http://yann.lecun.com/exdb/mnist/). It has a training set of 60,000 examples, and a test set of 10,000 examples of handwritten digits. 36 | 37 | --- 38 | 39 | ## Additional Resource: 40 | 41 | 1. [SageMaker distributed data parallel PyTorch API Specification](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel_pytorch.html) 42 | 1. [Getting started with SageMaker distributed data parallel](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html) 43 | 1. [PyTorch in SageMaker](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) -------------------------------------------------------------------------------- /04_advanced_training/pytorch/code/inference.py: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, 12 | # software distributed under the License is distributed on an 13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14 | # KIND, either express or implied. See the License for the 15 | # specific language governing permissions and limitations 16 | # under the License. 17 | 18 | from __future__ import print_function 19 | 20 | import os 21 | 22 | import torch 23 | 24 | # Network definition 25 | from model_def import Net 26 | 27 | 28 | def model_fn(model_dir): 29 | print("In model_fn. Model directory is -") 30 | print(model_dir) 31 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 32 | model = Net() 33 | with open(os.path.join(model_dir, "model.pth"), "rb") as f: 34 | print("Loading the mnist model") 35 | model.load_state_dict(torch.load(f, map_location=device)) 36 | return model 37 | -------------------------------------------------------------------------------- /04_advanced_training/pytorch/code/model_def.py: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, 12 | # software distributed under the License is distributed on an 13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14 | # KIND, either express or implied. See the License for the 15 | # specific language governing permissions and limitations 16 | # under the License. 17 | 18 | import torch 19 | import torch.nn as nn 20 | import torch.nn.functional as F 21 | 22 | 23 | class Net(nn.Module): 24 | def __init__(self): 25 | super(Net, self).__init__() 26 | self.conv1 = nn.Conv2d(1, 32, 3, 1) 27 | self.conv2 = nn.Conv2d(32, 64, 3, 1) 28 | self.dropout1 = nn.Dropout2d(0.25) 29 | self.dropout2 = nn.Dropout2d(0.5) 30 | self.fc1 = nn.Linear(9216, 128) 31 | self.fc2 = nn.Linear(128, 10) 32 | 33 | def forward(self, x): 34 | x = self.conv1(x) 35 | x = F.relu(x) 36 | x = self.conv2(x) 37 | x = F.relu(x) 38 | x = F.max_pool2d(x, 2) 39 | x = self.dropout1(x) 40 | x = torch.flatten(x, 1) 41 | x = self.fc1(x) 42 | x = F.relu(x) 43 | x = self.dropout2(x) 44 | x = self.fc2(x) 45 | output = F.log_softmax(x, dim=1) 46 | return output 47 | -------------------------------------------------------------------------------- /04_advanced_training/pytorch/code/train_pytorch_smdataparallel_mnist.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You 4 | # may not use this file except in compliance with the License. A copy of 5 | # the License is located at 6 | # 7 | # http://aws.amazon.com/apache2.0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 11 | # ANY KIND, either express or implied. See the License for the specific 12 | # language governing permissions and limitations under the License. 13 | 14 | from __future__ import print_function 15 | 16 | import argparse 17 | from packaging.version import Version 18 | import os 19 | import time 20 | 21 | import smdistributed.dataparallel.torch.distributed as dist 22 | import torch 23 | import torchvision 24 | import torch.nn as nn 25 | import torch.nn.functional as F 26 | import torch.optim as optim 27 | 28 | # Network definition 29 | from model_def import Net 30 | 31 | # Import SMDataParallel PyTorch Modules 32 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP 33 | from torch.optim.lr_scheduler import StepLR 34 | from torchvision import datasets, transforms 35 | 36 | # override dependency on mirrors provided by torch vision package 37 | # from torchvision 0.9.1, 2 candidate mirror website links will be added before "resources" items automatically 38 | # Reference PR: https://github.com/pytorch/vision/pull/3559 39 | TORCHVISION_VERSION = "0.9.1" 40 | if Version(torchvision.__version__) < Version(TORCHVISION_VERSION): 41 | datasets.MNIST.resources = [ 42 | ( 43 | "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/train-images-idx3-ubyte.gz", 44 | "f68b3c2dcbeaaa9fbdd348bbdeb94873", 45 | ), 46 | ( 47 | "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/train-labels-idx1-ubyte.gz", 48 | "d53e105ee54ea40749a09fcbcd1e9432", 49 | ), 50 | ( 51 | "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz", 52 | "9fb629c4189551a2d022fa330f9573f3", 53 | ), 54 | ( 55 | "https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz", 56 | "ec29112dd5afa0611ce80d1b7f02629c", 57 | ), 58 | ] 59 | else: 60 | datasets.MNIST.mirrors = ["https://dlinfra-mnist-dataset.s3-us-west-2.amazonaws.com/mnist/"] 61 | 62 | 63 | class CUDANotFoundException(Exception): 64 | pass 65 | 66 | 67 | dist.init_process_group() 68 | 69 | 70 | def train(args, model, device, train_loader, optimizer, epoch): 71 | model.train() 72 | for batch_idx, (data, target) in enumerate(train_loader): 73 | data, target = data.to(device), target.to(device) 74 | optimizer.zero_grad() 75 | output = model(data) 76 | loss = F.nll_loss(output, target) 77 | loss.backward() 78 | optimizer.step() 79 | if batch_idx % args.log_interval == 0 and args.rank == 0: 80 | print( 81 | "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format( 82 | epoch, 83 | batch_idx * len(data) * args.world_size, 84 | len(train_loader.dataset), 85 | 100.0 * batch_idx / len(train_loader), 86 | loss.item(), 87 | ) 88 | ) 89 | if args.verbose: 90 | print("Batch", batch_idx, "from rank", args.rank) 91 | 92 | 93 | def test(model, device, test_loader): 94 | model.eval() 95 | test_loss = 0 96 | correct = 0 97 | with torch.no_grad(): 98 | for data, target in test_loader: 99 | data, target = data.to(device), target.to(device) 100 | output = model(data) 101 | test_loss += F.nll_loss(output, target, reduction="sum").item() # sum up batch loss 102 | pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability 103 | correct += pred.eq(target.view_as(pred)).sum().item() 104 | 105 | test_loss /= len(test_loader.dataset) 106 | 107 | print( 108 | "\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format( 109 | test_loss, correct, len(test_loader.dataset), 100.0 * correct / len(test_loader.dataset) 110 | ) 111 | ) 112 | 113 | 114 | def main(): 115 | # Training settings 116 | parser = argparse.ArgumentParser(description="PyTorch MNIST Example") 117 | parser.add_argument( 118 | "--batch-size", 119 | type=int, 120 | default=64, 121 | metavar="N", 122 | help="input batch size for training (default: 64)", 123 | ) 124 | parser.add_argument( 125 | "--test-batch-size", 126 | type=int, 127 | default=1000, 128 | metavar="N", 129 | help="input batch size for testing (default: 1000)", 130 | ) 131 | parser.add_argument( 132 | "--epochs", 133 | type=int, 134 | default=14, 135 | metavar="N", 136 | help="number of epochs to train (default: 14)", 137 | ) 138 | parser.add_argument( 139 | "--lr", type=float, default=1.0, metavar="LR", help="learning rate (default: 1.0)" 140 | ) 141 | parser.add_argument( 142 | "--gamma", 143 | type=float, 144 | default=0.7, 145 | metavar="M", 146 | help="Learning rate step gamma (default: 0.7)", 147 | ) 148 | parser.add_argument("--seed", type=int, default=1, metavar="S", help="random seed (default: 1)") 149 | parser.add_argument( 150 | "--log-interval", 151 | type=int, 152 | default=10, 153 | metavar="N", 154 | help="how many batches to wait before logging training status", 155 | ) 156 | parser.add_argument( 157 | "--save-model", action="store_true", default=False, help="For Saving the current Model" 158 | ) 159 | parser.add_argument( 160 | "--verbose", 161 | action="store_true", 162 | default=False, 163 | help="For displaying smdistributed.dataparallel-specific logs", 164 | ) 165 | parser.add_argument( 166 | "--data-path", 167 | type=str, 168 | default="/tmp/data", 169 | help="Path for downloading " "the MNIST dataset", 170 | ) 171 | 172 | args = parser.parse_args() 173 | args.world_size = dist.get_world_size() 174 | args.rank = rank = dist.get_rank() 175 | args.local_rank = local_rank = dist.get_local_rank() 176 | args.lr = 1.0 177 | args.batch_size //= args.world_size // 8 178 | args.batch_size = max(args.batch_size, 1) 179 | data_path = args.data_path 180 | 181 | if args.verbose: 182 | print( 183 | "Hello from rank", 184 | rank, 185 | "of local_rank", 186 | local_rank, 187 | "in world size of", 188 | args.world_size, 189 | ) 190 | 191 | if not torch.cuda.is_available(): 192 | raise CUDANotFoundException( 193 | "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices." 194 | ) 195 | 196 | torch.manual_seed(args.seed) 197 | 198 | device = torch.device("cuda") 199 | 200 | # select a single rank per node to download data 201 | is_first_local_rank = local_rank == 0 202 | if is_first_local_rank: 203 | train_dataset = datasets.MNIST( 204 | data_path, 205 | train=True, 206 | download=True, 207 | transform=transforms.Compose( 208 | [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] 209 | ), 210 | ) 211 | dist.barrier() # prevent other ranks from accessing the data early 212 | if not is_first_local_rank: 213 | train_dataset = datasets.MNIST( 214 | data_path, 215 | train=True, 216 | download=False, 217 | transform=transforms.Compose( 218 | [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] 219 | ), 220 | ) 221 | 222 | train_sampler = torch.utils.data.distributed.DistributedSampler( 223 | train_dataset, num_replicas=args.world_size, rank=rank 224 | ) 225 | train_loader = torch.utils.data.DataLoader( 226 | train_dataset, 227 | batch_size=args.batch_size, 228 | shuffle=False, 229 | num_workers=0, 230 | pin_memory=True, 231 | sampler=train_sampler, 232 | ) 233 | if rank == 0: 234 | test_loader = torch.utils.data.DataLoader( 235 | datasets.MNIST( 236 | data_path, 237 | train=False, 238 | transform=transforms.Compose( 239 | [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] 240 | ), 241 | ), 242 | batch_size=args.test_batch_size, 243 | shuffle=True, 244 | ) 245 | 246 | model = DDP(Net().to(device)) 247 | torch.cuda.set_device(local_rank) 248 | model.cuda(local_rank) 249 | optimizer = optim.Adadelta(model.parameters(), lr=args.lr) 250 | scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma) 251 | for epoch in range(1, args.epochs + 1): 252 | train(args, model, device, train_loader, optimizer, epoch) 253 | if rank == 0: 254 | test(model, device, test_loader) 255 | scheduler.step() 256 | 257 | if args.save_model: 258 | torch.save(model.state_dict(), "mnist_cnn.pt") 259 | 260 | 261 | if __name__ == "__main__": 262 | main() 263 | -------------------------------------------------------------------------------- /04_advanced_training/pytorch/static/ss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/04_advanced_training/pytorch/static/ss.png -------------------------------------------------------------------------------- /04_advanced_training/tensorflow/static/debugger_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/04_advanced_training/tensorflow/static/debugger_output.png -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05a_model_evaluation/code/inference.py: -------------------------------------------------------------------------------- 1 | print('******* in inference.py *******') 2 | import tensorflow as tf 3 | print(f'TensorFlow version is: {tf.version.VERSION}') 4 | 5 | from tensorflow.keras.preprocessing import image 6 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input 7 | print(f'Keras version is: {tf.keras.__version__}') 8 | 9 | import io 10 | import base64 11 | import json 12 | import numpy as np 13 | from numpy import argmax 14 | from collections import namedtuple 15 | from PIL import Image 16 | import time 17 | import requests 18 | 19 | # Imports for GRPC invoke on TFS 20 | import grpc 21 | from tensorflow.compat.v1 import make_tensor_proto 22 | from tensorflow_serving.apis import predict_pb2 23 | from tensorflow_serving.apis import prediction_service_pb2_grpc 24 | 25 | import os 26 | # default to use of GRPC 27 | PREDICT_USING_GRPC = os.environ.get('PREDICT_USING_GRPC', 'true') 28 | if PREDICT_USING_GRPC == 'true': 29 | USE_GRPC = True 30 | else: 31 | USE_GRPC = False 32 | 33 | MAX_GRPC_MESSAGE_LENGTH = 512 * 1024 * 1024 34 | 35 | HEIGHT = 224 36 | WIDTH = 224 37 | 38 | # Restrict memory growth on GPU's 39 | physical_gpus = tf.config.experimental.list_physical_devices('GPU') 40 | if physical_gpus: 41 | try: 42 | # Currently, memory growth needs to be the same across GPUs 43 | for gpu in physical_gpus: 44 | tf.config.experimental.set_memory_growth(gpu, True) 45 | logical_gpus = tf.config.experimental.list_logical_devices('GPU') 46 | print(len(physical_gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPUs') 47 | except RuntimeError as e: 48 | # Memory growth must be set before GPUs have been initialized 49 | print(e) 50 | else: 51 | print('**** NO physical GPUs') 52 | 53 | 54 | num_inferences = 0 55 | print(f'num_inferences: {num_inferences}') 56 | 57 | Context = namedtuple('Context', 58 | 'model_name, model_version, method, rest_uri, grpc_uri, ' 59 | 'custom_attributes, request_content_type, accept_header') 60 | 61 | def handler(data, context): 62 | 63 | global num_inferences 64 | num_inferences += 1 65 | 66 | print(f'\n************ inference #: {num_inferences}') 67 | if context.request_content_type == 'application/x-image': 68 | stream = io.BytesIO(data.read()) 69 | img = Image.open(stream).convert('RGB') 70 | _print_image_metadata(img) 71 | 72 | img = img.resize((WIDTH, HEIGHT)) 73 | img_array = image.img_to_array(img) #, data_format = "channels_first") 74 | # the image is now in an array of shape (224, 224, 3) or (3, 224, 224) based on data_format 75 | # need to expand it to add dim for num samples, e.g. (1, 224, 224, 3) 76 | x = img_array.reshape((1,) + img_array.shape) 77 | instance = preprocess_input(x) 78 | print(f' final image shape: {instance.shape}') 79 | del x, img 80 | else: 81 | _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown')) 82 | 83 | start_time = time.time() 84 | 85 | if USE_GRPC: 86 | prediction = _predict_using_grpc(context, instance) 87 | 88 | else: # use TFS REST API 89 | inst_json = json.dumps({'instances': instance.tolist()}) 90 | response = requests.post(context.rest_uri, data=inst_json) 91 | if response.status_code != 200: 92 | raise Exception(response.content.decode('utf-8')) 93 | prediction = response.content 94 | 95 | end_time = time.time() 96 | latency = int((end_time - start_time) * 1000) 97 | print(f'=== TFS invoke took: {latency} ms') 98 | 99 | response_content_type = context.accept_header 100 | return prediction, response_content_type 101 | 102 | def _return_error(code, message): 103 | raise ValueError('Error: {}, {}'.format(str(code), message)) 104 | 105 | def _predict_using_grpc(context, instance): 106 | request = predict_pb2.PredictRequest() 107 | request.model_spec.name = 'model' 108 | request.model_spec.signature_name = 'serving_default' 109 | 110 | request.inputs['input_1'].CopyFrom(make_tensor_proto(instance)) 111 | options = [ 112 | ('grpc.max_send_message_length', MAX_GRPC_MESSAGE_LENGTH), 113 | ('grpc.max_receive_message_length', MAX_GRPC_MESSAGE_LENGTH) 114 | ] 115 | channel = grpc.insecure_channel(f'0.0.0.0:{context.grpc_port}', options=options) 116 | stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) 117 | result_future = stub.Predict.future(request, 30) # 5 seconds 118 | output_tensor_proto = result_future.result().outputs['output'] 119 | output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim] 120 | output_np = np.array(output_tensor_proto.float_val).reshape(output_shape) 121 | predicted_class_idx = argmax(output_np) 122 | print(f' Predicted class: {predicted_class_idx}') 123 | prediction_json = {'predictions': output_np.tolist()} 124 | return json.dumps(prediction_json) 125 | 126 | def _print_image_metadata(img): 127 | # Retrieve the attributes of the image 128 | fileFormat = img.format 129 | imageMode = img.mode 130 | imageSize = img.size # (width, height) 131 | colorPalette = img.palette 132 | 133 | print(f' File format: {fileFormat}') 134 | print(f' Image mode: {imageMode}') 135 | print(f' Image size: {imageSize}') 136 | print(f' Color pal: {colorPalette}') 137 | 138 | print(f' Keys from image.info dictionary:') 139 | for key, value in img.info.items(): 140 | print(f' {key}') -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05a_model_evaluation/code/requirements-gpu.txt: -------------------------------------------------------------------------------- 1 | Pillow 2 | numpy 3 | tensorflow-gpu -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05a_model_evaluation/code/requirements.txt: -------------------------------------------------------------------------------- 1 | # This is the set of Python packages that will get pip installed 2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 3 | Pillow 4 | numpy==1.21.6 5 | tensorflow==2.11.1 -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05a_model_evaluation/code/train-mobilenet.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You 4 | # may not use this file except in compliance with the License. A copy of 5 | # the License is located at 6 | # 7 | # http://aws.amazon.com/apache2.0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 11 | # ANY KIND, either express or implied. See the License for the specific 12 | # language governing permissions and limitations under the License. 13 | 14 | import tensorflow as tf 15 | import tensorflow.keras 16 | 17 | TF_VERSION = tf.version.VERSION 18 | print('TF version: {}'.format(tf.__version__)) 19 | print('Keras version: {}'.format(tensorflow.keras.__version__)) 20 | 21 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input 22 | LAST_FROZEN_LAYER = 20 23 | 24 | from tensorflow.keras.preprocessing.image import ImageDataGenerator 25 | from tensorflow.keras.layers import Dense, Activation, Flatten, Dropout, GlobalAveragePooling2D 26 | from tensorflow.keras.models import Sequential, Model, load_model 27 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop 28 | from tensorflow.keras.callbacks import ModelCheckpoint 29 | from tensorflow.keras import backend as K 30 | 31 | import numpy as np 32 | import os 33 | import re 34 | import json 35 | import argparse 36 | import glob 37 | import boto3 38 | 39 | HEIGHT = 224 40 | WIDTH = 224 41 | 42 | class PurgeCheckpointsCallback(tensorflow.keras.callbacks.Callback): 43 | def __init__(self, args): 44 | """ Save params in constructor 45 | """ 46 | self.args = args 47 | 48 | def on_epoch_end(self, epoch, logs=None): 49 | files = sorted([f for f in os.listdir(self.args.checkpoint_path) if f.endswith('.' + 'h5')]) 50 | keep = 3 51 | print(f' End of epoch {epoch + 1}. Removing old checkpoints...') 52 | for i in range(0, len(files) - keep - 1): 53 | print(f' local: {files[i]}') 54 | os.remove(f'{self.args.checkpoint_path}/{files[i]}') 55 | 56 | s3_uri_parts = self.args.s3_checkpoint_path.split('/') 57 | bucket = s3_uri_parts[2] 58 | key = '/'.join(s3_uri_parts[3:]) + '/' + files[i] 59 | 60 | print(f' s3 bucket: {bucket}, key: {key}') 61 | s3_client = boto3.client('s3') 62 | s3_client.delete_object(Bucket=bucket, Key=key) 63 | print(f' Done\n') 64 | 65 | def load_checkpoint_model(checkpoint_path): 66 | files = sorted([f for f in os.listdir(checkpoint_path) if f.endswith('.' + 'h5')]) 67 | epoch_numbers = [re.search('(?<=\.)(.*[0-9])(?=\.)',f).group() for f in files] 68 | 69 | max_epoch_number = max(epoch_numbers) 70 | max_epoch_index = epoch_numbers.index(max_epoch_number) 71 | max_epoch_filename = files[max_epoch_index] 72 | 73 | print('\nList of available checkpoints:') 74 | print('------------------------------------') 75 | [print(f) for f in files] 76 | print('------------------------------------') 77 | print(f'Checkpoint file for latest epoch: {max_epoch_filename}') 78 | print(f'Resuming training from epoch: {max_epoch_number}') 79 | print('------------------------------------') 80 | 81 | resume_model = load_model(f'{checkpoint_path}/{max_epoch_filename}') 82 | return resume_model, int(max_epoch_number) 83 | 84 | def build_finetune_model(base_model, dropout, fc_layers, num_classes): 85 | # Freeze all base layers 86 | for layer in base_model.layers: 87 | layer.trainable = False 88 | 89 | x = base_model.output 90 | x = Flatten()(x) 91 | for fc in fc_layers: 92 | x = Dense(fc, activation='relu')(x) 93 | if (dropout != 0.0): 94 | x = Dropout(dropout)(x) 95 | 96 | # New softmax layer 97 | predictions = Dense(num_classes, activation='softmax', name='output')(x) 98 | 99 | finetune_model = Model(inputs=base_model.input, outputs=predictions) 100 | 101 | return finetune_model 102 | 103 | def create_data_generators(args): 104 | train_datagen = ImageDataGenerator( 105 | preprocessing_function=preprocess_input, 106 | rotation_range=70, 107 | brightness_range=(0.6, 1.0), 108 | width_shift_range=0.3, 109 | height_shift_range=0.3, 110 | shear_range=0.3, 111 | zoom_range=0.3, 112 | horizontal_flip=True, 113 | vertical_flip=False) 114 | val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input) 115 | test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input) 116 | 117 | train_gen = train_datagen.flow_from_directory('/opt/ml/input/data/train', 118 | target_size=(HEIGHT, WIDTH), 119 | batch_size=args.batch_size) 120 | test_gen = train_datagen.flow_from_directory('/opt/ml/input/data/test', 121 | target_size=(HEIGHT, WIDTH), 122 | batch_size=args.batch_size) 123 | val_gen = train_datagen.flow_from_directory('/opt/ml/input/data/validation', 124 | target_size=(HEIGHT, WIDTH), 125 | batch_size=args.batch_size) 126 | return train_gen, test_gen, val_gen 127 | 128 | def save_model_artifacts(model, model_dir): 129 | print(f'Saving model to {model_dir}...') 130 | # Note that this method of saving does produce a warning about not containing the train and evaluate graphs. 131 | # The resulting saved model works fine for inference. It will simply not support incremental training. If that 132 | # is needed, one can use model checkpoints and save those. 133 | print('Model directory files BEFORE save: {}'.format(glob.glob(f'{model_dir}/*/*'))) 134 | if tf.version.VERSION[0] == '2': 135 | model.save(f'{model_dir}/1', save_format='tf') 136 | else: 137 | tf.contrib.saved_model.save_keras_model(model, f'{model_dir}/1') 138 | print('Model directory files AFTER save: {}'.format(glob.glob(f'{model_dir}/*/*'))) 139 | print('...DONE saving model!') 140 | 141 | # Need to copy these files to the code directory, else the SageMaker endpoint will not use them. 142 | print('Copying inference source files...') 143 | 144 | if not os.path.exists(f'{model_dir}/code'): 145 | os.system(f'mkdir {model_dir}/code') 146 | os.system(f'cp inference.py {model_dir}/code') 147 | os.system(f'cp requirements.txt {model_dir}/code') 148 | print('Files after copying custom inference handler files: {}'.format(glob.glob(f'{model_dir}/code/*'))) 149 | 150 | def main(args): 151 | sm_training_env_json = json.loads(os.environ.get('SM_TRAINING_ENV')) 152 | is_master = sm_training_env_json['is_master'] 153 | print('is_master {}'.format(is_master)) 154 | 155 | # Create data generators for feeding training and evaluation based on data provided to us 156 | # by the SageMaker TensorFlow container 157 | train_gen, test_gen, val_gen = create_data_generators(args) 158 | 159 | base_model = MobileNetV2(weights='imagenet', 160 | include_top=False, 161 | input_shape=(HEIGHT, WIDTH, 3)) 162 | 163 | # Here we extend the base model with additional fully connected layers, dropout for avoiding 164 | # overfitting to the training dataset, and a classification layer 165 | fully_connected_layers = [] 166 | for i in range(args.num_fully_connected_layers): 167 | fully_connected_layers.append(1024) 168 | 169 | num_classes = len(glob.glob('/opt/ml/input/data/train/*')) 170 | model = build_finetune_model(base_model, 171 | dropout=args.dropout, 172 | fc_layers=fully_connected_layers, 173 | num_classes=num_classes) 174 | 175 | opt = RMSprop(lr=args.initial_lr) 176 | model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy']) 177 | 178 | print('\nBeginning training...') 179 | 180 | NUM_EPOCHS = args.fine_tuning_epochs 181 | 182 | num_train_images = len(train_gen.filepaths) 183 | num_val_images = len(val_gen.filepaths) 184 | 185 | num_hosts = len(args.hosts) 186 | train_steps = num_train_images // args.batch_size // num_hosts 187 | val_steps = num_val_images // args.batch_size // num_hosts 188 | 189 | print('Batch size: {}, Train Steps: {}, Val Steps: {}'.format(args.batch_size, train_steps, val_steps)) 190 | 191 | # If there are no checkpoints found, must be starting from scratch, so load the pretrained model 192 | checkpoints_avail = os.path.isdir(args.checkpoint_path) and os.listdir(args.checkpoint_path) 193 | if not checkpoints_avail: 194 | if not os.path.isdir(args.checkpoint_path): 195 | os.mkdir(args.checkpoint_path) 196 | 197 | # Train for a few epochs 198 | model.fit_generator(train_gen, epochs=args.initial_epochs, workers=8, 199 | steps_per_epoch=train_steps, 200 | validation_data=val_gen, validation_steps=val_steps, 201 | shuffle=True) 202 | 203 | # Now fine tune the last set of layers in the model 204 | for layer in model.layers[LAST_FROZEN_LAYER:]: 205 | layer.trainable = True 206 | 207 | initial_epoch_number = 0 208 | 209 | # Otherwise, start from the latest checkpoint 210 | else: 211 | model, initial_epoch_number = load_checkpoint_model(args.checkpoint_path) 212 | 213 | fine_tuning_lr = args.fine_tuning_lr 214 | model.compile(optimizer=SGD(lr=fine_tuning_lr, momentum=0.9), 215 | loss='categorical_crossentropy', metrics=['accuracy']) 216 | 217 | callbacks = [] 218 | if not (args.s3_checkpoint_path == ''): 219 | checkpoint_names = 'img-classifier.{epoch:03d}.h5' 220 | checkpoint_callback = ModelCheckpoint(filepath=f'{args.checkpoint_path}/{checkpoint_names}', 221 | save_weights_only=False, 222 | monitor='val_accuracy') 223 | callbacks.append(checkpoint_callback) 224 | callbacks.append(PurgeCheckpointsCallback(args)) 225 | 226 | history = model.fit_generator(train_gen, epochs=NUM_EPOCHS, workers=8, 227 | steps_per_epoch=train_steps, 228 | initial_epoch=initial_epoch_number, 229 | validation_data=val_gen, validation_steps=val_steps, 230 | shuffle=True, callbacks=callbacks) 231 | print('Model has been fit.') 232 | 233 | # Save the model if we are executing on the master host 234 | if is_master: 235 | print('Saving model, since we are master host') 236 | save_model_artifacts(model, os.environ.get('SM_MODEL_DIR')) 237 | else: 238 | print('NOT saving model, will leave that up to master host') 239 | 240 | checkpoint_files = [f for f in os.listdir(args.checkpoint_path) if f.endswith('.' + 'h5')] 241 | print(f'\nCheckpoints: {sorted(checkpoint_files)}') 242 | 243 | print('\nExiting training script.\n') 244 | 245 | if __name__=='__main__': 246 | parser = argparse.ArgumentParser() 247 | # hyperparameters sent by the client are passed as command-line arguments to the script. 248 | parser.add_argument('--initial_epochs', type=int, default=5) 249 | parser.add_argument('--fine_tuning_epochs', type=int, default=30) 250 | parser.add_argument('--batch_size', type=int, default=8) 251 | parser.add_argument('--initial_lr', type=float, default=0.0001) 252 | parser.add_argument('--fine_tuning_lr', type=float, default=0.00001) 253 | parser.add_argument('--dropout', type=float, default=0.5) 254 | parser.add_argument('--num_fully_connected_layers', type=int, default=1) 255 | parser.add_argument('--s3_checkpoint_path', type=str, default='') 256 | # input data and model directories 257 | parser.add_argument('--model_dir', type=str) 258 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 259 | parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) 260 | parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION')) 261 | parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST')) 262 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS'))) 263 | parser.add_argument('--checkpoint_path', type=str, default='/opt/ml/checkpoints') 264 | 265 | args, _ = parser.parse_known_args() 266 | print('args: {}'.format(args)) 267 | 268 | main(args) -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05a_model_evaluation/docker/Dockerfile: -------------------------------------------------------------------------------- 1 | 2 | FROM public.ecr.aws/docker/library/python:3.7 3 | 4 | ADD requirements.txt / 5 | 6 | RUN pip3 install -r requirements.txt 7 | 8 | ENV PYTHONUNBUFFERED=TRUE 9 | ENV TF_CPP_MIN_LOG_LEVEL="2" 10 | 11 | ENTRYPOINT ["python3"] 12 | -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05a_model_evaluation/docker/requirements.txt: -------------------------------------------------------------------------------- 1 | # This is the set of Python packages that will get pip installed 2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 3 | Pillow 4 | scikit-learn 5 | pandas 6 | numpy==1.21.6 7 | tensorflow==2.11.1 8 | boto3==1.18.4 9 | sagemaker-experiments 10 | matplotlib==3.4.2 11 | -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05a_model_evaluation/evaluation.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import pandas as pd 4 | import argparse 5 | import pathlib 6 | import json 7 | import os 8 | import numpy as np 9 | import tarfile 10 | import uuid 11 | 12 | from PIL import Image 13 | 14 | from sklearn.metrics import ( 15 | accuracy_score, 16 | precision_score, 17 | recall_score, 18 | confusion_matrix, 19 | f1_score 20 | ) 21 | 22 | from tensorflow import keras 23 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input 24 | from tensorflow.keras.preprocessing import image 25 | # from smexperiments import tracker 26 | 27 | logger = logging.getLogger() 28 | logger.setLevel(logging.INFO) 29 | logger.addHandler(logging.StreamHandler()) 30 | 31 | input_path = "/opt/ml/processing/input/test" #"output/test" # 32 | manifest_path = "/opt/ml/processing/input/manifest/test.csv"#"output/manifest/test.csv" 33 | model_path = "/opt/ml/processing/model" #"model" # 34 | output_path = '/opt/ml/processing/output' #"output" # 35 | 36 | HEIGHT=224; WIDTH=224 37 | 38 | def predict_bird_from_file_new(fn, model): 39 | 40 | img = Image.open(fn).convert('RGB') 41 | 42 | img = img.resize((WIDTH, HEIGHT)) 43 | img_array = image.img_to_array(img) #, data_format = "channels_first") 44 | 45 | x = img_array.reshape((1,) + img_array.shape) 46 | instance = preprocess_input(x) 47 | 48 | del x, img 49 | 50 | result = model.predict(instance) 51 | 52 | predicted_class_idx = np.argmax(result) 53 | confidence = result[0][predicted_class_idx] 54 | 55 | return predicted_class_idx, confidence 56 | 57 | if __name__ == "__main__": 58 | parser = argparse.ArgumentParser() 59 | parser.add_argument("--model-file", type=str, default="model.tar.gz") 60 | args, _ = parser.parse_known_args() 61 | 62 | logger.debug("Extracting the model") 63 | 64 | model_file = os.path.join(model_path, args.model_file) 65 | file = tarfile.open(model_file) 66 | file.extractall(model_path) 67 | 68 | file.close() 69 | 70 | logger.debug("Load model") 71 | 72 | model = keras.models.load_model("{}/1".format(model_path)) 73 | 74 | logger.debug("Starting evaluation.") 75 | 76 | # load test data. this should be an argument 77 | df = pd.read_csv(manifest_path) 78 | 79 | num_images = df.shape[0] 80 | 81 | class_name_list = sorted(df['class_id'].unique().tolist()) 82 | 83 | class_name = pd.Series(df['class_name'].values,index=df['class_id']).to_dict() 84 | 85 | logger.debug('Testing {} images'.format(df.shape[0])) 86 | num_errors = 0 87 | preds = [] 88 | acts = [] 89 | for i in range(df.shape[0]): 90 | fname = df.iloc[i]['image_file_name'] 91 | act = int(df.iloc[i]['class_id']) - 1 92 | acts.append(act) 93 | 94 | pred, conf = predict_bird_from_file_new(input_path + '/' + fname, model) 95 | preds.append(pred) 96 | if (pred != act): 97 | num_errors += 1 98 | logger.debug('ERROR on image index {} -- Pred: {} {:.2f}, Actual: {}'.format(i, 99 | class_name_list[pred], conf, 100 | class_name_list[act])) 101 | precision = precision_score(acts, preds, average='micro') 102 | recall = recall_score(acts, preds, average='micro') 103 | accuracy = accuracy_score(acts, preds) 104 | cnf_matrix = confusion_matrix(acts, preds, labels=range(len(class_name_list))) 105 | f1 = f1_score(acts, preds, average='micro') 106 | 107 | logger.debug("Accuracy: {}".format(accuracy)) 108 | logger.debug("Precision: {}".format(precision)) 109 | logger.debug("Recall: {}".format(recall)) 110 | logger.debug("Confusion matrix: {}".format(cnf_matrix)) 111 | logger.debug("F1 score: {}".format(f1)) 112 | 113 | logger.debug(cnf_matrix) 114 | 115 | matrix_output = dict() 116 | 117 | for i in range(len(cnf_matrix)): 118 | matrix_row = dict() 119 | for j in range(len(cnf_matrix[0])): 120 | matrix_row[class_name[class_name_list[j]]] = int(cnf_matrix[i][j]) 121 | matrix_output[class_name[class_name_list[i]]] = matrix_row 122 | 123 | 124 | report_dict = { 125 | "multiclass_classification_metrics": { 126 | "accuracy": {"value": accuracy, "standard_deviation": "NaN"}, 127 | "precision": {"value": precision, "standard_deviation": "NaN"}, 128 | "recall": {"value": recall, "standard_deviation": "NaN"}, 129 | "f1": {"value": f1, "standard_deviation": "NaN"}, 130 | "confusion_matrix":matrix_output 131 | }, 132 | } 133 | 134 | output_dir = "/opt/ml/processing/evaluation" 135 | pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True) 136 | 137 | evaluation_path = f"{output_dir}/evaluation.json" 138 | with open(evaluation_path, "w") as f: 139 | f.write(json.dumps(report_dict)) 140 | -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05b_model_explainability/code/inference.py: -------------------------------------------------------------------------------- 1 | print('******* in inference.py *******') 2 | import tensorflow as tf 3 | print(f'TensorFlow version is: {tf.version.VERSION}') 4 | 5 | from tensorflow.keras.preprocessing import image 6 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input 7 | print(f'Keras version is: {tf.keras.__version__}') 8 | 9 | import io 10 | import base64 11 | import json 12 | import numpy as np 13 | from numpy import argmax 14 | from collections import namedtuple 15 | from PIL import Image 16 | import time 17 | import requests 18 | 19 | # Imports for GRPC invoke on TFS 20 | import grpc 21 | from tensorflow.compat.v1 import make_tensor_proto 22 | from tensorflow_serving.apis import predict_pb2 23 | from tensorflow_serving.apis import prediction_service_pb2_grpc 24 | 25 | import os 26 | # default to use of GRPC 27 | PREDICT_USING_GRPC = os.environ.get('PREDICT_USING_GRPC', 'true') 28 | if PREDICT_USING_GRPC == 'true': 29 | USE_GRPC = True 30 | else: 31 | USE_GRPC = False 32 | 33 | MAX_GRPC_MESSAGE_LENGTH = 512 * 1024 * 1024 34 | 35 | HEIGHT = 224 36 | WIDTH = 224 37 | 38 | # Restrict memory growth on GPU's 39 | physical_gpus = tf.config.experimental.list_physical_devices('GPU') 40 | if physical_gpus: 41 | try: 42 | # Currently, memory growth needs to be the same across GPUs 43 | for gpu in physical_gpus: 44 | tf.config.experimental.set_memory_growth(gpu, True) 45 | logical_gpus = tf.config.experimental.list_logical_devices('GPU') 46 | print(len(physical_gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPUs') 47 | except RuntimeError as e: 48 | # Memory growth must be set before GPUs have been initialized 49 | print(e) 50 | else: 51 | print('**** NO physical GPUs') 52 | 53 | 54 | num_inferences = 0 55 | print(f'num_inferences: {num_inferences}') 56 | 57 | Context = namedtuple('Context', 58 | 'model_name, model_version, method, rest_uri, grpc_uri, ' 59 | 'custom_attributes, request_content_type, accept_header') 60 | 61 | def handler(data, context): 62 | 63 | global num_inferences 64 | num_inferences += 1 65 | 66 | print(f'\n************ inference #: {num_inferences}') 67 | if context.request_content_type == 'image/jpeg': 68 | stream = io.BytesIO(data.read()) 69 | img = Image.open(stream).convert('RGB') 70 | _print_image_metadata(img) 71 | 72 | img = img.resize((WIDTH, HEIGHT)) 73 | img_array = image.img_to_array(img) #, data_format = "channels_first") 74 | # the image is now in an array of shape (224, 224, 3) or (3, 224, 224) based on data_format 75 | # need to expand it to add dim for num samples, e.g. (1, 224, 224, 3) 76 | x = img_array.reshape((1,) + img_array.shape) 77 | instance = preprocess_input(x) 78 | print(f' final image shape: {instance.shape}') 79 | del x, img 80 | else: 81 | _return_error(415, 'Unsupported different content type "{}"'.format(context.request_content_type or 'Unknown')) 82 | 83 | start_time = time.time() 84 | 85 | if USE_GRPC: 86 | prediction = _predict_using_grpc(context, instance) 87 | 88 | else: # use TFS REST API 89 | inst_json = json.dumps({'instances': instance.tolist()}) 90 | response = requests.post(context.rest_uri, data=inst_json) 91 | if response.status_code != 200: 92 | raise Exception(response.content.decode('utf-8')) 93 | prediction = response.content 94 | 95 | end_time = time.time() 96 | latency = int((end_time - start_time) * 1000) 97 | print(f'=== TFS invoke took: {latency} ms') 98 | 99 | response_content_type = context.accept_header 100 | return prediction, response_content_type 101 | 102 | def _return_error(code, message): 103 | raise ValueError('Error: {}, {}'.format(str(code), message)) 104 | 105 | def _predict_using_grpc(context, instance): 106 | request = predict_pb2.PredictRequest() 107 | request.model_spec.name = 'model' 108 | request.model_spec.signature_name = 'serving_default' 109 | 110 | request.inputs['input_1'].CopyFrom(make_tensor_proto(instance)) 111 | options = [ 112 | ('grpc.max_send_message_length', MAX_GRPC_MESSAGE_LENGTH), 113 | ('grpc.max_receive_message_length', MAX_GRPC_MESSAGE_LENGTH) 114 | ] 115 | channel = grpc.insecure_channel(f'0.0.0.0:{context.grpc_port}', options=options) 116 | stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) 117 | result_future = stub.Predict.future(request, 30) # 5 seconds 118 | output_tensor_proto = result_future.result().outputs['output'] 119 | output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim] 120 | print ('output shape' , output_shape) 121 | output_np = np.array(output_tensor_proto.float_val).reshape(output_shape) 122 | print ('output_np ' , output_np) 123 | predicted_class_idx = argmax(output_np) 124 | print(f' Predicted class: {predicted_class_idx}') 125 | prediction_json = np.reshape (output_np, -1).tolist() 126 | print('this is what output predictions looks like ', np.reshape (output_np, -1).tolist() ) 127 | print('this is what output predictions looks like ', prediction_json) 128 | return json.dumps(prediction_json) 129 | 130 | def _print_image_metadata(img): 131 | # Retrieve the attributes of the image 132 | fileFormat = img.format 133 | imageMode = img.mode 134 | imageSize = img.size # (width, height) 135 | colorPalette = img.palette 136 | 137 | print(f' File format: {fileFormat}') 138 | print(f' Image mode: {imageMode}') 139 | print(f' Image size: {imageSize}') 140 | print(f' Color pal: {colorPalette}') 141 | 142 | print(f' Keys from image.info dictionary:') 143 | for key, value in img.info.items(): 144 | print(f' {key}') -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05b_model_explainability/code/requirements-gpu.txt: -------------------------------------------------------------------------------- 1 | Pillow 2 | numpy 3 | tensorflow-gpu -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05b_model_explainability/code/requirements.txt: -------------------------------------------------------------------------------- 1 | # This is the set of Python packages that will get pip installed 2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 3 | Pillow 4 | numpy==1.21.6 5 | tensorflow==2.11.1 -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05b_model_explainability/code/train-mobilenet.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You 4 | # may not use this file except in compliance with the License. A copy of 5 | # the License is located at 6 | # 7 | # http://aws.amazon.com/apache2.0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 11 | # ANY KIND, either express or implied. See the License for the specific 12 | # language governing permissions and limitations under the License. 13 | 14 | import tensorflow as tf 15 | import tensorflow.keras 16 | 17 | TF_VERSION = tf.version.VERSION 18 | print('TF version: {}'.format(tf.__version__)) 19 | print('Keras version: {}'.format(tensorflow.keras.__version__)) 20 | 21 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input 22 | LAST_FROZEN_LAYER = 20 23 | 24 | from tensorflow.keras.preprocessing.image import ImageDataGenerator 25 | from tensorflow.keras.layers import Dense, Activation, Flatten, Dropout, GlobalAveragePooling2D 26 | from tensorflow.keras.models import Sequential, Model, load_model 27 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop 28 | from tensorflow.keras.callbacks import ModelCheckpoint 29 | from tensorflow.keras import backend as K 30 | 31 | import numpy as np 32 | import os 33 | import re 34 | import json 35 | import argparse 36 | import glob 37 | import boto3 38 | 39 | HEIGHT = 224 40 | WIDTH = 224 41 | 42 | class PurgeCheckpointsCallback(tensorflow.keras.callbacks.Callback): 43 | def __init__(self, args): 44 | """ Save params in constructor 45 | """ 46 | self.args = args 47 | 48 | def on_epoch_end(self, epoch, logs=None): 49 | files = sorted([f for f in os.listdir(self.args.checkpoint_path) if f.endswith('.' + 'h5')]) 50 | keep = 3 51 | print(f' End of epoch {epoch + 1}. Removing old checkpoints...') 52 | for i in range(0, len(files) - keep - 1): 53 | print(f' local: {files[i]}') 54 | os.remove(f'{self.args.checkpoint_path}/{files[i]}') 55 | 56 | s3_uri_parts = self.args.s3_checkpoint_path.split('/') 57 | bucket = s3_uri_parts[2] 58 | key = '/'.join(s3_uri_parts[3:]) + '/' + files[i] 59 | 60 | print(f' s3 bucket: {bucket}, key: {key}') 61 | s3_client = boto3.client('s3') 62 | s3_client.delete_object(Bucket=bucket, Key=key) 63 | print(f' Done\n') 64 | 65 | def load_checkpoint_model(checkpoint_path): 66 | files = sorted([f for f in os.listdir(checkpoint_path) if f.endswith('.' + 'h5')]) 67 | epoch_numbers = [re.search('(?<=\.)(.*[0-9])(?=\.)',f).group() for f in files] 68 | 69 | max_epoch_number = max(epoch_numbers) 70 | max_epoch_index = epoch_numbers.index(max_epoch_number) 71 | max_epoch_filename = files[max_epoch_index] 72 | 73 | print('\nList of available checkpoints:') 74 | print('------------------------------------') 75 | [print(f) for f in files] 76 | print('------------------------------------') 77 | print(f'Checkpoint file for latest epoch: {max_epoch_filename}') 78 | print(f'Resuming training from epoch: {max_epoch_number}') 79 | print('------------------------------------') 80 | 81 | resume_model = load_model(f'{checkpoint_path}/{max_epoch_filename}') 82 | return resume_model, int(max_epoch_number) 83 | 84 | def build_finetune_model(base_model, dropout, fc_layers, num_classes): 85 | # Freeze all base layers 86 | for layer in base_model.layers: 87 | layer.trainable = False 88 | 89 | x = base_model.output 90 | x = Flatten()(x) 91 | for fc in fc_layers: 92 | x = Dense(fc, activation='relu')(x) 93 | if (dropout != 0.0): 94 | x = Dropout(dropout)(x) 95 | 96 | # New softmax layer 97 | predictions = Dense(num_classes, activation='softmax', name='output')(x) 98 | 99 | finetune_model = Model(inputs=base_model.input, outputs=predictions) 100 | 101 | return finetune_model 102 | 103 | def create_data_generators(args): 104 | train_datagen = ImageDataGenerator( 105 | preprocessing_function=preprocess_input, 106 | rotation_range=70, 107 | brightness_range=(0.6, 1.0), 108 | width_shift_range=0.3, 109 | height_shift_range=0.3, 110 | shear_range=0.3, 111 | zoom_range=0.3, 112 | horizontal_flip=True, 113 | vertical_flip=False) 114 | val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input) 115 | test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input) 116 | 117 | train_gen = train_datagen.flow_from_directory('/opt/ml/input/data/train', 118 | target_size=(HEIGHT, WIDTH), 119 | batch_size=args.batch_size) 120 | test_gen = train_datagen.flow_from_directory('/opt/ml/input/data/test', 121 | target_size=(HEIGHT, WIDTH), 122 | batch_size=args.batch_size) 123 | val_gen = train_datagen.flow_from_directory('/opt/ml/input/data/validation', 124 | target_size=(HEIGHT, WIDTH), 125 | batch_size=args.batch_size) 126 | return train_gen, test_gen, val_gen 127 | 128 | def save_model_artifacts(model, model_dir): 129 | print(f'Saving model to {model_dir}...') 130 | # Note that this method of saving does produce a warning about not containing the train and evaluate graphs. 131 | # The resulting saved model works fine for inference. It will simply not support incremental training. If that 132 | # is needed, one can use model checkpoints and save those. 133 | print('Model directory files BEFORE save: {}'.format(glob.glob(f'{model_dir}/*/*'))) 134 | if tf.version.VERSION[0] == '2': 135 | model.save(f'{model_dir}/1', save_format='tf') 136 | else: 137 | tf.contrib.saved_model.save_keras_model(model, f'{model_dir}/1') 138 | print('Model directory files AFTER save: {}'.format(glob.glob(f'{model_dir}/*/*'))) 139 | print('...DONE saving model!') 140 | 141 | # Need to copy these files to the code directory, else the SageMaker endpoint will not use them. 142 | print('Copying inference source files...') 143 | 144 | if not os.path.exists(f'{model_dir}/code'): 145 | os.system(f'mkdir {model_dir}/code') 146 | os.system(f'cp inference.py {model_dir}/code') 147 | os.system(f'cp requirements.txt {model_dir}/code') 148 | print('Files after copying custom inference handler files: {}'.format(glob.glob(f'{model_dir}/code/*'))) 149 | 150 | def main(args): 151 | sm_training_env_json = json.loads(os.environ.get('SM_TRAINING_ENV')) 152 | is_master = sm_training_env_json['is_master'] 153 | print('is_master {}'.format(is_master)) 154 | 155 | # Create data generators for feeding training and evaluation based on data provided to us 156 | # by the SageMaker TensorFlow container 157 | train_gen, test_gen, val_gen = create_data_generators(args) 158 | 159 | base_model = MobileNetV2(weights='imagenet', 160 | include_top=False, 161 | input_shape=(HEIGHT, WIDTH, 3)) 162 | 163 | # Here we extend the base model with additional fully connected layers, dropout for avoiding 164 | # overfitting to the training dataset, and a classification layer 165 | fully_connected_layers = [] 166 | for i in range(args.num_fully_connected_layers): 167 | fully_connected_layers.append(1024) 168 | 169 | num_classes = len(glob.glob('/opt/ml/input/data/train/*')) 170 | model = build_finetune_model(base_model, 171 | dropout=args.dropout, 172 | fc_layers=fully_connected_layers, 173 | num_classes=num_classes) 174 | 175 | opt = RMSprop(lr=args.initial_lr) 176 | model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy']) 177 | 178 | print('\nBeginning training...') 179 | 180 | NUM_EPOCHS = args.fine_tuning_epochs 181 | 182 | num_train_images = len(train_gen.filepaths) 183 | num_val_images = len(val_gen.filepaths) 184 | 185 | num_hosts = len(args.hosts) 186 | train_steps = num_train_images // args.batch_size // num_hosts 187 | val_steps = num_val_images // args.batch_size // num_hosts 188 | 189 | print('Batch size: {}, Train Steps: {}, Val Steps: {}'.format(args.batch_size, train_steps, val_steps)) 190 | 191 | # If there are no checkpoints found, must be starting from scratch, so load the pretrained model 192 | checkpoints_avail = os.path.isdir(args.checkpoint_path) and os.listdir(args.checkpoint_path) 193 | if not checkpoints_avail: 194 | if not os.path.isdir(args.checkpoint_path): 195 | os.mkdir(args.checkpoint_path) 196 | 197 | # Train for a few epochs 198 | model.fit_generator(train_gen, epochs=args.initial_epochs, workers=8, 199 | steps_per_epoch=train_steps, 200 | validation_data=val_gen, validation_steps=val_steps, 201 | shuffle=True) 202 | 203 | # Now fine tune the last set of layers in the model 204 | for layer in model.layers[LAST_FROZEN_LAYER:]: 205 | layer.trainable = True 206 | 207 | initial_epoch_number = 0 208 | 209 | # Otherwise, start from the latest checkpoint 210 | else: 211 | model, initial_epoch_number = load_checkpoint_model(args.checkpoint_path) 212 | 213 | fine_tuning_lr = args.fine_tuning_lr 214 | model.compile(optimizer=SGD(lr=fine_tuning_lr, momentum=0.9), 215 | loss='categorical_crossentropy', metrics=['accuracy']) 216 | 217 | callbacks = [] 218 | if not (args.s3_checkpoint_path == ''): 219 | checkpoint_names = 'img-classifier.{epoch:03d}.h5' 220 | checkpoint_callback = ModelCheckpoint(filepath=f'{args.checkpoint_path}/{checkpoint_names}', 221 | save_weights_only=False, 222 | monitor='val_accuracy') 223 | callbacks.append(checkpoint_callback) 224 | callbacks.append(PurgeCheckpointsCallback(args)) 225 | 226 | history = model.fit_generator(train_gen, epochs=NUM_EPOCHS, workers=8, 227 | steps_per_epoch=train_steps, 228 | initial_epoch=initial_epoch_number, 229 | validation_data=val_gen, validation_steps=val_steps, 230 | shuffle=True, callbacks=callbacks) 231 | print('Model has been fit.') 232 | 233 | # Save the model if we are executing on the master host 234 | if is_master: 235 | print('Saving model, since we are master host') 236 | save_model_artifacts(model, os.environ.get('SM_MODEL_DIR')) 237 | else: 238 | print('NOT saving model, will leave that up to master host') 239 | 240 | checkpoint_files = [f for f in os.listdir(args.checkpoint_path) if f.endswith('.' + 'h5')] 241 | print(f'\nCheckpoints: {sorted(checkpoint_files)}') 242 | 243 | print('\nExiting training script.\n') 244 | 245 | if __name__=='__main__': 246 | parser = argparse.ArgumentParser() 247 | # hyperparameters sent by the client are passed as command-line arguments to the script. 248 | parser.add_argument('--initial_epochs', type=int, default=5) 249 | parser.add_argument('--fine_tuning_epochs', type=int, default=30) 250 | parser.add_argument('--batch_size', type=int, default=8) 251 | parser.add_argument('--initial_lr', type=float, default=0.0001) 252 | parser.add_argument('--fine_tuning_lr', type=float, default=0.00001) 253 | parser.add_argument('--dropout', type=float, default=0.5) 254 | parser.add_argument('--num_fully_connected_layers', type=int, default=1) 255 | parser.add_argument('--s3_checkpoint_path', type=str, default='') 256 | # input data and model directories 257 | parser.add_argument('--model_dir', type=str) 258 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 259 | parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) 260 | parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION')) 261 | parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST')) 262 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS'))) 263 | parser.add_argument('--checkpoint_path', type=str, default='/opt/ml/checkpoints') 264 | 265 | args, _ = parser.parse_known_args() 266 | print('args: {}'.format(args)) 267 | 268 | main(args) -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/05b_model_explainability/preprocessing.py: -------------------------------------------------------------------------------- 1 | 2 | import logging 3 | 4 | import pandas as pd 5 | import argparse 6 | import boto3 7 | import json 8 | import os 9 | import shutil 10 | 11 | logger = logging.getLogger() 12 | logger.setLevel(logging.INFO) 13 | logger.addHandler(logging.StreamHandler()) 14 | 15 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 16 | output_path = '/opt/ml/processing/output' #"output" # 17 | IMAGES_DIR = os.path.join(input_path, 'images') 18 | SPLIT_RATIOS = (0.6, 0.2, 0.2) 19 | 20 | 21 | # this function is used to split a dataframe into 3 seperate dataframes 22 | # one of each: train, validate, test 23 | 24 | def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False): 25 | train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame() 26 | 27 | labels = df[label_column].unique() 28 | for lbl in labels: 29 | lbl_df = df[df[label_column] == lbl] 30 | 31 | lbl_train_df = lbl_df.sample(frac=splits[0]) 32 | lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index) 33 | lbl_test_df = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2])) 34 | lbl_val_df = lbl_val_and_test_df.drop(lbl_test_df.index) 35 | 36 | if verbose: 37 | print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl, 38 | len(lbl_df), 39 | len(lbl_train_df), 40 | len(lbl_val_df), 41 | len(lbl_test_df))) 42 | train_df = train_df.append(lbl_train_df) 43 | val_df = val_df.append(lbl_val_df) 44 | test_df = test_df.append(lbl_test_df) 45 | 46 | # shuffle them on the way out using .sample(frac=1) 47 | return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1) 48 | 49 | # This function grabs the manifest files and build a dataframe, then call the split_to_train_val_test 50 | # function above and return the 3 dataframes 51 | def get_train_val_dataframes(BASE_DIR, classes, split_ratios): 52 | CLASSES_FILE = os.path.join(BASE_DIR, 'classes.txt') 53 | IMAGE_FILE = os.path.join(BASE_DIR, 'images.txt') 54 | LABEL_FILE = os.path.join(BASE_DIR, 'image_class_labels.txt') 55 | 56 | images_df = pd.read_csv(IMAGE_FILE, sep=' ', 57 | names=['image_pretty_name', 'image_file_name'], 58 | header=None) 59 | image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ', 60 | names=['image_pretty_name', 'orig_class_id'], header=None) 61 | 62 | # Merge the metadata into a single flat dataframe for easier processing 63 | full_df = pd.DataFrame(images_df) 64 | 65 | full_df.reset_index(inplace=True, drop=True) 66 | full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name') 67 | 68 | # grab a small subset of species for testing 69 | criteria = full_df['orig_class_id'].isin(classes) 70 | full_df = full_df[criteria] 71 | print('Using {} images from {} classes'.format(full_df.shape[0], len(classes))) 72 | 73 | unique_classes = full_df['orig_class_id'].drop_duplicates() 74 | sorted_unique_classes = sorted(unique_classes) 75 | id_to_one_based = {} 76 | i = 1 77 | for c in sorted_unique_classes: 78 | id_to_one_based[c] = str(i) 79 | i += 1 80 | 81 | full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based) 82 | full_df.reset_index(inplace=True, drop=True) 83 | 84 | def get_class_name(fn): 85 | return fn.split('/')[0] 86 | full_df['class_name'] = full_df['image_file_name'].apply(get_class_name) 87 | full_df = full_df.drop(['image_pretty_name'], axis=1) 88 | 89 | train_df = [] 90 | test_df = [] 91 | val_df = [] 92 | 93 | # split into training and validation sets 94 | train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', split_ratios) 95 | 96 | print('num images total: ' + str(images_df.shape[0])) 97 | print('\nnum train: ' + str(train_df.shape[0])) 98 | print('num val: ' + str(val_df.shape[0])) 99 | print('num test: ' + str(test_df.shape[0])) 100 | return train_df, val_df, test_df 101 | 102 | # this function copy images by channel to its destination folder 103 | def copy_files_for_channel(df, channel_name, verbose=False): 104 | print('\nCopying files for {} images in channel: {}...'.format(df.shape[0], channel_name)) 105 | for i in range(df.shape[0]): 106 | target_fname = df.iloc[i]['image_file_name'] 107 | # if verbose: 108 | # print(target_fname) 109 | src = "{}/{}".format(IMAGES_DIR, target_fname) #f"{IMAGES_DIR}/{target_fname}" 110 | dst = "{}/{}/{}".format(output_path,channel_name,target_fname) 111 | shutil.copyfile(src, dst) 112 | 113 | if __name__ == "__main__": 114 | parser = argparse.ArgumentParser() 115 | parser.add_argument("--classes", type=str, default="") 116 | parser.add_argument("--input-data", type=str, default="classes.txt") 117 | args, _ = parser.parse_known_args() 118 | 119 | 120 | c_list = args.classes.split(',') 121 | input_data = args.input_data 122 | 123 | CLASSES_FILE = os.path.join(input_path, input_data) 124 | 125 | CLASS_COLS = ['class_number','class_id'] 126 | 127 | if len(c_list)==0: 128 | # Otherwise, you can use the full set of species 129 | CLASSES = [] 130 | for c in range(200): 131 | CLASSES += [c + 1] 132 | prefix = prefix + '-full' 133 | else: 134 | CLASSES = list(map(int, c_list)) 135 | 136 | 137 | classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None) 138 | 139 | criteria = classes_df['class_number'].isin(CLASSES) 140 | classes_df = classes_df[criteria] 141 | 142 | class_name_list = sorted(classes_df['class_id'].unique().tolist()) 143 | print(class_name_list) 144 | 145 | 146 | train_df, val_df, test_df = get_train_val_dataframes(input_path, CLASSES, SPLIT_RATIOS) 147 | 148 | for c in class_name_list: 149 | os.mkdir('{}/{}/{}'.format(output_path, 'valid', c)) 150 | os.mkdir('{}/{}/{}'.format(output_path, 'test', c)) 151 | os.mkdir('{}/{}/{}'.format(output_path, 'train', c)) 152 | 153 | copy_files_for_channel(val_df, 'valid') 154 | copy_files_for_channel(test_df, 'test') 155 | copy_files_for_channel(train_df, 'train') 156 | 157 | # export manifest file for validation 158 | train_m_file = "{}/manifest/train.csv".format(output_path) 159 | train_df.to_csv(train_m_file, index=False) 160 | test_m_file = "{}/manifest/test.csv".format(output_path) 161 | test_df.to_csv(test_m_file, index=False) 162 | val_m_file = "{}/manifest/valid.csv".format(output_path) 163 | val_df.to_csv(val_m_file, index=False) 164 | 165 | print("Finished running processing job") 166 | -------------------------------------------------------------------------------- /05_model_evaluation_and_model_explainability/README.md: -------------------------------------------------------------------------------- 1 | # Model Evaluation and Model Explainability 2 | 3 | Amazon SageMaker Processing is a managed solution for various machine learning(ML) processes. These include feature engineering, data validation, model evaluation, model explainability, and many other essential steps performed by engineers and data scientists during their ML workflow. 4 | 5 | Your custom processing scripts will be containerized running in infrastructure that is created on demand and terminated automatically. You can choose to use a SageMaker optimized containers for popular data processing or model evaluation frameworks like Scikit learn, PySpark etc. or Bring Your Own Containers (BYOC). 6 | 7 | ![Process Data](https://docs.aws.amazon.com/sagemaker/latest/dg/images/Processing-1.png) 8 | 9 | 10 | # The notebook in Module 5a describes how to evaluate your ML model 11 | 12 | # Introduction 13 | 14 | Postprocess and model evaluation are important steps before deployment the model to production. In this module you will use ScriptProcessor from SageMaker Processing to evaluate the model performance after the training. 15 | 16 | To setup your ScriptProcessor, we will build a custom container for a model evaluation script which will Load the tensorflow model, load the test dataset and annotation (either from previous module or run the :code[optional-prepare-data-and-model.ipynb] notebook), and then run predicition and generate the confusion matrix. 17 | 18 | --- 19 | # Prerequisites 20 | 21 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need: 22 | 23 | - Access to the SageMaker default S3 bucket. 24 | - Familiarity with Python and numpy 25 | - Basic familiarity with AWS S3. 26 | - Basic understanding of AWS Sagemaker. 27 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 28 | - SageMaker Studio is preferred for the full UI integration 29 | 30 | --- 31 | 32 | ### Dataset & Inputs 33 | 34 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html). If you kept your artifacts from previous labs, then simply update the s3 location below for you Test images and Test data annotation file. If you do not have them, just run the `optional-prepare-data-and-model.ipynb` notebook to generate the files, and then update the path below. 35 | 36 | - S3 path for test image data 37 | - S3 path for test data annotation file 38 | - S3 path for the bird classification model 39 | 40 | --- 41 | # Review Outputs 42 | 43 | At the end of the lab, you will generate a json file containing the performance metrics (accuracy, precision, recall, f1, and confusion matrix) on your test dataset. 44 | 45 | 46 | # The notebook in Module 5b describes Image Classification ML model explainability using SageMaker Clarify 47 | 48 | Amazon SageMaker Clarify provides machine learning (ML) developers with purpose built tools to gain greater insights into their ML training data and models. Amazon Clarify detects and measures potential bias using a variety of metrics so that they can address potential bias, and explain model predictions. 49 | 50 | Amazon SageMaker Clarify can detect potential bias during data preparation, after model training, and in your deployed model. For instance, you can check for bias related to age in your dataset or in your trained model and receive a detailed report that quantifies different types of potential bias. SageMaker Clarify also includes feature importance scores that help you explain how your model make predictions and produces explainability reports in bulk or real time via online explainability. You can use these reports to support customer or internal presentations, or to identify potential issues with your model. 51 | 52 | 53 | 54 | --- 55 | 56 | # Introduction 57 | 58 | # Explain model predictions 59 | Understand which features contributed the most to model prediction 60 | 61 | SageMaker Clarify is integrated with SageMaker Experiments to provide scores detailing which features contributed the most to your model prediction on a particular input for tabular, NLP and computer vision models. For tabular datasets, SageMaker Clarify can also output an aggregated feature importance chart which provides insights into the overall prediction process of the model. These details can help determine if a particular model input has more influence than expected on overall model behavior. For tabular data, in addition to the feature importance scores, you can also use partial dependence plots (PDP) to show the dependence of the predicted target response on a set of input features of interest. 62 | 63 | --- 64 | # Prerequisites 65 | 66 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need: 67 | 68 | - Access to the SageMaker default S3 bucket. 69 | - Familiarity with Python and numpy 70 | - Basic familiarity with AWS S3. 71 | - Basic understanding of AWS Sagemaker. 72 | - Basic understanding of AWS Sagemaker Clarify. 73 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 74 | - SageMaker Studio is preferred for the full UI integration 75 | 76 | --- 77 | 78 | ### Dataset & Inputs 79 | 80 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html). If you kept your artifacts from previous labs, then simply update the s3 location below for you Test images and Test data annotation file. 81 | 82 | - S3 path for test image data 83 | - S3 path for test data annotation file 84 | - S3 path for the bird classification model 85 | 86 | --- 87 | # Review Outputs 88 | 89 | At the end of the lab, you will download the results of the clarify explainability job and can see the SHAP values as well as the heat map of the pixels for your image classification 90 | -------------------------------------------------------------------------------- /06_training_pipeline/README.md: -------------------------------------------------------------------------------- 1 | # Training Pipeline 2 | 3 | Machine learning(ML) orchestration workflows is an important milestone to operationalize ML solutions for production. This usually entails connecting and automating various ML steps which include exploring and preparing data, experimenting with different algorithms and parameters, training and tuning models, and deploying models to production. This is why purpose-built tools like [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html) and [SageMaker Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html) are important to help you automate, manage, and standardize ML orchestration workflows, so you can quickly productionalize and scale your ML solutions. 4 | 5 | --- 6 | # Introduction 7 | 8 | This module demonstrates how to build a reusable computer vision (CV) workflow using SageMaker Pipelines. The pipeline goes through preprocessing, training, and evaluating steps for 2 different training jobs:1) Spot training and 2) On Demand training. If the accuracy meets certain requirements, the models are then registered with SageMaker Model Registry. 9 | 10 | ![Training Pipeline](statics/cv-training-pipeline.png) 11 | 12 | [SageMaker pipelines](https://aws.amazon.com/sagemaker/pipelines/) is the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). It works on the concept of steps. The order steps are executed in is inferred from the dependencies each step have. If a step has a dependency on the output from a previous step, it's not executed until after that step has completed successfully. This also allows SageMaker to create a Direct Acyclic Graph, DAG, that can be visuallized in Amazon SageMaker Studio (see diagram above). The DAG can be used to track pipeline executions, inputs/outputs and metrics, giving user the full lineage of the model creation. 13 | 14 | ![Spot Training](statics/cost-explore.png) 15 | 16 | This module will also illustrate the concept of Spot Training. The pipeline train two models (exactly same) in parallel, one using on-demand instances and the other uses spot instances. Training workloads are tagged accordingly: TrainingType: Spot or OnDemand. If you are interested and have permission to access billing of your AWS account, you can visualize the cost savings from spot training in your cost explore. To enable custom cost allocation tags, please follow this [AWS documentation](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/activating-tags.html). It takes 12-48 hrs for the new tag to show in your cost explore. 17 | 18 | --- 19 | 20 | Optionally, you can also try SageMaker Project and building a repeatable pattern from your pipeline. It is a great way to standardize and scale ML solutions for your teams and organizations. 21 | 22 | ![Studio Template](statics/project-template.png) 23 | 24 | ![Studio Parameters](statics/project-parameter.png) 25 | 26 | An Amazon SageMaker project template automates the set up and implementation of MLOps patterns. SageMaker already provides a set of project templates 27 | that create the infrastructure you need to create an MLOps solution for CI/CD of your ML models. [More on Provided Project Templates](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-sm.html) If the SageMaker provided templates do not meet your needs, you can also create your own templates. 28 | 29 | For that, you need the pipeline definition file. It is a JSON file that defines a series of interconnected steps using Directed Acyclic Graph (DAG) and specifies the requirements and relationships between each step of your pipeline. This JSON file along with Cloudformation template also allows you to manage your pipeline and it's underline infrastructure as code, so you can establish re-usable CI/CD patterns to standardize among your teams. 30 | 31 | A CloudFormation template `sagemaker-project.yaml` is supplied with this week's lab. 32 | 33 | It will generate following resources: 34 | * S3 bucket - with Lambda PUT notification enabled 35 | * A lambda function - when triggered will create/update a SageMaker pipeline and execute it 36 | * A SageMaker pipeline - generated from the JSON definition file 37 | * A SageMaker Model Group - to house the complete models 38 | 39 | Note: for lab purpose, we are loose with our role permissions. You will likely use tighter permission control than what's provided. 40 | 41 | To get started, load the provided Jupyter notebook and associated files to you SageMaker Studio Environment. 42 | 43 | --- 44 | 45 | ## Prerequisites 46 | 47 | To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need: 48 | 49 | - Access to the SageMaker default S3 bucket 50 | - Access to Elastic Container Registry (ECR) 51 | - For the optional portion of this lab, you will need access to CloudFormation, Service Catelog, and Cost Explore 52 | - Familiarity with Training on Amazon SageMaker 53 | - Familiarity with Python 54 | - Familiarity with AWS S3 55 | - Basic understanding of CloudFormaton and concept of deploy infra as code 56 | - Basic understanding of tagging and cost governance 57 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 58 | - SageMaker Studio is preferred for the full UI integration 59 | 60 | --- 61 | ## Dataset 62 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not. 63 | 64 | ![Bird Dataset](statics/birds.png) 65 | 66 | Run the notebook to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data. 67 | -------------------------------------------------------------------------------- /06_training_pipeline/code/inference.py: -------------------------------------------------------------------------------- 1 | print('******* in inference.py *******') 2 | import tensorflow as tf 3 | print(f'TensorFlow version is: {tf.version.VERSION}') 4 | 5 | from tensorflow.keras.preprocessing import image 6 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input 7 | print(f'Keras version is: {tf.keras.__version__}') 8 | 9 | import io 10 | import base64 11 | import json 12 | import numpy as np 13 | from numpy import argmax 14 | from collections import namedtuple 15 | from PIL import Image 16 | import time 17 | import requests 18 | 19 | # Imports for GRPC invoke on TFS 20 | import grpc 21 | from tensorflow.compat.v1 import make_tensor_proto 22 | from tensorflow_serving.apis import predict_pb2 23 | from tensorflow_serving.apis import prediction_service_pb2_grpc 24 | 25 | import os 26 | # default to use of GRPC 27 | PREDICT_USING_GRPC = os.environ.get('PREDICT_USING_GRPC', 'true') 28 | if PREDICT_USING_GRPC == 'true': 29 | USE_GRPC = True 30 | else: 31 | USE_GRPC = False 32 | 33 | MAX_GRPC_MESSAGE_LENGTH = 512 * 1024 * 1024 34 | 35 | HEIGHT = 224 36 | WIDTH = 224 37 | 38 | # Restrict memory growth on GPU's 39 | physical_gpus = tf.config.experimental.list_physical_devices('GPU') 40 | if physical_gpus: 41 | try: 42 | # Currently, memory growth needs to be the same across GPUs 43 | for gpu in physical_gpus: 44 | tf.config.experimental.set_memory_growth(gpu, True) 45 | logical_gpus = tf.config.experimental.list_logical_devices('GPU') 46 | print(len(physical_gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPUs') 47 | except RuntimeError as e: 48 | # Memory growth must be set before GPUs have been initialized 49 | print(e) 50 | else: 51 | print('**** NO physical GPUs') 52 | 53 | 54 | num_inferences = 0 55 | print(f'num_inferences: {num_inferences}') 56 | 57 | Context = namedtuple('Context', 58 | 'model_name, model_version, method, rest_uri, grpc_uri, ' 59 | 'custom_attributes, request_content_type, accept_header') 60 | 61 | def handler(data, context): 62 | 63 | global num_inferences 64 | num_inferences += 1 65 | 66 | print(f'\n************ inference #: {num_inferences}') 67 | if context.request_content_type == 'application/x-image': 68 | stream = io.BytesIO(data.read()) 69 | img = Image.open(stream).convert('RGB') 70 | _print_image_metadata(img) 71 | 72 | img = img.resize((WIDTH, HEIGHT)) 73 | img_array = image.img_to_array(img) #, data_format = "channels_first") 74 | # the image is now in an array of shape (224, 224, 3) or (3, 224, 224) based on data_format 75 | # need to expand it to add dim for num samples, e.g. (1, 224, 224, 3) 76 | x = img_array.reshape((1,) + img_array.shape) 77 | instance = preprocess_input(x) 78 | print(f' final image shape: {instance.shape}') 79 | del x, img 80 | else: 81 | _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown')) 82 | 83 | start_time = time.time() 84 | 85 | if USE_GRPC: 86 | prediction = _predict_using_grpc(context, instance) 87 | 88 | else: # use TFS REST API 89 | inst_json = json.dumps({'instances': instance.tolist()}) 90 | response = requests.post(context.rest_uri, data=inst_json) 91 | if response.status_code != 200: 92 | raise Exception(response.content.decode('utf-8')) 93 | prediction = response.content 94 | 95 | end_time = time.time() 96 | latency = int((end_time - start_time) * 1000) 97 | print(f'=== TFS invoke took: {latency} ms') 98 | 99 | response_content_type = context.accept_header 100 | return prediction, response_content_type 101 | 102 | def _return_error(code, message): 103 | raise ValueError('Error: {}, {}'.format(str(code), message)) 104 | 105 | def _predict_using_grpc(context, instance): 106 | request = predict_pb2.PredictRequest() 107 | request.model_spec.name = 'model' 108 | request.model_spec.signature_name = 'serving_default' 109 | 110 | request.inputs['input_1'].CopyFrom(make_tensor_proto(instance)) 111 | options = [ 112 | ('grpc.max_send_message_length', MAX_GRPC_MESSAGE_LENGTH), 113 | ('grpc.max_receive_message_length', MAX_GRPC_MESSAGE_LENGTH) 114 | ] 115 | channel = grpc.insecure_channel(f'0.0.0.0:{context.grpc_port}', options=options) 116 | stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) 117 | result_future = stub.Predict.future(request, 30) # 5 seconds 118 | output_tensor_proto = result_future.result().outputs['output'] 119 | output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim] 120 | output_np = np.array(output_tensor_proto.float_val).reshape(output_shape) 121 | predicted_class_idx = argmax(output_np) 122 | print(f' Predicted class: {predicted_class_idx}') 123 | prediction_json = {'predictions': output_np.tolist()} 124 | return json.dumps(prediction_json) 125 | 126 | def _print_image_metadata(img): 127 | # Retrieve the attributes of the image 128 | fileFormat = img.format 129 | imageMode = img.mode 130 | imageSize = img.size # (width, height) 131 | colorPalette = img.palette 132 | 133 | print(f' File format: {fileFormat}') 134 | print(f' Image mode: {imageMode}') 135 | print(f' Image size: {imageSize}') 136 | print(f' Color pal: {colorPalette}') 137 | 138 | print(f' Keys from image.info dictionary:') 139 | for key, value in img.info.items(): 140 | print(f' {key}') -------------------------------------------------------------------------------- /06_training_pipeline/code/requirements-gpu.txt: -------------------------------------------------------------------------------- 1 | Pillow 2 | numpy 3 | tensorflow-gpu -------------------------------------------------------------------------------- /06_training_pipeline/code/requirements.txt: -------------------------------------------------------------------------------- 1 | # This is the set of Python packages that will get pip installed 2 | # at startup of the Amazon SageMaker endpoint or batch transformation. 3 | Pillow 4 | numpy==1.21.6 5 | tensorflow==2.11.1 -------------------------------------------------------------------------------- /06_training_pipeline/code/train-mobilenet.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You 4 | # may not use this file except in compliance with the License. A copy of 5 | # the License is located at 6 | # 7 | # http://aws.amazon.com/apache2.0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 11 | # ANY KIND, either express or implied. See the License for the specific 12 | # language governing permissions and limitations under the License. 13 | 14 | import tensorflow as tf 15 | import tensorflow.keras 16 | 17 | TF_VERSION = tf.version.VERSION 18 | print('TF version: {}'.format(tf.__version__)) 19 | print('Keras version: {}'.format(tensorflow.keras.__version__)) 20 | 21 | from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input 22 | LAST_FROZEN_LAYER = 20 23 | 24 | from tensorflow.keras.preprocessing.image import ImageDataGenerator 25 | from tensorflow.keras.layers import Dense, Activation, Flatten, Dropout, GlobalAveragePooling2D 26 | from tensorflow.keras.models import Sequential, Model, load_model 27 | from tensorflow.keras.optimizers import SGD, Adam, RMSprop 28 | from tensorflow.keras.callbacks import ModelCheckpoint 29 | from tensorflow.keras import backend as K 30 | 31 | import numpy as np 32 | import os 33 | import re 34 | import json 35 | import argparse 36 | import glob 37 | import boto3 38 | 39 | HEIGHT = 224 40 | WIDTH = 224 41 | 42 | class PurgeCheckpointsCallback(tensorflow.keras.callbacks.Callback): 43 | def __init__(self, args): 44 | """ Save params in constructor 45 | """ 46 | self.args = args 47 | 48 | def on_epoch_end(self, epoch, logs=None): 49 | files = sorted([f for f in os.listdir(self.args.checkpoint_path) if f.endswith('.' + 'h5')]) 50 | keep = 3 51 | print(f' End of epoch {epoch + 1}. Removing old checkpoints...') 52 | for i in range(0, len(files) - keep - 1): 53 | print(f' local: {files[i]}') 54 | os.remove(f'{self.args.checkpoint_path}/{files[i]}') 55 | 56 | s3_uri_parts = self.args.s3_checkpoint_path.split('/') 57 | bucket = s3_uri_parts[2] 58 | key = '/'.join(s3_uri_parts[3:]) + '/' + files[i] 59 | 60 | print(f' s3 bucket: {bucket}, key: {key}') 61 | s3_client = boto3.client('s3') 62 | s3_client.delete_object(Bucket=bucket, Key=key) 63 | print(f' Done\n') 64 | 65 | def load_checkpoint_model(checkpoint_path): 66 | files = sorted([f for f in os.listdir(checkpoint_path) if f.endswith('.' + 'h5')]) 67 | epoch_numbers = [re.search('(?<=\.)(.*[0-9])(?=\.)',f).group() for f in files] 68 | 69 | max_epoch_number = max(epoch_numbers) 70 | max_epoch_index = epoch_numbers.index(max_epoch_number) 71 | max_epoch_filename = files[max_epoch_index] 72 | 73 | print('\nList of available checkpoints:') 74 | print('------------------------------------') 75 | [print(f) for f in files] 76 | print('------------------------------------') 77 | print(f'Checkpoint file for latest epoch: {max_epoch_filename}') 78 | print(f'Resuming training from epoch: {max_epoch_number}') 79 | print('------------------------------------') 80 | 81 | resume_model = load_model(f'{checkpoint_path}/{max_epoch_filename}') 82 | return resume_model, int(max_epoch_number) 83 | 84 | def build_finetune_model(base_model, dropout, fc_layers, num_classes): 85 | # Freeze all base layers 86 | for layer in base_model.layers: 87 | layer.trainable = False 88 | 89 | x = base_model.output 90 | x = Flatten()(x) 91 | for fc in fc_layers: 92 | x = Dense(fc, activation='relu')(x) 93 | if (dropout != 0.0): 94 | x = Dropout(dropout)(x) 95 | 96 | # New softmax layer 97 | predictions = Dense(num_classes, activation='softmax', name='output')(x) 98 | 99 | finetune_model = Model(inputs=base_model.input, outputs=predictions) 100 | 101 | return finetune_model 102 | 103 | def create_data_generators(args): 104 | train_datagen = ImageDataGenerator( 105 | preprocessing_function=preprocess_input, 106 | rotation_range=70, 107 | brightness_range=(0.6, 1.0), 108 | width_shift_range=0.3, 109 | height_shift_range=0.3, 110 | shear_range=0.3, 111 | zoom_range=0.3, 112 | horizontal_flip=True, 113 | vertical_flip=False) 114 | val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input) 115 | test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input) 116 | 117 | train_gen = train_datagen.flow_from_directory('/opt/ml/input/data/train', 118 | target_size=(HEIGHT, WIDTH), 119 | batch_size=args.batch_size) 120 | test_gen = train_datagen.flow_from_directory('/opt/ml/input/data/test', 121 | target_size=(HEIGHT, WIDTH), 122 | batch_size=args.batch_size) 123 | val_gen = train_datagen.flow_from_directory('/opt/ml/input/data/validation', 124 | target_size=(HEIGHT, WIDTH), 125 | batch_size=args.batch_size) 126 | return train_gen, test_gen, val_gen 127 | 128 | def save_model_artifacts(model, model_dir): 129 | print(f'Saving model to {model_dir}...') 130 | # Note that this method of saving does produce a warning about not containing the train and evaluate graphs. 131 | # The resulting saved model works fine for inference. It will simply not support incremental training. If that 132 | # is needed, one can use model checkpoints and save those. 133 | print('Model directory files BEFORE save: {}'.format(glob.glob(f'{model_dir}/*/*'))) 134 | if tf.version.VERSION[0] == '2': 135 | model.save(f'{model_dir}/1', save_format='tf') 136 | else: 137 | tf.contrib.saved_model.save_keras_model(model, f'{model_dir}/1') 138 | print('Model directory files AFTER save: {}'.format(glob.glob(f'{model_dir}/*/*'))) 139 | print('...DONE saving model!') 140 | 141 | # Need to copy these files to the code directory, else the SageMaker endpoint will not use them. 142 | print('Copying inference source files...') 143 | 144 | if not os.path.exists(f'{model_dir}/code'): 145 | os.system(f'mkdir {model_dir}/code') 146 | os.system(f'cp inference.py {model_dir}/code') 147 | os.system(f'cp requirements.txt {model_dir}/code') 148 | print('Files after copying custom inference handler files: {}'.format(glob.glob(f'{model_dir}/code/*'))) 149 | 150 | def main(args): 151 | sm_training_env_json = json.loads(os.environ.get('SM_TRAINING_ENV')) 152 | is_master = sm_training_env_json['is_master'] 153 | print('is_master {}'.format(is_master)) 154 | 155 | # Create data generators for feeding training and evaluation based on data provided to us 156 | # by the SageMaker TensorFlow container 157 | train_gen, test_gen, val_gen = create_data_generators(args) 158 | 159 | base_model = MobileNetV2(weights='imagenet', 160 | include_top=False, 161 | input_shape=(HEIGHT, WIDTH, 3)) 162 | 163 | # Here we extend the base model with additional fully connected layers, dropout for avoiding 164 | # overfitting to the training dataset, and a classification layer 165 | fully_connected_layers = [] 166 | for i in range(args.num_fully_connected_layers): 167 | fully_connected_layers.append(1024) 168 | 169 | num_classes = len(glob.glob('/opt/ml/input/data/train/*')) 170 | model = build_finetune_model(base_model, 171 | dropout=args.dropout, 172 | fc_layers=fully_connected_layers, 173 | num_classes=num_classes) 174 | 175 | opt = RMSprop(lr=args.initial_lr) 176 | model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy']) 177 | 178 | print('\nBeginning training...') 179 | 180 | NUM_EPOCHS = args.fine_tuning_epochs 181 | 182 | num_train_images = len(train_gen.filepaths) 183 | num_val_images = len(val_gen.filepaths) 184 | 185 | num_hosts = len(args.hosts) 186 | train_steps = num_train_images // args.batch_size // num_hosts 187 | val_steps = num_val_images // args.batch_size // num_hosts 188 | 189 | print('Batch size: {}, Train Steps: {}, Val Steps: {}'.format(args.batch_size, train_steps, val_steps)) 190 | 191 | # If there are no checkpoints found, must be starting from scratch, so load the pretrained model 192 | checkpoints_avail = os.path.isdir(args.checkpoint_path) and os.listdir(args.checkpoint_path) 193 | if not checkpoints_avail: 194 | if not os.path.isdir(args.checkpoint_path): 195 | os.mkdir(args.checkpoint_path) 196 | 197 | # Train for a few epochs 198 | model.fit_generator(train_gen, epochs=args.initial_epochs, workers=8, 199 | steps_per_epoch=train_steps, 200 | validation_data=val_gen, validation_steps=val_steps, 201 | shuffle=True) 202 | 203 | # Now fine tune the last set of layers in the model 204 | for layer in model.layers[LAST_FROZEN_LAYER:]: 205 | layer.trainable = True 206 | 207 | initial_epoch_number = 0 208 | 209 | # Otherwise, start from the latest checkpoint 210 | else: 211 | model, initial_epoch_number = load_checkpoint_model(args.checkpoint_path) 212 | 213 | fine_tuning_lr = args.fine_tuning_lr 214 | model.compile(optimizer=SGD(lr=fine_tuning_lr, momentum=0.9), 215 | loss='categorical_crossentropy', metrics=['accuracy']) 216 | 217 | callbacks = [] 218 | if not (args.s3_checkpoint_path == ''): 219 | checkpoint_names = 'img-classifier.{epoch:03d}.h5' 220 | checkpoint_callback = ModelCheckpoint(filepath=f'{args.checkpoint_path}/{checkpoint_names}', 221 | save_weights_only=False, 222 | monitor='val_accuracy') 223 | callbacks.append(checkpoint_callback) 224 | callbacks.append(PurgeCheckpointsCallback(args)) 225 | 226 | history = model.fit_generator(train_gen, epochs=NUM_EPOCHS, workers=8, 227 | steps_per_epoch=train_steps, 228 | initial_epoch=initial_epoch_number, 229 | validation_data=val_gen, validation_steps=val_steps, 230 | shuffle=True, callbacks=callbacks) 231 | print('Model has been fit.') 232 | 233 | # Save the model if we are executing on the master host 234 | if is_master: 235 | print('Saving model, since we are master host') 236 | save_model_artifacts(model, os.environ.get('SM_MODEL_DIR')) 237 | else: 238 | print('NOT saving model, will leave that up to master host') 239 | 240 | checkpoint_files = [f for f in os.listdir(args.checkpoint_path) if f.endswith('.' + 'h5')] 241 | print(f'\nCheckpoints: {sorted(checkpoint_files)}') 242 | 243 | print('\nExiting training script.\n') 244 | 245 | if __name__=='__main__': 246 | parser = argparse.ArgumentParser() 247 | # hyperparameters sent by the client are passed as command-line arguments to the script. 248 | parser.add_argument('--initial_epochs', type=int, default=5) 249 | parser.add_argument('--fine_tuning_epochs', type=int, default=30) 250 | parser.add_argument('--batch_size', type=int, default=8) 251 | parser.add_argument('--initial_lr', type=float, default=0.0001) 252 | parser.add_argument('--fine_tuning_lr', type=float, default=0.00001) 253 | parser.add_argument('--dropout', type=float, default=0.5) 254 | parser.add_argument('--num_fully_connected_layers', type=int, default=1) 255 | parser.add_argument('--s3_checkpoint_path', type=str, default='') 256 | # input data and model directories 257 | parser.add_argument('--model_dir', type=str) 258 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 259 | parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) 260 | parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION')) 261 | parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST')) 262 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS'))) 263 | parser.add_argument('--checkpoint_path', type=str, default='/opt/ml/checkpoints') 264 | 265 | args, _ = parser.parse_known_args() 266 | print('args: {}'.format(args)) 267 | 268 | main(args) -------------------------------------------------------------------------------- /06_training_pipeline/evaluation.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import pandas as pd 4 | import argparse 5 | import pathlib 6 | import json 7 | import os 8 | import numpy as np 9 | import tarfile 10 | import uuid 11 | 12 | from PIL import Image 13 | 14 | from sklearn.metrics import ( 15 | accuracy_score, 16 | precision_score, 17 | recall_score, 18 | confusion_matrix, 19 | f1_score 20 | ) 21 | 22 | from tensorflow import keras 23 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input 24 | from tensorflow.keras.preprocessing import image 25 | from tensorflow.keras.optimizers import SGD 26 | # from smexperiments import tracker 27 | 28 | logger = logging.getLogger() 29 | logger.setLevel(logging.INFO) 30 | logger.addHandler(logging.StreamHandler()) 31 | 32 | input_path = "/opt/ml/processing/input/test" #"output/test" # 33 | manifest_path = "/opt/ml/processing/input/manifest/test.csv"#"output/manifest/test.csv" 34 | model_path = "/opt/ml/processing/model" #"model" # 35 | output_path = '/opt/ml/processing/output' #"output" # 36 | 37 | HEIGHT=224; WIDTH=224 38 | 39 | def predict_bird_from_file_new(fn, model): 40 | 41 | img = Image.open(fn).convert('RGB') 42 | 43 | img = img.resize((WIDTH, HEIGHT)) 44 | img_array = image.img_to_array(img) #, data_format = "channels_first") 45 | 46 | x = img_array.reshape((1,) + img_array.shape) 47 | instance = preprocess_input(x) 48 | 49 | del x, img 50 | 51 | result = model.predict(instance) 52 | 53 | predicted_class_idx = np.argmax(result) 54 | confidence = result[0][predicted_class_idx] 55 | 56 | return predicted_class_idx, confidence 57 | 58 | if __name__ == "__main__": 59 | parser = argparse.ArgumentParser() 60 | parser.add_argument("--model-file", type=str, default="model.tar.gz") 61 | args, _ = parser.parse_known_args() 62 | 63 | print("Extracting the model") 64 | 65 | model_file = os.path.join(model_path, args.model_file) 66 | file = tarfile.open(model_file) 67 | file.extractall(model_path) 68 | 69 | file.close() 70 | 71 | print("Load model") 72 | 73 | model = keras.models.load_model("{}/1".format(model_path), compile=False) 74 | model.compile(optimizer=SGD(lr=0.00001, momentum=0.9), loss='categorical_crossentropy', metrics=['accuracy']) 75 | 76 | print("Starting evaluation.") 77 | 78 | # load test data. this should be an argument 79 | df = pd.read_csv(manifest_path) 80 | 81 | num_images = df.shape[0] 82 | 83 | class_name_list = sorted(df['class_id'].unique().tolist()) 84 | 85 | class_name = pd.Series(df['class_name'].values,index=df['class_id']).to_dict() 86 | 87 | print('Testing {} images'.format(df.shape[0])) 88 | num_errors = 0 89 | preds = [] 90 | acts = [] 91 | for i in range(df.shape[0]): 92 | fname = df.iloc[i]['image_file_name'] 93 | act = int(df.iloc[i]['class_id']) - 1 94 | acts.append(act) 95 | 96 | pred, conf = predict_bird_from_file_new(input_path + '/' + fname, model) 97 | 98 | preds.append(pred) 99 | 100 | print(f'max range is {len(class_name_list)-1}, prediction is {pred}, and the actual is {act}============') 101 | if (pred != act): 102 | num_errors += 1 103 | logger.debug('ERROR on image index {} -- Pred: {} {:.2f}, Actual: {}'.format(i, 104 | class_name_list[pred], conf, 105 | class_name_list[act])) 106 | precision = precision_score(acts, preds, average='micro') 107 | recall = recall_score(acts, preds, average='micro') 108 | accuracy = accuracy_score(acts, preds) 109 | cnf_matrix = confusion_matrix(acts, preds, labels=range(len(class_name_list))) 110 | f1 = f1_score(acts, preds, average='micro') 111 | 112 | print("Accuracy: {}".format(accuracy)) 113 | logger.debug("Precision: {}".format(precision)) 114 | logger.debug("Recall: {}".format(recall)) 115 | logger.debug("Confusion matrix: {}".format(cnf_matrix)) 116 | logger.debug("F1 score: {}".format(f1)) 117 | 118 | print(cnf_matrix) 119 | 120 | matrix_output = dict() 121 | 122 | for i in range(len(cnf_matrix)): 123 | matrix_row = dict() 124 | for j in range(len(cnf_matrix[0])): 125 | matrix_row[class_name[class_name_list[j]]] = int(cnf_matrix[i][j]) 126 | matrix_output[class_name[class_name_list[i]]] = matrix_row 127 | 128 | 129 | report_dict = { 130 | "multiclass_classification_metrics": { 131 | "accuracy": {"value": accuracy, "standard_deviation": "NaN"}, 132 | "precision": {"value": precision, "standard_deviation": "NaN"}, 133 | "recall": {"value": recall, "standard_deviation": "NaN"}, 134 | "f1": {"value": f1, "standard_deviation": "NaN"}, 135 | "confusion_matrix":matrix_output 136 | }, 137 | } 138 | 139 | output_dir = "/opt/ml/processing/evaluation" 140 | pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True) 141 | 142 | evaluation_path = f"{output_dir}/evaluation.json" 143 | with open(evaluation_path, "w") as f: 144 | f.write(json.dumps(report_dict)) 145 | -------------------------------------------------------------------------------- /06_training_pipeline/lambda/index.py: -------------------------------------------------------------------------------- 1 | import json 2 | import boto3 3 | import os 4 | import pprint as pp 5 | import uuid 6 | 7 | s3 = boto3.client('s3') 8 | sagemaker = boto3.client("sagemaker") 9 | 10 | pipeline_name = os.getenv('PipelineName') 11 | role = os.getenv('RoleArn') 12 | 13 | def lambda_handler(event, context): 14 | if len(event['Records']) >0: 15 | bucket = event['Records'][0]['s3']['bucket']['name'] 16 | key = event['Records'][0]['s3']['object']['key'] 17 | response = s3.get_object(Bucket=bucket, Key=key) 18 | 19 | definition_data = response["Body"].read().decode('utf-8') 20 | 21 | try: 22 | response = sagemaker.list_pipelines( 23 | PipelineNamePrefix=pipeline_name 24 | ) 25 | 26 | pipelines = response['PipelineSummaries'] 27 | except Exception as e: 28 | return e 29 | 30 | create = False 31 | 32 | if len(pipelines)>0: 33 | for p in pipelines: 34 | if p['PipelineName'] == pipeline_name: 35 | create = False 36 | else: 37 | create = True 38 | # Update the pipeline 39 | try: 40 | if create: 41 | response = sagemaker.create_pipeline( 42 | PipelineName=pipeline_name, 43 | PipelineDisplayName=pipeline_name, 44 | PipelineDefinition=definition_data, 45 | PipelineDescription=f'pipeline from {key}', 46 | ClientRequestToken=str(uuid.uuid4()), 47 | RoleArn=role 48 | ) 49 | print(f'created new pipeline: {pipeline_name}') 50 | 51 | else: 52 | response = sagemaker.update_pipeline( 53 | PipelineName=pipeline_name, 54 | PipelineDisplayName=pipeline_name, 55 | PipelineDefinition=definition_data, 56 | PipelineDescription=f'pipeline from {key}', 57 | RoleArn=role 58 | ) 59 | 60 | print(f'updated pipeline: {pipeline_name}') 61 | except Exception as e: 62 | return e 63 | 64 | # Execute the pipeline 65 | try: 66 | 67 | response = sagemaker.start_pipeline_execution( 68 | PipelineName=pipeline_name, 69 | PipelineExecutionDisplayName=pipeline_name, 70 | PipelineExecutionDescription=f'pipeline from {key}', 71 | ) 72 | 73 | except Exception as e: 74 | print(f'Failed starting SageMaker pipeline because: {e}') 75 | 76 | return { 77 | 'statusCode': 200, 78 | 'body': json.dumps(f"Launched SageMaker pipeline: {pipeline_name}") 79 | } 80 | 81 | return "Complete Pipeline Execution...." -------------------------------------------------------------------------------- /06_training_pipeline/lambda/lambda.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/lambda/lambda.zip -------------------------------------------------------------------------------- /06_training_pipeline/preprocess.py: -------------------------------------------------------------------------------- 1 | 2 | import logging 3 | 4 | import pandas as pd 5 | import argparse 6 | import boto3 7 | import json 8 | import os 9 | import shutil 10 | 11 | logger = logging.getLogger() 12 | logger.setLevel(logging.INFO) 13 | logger.addHandler(logging.StreamHandler()) 14 | 15 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 16 | output_path = '/opt/ml/processing/output' #"output" # 17 | IMAGES_DIR = os.path.join(input_path, 'images') 18 | SPLIT_RATIOS = (0.6, 0.2, 0.2) 19 | 20 | 21 | # this function is used to split a dataframe into 3 seperate dataframes 22 | # one of each: train, validate, test 23 | 24 | def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False): 25 | train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame() 26 | 27 | labels = df[label_column].unique() 28 | for lbl in labels: 29 | lbl_df = df[df[label_column] == lbl] 30 | 31 | lbl_train_df = lbl_df.sample(frac=splits[0]) 32 | lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index) 33 | lbl_test_df = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2])) 34 | lbl_val_df = lbl_val_and_test_df.drop(lbl_test_df.index) 35 | 36 | if verbose: 37 | print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl, 38 | len(lbl_df), 39 | len(lbl_train_df), 40 | len(lbl_val_df), 41 | len(lbl_test_df))) 42 | train_df = train_df.append(lbl_train_df) 43 | val_df = val_df.append(lbl_val_df) 44 | test_df = test_df.append(lbl_test_df) 45 | 46 | # shuffle them on the way out using .sample(frac=1) 47 | return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1) 48 | 49 | # This function grabs the manifest files and build a dataframe, then call the split_to_train_val_test 50 | # function above and return the 3 dataframes 51 | def get_train_val_dataframes(BASE_DIR, classes, images_to_exclude, split_ratios): 52 | CLASSES_FILE = os.path.join(BASE_DIR, 'classes.txt') 53 | IMAGE_FILE = os.path.join(BASE_DIR, 'images.txt') 54 | LABEL_FILE = os.path.join(BASE_DIR, 'image_class_labels.txt') 55 | 56 | images_df = pd.read_csv(IMAGE_FILE, sep=' ', 57 | names=['image_pretty_name', 'image_file_name'], 58 | header=None) 59 | image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ', 60 | names=['image_pretty_name', 'orig_class_id'], header=None) 61 | 62 | # Merge the metadata into a single flat dataframe for easier processing 63 | full_df = pd.DataFrame(images_df) 64 | full_df = full_df[~full_df.image_file_name.isin(images_to_exclude)] 65 | 66 | full_df.reset_index(inplace=True, drop=True) 67 | full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name') 68 | 69 | # grab a small subset of species for testing 70 | criteria = full_df['orig_class_id'].isin(classes) 71 | full_df = full_df[criteria] 72 | print('Using {} images from {} classes'.format(full_df.shape[0], len(classes))) 73 | 74 | unique_classes = full_df['orig_class_id'].drop_duplicates() 75 | sorted_unique_classes = sorted(unique_classes) 76 | id_to_one_based = {} 77 | i = 1 78 | for c in sorted_unique_classes: 79 | id_to_one_based[c] = str(i) 80 | i += 1 81 | 82 | full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based) 83 | full_df.reset_index(inplace=True, drop=True) 84 | 85 | def get_class_name(fn): 86 | return fn.split('/')[0] 87 | full_df['class_name'] = full_df['image_file_name'].apply(get_class_name) 88 | full_df = full_df.drop(['image_pretty_name'], axis=1) 89 | 90 | train_df = [] 91 | test_df = [] 92 | val_df = [] 93 | 94 | # split into training and validation sets 95 | train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', split_ratios) 96 | 97 | print('num images total: ' + str(images_df.shape[0])) 98 | print('\nnum train: ' + str(train_df.shape[0])) 99 | print('num val: ' + str(val_df.shape[0])) 100 | print('num test: ' + str(test_df.shape[0])) 101 | return train_df, val_df, test_df 102 | 103 | # this function copy images by channel to its destination folder 104 | def copy_files_for_channel(df, channel_name, verbose=False): 105 | print('\nCopying files for {} images in channel: {}...'.format(df.shape[0], channel_name)) 106 | for i in range(df.shape[0]): 107 | target_fname = df.iloc[i]['image_file_name'] 108 | # if verbose: 109 | # print(target_fname) 110 | src = "{}/{}".format(IMAGES_DIR, target_fname) #f"{IMAGES_DIR}/{target_fname}" 111 | dst = "{}/{}/{}".format(output_path,channel_name,target_fname) 112 | shutil.copyfile(src, dst) 113 | 114 | if __name__ == "__main__": 115 | parser = argparse.ArgumentParser() 116 | parser.add_argument("--classes", type=str, default="") 117 | parser.add_argument("--input-data", type=str, default="classes.txt") 118 | args, _ = parser.parse_known_args() 119 | 120 | 121 | c_list = args.classes.split(',') 122 | input_data = args.input_data 123 | 124 | print(c_list) 125 | 126 | CLASSES_FILE = os.path.join(input_path, input_data) 127 | 128 | EXCLUDE_IMAGE_LIST = ['087.Mallard/Mallard_0130_76836.jpg'] 129 | CLASS_COLS = ['class_number','class_id'] 130 | 131 | if len(c_list)==0: 132 | # Otherwise, you can use the full set of species 133 | CLASSES = [] 134 | for c in range(200): 135 | CLASSES += [c + 1] 136 | prefix = prefix + '-full' 137 | else: 138 | CLASSES = list(map(int, c_list)) 139 | 140 | 141 | classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None) 142 | 143 | criteria = classes_df['class_number'].isin(CLASSES) 144 | classes_df = classes_df[criteria] 145 | 146 | class_name_list = sorted(classes_df['class_id'].unique().tolist()) 147 | print(class_name_list) 148 | 149 | 150 | train_df, val_df, test_df = get_train_val_dataframes(input_path, CLASSES, EXCLUDE_IMAGE_LIST, SPLIT_RATIOS) 151 | 152 | for c in class_name_list: 153 | os.mkdir('{}/{}/{}'.format(output_path, 'validation', c)) 154 | os.mkdir('{}/{}/{}'.format(output_path, 'test', c)) 155 | os.mkdir('{}/{}/{}'.format(output_path, 'train', c)) 156 | 157 | copy_files_for_channel(val_df, 'validation') 158 | copy_files_for_channel(test_df, 'test') 159 | copy_files_for_channel(train_df, 'train') 160 | 161 | # export manifest file for validation 162 | train_m_file = "{}/manifest/train.csv".format(output_path) 163 | train_df.to_csv(train_m_file, index=False) 164 | test_m_file = "{}/manifest/test.csv".format(output_path) 165 | test_df.to_csv(test_m_file, index=False) 166 | val_m_file = "{}/manifest/validation.csv".format(output_path) 167 | val_df.to_csv(val_m_file, index=False) 168 | 169 | print("Finished running processing job") 170 | -------------------------------------------------------------------------------- /06_training_pipeline/sagemaker-project.yaml: -------------------------------------------------------------------------------- 1 | Description: Upload an object to an S3 bucket, triggering a Lambda event, returning the object key as a Stack Output. 2 | Parameters: 3 | SageMakerProjectName: 4 | Type: String 5 | Description: The name of the SageMaker project. This name is alos used to create the S3 Bucket (most not already exist) 6 | SageMakerProjectId: 7 | Type: String 8 | Description: Service generated Id of the project. 9 | MaxLength: 16 10 | MinLength: 1 11 | LambdaBucket: 12 | Description: S3 Bucket of your lambda file 13 | Type: String 14 | LambdaKey: 15 | Description: S3 Key of your lambnda zip file 16 | Type: String 17 | PipelineDefinitionBucket: 18 | Description: S3 Bucket of your pipeline definition file 19 | Type: String 20 | PipelineDefinitionKey: 21 | Description: S3 Key of your pipeline definition file 22 | Type: String 23 | PipelineExecutionRoleArn: 24 | Description: Execution Role Arn for your SageMaker Pipeline (if use studio, use the same execution role as studio) 25 | Type: String 26 | Resources: 27 | Bucket: 28 | Type: AWS::S3::Bucket 29 | DependsOn: BucketPermission 30 | Properties: 31 | BucketName: !Ref SageMakerProjectName 32 | NotificationConfiguration: 33 | LambdaConfigurations: 34 | - Event: 's3:ObjectCreated:*' 35 | Function: !GetAtt BucketWatcher.Arn 36 | Filter: 37 | S3Key: 38 | Rules: 39 | - Name: prefix 40 | Value: pipeline/ 41 | - Name: suffix 42 | Value: .json 43 | BucketPermission: 44 | Type: AWS::Lambda::Permission 45 | Properties: 46 | Action: 'lambda:InvokeFunction' 47 | FunctionName: !Ref BucketWatcher 48 | Principal: s3.amazonaws.com 49 | SourceAccount: !Ref "AWS::AccountId" 50 | SourceArn: !Sub "arn:aws:s3:::${SageMakerProjectName}" 51 | BucketWatcher: 52 | Type: AWS::Lambda::Function 53 | Properties: 54 | Description: Sends a Wait Condition signal to Handle when invoked 55 | Handler: index.lambda_handler 56 | Role: !GetAtt LambdaExecutionRole.Arn 57 | Code: 58 | S3Bucket: !Ref LambdaBucket 59 | S3Key: !Ref LambdaKey 60 | Timeout: 5 61 | Runtime: python3.7 62 | Environment: 63 | Variables: 64 | PipelineName: !Sub "${SageMakerProjectName}-pipeline" 65 | RoleArn: !Ref PipelineExecutionRoleArn 66 | LambdaExecutionRole: 67 | Type: AWS::IAM::Role 68 | Properties: 69 | AssumeRolePolicyDocument: 70 | Version: '2012-10-17' 71 | Statement: 72 | - Effect: Allow 73 | Principal: {Service: [lambda.amazonaws.com, sagemaker.amazonaws.com]} 74 | Action: ['sts:AssumeRole'] 75 | Path: / 76 | ManagedPolicyArns: 77 | - "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" 78 | - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess" 79 | Policies: 80 | - PolicyName: S3Policy 81 | PolicyDocument: 82 | Version: '2012-10-17' 83 | Statement: 84 | - Effect: Allow 85 | Action: 86 | - 's3:*' 87 | Resource: 88 | - "arn:aws:s3:::*" 89 | MyPipeline: 90 | Type: AWS::SageMaker::Pipeline 91 | Properties: 92 | PipelineName: !Sub "${SageMakerProjectName}-pipeline" 93 | PipelineDisplayName: !Sub "${SageMakerProjectName}-pipeline" 94 | PipelineDescription: !Sub "${SageMakerProjectName}-pipeline" 95 | PipelineDefinition: 96 | PipelineDefinitionS3Location: 97 | Bucket: !Ref PipelineDefinitionBucket 98 | Key: !Ref PipelineDefinitionKey 99 | RoleArn: !Ref PipelineExecutionRoleArn 100 | Tags: 101 | - Key: sagemaker:project-id 102 | Value: 103 | Ref: SageMakerProjectId 104 | - Key: sagemaker:project-name 105 | Value: 106 | Ref: SageMakerProjectName 107 | MyModelPackageGroup: 108 | Type: AWS::SageMaker::ModelPackageGroup 109 | Properties: 110 | ModelPackageGroupDescription: !Sub "${SageMakerProjectName}-model-group" 111 | ModelPackageGroupName: !Sub "${SageMakerProjectName}-model-group" 112 | Tags: 113 | - Key: sagemaker:project-id 114 | Value: 115 | Ref: SageMakerProjectId 116 | - Key: sagemaker:project-name 117 | Value: 118 | Ref: SageMakerProjectName 119 | -------------------------------------------------------------------------------- /06_training_pipeline/statics/birds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/birds.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/compare-model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/compare-model.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/confussion_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/confussion_matrix.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/cost-explore.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/cost-explore.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/cv-training-pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/cv-training-pipeline.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/execute-pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/execute-pipeline.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/parameters-input.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/parameters-input.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/project-parameter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/project-parameter.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/project-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/project-template.png -------------------------------------------------------------------------------- /06_training_pipeline/statics/studio-ui-pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/06_training_pipeline/statics/studio-ui-pipeline.png -------------------------------------------------------------------------------- /07_deployment/README.md: -------------------------------------------------------------------------------- 1 | # Model Deployment with Sagemaker 2 | 3 | --- 4 | 5 | # Introduction 6 | 7 | Once your models have been built, trained, and evaluated such that you are satisfied with their performance, you would likely want to deploy them to get predictions. This lab focuses on the following deployment types: 8 | * [Cloud deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html) using Amazon SageMaker hosting services 9 | * [Real time inference deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) to cloud endpoints with compute you have control over. 10 | * [Serverless inference deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) to cloud endpoints with compute provisioned and manged by Sagemaker. 11 | * [Edge deployment](https://docs.aws.amazon.com/sagemaker/latest/dg/edge.html) to devices using [AWS IoT Greengrass for ML Inferences](https://aws.amazon.com/greengrass/ml/). 12 | 13 | --- 14 | # Prerequisites 15 | 16 | Download the notebook corresponding to the desired deployment type you'd want to do into your environemnt. Run it by executing each cell in order. 17 | 18 | To understand what's happening, you'll need: 19 | 20 | - Access to the SageMaker default S3 bucket. 21 | - Familiarity with Python and numpy 22 | - Basic familiarity with AWS S3. 23 | - Basic understanding of AWS Sagemaker. 24 | - SageMaker Studio is preferred for the full UI integration 25 | 26 | --- 27 | 28 | ### Dataset & Inputs 29 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html). The main components needed for running the deployment notebooks are: 30 | 31 | - S3 path for test image data 32 | - S3 path for test data annotation file 33 | - S3 path for the bird classification model 34 | 35 | If you don't have model artifacts from previous modules, there is an `optional-prepare-data-and-model` notebook you can run to generate the necessary components. In the first cell, be sure to change the S3 prefix to the appropriate value according to the deployment type you want to do. 36 | The outputs from a single run of this notebook, or kept ones from previous modules can be used for any of the deployments by simply updating the S3 location in the first cell of the deployment notebook you run. 37 | 38 | 39 | 40 | --- 41 | # Review Outputs 42 | 43 | At the end of the deployment notbook, you will have used sample images from the test dataset to get predictions from the endpoints. -------------------------------------------------------------------------------- /07_deployment/cloud_deployment/cv_utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from numpy import argmax 3 | from IPython.display import Image, display 4 | import os 5 | import boto3 6 | import random 7 | 8 | s3 = boto3.client('s3') 9 | 10 | def get_n_random_images(bucket_name, prefix,n): 11 | keys = [] 12 | resp = s3.list_objects_v2(Bucket=f'{bucket_name}', MaxKeys=500,Prefix=prefix) 13 | for obj in resp['Contents']: 14 | keys.append(obj['Key']) 15 | random.shuffle(keys) 16 | del keys[n:] 17 | return keys 18 | 19 | def download_images_locally(bucket_name,image_keys): 20 | if not os.path.isdir('./inference-test-data'): 21 | os.mkdir('./inference-test-data') 22 | local_paths=[] 23 | for obj in image_keys: 24 | fname = obj.split('/')[-1] 25 | local_dl_path = './inference-test-data/'+fname 26 | 27 | s3.download_file(bucket_name,obj,local_dl_path) 28 | local_paths.append(local_dl_path) 29 | return local_paths 30 | 31 | def get_classes_as_list(cf, class_filter): 32 | classes_df = pd.read_csv(cf, sep=' ', header=None) 33 | criteria = classes_df.iloc[:,0].isin(class_filter) 34 | classes_df = classes_df[criteria] 35 | 36 | class_name_list = sorted(classes_df.iloc[:,1].unique().tolist()) 37 | return class_name_list 38 | 39 | 40 | def predict_bird_from_file(fn, predictor,possible_classes,verbose=True, height=224,width=224): 41 | with open(fn, 'rb') as img: 42 | f = img.read() 43 | x = bytearray(f) 44 | 45 | #class_selection = '13, 17, 35, 36, 47, 68, 73, 87' 46 | 47 | results = predictor.predict(x)['predictions'] 48 | predicted_class_idx = argmax(results) 49 | predicted_class = possible_classes[predicted_class_idx] 50 | confidence = results[0][predicted_class_idx] 51 | if verbose: 52 | display(Image(fn, height=height, width=width)) 53 | print('Class: {}, confidence: {:.2f}'.format(predicted_class, confidence)) 54 | del img, x 55 | return predicted_class_idx, confidence 56 | -------------------------------------------------------------------------------- /07_deployment/cloud_deployment/statics/active-sagemaker-endpoints.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/07_deployment/cloud_deployment/statics/active-sagemaker-endpoints.png -------------------------------------------------------------------------------- /07_deployment/edge_deployment/README.md: -------------------------------------------------------------------------------- 1 | ## Edge Deployment using SageMaker Edge Manager and IoT Greengrass v2 2 | 3 | --- 4 | 5 | # Introduction 6 | 7 | In this lab illustrate how you can optimize and deploy machine learning models for edge devices. 8 | 9 | 1. Use an ARM64 EC2 instance to mimic an typical ARM based edge device. 10 | 2. Config the edge device by installing Greengrass core to manage the edge deployment 11 | 3. Prepare our model for edge using SageMaker Neo and Edge Manager (either use the model from the previous modules or run the `optional-prepare-data-and-model.ipynb` notebook to create a new model) 12 | 4. Finally we will deploy a simple application that predicts a list of bird images and send the results to the cloud 13 | 14 | ** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio** 15 | 16 | --- 17 | # Prerequisites 18 | 19 | Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need: 20 | 21 | - Following Policy to your **SageMaker Studio Execution role** to properly executed this lab. Warning: the permissions set for this lab are very loose for simplicity purposes. Please follow the least privilege frame when you work on your own projects. These permissions allow user to interact with other AWS services like EC2, System Manager (SSM), IoT Core, and GreengrassV2. 22 | 23 | - AmazonEC2FullAccess 24 | - AmazonEC2RoleforSSM 25 | - AmazonSSMManagedInstanceCore 26 | - AmazonSSMFullAccess 27 | - AWSGreengrassFullAccess 28 | - AWSIoTFullAccess 29 | 30 | - Familiarity with Python 31 | - Basic understanding of SSM 32 | - Basic familiarity with AWS S3 33 | - Basic familiarity with IoT Core 34 | - Basic familiarity with GreengrassV2 35 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 36 | - SageMaker Studio is preferred for the full UI integration 37 | 38 | --- 39 | ## Prepare the Model for the Edge 40 | 41 | You can use the bird model use created from the previous modules or run the `optional-prepare-data-and-model.ipynb` notebook to create a new model. Update the path to your model below if necessary. 42 | 43 | --- 44 | ## View Your Deployment in GreenGrass Console 45 | 46 | 1. Go to [Greengrass Console](https://console.aws.amazon.com/greengrass/) 47 | 48 | 2. View the deployment you just made in the deployment dashoboard 49 | 50 | - This shows you the deployment details: components, target devices, and deployment status 51 | - You can also see the device status to indicate the device is running in a healthy state. 52 | 53 | 54 | 55 | 3. Click inot Core device to get more detail on the device status 56 | - You can get status details on each components 57 | 58 | -------------------------------------------------------------------------------- /07_deployment/edge_deployment/build/image_classification/inference.py: -------------------------------------------------------------------------------- 1 | import grpc 2 | from PIL import Image 3 | import agent_pb2 as agent 4 | import agent_pb2_grpc as agent_grpc 5 | import os 6 | import time 7 | import numpy as np 8 | from numpy import argmax 9 | import uuid 10 | 11 | # This is a test script to check if my stuff are uploaded properly 12 | model_path = '/greengrass/v2/work/Bird-Model-ARM-TF2' 13 | model_name = 'bird-model' 14 | 15 | image_path = os.path.expandvars(os.environ.get("DEFAULT_SMEM_IC_IMAGE_DIR")) 16 | 17 | agent_socket = 'unix:///tmp/aws.greengrass.SageMakerEdgeManager.sock' 18 | 19 | channel = grpc.insecure_channel(agent_socket) 20 | client = agent_grpc.AgentStub(channel) 21 | 22 | print(f"Images directory is {image_path}================") 23 | 24 | print(f"our current directory is {os.getcwd()}==========") 25 | 26 | 27 | def list_models(cli): 28 | resp = cli.ListModels(agent.ListModelsRequest()) 29 | return { 30 | m.name: {"in": m.input_tensor_metadatas, "out": m.output_tensor_metadatas} 31 | for m in resp.models 32 | } 33 | 34 | 35 | def unload_model(cli, model_name): 36 | try: 37 | req = agent.UnLoadModelRequest() 38 | req.name = model_name 39 | return cli.UnLoadModel(req) 40 | except Exception as e: 41 | print(e) 42 | return None 43 | 44 | 45 | def load_model(cli, model_name, model_path): 46 | """ Load a new model into the Edge Agent if not loaded yet""" 47 | try: 48 | req = agent.LoadModelRequest() 49 | req.url = model_path 50 | req.name = model_name 51 | return cli.LoadModel(req) 52 | except Exception as e: 53 | print(e) 54 | return None 55 | 56 | 57 | def create_tensor(x, tensor_name): 58 | if x.dtype != np.float32: 59 | raise Exception("It only supports numpy float32 arrays for this tensor") 60 | tensor = agent.Tensor() 61 | tensor.tensor_metadata.name = tensor_name 62 | tensor.tensor_metadata.data_type = agent.FLOAT32 63 | for s in x.shape: 64 | tensor.tensor_metadata.shape.append(s) 65 | tensor.byte_data = x.tobytes() 66 | return tensor 67 | 68 | 69 | def predict(cli, model_name, x, shm=False): 70 | """ 71 | Invokes the model and get the predictions 72 | """ 73 | model_map = list_models(cli) 74 | if model_map.get(model_name) is None: 75 | raise Exception("Model %s not loaded" % model_name) 76 | # Create a request 77 | req = agent.PredictRequest() 78 | req.name = model_name 79 | # Then load the data into a temp Tensor 80 | tensor = agent.Tensor() 81 | meta = model_map[model_name]["in"][0] 82 | tensor.tensor_metadata.name = meta.name 83 | tensor.tensor_metadata.data_type = meta.data_type 84 | for s in meta.shape: 85 | tensor.tensor_metadata.shape.append(s) 86 | 87 | if shm: 88 | tensor.shared_memory_handle.offset = 0 89 | tensor.shared_memory_handle.segment_id = x 90 | else: 91 | tensor.byte_data = x.astype(np.float32).tobytes() 92 | 93 | req.tensors.append(tensor) 94 | 95 | # Invoke the model 96 | resp = cli.Predict(req) 97 | 98 | # Parse the output 99 | meta = model_map[model_name]["out"][0] 100 | tensor = resp.tensors[0] 101 | data = np.frombuffer(tensor.byte_data, dtype=np.float32) 102 | return data.reshape(tensor.tensor_metadata.shape) 103 | 104 | 105 | def main(): 106 | if list_models(client).get(model_name) is None: 107 | load_model(client, model_name, model_path) 108 | 109 | print("Loaded Models:") 110 | print(list_models(client)) 111 | 112 | start = time.time() 113 | 114 | count = 0 115 | 116 | if os.path.isdir(image_path): 117 | for images in os.listdir(image_path): 118 | if ".jpg" in images: 119 | im = Image.open(f"{image_path}/{images}") 120 | input_image = np.array(im).astype(np.float32) / 255.0 121 | print(f"Shape of the Image {input_image.shape}============") 122 | 123 | y = predict(client, model_name, input_image, False) 124 | 125 | print(f"output tensor is {y}===================") 126 | predicted_class_idx = argmax(y) 127 | 128 | total_time = time.time() - start 129 | print(f"Output Prediction: {predicted_class_idx}") 130 | 131 | req = agent.CaptureDataRequest() 132 | req.model_name = model_name 133 | req.capture_id = str(uuid.uuid4()) 134 | req.input_tensors.append(create_tensor(input_image, "input_image")) 135 | req.output_tensors.append(create_tensor(y, "output_image")) 136 | resp = client.CaptureData(req) 137 | 138 | count +=1 139 | else: 140 | print("Images directory not found ===============") 141 | 142 | print(f"Number of images predicted: {count}============") 143 | 144 | print("End of the script ==============") 145 | 146 | 147 | if __name__ == '__main__': 148 | main() -------------------------------------------------------------------------------- /07_deployment/edge_deployment/build/installer.sh: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: Apache-2.0 3 | 4 | #!/bin/bash 5 | install_lib() { 6 | py_lib="${1}" 7 | module="${2}" 8 | command=$( 9 | cat <The notebook use public workforce (Mechanical Turk) for the groundTruth labeling job. That will incur some cost at **$0.012 per image**. If want to use a prviate workforce, then you must have a private workforce already created and update the code with your private workforce ARN. Please follow this [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html) on how to create a private workforce 27 | - Access to Elastic Container Registry (ECR) 28 | - Familiarity with Training on Amazon SageMaker 29 | - Familiarity with SageMaker Processing Job 30 | - Familiarity with SageMaker Pipeline 31 | - Familiarity with Python 32 | - Familiarity with AWS S3 33 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 34 | - SageMaker Studio is preferred for the full UI integration 35 | 36 | --- 37 | 38 | ## Dataset 39 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species (the original technical report can be found here). Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not. 40 | 41 | ![Bird Dataset](statics/birds.png) 42 | 43 | The dataset can be downloaded [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. To make the download and training time reasonable, we will use a subset of the bird data. CUB_MINI.tar is attached in this workshop repo. It contains images and metadata for 8 of the 200 species, and can be used effectively for testing and learning how to use SageMaker and TensorFlow together on computer vision use cases. 44 | 45 | * 13 (Bobolink) 46 | * 17 (Cardinal) 47 | * 35 (Purple_Finch) 48 | * 36 (Noerhern_Flicker) 49 | * 47 (American_Golsdi..) 50 | * 68 (Ruby_throated_...) 51 | * 73 (Blue_Jay) 52 | * 87 (Mallard) 53 | 54 | --- 55 | To help you understand and test each step of the training pipeline: 1) preprocess, 2) training/tuning, and 3) model evaluation. 3 seperate notebooks are provided which you can step through each. 56 | 57 | - test-1-preprocess.ipynb 58 | - test-2-training.ipynb 59 | - test-3-evaluate.ipynb 60 | 61 | --- 62 | ## Edge Deployment -------------------------------------------------------------------------------- /08_end-to-end/1_training/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/__init__.py -------------------------------------------------------------------------------- /08_end-to-end/1_training/pipeline/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__init__.py -------------------------------------------------------------------------------- /08_end-to-end/1_training/pipeline/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /08_end-to-end/1_training/pipeline/__pycache__/pipeline.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__pycache__/pipeline.cpython-37.pyc -------------------------------------------------------------------------------- /08_end-to-end/1_training/pipeline/__pycache__/pipeline_tuning.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/__pycache__/pipeline_tuning.cpython-37.pyc -------------------------------------------------------------------------------- /08_end-to-end/1_training/pipeline/code/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/pipeline/code/__init__.py -------------------------------------------------------------------------------- /08_end-to-end/1_training/pipeline/evaluation.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import argparse 4 | import pathlib 5 | import json 6 | import os 7 | import numpy as np 8 | import tarfile 9 | import glob 10 | 11 | from sklearn.metrics import ( 12 | accuracy_score, 13 | precision_score, 14 | recall_score, 15 | confusion_matrix, 16 | f1_score 17 | ) 18 | 19 | import tensorflow as tf 20 | from tensorflow import keras 21 | from tensorflow.keras.applications.mobilenet_v2 import preprocess_input 22 | from tensorflow.keras.preprocessing import image 23 | from tensorflow.keras.optimizers import SGD 24 | # from smexperiments import tracker 25 | 26 | input_path = "/opt/ml/processing/input/test" #"output/test" # 27 | classes_path = "/opt/ml/processing/input/classes"#"output/manifest/test.csv" 28 | model_path = "/opt/ml/processing/model" #"model" # 29 | output_path = '/opt/ml/processing/output' #"output" # 30 | 31 | HEIGHT = 224 32 | WIDTH = 224 33 | DEPTH = 3 34 | NUM_CLASSES = 3 35 | 36 | def load_classes(file_name): 37 | 38 | classes_file = os.path.join(classes_path, file_name) 39 | 40 | with open(classes_file) as f: 41 | classes = json.load(f) 42 | 43 | return classes 44 | 45 | def get_filenames(input_data_dir): 46 | return glob.glob('{}/*.tfrecords'.format(input_data_dir)) 47 | 48 | def _parse_image_function(example): 49 | image_feature_description = { 50 | 'image': tf.io.FixedLenFeature([], tf.string), 51 | 'label': tf.io.FixedLenFeature([], tf.int64), 52 | } 53 | 54 | features = tf.io.parse_single_example(example, image_feature_description) 55 | 56 | image = tf.image.decode_jpeg(features['image'], channels=DEPTH) 57 | 58 | image = tf.image.resize(image, [WIDTH, HEIGHT]) 59 | 60 | label = tf.cast(features['label'], tf.int32) 61 | 62 | return image, label 63 | 64 | def predict_bird(model, img_array): 65 | 66 | x = img_array.reshape((1,) + img_array.shape) 67 | instance = preprocess_input(x) 68 | 69 | del x 70 | 71 | result = model.predict(instance) 72 | 73 | predicted_class_idx = np.argmax(result) 74 | confidence = result[0][predicted_class_idx] 75 | 76 | return predicted_class_idx, confidence 77 | 78 | if __name__ == "__main__": 79 | parser = argparse.ArgumentParser() 80 | parser.add_argument("--model-file", type=str, default="model.tar.gz") 81 | parser.add_argument("--classes-file", type=str, default="classes.json") 82 | args, _ = parser.parse_known_args() 83 | 84 | print("Extracting the model") 85 | 86 | model_file = os.path.join(model_path, args.model_file) 87 | file = tarfile.open(model_file) 88 | file.extractall(model_path) 89 | 90 | file.close() 91 | 92 | print("Load model") 93 | 94 | model = keras.models.load_model("{}/1".format(model_path), compile=False) 95 | model.compile(optimizer=SGD(lr=0.00001, momentum=0.9), loss='categorical_crossentropy', metrics=['accuracy']) 96 | 97 | print("Starting evaluation.") 98 | 99 | class_name = load_classes(args.classes_file) 100 | 101 | filenames = get_filenames(input_path)#args[channel]) 102 | dataset = tf.data.TFRecordDataset(filenames) 103 | 104 | test_dataset = dataset.map(_parse_image_function) 105 | 106 | preds = [] 107 | acts = [] 108 | num_errors = 0 109 | 110 | for image_features in test_dataset: 111 | act = image_features[1].numpy() 112 | acts.append(act) 113 | 114 | pred, cnf = predict_bird(model, image_features[0].numpy()) 115 | 116 | preds.append(pred) 117 | 118 | print(f'prediction is {pred}, and the actual is {act}============') 119 | if (pred != act): 120 | num_errors += 1 121 | print('ERROR - Pred: {} {:.2f}, Actual: {}'.format(pred, cnf, act)) 122 | 123 | precision = precision_score(acts, preds, average='micro') 124 | recall = recall_score(acts, preds, average='micro') 125 | accuracy = accuracy_score(acts, preds) 126 | cnf_matrix = confusion_matrix(acts, preds, labels=range(len(class_name))) 127 | f1 = f1_score(acts, preds, average='micro') 128 | 129 | print("Accuracy: {}".format(accuracy)) 130 | print("Precision: {}".format(precision)) 131 | print("Recall: {}".format(recall)) 132 | print("Confusion matrix: {}".format(cnf_matrix)) 133 | print("F1 score: {}".format(f1)) 134 | 135 | matrix_output = dict() 136 | 137 | for i in range(len(cnf_matrix)): 138 | matrix_row = dict() 139 | for j in range(len(cnf_matrix[0])): 140 | matrix_row[class_name[str(j)]] = int(cnf_matrix[i][j]) 141 | matrix_output[class_name[str(i)]] = matrix_row 142 | 143 | 144 | report_dict = { 145 | "multiclass_classification_metrics": { 146 | "accuracy": {"value": accuracy, "standard_deviation": "NaN"}, 147 | "precision": {"value": precision, "standard_deviation": "NaN"}, 148 | "recall": {"value": recall, "standard_deviation": "NaN"}, 149 | "f1": {"value": f1, "standard_deviation": "NaN"}, 150 | "confusion_matrix":matrix_output 151 | }, 152 | } 153 | 154 | pathlib.Path(output_path).mkdir(parents=True, exist_ok=True) 155 | 156 | evaluation_path = f"{output_path}/evaluation.json" 157 | 158 | print(f"Saving the result json file to {evaluation_path} ....") 159 | 160 | with open(evaluation_path, "w") as f: 161 | f.write(json.dumps(report_dict)) 162 | 163 | print("End of the evaluation process....") 164 | -------------------------------------------------------------------------------- /08_end-to-end/1_training/pipeline/preprocess.py: -------------------------------------------------------------------------------- 1 | 2 | import sys 3 | import tensorflow as tf 4 | import argparse 5 | import boto3 6 | import random 7 | import os 8 | import pathlib 9 | import glob 10 | import json 11 | 12 | input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 13 | output_path = '/opt/ml/processing/output' #"output" # 14 | 15 | def serialize_example(image, label): 16 | 17 | feature = { 18 | 'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])), 19 | 'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(label)])) 20 | } 21 | 22 | example_proto = tf.train.Example(features=tf.train.Features(feature=feature)) 23 | return example_proto.SerializeToString() 24 | 25 | def parse_manifest(manifest_file, labels, classes): 26 | 27 | f = open(manifest_file, 'r') 28 | 29 | lines = f.readlines() 30 | 31 | for l in lines: 32 | img_data = dict() 33 | img_raw = json.loads(l) 34 | 35 | img_data['image-path'] = img_raw['source-ref'].split('/')[-1] 36 | img_data['label'] = img_raw['label'] 37 | class_name = img_raw['label-metadata']['class-name'] 38 | 39 | 40 | if img_data['label'] not in labels: 41 | labels[img_data['label']] = [] 42 | 43 | if img_data['label'] not in classes: 44 | classes[img_data['label']] = class_name 45 | 46 | labels[img_data['label']].append(img_data) 47 | 48 | return labels, classes 49 | 50 | def split_dataset(labels): 51 | channels ={ 52 | "train":[], 53 | "valid":[], 54 | "test":[] 55 | } 56 | 57 | for l in labels: 58 | images = labels[l] 59 | random.shuffle(images) 60 | 61 | splits = [0.7, 0.9] 62 | 63 | channels["train"] += images[0: int(splits[0] * len(images))] 64 | channels["valid"] += images[int(splits[0] * len(images)):int(splits[1] * len(images))] 65 | channels["test"] += images[int(splits[1] * len(images)):] 66 | 67 | return channels 68 | 69 | 70 | # def get_filenames(input_data_dir): 71 | # return glob.glob('{}/*.tfrecords'.format(input_data_dir)) 72 | 73 | def building_tfrecord(channels, images_dir): 74 | for c in channels: 75 | 76 | pathlib.Path(f'{output_path}/{c}').mkdir(parents=True, exist_ok=True) 77 | 78 | 79 | tfrecord_file = f'{output_path}/{c}/bird-{c}.tfrecords' 80 | 81 | count = 0 82 | 83 | with tf.io.TFRecordWriter(tfrecord_file) as writer: 84 | for img in channels[c]: 85 | 86 | image_string = open(os.path.join(images_dir, 87 | img['image-path']), 'rb').read() 88 | 89 | tf_example = serialize_example(image_string, img['label']) 90 | writer.write(tf_example) 91 | count +=1 92 | 93 | print(f"number of images processed for {c} is {count}") 94 | 95 | if __name__ == "__main__": 96 | parser = argparse.ArgumentParser() 97 | parser.add_argument("--manifest", type=str, default="manifest") 98 | parser.add_argument("--images", type=str, default="images") 99 | args, _ = parser.parse_known_args() 100 | 101 | manifest_dir = os.path.join(input_path, args.manifest) 102 | images_dir = os.path.join(input_path, args.images) 103 | 104 | print(f"manifest file location: {manifest_dir}...") 105 | 106 | if len(os.listdir(manifest_dir)) < 1: 107 | print(f"No manifest files at this location: {manifest_dir}...") 108 | sys.exit() 109 | 110 | labels = dict() 111 | classes = dict() 112 | 113 | # looping through all the manifest files and parse the values by class 114 | for m in os.listdir(manifest_dir): 115 | manifest_file = os.path.join(manifest_dir, m) 116 | labels, classes = parse_manifest(manifest_file, labels, classes) 117 | 118 | # split the dataset by channel 119 | channels = split_dataset(labels) 120 | 121 | building_tfrecord(channels, images_dir) 122 | 123 | # save the classes json file 124 | classes_path = f"{output_path}/classes/classes.json" 125 | with open(classes_path, "w") as f: 126 | f.write(json.dumps(classes)) 127 | 128 | print("Finished running processing job") 129 | 130 | 131 | -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/birds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/birds.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/debugger_access.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_access.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/debugger_node_profiler.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_node_profiler.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/debugger_profiler.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_profiler.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/debugger_summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/debugger_summary.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/end2end.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/end2end.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/execute-pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/execute-pipeline.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/manual_approval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/manual_approval.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/model_registry.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/model_registry.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/panorama_test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/panorama_test.png -------------------------------------------------------------------------------- /08_end-to-end/1_training/statics/test_util_folder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/1_training/statics/test_util_folder.png -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/graphs/bird_demo_app/graph.json: -------------------------------------------------------------------------------- 1 | { 2 | "nodeGraph": { 3 | "envelopeVersion": "2021-01-01", 4 | "packages": [ 5 | { 6 | "name": "987720697751::BIRD_DEMO_CODE", 7 | "version": "1.0" 8 | }, 9 | { 10 | "name": "987720697751::BIRD_DEMO_TF_MODEL", 11 | "version": "1.0" 12 | }, 13 | { 14 | "name": "987720697751::RTSP_STREAM", 15 | "version": "1.0" 16 | }, 17 | { 18 | "name": "panorama::hdmi_data_sink", 19 | "version": "1.0" 20 | } 21 | ], 22 | "nodes": [ 23 | { 24 | "name": "code_node", 25 | "interface": "987720697751::BIRD_DEMO_CODE.interface", 26 | "overridable": false, 27 | "launch": "onAppStart" 28 | }, 29 | { 30 | "name": "model_node", 31 | "interface": "987720697751::BIRD_DEMO_TF_MODEL.interface", 32 | "overridable": false, 33 | "launch": "onAppStart" 34 | }, 35 | { 36 | "name": "camera_node", 37 | "interface": "987720697751::RTSP_STREAM.interface", 38 | "overridable": true, 39 | "launch": "onAppStart", 40 | "decorator": { 41 | "title": "IP camera", 42 | "description": "Choose a camera stream." 43 | } 44 | }, 45 | { 46 | "name": "output_node", 47 | "interface": "panorama::hdmi_data_sink.hdmi0", 48 | "overridable": true, 49 | "launch": "onAppStart" 50 | }, 51 | { 52 | "name": "detection_threshold", 53 | "interface": "float32", 54 | "value": 60.0, 55 | "overridable": true, 56 | "decorator": { 57 | "title": "Threshold", 58 | "description": "The minimum confidence percentage for a positive classification." 59 | } 60 | } 61 | ], 62 | "edges": [ 63 | { 64 | "producer": "camera_node.video_out", 65 | "consumer": "code_node.video_in" 66 | }, 67 | { 68 | "producer": "code_node.video_out", 69 | "consumer": "output_node.video_in" 70 | }, 71 | { 72 | "producer": "detection_threshold", 73 | "consumer": "code_node.threshold" 74 | } 75 | ] 76 | } 77 | } -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/2_deployment/bird_demo_app/packages/.keep -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM public.ecr.aws/panorama/panorama-application 2 | COPY src /panorama 3 | 4 | ARG DEBIAN_FRONTEND=noninteractive 5 | RUN python3 -m pip install -U scipy 6 | RUN pip3 install opencv-python boto3 -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/descriptor.json: -------------------------------------------------------------------------------- 1 | {"runtimeDescriptor": {"envelopeVersion": "2021-01-01", "entry": {"path": "python3", "name": "/panorama/app.py"}}} -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "nodePackage": { 3 | "envelopeVersion": "2021-01-01", 4 | "name": "BIRD_DEMO_CODE", 5 | "version": "1.0", 6 | "description": "Default description for package ssd_tf_latest", 7 | "assets": [ 8 | { 9 | "name": "code_asset", 10 | "implementations": [ 11 | { 12 | "type": "container", 13 | "assetUri": "4f1fdec28a45df7b4cee58e9ec1e4c9440fd3d85cf36c15a3183920b9692cd32.tar.gz", 14 | "descriptorUri": "577ccf8b6cff24dc1f90cc2af22644ee9d0c81b5a22e771287bf3630e1b7b8fb.json" 15 | } 16 | ] 17 | } 18 | ], 19 | "interfaces": [ 20 | { 21 | "name": "interface", 22 | "category": "business_logic", 23 | "asset": "code_asset", 24 | "inputs": [ 25 | { 26 | "name": "video_in", 27 | "type": "media" 28 | }, 29 | { 30 | "name": "threshold", 31 | "type": "float32" 32 | } 33 | ], 34 | "outputs": [ 35 | { 36 | "description": "Video stream output", 37 | "name": "video_out", 38 | "type": "media" 39 | } 40 | ] 41 | } 42 | ] 43 | } 44 | } -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/src/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/src/.keep -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_CODE-1.0/src/app.py: -------------------------------------------------------------------------------- 1 | 2 | import json 3 | import logging 4 | import time 5 | from logging.handlers import RotatingFileHandler 6 | 7 | import boto3 8 | from botocore.exceptions import ClientError 9 | import cv2 10 | import numpy as np 11 | import panoramasdk 12 | import datetime 13 | 14 | class Application(panoramasdk.node): 15 | def __init__(self): 16 | """Initializes the application's attributes with parameters from the interface, and default values.""" 17 | self.MODEL_NODE = "model_node" 18 | self.MODEL_DIM = 224 19 | self.frame_num = 0 20 | self.tracked_objects = [] 21 | self.tracked_objects_start_time = dict() 22 | self.tracked_objects_duration = dict() 23 | 24 | self.classes = { 25 | 0: "Bobolink", 26 | 1: "Cardinal", 27 | 2: "Purple_Finch", 28 | 3: "Northern_Flicker", 29 | 4:"American_Goldfinch", 30 | 5:"Ruby_throated_Hummingbird", 31 | 6:"Blue_Jay", 32 | 7:"Mallard" 33 | } 34 | 35 | def process_streams(self): 36 | """Processes one frame of video from one or more video streams.""" 37 | self.frame_num += 1 38 | logger.debug(self.frame_num) 39 | 40 | # Loop through attached video streams 41 | streams = self.inputs.video_in.get() 42 | for stream in streams: 43 | self.process_media(stream) 44 | 45 | self.outputs.video_out.put(streams) 46 | 47 | def process_media(self, stream): 48 | """Runs inference on a frame of video.""" 49 | image_data = preprocess(stream.image, self.MODEL_DIM) 50 | logger.debug(image_data.shape) 51 | 52 | # Run inference 53 | inference_results = self.call({"input_1":image_data}, self.MODEL_NODE) 54 | 55 | # Process results (object deteciton) 56 | self.process_results(inference_results, stream) 57 | 58 | def process_results(self, inference_results, stream): 59 | """Processes output tensors from a computer vision model and annotates a video frame.""" 60 | if inference_results is None: 61 | logger.warning("Inference results are None.") 62 | return 63 | 64 | logger.debug('Inference results: {}'.format(inference_results)) 65 | count = 0 66 | for det in inference_results: 67 | if count == 0: 68 | first_output = det 69 | count += 1 70 | 71 | # first_output = inference_results[0] 72 | logger.debug('Output one type: {}'.format(type(first_output))) 73 | probabilities = first_output[0] 74 | # 1000 values for 1000 classes 75 | logger.debug('Result one shape: {}'.format(probabilities.shape)) 76 | top_result = probabilities.argmax() 77 | 78 | self.detected_class = self.classes[top_result] 79 | self.detected_frame = self.frame_num 80 | # persist for up to 5 seconds 81 | # if self.frame_num - self.detected_frame < 75: 82 | label = '{} ({}%)'.format(self.detected_class, int(probabilities[top_result]*100)) 83 | stream.add_label(label, 0.1, 0.1) 84 | 85 | def preprocess(img, size): 86 | """Resizes and normalizes a frame of video.""" 87 | resized = cv2.resize(img, (size, size)) 88 | x1 = np.asarray(resized) 89 | x1 = np.expand_dims(x1, 0) 90 | return x1 91 | 92 | def get_logger(name=__name__,level=logging.INFO): 93 | logger = logging.getLogger(name) 94 | logger.setLevel(level) 95 | handler = RotatingFileHandler("/opt/aws/panorama/logs/app.log", maxBytes=100000000, backupCount=2) 96 | formatter = logging.Formatter(fmt='%(asctime)s %(levelname)-8s %(message)s', 97 | datefmt='%Y-%m-%d %H:%M:%S') 98 | handler.setFormatter(formatter) 99 | logger.addHandler(handler) 100 | return logger 101 | 102 | def main(): 103 | try: 104 | logger.info("INITIALIZING APPLICATION") 105 | app = Application() 106 | logger.info("PROCESSING STREAMS") 107 | while True: 108 | app.process_streams() 109 | except Exception as e: 110 | logger.warning(e) 111 | 112 | logger = get_logger(level=logging.INFO) 113 | main() 114 | -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_TF_MODEL-1.0/descriptor.json: -------------------------------------------------------------------------------- 1 | { 2 | "mlModelDescriptor": { 3 | "envelopeVersion": "2021-01-01", 4 | "framework": "TENSORFLOW", 5 | "inputs": [ 6 | { 7 | "name": "image_tensor", 8 | "shape": [1,224,224,3] 9 | } 10 | ] 11 | } 12 | } -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-BIRD_DEMO_TF_MODEL-1.0/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "nodePackage": { 3 | "envelopeVersion": "2021-01-01", 4 | "name": "BIRD_DEMO_TF_MODEL", 5 | "version": "1.0", 6 | "description": "Default description for package ", 7 | "assets": [ 8 | { 9 | "name": "model_asset", 10 | "implementations": [ 11 | { 12 | "type": "model", 13 | "assetUri": "ae6c7df9e9aaff6e5001f88c82f5cc938ab84b7a01a38247083f2542f30e65cf.tar.gz", 14 | "descriptorUri": "1d13ccfae4b1b8cde03bfa404c7115538d511320db773a24fed30a2b9d78f355.json" 15 | } 16 | ] 17 | } 18 | ], 19 | "interfaces": [ 20 | { 21 | "name": "interface", 22 | "category": "ml_model", 23 | "asset": "model_asset", 24 | "inputs": [ 25 | { 26 | "description": "Video stream output", 27 | "name": "video_in", 28 | "type": "media" 29 | } 30 | ], 31 | "outputs": [ 32 | { 33 | "description": "Video stream output", 34 | "name": "video_out", 35 | "type": "media" 36 | } 37 | ] 38 | }, 39 | { 40 | "name": "model_asset_interface", 41 | "category": "ml_model", 42 | "asset": "model_asset", 43 | "inputs": [ 44 | { 45 | "name": "video_in", 46 | "type": "media" 47 | } 48 | ], 49 | "outputs": [ 50 | { 51 | "name": "video_out", 52 | "type": "media" 53 | } 54 | ] 55 | } 56 | ] 57 | } 58 | } -------------------------------------------------------------------------------- /08_end-to-end/2_deployment/bird_demo_app/packages/987720697751-RTSP_STREAM-1.0/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "nodePackage":{ 3 | "envelopeVersion":"2021-01-01", 4 | "description":"RTSP/ONVIF camera data source node package", 5 | "assets":[ 6 | { 7 | "name":"rtsp_camera", 8 | "implementations":[ 9 | { 10 | "type":"system", 11 | "assetUri":"source/video/camera/rtsp/source_rtsp" 12 | } 13 | ] 14 | } 15 | ], 16 | "interfaces":[ 17 | { 18 | "name":"interface", 19 | "category":"media_source", 20 | "asset":"rtsp_camera", 21 | "inputs":[ 22 | { 23 | "description":"Camera username", 24 | "name":"username", 25 | "type":"string", 26 | "default":"admin" 27 | }, 28 | { 29 | "description":"Camera password", 30 | "name":"password", 31 | "type":"string", 32 | "default":"123456" 33 | }, 34 | { 35 | "description":"Camera streamUrl", 36 | "name":"streamUrl", 37 | "type":"string", 38 | "default":"rtsp://10.0.0.212:554/stream0" 39 | } 40 | ], 41 | "outputs":[ 42 | { 43 | "description":"Video stream output", 44 | "name":"video_out", 45 | "type":"media" 46 | } 47 | ] 48 | } 49 | ] 50 | } 51 | } -------------------------------------------------------------------------------- /08_end-to-end/README.md: -------------------------------------------------------------------------------- 1 | # End-to-end Edge Computer Vision (CV) Machine Learning (ML) Orchistration 2 | 3 | ## Introduction 4 | 5 | In this module, you will walk through an end-to-end process of developing an enterprise Computer Vision(CV) solution. The module starts with data label, then move to training orchestration where you will prepare, train, and evaluate the models. If model performance meets your expectation, you will deploy it to an AWS panorama for edge inference. The CV example is predicting different species of birds, but technique and concepts are applicable to a wide range of industrial CV use cases. 6 | 7 | --- 8 | ## Folder Structure 9 | - 1_training - End-to-end MLOps pipeline example. 10 | - 2_deployment - Panorama application code to deploy to AWS Panorama device or simulate using the test untility. 11 | 12 | ## Architecture 13 | --- 14 | ![MLOps Pattern](1_training/statics/end2end.png) 15 | 16 | The architecture we will build during this workshop is illustrated on the right. Several key components can be highlighted: 17 | 18 | 1. **Data labeling w/ SageMaker GroundTruth**: We are using the CUB_MINI.tar data which is a subset of [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset. The code below will spin up a GoundTruth labeling job for private/public workforce to label. The output is a fully labeled manifest file that can be ingested by the atutomated training pipeline. 19 | 20 | 2. **Automated training Pipeline** Once the dataset is fully labelled, we have prepared a SageMaker pipeline that will go through steps: preprocessing, model training, model evaluation, and storing the model to model registry. At the same time, the example code will provide advance topics like streaming data with pipe mode and fast file mode, monitor training with SageMaker debugger, and distributed training to increase training effeciency and reduce training time. 21 | 22 | 3. **Model Deployment & Edge Inferencing w/ AWS Panorama** With manual approval of the model in Model registry, we can kick off the edge deploymnet with a deployment lambda. That can be accomplish using the AWS Panorama API. We will also illustration the edge application development and testing using the AWS Panorama Test Utility (Emulator) 23 | 24 | --- 25 | ### Setup Test Utility 26 | 27 | Panorama Test Utility is a set of python libraries and commandline commands, which allows you to test-run Panorama applications without Panorama appliance device. With Test Utility, you can start running sample applications and developing your own Panorama applications before preparing real Panorama appliance. Sample applications in this repository also use Test Utility. 28 | 29 | For more about the Test Utility and its current capabilities, please refer to [Introducing AWS Panorama Test Utility document](https://github.com/aws-samples/aws-panorama-samples/blob/main/docs/AboutTestUtility.md). 30 | 31 | To set up your environment for Test Utility, please refer to **[Test Utility environment setup](https://github.com/aws-samples/aws-panorama-samples/blob/main/docs/EnvironmentSetup.md)**. 32 | 33 | ![Panorama Test Utility](1_training/statics/panorama_test.png) 34 | 35 | --- 36 | 37 | ## Prerequisites 38 | 39 | To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need: 40 | 41 | - Access to the SageMaker default S3 bucket or use your own 42 | - The S3 bucket that you use for this demo must have a CORS policy attached. To learn more about this requirement, and how to attach a CORS policy to an S3 bucket, see [CORS Permission Requirement](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-cors-update.html) 43 | - The notebook use public workforce (Mechanical Turk) for the groundTruth labeling job. That will incur some cost at **$0.012 per image**. If want to use a prviate workforce, then you must have a private workforce already created and update the code with your private workforce ARN. Please follow this [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html) on how to create a private workforce 44 | - Access to Elastic Container Registry (ECR) 45 | - Familiarity with Training on Amazon SageMaker 46 | - Familiarity with SageMaker Processing Job 47 | - Familiarity with SageMaker Pipelines 48 | - Familiarity with Python 49 | - Familiarity with AWS S3 50 | - Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from. 51 | - SageMaker Studio is preferred for the full UI integration 52 | 53 | --- 54 | 55 | ## Dataset 56 | The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species. Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not. 57 | 58 | ![Bird Dataset](1_training/statics/birds.png) 59 | 60 | Run the code in the notebook to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data. -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## End-to-end Computer Vision (CV) Training Workshop 2 | --- 3 | 4 | This is an end-to-end CV MLOps workshop aimed to help Machine Learning (ML) and Data Science (DS) teams build relevant AWS and SageMaker competencies for an enterprise scale solution. The content is derived from a real world CV use case where an image classification model is developed and trained on SageMaker and then deployed to an edge computing devices. Here is a diagram overview of the workshop and the learning outcome for each module 5 | 6 | ![Workshop Overview](statics/cv-workshop-overview.png) 7 | 8 | 9 | The curriculum consists following modules: 10 | 11 | 1. [Data Labeling (Optional)](01_groundtruth(optional)/README.md) 12 | 2. [Preprocessing](02_preprocessing/README.md) 13 | 3. [Training on SageMaker](03_training/README.md) 14 | 4. [Advance Training on SageMaker](04_advanced_training/README.md) 15 | 5. [Model Evaluation and Model Explainability](05_model_evaluation_and_model_explainability/README.md) 16 | 6. [Sagemaker Trainning Pipeline](06_training_pipeline/README.md) 17 | 7. [Edge Deployment](07_edge_deployment/README.md) 18 | 8. [End-to-end](08_end-to-end/README.md) 19 | 20 | To get started, load the provided Jupyter notebook and associated files to you SageMaker Studio Environment. 21 | 22 | ## Security 23 | 24 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 25 | 26 | ## License 27 | 28 | This library is licensed under the MIT-0 License. See the LICENSE file. 29 | -------------------------------------------------------------------------------- /statics/cv-workshop-overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/statics/cv-workshop-overview.png -------------------------------------------------------------------------------- /statics/workshop_overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/end-to-end-workshop-for-computer-vision/b441f75617feb9f2ca9354a4e889860926c590df/statics/workshop_overview.png --------------------------------------------------------------------------------