├── .github
    └── PULL_REQUEST_TEMPLATE.md
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── Dockerfile_tf_serving
├── LICENSE
├── Makefile
├── README.md
├── coco_label_map.py
├── deploy.sh
├── images
    └── overview.png
├── k8s-daemonset.yml
├── requirements.txt
├── stack.cfn.yml
└── test.py


/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | *Issue #, if available:*
2 | 
3 | *Description of changes:*
4 | 
5 | 
6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
7 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | custom.mk
3 | *.pyc
4 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-elastic-inference-eks/issues), or [recently closed](https://github.com/aws-samples/amazon-elastic-inference-eks/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-elastic-inference-eks/labels/help%20wanted) issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](https://github.com/aws-samples/amazon-elastic-inference-eks/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.6
 2 | 
 3 | WORKDIR /usr/src/app
 4 | 
 5 | COPY requirements.txt ./
 6 | 
 7 | RUN pip3 install --no-cache-dir -r requirements.txt
 8 | 
 9 | COPY . .
10 | 
11 | CMD [ "python", "-u", "./test.py" ]
12 | 


--------------------------------------------------------------------------------
/Dockerfile_tf_serving:
--------------------------------------------------------------------------------
 1 | FROM amazonlinux
 2 | 
 3 | # install missing packages
 4 | RUN yum install -y wget && yum install -y tar && yum install -y git
 5 | 
 6 | # install binary for TensorFlow Serving modified for Elastic Inference
 7 | RUN wget -q -O - https://s3.amazonaws.com/amazonei-tensorflow/tensorflow-serving/v1.12/amazonlinux/latest/tensorflow-serving-1-12-0-amazonlinux-ei-1-1.tar.gz | tar -xvz
 8 | 
 9 | RUN chmod +x /tensorflow-serving-1-12-0-amazonlinux-ei-1-1/amazonei_tensorflow_model_server
10 | 
11 | # install object detection model
12 | WORKDIR /models
13 | RUN wget -nv -O model.tar.gz \
14 | http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
15 | RUN tar -xvf model.tar.gz
16 | RUN mkdir -p object-detect/1
17 | RUN find -name saved_model -exec mv {}/saved_model.pb {}/variables object-detect/1/ \;
18 | 
19 | WORKDIR /
20 | 
21 | CMD ["./tensorflow-serving-1-12-0-amazonlinux-ei-1-1/amazonei_tensorflow_model_server", \
22 |     "--rest_api_port=8501", \
23 |     "--model_base_path=/models/object-detect"]
24 | 
25 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | 
 2 | # This file was heavily influenced by the AWS EKS Reference Architecture
 3 | # https://github.com/aws-samples/amazon-eks-refarch-cloudformation
 4 | 
 5 | CUSTOM_FILE ?= custom.mk
 6 | ifneq ("$(wildcard $(CUSTOM_FILE))","")
 7 | 	include $(CUSTOM_FILE)
 8 | endif
 9 | 
10 | ROOT ?= $(shell pwd)
11 | AWS_ACCOUNT_ID := $(shell aws sts get-caller-identity --query 'Account' --output text)
12 | CLUSTER_STACK_NAME ?= eks-ei-blog
13 | CLUSTER_NAME ?= $(CLUSTER_STACK_NAME)
14 | EKS_ADMIN_ROLE ?= arn:aws:iam::$(AWS_ACCOUNT_ID):role/EksEiBlogPostRole
15 | REGION ?= 'us-east-1'
16 | AZ_0 ?= 'us-east-1a'
17 | AZ_1 ?= 'us-east-1b'
18 | SSH_KEY_NAME ?= ''
19 | USER_ARN ?= $(shell aws sts get-caller-identity --output text --query 'Arn')
20 | EI_TYPE ?= 'eia1.medium'
21 | NODE_INSTANCE_TYPE ?= 'm5.large'
22 | INFERENCE_NODE_INSTANCE_TYPE ?= 'c5.large'
23 | NODE_ASG_MIN ?= 1
24 | NODE_ASG_MAX ?= 5
25 | NODE_ASG_DESIRED ?= 2
26 | INFERENCE_NODE_ASG_MAX ?= 6
27 | INFERENCE_NODE_ASG_MIN ?= 2
28 | INFERENCE_NODE_ASG_DESIRED ?= 2
29 | NODE_VOLUME_SIZE ?= 100
30 | INFERENCE_NODE_VOLUME_SIZE ?= 100
31 | LAMBDA_CR_BUCKET_PREFIX ?= 'pub-cfn-cust-res-pocs'
32 | DEFAULT_SQS_TASK_VISIBILITY ?= 7200
33 | DEFAULT_SQS_TASK_COMPLETED_VISIBILITY ?= 500
34 | INFERENCE_SCALE_PERIODS ?= 1
35 | INFERENCE_SCALE_OUT_THRESHOLD ?= 2
36 | INFERENCE_SCALE_IN_THRESHOLD ?= 2
37 | INFERENCE_NODE_GROUP_NAME ?= 'inference'
38 | NODE_GROUP_NAME ?= 'standard'
39 | INFERENCE_BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=true,nodegroup=elastic-inference
40 | BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=false,nodegroup=standard
41 | 
42 | ROLE_TRUST ?= '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "cloudformation.amazonaws.com" }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Service": "lambda.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }'
43 | 
44 | .PHONY: create-role
45 | create-role:
46 | 	@aws iam create-role --role-name EksEiBlogPostRole --assume-role-policy-document $(ROLE_TRUST) --output text --query 'Role.Arn'
47 | 	@aws iam attach-role-policy --role-name EksEiBlogPostRole --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
48 | 
49 | .PHONY: update-kubeconfig
50 | update-kubeconfig:
51 | 	@aws eks update-kubeconfig --region $(REGION) --name $(CLUSTER_NAME)
52 | 
53 | .PHONY: deploy-daemonset
54 | deploy-daemonset:
55 | 	@kubectl apply -f k8s-daemonset.yml
56 | 
57 | .PHONY: create-cluster
58 | create-cluster:
59 | 	@aws --region $(REGION) cloudformation create-stack \
60 |   --template-body file://stack.cfn.yml  \
61 |   --stack-name  $(CLUSTER_STACK_NAME) \
62 |   --role-arn $(EKS_ADMIN_ROLE) \
63 |   --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \
64 |   --parameters \
65 |   ParameterKey=EksClusterName,ParameterValue="$(CLUSTER_NAME)" \
66 |   ParameterKey=AvailabilityZone0,ParameterValue="$(AZ_0)" \
67 |   ParameterKey=AvailabilityZone1,ParameterValue="$(AZ_1)" \
68 |   ParameterKey=AdminUserArn,ParameterValue="$(USER_ARN)" \
69 |   ParameterKey=CreateRoleArn,ParameterValue="$(EKS_ADMIN_ROLE)" \
70 |   ParameterKey=KeyName,ParameterValue="$(SSH_KEY_NAME)" \
71 |   ParameterKey=InferenceNodeGroupName,ParameterValue="$(INFERENCE_NODE_GROUP_NAME)" \
72 |   ParameterKey=InferenceBootstrapArguments,ParameterValue="'$(INFERENCE_BOOTSTRAP)'" \
73 |   ParameterKey=BootstrapArguments,ParameterValue="'$(BOOTSTRAP)'" \
74 |   ParameterKey=NodeGroupName,ParameterValue="$(NODE_GROUP_NAME)" \
75 |   ParameterKey=NodeInstanceType,ParameterValue="$(NODE_INSTANCE_TYPE)" \
76 |   ParameterKey=InferenceNodeInstanceType,ParameterValue="$(INFERENCE_NODE_INSTANCE_TYPE)" \
77 |   ParameterKey=ElasticInferenceType,ParameterValue="$(EI_TYPE)" \
78 | 	ParameterKey=NodeAutoScalingGroupMinSize,ParameterValue="$(NODE_ASG_MIN)" \
79 |  	ParameterKey=NodeAutoScalingGroupMaxSize,ParameterValue="$(NODE_ASG_MAX)" \
80 | 	ParameterKey=NodeAutoScalingGroupDesiredCapacity,ParameterValue="$(NODE_ASG_DESIRED)" \
81 |   ParameterKey=InferenceNodeAutoScalingGroupMinSize,ParameterValue="$(INFERENCE_NODE_ASG_MIN)" \
82 |   ParameterKey=InferenceNodeAutoScalingGroupMaxSize,ParameterValue="$(INFERENCE_NODE_ASG_MAX)" \
83 |   ParameterKey=InferenceNodeAutoScalingGroupDesiredCapacity,ParameterValue="$(INFERENCE_NODE_ASG_DESIRED)" \
84 |   ParameterKey=NodeVolumeSize,ParameterValue="$(NODE_VOLUME_SIZE)" \
85 |   ParameterKey=InferenceNodeVolumeSize,ParameterValue="$(INFERENCE_NODE_VOLUME_SIZE)" \
86 |   ParameterKey=LambdaCustomResourceBucketPrefix,ParameterValue="$(LAMBDA_CR_BUCKET_PREFIX)" \
87 |   ParameterKey=DefaultTaskQueueVisibilityTimeout,ParameterValue="$(DEFAULT_SQS_TASK_VISIBILITY)" \
88 |   ParameterKey=DefaultTaskCompletedQueueVisibilityTimeout,ParameterValue="$(DEFAULT_SQS_TASK_COMPLETED_VISIBILITY)" \
89 |   ParameterKey=InferenceScaleEvaluationPeriods,ParameterValue="$(INFERENCE_SCALE_PERIODS)" \
90 |   ParameterKey=InferenceQueueDepthScaleOutThreshold,ParameterValue="$(INFERENCE_SCALE_OUT_THRESHOLD)" \
91 | 	ParameterKey=InferenceQueueDepthScaleInThreshold,ParameterValue="$(INFERENCE_SCALE_IN_THRESHOLD)"
92 | 	@echo open "https://console.aws.amazon.com/cloudformation/home?region=$(REGION)#/stacks to see the details"
93 | 
94 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Amazon Elastic Inference with Amazon EKS
  2 | 
  3 | This repository contains resources demonstrating how to use [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference/) (EI) and [Amazon EKS](https://aws.amazon.com/eks/) together to deliver a cost optimized, scalable solution for performing inference on video frames. More specifically, the solution herein runs containers in Amazon EKS that read a video from Amazon S3, preprocess its frames, then send the frames for object detection to a TensorFlow Serving container modified to work with Amazon EI. This computationally intensive use case showcases the advantages of using Amazon EI and Amazon EKS together to achieve accelerated inference at low cost within a scalable, containerized architecture.
  4 | 
  5 | ![overview](images/overview.png)
  6 | 
  7 | ## Deploy
  8 | 
  9 | The following steps require the [AWS Command Line Interface](https://aws.amazon.com/cli/) to be installed and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). Additionally, you must follow the AWS instructions for installing the [IAM compatible version of kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html). You must also install the [aws-iam-authenticator](https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html), to help with IAM integration
 10 | 
 11 | Amazon EKS and Elastic Inference are currently not available in all AWS regions. Consult the [AWS Region Table](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/) for more information.
 12 | 
 13 | If you would like to see the operations behind the make commands, [see the Makefile](https://github.com/aws-samples/amazon-elastic-inference-eks/blob/master/Makefile). Likewise, if you would like to see the resources being created and how the EI accelerator is attached, see the [CloudFormation template](https://github.com/aws-samples/amazon-elastic-inference-eks/blob/master/stack.cfn.yml).
 14 | 
 15 | The first step in the process is to clone this repository:
 16 | 
 17 | ```
 18 | git clone https://github.com/aws-samples/amazon-elastic-inference-eks.git && cd amazon-elastic-inference-eks
 19 | ```
 20 | 
 21 | In order to create both the Amazon EKS cluster and the node groups in a single step, first we must create an IAM role to be used by a Custom Resource in the CloudFormation template to programmatically modify the EKS aws-auth ConfigMap. Modifying the aws-auth ConfigMap is required so that the EC2 instances can register themselves with the cluster.
 22 | 
 23 | ```
 24 | make create-role
 25 | ```
 26 | 
 27 | After the IAM role is created, you can launch the cluster by running:
 28 | 
 29 | ```
 30 | make create-cluster
 31 | ```
 32 | 
 33 | This step takes 10-15 minutes to create all of the required resources. Once the command is executed,
 34 | an AWS Management Console link to the CloudFormation section is printed. You can monitor the resource creation
 35 | here. Once the stack status is, CREATE_COMPLETE, you can move on to the next step.
 36 | 
 37 | If you are interested in overriding the default values in the Makefile, create a custom.mk file in the same directory and set the values. The following parameters can be overridden:
 38 | 
 39 | ```
 40 | CLUSTER_STACK_NAME ?= eks-ei-blog
 41 | CLUSTER_NAME ?= $(CLUSTER_STACK_NAME)
 42 | EKS_ADMIN_ROLE ?= arn:aws:iam::$(AWS_ACCOUNT_ID):role/EksEiBlogPostRole
 43 | REGION ?= 'us-east-1'
 44 | AZ_0 ?= 'us-east-1a'
 45 | EKS_ADMIN_ROLE ?= arn:aws:iam::$(AWS_ACCOUNT_ID):role/EksEiBlogPostRole
 46 | REGION ?= 'us-east-1'
 47 | AZ_0 ?= 'us-east-1a'
 48 | AZ_1 ?= 'us-east-1b'
 49 | SSH_KEY_NAME ?= 'somekey'
 50 | USER_ARN ?= $(shell aws sts get-caller-identity --output text --query 'Arn')
 51 | EI_TYPE ?= 'eia1.medium'
 52 | NODE_INSTANCE_TYPE ?= 'm5.large'
 53 | INFERENCE_NODE_INSTANCE_TYPE ?= 'c5.large'
 54 | NODE_ASG_MIN ?= 1
 55 | NODE_ASG_MAX ?= 5
 56 | NODE_ASG_DESIRED ?= 2
 57 | INFERENCE_NODE_ASG_MAX ?= 6
 58 | INFERENCE_NODE_ASG_MIN ?= 2
 59 | INFERENCE_NODE_ASG_DESIRED ?= 2
 60 | NODE_VOLUME_SIZE ?= 100
 61 | INFERENCE_NODE_VOLUME_SIZE ?= 100
 62 | LAMBDA_CR_BUCKET_PREFIX ?= 'pub-cfn-cust-res-pocs'
 63 | DEFAULT_SQS_TASK_VISIBILITY ?= 7200
 64 | DEFAULT_SQS_TASK_COMPLETED_VISIBILITY ?= 500
 65 | INFERENCE_SCALE_PERIODS ?= 1
 66 | INFERENCE_SCALE_OUT_THRESHOLD ?= 2
 67 | INFERENCE_SCALE_IN_THRESHOLD ?= 2
 68 | INFERENCE_NODE_GROUP_NAME ?= 'inference'
 69 | NODE_GROUP_NAME ?= 'standard'
 70 | INFERENCE_BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=true,nodegroup=elastic-inference
 71 | BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=false,nodegroup=standard
 72 | ```
 73 | 
 74 | After the cluster is created, update your local [kubeconfig](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) file by running:
 75 | 
 76 | ```
 77 | make update-kubeconfig
 78 | ```
 79 | 
 80 | You can test your local Amazon EKS authentication by running:
 81 | 
 82 | ```
 83 | kubectl get nodes
 84 | ```
 85 | 
 86 | The output should provide high-level information about the four nodes (or a different number if you overrode the defaults). Next, we are going to deploy the [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) to the cluster:
 87 | 
 88 | ```
 89 | kubectl apply -f k8s-daemonset.yml
 90 | ```
 91 | 
 92 | You can check the status of the deployment by running:
 93 | 
 94 | ```
 95 | kubectl describe daemonset inference-daemon
 96 | ```
 97 | Wait until the "Pod Status" shows the total oucnt as "Running"
 98 | 
 99 | ## Run
100 | 
101 | Next you deploy a sample MOV file to the data bucket created by the stack. In the S3 console,
102 | you will see a bucket that has the format of (default CLUSTER/STACK NAME is - eks-ei-blog):
103 | 
104 | ```
105 | task-data-bucket-ACCOUNT_ID-REGION-CLUSTER/STACK NAME
106 | ```
107 | 
108 | From the command line, you can use something similar to upload sample content to the bucket:
109 | 
110 | ```
111 | aws s3 cp --region REGION sample.mov s3://task-data-bucket-ACCOUNT_ID-REGION-CLUSTER/STACK NAME
112 | ```
113 | 
114 | Once you have sample data in the bucket, in the SQS section of the AWS Management Console, [submit a sample task](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-send-message.html). The queue name resembles:
115 | 
116 | ```
117 | task-queue-CLUSTER/STACK NAME
118 | ```
119 | 
120 | The format of the message submitted in the console is:
121 | 
122 | ```
123 | { "bucket": "task-data-bucket-ACCOUNT_ID-REGION-CLUSTER/STACK NAME", "object": "sample.mov" }
124 | ```
125 | 
126 | Change the bucket name and the object key to match your deployment.
127 | 
128 | To see the inference results, you can [view messages](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-receive-delete-message.html) on the task completed SQS queue in the AWS Management Console.
129 | 
130 | 
131 | ## Cleanup
132 | 
133 | You must delete the DaemonSet before terminating the Amazon EKS cluster, or there will be resources that cannot be
134 | reclaimed by CloudFormation. To do so, run:
135 | 
136 | ```
137 | kubectl delete -f k8s-daemonset.yml
138 | ```
139 | 
140 | Once the DaemonSet is terminated, you can [delete the stack](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-delete-stack.html) in the CloudFormation section of the AWS Management Console.
141 | 
142 | ## License Summary
143 | 
144 | This sample code is made available under the MIT-0 license. See the LICENSE file.
145 | 


--------------------------------------------------------------------------------
/coco_label_map.py:
--------------------------------------------------------------------------------
 1 | label_map = {
 2 |     1: 'person',
 3 |     2: 'bicycle',
 4 |     3: 'car',
 5 |     4: 'motorcycle',
 6 |     5: 'airplane',
 7 |     6: 'bus',
 8 |     7: 'train',
 9 |     8: 'truck',
10 |     9: 'boat',
11 |     10: 'traffic light',
12 |     11: 'fire hydrant',
13 |     13: 'stop sign',
14 |     14: 'parking meter',
15 |     15: 'bench',
16 |     16: 'bird',
17 |     17: 'cat',
18 |     18: 'dog',
19 |     19: 'horse',
20 |     20: 'sheep',
21 |     21: 'cow',
22 |     22: 'elephant',
23 |     23: 'bear',
24 |     24: 'zebra',
25 |     25: 'giraffe',
26 |     27: 'backpack',
27 |     28: 'umbrella',
28 |     31: 'handbag',
29 |     32: 'tie',
30 |     33: 'suitcase',
31 |     34: 'frisbee',
32 |     35: 'skis',
33 |     36: 'snowboard',
34 |     37: 'sports ball',
35 |     38: 'kite',
36 |     39: 'baseball bat',
37 |     40: 'baseball glove',
38 |     41: 'skateboard',
39 |     42: 'surfboard',
40 |     43: 'tennis racket',
41 |     44: 'bottle',
42 |     46: 'wine glass',
43 |     47: 'cup',
44 |     48: 'fork',
45 |     49: 'knife',
46 |     50: 'spoon',
47 |     51: 'bowl',
48 |     52: 'banana',
49 |     53: 'apple',
50 |     54: 'sandwich',
51 |     55: 'orange',
52 |     56: 'broccoli',
53 |     57: 'carrot',
54 |     58: 'hot dog',
55 |     59: 'pizza',
56 |     60: 'donut',
57 |     61: 'cake',
58 |     62: 'chair',
59 |     63: 'couch',
60 |     64: 'potted plant',
61 |     65: 'bed',
62 |     67: 'dining table',
63 |     70: 'toilet',
64 |     72: 'tv',
65 |     73: 'laptop',
66 |     74: 'mouse',
67 |     75: 'remote',
68 |     76: 'keyboard',
69 |     77: 'cell phone',
70 |     78: 'microwave',
71 |     79: 'oven',
72 |     80: 'toaster',
73 |     81: 'sink',
74 |     82: 'refrigerator',
75 |     84: 'book',
76 |     85: 'clock',
77 |     86: 'vase',
78 |     87: 'scissors',
79 |     88: 'teddy bear',
80 |     89: 'hair drier',
81 |     90: 'toothbrush'
82 | }
83 | 


--------------------------------------------------------------------------------
/deploy.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | TAG=$(git log -1 --pretty=%H)
 4 | 
 5 | echo EI APP --------------------------------------------------------------------
 6 | 
 7 | APP_REPOSITORY=rnzdocker1/eks-elastic-inference-app
 8 | 
 9 | python -m py_compile test.py
10 | 
11 | docker build --tag $APP_REPOSITORY:$TAG .
12 | 
13 | docker push $APP_REPOSITORY:$TAG
14 | 
15 | echo EI TensorFlow Serving -----------------------------------------------------
16 | 
17 | SERVING_REPOSITORY=rnzdocker1/eks-elastic-inference-serving
18 | 
19 | docker build --file Dockerfile_tf_serving --tag $SERVING_REPOSITORY:$TAG .
20 | 
21 | docker push $SERVING_REPOSITORY:$TAG
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/images/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-elastic-inference-eks/475452ba42ebff295a482fb971012f8de21ac162/images/overview.png


--------------------------------------------------------------------------------
/k8s-daemonset.yml:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License").
  4 | # You may not use this file except in compliance with the License.
  5 | # A copy of the License is located at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # or in the "license" file accompanying this file. This file is distributed
 10 | # on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 11 | # express or implied. See the License for the specific language governing
 12 | # permissions and limitations under the License.
 13 | 
 14 | apiVersion: v1
 15 | kind: ServiceAccount
 16 | metadata:
 17 |   labels:
 18 |     app: inference-daemon
 19 |   name: inference-daemon
 20 |   namespace: default
 21 | ---
 22 | apiVersion: rbac.authorization.k8s.io/v1beta1
 23 | kind: ClusterRoleBinding
 24 | metadata:
 25 |   name: inference-daemon
 26 |   labels:
 27 |     app: inference-daemon
 28 | roleRef:
 29 |   apiGroup: rbac.authorization.k8s.io
 30 |   kind: ClusterRole
 31 |   name: cluster-admin
 32 | subjects:
 33 | - kind: ServiceAccount
 34 |   name: inference-daemon
 35 |   namespace: default
 36 | ---
 37 | apiVersion: extensions/v1beta1
 38 | kind: DaemonSet
 39 | metadata:
 40 |   name: inference-daemon
 41 | spec:
 42 |   updateStrategy:
 43 |     type: RollingUpdate
 44 |   template:
 45 |     metadata:
 46 |       labels:
 47 |         app: inference-daemon
 48 |     spec:
 49 |       volumes:
 50 |       - name: config-volume
 51 |         configMap:
 52 |           name: inference-config
 53 |       hostNetwork: true
 54 |       nodeSelector:
 55 |         nodegroup: elastic-inference
 56 |       containers:
 57 |       - name: inference-daemon
 58 |         image: rnzdocker1/eks-elastic-inference-app:ea770a6a256bade77730501c9a71d5670f55887d
 59 |         imagePullPolicy: Always
 60 |         env:
 61 |         - name: SQS_TASK_QUEUE
 62 |           value: task-queue-eks-a
 63 |         - name: SQS_TASK_COMPLETED_QUEUE
 64 |           value: task-completed-queue-eks-a
 65 |         resources:
 66 |           requests:
 67 |             memory: 1024Mi
 68 |       - name: tensorflow-serving
 69 |         image: rnzdocker1/eks-elastic-inference-serving:8e6ac6052e2d45da060aa0d22e1945b3631c4c1b
 70 |         imagePullPolicy: Always
 71 |         resources:
 72 |           requests:
 73 |             memory: 1024Mi
 74 |         ports:
 75 |           - name: tf-serving-api
 76 |             containerPort: 8501
 77 |             hostPort: 8501
 78 |             protocol: TCP
 79 |         volumeMounts:
 80 |         - name: config-volume
 81 |           mountPath: /app/ei
 82 |           readOnly: true
 83 | ---
 84 | # Configuration for the application
 85 | apiVersion: v1
 86 | kind: ConfigMap
 87 | metadata:
 88 |   name: inference-config
 89 | data:
 90 |   config.yaml: |-
 91 |     Version: 1
 92 | ---
 93 | # k8s service definition
 94 | apiVersion: v1
 95 | kind: Service
 96 | metadata:
 97 |   name: inference-service
 98 | spec:
 99 |   selector:
100 |     app: inference-daemon
101 |   clusterIP: None
102 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | 
2 | boto3==1.9.188
3 | requests==2.21.0
4 | numpy==1.16.3
5 | opencv-python-headless==4.1.0.25
6 | 


--------------------------------------------------------------------------------
/stack.cfn.yml:
--------------------------------------------------------------------------------
   1 | ---
   2 | AWSTemplateFormatVersion: 2010-09-09
   3 | 
   4 | Description: EKS + Elastic Inference Blog Post
   5 | 
   6 | 
   7 | Parameters:
   8 | 
   9 |   KeyName:
  10 |     Type: String
  11 |     Default: ''
  12 |     Description: The EC2 Key Pair to allow SSH access to the instances
  13 | 
  14 |   ElasticInferenceType:
  15 |     Description: Elastic Inference type
  16 |     Type: String
  17 |     Default: eia1.medium
  18 | 
  19 |   EksClusterName:
  20 |     Description: EKS Cluster Name
  21 |     Default: eks-ei-cluster
  22 |     Type: String
  23 | 
  24 |   NodeInstanceType:
  25 |     Description: EC2 instance type for the node instances
  26 |     Type: String
  27 |     Default: m5.large
  28 |     ConstraintDescription: Must be a valid EC2 instance type
  29 |     AllowedValues:
  30 |       - t2.small
  31 |       - t2.medium
  32 |       - t2.large
  33 |       - t2.xlarge
  34 |       - t2.2xlarge
  35 |       - t3.nano
  36 |       - t3.micro
  37 |       - t3.small
  38 |       - t3.medium
  39 |       - t3.large
  40 |       - t3.xlarge
  41 |       - t3.2xlarge
  42 |       - m3.medium
  43 |       - m3.large
  44 |       - m3.xlarge
  45 |       - m3.2xlarge
  46 |       - m4.large
  47 |       - m4.xlarge
  48 |       - m4.2xlarge
  49 |       - m4.4xlarge
  50 |       - m4.10xlarge
  51 |       - m5.large
  52 |       - m5.xlarge
  53 |       - m5.2xlarge
  54 |       - m5.4xlarge
  55 |       - m5.12xlarge
  56 |       - m5.24xlarge
  57 |       - c4.large
  58 |       - c4.xlarge
  59 |       - c4.2xlarge
  60 |       - c4.4xlarge
  61 |       - c4.8xlarge
  62 |       - c5.large
  63 |       - c5.xlarge
  64 |       - c5.2xlarge
  65 |       - c5.4xlarge
  66 |       - c5.9xlarge
  67 |       - c5.18xlarge
  68 |       - i3.large
  69 |       - i3.xlarge
  70 |       - i3.2xlarge
  71 |       - i3.4xlarge
  72 |       - i3.8xlarge
  73 |       - i3.16xlarge
  74 |       - r3.xlarge
  75 |       - r3.2xlarge
  76 |       - r3.4xlarge
  77 |       - r3.8xlarge
  78 |       - r4.large
  79 |       - r4.xlarge
  80 |       - r4.2xlarge
  81 |       - r4.4xlarge
  82 |       - r4.8xlarge
  83 |       - r4.16xlarge
  84 |       - x1.16xlarge
  85 |       - x1.32xlarge
  86 |       - p2.xlarge
  87 |       - p2.8xlarge
  88 |       - p2.16xlarge
  89 |       - p3.2xlarge
  90 |       - p3.8xlarge
  91 |       - p3.16xlarge
  92 |       - p3dn.24xlarge
  93 |       - r5.large
  94 |       - r5.xlarge
  95 |       - r5.2xlarge
  96 |       - r5.4xlarge
  97 |       - r5.12xlarge
  98 |       - r5.24xlarge
  99 |       - r5d.large
 100 |       - r5d.xlarge
 101 |       - r5d.2xlarge
 102 |       - r5d.4xlarge
 103 |       - r5d.12xlarge
 104 |       - r5d.24xlarge
 105 |       - z1d.large
 106 |       - z1d.xlarge
 107 |       - z1d.2xlarge
 108 |       - z1d.3xlarge
 109 |       - z1d.6xlarge
 110 |       - z1d.12xlarge
 111 | 
 112 |   InferenceNodeInstanceType:
 113 |     Description: EC2 instance type for the node instances
 114 |     Type: String
 115 |     Default: c5.large
 116 |     ConstraintDescription: Must be a valid EC2 instance type
 117 |     AllowedValues:
 118 |       - t2.small
 119 |       - t2.medium
 120 |       - t2.large
 121 |       - t2.xlarge
 122 |       - t2.2xlarge
 123 |       - t3.nano
 124 |       - t3.micro
 125 |       - t3.small
 126 |       - t3.medium
 127 |       - t3.large
 128 |       - t3.xlarge
 129 |       - t3.2xlarge
 130 |       - m3.medium
 131 |       - m3.large
 132 |       - m3.xlarge
 133 |       - m3.2xlarge
 134 |       - m4.large
 135 |       - m4.xlarge
 136 |       - m4.2xlarge
 137 |       - m4.4xlarge
 138 |       - m4.10xlarge
 139 |       - m5.large
 140 |       - m5.xlarge
 141 |       - m5.2xlarge
 142 |       - m5.4xlarge
 143 |       - m5.12xlarge
 144 |       - m5.24xlarge
 145 |       - c4.large
 146 |       - c4.xlarge
 147 |       - c4.2xlarge
 148 |       - c4.4xlarge
 149 |       - c4.8xlarge
 150 |       - c5.large
 151 |       - c5.xlarge
 152 |       - c5.2xlarge
 153 |       - c5.4xlarge
 154 |       - c5.9xlarge
 155 |       - c5.18xlarge
 156 |       - i3.large
 157 |       - i3.xlarge
 158 |       - i3.2xlarge
 159 |       - i3.4xlarge
 160 |       - i3.8xlarge
 161 |       - i3.16xlarge
 162 |       - r3.xlarge
 163 |       - r3.2xlarge
 164 |       - r3.4xlarge
 165 |       - r3.8xlarge
 166 |       - r4.large
 167 |       - r4.xlarge
 168 |       - r4.2xlarge
 169 |       - r4.4xlarge
 170 |       - r4.8xlarge
 171 |       - r4.16xlarge
 172 |       - x1.16xlarge
 173 |       - x1.32xlarge
 174 |       - p2.xlarge
 175 |       - p2.8xlarge
 176 |       - p2.16xlarge
 177 |       - p3.2xlarge
 178 |       - p3.8xlarge
 179 |       - p3.16xlarge
 180 |       - p3dn.24xlarge
 181 |       - r5.large
 182 |       - r5.xlarge
 183 |       - r5.2xlarge
 184 |       - r5.4xlarge
 185 |       - r5.12xlarge
 186 |       - r5.24xlarge
 187 |       - r5d.large
 188 |       - r5d.xlarge
 189 |       - r5d.2xlarge
 190 |       - r5d.4xlarge
 191 |       - r5d.12xlarge
 192 |       - r5d.24xlarge
 193 |       - z1d.large
 194 |       - z1d.xlarge
 195 |       - z1d.2xlarge
 196 |       - z1d.3xlarge
 197 |       - z1d.6xlarge
 198 |       - z1d.12xlarge
 199 | 
 200 |   NodeAutoScalingGroupMinSize:
 201 |     Description: Minimum size of Node Group ASG
 202 |     Type: Number
 203 |     Default: 1
 204 | 
 205 |   NodeAutoScalingGroupMaxSize:
 206 |     Description: Maximum size of Node Group ASG. Set to at least 1 greater than NodeAutoScalingGroupDesiredCapacity.
 207 |     Type: Number
 208 |     Default: 4
 209 | 
 210 |   NodeAutoScalingGroupDesiredCapacity:
 211 |     Description: Desired capacity of Node Group ASG.
 212 |     Type: Number
 213 |     Default: 2
 214 | 
 215 |   InferenceNodeAutoScalingGroupMinSize:
 216 |     Description: Minimum size of Node Group ASG.
 217 |     Type: Number
 218 |     Default: 2
 219 | 
 220 |   InferenceNodeAutoScalingGroupMaxSize:
 221 |     Description: Maximum size of Node Group ASG. Set to at least 1 greater than NodeAutoScalingGroupDesiredCapacity.
 222 |     Type: Number
 223 |     Default: 6
 224 | 
 225 |   InferenceNodeAutoScalingGroupDesiredCapacity:
 226 |     Description: Desired capacity of Node Group ASG.
 227 |     Type: Number
 228 |     Default: 2
 229 | 
 230 |   NodeVolumeSize:
 231 |     Description: Node volume size
 232 |     Type: Number
 233 |     Default: 100
 234 | 
 235 |   InferenceNodeVolumeSize:
 236 |     Description: Node volume size
 237 |     Type: Number
 238 |     Default: 100
 239 | 
 240 |   InferenceNodeGroupName:
 241 |     Description: Unique identifier for the Node Group.
 242 |     Default: inference
 243 |     Type: String
 244 | 
 245 |   NodeGroupName:
 246 |     Description: Unique identifier for the Node Group.
 247 |     Default: standard
 248 |     Type: String
 249 | 
 250 |   InferenceBootstrapArguments:
 251 |     Description: Arguments to pass to the bootstrap script. See files/bootstrap.sh in https://github.com/awslabs/amazon-eks-ami
 252 |     Default: --kubelet-extra-args --node-labels=inference=true,nodegroup=elastic-inference
 253 |     Type: String
 254 | 
 255 |   BootstrapArguments:
 256 |     Description: Arguments to pass to the bootstrap script. See files/bootstrap.sh in https://github.com/awslabs/amazon-eks-ami
 257 |     Default: --kubelet-extra-args --node-labels=inference=false,nodegroup=standard
 258 |     Type: String
 259 | 
 260 |   AvailabilityZone0:
 261 |     Description: The first availability zone in the region
 262 |     Type: AWS::EC2::AvailabilityZone::Name
 263 |     ConstraintDescription: Must be a valid availability zone
 264 | 
 265 |   AvailabilityZone1:
 266 |     Description: The second availability zone in the region
 267 |     Type: AWS::EC2::AvailabilityZone::Name
 268 |     ConstraintDescription: Must be a valid availability zone
 269 | 
 270 |   CreateRoleArn:
 271 |     Type: String
 272 | 
 273 |   AdminUserArn:
 274 |     Type: String
 275 | 
 276 |   LambdaCustomResourceBucketPrefix:
 277 |     Type: String
 278 |     Default: pub-cfn-cust-res-pocs
 279 | 
 280 |   DefaultTaskQueueVisibilityTimeout:
 281 |     Type: Number
 282 |     Description: The queue visibility timeout
 283 |     MinValue: 0
 284 |     MaxValue: 43200
 285 |     Default: 7200 # Two hour default
 286 | 
 287 |   DefaultTaskCompletedQueueVisibilityTimeout:
 288 |     Type: Number
 289 |     Description: The queue visibility timeout
 290 |     MinValue: 0
 291 |     MaxValue: 43200
 292 |     Default: 500
 293 | 
 294 |   # Scaling params
 295 |   InferenceScaleEvaluationPeriods:
 296 |     Description: The number of periods over which data is compared to the specified threshold
 297 |     Type: Number
 298 |     Default: 1
 299 |     MinValue: 1
 300 | 
 301 |   InferenceQueueDepthScaleOutThreshold:
 302 |     Type: Number
 303 |     Description: Average queue depth value to trigger auto scaling out
 304 |     Default: 2
 305 | 
 306 |   InferenceQueueDepthScaleInThreshold:
 307 |     Type: Number
 308 |     Description: Average queue depth value to trigger auto scaling in
 309 |     Default: 2
 310 |     MinValue: 0
 311 |     ConstraintDescription: Value must be between 0 or more
 312 | 
 313 | 
 314 | Mappings:
 315 | 
 316 |   # Maps CIDR blocks to VPC and various subnets
 317 |   CIDRMap:
 318 |     VPC:
 319 |       CIDR: 10.50.0.0/16
 320 |     Public0:
 321 |       CIDR: 10.50.0.0/19
 322 |     Public1:
 323 |       CIDR: 10.50.32.0/19
 324 | 
 325 |   # Amazon EKS Linux AMI - https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html
 326 |   AMIMap:
 327 |     us-east-2:
 328 |       AMI: ami-0485258c2d1c3608f
 329 |     us-east-1:
 330 |       AMI: ami-0f2e8e5663e16b436
 331 |     us-west-2:
 332 |       AMI: ami-03a55127c613349a7
 333 |     ap-northeast-1:
 334 |       AMI: ami-0fde798d17145fae1
 335 |     ap-northeast-2:
 336 |       AMI: ami-07fd7609df6c8e39b
 337 |     eu-west-1:
 338 |       AMI: ami-00ac2e6b3cb38a9b9
 339 | 
 340 | 
 341 | Resources:
 342 | 
 343 |   # VPC ------------------------------------------------------------------------
 344 | 
 345 |   Vpc:
 346 |     Type: AWS::EC2::VPC
 347 |     Properties:
 348 |       CidrBlock: !FindInMap [ CIDRMap, VPC, CIDR ]
 349 |       EnableDnsSupport: true
 350 |       EnableDnsHostnames: true
 351 |       Tags:
 352 |         - Key: Name
 353 |           Value: !Sub ${AWS::StackName}-vpc
 354 | 
 355 |   ElasticInferenceVpcEndpoint:
 356 |     Type: AWS::EC2::VPCEndpoint
 357 |     Properties:
 358 |       VpcId: !Ref Vpc
 359 |       ServiceName: !Sub com.amazonaws.${AWS::Region}.elastic-inference.runtime
 360 |       VpcEndpointType: Interface
 361 |       PrivateDnsEnabled: true
 362 |       SubnetIds:
 363 |         - !Ref PublicSubnet0
 364 |         - !Ref PublicSubnet1
 365 |       SecurityGroupIds:
 366 |         - !GetAtt ElasticInferenceVpcEndpointSecurityGroup.GroupId
 367 | 
 368 |   ElasticInferenceVpcEndpointSecurityGroup:
 369 |     Type: AWS::EC2::SecurityGroup
 370 |     Properties:
 371 |       GroupDescription: Elastic Inference Vpc Endpoint security group
 372 |       VpcId: !Ref Vpc
 373 |       SecurityGroupIngress:
 374 |         - SourceSecurityGroupId: !Ref NodeSecurityGroup
 375 |           IpProtocol: tcp
 376 |           ToPort: 443
 377 |           FromPort: 443
 378 |       SecurityGroupEgress:
 379 |         - CidrIp: 0.0.0.0/0
 380 |           IpProtocol: tcp
 381 |           ToPort: 443
 382 |           FromPort: 443
 383 | 
 384 |   InternetGateway:
 385 |     Type: AWS::EC2::InternetGateway
 386 |     Properties:
 387 |       Tags:
 388 |         - Key: Name
 389 |           Value: !Sub ${AWS::StackName}-igw
 390 | 
 391 |   VpcGatewayAttachment:
 392 |     Type: AWS::EC2::VPCGatewayAttachment
 393 |     Properties:
 394 |       InternetGatewayId: !Ref InternetGateway
 395 |       VpcId: !Ref Vpc
 396 | 
 397 |   RouteTable:
 398 |     Type: AWS::EC2::RouteTable
 399 |     Properties:
 400 |       VpcId: !Ref Vpc
 401 |       Tags:
 402 |         - Key: Name
 403 |           Value: Public Subnets
 404 |         - Key: Network
 405 |           Value: Public
 406 | 
 407 |   PublicSubnet0:
 408 |     Type: AWS::EC2::Subnet
 409 |     Properties:
 410 |       VpcId: !Ref Vpc
 411 |       CidrBlock: !FindInMap [ CIDRMap, Public0, CIDR ]
 412 |       AvailabilityZone: !Ref AvailabilityZone0
 413 |       Tags:
 414 |         - Key: Name
 415 |           Value: !Sub ${AWS::StackName}-PublicSubnet0
 416 | 
 417 |   PublicSubnet1:
 418 |     Type: AWS::EC2::Subnet
 419 |     Properties:
 420 |       VpcId: !Ref Vpc
 421 |       CidrBlock: !FindInMap [ CIDRMap, Public1, CIDR ]
 422 |       AvailabilityZone: !Ref AvailabilityZone1
 423 |       Tags:
 424 |         - Key: Name
 425 |           Value: !Sub ${AWS::StackName}-PublicSubnet1
 426 | 
 427 |   PublicRoute:
 428 |     Type: AWS::EC2::Route
 429 |     DependsOn: VpcGatewayAttachment
 430 |     Properties:
 431 |       RouteTableId: !Ref PublicRouteTable
 432 |       DestinationCidrBlock: 0.0.0.0/0
 433 |       GatewayId: !Ref InternetGateway
 434 | 
 435 |   PublicRouteTable:
 436 |     Type: AWS::EC2::RouteTable
 437 |     Properties:
 438 |       VpcId: !Ref Vpc
 439 |       Tags:
 440 |         - Key: Name
 441 |           Value: !Sub ${AWS::StackName}-public-igw
 442 | 
 443 |   PublicSubnetRouteTableAssociation0:
 444 |     Type: AWS::EC2::SubnetRouteTableAssociation
 445 |     Properties:
 446 |       SubnetId: !Ref PublicSubnet0
 447 |       RouteTableId: !Ref PublicRouteTable
 448 | 
 449 |   PublicSubnetRouteTableAssociation1:
 450 |     Type: AWS::EC2::SubnetRouteTableAssociation
 451 |     Properties:
 452 |       SubnetId: !Ref PublicSubnet1
 453 |       RouteTableId: !Ref PublicRouteTable
 454 | 
 455 |   # Queues --------------------------------------------------------------------
 456 | 
 457 |   TaskQueue:
 458 |     Type: AWS::SQS::Queue
 459 |     Properties:
 460 |       QueueName: !Sub task-queue-${AWS::StackName}
 461 |       VisibilityTimeout: !Ref DefaultTaskQueueVisibilityTimeout
 462 |       KmsMasterKeyId: alias/aws/sqs
 463 | 
 464 |   TaskCompletedQueue:
 465 |     Type: AWS::SQS::Queue
 466 |     Properties:
 467 |       QueueName: !Sub task-completed-queue-${AWS::StackName}
 468 |       VisibilityTimeout: !Ref DefaultTaskCompletedQueueVisibilityTimeout
 469 |       KmsMasterKeyId: alias/aws/sqs
 470 | 
 471 |   # Buckets --------------------------------------------------------------------
 472 | 
 473 |   TaskDataBucket:
 474 |     Type: AWS::S3::Bucket
 475 |     Properties:
 476 |       BucketName: !Sub task-data-bucket-${AWS::AccountId}-${AWS::Region}-${AWS::StackName}
 477 |       PublicAccessBlockConfiguration:
 478 |         BlockPublicAcls: true
 479 |         BlockPublicPolicy: true
 480 |         IgnorePublicAcls: true
 481 |         RestrictPublicBuckets: true
 482 |       BucketEncryption:
 483 |         ServerSideEncryptionConfiguration:
 484 |           - ServerSideEncryptionByDefault:
 485 |               SSEAlgorithm: AES256
 486 | 
 487 |   # EKS Cluster ----------------------------------------------------------------
 488 | 
 489 |   EksCluster:
 490 |     Type: AWS::EKS::Cluster
 491 |     Properties:
 492 |       Name: !Ref EksClusterName
 493 |       RoleArn: !GetAtt ClusterRole.Arn
 494 |       ResourcesVpcConfig:
 495 |         SecurityGroupIds:
 496 |           - !Ref ClusterControlPlaneSecurityGroup
 497 |         SubnetIds:
 498 |           - !Ref PublicSubnet0
 499 |           - !Ref PublicSubnet1
 500 |     DependsOn:
 501 |       - PublicSubnetRouteTableAssociation0
 502 |       - PublicSubnetRouteTableAssociation1
 503 | 
 504 |   EksAdminRole:
 505 |     Type: AWS::IAM::Role
 506 |     Properties:
 507 |       Path: /
 508 |       AssumeRolePolicyDocument:
 509 |         Version: 2012-10-17
 510 |         Statement:
 511 |           - Effect: Allow
 512 |             Principal:
 513 |               Service: codebuild.amazonaws.com
 514 |             Action: sts:AssumeRole
 515 |       Policies:
 516 |         - PolicyName: eks-describe
 517 |           PolicyDocument:
 518 |             Version: 2012-10-17
 519 |             Statement:
 520 |               - Resource: '*'
 521 |                 Effect: Allow
 522 |                 Action:
 523 |                   - eks:Describe*
 524 | 
 525 |   # https://github.com/rnzsgh/cfn-eks-custom-resource-aws-auth-configmap
 526 |   AwsAuthConfigMapConfigCustomResourceLambda:
 527 |     Type: AWS::Lambda::Function
 528 |     Properties:
 529 |       Code:
 530 |         S3Bucket: !Sub ${LambdaCustomResourceBucketPrefix}-${AWS::Region}
 531 |         S3Key: cfn-eks-custom-resource-aws-auth-configmap.zip
 532 |       Handler: main
 533 |       Role: !Ref CreateRoleArn
 534 |       Runtime: go1.x
 535 |       Timeout: 300
 536 | 
 537 |   AwsAuthConfigMapConfigCustomResource:
 538 |     Type: Custom::AwsAuthConfigMapConfigCustomResource
 539 |     Properties:
 540 |       ServiceToken: !GetAtt AwsAuthConfigMapConfigCustomResourceLambda.Arn
 541 |       CreateRoleArn: !Ref CreateRoleArn
 542 |       ClusterName: !Ref EksCluster
 543 |       ClusterEndpoint: !GetAtt EksCluster.Endpoint
 544 |       NodeInstanceRoleArn: !GetAtt NodeInstanceRole.Arn
 545 |       AdminUserArn: !Ref AdminUserArn
 546 |       AdminRoleArn: !GetAtt EksAdminRole.Arn
 547 | 
 548 |   ClusterControlPlaneSecurityGroup:
 549 |     Type: AWS::EC2::SecurityGroup
 550 |     Properties:
 551 |       GroupDescription: Security Group for the EKS control plane
 552 |       VpcId: !Ref Vpc
 553 | 
 554 |   ClusterRole:
 555 |     Type: AWS::IAM::Role
 556 |     Properties:
 557 |       AssumeRolePolicyDocument:
 558 |         Version: 2012-10-17
 559 |         Statement:
 560 |             Effect: Allow
 561 |             Principal:
 562 |               Service:
 563 |                 - eks.amazonaws.com
 564 |             Action: sts:AssumeRole
 565 |       ManagedPolicyArns:
 566 |         - arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
 567 |         - arn:aws:iam::aws:policy/AmazonEKSServicePolicy
 568 | 
 569 |   # EKS Nodes ------------------------------------------------------------------
 570 | 
 571 |   NodeInstanceProfile:
 572 |     Type: AWS::IAM::InstanceProfile
 573 |     Properties:
 574 |       Path: /
 575 |       Roles:
 576 |         - !Ref NodeInstanceRole
 577 | 
 578 |   NodeInstanceRole:
 579 |     Type: AWS::IAM::Role
 580 |     Properties:
 581 |       AssumeRolePolicyDocument:
 582 |         Version: 2012-10-17
 583 |         Statement:
 584 |         - Effect: Allow
 585 |           Principal:
 586 |             Service:
 587 |               - ec2.amazonaws.com
 588 |           Action:
 589 |             - sts:AssumeRole
 590 |       Path: /
 591 |       Policies:
 592 |         - PolicyName: ei-access
 593 |           PolicyDocument:
 594 |             Version: 2012-10-17
 595 |             Statement:
 596 |               - Effect: Allow
 597 |                 Action:
 598 |                   - elastic-inference:Connect
 599 |                   - iam:List*
 600 |                   - iam:Get*
 601 |                   - ec2:Describe*
 602 |                   - ec2:Get*
 603 |                   - ec2:ModifyInstanceAttribute
 604 |                 Resource: '*'
 605 |         - PolicyName: sqs-access
 606 |           PolicyDocument:
 607 |             Version: 2012-10-17
 608 |             Statement:
 609 |               - Effect: Allow
 610 |                 Action:
 611 |                   - sqs:SendMessage
 612 |                   - sqs:ReceiveMessage
 613 |                   - sqs:DeleteMessage
 614 |                   - sqs:ChangeMessageVisibility
 615 |                   - sqs:GetQueueUrl
 616 |                   - sqs:GetQueueAttributes
 617 |                 Resource:
 618 |                   - !GetAtt TaskQueue.Arn
 619 |                   - !GetAtt TaskCompletedQueue.Arn
 620 |         - PolicyName: s3-access
 621 |           PolicyDocument:
 622 |             Version: 2012-10-17
 623 |             Statement:
 624 |               - Effect: Allow
 625 |                 Action:
 626 |                   - s3:Get*
 627 |                   - s3:List*
 628 |                   - s3:PutObjectAcl
 629 |                   - s3:PutObject
 630 |                   - s3:PutObjectTagging
 631 |                   - s3:PutObjectVersionAcl
 632 |                   - s3:PutObjectVersionTagging
 633 |                   - s3:PutObjectRetention
 634 |                   - s3:PutObjectLegalHold
 635 |                   - s3:PutBucketObjectLockConfiguration
 636 |                   - s3:AbortMultipartUpload
 637 |                   - s3:DeleteObject
 638 |                   - s3:DeleteObjectTagging
 639 |                   - s3:DeleteObjectVersion
 640 |                   - s3:DeleteObjectVersionTagging
 641 |                 Resource:
 642 |                   - !Sub arn:aws:s3:::${TaskDataBucket}/*
 643 |                   - !Sub arn:aws:s3:::${TaskDataBucket}
 644 |       ManagedPolicyArns:
 645 |         - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
 646 |         - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
 647 |         - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
 648 |         - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
 649 | 
 650 |   NodeSecurityGroup:
 651 |     Type: AWS::EC2::SecurityGroup
 652 |     Properties:
 653 |       GroupDescription: Security group for all nodes in the cluster
 654 |       VpcId: !Ref Vpc
 655 |       Tags:
 656 |         - Key: !Sub kubernetes.io/cluster/${EksCluster}
 657 |           Value: owned
 658 |     DependsOn: EksCluster
 659 | 
 660 |   NodeSecurityGroupIngress:
 661 |     Type: AWS::EC2::SecurityGroupIngress
 662 |     Properties:
 663 |       Description: Allow node to communicate with each other
 664 |       GroupId: !Ref NodeSecurityGroup
 665 |       SourceSecurityGroupId: !Ref NodeSecurityGroup
 666 |       IpProtocol: tcp
 667 |       FromPort: 0
 668 |       ToPort: 65535
 669 |     DependsOn: NodeSecurityGroup
 670 | 
 671 |   NodeSecurityGroupFromControlPlaneIngress:
 672 |     Type: AWS::EC2::SecurityGroupIngress
 673 |     Properties:
 674 |       Description: Allow worker Kubelets and pods to receive communication from the cluster control plane
 675 |       GroupId: !Ref NodeSecurityGroup
 676 |       SourceSecurityGroupId: !Ref ClusterControlPlaneSecurityGroup
 677 |       IpProtocol: tcp
 678 |       FromPort: 1025
 679 |       ToPort: 65535
 680 |     DependsOn: NodeSecurityGroup
 681 | 
 682 |   ControlPlaneEgressToNodeSecurityGroup:
 683 |     Type: AWS::EC2::SecurityGroupEgress
 684 |     DependsOn: NodeSecurityGroup
 685 |     Properties:
 686 |       Description: Allow the cluster control plane to communicate with worker Kubelet and pods
 687 |       GroupId: !Ref ClusterControlPlaneSecurityGroup
 688 |       DestinationSecurityGroupId: !Ref NodeSecurityGroup
 689 |       IpProtocol: tcp
 690 |       FromPort: 1025
 691 |       ToPort: 65535
 692 | 
 693 |   NodeSecurityGroupFromControlPlaneOn443Ingress:
 694 |     Type: AWS::EC2::SecurityGroupIngress
 695 |     DependsOn: NodeSecurityGroup
 696 |     Properties:
 697 |       Description: Allow pods running extension API servers on port 443 to receive communication from cluster control plane
 698 |       GroupId: !Ref NodeSecurityGroup
 699 |       SourceSecurityGroupId: !Ref ClusterControlPlaneSecurityGroup
 700 |       IpProtocol: tcp
 701 |       FromPort: 443
 702 |       ToPort: 443
 703 | 
 704 |   ControlPlaneEgressToNodeSecurityGroupOn443:
 705 |     Type: AWS::EC2::SecurityGroupEgress
 706 |     DependsOn: NodeSecurityGroup
 707 |     Properties:
 708 |       Description: Allow the cluster control plane to communicate with pods running extension API servers on port 443
 709 |       GroupId: !Ref ClusterControlPlaneSecurityGroup
 710 |       DestinationSecurityGroupId: !Ref NodeSecurityGroup
 711 |       IpProtocol: tcp
 712 |       FromPort: 443
 713 |       ToPort: 443
 714 | 
 715 |   ClusterControlPlaneSecurityGroupIngress:
 716 |     Type: AWS::EC2::SecurityGroupIngress
 717 |     Properties:
 718 |       Description: Allow pods to communicate with the cluster API Server
 719 |       GroupId: !Ref ClusterControlPlaneSecurityGroup
 720 |       SourceSecurityGroupId: !Ref NodeSecurityGroup
 721 |       IpProtocol: tcp
 722 |       ToPort: 443
 723 |       FromPort: 443
 724 |     DependsOn: NodeSecurityGroup
 725 | 
 726 |   InferenceNodeGroup:
 727 |     Type: AWS::AutoScaling::AutoScalingGroup
 728 |     Properties:
 729 |       DesiredCapacity: !Ref InferenceNodeAutoScalingGroupDesiredCapacity
 730 |       MinSize: !Ref InferenceNodeAutoScalingGroupMinSize
 731 |       MaxSize: !Ref InferenceNodeAutoScalingGroupMaxSize
 732 |       LaunchTemplate:
 733 |         LaunchTemplateId: !Ref InferenceNodeLaunchTemplate
 734 |         Version: 1
 735 |       VPCZoneIdentifier:
 736 |         - !Ref PublicSubnet0
 737 |         - !Ref PublicSubnet1
 738 |       Tags:
 739 |         - Key: Name
 740 |           Value: !Sub ${EksCluster}-${InferenceNodeGroupName}-Node
 741 |           PropagateAtLaunch: true
 742 |         - Key: !Sub kubernetes.io/cluster/${EksCluster}
 743 |           Value: owned
 744 |           PropagateAtLaunch: true
 745 |       TerminationPolicies:
 746 |         - OldestInstance
 747 |         - OldestLaunchTemplate
 748 |     UpdatePolicy:
 749 |       AutoScalingRollingUpdate:
 750 |         MaxBatchSize: 1
 751 |         MinInstancesInService: !Ref NodeAutoScalingGroupDesiredCapacity
 752 |         PauseTime: PT5M
 753 |     DependsOn:
 754 |       - AwsAuthConfigMapConfigCustomResource
 755 | 
 756 |   NodeGroup:
 757 |     Type: AWS::AutoScaling::AutoScalingGroup
 758 |     Properties:
 759 |       DesiredCapacity: !Ref NodeAutoScalingGroupDesiredCapacity
 760 |       LaunchTemplate:
 761 |         LaunchTemplateId: !Ref NodeLaunchTemplate
 762 |         Version: 1
 763 |       MinSize: !Ref NodeAutoScalingGroupMinSize
 764 |       MaxSize: !Ref NodeAutoScalingGroupMaxSize
 765 |       VPCZoneIdentifier:
 766 |         - !Ref PublicSubnet0
 767 |         - !Ref PublicSubnet1
 768 |       Tags:
 769 |         - Key: Name
 770 |           Value: !Sub ${EksCluster}-${NodeGroupName}-Node
 771 |           PropagateAtLaunch: true
 772 |         - Key: !Sub kubernetes.io/cluster/${EksCluster}
 773 |           Value: owned
 774 |           PropagateAtLaunch: true
 775 |       TerminationPolicies:
 776 |         - OldestInstance
 777 |         - OldestLaunchTemplate
 778 |     UpdatePolicy:
 779 |       AutoScalingRollingUpdate:
 780 |         MaxBatchSize: 1
 781 |         MinInstancesInService: !Ref NodeAutoScalingGroupDesiredCapacity
 782 |         PauseTime: PT5M
 783 |     DependsOn:
 784 |       - AwsAuthConfigMapConfigCustomResource
 785 | 
 786 |   InferenceNodeLaunchTemplate:
 787 |     Type: AWS::EC2::LaunchTemplate
 788 |     Properties:
 789 |       LaunchTemplateData:
 790 |         IamInstanceProfile:
 791 |           Name: !Ref NodeInstanceProfile
 792 |         ImageId: !FindInMap [ AMIMap, !Ref "AWS::Region", AMI ]
 793 |         InstanceType: !Ref InferenceNodeInstanceType
 794 |         KeyName: !Ref KeyName
 795 |         ElasticInferenceAccelerators:
 796 |           - Type: !Ref ElasticInferenceType
 797 |         Monitoring:
 798 |           Enabled: true
 799 |         BlockDeviceMappings:
 800 |           - DeviceName: /dev/xvda
 801 |             Ebs:
 802 |               VolumeSize: !Ref InferenceNodeVolumeSize
 803 |               VolumeType: gp2
 804 |               DeleteOnTermination: true
 805 |         UserData:
 806 |           Fn::Base64:
 807 |             !Sub |
 808 |               #!/bin/bash
 809 |               set -o xtrace
 810 |               yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
 811 |               mv /etc/amazon/ssm/seelog.xml.template /etc/amazon/ssm/seelog.xml
 812 |               systemctl restart amazon-ssm-agent
 813 |               /etc/eks/bootstrap.sh ${EksCluster} ${InferenceBootstrapArguments}
 814 |               /opt/aws/bin/cfn-signal --exit-code $? --stack  ${AWS::StackName} --resource InferenceNodeGroup --region ${AWS::Region}
 815 |         NetworkInterfaces:
 816 |           - DeviceIndex: 0
 817 |             AssociatePublicIpAddress: true
 818 |             DeleteOnTermination: true
 819 |             Groups:
 820 |               - !Ref NodeSecurityGroup
 821 | 
 822 |   NodeLaunchTemplate:
 823 |     Type: AWS::EC2::LaunchTemplate
 824 |     Properties:
 825 |       LaunchTemplateData:
 826 |         IamInstanceProfile:
 827 |           Name: !Ref NodeInstanceProfile
 828 |         ImageId: !FindInMap [ AMIMap, !Ref "AWS::Region", AMI ]
 829 |         InstanceType: !Ref NodeInstanceType
 830 |         KeyName: !Ref KeyName
 831 |         Monitoring:
 832 |           Enabled: true
 833 |         BlockDeviceMappings:
 834 |           - DeviceName: /dev/xvda
 835 |             Ebs:
 836 |               VolumeSize: !Ref NodeVolumeSize
 837 |               VolumeType: gp2
 838 |               DeleteOnTermination: true
 839 |         UserData:
 840 |           Fn::Base64:
 841 |             !Sub |
 842 |               #!/bin/bash
 843 |               set -o xtrace
 844 |               yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
 845 |               mv /etc/amazon/ssm/seelog.xml.template /etc/amazon/ssm/seelog.xml
 846 |               systemctl restart amazon-ssm-agent
 847 |               /etc/eks/bootstrap.sh ${EksCluster} ${BootstrapArguments}
 848 |               /opt/aws/bin/cfn-signal --exit-code $? --stack  ${AWS::StackName} --resource NodeGroup --region ${AWS::Region}
 849 |         NetworkInterfaces:
 850 |           - DeviceIndex: 0
 851 |             AssociatePublicIpAddress: true
 852 |             DeleteOnTermination: true
 853 |             Groups:
 854 |               - !Ref NodeSecurityGroup
 855 | 
 856 |   TaskQueueDepthLambdaEventRule:
 857 |     Type: AWS::Events::Rule
 858 |     Properties:
 859 |       State: ENABLED
 860 |       ScheduleExpression: 'cron(* * * * ? *)'
 861 |       Targets:
 862 |         - Arn: !GetAtt TaskQueueDepthLambda.Arn
 863 |           Id: !Ref TaskQueueDepthLambda
 864 | 
 865 |   TaskQueueDepthLambdaPermission:
 866 |     Type: AWS::Lambda::Permission
 867 |     Properties:
 868 |       Action: lambda:InvokeFunction
 869 |       Principal: events.amazonaws.com
 870 |       FunctionName: !Ref TaskQueueDepthLambda
 871 |       SourceArn: !GetAtt TaskQueueDepthLambdaEventRule.Arn
 872 | 
 873 |   # https://github.com/rnzsgh/lambda-update-sqs-queue-depth
 874 |   TaskQueueDepthLambda:
 875 |     Type: AWS::Lambda::Function
 876 |     Properties:
 877 |       Environment:
 878 |         Variables:
 879 |           QueueUrl: !Ref TaskQueue
 880 |           CloudWatchMetricNamespace: eks-ei/sqs
 881 |           CloudWatchMetricName: TaskQueueApproximateNumberOfMessages
 882 |       Code:
 883 |         S3Bucket: !Sub ${LambdaCustomResourceBucketPrefix}-${AWS::Region}
 884 |         S3Key: lambda-update-sqs-queue-depth.zip
 885 |       Handler: main
 886 |       Role: !GetAtt TaskQueueDepthLambdaExecutionRole.Arn
 887 |       Runtime: go1.x
 888 |       Timeout: 300
 889 | 
 890 |   TaskQueueDepthLambdaExecutionRole:
 891 |     Type: AWS::IAM::Role
 892 |     Properties:
 893 |       ManagedPolicyArns:
 894 |         - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
 895 |       Path: /
 896 |       AssumeRolePolicyDocument:
 897 |         Version: 2012-10-17
 898 |         Statement:
 899 |           - Effect: Allow
 900 |             Action: sts:AssumeRole
 901 |             Principal:
 902 |               Service:
 903 |                 - lambda.amazonaws.com
 904 |       Policies:
 905 |         - PolicyName: sqs-attr
 906 |           PolicyDocument:
 907 |             Version: 2012-10-17
 908 |             Statement:
 909 |               - Effect: Allow
 910 |                 Action:
 911 |                   - sqs:GetQueueAttributes
 912 |                 Resource:
 913 |                   - !GetAtt TaskQueue.Arn
 914 |         - PolicyName: cloudwatch-put
 915 |           PolicyDocument:
 916 |             Version: 2012-10-17
 917 |             Statement:
 918 |               - Effect: Allow
 919 |                 Action:
 920 |                   - cloudwatch:PutMetricData
 921 |                 Resource: '*'
 922 | 
 923 |   InferenceScaleOutPolicy:
 924 |     Type: AWS::AutoScaling::ScalingPolicy
 925 |     Properties:
 926 |       AdjustmentType: ChangeInCapacity
 927 |       AutoScalingGroupName: !Ref InferenceNodeGroup
 928 |       ScalingAdjustment: 2
 929 |       Cooldown: 60
 930 | 
 931 |   InferenceScaleInPolicy:
 932 |     Type: AWS::AutoScaling::ScalingPolicy
 933 |     Properties:
 934 |       AdjustmentType: ChangeInCapacity
 935 |       AutoScalingGroupName: !Ref InferenceNodeGroup
 936 |       ScalingAdjustment: -1
 937 |       Cooldown: 60
 938 | 
 939 |   InferenceScaleOutAlarm:
 940 |     Type: AWS::CloudWatch::Alarm
 941 |     Properties:
 942 |       EvaluationPeriods: !Ref InferenceScaleEvaluationPeriods
 943 |       Statistic: Average
 944 |       TreatMissingData: notBreaching
 945 |       Threshold: !Ref InferenceQueueDepthScaleOutThreshold
 946 |       AlarmDescription: Alarm to add capacity if queue depth is high
 947 |       Period: 60
 948 |       AlarmActions:
 949 |         - !Ref InferenceScaleOutPolicy
 950 |       Namespace:  eks-ei/sqs
 951 |       MetricName: TaskQueueApproximateNumberOfMessages
 952 |       ComparisonOperator: GreaterThanThreshold
 953 | 
 954 |   InferenceScaleInAlarm:
 955 |     Type: AWS::CloudWatch::Alarm
 956 |     Properties:
 957 |       EvaluationPeriods: !Ref InferenceScaleEvaluationPeriods
 958 |       Statistic: Average
 959 |       TreatMissingData: notBreaching
 960 |       Threshold: !Ref InferenceQueueDepthScaleInThreshold
 961 |       AlarmDescription: Alarm to reduce capacity if container queue depth is low
 962 |       Period: 300
 963 |       AlarmActions:
 964 |         - !Ref InferenceScaleInPolicy
 965 |       Namespace: eks-ei/sqs
 966 |       MetricName: TaskQueueApproximateNumberOfMessages
 967 |       ComparisonOperator: LessThanThreshold
 968 | 
 969 | 
 970 | Outputs:
 971 | 
 972 |   Name:
 973 |     Value: !Ref AWS::StackName
 974 |     Export:
 975 |       Name: !Sub ${AWS::StackName}-Name
 976 | 
 977 |   VpcId:
 978 |     Value: !Ref Vpc
 979 |     Export:
 980 |       Name: !Sub ${AWS::StackName}-VpcId
 981 | 
 982 |   VpcCidr:
 983 |     Description: Vpc cidr block
 984 |     Value: !FindInMap [ CIDRMap, VPC, CIDR ]
 985 |     Export:
 986 |       Name: !Sub ${AWS::StackName}-vpc-cidr
 987 | 
 988 |   PublicSubnet0:
 989 |     Value: !Ref PublicSubnet0
 990 |     Export:
 991 |       Name: !Sub ${AWS::StackName}-PublicSubnet0Id
 992 | 
 993 |   PublicSubnet1:
 994 |     Value: !Ref PublicSubnet1
 995 |     Export:
 996 |       Name: !Sub ${AWS::StackName}-PublicSubnet1Id
 997 | 
 998 |   NodeInstanceRoleArn:
 999 |     Value: !GetAtt NodeInstanceRole.Arn
1000 |     Export:
1001 |       Name: !Sub ${AWS::StackName}-NodeInstanceRoleArn
1002 | 
1003 |   NodeInstanceRoleId:
1004 |     Value: !GetAtt NodeInstanceRole.RoleId
1005 |     Export:
1006 |       Name: !Sub ${AWS::StackName}-NodeInstanceRoleId
1007 | 
1008 |   EksClusterName:
1009 |     Value: !Ref EksCluster
1010 |     Export:
1011 |       Name: !Sub ${AWS::StackName}-EksClusterName
1012 | 
1013 |   EksClusterArn:
1014 |     Value: !GetAtt EksCluster.Arn
1015 |     Export:
1016 |       Name: !Sub ${AWS::StackName}-EksClusterArn
1017 | 
1018 |   EksClusterEndpoint:
1019 |     Value: !GetAtt EksCluster.Endpoint
1020 |     Export:
1021 |       Name: !Sub ${AWS::StackName}-EksClusterEndpoint
1022 | 
1023 |   ClusterControlPlaneSecurityGroup:
1024 |     Value: !Ref ClusterControlPlaneSecurityGroup
1025 |     Export:
1026 |       Name: !Sub ${AWS::StackName}-ClusterControlPlaneSecurityGroup
1027 | 
1028 |   TaskQueueUrl:
1029 |     Description: Task queue url
1030 |     Value: !Ref TaskQueue
1031 |     Export:
1032 |       Name: !Sub ${AWS::StackName}-TaskQueueUrl
1033 | 
1034 |   TaskQueueArn:
1035 |     Description: Task queue arn
1036 |     Value: !GetAtt TaskQueue.Arn
1037 |     Export:
1038 |       Name: !Sub ${AWS::StackName}-TaskQueueArn
1039 | 
1040 |   TaskQueueName:
1041 |     Description: Task queue name
1042 |     Value: !GetAtt TaskQueue.QueueName
1043 |     Export:
1044 |       Name: !Sub ${AWS::StackName}-TaskQueueName
1045 | 
1046 |   TaskCompletedQueueUrl:
1047 |     Description: TaskCompleted queue url
1048 |     Value: !Ref TaskCompletedQueue
1049 |     Export:
1050 |       Name: !Sub ${AWS::StackName}-TaskCompletedQueueUrl
1051 | 
1052 |   TaskCompletedQueueArn:
1053 |     Description: TaskCompleted queue arn
1054 |     Value: !GetAtt TaskCompletedQueue.Arn
1055 |     Export:
1056 |       Name: !Sub ${AWS::StackName}-TaskCompletedQueueArn
1057 | 
1058 |   TaskCompletedQueueName:
1059 |     Description: TaskCompleted queue name
1060 |     Value: !GetAtt TaskCompletedQueue.QueueName
1061 |     Export:
1062 |       Name: !Sub ${AWS::StackName}-TaskCompletedQueueName
1063 | 
1064 | 
1065 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
  1 | import boto3
  2 | import os
  3 | import sys
  4 | import cv2
  5 | import numpy
  6 | import requests
  7 | import json
  8 | import logging
  9 | import threading
 10 | import queue
 11 | 
 12 | import coco_label_map
 13 | 
 14 | ENDPOINT = 'http://localhost:8501/v1/models/default:predict'
 15 | TMP_FILE = "./tmp.mov"
 16 | 
 17 | FRAME_BATCH=5
 18 | 
 19 | FRAME_MAX=20
 20 | 
 21 | logging.basicConfig(
 22 |     level=logging.INFO,
 23 |     format='%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s',
 24 |     handlers=[ logging.StreamHandler(sys.stdout) ],
 25 | )
 26 | 
 27 | log = logging.getLogger()
 28 | 
 29 | def get_predictions_from_image_array(batch):
 30 |     res = requests.post(ENDPOINT, json={ 'instances': batch })
 31 |     return res.json()['predictions']
 32 | 
 33 | def get_classes_with_scores(predictions):
 34 |     vals = []
 35 |     for p in predictions:
 36 |         num_detections = int(p['num_detections'])
 37 |         detected_classes = p['detection_classes'][:num_detections]
 38 |         detected_classes =[coco_label_map.label_map[int(x)] for x in detected_classes]
 39 |         detection_scores = p['detection_scores'][:num_detections]
 40 |         vals.append(list(zip(detected_classes, detection_scores)))
 41 | 
 42 |     return vals
 43 | 
 44 | def prepare(prepare_queue, inference_queue):
 45 |     while True:
 46 |         inference_queue.put(prepare_queue.get().tolist())
 47 | 
 48 | def add_to_prepare(prepare_queue, frames):
 49 |     for f in frames:
 50 |         prepare_queue.put(f)
 51 |     frames.clear()
 52 | 
 53 | def process_video_from_file(file_path, prepare_queue, inference_queue):
 54 | 
 55 |     log.info('process_video_from_file')
 56 | 
 57 |     frames = []
 58 |     vidcap = cv2.VideoCapture(file_path)
 59 |     success, frame = vidcap.read()
 60 |     success = True
 61 | 
 62 |     log.info('start frame extraction')
 63 | 
 64 |     max_frame = 0
 65 |     while success:
 66 |         frames.append(frame)
 67 |         success, frame = vidcap.read()
 68 |         max_frame += 1
 69 |         if max_frame == FRAME_MAX:
 70 |             break
 71 | 
 72 |     log.info('end frame extraction')
 73 | 
 74 |     count = len(frames)
 75 | 
 76 |     add_worker = threading.Thread(target=add_to_prepare, args=(prepare_queue, frames,))
 77 |     add_worker.start()
 78 | 
 79 |     log.info('frame count: %d', count)
 80 |     batch = []
 81 |     predictions = []
 82 | 
 83 |     log.info('frame batch %d', FRAME_BATCH)
 84 | 
 85 |     for i in range(count):
 86 |         batch.append(inference_queue.get())
 87 | 
 88 |         if len(batch) == FRAME_BATCH or i == (count - 1):
 89 |             log.info('range: %d - batch: %d', i, len(batch))
 90 |             for v in get_classes_with_scores(get_predictions_from_image_array(batch)):
 91 |                 predictions.append(str(v))
 92 |                 predictions.append('\n')
 93 |             batch.clear()
 94 | 
 95 |     vidcap.release()
 96 |     #cv2.destroyAllWindows()
 97 | 
 98 |     return predictions
 99 | 
100 | def main():
101 | 
102 |     task_queue_name = None
103 |     task_completed_queue_name = None
104 | 
105 |     try:
106 |         task_queue_name = os.environ['SQS_TASK_QUEUE']
107 |         task_completed_queue_name = os.environ['SQS_TASK_COMPLETED_QUEUE']
108 |     except KeyError:
109 |         log.error('Please set the environment variables for SQS_TASK_QUEUE and SQS_TASK_COMPLETED_QUEUE')
110 |         sys.exit(1)
111 | 
112 |     # Get the instance information
113 |     r = requests.get("http://169.254.169.254/latest/dynamic/instance-identity/document")
114 |     r.raise_for_status()
115 |     response_json = r.json()
116 |     region = response_json.get('region')
117 |     instance_id = response_json.get('instanceId')
118 | 
119 |     ec2 = boto3.client('ec2', region_name=region)
120 |     s3 = boto3.client('s3', region_name=region)
121 | 
122 |     task_queue = boto3.resource('sqs', region_name=region).get_queue_by_name(QueueName=task_queue_name)
123 |     task_completed_queue = boto3.resource('sqs', region_name=region).get_queue_by_name(QueueName=task_completed_queue_name)
124 | 
125 |     log.info('Initialized - instance: %s', instance_id)
126 | 
127 |     prepare_queue = queue.Queue()
128 |     inference_queue = queue.Queue(maxsize=FRAME_BATCH)
129 | 
130 |     prepare_worker = threading.Thread(target=prepare, args=(prepare_queue, inference_queue,))
131 |     prepare_worker.start()
132 | 
133 |     while True:
134 |         for message in task_queue.receive_messages(WaitTimeSeconds=10):
135 |             try:
136 |                 log.info('Message received - instance: %s', instance_id)
137 | 
138 |                 ec2.modify_instance_attribute(
139 |                     InstanceId=instance_id,
140 |                     DisableApiTermination={ 'Value': True },
141 |                 )
142 |                 log.info('Termination protection engaged - instance: %s', instance_id)
143 | 
144 |                 message.change_visibility(VisibilityTimeout=600)
145 |                 log.info('Message visibility updated - instance: %s', instance_id)
146 | 
147 |                 # Process the message
148 |                 doc = json.loads(message.body)
149 |                 log.info('Message body is loaded - instance: %s', instance_id)
150 | 
151 |                 s3.download_file(doc['bucket'], doc['object'], TMP_FILE)
152 |                 log.info('File is downloaded - instance: %s', instance_id)
153 | 
154 |                 log.info('Starting predictions - instance: %s', instance_id)
155 |                 predictions_for_frames = process_video_from_file(TMP_FILE, prepare_queue, inference_queue)
156 |                 log.info('Predictions completed - instance: %s', instance_id)
157 | 
158 |                 log.info(''.join(e for e in predictions_for_frames))
159 | 
160 |                 task_completed_queue.send_message(MessageBody=''.join(e for e in predictions_for_frames))
161 |                 log.info('Task completed msg sent - instance: %s', instance_id)
162 |                 message.delete()
163 |                 log.info('Message deleted - instance: %s', instance_id)
164 | 
165 |                 ec2.modify_instance_attribute(
166 |                     InstanceId=instance_id,
167 |                     DisableApiTermination={ 'Value': False },
168 |                 )
169 |                 log.info('Termination protection disengaged - instance: %s', instance_id)
170 | 
171 |                 if os.path.exists(TMP_FILE):
172 |                     os.remove(TMP_FILE)
173 | 
174 |             except:
175 |                 log.error('Problem processing message: %s - instance: %s', sys.exc_info()[0], instance_id)
176 | 
177 | if __name__ == '__main__':
178 |     main()
179 | 
180 | 


--------------------------------------------------------------------------------