├── .github └── PULL_REQUEST_TEMPLATE.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── Dockerfile ├── Dockerfile_tf_serving ├── LICENSE ├── Makefile ├── README.md ├── coco_label_map.py ├── deploy.sh ├── images └── overview.png ├── k8s-daemonset.yml ├── requirements.txt ├── stack.cfn.yml └── test.py /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | custom.mk 3 | *.pyc 4 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-elastic-inference-eks/issues), or [recently closed](https://github.com/aws-samples/amazon-elastic-inference-eks/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-elastic-inference-eks/labels/help%20wanted) issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](https://github.com/aws-samples/amazon-elastic-inference-eks/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.6 2 | 3 | WORKDIR /usr/src/app 4 | 5 | COPY requirements.txt ./ 6 | 7 | RUN pip3 install --no-cache-dir -r requirements.txt 8 | 9 | COPY . . 10 | 11 | CMD [ "python", "-u", "./test.py" ] 12 | -------------------------------------------------------------------------------- /Dockerfile_tf_serving: -------------------------------------------------------------------------------- 1 | FROM amazonlinux 2 | 3 | # install missing packages 4 | RUN yum install -y wget && yum install -y tar && yum install -y git 5 | 6 | # install binary for TensorFlow Serving modified for Elastic Inference 7 | RUN wget -q -O - https://s3.amazonaws.com/amazonei-tensorflow/tensorflow-serving/v1.12/amazonlinux/latest/tensorflow-serving-1-12-0-amazonlinux-ei-1-1.tar.gz | tar -xvz 8 | 9 | RUN chmod +x /tensorflow-serving-1-12-0-amazonlinux-ei-1-1/amazonei_tensorflow_model_server 10 | 11 | # install object detection model 12 | WORKDIR /models 13 | RUN wget -nv -O model.tar.gz \ 14 | http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz 15 | RUN tar -xvf model.tar.gz 16 | RUN mkdir -p object-detect/1 17 | RUN find -name saved_model -exec mv {}/saved_model.pb {}/variables object-detect/1/ \; 18 | 19 | WORKDIR / 20 | 21 | CMD ["./tensorflow-serving-1-12-0-amazonlinux-ei-1-1/amazonei_tensorflow_model_server", \ 22 | "--rest_api_port=8501", \ 23 | "--model_base_path=/models/object-detect"] 24 | 25 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | 2 | # This file was heavily influenced by the AWS EKS Reference Architecture 3 | # https://github.com/aws-samples/amazon-eks-refarch-cloudformation 4 | 5 | CUSTOM_FILE ?= custom.mk 6 | ifneq ("$(wildcard $(CUSTOM_FILE))","") 7 | include $(CUSTOM_FILE) 8 | endif 9 | 10 | ROOT ?= $(shell pwd) 11 | AWS_ACCOUNT_ID := $(shell aws sts get-caller-identity --query 'Account' --output text) 12 | CLUSTER_STACK_NAME ?= eks-ei-blog 13 | CLUSTER_NAME ?= $(CLUSTER_STACK_NAME) 14 | EKS_ADMIN_ROLE ?= arn:aws:iam::$(AWS_ACCOUNT_ID):role/EksEiBlogPostRole 15 | REGION ?= 'us-east-1' 16 | AZ_0 ?= 'us-east-1a' 17 | AZ_1 ?= 'us-east-1b' 18 | SSH_KEY_NAME ?= '' 19 | USER_ARN ?= $(shell aws sts get-caller-identity --output text --query 'Arn') 20 | EI_TYPE ?= 'eia1.medium' 21 | NODE_INSTANCE_TYPE ?= 'm5.large' 22 | INFERENCE_NODE_INSTANCE_TYPE ?= 'c5.large' 23 | NODE_ASG_MIN ?= 1 24 | NODE_ASG_MAX ?= 5 25 | NODE_ASG_DESIRED ?= 2 26 | INFERENCE_NODE_ASG_MAX ?= 6 27 | INFERENCE_NODE_ASG_MIN ?= 2 28 | INFERENCE_NODE_ASG_DESIRED ?= 2 29 | NODE_VOLUME_SIZE ?= 100 30 | INFERENCE_NODE_VOLUME_SIZE ?= 100 31 | LAMBDA_CR_BUCKET_PREFIX ?= 'pub-cfn-cust-res-pocs' 32 | DEFAULT_SQS_TASK_VISIBILITY ?= 7200 33 | DEFAULT_SQS_TASK_COMPLETED_VISIBILITY ?= 500 34 | INFERENCE_SCALE_PERIODS ?= 1 35 | INFERENCE_SCALE_OUT_THRESHOLD ?= 2 36 | INFERENCE_SCALE_IN_THRESHOLD ?= 2 37 | INFERENCE_NODE_GROUP_NAME ?= 'inference' 38 | NODE_GROUP_NAME ?= 'standard' 39 | INFERENCE_BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=true,nodegroup=elastic-inference 40 | BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=false,nodegroup=standard 41 | 42 | ROLE_TRUST ?= '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "cloudformation.amazonaws.com" }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Service": "lambda.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }' 43 | 44 | .PHONY: create-role 45 | create-role: 46 | @aws iam create-role --role-name EksEiBlogPostRole --assume-role-policy-document $(ROLE_TRUST) --output text --query 'Role.Arn' 47 | @aws iam attach-role-policy --role-name EksEiBlogPostRole --policy-arn arn:aws:iam::aws:policy/AdministratorAccess 48 | 49 | .PHONY: update-kubeconfig 50 | update-kubeconfig: 51 | @aws eks update-kubeconfig --region $(REGION) --name $(CLUSTER_NAME) 52 | 53 | .PHONY: deploy-daemonset 54 | deploy-daemonset: 55 | @kubectl apply -f k8s-daemonset.yml 56 | 57 | .PHONY: create-cluster 58 | create-cluster: 59 | @aws --region $(REGION) cloudformation create-stack \ 60 | --template-body file://stack.cfn.yml \ 61 | --stack-name $(CLUSTER_STACK_NAME) \ 62 | --role-arn $(EKS_ADMIN_ROLE) \ 63 | --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \ 64 | --parameters \ 65 | ParameterKey=EksClusterName,ParameterValue="$(CLUSTER_NAME)" \ 66 | ParameterKey=AvailabilityZone0,ParameterValue="$(AZ_0)" \ 67 | ParameterKey=AvailabilityZone1,ParameterValue="$(AZ_1)" \ 68 | ParameterKey=AdminUserArn,ParameterValue="$(USER_ARN)" \ 69 | ParameterKey=CreateRoleArn,ParameterValue="$(EKS_ADMIN_ROLE)" \ 70 | ParameterKey=KeyName,ParameterValue="$(SSH_KEY_NAME)" \ 71 | ParameterKey=InferenceNodeGroupName,ParameterValue="$(INFERENCE_NODE_GROUP_NAME)" \ 72 | ParameterKey=InferenceBootstrapArguments,ParameterValue="'$(INFERENCE_BOOTSTRAP)'" \ 73 | ParameterKey=BootstrapArguments,ParameterValue="'$(BOOTSTRAP)'" \ 74 | ParameterKey=NodeGroupName,ParameterValue="$(NODE_GROUP_NAME)" \ 75 | ParameterKey=NodeInstanceType,ParameterValue="$(NODE_INSTANCE_TYPE)" \ 76 | ParameterKey=InferenceNodeInstanceType,ParameterValue="$(INFERENCE_NODE_INSTANCE_TYPE)" \ 77 | ParameterKey=ElasticInferenceType,ParameterValue="$(EI_TYPE)" \ 78 | ParameterKey=NodeAutoScalingGroupMinSize,ParameterValue="$(NODE_ASG_MIN)" \ 79 | ParameterKey=NodeAutoScalingGroupMaxSize,ParameterValue="$(NODE_ASG_MAX)" \ 80 | ParameterKey=NodeAutoScalingGroupDesiredCapacity,ParameterValue="$(NODE_ASG_DESIRED)" \ 81 | ParameterKey=InferenceNodeAutoScalingGroupMinSize,ParameterValue="$(INFERENCE_NODE_ASG_MIN)" \ 82 | ParameterKey=InferenceNodeAutoScalingGroupMaxSize,ParameterValue="$(INFERENCE_NODE_ASG_MAX)" \ 83 | ParameterKey=InferenceNodeAutoScalingGroupDesiredCapacity,ParameterValue="$(INFERENCE_NODE_ASG_DESIRED)" \ 84 | ParameterKey=NodeVolumeSize,ParameterValue="$(NODE_VOLUME_SIZE)" \ 85 | ParameterKey=InferenceNodeVolumeSize,ParameterValue="$(INFERENCE_NODE_VOLUME_SIZE)" \ 86 | ParameterKey=LambdaCustomResourceBucketPrefix,ParameterValue="$(LAMBDA_CR_BUCKET_PREFIX)" \ 87 | ParameterKey=DefaultTaskQueueVisibilityTimeout,ParameterValue="$(DEFAULT_SQS_TASK_VISIBILITY)" \ 88 | ParameterKey=DefaultTaskCompletedQueueVisibilityTimeout,ParameterValue="$(DEFAULT_SQS_TASK_COMPLETED_VISIBILITY)" \ 89 | ParameterKey=InferenceScaleEvaluationPeriods,ParameterValue="$(INFERENCE_SCALE_PERIODS)" \ 90 | ParameterKey=InferenceQueueDepthScaleOutThreshold,ParameterValue="$(INFERENCE_SCALE_OUT_THRESHOLD)" \ 91 | ParameterKey=InferenceQueueDepthScaleInThreshold,ParameterValue="$(INFERENCE_SCALE_IN_THRESHOLD)" 92 | @echo open "https://console.aws.amazon.com/cloudformation/home?region=$(REGION)#/stacks to see the details" 93 | 94 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Amazon Elastic Inference with Amazon EKS 2 | 3 | This repository contains resources demonstrating how to use [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference/) (EI) and [Amazon EKS](https://aws.amazon.com/eks/) together to deliver a cost optimized, scalable solution for performing inference on video frames. More specifically, the solution herein runs containers in Amazon EKS that read a video from Amazon S3, preprocess its frames, then send the frames for object detection to a TensorFlow Serving container modified to work with Amazon EI. This computationally intensive use case showcases the advantages of using Amazon EI and Amazon EKS together to achieve accelerated inference at low cost within a scalable, containerized architecture. 4 | 5 | ![overview](images/overview.png) 6 | 7 | ## Deploy 8 | 9 | The following steps require the [AWS Command Line Interface](https://aws.amazon.com/cli/) to be installed and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). Additionally, you must follow the AWS instructions for installing the [IAM compatible version of kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html). You must also install the [aws-iam-authenticator](https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html), to help with IAM integration 10 | 11 | Amazon EKS and Elastic Inference are currently not available in all AWS regions. Consult the [AWS Region Table](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/) for more information. 12 | 13 | If you would like to see the operations behind the make commands, [see the Makefile](https://github.com/aws-samples/amazon-elastic-inference-eks/blob/master/Makefile). Likewise, if you would like to see the resources being created and how the EI accelerator is attached, see the [CloudFormation template](https://github.com/aws-samples/amazon-elastic-inference-eks/blob/master/stack.cfn.yml). 14 | 15 | The first step in the process is to clone this repository: 16 | 17 | ``` 18 | git clone https://github.com/aws-samples/amazon-elastic-inference-eks.git && cd amazon-elastic-inference-eks 19 | ``` 20 | 21 | In order to create both the Amazon EKS cluster and the node groups in a single step, first we must create an IAM role to be used by a Custom Resource in the CloudFormation template to programmatically modify the EKS aws-auth ConfigMap. Modifying the aws-auth ConfigMap is required so that the EC2 instances can register themselves with the cluster. 22 | 23 | ``` 24 | make create-role 25 | ``` 26 | 27 | After the IAM role is created, you can launch the cluster by running: 28 | 29 | ``` 30 | make create-cluster 31 | ``` 32 | 33 | This step takes 10-15 minutes to create all of the required resources. Once the command is executed, 34 | an AWS Management Console link to the CloudFormation section is printed. You can monitor the resource creation 35 | here. Once the stack status is, CREATE_COMPLETE, you can move on to the next step. 36 | 37 | If you are interested in overriding the default values in the Makefile, create a custom.mk file in the same directory and set the values. The following parameters can be overridden: 38 | 39 | ``` 40 | CLUSTER_STACK_NAME ?= eks-ei-blog 41 | CLUSTER_NAME ?= $(CLUSTER_STACK_NAME) 42 | EKS_ADMIN_ROLE ?= arn:aws:iam::$(AWS_ACCOUNT_ID):role/EksEiBlogPostRole 43 | REGION ?= 'us-east-1' 44 | AZ_0 ?= 'us-east-1a' 45 | EKS_ADMIN_ROLE ?= arn:aws:iam::$(AWS_ACCOUNT_ID):role/EksEiBlogPostRole 46 | REGION ?= 'us-east-1' 47 | AZ_0 ?= 'us-east-1a' 48 | AZ_1 ?= 'us-east-1b' 49 | SSH_KEY_NAME ?= 'somekey' 50 | USER_ARN ?= $(shell aws sts get-caller-identity --output text --query 'Arn') 51 | EI_TYPE ?= 'eia1.medium' 52 | NODE_INSTANCE_TYPE ?= 'm5.large' 53 | INFERENCE_NODE_INSTANCE_TYPE ?= 'c5.large' 54 | NODE_ASG_MIN ?= 1 55 | NODE_ASG_MAX ?= 5 56 | NODE_ASG_DESIRED ?= 2 57 | INFERENCE_NODE_ASG_MAX ?= 6 58 | INFERENCE_NODE_ASG_MIN ?= 2 59 | INFERENCE_NODE_ASG_DESIRED ?= 2 60 | NODE_VOLUME_SIZE ?= 100 61 | INFERENCE_NODE_VOLUME_SIZE ?= 100 62 | LAMBDA_CR_BUCKET_PREFIX ?= 'pub-cfn-cust-res-pocs' 63 | DEFAULT_SQS_TASK_VISIBILITY ?= 7200 64 | DEFAULT_SQS_TASK_COMPLETED_VISIBILITY ?= 500 65 | INFERENCE_SCALE_PERIODS ?= 1 66 | INFERENCE_SCALE_OUT_THRESHOLD ?= 2 67 | INFERENCE_SCALE_IN_THRESHOLD ?= 2 68 | INFERENCE_NODE_GROUP_NAME ?= 'inference' 69 | NODE_GROUP_NAME ?= 'standard' 70 | INFERENCE_BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=true,nodegroup=elastic-inference 71 | BOOTSTRAP ?= --kubelet-extra-args --node-labels=inference=false,nodegroup=standard 72 | ``` 73 | 74 | After the cluster is created, update your local [kubeconfig](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) file by running: 75 | 76 | ``` 77 | make update-kubeconfig 78 | ``` 79 | 80 | You can test your local Amazon EKS authentication by running: 81 | 82 | ``` 83 | kubectl get nodes 84 | ``` 85 | 86 | The output should provide high-level information about the four nodes (or a different number if you overrode the defaults). Next, we are going to deploy the [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) to the cluster: 87 | 88 | ``` 89 | kubectl apply -f k8s-daemonset.yml 90 | ``` 91 | 92 | You can check the status of the deployment by running: 93 | 94 | ``` 95 | kubectl describe daemonset inference-daemon 96 | ``` 97 | Wait until the "Pod Status" shows the total oucnt as "Running" 98 | 99 | ## Run 100 | 101 | Next you deploy a sample MOV file to the data bucket created by the stack. In the S3 console, 102 | you will see a bucket that has the format of (default CLUSTER/STACK NAME is - eks-ei-blog): 103 | 104 | ``` 105 | task-data-bucket-ACCOUNT_ID-REGION-CLUSTER/STACK NAME 106 | ``` 107 | 108 | From the command line, you can use something similar to upload sample content to the bucket: 109 | 110 | ``` 111 | aws s3 cp --region REGION sample.mov s3://task-data-bucket-ACCOUNT_ID-REGION-CLUSTER/STACK NAME 112 | ``` 113 | 114 | Once you have sample data in the bucket, in the SQS section of the AWS Management Console, [submit a sample task](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-send-message.html). The queue name resembles: 115 | 116 | ``` 117 | task-queue-CLUSTER/STACK NAME 118 | ``` 119 | 120 | The format of the message submitted in the console is: 121 | 122 | ``` 123 | { "bucket": "task-data-bucket-ACCOUNT_ID-REGION-CLUSTER/STACK NAME", "object": "sample.mov" } 124 | ``` 125 | 126 | Change the bucket name and the object key to match your deployment. 127 | 128 | To see the inference results, you can [view messages](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-receive-delete-message.html) on the task completed SQS queue in the AWS Management Console. 129 | 130 | 131 | ## Cleanup 132 | 133 | You must delete the DaemonSet before terminating the Amazon EKS cluster, or there will be resources that cannot be 134 | reclaimed by CloudFormation. To do so, run: 135 | 136 | ``` 137 | kubectl delete -f k8s-daemonset.yml 138 | ``` 139 | 140 | Once the DaemonSet is terminated, you can [delete the stack](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-delete-stack.html) in the CloudFormation section of the AWS Management Console. 141 | 142 | ## License Summary 143 | 144 | This sample code is made available under the MIT-0 license. See the LICENSE file. 145 | -------------------------------------------------------------------------------- /coco_label_map.py: -------------------------------------------------------------------------------- 1 | label_map = { 2 | 1: 'person', 3 | 2: 'bicycle', 4 | 3: 'car', 5 | 4: 'motorcycle', 6 | 5: 'airplane', 7 | 6: 'bus', 8 | 7: 'train', 9 | 8: 'truck', 10 | 9: 'boat', 11 | 10: 'traffic light', 12 | 11: 'fire hydrant', 13 | 13: 'stop sign', 14 | 14: 'parking meter', 15 | 15: 'bench', 16 | 16: 'bird', 17 | 17: 'cat', 18 | 18: 'dog', 19 | 19: 'horse', 20 | 20: 'sheep', 21 | 21: 'cow', 22 | 22: 'elephant', 23 | 23: 'bear', 24 | 24: 'zebra', 25 | 25: 'giraffe', 26 | 27: 'backpack', 27 | 28: 'umbrella', 28 | 31: 'handbag', 29 | 32: 'tie', 30 | 33: 'suitcase', 31 | 34: 'frisbee', 32 | 35: 'skis', 33 | 36: 'snowboard', 34 | 37: 'sports ball', 35 | 38: 'kite', 36 | 39: 'baseball bat', 37 | 40: 'baseball glove', 38 | 41: 'skateboard', 39 | 42: 'surfboard', 40 | 43: 'tennis racket', 41 | 44: 'bottle', 42 | 46: 'wine glass', 43 | 47: 'cup', 44 | 48: 'fork', 45 | 49: 'knife', 46 | 50: 'spoon', 47 | 51: 'bowl', 48 | 52: 'banana', 49 | 53: 'apple', 50 | 54: 'sandwich', 51 | 55: 'orange', 52 | 56: 'broccoli', 53 | 57: 'carrot', 54 | 58: 'hot dog', 55 | 59: 'pizza', 56 | 60: 'donut', 57 | 61: 'cake', 58 | 62: 'chair', 59 | 63: 'couch', 60 | 64: 'potted plant', 61 | 65: 'bed', 62 | 67: 'dining table', 63 | 70: 'toilet', 64 | 72: 'tv', 65 | 73: 'laptop', 66 | 74: 'mouse', 67 | 75: 'remote', 68 | 76: 'keyboard', 69 | 77: 'cell phone', 70 | 78: 'microwave', 71 | 79: 'oven', 72 | 80: 'toaster', 73 | 81: 'sink', 74 | 82: 'refrigerator', 75 | 84: 'book', 76 | 85: 'clock', 77 | 86: 'vase', 78 | 87: 'scissors', 79 | 88: 'teddy bear', 80 | 89: 'hair drier', 81 | 90: 'toothbrush' 82 | } 83 | -------------------------------------------------------------------------------- /deploy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | TAG=$(git log -1 --pretty=%H) 4 | 5 | echo EI APP -------------------------------------------------------------------- 6 | 7 | APP_REPOSITORY=rnzdocker1/eks-elastic-inference-app 8 | 9 | python -m py_compile test.py 10 | 11 | docker build --tag $APP_REPOSITORY:$TAG . 12 | 13 | docker push $APP_REPOSITORY:$TAG 14 | 15 | echo EI TensorFlow Serving ----------------------------------------------------- 16 | 17 | SERVING_REPOSITORY=rnzdocker1/eks-elastic-inference-serving 18 | 19 | docker build --file Dockerfile_tf_serving --tag $SERVING_REPOSITORY:$TAG . 20 | 21 | docker push $SERVING_REPOSITORY:$TAG 22 | 23 | 24 | -------------------------------------------------------------------------------- /images/overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-elastic-inference-eks/475452ba42ebff295a482fb971012f8de21ac162/images/overview.png -------------------------------------------------------------------------------- /k8s-daemonset.yml: -------------------------------------------------------------------------------- 1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). 4 | # You may not use this file except in compliance with the License. 5 | # A copy of the License is located at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # or in the "license" file accompanying this file. This file is distributed 10 | # on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 11 | # express or implied. See the License for the specific language governing 12 | # permissions and limitations under the License. 13 | 14 | apiVersion: v1 15 | kind: ServiceAccount 16 | metadata: 17 | labels: 18 | app: inference-daemon 19 | name: inference-daemon 20 | namespace: default 21 | --- 22 | apiVersion: rbac.authorization.k8s.io/v1beta1 23 | kind: ClusterRoleBinding 24 | metadata: 25 | name: inference-daemon 26 | labels: 27 | app: inference-daemon 28 | roleRef: 29 | apiGroup: rbac.authorization.k8s.io 30 | kind: ClusterRole 31 | name: cluster-admin 32 | subjects: 33 | - kind: ServiceAccount 34 | name: inference-daemon 35 | namespace: default 36 | --- 37 | apiVersion: extensions/v1beta1 38 | kind: DaemonSet 39 | metadata: 40 | name: inference-daemon 41 | spec: 42 | updateStrategy: 43 | type: RollingUpdate 44 | template: 45 | metadata: 46 | labels: 47 | app: inference-daemon 48 | spec: 49 | volumes: 50 | - name: config-volume 51 | configMap: 52 | name: inference-config 53 | hostNetwork: true 54 | nodeSelector: 55 | nodegroup: elastic-inference 56 | containers: 57 | - name: inference-daemon 58 | image: rnzdocker1/eks-elastic-inference-app:ea770a6a256bade77730501c9a71d5670f55887d 59 | imagePullPolicy: Always 60 | env: 61 | - name: SQS_TASK_QUEUE 62 | value: task-queue-eks-a 63 | - name: SQS_TASK_COMPLETED_QUEUE 64 | value: task-completed-queue-eks-a 65 | resources: 66 | requests: 67 | memory: 1024Mi 68 | - name: tensorflow-serving 69 | image: rnzdocker1/eks-elastic-inference-serving:8e6ac6052e2d45da060aa0d22e1945b3631c4c1b 70 | imagePullPolicy: Always 71 | resources: 72 | requests: 73 | memory: 1024Mi 74 | ports: 75 | - name: tf-serving-api 76 | containerPort: 8501 77 | hostPort: 8501 78 | protocol: TCP 79 | volumeMounts: 80 | - name: config-volume 81 | mountPath: /app/ei 82 | readOnly: true 83 | --- 84 | # Configuration for the application 85 | apiVersion: v1 86 | kind: ConfigMap 87 | metadata: 88 | name: inference-config 89 | data: 90 | config.yaml: |- 91 | Version: 1 92 | --- 93 | # k8s service definition 94 | apiVersion: v1 95 | kind: Service 96 | metadata: 97 | name: inference-service 98 | spec: 99 | selector: 100 | app: inference-daemon 101 | clusterIP: None 102 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | 2 | boto3==1.9.188 3 | requests==2.21.0 4 | numpy==1.16.3 5 | opencv-python-headless==4.1.0.25 6 | -------------------------------------------------------------------------------- /stack.cfn.yml: -------------------------------------------------------------------------------- 1 | --- 2 | AWSTemplateFormatVersion: 2010-09-09 3 | 4 | Description: EKS + Elastic Inference Blog Post 5 | 6 | 7 | Parameters: 8 | 9 | KeyName: 10 | Type: String 11 | Default: '' 12 | Description: The EC2 Key Pair to allow SSH access to the instances 13 | 14 | ElasticInferenceType: 15 | Description: Elastic Inference type 16 | Type: String 17 | Default: eia1.medium 18 | 19 | EksClusterName: 20 | Description: EKS Cluster Name 21 | Default: eks-ei-cluster 22 | Type: String 23 | 24 | NodeInstanceType: 25 | Description: EC2 instance type for the node instances 26 | Type: String 27 | Default: m5.large 28 | ConstraintDescription: Must be a valid EC2 instance type 29 | AllowedValues: 30 | - t2.small 31 | - t2.medium 32 | - t2.large 33 | - t2.xlarge 34 | - t2.2xlarge 35 | - t3.nano 36 | - t3.micro 37 | - t3.small 38 | - t3.medium 39 | - t3.large 40 | - t3.xlarge 41 | - t3.2xlarge 42 | - m3.medium 43 | - m3.large 44 | - m3.xlarge 45 | - m3.2xlarge 46 | - m4.large 47 | - m4.xlarge 48 | - m4.2xlarge 49 | - m4.4xlarge 50 | - m4.10xlarge 51 | - m5.large 52 | - m5.xlarge 53 | - m5.2xlarge 54 | - m5.4xlarge 55 | - m5.12xlarge 56 | - m5.24xlarge 57 | - c4.large 58 | - c4.xlarge 59 | - c4.2xlarge 60 | - c4.4xlarge 61 | - c4.8xlarge 62 | - c5.large 63 | - c5.xlarge 64 | - c5.2xlarge 65 | - c5.4xlarge 66 | - c5.9xlarge 67 | - c5.18xlarge 68 | - i3.large 69 | - i3.xlarge 70 | - i3.2xlarge 71 | - i3.4xlarge 72 | - i3.8xlarge 73 | - i3.16xlarge 74 | - r3.xlarge 75 | - r3.2xlarge 76 | - r3.4xlarge 77 | - r3.8xlarge 78 | - r4.large 79 | - r4.xlarge 80 | - r4.2xlarge 81 | - r4.4xlarge 82 | - r4.8xlarge 83 | - r4.16xlarge 84 | - x1.16xlarge 85 | - x1.32xlarge 86 | - p2.xlarge 87 | - p2.8xlarge 88 | - p2.16xlarge 89 | - p3.2xlarge 90 | - p3.8xlarge 91 | - p3.16xlarge 92 | - p3dn.24xlarge 93 | - r5.large 94 | - r5.xlarge 95 | - r5.2xlarge 96 | - r5.4xlarge 97 | - r5.12xlarge 98 | - r5.24xlarge 99 | - r5d.large 100 | - r5d.xlarge 101 | - r5d.2xlarge 102 | - r5d.4xlarge 103 | - r5d.12xlarge 104 | - r5d.24xlarge 105 | - z1d.large 106 | - z1d.xlarge 107 | - z1d.2xlarge 108 | - z1d.3xlarge 109 | - z1d.6xlarge 110 | - z1d.12xlarge 111 | 112 | InferenceNodeInstanceType: 113 | Description: EC2 instance type for the node instances 114 | Type: String 115 | Default: c5.large 116 | ConstraintDescription: Must be a valid EC2 instance type 117 | AllowedValues: 118 | - t2.small 119 | - t2.medium 120 | - t2.large 121 | - t2.xlarge 122 | - t2.2xlarge 123 | - t3.nano 124 | - t3.micro 125 | - t3.small 126 | - t3.medium 127 | - t3.large 128 | - t3.xlarge 129 | - t3.2xlarge 130 | - m3.medium 131 | - m3.large 132 | - m3.xlarge 133 | - m3.2xlarge 134 | - m4.large 135 | - m4.xlarge 136 | - m4.2xlarge 137 | - m4.4xlarge 138 | - m4.10xlarge 139 | - m5.large 140 | - m5.xlarge 141 | - m5.2xlarge 142 | - m5.4xlarge 143 | - m5.12xlarge 144 | - m5.24xlarge 145 | - c4.large 146 | - c4.xlarge 147 | - c4.2xlarge 148 | - c4.4xlarge 149 | - c4.8xlarge 150 | - c5.large 151 | - c5.xlarge 152 | - c5.2xlarge 153 | - c5.4xlarge 154 | - c5.9xlarge 155 | - c5.18xlarge 156 | - i3.large 157 | - i3.xlarge 158 | - i3.2xlarge 159 | - i3.4xlarge 160 | - i3.8xlarge 161 | - i3.16xlarge 162 | - r3.xlarge 163 | - r3.2xlarge 164 | - r3.4xlarge 165 | - r3.8xlarge 166 | - r4.large 167 | - r4.xlarge 168 | - r4.2xlarge 169 | - r4.4xlarge 170 | - r4.8xlarge 171 | - r4.16xlarge 172 | - x1.16xlarge 173 | - x1.32xlarge 174 | - p2.xlarge 175 | - p2.8xlarge 176 | - p2.16xlarge 177 | - p3.2xlarge 178 | - p3.8xlarge 179 | - p3.16xlarge 180 | - p3dn.24xlarge 181 | - r5.large 182 | - r5.xlarge 183 | - r5.2xlarge 184 | - r5.4xlarge 185 | - r5.12xlarge 186 | - r5.24xlarge 187 | - r5d.large 188 | - r5d.xlarge 189 | - r5d.2xlarge 190 | - r5d.4xlarge 191 | - r5d.12xlarge 192 | - r5d.24xlarge 193 | - z1d.large 194 | - z1d.xlarge 195 | - z1d.2xlarge 196 | - z1d.3xlarge 197 | - z1d.6xlarge 198 | - z1d.12xlarge 199 | 200 | NodeAutoScalingGroupMinSize: 201 | Description: Minimum size of Node Group ASG 202 | Type: Number 203 | Default: 1 204 | 205 | NodeAutoScalingGroupMaxSize: 206 | Description: Maximum size of Node Group ASG. Set to at least 1 greater than NodeAutoScalingGroupDesiredCapacity. 207 | Type: Number 208 | Default: 4 209 | 210 | NodeAutoScalingGroupDesiredCapacity: 211 | Description: Desired capacity of Node Group ASG. 212 | Type: Number 213 | Default: 2 214 | 215 | InferenceNodeAutoScalingGroupMinSize: 216 | Description: Minimum size of Node Group ASG. 217 | Type: Number 218 | Default: 2 219 | 220 | InferenceNodeAutoScalingGroupMaxSize: 221 | Description: Maximum size of Node Group ASG. Set to at least 1 greater than NodeAutoScalingGroupDesiredCapacity. 222 | Type: Number 223 | Default: 6 224 | 225 | InferenceNodeAutoScalingGroupDesiredCapacity: 226 | Description: Desired capacity of Node Group ASG. 227 | Type: Number 228 | Default: 2 229 | 230 | NodeVolumeSize: 231 | Description: Node volume size 232 | Type: Number 233 | Default: 100 234 | 235 | InferenceNodeVolumeSize: 236 | Description: Node volume size 237 | Type: Number 238 | Default: 100 239 | 240 | InferenceNodeGroupName: 241 | Description: Unique identifier for the Node Group. 242 | Default: inference 243 | Type: String 244 | 245 | NodeGroupName: 246 | Description: Unique identifier for the Node Group. 247 | Default: standard 248 | Type: String 249 | 250 | InferenceBootstrapArguments: 251 | Description: Arguments to pass to the bootstrap script. See files/bootstrap.sh in https://github.com/awslabs/amazon-eks-ami 252 | Default: --kubelet-extra-args --node-labels=inference=true,nodegroup=elastic-inference 253 | Type: String 254 | 255 | BootstrapArguments: 256 | Description: Arguments to pass to the bootstrap script. See files/bootstrap.sh in https://github.com/awslabs/amazon-eks-ami 257 | Default: --kubelet-extra-args --node-labels=inference=false,nodegroup=standard 258 | Type: String 259 | 260 | AvailabilityZone0: 261 | Description: The first availability zone in the region 262 | Type: AWS::EC2::AvailabilityZone::Name 263 | ConstraintDescription: Must be a valid availability zone 264 | 265 | AvailabilityZone1: 266 | Description: The second availability zone in the region 267 | Type: AWS::EC2::AvailabilityZone::Name 268 | ConstraintDescription: Must be a valid availability zone 269 | 270 | CreateRoleArn: 271 | Type: String 272 | 273 | AdminUserArn: 274 | Type: String 275 | 276 | LambdaCustomResourceBucketPrefix: 277 | Type: String 278 | Default: pub-cfn-cust-res-pocs 279 | 280 | DefaultTaskQueueVisibilityTimeout: 281 | Type: Number 282 | Description: The queue visibility timeout 283 | MinValue: 0 284 | MaxValue: 43200 285 | Default: 7200 # Two hour default 286 | 287 | DefaultTaskCompletedQueueVisibilityTimeout: 288 | Type: Number 289 | Description: The queue visibility timeout 290 | MinValue: 0 291 | MaxValue: 43200 292 | Default: 500 293 | 294 | # Scaling params 295 | InferenceScaleEvaluationPeriods: 296 | Description: The number of periods over which data is compared to the specified threshold 297 | Type: Number 298 | Default: 1 299 | MinValue: 1 300 | 301 | InferenceQueueDepthScaleOutThreshold: 302 | Type: Number 303 | Description: Average queue depth value to trigger auto scaling out 304 | Default: 2 305 | 306 | InferenceQueueDepthScaleInThreshold: 307 | Type: Number 308 | Description: Average queue depth value to trigger auto scaling in 309 | Default: 2 310 | MinValue: 0 311 | ConstraintDescription: Value must be between 0 or more 312 | 313 | 314 | Mappings: 315 | 316 | # Maps CIDR blocks to VPC and various subnets 317 | CIDRMap: 318 | VPC: 319 | CIDR: 10.50.0.0/16 320 | Public0: 321 | CIDR: 10.50.0.0/19 322 | Public1: 323 | CIDR: 10.50.32.0/19 324 | 325 | # Amazon EKS Linux AMI - https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html 326 | AMIMap: 327 | us-east-2: 328 | AMI: ami-0485258c2d1c3608f 329 | us-east-1: 330 | AMI: ami-0f2e8e5663e16b436 331 | us-west-2: 332 | AMI: ami-03a55127c613349a7 333 | ap-northeast-1: 334 | AMI: ami-0fde798d17145fae1 335 | ap-northeast-2: 336 | AMI: ami-07fd7609df6c8e39b 337 | eu-west-1: 338 | AMI: ami-00ac2e6b3cb38a9b9 339 | 340 | 341 | Resources: 342 | 343 | # VPC ------------------------------------------------------------------------ 344 | 345 | Vpc: 346 | Type: AWS::EC2::VPC 347 | Properties: 348 | CidrBlock: !FindInMap [ CIDRMap, VPC, CIDR ] 349 | EnableDnsSupport: true 350 | EnableDnsHostnames: true 351 | Tags: 352 | - Key: Name 353 | Value: !Sub ${AWS::StackName}-vpc 354 | 355 | ElasticInferenceVpcEndpoint: 356 | Type: AWS::EC2::VPCEndpoint 357 | Properties: 358 | VpcId: !Ref Vpc 359 | ServiceName: !Sub com.amazonaws.${AWS::Region}.elastic-inference.runtime 360 | VpcEndpointType: Interface 361 | PrivateDnsEnabled: true 362 | SubnetIds: 363 | - !Ref PublicSubnet0 364 | - !Ref PublicSubnet1 365 | SecurityGroupIds: 366 | - !GetAtt ElasticInferenceVpcEndpointSecurityGroup.GroupId 367 | 368 | ElasticInferenceVpcEndpointSecurityGroup: 369 | Type: AWS::EC2::SecurityGroup 370 | Properties: 371 | GroupDescription: Elastic Inference Vpc Endpoint security group 372 | VpcId: !Ref Vpc 373 | SecurityGroupIngress: 374 | - SourceSecurityGroupId: !Ref NodeSecurityGroup 375 | IpProtocol: tcp 376 | ToPort: 443 377 | FromPort: 443 378 | SecurityGroupEgress: 379 | - CidrIp: 0.0.0.0/0 380 | IpProtocol: tcp 381 | ToPort: 443 382 | FromPort: 443 383 | 384 | InternetGateway: 385 | Type: AWS::EC2::InternetGateway 386 | Properties: 387 | Tags: 388 | - Key: Name 389 | Value: !Sub ${AWS::StackName}-igw 390 | 391 | VpcGatewayAttachment: 392 | Type: AWS::EC2::VPCGatewayAttachment 393 | Properties: 394 | InternetGatewayId: !Ref InternetGateway 395 | VpcId: !Ref Vpc 396 | 397 | RouteTable: 398 | Type: AWS::EC2::RouteTable 399 | Properties: 400 | VpcId: !Ref Vpc 401 | Tags: 402 | - Key: Name 403 | Value: Public Subnets 404 | - Key: Network 405 | Value: Public 406 | 407 | PublicSubnet0: 408 | Type: AWS::EC2::Subnet 409 | Properties: 410 | VpcId: !Ref Vpc 411 | CidrBlock: !FindInMap [ CIDRMap, Public0, CIDR ] 412 | AvailabilityZone: !Ref AvailabilityZone0 413 | Tags: 414 | - Key: Name 415 | Value: !Sub ${AWS::StackName}-PublicSubnet0 416 | 417 | PublicSubnet1: 418 | Type: AWS::EC2::Subnet 419 | Properties: 420 | VpcId: !Ref Vpc 421 | CidrBlock: !FindInMap [ CIDRMap, Public1, CIDR ] 422 | AvailabilityZone: !Ref AvailabilityZone1 423 | Tags: 424 | - Key: Name 425 | Value: !Sub ${AWS::StackName}-PublicSubnet1 426 | 427 | PublicRoute: 428 | Type: AWS::EC2::Route 429 | DependsOn: VpcGatewayAttachment 430 | Properties: 431 | RouteTableId: !Ref PublicRouteTable 432 | DestinationCidrBlock: 0.0.0.0/0 433 | GatewayId: !Ref InternetGateway 434 | 435 | PublicRouteTable: 436 | Type: AWS::EC2::RouteTable 437 | Properties: 438 | VpcId: !Ref Vpc 439 | Tags: 440 | - Key: Name 441 | Value: !Sub ${AWS::StackName}-public-igw 442 | 443 | PublicSubnetRouteTableAssociation0: 444 | Type: AWS::EC2::SubnetRouteTableAssociation 445 | Properties: 446 | SubnetId: !Ref PublicSubnet0 447 | RouteTableId: !Ref PublicRouteTable 448 | 449 | PublicSubnetRouteTableAssociation1: 450 | Type: AWS::EC2::SubnetRouteTableAssociation 451 | Properties: 452 | SubnetId: !Ref PublicSubnet1 453 | RouteTableId: !Ref PublicRouteTable 454 | 455 | # Queues -------------------------------------------------------------------- 456 | 457 | TaskQueue: 458 | Type: AWS::SQS::Queue 459 | Properties: 460 | QueueName: !Sub task-queue-${AWS::StackName} 461 | VisibilityTimeout: !Ref DefaultTaskQueueVisibilityTimeout 462 | KmsMasterKeyId: alias/aws/sqs 463 | 464 | TaskCompletedQueue: 465 | Type: AWS::SQS::Queue 466 | Properties: 467 | QueueName: !Sub task-completed-queue-${AWS::StackName} 468 | VisibilityTimeout: !Ref DefaultTaskCompletedQueueVisibilityTimeout 469 | KmsMasterKeyId: alias/aws/sqs 470 | 471 | # Buckets -------------------------------------------------------------------- 472 | 473 | TaskDataBucket: 474 | Type: AWS::S3::Bucket 475 | Properties: 476 | BucketName: !Sub task-data-bucket-${AWS::AccountId}-${AWS::Region}-${AWS::StackName} 477 | PublicAccessBlockConfiguration: 478 | BlockPublicAcls: true 479 | BlockPublicPolicy: true 480 | IgnorePublicAcls: true 481 | RestrictPublicBuckets: true 482 | BucketEncryption: 483 | ServerSideEncryptionConfiguration: 484 | - ServerSideEncryptionByDefault: 485 | SSEAlgorithm: AES256 486 | 487 | # EKS Cluster ---------------------------------------------------------------- 488 | 489 | EksCluster: 490 | Type: AWS::EKS::Cluster 491 | Properties: 492 | Name: !Ref EksClusterName 493 | RoleArn: !GetAtt ClusterRole.Arn 494 | ResourcesVpcConfig: 495 | SecurityGroupIds: 496 | - !Ref ClusterControlPlaneSecurityGroup 497 | SubnetIds: 498 | - !Ref PublicSubnet0 499 | - !Ref PublicSubnet1 500 | DependsOn: 501 | - PublicSubnetRouteTableAssociation0 502 | - PublicSubnetRouteTableAssociation1 503 | 504 | EksAdminRole: 505 | Type: AWS::IAM::Role 506 | Properties: 507 | Path: / 508 | AssumeRolePolicyDocument: 509 | Version: 2012-10-17 510 | Statement: 511 | - Effect: Allow 512 | Principal: 513 | Service: codebuild.amazonaws.com 514 | Action: sts:AssumeRole 515 | Policies: 516 | - PolicyName: eks-describe 517 | PolicyDocument: 518 | Version: 2012-10-17 519 | Statement: 520 | - Resource: '*' 521 | Effect: Allow 522 | Action: 523 | - eks:Describe* 524 | 525 | # https://github.com/rnzsgh/cfn-eks-custom-resource-aws-auth-configmap 526 | AwsAuthConfigMapConfigCustomResourceLambda: 527 | Type: AWS::Lambda::Function 528 | Properties: 529 | Code: 530 | S3Bucket: !Sub ${LambdaCustomResourceBucketPrefix}-${AWS::Region} 531 | S3Key: cfn-eks-custom-resource-aws-auth-configmap.zip 532 | Handler: main 533 | Role: !Ref CreateRoleArn 534 | Runtime: go1.x 535 | Timeout: 300 536 | 537 | AwsAuthConfigMapConfigCustomResource: 538 | Type: Custom::AwsAuthConfigMapConfigCustomResource 539 | Properties: 540 | ServiceToken: !GetAtt AwsAuthConfigMapConfigCustomResourceLambda.Arn 541 | CreateRoleArn: !Ref CreateRoleArn 542 | ClusterName: !Ref EksCluster 543 | ClusterEndpoint: !GetAtt EksCluster.Endpoint 544 | NodeInstanceRoleArn: !GetAtt NodeInstanceRole.Arn 545 | AdminUserArn: !Ref AdminUserArn 546 | AdminRoleArn: !GetAtt EksAdminRole.Arn 547 | 548 | ClusterControlPlaneSecurityGroup: 549 | Type: AWS::EC2::SecurityGroup 550 | Properties: 551 | GroupDescription: Security Group for the EKS control plane 552 | VpcId: !Ref Vpc 553 | 554 | ClusterRole: 555 | Type: AWS::IAM::Role 556 | Properties: 557 | AssumeRolePolicyDocument: 558 | Version: 2012-10-17 559 | Statement: 560 | Effect: Allow 561 | Principal: 562 | Service: 563 | - eks.amazonaws.com 564 | Action: sts:AssumeRole 565 | ManagedPolicyArns: 566 | - arn:aws:iam::aws:policy/AmazonEKSClusterPolicy 567 | - arn:aws:iam::aws:policy/AmazonEKSServicePolicy 568 | 569 | # EKS Nodes ------------------------------------------------------------------ 570 | 571 | NodeInstanceProfile: 572 | Type: AWS::IAM::InstanceProfile 573 | Properties: 574 | Path: / 575 | Roles: 576 | - !Ref NodeInstanceRole 577 | 578 | NodeInstanceRole: 579 | Type: AWS::IAM::Role 580 | Properties: 581 | AssumeRolePolicyDocument: 582 | Version: 2012-10-17 583 | Statement: 584 | - Effect: Allow 585 | Principal: 586 | Service: 587 | - ec2.amazonaws.com 588 | Action: 589 | - sts:AssumeRole 590 | Path: / 591 | Policies: 592 | - PolicyName: ei-access 593 | PolicyDocument: 594 | Version: 2012-10-17 595 | Statement: 596 | - Effect: Allow 597 | Action: 598 | - elastic-inference:Connect 599 | - iam:List* 600 | - iam:Get* 601 | - ec2:Describe* 602 | - ec2:Get* 603 | - ec2:ModifyInstanceAttribute 604 | Resource: '*' 605 | - PolicyName: sqs-access 606 | PolicyDocument: 607 | Version: 2012-10-17 608 | Statement: 609 | - Effect: Allow 610 | Action: 611 | - sqs:SendMessage 612 | - sqs:ReceiveMessage 613 | - sqs:DeleteMessage 614 | - sqs:ChangeMessageVisibility 615 | - sqs:GetQueueUrl 616 | - sqs:GetQueueAttributes 617 | Resource: 618 | - !GetAtt TaskQueue.Arn 619 | - !GetAtt TaskCompletedQueue.Arn 620 | - PolicyName: s3-access 621 | PolicyDocument: 622 | Version: 2012-10-17 623 | Statement: 624 | - Effect: Allow 625 | Action: 626 | - s3:Get* 627 | - s3:List* 628 | - s3:PutObjectAcl 629 | - s3:PutObject 630 | - s3:PutObjectTagging 631 | - s3:PutObjectVersionAcl 632 | - s3:PutObjectVersionTagging 633 | - s3:PutObjectRetention 634 | - s3:PutObjectLegalHold 635 | - s3:PutBucketObjectLockConfiguration 636 | - s3:AbortMultipartUpload 637 | - s3:DeleteObject 638 | - s3:DeleteObjectTagging 639 | - s3:DeleteObjectVersion 640 | - s3:DeleteObjectVersionTagging 641 | Resource: 642 | - !Sub arn:aws:s3:::${TaskDataBucket}/* 643 | - !Sub arn:aws:s3:::${TaskDataBucket} 644 | ManagedPolicyArns: 645 | - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy 646 | - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy 647 | - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly 648 | - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM 649 | 650 | NodeSecurityGroup: 651 | Type: AWS::EC2::SecurityGroup 652 | Properties: 653 | GroupDescription: Security group for all nodes in the cluster 654 | VpcId: !Ref Vpc 655 | Tags: 656 | - Key: !Sub kubernetes.io/cluster/${EksCluster} 657 | Value: owned 658 | DependsOn: EksCluster 659 | 660 | NodeSecurityGroupIngress: 661 | Type: AWS::EC2::SecurityGroupIngress 662 | Properties: 663 | Description: Allow node to communicate with each other 664 | GroupId: !Ref NodeSecurityGroup 665 | SourceSecurityGroupId: !Ref NodeSecurityGroup 666 | IpProtocol: tcp 667 | FromPort: 0 668 | ToPort: 65535 669 | DependsOn: NodeSecurityGroup 670 | 671 | NodeSecurityGroupFromControlPlaneIngress: 672 | Type: AWS::EC2::SecurityGroupIngress 673 | Properties: 674 | Description: Allow worker Kubelets and pods to receive communication from the cluster control plane 675 | GroupId: !Ref NodeSecurityGroup 676 | SourceSecurityGroupId: !Ref ClusterControlPlaneSecurityGroup 677 | IpProtocol: tcp 678 | FromPort: 1025 679 | ToPort: 65535 680 | DependsOn: NodeSecurityGroup 681 | 682 | ControlPlaneEgressToNodeSecurityGroup: 683 | Type: AWS::EC2::SecurityGroupEgress 684 | DependsOn: NodeSecurityGroup 685 | Properties: 686 | Description: Allow the cluster control plane to communicate with worker Kubelet and pods 687 | GroupId: !Ref ClusterControlPlaneSecurityGroup 688 | DestinationSecurityGroupId: !Ref NodeSecurityGroup 689 | IpProtocol: tcp 690 | FromPort: 1025 691 | ToPort: 65535 692 | 693 | NodeSecurityGroupFromControlPlaneOn443Ingress: 694 | Type: AWS::EC2::SecurityGroupIngress 695 | DependsOn: NodeSecurityGroup 696 | Properties: 697 | Description: Allow pods running extension API servers on port 443 to receive communication from cluster control plane 698 | GroupId: !Ref NodeSecurityGroup 699 | SourceSecurityGroupId: !Ref ClusterControlPlaneSecurityGroup 700 | IpProtocol: tcp 701 | FromPort: 443 702 | ToPort: 443 703 | 704 | ControlPlaneEgressToNodeSecurityGroupOn443: 705 | Type: AWS::EC2::SecurityGroupEgress 706 | DependsOn: NodeSecurityGroup 707 | Properties: 708 | Description: Allow the cluster control plane to communicate with pods running extension API servers on port 443 709 | GroupId: !Ref ClusterControlPlaneSecurityGroup 710 | DestinationSecurityGroupId: !Ref NodeSecurityGroup 711 | IpProtocol: tcp 712 | FromPort: 443 713 | ToPort: 443 714 | 715 | ClusterControlPlaneSecurityGroupIngress: 716 | Type: AWS::EC2::SecurityGroupIngress 717 | Properties: 718 | Description: Allow pods to communicate with the cluster API Server 719 | GroupId: !Ref ClusterControlPlaneSecurityGroup 720 | SourceSecurityGroupId: !Ref NodeSecurityGroup 721 | IpProtocol: tcp 722 | ToPort: 443 723 | FromPort: 443 724 | DependsOn: NodeSecurityGroup 725 | 726 | InferenceNodeGroup: 727 | Type: AWS::AutoScaling::AutoScalingGroup 728 | Properties: 729 | DesiredCapacity: !Ref InferenceNodeAutoScalingGroupDesiredCapacity 730 | MinSize: !Ref InferenceNodeAutoScalingGroupMinSize 731 | MaxSize: !Ref InferenceNodeAutoScalingGroupMaxSize 732 | LaunchTemplate: 733 | LaunchTemplateId: !Ref InferenceNodeLaunchTemplate 734 | Version: 1 735 | VPCZoneIdentifier: 736 | - !Ref PublicSubnet0 737 | - !Ref PublicSubnet1 738 | Tags: 739 | - Key: Name 740 | Value: !Sub ${EksCluster}-${InferenceNodeGroupName}-Node 741 | PropagateAtLaunch: true 742 | - Key: !Sub kubernetes.io/cluster/${EksCluster} 743 | Value: owned 744 | PropagateAtLaunch: true 745 | TerminationPolicies: 746 | - OldestInstance 747 | - OldestLaunchTemplate 748 | UpdatePolicy: 749 | AutoScalingRollingUpdate: 750 | MaxBatchSize: 1 751 | MinInstancesInService: !Ref NodeAutoScalingGroupDesiredCapacity 752 | PauseTime: PT5M 753 | DependsOn: 754 | - AwsAuthConfigMapConfigCustomResource 755 | 756 | NodeGroup: 757 | Type: AWS::AutoScaling::AutoScalingGroup 758 | Properties: 759 | DesiredCapacity: !Ref NodeAutoScalingGroupDesiredCapacity 760 | LaunchTemplate: 761 | LaunchTemplateId: !Ref NodeLaunchTemplate 762 | Version: 1 763 | MinSize: !Ref NodeAutoScalingGroupMinSize 764 | MaxSize: !Ref NodeAutoScalingGroupMaxSize 765 | VPCZoneIdentifier: 766 | - !Ref PublicSubnet0 767 | - !Ref PublicSubnet1 768 | Tags: 769 | - Key: Name 770 | Value: !Sub ${EksCluster}-${NodeGroupName}-Node 771 | PropagateAtLaunch: true 772 | - Key: !Sub kubernetes.io/cluster/${EksCluster} 773 | Value: owned 774 | PropagateAtLaunch: true 775 | TerminationPolicies: 776 | - OldestInstance 777 | - OldestLaunchTemplate 778 | UpdatePolicy: 779 | AutoScalingRollingUpdate: 780 | MaxBatchSize: 1 781 | MinInstancesInService: !Ref NodeAutoScalingGroupDesiredCapacity 782 | PauseTime: PT5M 783 | DependsOn: 784 | - AwsAuthConfigMapConfigCustomResource 785 | 786 | InferenceNodeLaunchTemplate: 787 | Type: AWS::EC2::LaunchTemplate 788 | Properties: 789 | LaunchTemplateData: 790 | IamInstanceProfile: 791 | Name: !Ref NodeInstanceProfile 792 | ImageId: !FindInMap [ AMIMap, !Ref "AWS::Region", AMI ] 793 | InstanceType: !Ref InferenceNodeInstanceType 794 | KeyName: !Ref KeyName 795 | ElasticInferenceAccelerators: 796 | - Type: !Ref ElasticInferenceType 797 | Monitoring: 798 | Enabled: true 799 | BlockDeviceMappings: 800 | - DeviceName: /dev/xvda 801 | Ebs: 802 | VolumeSize: !Ref InferenceNodeVolumeSize 803 | VolumeType: gp2 804 | DeleteOnTermination: true 805 | UserData: 806 | Fn::Base64: 807 | !Sub | 808 | #!/bin/bash 809 | set -o xtrace 810 | yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm 811 | mv /etc/amazon/ssm/seelog.xml.template /etc/amazon/ssm/seelog.xml 812 | systemctl restart amazon-ssm-agent 813 | /etc/eks/bootstrap.sh ${EksCluster} ${InferenceBootstrapArguments} 814 | /opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} --resource InferenceNodeGroup --region ${AWS::Region} 815 | NetworkInterfaces: 816 | - DeviceIndex: 0 817 | AssociatePublicIpAddress: true 818 | DeleteOnTermination: true 819 | Groups: 820 | - !Ref NodeSecurityGroup 821 | 822 | NodeLaunchTemplate: 823 | Type: AWS::EC2::LaunchTemplate 824 | Properties: 825 | LaunchTemplateData: 826 | IamInstanceProfile: 827 | Name: !Ref NodeInstanceProfile 828 | ImageId: !FindInMap [ AMIMap, !Ref "AWS::Region", AMI ] 829 | InstanceType: !Ref NodeInstanceType 830 | KeyName: !Ref KeyName 831 | Monitoring: 832 | Enabled: true 833 | BlockDeviceMappings: 834 | - DeviceName: /dev/xvda 835 | Ebs: 836 | VolumeSize: !Ref NodeVolumeSize 837 | VolumeType: gp2 838 | DeleteOnTermination: true 839 | UserData: 840 | Fn::Base64: 841 | !Sub | 842 | #!/bin/bash 843 | set -o xtrace 844 | yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm 845 | mv /etc/amazon/ssm/seelog.xml.template /etc/amazon/ssm/seelog.xml 846 | systemctl restart amazon-ssm-agent 847 | /etc/eks/bootstrap.sh ${EksCluster} ${BootstrapArguments} 848 | /opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} --resource NodeGroup --region ${AWS::Region} 849 | NetworkInterfaces: 850 | - DeviceIndex: 0 851 | AssociatePublicIpAddress: true 852 | DeleteOnTermination: true 853 | Groups: 854 | - !Ref NodeSecurityGroup 855 | 856 | TaskQueueDepthLambdaEventRule: 857 | Type: AWS::Events::Rule 858 | Properties: 859 | State: ENABLED 860 | ScheduleExpression: 'cron(* * * * ? *)' 861 | Targets: 862 | - Arn: !GetAtt TaskQueueDepthLambda.Arn 863 | Id: !Ref TaskQueueDepthLambda 864 | 865 | TaskQueueDepthLambdaPermission: 866 | Type: AWS::Lambda::Permission 867 | Properties: 868 | Action: lambda:InvokeFunction 869 | Principal: events.amazonaws.com 870 | FunctionName: !Ref TaskQueueDepthLambda 871 | SourceArn: !GetAtt TaskQueueDepthLambdaEventRule.Arn 872 | 873 | # https://github.com/rnzsgh/lambda-update-sqs-queue-depth 874 | TaskQueueDepthLambda: 875 | Type: AWS::Lambda::Function 876 | Properties: 877 | Environment: 878 | Variables: 879 | QueueUrl: !Ref TaskQueue 880 | CloudWatchMetricNamespace: eks-ei/sqs 881 | CloudWatchMetricName: TaskQueueApproximateNumberOfMessages 882 | Code: 883 | S3Bucket: !Sub ${LambdaCustomResourceBucketPrefix}-${AWS::Region} 884 | S3Key: lambda-update-sqs-queue-depth.zip 885 | Handler: main 886 | Role: !GetAtt TaskQueueDepthLambdaExecutionRole.Arn 887 | Runtime: go1.x 888 | Timeout: 300 889 | 890 | TaskQueueDepthLambdaExecutionRole: 891 | Type: AWS::IAM::Role 892 | Properties: 893 | ManagedPolicyArns: 894 | - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole 895 | Path: / 896 | AssumeRolePolicyDocument: 897 | Version: 2012-10-17 898 | Statement: 899 | - Effect: Allow 900 | Action: sts:AssumeRole 901 | Principal: 902 | Service: 903 | - lambda.amazonaws.com 904 | Policies: 905 | - PolicyName: sqs-attr 906 | PolicyDocument: 907 | Version: 2012-10-17 908 | Statement: 909 | - Effect: Allow 910 | Action: 911 | - sqs:GetQueueAttributes 912 | Resource: 913 | - !GetAtt TaskQueue.Arn 914 | - PolicyName: cloudwatch-put 915 | PolicyDocument: 916 | Version: 2012-10-17 917 | Statement: 918 | - Effect: Allow 919 | Action: 920 | - cloudwatch:PutMetricData 921 | Resource: '*' 922 | 923 | InferenceScaleOutPolicy: 924 | Type: AWS::AutoScaling::ScalingPolicy 925 | Properties: 926 | AdjustmentType: ChangeInCapacity 927 | AutoScalingGroupName: !Ref InferenceNodeGroup 928 | ScalingAdjustment: 2 929 | Cooldown: 60 930 | 931 | InferenceScaleInPolicy: 932 | Type: AWS::AutoScaling::ScalingPolicy 933 | Properties: 934 | AdjustmentType: ChangeInCapacity 935 | AutoScalingGroupName: !Ref InferenceNodeGroup 936 | ScalingAdjustment: -1 937 | Cooldown: 60 938 | 939 | InferenceScaleOutAlarm: 940 | Type: AWS::CloudWatch::Alarm 941 | Properties: 942 | EvaluationPeriods: !Ref InferenceScaleEvaluationPeriods 943 | Statistic: Average 944 | TreatMissingData: notBreaching 945 | Threshold: !Ref InferenceQueueDepthScaleOutThreshold 946 | AlarmDescription: Alarm to add capacity if queue depth is high 947 | Period: 60 948 | AlarmActions: 949 | - !Ref InferenceScaleOutPolicy 950 | Namespace: eks-ei/sqs 951 | MetricName: TaskQueueApproximateNumberOfMessages 952 | ComparisonOperator: GreaterThanThreshold 953 | 954 | InferenceScaleInAlarm: 955 | Type: AWS::CloudWatch::Alarm 956 | Properties: 957 | EvaluationPeriods: !Ref InferenceScaleEvaluationPeriods 958 | Statistic: Average 959 | TreatMissingData: notBreaching 960 | Threshold: !Ref InferenceQueueDepthScaleInThreshold 961 | AlarmDescription: Alarm to reduce capacity if container queue depth is low 962 | Period: 300 963 | AlarmActions: 964 | - !Ref InferenceScaleInPolicy 965 | Namespace: eks-ei/sqs 966 | MetricName: TaskQueueApproximateNumberOfMessages 967 | ComparisonOperator: LessThanThreshold 968 | 969 | 970 | Outputs: 971 | 972 | Name: 973 | Value: !Ref AWS::StackName 974 | Export: 975 | Name: !Sub ${AWS::StackName}-Name 976 | 977 | VpcId: 978 | Value: !Ref Vpc 979 | Export: 980 | Name: !Sub ${AWS::StackName}-VpcId 981 | 982 | VpcCidr: 983 | Description: Vpc cidr block 984 | Value: !FindInMap [ CIDRMap, VPC, CIDR ] 985 | Export: 986 | Name: !Sub ${AWS::StackName}-vpc-cidr 987 | 988 | PublicSubnet0: 989 | Value: !Ref PublicSubnet0 990 | Export: 991 | Name: !Sub ${AWS::StackName}-PublicSubnet0Id 992 | 993 | PublicSubnet1: 994 | Value: !Ref PublicSubnet1 995 | Export: 996 | Name: !Sub ${AWS::StackName}-PublicSubnet1Id 997 | 998 | NodeInstanceRoleArn: 999 | Value: !GetAtt NodeInstanceRole.Arn 1000 | Export: 1001 | Name: !Sub ${AWS::StackName}-NodeInstanceRoleArn 1002 | 1003 | NodeInstanceRoleId: 1004 | Value: !GetAtt NodeInstanceRole.RoleId 1005 | Export: 1006 | Name: !Sub ${AWS::StackName}-NodeInstanceRoleId 1007 | 1008 | EksClusterName: 1009 | Value: !Ref EksCluster 1010 | Export: 1011 | Name: !Sub ${AWS::StackName}-EksClusterName 1012 | 1013 | EksClusterArn: 1014 | Value: !GetAtt EksCluster.Arn 1015 | Export: 1016 | Name: !Sub ${AWS::StackName}-EksClusterArn 1017 | 1018 | EksClusterEndpoint: 1019 | Value: !GetAtt EksCluster.Endpoint 1020 | Export: 1021 | Name: !Sub ${AWS::StackName}-EksClusterEndpoint 1022 | 1023 | ClusterControlPlaneSecurityGroup: 1024 | Value: !Ref ClusterControlPlaneSecurityGroup 1025 | Export: 1026 | Name: !Sub ${AWS::StackName}-ClusterControlPlaneSecurityGroup 1027 | 1028 | TaskQueueUrl: 1029 | Description: Task queue url 1030 | Value: !Ref TaskQueue 1031 | Export: 1032 | Name: !Sub ${AWS::StackName}-TaskQueueUrl 1033 | 1034 | TaskQueueArn: 1035 | Description: Task queue arn 1036 | Value: !GetAtt TaskQueue.Arn 1037 | Export: 1038 | Name: !Sub ${AWS::StackName}-TaskQueueArn 1039 | 1040 | TaskQueueName: 1041 | Description: Task queue name 1042 | Value: !GetAtt TaskQueue.QueueName 1043 | Export: 1044 | Name: !Sub ${AWS::StackName}-TaskQueueName 1045 | 1046 | TaskCompletedQueueUrl: 1047 | Description: TaskCompleted queue url 1048 | Value: !Ref TaskCompletedQueue 1049 | Export: 1050 | Name: !Sub ${AWS::StackName}-TaskCompletedQueueUrl 1051 | 1052 | TaskCompletedQueueArn: 1053 | Description: TaskCompleted queue arn 1054 | Value: !GetAtt TaskCompletedQueue.Arn 1055 | Export: 1056 | Name: !Sub ${AWS::StackName}-TaskCompletedQueueArn 1057 | 1058 | TaskCompletedQueueName: 1059 | Description: TaskCompleted queue name 1060 | Value: !GetAtt TaskCompletedQueue.QueueName 1061 | Export: 1062 | Name: !Sub ${AWS::StackName}-TaskCompletedQueueName 1063 | 1064 | 1065 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import os 3 | import sys 4 | import cv2 5 | import numpy 6 | import requests 7 | import json 8 | import logging 9 | import threading 10 | import queue 11 | 12 | import coco_label_map 13 | 14 | ENDPOINT = 'http://localhost:8501/v1/models/default:predict' 15 | TMP_FILE = "./tmp.mov" 16 | 17 | FRAME_BATCH=5 18 | 19 | FRAME_MAX=20 20 | 21 | logging.basicConfig( 22 | level=logging.INFO, 23 | format='%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s] %(message)s', 24 | handlers=[ logging.StreamHandler(sys.stdout) ], 25 | ) 26 | 27 | log = logging.getLogger() 28 | 29 | def get_predictions_from_image_array(batch): 30 | res = requests.post(ENDPOINT, json={ 'instances': batch }) 31 | return res.json()['predictions'] 32 | 33 | def get_classes_with_scores(predictions): 34 | vals = [] 35 | for p in predictions: 36 | num_detections = int(p['num_detections']) 37 | detected_classes = p['detection_classes'][:num_detections] 38 | detected_classes =[coco_label_map.label_map[int(x)] for x in detected_classes] 39 | detection_scores = p['detection_scores'][:num_detections] 40 | vals.append(list(zip(detected_classes, detection_scores))) 41 | 42 | return vals 43 | 44 | def prepare(prepare_queue, inference_queue): 45 | while True: 46 | inference_queue.put(prepare_queue.get().tolist()) 47 | 48 | def add_to_prepare(prepare_queue, frames): 49 | for f in frames: 50 | prepare_queue.put(f) 51 | frames.clear() 52 | 53 | def process_video_from_file(file_path, prepare_queue, inference_queue): 54 | 55 | log.info('process_video_from_file') 56 | 57 | frames = [] 58 | vidcap = cv2.VideoCapture(file_path) 59 | success, frame = vidcap.read() 60 | success = True 61 | 62 | log.info('start frame extraction') 63 | 64 | max_frame = 0 65 | while success: 66 | frames.append(frame) 67 | success, frame = vidcap.read() 68 | max_frame += 1 69 | if max_frame == FRAME_MAX: 70 | break 71 | 72 | log.info('end frame extraction') 73 | 74 | count = len(frames) 75 | 76 | add_worker = threading.Thread(target=add_to_prepare, args=(prepare_queue, frames,)) 77 | add_worker.start() 78 | 79 | log.info('frame count: %d', count) 80 | batch = [] 81 | predictions = [] 82 | 83 | log.info('frame batch %d', FRAME_BATCH) 84 | 85 | for i in range(count): 86 | batch.append(inference_queue.get()) 87 | 88 | if len(batch) == FRAME_BATCH or i == (count - 1): 89 | log.info('range: %d - batch: %d', i, len(batch)) 90 | for v in get_classes_with_scores(get_predictions_from_image_array(batch)): 91 | predictions.append(str(v)) 92 | predictions.append('\n') 93 | batch.clear() 94 | 95 | vidcap.release() 96 | #cv2.destroyAllWindows() 97 | 98 | return predictions 99 | 100 | def main(): 101 | 102 | task_queue_name = None 103 | task_completed_queue_name = None 104 | 105 | try: 106 | task_queue_name = os.environ['SQS_TASK_QUEUE'] 107 | task_completed_queue_name = os.environ['SQS_TASK_COMPLETED_QUEUE'] 108 | except KeyError: 109 | log.error('Please set the environment variables for SQS_TASK_QUEUE and SQS_TASK_COMPLETED_QUEUE') 110 | sys.exit(1) 111 | 112 | # Get the instance information 113 | r = requests.get("http://169.254.169.254/latest/dynamic/instance-identity/document") 114 | r.raise_for_status() 115 | response_json = r.json() 116 | region = response_json.get('region') 117 | instance_id = response_json.get('instanceId') 118 | 119 | ec2 = boto3.client('ec2', region_name=region) 120 | s3 = boto3.client('s3', region_name=region) 121 | 122 | task_queue = boto3.resource('sqs', region_name=region).get_queue_by_name(QueueName=task_queue_name) 123 | task_completed_queue = boto3.resource('sqs', region_name=region).get_queue_by_name(QueueName=task_completed_queue_name) 124 | 125 | log.info('Initialized - instance: %s', instance_id) 126 | 127 | prepare_queue = queue.Queue() 128 | inference_queue = queue.Queue(maxsize=FRAME_BATCH) 129 | 130 | prepare_worker = threading.Thread(target=prepare, args=(prepare_queue, inference_queue,)) 131 | prepare_worker.start() 132 | 133 | while True: 134 | for message in task_queue.receive_messages(WaitTimeSeconds=10): 135 | try: 136 | log.info('Message received - instance: %s', instance_id) 137 | 138 | ec2.modify_instance_attribute( 139 | InstanceId=instance_id, 140 | DisableApiTermination={ 'Value': True }, 141 | ) 142 | log.info('Termination protection engaged - instance: %s', instance_id) 143 | 144 | message.change_visibility(VisibilityTimeout=600) 145 | log.info('Message visibility updated - instance: %s', instance_id) 146 | 147 | # Process the message 148 | doc = json.loads(message.body) 149 | log.info('Message body is loaded - instance: %s', instance_id) 150 | 151 | s3.download_file(doc['bucket'], doc['object'], TMP_FILE) 152 | log.info('File is downloaded - instance: %s', instance_id) 153 | 154 | log.info('Starting predictions - instance: %s', instance_id) 155 | predictions_for_frames = process_video_from_file(TMP_FILE, prepare_queue, inference_queue) 156 | log.info('Predictions completed - instance: %s', instance_id) 157 | 158 | log.info(''.join(e for e in predictions_for_frames)) 159 | 160 | task_completed_queue.send_message(MessageBody=''.join(e for e in predictions_for_frames)) 161 | log.info('Task completed msg sent - instance: %s', instance_id) 162 | message.delete() 163 | log.info('Message deleted - instance: %s', instance_id) 164 | 165 | ec2.modify_instance_attribute( 166 | InstanceId=instance_id, 167 | DisableApiTermination={ 'Value': False }, 168 | ) 169 | log.info('Termination protection disengaged - instance: %s', instance_id) 170 | 171 | if os.path.exists(TMP_FILE): 172 | os.remove(TMP_FILE) 173 | 174 | except: 175 | log.error('Problem processing message: %s - instance: %s', sys.exc_info()[0], instance_id) 176 | 177 | if __name__ == '__main__': 178 | main() 179 | 180 | --------------------------------------------------------------------------------