├── functions └── replace-route │ ├── __init__.py │ ├── tests │ ├── __init__.py │ ├── test_requirements.txt │ ├── pyproject.toml │ └── test_replace_route.py │ ├── requirements.txt │ ├── cloudwatch-event.json │ ├── template.yaml │ ├── sns-event.json │ └── app.py ├── CODEOWNERS ├── assets ├── nat-table.png ├── architecture.png └── Chime_company_logo.png ├── .gitignore ├── examples ├── provider.tf ├── outputs.tf ├── variables.tf └── main.tf ├── versions.tf ├── alternat.conf.tftpl ├── Dockerfile ├── outputs.tf ├── .github └── workflows │ ├── main.yaml │ └── test.yaml ├── LICENSE ├── cwagent.json.tftpl ├── CODE_OF_CONDUCT.md ├── docs └── 0.2.0-migration-guide.md ├── test ├── go.mod └── alternat_test.go ├── scripts └── alternat.sh ├── lambda.tf ├── variables.tf ├── main.tf └── README.md /functions/replace-route/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /functions/replace-route/tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @chime/maintainers @bwhaley 2 | -------------------------------------------------------------------------------- /functions/replace-route/requirements.txt: -------------------------------------------------------------------------------- 1 | boto3==1.34.90 2 | -------------------------------------------------------------------------------- /functions/replace-route/tests/test_requirements.txt: -------------------------------------------------------------------------------- 1 | moto==5.0.5 2 | pytest==8.1.1 3 | sure==2.0.0 4 | -------------------------------------------------------------------------------- /assets/nat-table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chime/terraform-aws-alternat/HEAD/assets/nat-table.png -------------------------------------------------------------------------------- /assets/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chime/terraform-aws-alternat/HEAD/assets/architecture.png -------------------------------------------------------------------------------- /assets/Chime_company_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chime/terraform-aws-alternat/HEAD/assets/Chime_company_logo.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | **/.aws-sam/ 2 | **/out.txt 3 | **/__pycache__ 4 | **/terraform.tfstate* 5 | **/.terraform* 6 | **/.test-data 7 | .idea 8 | -------------------------------------------------------------------------------- /functions/replace-route/tests/pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.pytest.ini_options] 2 | filterwarnings = [ 3 | 'ignore:datetime.datetime.utcnow\(\) is deprecated:DeprecationWarning', # https://github.com/boto/botocore/pull/3145 4 | ] 5 | -------------------------------------------------------------------------------- /examples/provider.tf: -------------------------------------------------------------------------------- 1 | terraform { 2 | required_providers { 3 | aws = { 4 | source = "hashicorp/aws" 5 | version = ">= 5.0" 6 | } 7 | } 8 | } 9 | 10 | provider "aws" { 11 | region = var.aws_region 12 | } 13 | -------------------------------------------------------------------------------- /examples/outputs.tf: -------------------------------------------------------------------------------- 1 | output "vpc_id" { 2 | description = "VPC ID" 3 | value = module.vpc.vpc_id 4 | } 5 | 6 | output "nat_instance_security_group_id" { 7 | description = "NAT Instance Security Group ID" 8 | value = module.alternat.nat_instance_security_group_id 9 | } 10 | -------------------------------------------------------------------------------- /versions.tf: -------------------------------------------------------------------------------- 1 | terraform { 2 | required_version = ">= 1.1" 3 | 4 | required_providers { 5 | aws = { 6 | source = "hashicorp/aws" 7 | version = ">= 5.32.0" 8 | } 9 | archive = { 10 | source = "hashicorp/archive" 11 | version = ">= 2.7.0" 12 | } 13 | } 14 | } 15 | -------------------------------------------------------------------------------- /alternat.conf.tftpl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | USERDATA_CONFIG_FILE="/etc/alternat.conf" 3 | echo eip_allocation_ids_csv=${eip_allocation_ids_csv} >> "$USERDATA_CONFIG_FILE" 4 | echo route_table_ids_csv=${route_table_ids_csv} >> "$USERDATA_CONFIG_FILE" 5 | echo enable_ssm=${enable_ssm} >> "$USERDATA_CONFIG_FILE" 6 | echo enable_cloudwatch_agent=${enable_cloudwatch_agent} >> "$USERDATA_CONFIG_FILE" 7 | -------------------------------------------------------------------------------- /functions/replace-route/cloudwatch-event.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "0", 3 | "id": "b15ce58e-555e-bb98-e00a-3a71dbd9f1d8", 4 | "detail-type": "Scheduled Event", 5 | "source": "aws.events", 6 | "account": "0123456789012", 7 | "time": "2022-09-01T21:02:19Z", 8 | "region": "us-east-1", 9 | "resources": [ 10 | "arn:aws:events:us-east-1:0123456789012:rule/alternat-test-every-minute" 11 | ], 12 | "detail": {} 13 | } 14 | -------------------------------------------------------------------------------- /functions/replace-route/template.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion : '2010-09-09' 2 | Transform: AWS::Serverless-2016-10-31 3 | Resources: 4 | AutoScalingTerminationFunction: 5 | Type: AWS::Serverless::Function 6 | Properties: 7 | Handler: app.handler 8 | Runtime: python3.12 9 | Timeout: 30 10 | ConnectivityTestFunction: 11 | Type: AWS::Serverless::Function 12 | Properties: 13 | Handler: app.connectivity_test_handler 14 | Runtime: python3.12 15 | Timeout: 30 16 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM public.ecr.aws/lambda/python:3.12 2 | 3 | # Copy function code 4 | COPY functions/replace-route/app.py ${LAMBDA_TASK_ROOT} 5 | 6 | # Install the function's dependencies using file requirements.txt 7 | # from your project folder. 8 | 9 | COPY functions/replace-route/requirements.txt . 10 | RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}" 11 | 12 | # Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile) 13 | CMD [ "app.handler" ] 14 | -------------------------------------------------------------------------------- /outputs.tf: -------------------------------------------------------------------------------- 1 | 2 | output "nat_instance_eips" { 3 | description = "List of Elastic IP addresses used by the NAT instances. This will be empty if EIPs are provided in var.nat_instance_eip_ids." 4 | value = (local.reuse_nat_instance_eips 5 | ? [] 6 | : local.nat_instance_eips[*].public_ip) 7 | } 8 | 9 | output "nat_gateway_eips" { 10 | description = "List of Elastic IP addresses used by the standby NAT gateways." 11 | value = [ 12 | for eip in local.nat_gateway_eips 13 | : eip.public_ip 14 | if var.create_nat_gateways 15 | ] 16 | } 17 | 18 | output "nat_instance_security_group_id" { 19 | description = "NAT Instance Security Group ID." 20 | value = aws_security_group.nat_instance.id 21 | } 22 | 23 | output "autoscaling_group_names" { 24 | description = "Name of autoscaling groups for NAT instances." 25 | value = [ 26 | for asg in aws_autoscaling_group.nat_instance 27 | : asg.name 28 | ] 29 | } 30 | -------------------------------------------------------------------------------- /.github/workflows/main.yaml: -------------------------------------------------------------------------------- 1 | name: Build 2 | 3 | on: push 4 | 5 | jobs: 6 | build_and_push_image: 7 | permissions: 8 | id-token: write 9 | contents: read 10 | runs-on: ubuntu-24.04 11 | steps: 12 | - uses: actions/checkout@v4 13 | - name: Configure AWS Credentials 14 | uses: aws-actions/configure-aws-credentials@v4 15 | with: 16 | role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }} 17 | aws-region: ${{ secrets.AWS_REGION }} 18 | - name: Set outputs 19 | id: sha 20 | run: echo "::set-output name=short_sha::$(git rev-parse --short HEAD)" 21 | - name: Login to Amazon ECR 22 | id: login-ecr 23 | uses: aws-actions/amazon-ecr-login@v2 24 | - name: Build, tag, and push lambda image to Amazon ECR 25 | run: | 26 | docker build . -t "${{ secrets.ECR_REGISTRY }}/${{ secrets.ECR_REPOSITORY }}:${{ steps.sha.outputs.short_sha }}" 27 | docker push "${{ secrets.ECR_REGISTRY}}/${{ secrets.ECR_REPOSITORY }}:${{ steps.sha.outputs.short_sha }}" 28 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2022 Chime Financial 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /functions/replace-route/sns-event.json: -------------------------------------------------------------------------------- 1 | { 2 | "Records": [ 3 | { 4 | "EventSource": "aws:sns", 5 | "EventVersion": "1.0", 6 | "EventSubscriptionArn": "arn:aws:sns:us-east-1:123456789012:alternat-topic20220825201854040700000002:ca6bfdb1-b0ec-4392-8bba-e2265586770d", 7 | "Sns": { 8 | "Type": "Notification", 9 | "TopicArn": "arn:aws:sns:us-east-1:0123456789012:alternat-topic20220825201854040700000002", 10 | "Subject": "Auto Scaling: Lifecycle action 'TERMINATING' for instance i-0ce69e1a05d46bb3c in progress.", 11 | "Message": "{\"Origin\":\"AutoScalingGroup\",\"LifecycleHookName\":\"NATInstanceTerminationLifeCycleHook\",\"Destination\":\"EC2\",\"AccountId\":\"0123456789012\",\"RequestId\":\"99660bc1-e85f-69cf-3905-c9a7390dcd5e\",\"LifecycleTransition\":\"autoscaling:EC2_INSTANCE_TERMINATING\",\"AutoScalingGroupName\":\"alternat-asg\",\"Service\":\"AWS Auto Scaling\",\"Time\":\"2022-08-30T20:38:31.798Z\",\"EC2InstanceId\":\"i-0ce69e1a05d46bb3c\",\"NotificationMetadata\":\"{\\\"goodbye\\\":\\\"world\\\"}\",\"LifecycleActionToken\":\"9852eafb-6c45-43f8-917c-9353bf43d286\"}", 12 | "Timestamp": "2022-08-30T20:38:31.829Z", 13 | "MessageAttributes": {} 14 | } 15 | } 16 | ] 17 | } 18 | -------------------------------------------------------------------------------- /.github/workflows/test.yaml: -------------------------------------------------------------------------------- 1 | name: Test 2 | 3 | on: push 4 | 5 | jobs: 6 | lambda_tests: 7 | permissions: 8 | contents: read 9 | runs-on: ubuntu-latest 10 | env: 11 | AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }} 12 | steps: 13 | - uses: actions/checkout@v4 14 | 15 | - name: Run tests 16 | run: | 17 | pip install pip --upgrade 18 | pip install pyopenssl --upgrade 19 | pip install -r functions/replace-route/requirements.txt 20 | pip install -r functions/replace-route/tests/test_requirements.txt 21 | python -m pytest 22 | 23 | terraform_tests: 24 | permissions: 25 | id-token: write 26 | contents: read 27 | runs-on: ubuntu-latest 28 | defaults: 29 | run: 30 | working-directory: ./test 31 | steps: 32 | - uses: actions/checkout@v4 33 | 34 | - name: Install Terraform 35 | uses: hashicorp/setup-terraform@v3 36 | 37 | - name: Install Go 38 | uses: actions/setup-go@v5 39 | with: 40 | go-version-file: test/go.mod 41 | cache-dependency-path: test/go.sum 42 | 43 | - name: Go Tidy 44 | run: go mod tidy && git diff --exit-code 45 | 46 | - name: Go Mod 47 | run: go mod download 48 | 49 | - name: Go mod verify 50 | run: go mod verify 51 | 52 | - name: Configure AWS Credentials 53 | uses: aws-actions/configure-aws-credentials@v4 54 | with: 55 | role-to-assume: ${{ secrets.TERRATEST_ROLE_TO_ASSUME }} 56 | aws-region: ${{ secrets.AWS_REGION }} 57 | 58 | - name: Run tests 59 | run: go test -v -timeout 60m 60 | 61 | -------------------------------------------------------------------------------- /examples/variables.tf: -------------------------------------------------------------------------------- 1 | variable "alternat_instance_type" { 2 | description = "The instance type to use for the Alternat instances." 3 | type = string 4 | default = "t4g.medium" 5 | } 6 | 7 | variable "aws_region" { 8 | description = "The AWS region to deploy to." 9 | type = string 10 | default = "us-west-2" 11 | } 12 | 13 | variable "create_nat_gateways" { 14 | description = "Whether to create NAT Gateways using the Alternat module." 15 | type = bool 16 | default = true 17 | } 18 | 19 | variable "enable_nat_gateway" { 20 | description = "Whether to create NAT Gateways using the VPC module." 21 | type = bool 22 | default = false 23 | } 24 | 25 | variable "enable_nat_restore" { 26 | description = "Whether to enable NAT restore." 27 | type = bool 28 | default = true 29 | } 30 | 31 | variable "enable_ssm" { 32 | description = "Whether to enable SSM." 33 | type = bool 34 | default = true 35 | } 36 | 37 | variable "nat_instance_key_name" { 38 | description = "The name of the key pair to use for the NAT instances." 39 | type = string 40 | default = "" 41 | } 42 | 43 | variable "private_subnets" { 44 | description = "List of private subnets to use in the example VPC." 45 | type = list(string) 46 | default = ["10.10.20.0/24", "10.10.21.0/24"] 47 | } 48 | 49 | variable "public_subnets" { 50 | description = "List of public subnets to use in the example VPC. Alternat instnaces and NAT Gateways reside in these subnets." 51 | type = list(string) 52 | default = ["10.10.0.0/24", "10.10.1.0/24"] 53 | } 54 | 55 | variable "vpc_cidr" { 56 | description = "The CIDR block to use for the example VPC." 57 | type = string 58 | default = "10.10.0.0/16" 59 | } 60 | 61 | variable "vpc_secondary_subnets" { 62 | description = "List of private subnets in the secondary cidr space." 63 | type = list(string) 64 | default = ["10.20.20.0/24", "10.20.21.0/24"] 65 | } 66 | 67 | variable "vpc_secondary_cidr" { 68 | description = "A secondary CIDR block to use with the example VPC." 69 | type = string 70 | default = "10.20.0.0/16" 71 | } 72 | 73 | variable "vpc_name" { 74 | description = "The name to use for the example VPC." 75 | type = string 76 | default = "alternat-example" 77 | } 78 | 79 | variable "enable_cloudwatch_agent" { 80 | description = "Whether to enable CloudWatch Agent on the NAT instances." 81 | type = bool 82 | default = true 83 | } 84 | -------------------------------------------------------------------------------- /cwagent.json.tftpl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | CWAGENT_CONFIG_DIR="/opt/aws/amazon-cloudwatch-agent/etc" 4 | CWAGENT_CONFIG_FILE="$CWAGENT_CONFIG_DIR/amazon-cloudwatch-agent.json" 5 | 6 | # Check if parent directory exists and is writable 7 | if [ ! -d $CWAGENT_CONFIG_DIR ] || [ ! -w $CWAGENT_CONFIG_DIR ]; then 8 | echo "Error: Directory $CWAGENT_CONFIG_DIR does not exist or is not writable" 9 | exit 1 10 | fi 11 | 12 | # Configuration that aims to report most metrics as provided for managed NAT Gateways. 13 | # 14 | # Adapted from the fck-nat project. All credit goes to its original contributors. 15 | # Original documentation: https://fck-nat.dev/develop/features/#metrics 16 | cat < $CWAGENT_CONFIG_FILE 17 | { 18 | "agent": { 19 | "metrics_collection_interval": 60, 20 | "run_as_user": "root", 21 | "usage_data": false 22 | }, 23 | "metrics": { 24 | "namespace": "${cloudwatch_namespace}", 25 | "metrics_collected": { 26 | "net": { 27 | "resources": ${cloudwatch_interfaces}, 28 | "measurement": [ 29 | { "name": "bytes_recv", "rename": "BytesIn", "unit": "Bytes" }, 30 | { "name": "bytes_sent", "rename": "BytesOut", "unit": "Bytes" }, 31 | { "name": "packets_sent", "rename": "PacketsOutCount", "unit": "Count" }, 32 | { "name": "packets_recv", "rename": "PacketsInCount", "unit": "Count" }, 33 | { "name": "drop_in", "rename": "PacketsDropInCount", "unit": "Count" }, 34 | { "name": "drop_out", "rename": "PacketsDropOutCount", "unit": "Count" } 35 | ] 36 | }, 37 | "netstat": { 38 | "measurement": [ 39 | { "name": "tcp_syn_sent", "rename": "ConnectionAttemptOutCount", "unit": "Count" }, 40 | { "name": "tcp_syn_recv", "rename": "ConnectionAttemptInCount", "unit": "Count" }, 41 | { "name": "tcp_established", "rename": "ConnectionEstablishedCount", "unit": "Count" } 42 | ] 43 | }, 44 | "ethtool": { 45 | "interface_include": ${cloudwatch_interfaces}, 46 | "metrics_include": [ 47 | "bw_in_allowance_exceeded", 48 | "bw_out_allowance_exceeded", 49 | "conntrack_allowance_exceeded", 50 | "pps_allowance_exceeded" 51 | ] 52 | }, 53 | "mem": { 54 | "measurement": [ 55 | { "name": "used_percent", "rename": "MemoryUsed", "unit": "Percent" } 56 | ] 57 | } 58 | }, 59 | "append_dimensions": { 60 | "InstanceId": "\$${aws:InstanceId}", 61 | "AutoScalingGroupName": "\$${aws:AutoScalingGroupName}" 62 | } 63 | } 64 | } 65 | EOF 66 | 67 | systemctl restart amazon-cloudwatch-agent 68 | -------------------------------------------------------------------------------- /examples/main.tf: -------------------------------------------------------------------------------- 1 | data "aws_availability_zones" "available" {} 2 | 3 | locals { 4 | azs = slice(data.aws_availability_zones.available.names, 0, 2) 5 | } 6 | 7 | module "vpc" { 8 | source = "terraform-aws-modules/vpc/aws" 9 | version = "~> 4" 10 | 11 | name = var.vpc_name 12 | cidr = var.vpc_cidr 13 | secondary_cidr_blocks = [var.vpc_secondary_cidr] 14 | private_subnets = var.private_subnets 15 | public_subnets = var.public_subnets 16 | azs = local.azs 17 | enable_nat_gateway = var.enable_nat_gateway 18 | } 19 | 20 | resource "aws_subnet" "secondary_subnets" { 21 | count = length(var.vpc_secondary_subnets) 22 | vpc_id = module.vpc.vpc_id 23 | cidr_block = var.vpc_secondary_subnets[count.index] 24 | availability_zone = local.azs[count.index] 25 | } 26 | 27 | resource "aws_route_table_association" "secondary_subnets" { 28 | count = length(var.vpc_secondary_subnets) 29 | subnet_id = aws_subnet.secondary_subnets[count.index].id 30 | route_table_id = module.vpc.private_route_table_ids[count.index] 31 | } 32 | 33 | resource "aws_eip" "preallocated_eip" { 34 | tags = { 35 | "Name" = "alternat-example-explicit-eip" 36 | } 37 | } 38 | 39 | data "aws_subnet" "subnet" { 40 | count = length(module.vpc.private_subnets) 41 | id = module.vpc.private_subnets[count.index] 42 | } 43 | 44 | locals { 45 | vpc_az_maps = [ 46 | for index, rt in module.vpc.private_route_table_ids 47 | : { 48 | az = local.azs[index] 49 | route_table_ids = [rt] 50 | public_subnet_id = module.vpc.public_subnets[index] 51 | # The secondary subnets do not need to be included here. this data is 52 | # used for the connectivity test function and VPC endpoint which are 53 | # only needed in one subnet per zone. 54 | private_subnet_ids = [module.vpc.private_subnets[index]] 55 | } 56 | ] 57 | } 58 | 59 | module "alternat" { 60 | # To use Alternat from the Terraform Registry: 61 | # source = "chime/alternat/aws" 62 | source = "./.." 63 | 64 | create_nat_gateways = var.create_nat_gateways 65 | ingress_security_group_cidr_blocks = var.private_subnets 66 | vpc_az_maps = local.vpc_az_maps 67 | vpc_id = module.vpc.vpc_id 68 | 69 | lambda_package_type = "Zip" 70 | 71 | nat_instance_type = var.alternat_instance_type 72 | nat_instance_key_name = var.nat_instance_key_name 73 | enable_nat_restore = var.enable_nat_restore 74 | enable_ssm = var.enable_ssm 75 | enable_cloudwatch_agent = var.enable_cloudwatch_agent 76 | fallback_ngw_eip_allocation_ids = { (local.azs[0]) = aws_eip.preallocated_eip.allocation_id } 77 | } 78 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, gender identity and expression, level of experience, 9 | nationality, personal appearance, race, religion, or sexual identity and 10 | orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at support@chime.com. All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at [http://contributor-covenant.org/version/1/4][version] 72 | 73 | [homepage]: http://contributor-covenant.org 74 | [version]: http://contributor-covenant.org/version/1/4/ 75 | -------------------------------------------------------------------------------- /docs/0.2.0-migration-guide.md: -------------------------------------------------------------------------------- 1 | ## Migration guide for v0.2.0 2 | 3 | The v0.2.0 release changes the Terraform module inputs in order to associate a route table with a set of private and public subnets and an availability zone. Follow the steps below to migrate to this version. The migration can be completed without interrupting or terminating the NAT instances. 4 | 5 | The modules accept a new input called `var.vpc_az_maps`, a list of objects mapping route tables to their corresponding subnets and AZs: 6 | 7 | ``` 8 | variable "vpc_az_maps" { 9 | description = "A map of az to private route tables that the NAT instances will manage." 10 | type = list(object({ 11 | az = string 12 | private_subnet_ids = list(string) 13 | public_subnet_id = string 14 | route_table_ids = list(string) 15 | })) 16 | } 17 | ``` 18 | 19 | Previously, using the alternat module with the open source [`terraform-aws-vpc` module](https://github.com/terraform-aws-modules/terraform-aws-vpc) looked something like this: 20 | 21 | ``` 22 | module "alternat" { 23 | source = "git@github.com:chime/terraform-aws-alternat.git//modules/terraform-aws-alternat?ref=v0.1.3" 24 | 25 | alternat_image_uri = "012345678901.dkr.ecr.us-west-2.amazonaws.com/alternat" 26 | alternat_image_tag = "v0.1.3" 27 | ingress_security_group_ids = [aws_security_group.client_sg.id] 28 | private_route_table_ids = module.vpc.private_route_table_ids 29 | vpc_private_subnet_ids = module.vpc.private_subnets 30 | vpc_public_subnet_ids = module.vpc.public_subnets 31 | vpc_id = module.vpc.vpc_id 32 | } 33 | ``` 34 | 35 | With `vpc_az_maps`, the module call now looks like: 36 | 37 | ``` 38 | data "aws_subnet" "subnet" { 39 | count = length(module.vpc.private_subnets) 40 | id = module.vpc.private_subnets[count.index] 41 | } 42 | 43 | locals { 44 | vpc_az_maps = [ 45 | for index, rt in module.vpc.private_route_table_ids 46 | : { 47 | az = data.aws_subnet.subnet[index].availability_zone 48 | route_table_ids = [rt] 49 | public_subnet_id = module.vpc.public_subnets[index] 50 | private_subnet_ids = [module.vpc.private_subnets[index]] 51 | } 52 | ] 53 | } 54 | 55 | module "alternat" { 56 | source = "git@github.com:chime/terraform-aws-alternat.git//modules/terraform-aws-alternat?ref=v0.2.0" 57 | 58 | alternat_image_uri = "188238883601.dkr.ecr.us-west-2.amazonaws.com/alternat" 59 | alternat_image_tag = "v0.2.0" 60 | ingress_security_group_ids = [aws_security_group.client_sg.id] 61 | vpc_az_maps = local.vpc_az_maps 62 | vpc_id = module.vpc.vpc_id 63 | } 64 | ``` 65 | 66 | In the code above, the `availability_zone` of the `aws_subnet` data source will not resolve until the VPC is created. Therefore, for new deployments where the VPC and alterNAT are being set up for the first time, an error will occur stating, `Invalid for_each argument` error because "Terraform cannot predict how many instances will be created." To work around this, follow the advice in the error message by using the `-target` argument to first create the VPC. Users with existing VPCs/subnets are not impacted by this. 67 | 68 | After making the above changes, running `terraform plan` will show some resources as changed and others as replaced. You can avoid the important replacements by using `moved` blocks. For example, in the `us-west-2` region: 69 | 70 | ``` 71 | moved { 72 | from = module.alternat_instances.aws_nat_gateway.main[0] 73 | to = module.alternat_instances.aws_nat_gateway.main["us-west-2a"] 74 | } 75 | moved { 76 | from = module.alternat_instances.aws_eip.nat_gateway_eips[0] 77 | to = module.alternat_instances.aws_eip.nat_gateway_eips["us-west-2a"] 78 | } 79 | moved { 80 | from = module.alternat_instances.aws_autoscaling_group.nat_instance[0] 81 | to = module.alternat_instances.aws_autoscaling_group.nat_instance["us-west-2a"] 82 | } 83 | ``` 84 | 85 | You'll need to repeat the above blocks for each availability zone. The Lambda functions and CloudWatch event targets will still be replaced, but this will not cause any downtime for the NAT instances or NAT gateways. 86 | -------------------------------------------------------------------------------- /test/go.mod: -------------------------------------------------------------------------------- 1 | module github.com/chime/terraform-aws-alternat 2 | 3 | go 1.24.0 4 | 5 | require ( 6 | github.com/aws/aws-sdk-go-v2 v1.26.1 7 | github.com/aws/aws-sdk-go-v2/config v1.27.11 8 | github.com/aws/aws-sdk-go-v2/service/ec2 v1.160.0 9 | github.com/gruntwork-io/terratest v0.46.14 10 | github.com/stretchr/testify v1.9.0 11 | ) 12 | 13 | require ( 14 | cloud.google.com/go v0.110.0 // indirect 15 | cloud.google.com/go/compute/metadata v0.3.0 // indirect 16 | cloud.google.com/go/iam v0.13.0 // indirect 17 | cloud.google.com/go/storage v1.29.0 // indirect 18 | github.com/agext/levenshtein v1.2.3 // indirect 19 | github.com/apparentlymart/go-textseg/v13 v13.0.0 // indirect 20 | github.com/aws/aws-sdk-go v1.44.122 // indirect 21 | github.com/aws/aws-sdk-go-v2/credentials v1.17.11 // indirect 22 | github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.1 // indirect 23 | github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.5 // indirect 24 | github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.5 // indirect 25 | github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 // indirect 26 | github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.2 // indirect 27 | github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.7 // indirect 28 | github.com/aws/aws-sdk-go-v2/service/sso v1.20.5 // indirect 29 | github.com/aws/aws-sdk-go-v2/service/ssooidc v1.23.4 // indirect 30 | github.com/aws/aws-sdk-go-v2/service/sts v1.28.6 // indirect 31 | github.com/aws/smithy-go v1.20.2 // indirect 32 | github.com/bgentry/go-netrc v0.0.0-20140422174119-9fd32a8b3d3d // indirect 33 | github.com/boombuler/barcode v1.0.1 // indirect 34 | github.com/cpuguy83/go-md2man/v2 v2.0.0 // indirect 35 | github.com/davecgh/go-spew v1.1.1 // indirect 36 | github.com/emicklei/go-restful/v3 v3.9.0 // indirect 37 | github.com/go-errors/errors v1.0.2-0.20180813162953-d98b870cc4e0 // indirect 38 | github.com/go-logr/logr v1.2.4 // indirect 39 | github.com/go-openapi/jsonpointer v0.19.6 // indirect 40 | github.com/go-openapi/jsonreference v0.20.2 // indirect 41 | github.com/go-openapi/swag v0.22.3 // indirect 42 | github.com/go-sql-driver/mysql v1.4.1 // indirect 43 | github.com/gogo/protobuf v1.3.2 // indirect 44 | github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect 45 | github.com/golang/protobuf v1.5.3 // indirect 46 | github.com/google/gnostic-models v0.6.8 // indirect 47 | github.com/google/go-cmp v0.6.0 // indirect 48 | github.com/google/gofuzz v1.2.0 // indirect 49 | github.com/google/uuid v1.3.0 // indirect 50 | github.com/googleapis/enterprise-certificate-proxy v0.2.3 // indirect 51 | github.com/googleapis/gax-go/v2 v2.7.1 // indirect 52 | github.com/gruntwork-io/go-commons v0.8.0 // indirect 53 | github.com/hashicorp/errwrap v1.0.0 // indirect 54 | github.com/hashicorp/go-cleanhttp v0.5.2 // indirect 55 | github.com/hashicorp/go-getter v1.7.9 // indirect 56 | github.com/hashicorp/go-multierror v1.1.0 // indirect 57 | github.com/hashicorp/go-safetemp v1.0.0 // indirect 58 | github.com/hashicorp/go-version v1.6.0 // indirect 59 | github.com/hashicorp/hcl/v2 v2.9.1 // indirect 60 | github.com/hashicorp/terraform-json v0.13.0 // indirect 61 | github.com/imdario/mergo v0.3.11 // indirect 62 | github.com/jinzhu/copier v0.0.0-20190924061706-b57f9002281a // indirect 63 | github.com/jmespath/go-jmespath v0.4.0 // indirect 64 | github.com/josharian/intern v1.0.0 // indirect 65 | github.com/json-iterator/go v1.1.12 // indirect 66 | github.com/klauspost/compress v1.15.11 // indirect 67 | github.com/mailru/easyjson v0.7.7 // indirect 68 | github.com/mattn/go-zglob v0.0.2-0.20190814121620-e3c945676326 // indirect 69 | github.com/mitchellh/go-homedir v1.1.0 // indirect 70 | github.com/mitchellh/go-wordwrap v1.0.1 // indirect 71 | github.com/moby/spdystream v0.2.0 // indirect 72 | github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect 73 | github.com/modern-go/reflect2 v1.0.2 // indirect 74 | github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect 75 | github.com/pmezard/go-difflib v1.0.0 // indirect 76 | github.com/pquerna/otp v1.2.0 // indirect 77 | github.com/russross/blackfriday/v2 v2.1.0 // indirect 78 | github.com/spf13/pflag v1.0.5 // indirect 79 | github.com/tmccombs/hcl2json v0.3.3 // indirect 80 | github.com/ulikunitz/xz v0.5.14 // indirect 81 | github.com/urfave/cli v1.22.2 // indirect 82 | github.com/zclconf/go-cty v1.9.1 // indirect 83 | go.opencensus.io v0.24.0 // indirect 84 | golang.org/x/crypto v0.45.0 // indirect 85 | golang.org/x/net v0.47.0 // indirect 86 | golang.org/x/oauth2 v0.27.0 // indirect 87 | golang.org/x/sys v0.38.0 // indirect 88 | golang.org/x/term v0.37.0 // indirect 89 | golang.org/x/text v0.31.0 // indirect 90 | golang.org/x/time v0.3.0 // indirect 91 | golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 // indirect 92 | google.golang.org/api v0.114.0 // indirect 93 | google.golang.org/appengine v1.6.7 // indirect 94 | google.golang.org/genproto v0.0.0-20230410155749-daa745c078e1 // indirect 95 | google.golang.org/grpc v1.56.3 // indirect 96 | google.golang.org/protobuf v1.33.0 // indirect 97 | gopkg.in/inf.v0 v0.9.1 // indirect 98 | gopkg.in/yaml.v2 v2.4.0 // indirect 99 | gopkg.in/yaml.v3 v3.0.1 // indirect 100 | k8s.io/api v0.28.4 // indirect 101 | k8s.io/apimachinery v0.28.4 // indirect 102 | k8s.io/client-go v0.28.4 // indirect 103 | k8s.io/klog/v2 v2.100.1 // indirect 104 | k8s.io/kube-openapi v0.0.0-20230717233707-2695361300d9 // indirect 105 | k8s.io/utils v0.0.0-20230406110748-d93618cff8a2 // indirect 106 | sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd // indirect 107 | sigs.k8s.io/structured-merge-diff/v4 v4.2.3 // indirect 108 | sigs.k8s.io/yaml v1.3.0 // indirect 109 | ) 110 | -------------------------------------------------------------------------------- /scripts/alternat.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Send output to a file and to the console 4 | # Credit to the alestic blog for this one-liner 5 | # https://alestic.com/2010/12/ec2-user-data-output/ 6 | exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1 7 | 8 | shopt -s expand_aliases 9 | 10 | panic() { 11 | [ -n "$1" ] && echo "$1" 12 | complete_asg_lifecycle_action ABANDON 13 | echo "alterNAT setup failed" 14 | exit 1 15 | } 16 | 17 | load_config() { 18 | if [ -f "$CONFIG_FILE" ]; then 19 | . "$CONFIG_FILE" 20 | else 21 | panic "Config file $CONFIG_FILE not found" 22 | fi 23 | validate_var "eip_allocation_ids_csv" "$eip_allocation_ids_csv" 24 | validate_var "route_table_ids_csv" "$route_table_ids_csv" 25 | validate_var "enable_ssm" "$enable_ssm" 26 | validate_var "enable_cloudwatch_agent" "$enable_cloudwatch_agent" 27 | } 28 | 29 | validate_var() { 30 | var_name="$1" 31 | var_val="$2" 32 | if [ ! "$2" ]; then 33 | echo "Config var \"$var_name\" is unset" 34 | exit 1 35 | fi 36 | } 37 | 38 | # configure_nat() sets up Linux to act as a NAT device. 39 | # See https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html#NATInstance 40 | configure_nat() { 41 | $dnf_cmd install nftables 42 | systemctl enable --now nftables 43 | 44 | local nic_name="$(ip route show | grep default | sed -n 's/.*dev \([^\ ]*\).*/\1/p')" 45 | echo "Found interface name ${nic_name}" 46 | 47 | echo "Determining the MAC address on ${nic_name}" 48 | local nic_mac="$(cat /sys/class/net/${nic_name}/address)" || panic "Unable to determine MAC address on ${nic_name}." 49 | echo "Found MAC ${nic_mac} for ${nic_name}." 50 | 51 | local vpc_cidr_uri="http://169.254.169.254/latest/meta-data/network/interfaces/macs/${nic_mac}/vpc-ipv4-cidr-blocks" 52 | echo "Metadata location for vpc ipv4 ranges: $vpc_cidr_uri" 53 | 54 | readarray -t vpc_cidrs <<< $(CURL_WITH_TOKEN "$vpc_cidr_uri") 55 | if [ ${#vpc_cidrs[*]} -lt 1 ]; then 56 | panic "Unable to obtain VPC CIDR range from metadata." 57 | else 58 | echo "Retrieved VPC CIDR range(s) ${vpc_cidrs[@]} from metadata." 59 | fi 60 | 61 | echo "Enabling NAT..." 62 | # Read more about these settings here: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt 63 | 64 | sysctl -q -w "net.ipv4.ip_forward"=1 "net.ipv4.conf.$nic_name.send_redirects"=0 "net.ipv4.ip_local_port_range"="1024 65535" || 65 | panic 66 | 67 | nft add table ip nat 68 | nft add chain ip nat postrouting { type nat hook postrouting priority 100 \; } 69 | 70 | for cidr in "${vpc_cidrs[@]}"; 71 | do 72 | nft add rule ip nat postrouting ip saddr "$cidr" oif "$nic_name" masquerade 73 | if [ $? -ne 0 ]; then 74 | panic "Unable to add nft rule for cidr $cidr. nft exited with status $?" 75 | fi 76 | done 77 | 78 | sysctl "net.ipv4.ip_forward" "net.ipv4.conf.${nic_name}.send_redirects" "net.ipv4.ip_local_port_range" 79 | nft list ruleset 80 | 81 | echo "NAT configuration complete" 82 | } 83 | 84 | # Disabling source/dest check is what makes a NAT instance a NAT instance. 85 | # See https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html#EIP_Disable_SrcDestCheck 86 | disable_source_dest_check() { 87 | echo "Disabling source/destination check" 88 | aws ec2 modify-instance-attribute --instance-id $INSTANCE_ID --source-dest-check "{\"Value\": false}" 89 | if [ $? -ne 0 ]; then 90 | panic "Unable to disable source/dest check." 91 | fi 92 | echo "source/destination check disabled for $INSTANCE_ID" 93 | } 94 | 95 | # associate_eip() tries each provided EIP allocation id until it finds one that is not already associated. 96 | function associate_eip() { 97 | echo "Associating an EIP from the pool of addresses" 98 | 99 | local associated_allocation_id="" 100 | local eip="" 101 | local num_retries=10 102 | local sleep_len=60 103 | 104 | IFS=',' read -r -a eip_allocation_ids <<< "${eip_allocation_ids_csv}" 105 | 106 | # Retry the allocation operation $num_retries times with a $sleep_len wait between retries. 107 | # This is to handle any delays in releasing an EIP allocation during instance termination. 108 | for n in $(seq 1 "$num_retries"); do 109 | for eip_allocation_id in "${eip_allocation_ids[@]}" 110 | do 111 | eip=$(aws ec2 describe-addresses --allocation-ids "$eip_allocation_id" --query 'Addresses[0].PublicIp' | tr -d '"') 112 | echo "Trying IP $eip" 113 | aws ec2 associate-address --no-allow-reassociation --allocation-id "$eip_allocation_id" --instance-id "$INSTANCE_ID" 114 | if [ $? -eq 0 ]; then 115 | break 116 | fi 117 | echo "Failed to associate IP $eip" 118 | eip="" 119 | done 120 | if [ ! -z "$eip" ]; then 121 | break 122 | else 123 | echo "Unable to associate an EIP ($n of $num_retries attempts)." 124 | sleep "$sleep_len" 125 | fi 126 | done 127 | 128 | if [ -z "$eip" ]; then 129 | panic "Unable to associate an EIP!" 130 | fi 131 | 132 | echo "Associated EIP $eip with instance $INSTANCE_ID"; 133 | } 134 | 135 | # First try to replace an existing route 136 | # If no route exists already (e.g. first time set up) then create the route. 137 | configure_route_table() { 138 | echo "Configuring route tables" 139 | 140 | IFS=',' read -r -a route_table_ids <<< "${route_table_ids_csv}" 141 | 142 | for route_table_id in "${route_table_ids[@]}" 143 | do 144 | echo "Attempting to find route table $route_table_id" 145 | local rtb_id=$(aws ec2 describe-route-tables --filters Name=route-table-id,Values=${route_table_id} --query 'RouteTables[0].RouteTableId' | tr -d '"') 146 | if [ -z "$rtb_id" ]; then 147 | panic "Unable to find route table $rtb_id" 148 | fi 149 | 150 | echo "Found route table $rtb_id" 151 | echo "Replacing route to 0.0.0.0/0 for $rtb_id" 152 | aws ec2 replace-route --route-table-id "$rtb_id" --instance-id "$INSTANCE_ID" --destination-cidr-block 0.0.0.0/0 153 | if [ $? -eq 0 ]; then 154 | echo "Successfully replaced route to 0.0.0.0/0 via instance $INSTANCE_ID for route table $rtb_id" 155 | continue 156 | fi 157 | 158 | echo "Unable to replace route. Attempting to create route" 159 | aws ec2 create-route --route-table-id "$rtb_id" --instance-id "$INSTANCE_ID" --destination-cidr-block 0.0.0.0/0 160 | if [ $? -eq 0 ]; then 161 | echo "Successfully created route to 0.0.0.0/0 via instance $INSTANCE_ID for route table $rtb_id" 162 | else 163 | panic "Unable to replace or create the route!" 164 | fi 165 | done 166 | } 167 | 168 | # install_ssm_agent() installs the SSM agent if enable_ssm is true. 169 | install_ssm_agent() { 170 | if [ "$enable_ssm" = "true" ]; then 171 | echo "Installing SSM agent" 172 | $dnf_cmd install amazon-ssm-agent && \ 173 | systemctl enable --now amazon-ssm-agent 174 | if [ $? -ne 0 ]; then 175 | panic "Unable to install SSM agent" 176 | fi 177 | echo "SSM agent installed successfully" 178 | fi 179 | } 180 | 181 | # install_cloudwatch_agent() installs the CloudWatch Agent if enable_cloudwatch_agent is true. 182 | install_cloudwatch_agent() { 183 | if [ "$enable_cloudwatch_agent" = "true" ]; then 184 | echo "Installing CloudWatch agent" 185 | $dnf_cmd install amazon-cloudwatch-agent && \ 186 | systemctl enable --now amazon-cloudwatch-agent 187 | if [ $? -ne 0 ]; then 188 | panic "Unable to install CloudWatch Agent" 189 | fi 190 | echo "CloudWatch Agent installed successfully" 191 | fi 192 | } 193 | 194 | ASG_LIFECYCLE_HOOK_NAME="NATInstanceLaunchScript" 195 | complete_asg_lifecycle_action() { 196 | if [[ -z "$1" ]]; then 197 | echo "No lifecycle action result given" 198 | fi 199 | 200 | local auto_scaling_group_name 201 | auto_scaling_group_name="$(ec2-metadata --quiet --tags | grep 'aws:autoscaling:groupName' | awk '{print $2}')" 202 | if [[ -z "${auto_scaling_group_name}" ]]; then 203 | echo "Could not detect auto scaling group name" 204 | fi 205 | 206 | local output status 207 | output="$(aws autoscaling complete-lifecycle-action \ 208 | --lifecycle-hook-name "${ASG_LIFECYCLE_HOOK_NAME}" \ 209 | --auto-scaling-group-name "${auto_scaling_group_name}" \ 210 | --lifecycle-action-result "$1" \ 211 | --instance-id "${INSTANCE_ID}" 2>&1)" 212 | status=$? 213 | if [[ $status -ne 0 ]]; then 214 | if grep -q "No active Lifecycle Action found" <<< "${output}"; then 215 | echo "Ignoring missing ASG lifecycle action" 216 | else 217 | echo "${output}" 218 | echo "Failed to complete ASG lifecycle action" 219 | fi 220 | fi 221 | 222 | echo "Completed ASG lifecycle action with result $1" 223 | } 224 | 225 | curl_cmd="curl --silent --fail" 226 | dnf_cmd="dnf --quiet --assumeyes" 227 | 228 | echo "Requesting IMDSv2 token" 229 | token=$($curl_cmd -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 900") 230 | alias CURL_WITH_TOKEN="$curl_cmd -H \"X-aws-ec2-metadata-token: $token\"" 231 | 232 | # Set CLI Output to text 233 | export AWS_DEFAULT_OUTPUT="text" 234 | 235 | # Disable pager output 236 | # https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-pagination.html#cli-usage-pagination-clientside 237 | export AWS_PAGER="" 238 | 239 | # Set Instance Identity URI 240 | II_URI="http://169.254.169.254/latest/dynamic/instance-identity/document" 241 | 242 | # Retrieve the instance ID 243 | INSTANCE_ID=$(CURL_WITH_TOKEN $II_URI | grep instanceId | awk -F\" '{print $4}') 244 | 245 | # Set region of NAT instance 246 | export AWS_DEFAULT_REGION=$(CURL_WITH_TOKEN $II_URI | grep region | awk -F\" '{print $4}') 247 | 248 | # alterNAT config file containing inputs needed for initialization 249 | CONFIG_FILE="/etc/alternat.conf" 250 | load_config 251 | 252 | echo "Beginning self-managed NAT configuration" 253 | install_ssm_agent 254 | install_cloudwatch_agent 255 | configure_nat 256 | disable_source_dest_check 257 | associate_eip 258 | configure_route_table 259 | complete_asg_lifecycle_action CONTINUE 260 | echo "Configuration completed successfully!" 261 | -------------------------------------------------------------------------------- /lambda.tf: -------------------------------------------------------------------------------- 1 | locals { 2 | autoscaling_func_env_vars = { 3 | # Lambda function env vars cannot contain hyphens 4 | for obj in var.vpc_az_maps 5 | : replace(upper(obj.az), "-", "_") => join(",", obj.route_table_ids) 6 | } 7 | has_ipv6_env_var = { "HAS_IPV6" = var.lambda_has_ipv6 } 8 | lambda_runtime = "python3.12" 9 | } 10 | 11 | resource "archive_file" "lambda" { 12 | count = var.lambda_package_type == "Zip" ? 1 : 0 13 | type = "zip" 14 | source_dir = "${path.module}/functions/replace-route" 15 | excludes = ["__pycache__"] 16 | output_path = var.lambda_zip_path 17 | } 18 | 19 | # Lambda function for Auto Scaling Group Lifecycle Hook 20 | resource "aws_lambda_function" "alternat_autoscaling_hook" { 21 | function_name = var.autoscaling_hook_function_name 22 | architectures = var.lambda_function_architectures 23 | package_type = var.lambda_package_type 24 | memory_size = var.lambda_memory_size 25 | timeout = var.lambda_timeout 26 | role = aws_iam_role.nat_lambda_role.arn 27 | 28 | layers = var.lambda_layer_arns 29 | 30 | image_uri = var.lambda_package_type == "Image" ? "${var.alternat_image_uri}:${var.alternat_image_tag}" : null 31 | 32 | runtime = var.lambda_package_type == "Zip" ? local.lambda_runtime : null 33 | handler = var.lambda_package_type == "Zip" ? var.lambda_handlers.alternat_autoscaling_hook : null 34 | filename = var.lambda_package_type == "Zip" ? archive_file.lambda[0].output_path : null 35 | source_code_hash = var.lambda_package_type == "Zip" ? archive_file.lambda[0].output_base64sha256 : null 36 | 37 | environment { 38 | variables = merge( 39 | local.autoscaling_func_env_vars, 40 | { NAT_GATEWAY_ID = var.nat_gateway_id }, 41 | var.lambda_environment_variables, 42 | ) 43 | } 44 | 45 | tags = merge({ 46 | FunctionName = "alternat-autoscaling-lifecycle-hook", 47 | }, var.tags) 48 | } 49 | 50 | resource "aws_iam_role" "nat_lambda_role" { 51 | name = var.nat_lambda_function_role_name == "" ? null : var.nat_lambda_function_role_name 52 | name_prefix = var.nat_lambda_function_role_name == "" ? "alternat-lambda-role-" : null 53 | assume_role_policy = data.aws_iam_policy_document.nat_lambda_policy.json 54 | tags = var.tags 55 | } 56 | 57 | resource "aws_iam_role_policy_attachment" "nat_lambda_basic_execution_role_attachment" { 58 | role = aws_iam_role.nat_lambda_role.name 59 | policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole" 60 | } 61 | 62 | data "aws_iam_policy_document" "alternat_lambda_permissions" { 63 | statement { 64 | sid = "alterNATDescribePermissions" 65 | effect = "Allow" 66 | actions = [ 67 | "ec2:DescribeNatGateways", 68 | "ec2:DescribeRouteTables", 69 | "ec2:DescribeSubnets", 70 | ] 71 | resources = ["*"] 72 | } 73 | 74 | statement { 75 | sid = "alterNATDescribeASG" 76 | effect = "Allow" 77 | actions = [ 78 | "autoscaling:DescribeAutoScalingGroups" 79 | ] 80 | resources = ["*"] 81 | } 82 | 83 | statement { 84 | sid = "alterNATModifyRoutePermissions" 85 | effect = "Allow" 86 | actions = [ 87 | "ec2:ReplaceRoute" 88 | ] 89 | resources = [ 90 | for route_table in local.all_route_tables 91 | : "arn:aws:ec2:${data.aws_region.current.id}:${data.aws_caller_identity.current.id}:route-table/${route_table}" 92 | ] 93 | } 94 | 95 | statement { 96 | sid = "alterNATASGLifecyclePermissions" 97 | effect = "Allow" 98 | actions = [ 99 | "autoscaling:CompleteLifecycleAction", 100 | ] 101 | resources = [ 102 | "arn:aws:autoscaling:${data.aws_region.current.id}:${data.aws_caller_identity.current.account_id}:autoScalingGroup:*:autoScalingGroupName/${var.nat_instance_name_prefix}*", 103 | ] 104 | } 105 | } 106 | 107 | data "aws_iam_policy_document" "nat_lambda_policy" { 108 | statement { 109 | actions = ["sts:AssumeRole"] 110 | principals { 111 | type = "Service" 112 | identifiers = ["lambda.amazonaws.com"] 113 | } 114 | effect = "Allow" 115 | } 116 | } 117 | 118 | resource "aws_iam_role_policy" "alternat_lambda_permissions" { 119 | name = "alternat-lambda-permissions-policy" 120 | policy = data.aws_iam_policy_document.alternat_lambda_permissions.json 121 | role = aws_iam_role.nat_lambda_role.name 122 | } 123 | 124 | resource "aws_lambda_permission" "sns_topic_to_alternat_lambda" { 125 | statement_id = "AllowExecutionFromSNS" 126 | action = "lambda:InvokeFunction" 127 | function_name = aws_lambda_function.alternat_autoscaling_hook.function_name 128 | principal = "sns.amazonaws.com" 129 | source_arn = aws_sns_topic.alternat_topic.arn 130 | } 131 | 132 | resource "aws_sns_topic_subscription" "nat_lambda_topic_subscription" { 133 | topic_arn = aws_sns_topic.alternat_topic.arn 134 | protocol = "lambda" 135 | endpoint = aws_lambda_function.alternat_autoscaling_hook.arn 136 | } 137 | 138 | # Lambda function for monitoring connectivity through the NAT instance 139 | resource "aws_lambda_function" "alternat_connectivity_tester" { 140 | for_each = { for obj in var.vpc_az_maps : obj.az => obj } 141 | 142 | function_name = "${var.connectivity_tester_function_name}-${each.key}" 143 | architectures = var.lambda_function_architectures 144 | package_type = var.lambda_package_type 145 | memory_size = var.lambda_memory_size 146 | timeout = var.lambda_timeout 147 | role = aws_iam_role.nat_lambda_role.arn 148 | 149 | layers = var.lambda_layer_arns 150 | 151 | image_uri = var.lambda_package_type == "Image" ? "${var.alternat_image_uri}:${var.alternat_image_tag}" : null 152 | 153 | runtime = var.lambda_package_type == "Zip" ? local.lambda_runtime : null 154 | handler = var.lambda_package_type == "Zip" ? var.lambda_handlers.connectivity_tester : null 155 | filename = var.lambda_package_type == "Zip" ? archive_file.lambda[0].output_path : null 156 | source_code_hash = var.lambda_package_type == "Zip" ? archive_file.lambda[0].output_base64sha256 : null 157 | 158 | dynamic "image_config" { 159 | for_each = var.lambda_package_type == "Image" ? [var.lambda_handlers.connectivity_tester] : [] 160 | 161 | content { 162 | command = [image_config.value] 163 | } 164 | } 165 | 166 | environment { 167 | variables = merge( 168 | { 169 | ROUTE_TABLE_IDS_CSV = join(",", each.value.route_table_ids), 170 | PUBLIC_SUBNET_ID = each.value.public_subnet_id 171 | CHECK_URLS = join(",", var.connectivity_test_check_urls) 172 | NAT_GATEWAY_ID = var.nat_gateway_id 173 | NAT_ASG_NAME = aws_autoscaling_group.nat_instance[each.key].name 174 | ENABLE_NAT_RESTORE = var.enable_nat_restore 175 | }, 176 | local.has_ipv6_env_var, 177 | var.lambda_environment_variables, 178 | ) 179 | } 180 | 181 | vpc_config { 182 | subnet_ids = each.value.private_subnet_ids 183 | security_group_ids = [aws_security_group.nat_lambda.id] 184 | } 185 | 186 | tags = merge({ 187 | FunctionName = "alternat-connectivity-tester-${each.key}", 188 | }, var.tags) 189 | } 190 | 191 | resource "aws_security_group" "nat_lambda" { 192 | name_prefix = "alternat-lambda" 193 | vpc_id = var.vpc_id 194 | tags = var.tags 195 | } 196 | 197 | resource "aws_security_group_rule" "nat_lambda_egress" { 198 | type = "egress" 199 | protocol = "-1" 200 | from_port = 0 201 | to_port = 0 202 | cidr_blocks = ["0.0.0.0/0"] 203 | security_group_id = aws_security_group.nat_lambda.id 204 | } 205 | 206 | resource "aws_cloudwatch_event_rule" "every_minute" { 207 | name = var.connectivity_test_event_rule_name 208 | description = "Fires every minute" 209 | schedule_expression = "rate(1 minute)" 210 | tags = var.tags 211 | } 212 | 213 | resource "aws_cloudwatch_event_target" "test_connection_every_minute" { 214 | for_each = { for obj in var.vpc_az_maps : obj.az => obj } 215 | 216 | rule = aws_cloudwatch_event_rule.every_minute.name 217 | target_id = "connectivity-tester-${each.key}" 218 | arn = aws_lambda_function.alternat_connectivity_tester[each.key].arn 219 | } 220 | 221 | resource "aws_lambda_permission" "allow_cloudwatch_to_call_connectivity_tester" { 222 | for_each = { for obj in var.vpc_az_maps : obj.az => obj } 223 | 224 | statement_id = "AllowExecutionFromCloudWatch" 225 | action = "lambda:InvokeFunction" 226 | function_name = aws_lambda_function.alternat_connectivity_tester[each.key].function_name 227 | principal = "events.amazonaws.com" 228 | source_arn = aws_cloudwatch_event_rule.every_minute.arn 229 | } 230 | 231 | data "aws_iam_policy_document" "lambda_ssm_send_command_document" { 232 | statement { 233 | sid = "AllowSSMSendCommandOnDocument" 234 | effect = "Allow" 235 | 236 | actions = [ 237 | "ssm:SendCommand", 238 | ] 239 | 240 | resources = [ 241 | "arn:aws:ssm:${data.aws_region.current.id}::document/AWS-RunShellScript", 242 | ] 243 | } 244 | statement { 245 | sid = "AllowSSMSendCommandOnInstances" 246 | effect = "Allow" 247 | 248 | actions = [ 249 | "ssm:SendCommand", 250 | ] 251 | 252 | resources = [ 253 | "arn:aws:ec2:${data.aws_region.current.id}:${data.aws_caller_identity.current.id}:instance/*" 254 | ] 255 | condition { 256 | test = "StringEquals" 257 | variable = "aws:ResourceTag/alterNATInstance" 258 | values = [ 259 | "true" 260 | ] 261 | } 262 | } 263 | statement { 264 | sid = "AllowSSMAndEC2ReadOps" 265 | effect = "Allow" 266 | 267 | actions = [ 268 | "ssm:GetCommandInvocation", 269 | "ec2:DescribeInstances" 270 | ] 271 | resources = ["*"] 272 | } 273 | } 274 | 275 | resource "aws_iam_role_policy" "lambda_ssm_send_command_policy" { 276 | count = var.enable_nat_restore ? 1 : 0 277 | name = "AllowLambdaToSendSSMCommand" 278 | role = aws_iam_role.nat_lambda_role.id 279 | policy = data.aws_iam_policy_document.lambda_ssm_send_command_document.json 280 | } 281 | -------------------------------------------------------------------------------- /variables.tf: -------------------------------------------------------------------------------- 1 | variable "additional_instance_policies" { 2 | description = "Additional policies for the Alternat instance IAM role." 3 | type = list(object({ 4 | policy_name = string 5 | policy_json = string 6 | })) 7 | default = [] 8 | } 9 | 10 | variable "alternat_image_tag" { 11 | description = "The tag of the container image for the Alternat Lambda functions." 12 | type = string 13 | default = "latest" 14 | } 15 | 16 | variable "alternat_image_uri" { 17 | description = "The URI of the container image for the Alternat Lambda functions." 18 | type = string 19 | default = "" 20 | } 21 | 22 | variable "architecture" { 23 | description = "Architecture of the NAT instance image. Must be compatible with nat_instance_type." 24 | type = string 25 | default = "arm64" 26 | } 27 | 28 | variable "autoscaling_hook_function_name" { 29 | description = "The name to use for the autoscaling hook Lambda function." 30 | type = string 31 | default = "alternat-autoscaling-hook" 32 | } 33 | 34 | variable "create_nat_gateways" { 35 | description = "Whether to create the NAT Gateway and the NAT Gateway EIP in this module. If false, you must create and manage NAT Gateways separately." 36 | type = bool 37 | default = true 38 | } 39 | 40 | variable "connectivity_test_check_urls" { 41 | description = "List of URLs to check with the connectivity tester function." 42 | type = list(string) 43 | default = ["https://www.example.com", "https://www.google.com"] 44 | } 45 | 46 | variable "connectivity_test_event_rule_name" { 47 | description = "The name to use for the event rule that invokes the connectivity test Lambda function." 48 | type = string 49 | default = "alternat-test-every-minute" 50 | } 51 | 52 | variable "connectivity_tester_function_name" { 53 | description = "The prefix to use for the name of the connectivity tester Lambda function. Because there is a function created in each ASG, the name will be suffixed with an index." 54 | type = string 55 | default = "alternat-connectivity-tester" 56 | } 57 | 58 | variable "enable_ec2_endpoint" { 59 | description = "Whether to create a VPC endpoint to EC2 for Internet Connectivity testing." 60 | type = bool 61 | default = true 62 | } 63 | 64 | variable "enable_ssm" { 65 | description = "Whether to enable SSM on the Alternat instances." 66 | type = bool 67 | default = true 68 | } 69 | 70 | variable "enable_nat_restore" { 71 | description = "Whether to enable NAT restore functionality." 72 | type = bool 73 | default = false 74 | } 75 | 76 | variable "ingress_security_group_ids" { 77 | description = "A list of security group IDs that are allowed by the NAT instance." 78 | type = list(string) 79 | default = [] 80 | } 81 | 82 | variable "ingress_security_group_cidr_blocks" { 83 | description = "A list of CIDR blocks that are allowed by the NAT instance." 84 | type = list(string) 85 | default = [] 86 | } 87 | 88 | variable "ingress_security_group_ipv6_cidr_blocks" { 89 | description = "A list of IPv6 CIDR blocks that are allowed by the NAT instance." 90 | type = list(string) 91 | default = [] 92 | } 93 | 94 | variable "lifecycle_heartbeat_timeout" { 95 | description = "The length of time, in seconds, that autoscaled NAT instances should wait in the terminate state before being fully terminated." 96 | type = number 97 | default = 180 98 | } 99 | 100 | variable "max_instance_lifetime" { 101 | description = "Max instance life in seconds. Defaults to 14 days. Set to 0 to disable." 102 | type = number 103 | default = 1209600 104 | } 105 | 106 | variable "nat_ami" { 107 | description = "The AMI to use for the NAT instance. Defaults to the latest Amazon Linux 2 AMI." 108 | type = string 109 | default = "" 110 | } 111 | 112 | variable "nat_instance_block_devices" { 113 | description = "Optional custom EBS volume settings for the NAT instance." 114 | type = any 115 | default = {} 116 | } 117 | 118 | variable "nat_instance_iam_profile_name" { 119 | description = "Name to use for the IAM profile used by the NAT instance. Must be globally unique in this AWS account. Defaults to alternat-instance- as a prefix." 120 | type = string 121 | default = "" 122 | } 123 | 124 | variable "nat_instance_iam_role_name" { 125 | description = "Name to use for the IAM role used by the NAT instance. Must be globally unique in this AWS account. Defaults to alternat-instance- as a prefix." 126 | type = string 127 | default = "" 128 | } 129 | 130 | variable "nat_instance_key_name" { 131 | description = "The name of the key pair to use for the NAT instance. This is primarily used for testing." 132 | type = string 133 | default = "" 134 | } 135 | 136 | variable "nat_instance_lifecycle_hook_role_name" { 137 | description = "Name to use for the IAM role used by the NAT instance lifecycle hook. Must be globally unique in this AWS account. Defaults to alternat-lifecycle-hook as a prefix." 138 | type = string 139 | default = "" 140 | } 141 | 142 | variable "nat_instance_name_prefix" { 143 | description = "Prefix for the NAT Auto Scaling Group and instance names. Because there is an instance created in each ASG, the name will be suffixed with an index." 144 | type = string 145 | default = "alternat-" 146 | } 147 | 148 | variable "nat_instance_sg_name_prefix" { 149 | description = "Prefix for the NAT instance security group name." 150 | type = string 151 | default = "alternat-instance" 152 | } 153 | 154 | variable "nat_lambda_function_role_name" { 155 | description = "Name to use for the IAM role used by the replace-route Lambda function. Must be globally unique in this AWS account." 156 | type = string 157 | default = "" 158 | } 159 | 160 | variable "nat_instance_type" { 161 | description = "Instance type to use for NAT instances." 162 | type = string 163 | default = "c6gn.8xlarge" 164 | } 165 | 166 | variable "nat_instance_eip_ids" { 167 | description = <<-EOT 168 | Allocation IDs of Elastic IPs to associate with the NAT instances. If not specified, EIPs will be created. 169 | 170 | Note: if the number of EIPs does not match the number of subnets specified in `vpc_public_subnet_ids`, this variable will be ignored. 171 | EOT 172 | type = list(string) 173 | default = [] 174 | } 175 | 176 | variable "fallback_ngw_eip_allocation_ids" { 177 | type = map(string) 178 | default = {} 179 | description = "Explicitly specified allocation_ids for fallback NAT Gateway by AZ (e.g., { eu-west-1a = \"eipalloc-0123456789abcdef0\" }). If specified for an AZ, the EIP will not be created automatically." 180 | validation { 181 | condition = alltrue([for v in values(var.fallback_ngw_eip_allocation_ids) : can(regex("^eipalloc-[0-9a-f]+$", v))]) 182 | error_message = "Each allocation_id must be in the format eipalloc-xxxxxxxxxxxxxxx (hex)." 183 | } 184 | } 185 | 186 | variable "nat_instance_user_data_pre_install" { 187 | description = "Pre-install shell script to run at boot before configuring alternat." 188 | type = string 189 | default = "" 190 | } 191 | 192 | variable "nat_instance_user_data_post_install" { 193 | description = "Post-install shell script to run at boot after configuring alternat." 194 | type = string 195 | default = "" 196 | } 197 | 198 | variable "prevent_destroy_eips" { 199 | description = "Prevents accidental destruction of EIPs by setting `prevent_destroy=true`" 200 | type = bool 201 | default = false 202 | } 203 | 204 | variable "tags" { 205 | description = "A map of tags to add to all supported resources managed by the module." 206 | type = map(string) 207 | default = {} 208 | } 209 | 210 | variable "vpc_az_maps" { 211 | description = "A map of az to private route tables that the NAT instances will manage." 212 | type = list(object({ 213 | az = string 214 | private_subnet_ids = list(string) 215 | public_subnet_id = string 216 | route_table_ids = list(string) 217 | })) 218 | } 219 | 220 | variable "nat_gateway_id" { 221 | description = "NAT Gateway ID to use for fallback. If not provided, the gateway in the same subnet as relevant NAT instance is selected." 222 | type = string 223 | default = "" 224 | } 225 | 226 | variable "vpc_id" { 227 | description = "The ID of the VPC." 228 | type = string 229 | } 230 | 231 | variable "lambda_package_type" { 232 | description = "The lambda deployment package type. Valid values are \"Zip\" and \"Image\". Defaults to \"Image\"." 233 | type = string 234 | default = "Image" 235 | nullable = false 236 | 237 | validation { 238 | condition = contains(["Zip", "Image"], var.lambda_package_type) 239 | error_message = "Must be a supported package type: \"Zip\" or \"Image\"." 240 | } 241 | } 242 | 243 | variable "lambda_memory_size" { 244 | description = "Amount of memory in MB your Lambda Function can use at runtime. Defaults to 256." 245 | type = number 246 | default = 256 247 | } 248 | 249 | variable "lambda_timeout" { 250 | description = "Amount of time your Lambda Function has to run in seconds. Defaults to 300." 251 | type = number 252 | default = 300 253 | } 254 | 255 | variable "lambda_handlers" { 256 | description = "Lambda handlers." 257 | type = object({ 258 | connectivity_tester = string, 259 | alternat_autoscaling_hook = string, 260 | }) 261 | default = { 262 | connectivity_tester = "app.connectivity_test_handler", 263 | alternat_autoscaling_hook = "app.handler" 264 | } 265 | } 266 | 267 | variable "lambda_environment_variables" { 268 | description = "Environment variables to be provided to the lambda function." 269 | type = map(string) 270 | default = null 271 | } 272 | 273 | variable "lambda_has_ipv6" { 274 | description = "Controls whether or not the lambda function can use IPv6." 275 | type = bool 276 | default = true 277 | } 278 | 279 | variable "lambda_zip_path" { 280 | description = "The location where the generated zip file should be stored. Required when `lambda_package_type` is \"Zip\"." 281 | type = string 282 | default = "/tmp/alternat-lambda.zip" 283 | } 284 | 285 | variable "lambda_function_architectures" { 286 | description = "CPU architecture(s) to use for the lambda functions." 287 | type = list(string) 288 | default = ["x86_64"] 289 | } 290 | 291 | variable "lambda_layer_arns" { 292 | type = list(string) 293 | description = "List of Lambda layers ARN that will be added to functions" 294 | default = null 295 | } 296 | 297 | variable "enable_cloudwatch_agent" { 298 | description = "Whether to enable CloudWatch Agent on the NAT instances." 299 | type = bool 300 | default = false 301 | } 302 | 303 | variable "cloudwatch_namespace" { 304 | description = "The name of the CloudWatch namespace for the CloudWatch Agent" 305 | type = string 306 | default = "alterNAT" 307 | } 308 | 309 | variable "cloudwatch_interfaces" { 310 | description = "List of NAT instance interfaces that should be monitored by the CloudWatch Agent" 311 | type = list(string) 312 | default = ["ens5", "ens6"] 313 | } 314 | 315 | variable "enable_launch_script_lifecycle_hook" { 316 | description = "Whether to enable the ASG lifecycle hook for the launch script." 317 | type = bool 318 | default = false 319 | } 320 | -------------------------------------------------------------------------------- /functions/replace-route/app.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import logging 4 | import time 5 | import urllib 6 | import socket 7 | 8 | import botocore 9 | import boto3 10 | 11 | 12 | logger = logging.getLogger() 13 | logger.setLevel(logging.INFO) 14 | logging.getLogger('boto3').setLevel(logging.CRITICAL) 15 | logging.getLogger('botocore').setLevel(logging.CRITICAL) 16 | 17 | 18 | ec2_client = boto3.client("ec2") 19 | 20 | LIFECYCLE_HOOK_NAME_KEY = "LifecycleHookName" 21 | AUTO_SCALING_GROUP_NAME_KEY = "AutoScalingGroupName" 22 | LIFECYCLE_ACTION_TOKEN_KEY = "LifecycleActionToken" 23 | 24 | # Checks every CONNECTIVITY_CHECK_INTERVAL seconds, exits after 1 minute 25 | DEFAULT_CONNECTIVITY_CHECK_INTERVAL = "5" 26 | 27 | # Which URLs to check for connectivity 28 | DEFAULT_CHECK_URLS = ["https://www.example.com", "https://www.google.com"] 29 | 30 | # The timeout for the connectivity checks. 31 | REQUEST_TIMEOUT = 5 32 | 33 | # Waiting time for SSM to start commands. 34 | SSM_TIMEOUT_SECONDS = 30 35 | 36 | # Whether or not return to the nat instance when the nat gateway is used. 37 | DEFAULT_ENABLE_NAT_RESTORE = False 38 | 39 | # Whether or not use IPv6. 40 | DEFAULT_HAS_IPV6 = True 41 | 42 | 43 | # Overrides socket.getaddrinfo to perform IPv4 lookups 44 | # See https://github.com/chime/terraform-aws-alternat/issues/87 45 | def disable_ipv6(): 46 | prv_getaddrinfo = socket.getaddrinfo 47 | def getaddrinfo_ipv4(*args): 48 | modified_args = (args[0], args[1], socket.AF_INET) + args[3:] 49 | res = prv_getaddrinfo(*modified_args) 50 | return res 51 | socket.getaddrinfo = getaddrinfo_ipv4 52 | 53 | 54 | def get_az_and_vpc_zone_identifier(auto_scaling_group): 55 | autoscaling = boto3.client("autoscaling") 56 | 57 | try: 58 | asg_objects = autoscaling.describe_auto_scaling_groups(AutoScalingGroupNames=[auto_scaling_group]) 59 | except botocore.exceptions.ClientError as error: 60 | logger.error("Unable to describe autoscaling groups") 61 | raise error 62 | 63 | if asg_objects["AutoScalingGroups"] and len(asg_objects["AutoScalingGroups"]) > 0: 64 | asg = asg_objects["AutoScalingGroups"][0] 65 | logger.debug("Auto Scaling Group: %s", asg) 66 | 67 | availability_zone = asg["AvailabilityZones"][0] 68 | logger.debug("Availability Zone: %s", availability_zone) 69 | 70 | vpc_zone_identifier = asg["VPCZoneIdentifier"] 71 | logger.debug("VPC zone identifier: %s", vpc_zone_identifier) 72 | 73 | return availability_zone, vpc_zone_identifier 74 | 75 | raise MissingVPCZoneIdentifierError(asg_objects) 76 | 77 | 78 | def get_vpc_id(route_table): 79 | try: 80 | route_tables = ec2_client.describe_route_tables(RouteTableIds=[route_table]) 81 | except botocore.exceptions.ClientError as error: 82 | logger.error("Unable to get vpc id") 83 | raise error 84 | if "RouteTables" in route_tables and len(route_tables["RouteTables"]) == 1: 85 | vpc_id = route_tables["RouteTables"][0]["VpcId"] 86 | logger.debug("VPC ID: %s", vpc_id) 87 | return vpc_id 88 | 89 | 90 | def get_nat_gateway_id(vpc_id, subnet_id): 91 | nat_gateway_id = os.getenv("NAT_GATEWAY_ID") 92 | if nat_gateway_id: 93 | logger.info("Using NAT_GATEWAY_ID env. variable (%s)", nat_gateway_id) 94 | return nat_gateway_id 95 | 96 | try: 97 | nat_gateways = ec2_client.describe_nat_gateways( 98 | Filters=[ 99 | { 100 | "Name": "vpc-id", 101 | "Values": [vpc_id] 102 | }, 103 | { 104 | "Name": "subnet-id", 105 | "Values": [subnet_id] 106 | }, 107 | { 108 | "Name": "state", 109 | "Values": ["available"] 110 | } 111 | ] 112 | ) 113 | except botocore.exceptions.ClientError as error: 114 | logger.error("Unable to describe nat gateway") 115 | raise error 116 | 117 | logger.debug("NAT Gateways: %s", nat_gateways) 118 | if len(nat_gateways.get("NatGateways")) < 1: 119 | raise MissingNatGatewayError(nat_gateways) 120 | 121 | nat_gateway_id = nat_gateways['NatGateways'][0]["NatGatewayId"] 122 | logger.debug("NAT Gateway ID: %s", nat_gateway_id) 123 | return nat_gateway_id 124 | 125 | 126 | def replace_route(route_table_id, target_id): 127 | new_route_table = {} 128 | target_key = "NatGatewayId" 129 | if target_id.startswith("i-"): 130 | target_key = "InstanceId" 131 | new_route_table = { 132 | "DestinationCidrBlock": "0.0.0.0/0", 133 | target_key: target_id, 134 | "RouteTableId": route_table_id 135 | } 136 | 137 | try: 138 | logger.info("Replacing existing route %s for route table %s", route_table_id, new_route_table) 139 | ec2_client.replace_route(**new_route_table) 140 | except botocore.exceptions.ClientError as error: 141 | logger.error("Unable to replace route") 142 | raise error 143 | 144 | def run_nat_instance_diagnostics(instance_id): 145 | """ 146 | Runs a basic diagnostic script via SSM on the NAT instance. 147 | It checks if IP forwarding is enabled and lists the nftables NAT configuration. 148 | Returns True if configuration is healthy, False otherwise. 149 | """ 150 | ssm_client = boto3.client("ssm") 151 | 152 | diagnostic_script = [ 153 | "#!/bin/bash", 154 | "set -e", 155 | "echo 'ip_forward='$(cat /proc/sys/net/ipv4/ip_forward)", 156 | "echo 'nft_nat_table='$(nft list table ip nat 2>/dev/null || echo 'nftables nat table not found')" 157 | ] 158 | 159 | try: 160 | response = ssm_client.send_command( 161 | InstanceIds=[instance_id], 162 | DocumentName="AWS-RunShellScript", 163 | Parameters={"commands": diagnostic_script}, 164 | Comment="Run NAT instance diagnostics", 165 | TimeoutSeconds=SSM_TIMEOUT_SECONDS 166 | ) 167 | command_id = response['Command']['CommandId'] 168 | time.sleep(5) # Allow some time for the command to execute 169 | 170 | invocation = ssm_client.get_command_invocation( 171 | CommandId=command_id, 172 | InstanceId=instance_id, 173 | ) 174 | 175 | output = invocation.get('StandardOutputContent', '') 176 | 177 | if invocation.get('StandardErrorContent'): 178 | logger.warning("NAT instance diagnostic errors:\n%s", invocation['StandardErrorContent']) 179 | 180 | # Check conditions 181 | if "ip_forward=0" in output: 182 | logger.warning("NAT instance has ip_forward=0 — IP forwarding is disabled.") 183 | return False 184 | 185 | if "masquerade" not in output: 186 | logger.warning("NAT instance nftables missing 'masquerade' rule — SNAT may be broken.") 187 | return False 188 | 189 | if is_source_dest_check_enabled(instance_id) is True: 190 | logger.warning("Source/destination check is ENABLED — this will break NAT functionality.") 191 | return False 192 | if is_source_dest_check_enabled(instance_id) is None: 193 | logger.warning("Skipping NAT restore due to error checking source/dest.") 194 | return False 195 | 196 | return True 197 | 198 | except botocore.exceptions.ClientError as e: 199 | logger.error("SSM diagnostic command failed: %s", str(e)) 200 | return False 201 | 202 | def is_source_dest_check_enabled(instance_id): 203 | ec2 = boto3.client('ec2') 204 | try: 205 | response = ec2.describe_instances(InstanceIds=[instance_id]) 206 | attr = response['Reservations'][0]['Instances'][0].get('SourceDestCheck', True) 207 | return attr 208 | except Exception as e: 209 | logger.error(f"Error checking source/dest check: {e}") 210 | return None 211 | 212 | def are_any_routes_pointing_to_nat_gateway(route_table_ids): 213 | ec2 = boto3.client('ec2') 214 | try: 215 | response = ec2.describe_route_tables(RouteTableIds=route_table_ids) 216 | for rtb in response.get('RouteTables', []): 217 | for route in rtb.get('Routes', []): 218 | if route.get('DestinationCidrBlock') == "0.0.0.0/0" and 'NatGatewayId' in route and route.get('State') == 'active': 219 | return True 220 | return False 221 | except Exception as e: 222 | logger.error(f"Error checking NAT Gateway routes: {e}") 223 | return False 224 | 225 | def attempt_nat_instance_restore(): 226 | ssm_client = boto3.client('ssm') 227 | nat_instance_id = get_current_nat_instance_id(os.getenv("NAT_ASG_NAME")) 228 | route_tables = os.getenv("ROUTE_TABLE_IDS_CSV", "").split(",") 229 | 230 | if not nat_instance_id or not route_tables: 231 | logger.warning("NAT_INSTANCE_ID or ROUTE_TABLE_IDS_CSV not set. Skipping NAT restore.") 232 | return 233 | 234 | logger.info("Attempting to restore route to NAT Instance: %s", nat_instance_id) 235 | 236 | try: 237 | check_urls = os.getenv("CHECK_URLS", ",".join(DEFAULT_CHECK_URLS)).split(",") 238 | commands = [] 239 | for url in check_urls: 240 | command = f"curl -s -o /dev/null -w '%{{http_code}}\\n' --max-time 5 {url.strip()}" 241 | commands.append(command) 242 | # Send SSM command to test connectivity 243 | response = ssm_client.send_command( 244 | InstanceIds=[nat_instance_id], 245 | DocumentName="AWS-RunShellScript", 246 | TimeoutSeconds=SSM_TIMEOUT_SECONDS, 247 | Parameters={ 248 | "commands": commands 249 | }, 250 | Comment="Check Internet access from NAT instance via Lambda", 251 | ) 252 | command_id = response['Command']['CommandId'] 253 | time.sleep(5) # Wait briefly before checking command result 254 | 255 | # Poll command result 256 | invocation = ssm_client.get_command_invocation( 257 | CommandId=command_id, 258 | InstanceId=nat_instance_id, 259 | ) 260 | if invocation['Status'] == "Success": 261 | output = invocation['StandardOutputContent'].strip() 262 | http_codes = output.splitlines() 263 | if all(int(code) < 500 for code in http_codes): 264 | logger.info("NAT instance has Internet access, we can diagnose the NAT configuration.") 265 | try: 266 | if not run_nat_instance_diagnostics(nat_instance_id): 267 | logger.warning("Skipping route restore due to failed NAT diagnostics.") 268 | return 269 | except Exception as diag_error: 270 | logger.error("Unexpected error during NAT diagnostics: %s", str(diag_error)) 271 | return 272 | for rtb in route_tables: 273 | replace_route(rtb, nat_instance_id) 274 | logger.info("Route table %s now points to NAT instance %s", rtb, nat_instance_id) 275 | return 276 | else: 277 | logger.warning("Invocation output: %s", invocation['StandardOutputContent']) 278 | else: 279 | logger.warning("NAT instance connectivity test failed or did not return expected result.") 280 | 281 | 282 | except botocore.exceptions.ClientError as e: 283 | logger.error("SSM command failed: %s", str(e)) 284 | except Exception as ex: 285 | logger.error("Unexpected error during NAT restore: %s", str(ex)) 286 | 287 | def check_connection(check_urls): 288 | """ 289 | Checks connectivity to check_urls. If any of them succeed, return success. 290 | If all fail, replaces the route table to point at a standby NAT Gateway and 291 | return failure. 292 | 293 | If ENABLE_NAT_RESTORE is set and we're currently using the NAT Gateway, 294 | attempt to restore route to the NAT instance before checking connectivity. 295 | """ 296 | route_tables = os.getenv("ROUTE_TABLE_IDS_CSV", "").split(",") 297 | if not route_tables: 298 | raise MissingEnvironmentVariableError("ROUTE_TABLE_IDS_CSV") 299 | 300 | restore_enabled = get_env_bool("ENABLE_NAT_RESTORE", DEFAULT_ENABLE_NAT_RESTORE) 301 | 302 | # Step 1: Try failback to NAT instance if allowed and current route is NAT Gateway 303 | if restore_enabled and are_any_routes_pointing_to_nat_gateway(route_tables): 304 | logger.info("ENABLE_NAT_RESTORE=true and route is NAT Gateway. Trying to restore NAT instance...") 305 | attempt_nat_instance_restore() 306 | time.sleep(5) 307 | 308 | # Step 2: Test connectivity 309 | for url in check_urls: 310 | try: 311 | req = urllib.request.Request(url) 312 | req.add_header('User-Agent', 'alternat/1.0') 313 | urllib.request.urlopen(req, timeout=REQUEST_TIMEOUT) 314 | logger.debug("Successfully connected to %s", url) 315 | return True 316 | except urllib.error.HTTPError as error: 317 | logger.warning("Response error from %s: %s, treating as success", url, error) 318 | return True 319 | except urllib.error.URLError as error: 320 | logger.error("error connecting to %s: %s", url, error) 321 | except socket.timeout as error: 322 | logger.error("timeout error connecting to %s: %s", url, error) 323 | 324 | logger.warning("Failed connectivity tests! Replacing route") 325 | 326 | public_subnet_id = os.getenv("PUBLIC_SUBNET_ID") 327 | if not public_subnet_id: 328 | raise MissingEnvironmentVariableError("PUBLIC_SUBNET_ID") 329 | 330 | vpc_id = get_vpc_id(route_tables[0]) 331 | 332 | nat_gateway_id = get_nat_gateway_id(vpc_id, public_subnet_id) 333 | 334 | for rtb in route_tables: 335 | replace_route(rtb, nat_gateway_id) 336 | logger.info("Route replacement succeeded") 337 | return False 338 | 339 | def get_current_nat_instance_id(asg_name): 340 | try: 341 | autoscaling = boto3.client("autoscaling") 342 | response = autoscaling.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name]) 343 | instances = response['AutoScalingGroups'][0]['Instances'] 344 | for instance in instances: 345 | if instance['LifecycleState'] == 'InService': 346 | return instance['InstanceId'] 347 | except Exception as e: 348 | logger.error(f"Failed to retrieve NAT instance ID from ASG {asg_name}: {e}") 349 | return None 350 | 351 | def connectivity_test_handler(event, context): 352 | if not isinstance(event, dict): 353 | logger.error(f"Unknown event: {event}") 354 | return 355 | 356 | if event.get("source") != "aws.events": 357 | logger.error(f"Unable to handle unknown event type: {json.dumps(event)}") 358 | raise UnknownEventTypeError 359 | 360 | logger.debug("Starting NAT instance connectivity test") 361 | 362 | check_interval = int(os.getenv("CONNECTIVITY_CHECK_INTERVAL", DEFAULT_CONNECTIVITY_CHECK_INTERVAL)) 363 | check_urls = "CHECK_URLS" in os.environ and os.getenv("CHECK_URLS").split(",") or DEFAULT_CHECK_URLS 364 | 365 | has_ipv6 = get_env_bool("HAS_IPV6", DEFAULT_HAS_IPV6) 366 | if not has_ipv6: 367 | disable_ipv6() 368 | 369 | # Run connectivity checks for approximately 1 minute 370 | run = 0 371 | num_runs = 60 / check_interval 372 | while run < num_runs: 373 | if check_connection(check_urls): 374 | time.sleep(check_interval) 375 | run += 1 376 | else: 377 | break 378 | 379 | def get_env_bool(var_name, default_value=False): 380 | value = os.getenv(var_name, default_value) 381 | true_values = ["t", "true", "y", "yes", "1"] 382 | return str(value).lower() in true_values 383 | 384 | 385 | def complete_asg_lifecycle_action( 386 | auto_scaling_group_name, 387 | lifecycle_hook_name, 388 | lifecycle_action_token, 389 | lifecycle_action_result, 390 | ignore_validation_error=True, 391 | ): 392 | autoscaling_client = boto3.client("autoscaling") 393 | try: 394 | autoscaling_client.complete_lifecycle_action( 395 | AutoScalingGroupName=auto_scaling_group_name, 396 | LifecycleHookName=lifecycle_hook_name, 397 | LifecycleActionToken=lifecycle_action_token, 398 | LifecycleActionResult=lifecycle_action_result, 399 | ) 400 | except botocore.exceptions.ClientError as error: 401 | if ( 402 | ignore_validation_error 403 | and error.response["Error"]["Code"] == "ValidationError" 404 | ): 405 | return False 406 | raise 407 | return True 408 | 409 | 410 | def handler(event, _): 411 | try: 412 | for record in event["Records"]: 413 | message = json.loads(record["Sns"]["Message"]) 414 | if ( 415 | LIFECYCLE_HOOK_NAME_KEY in message 416 | and AUTO_SCALING_GROUP_NAME_KEY in message 417 | ): 418 | asg = message[AUTO_SCALING_GROUP_NAME_KEY] 419 | lifecycle_hook_name = message[LIFECYCLE_HOOK_NAME_KEY] 420 | lifecycle_action_token = message[LIFECYCLE_ACTION_TOKEN_KEY] 421 | else: 422 | logger.error("Failed to find lifecycle message to parse") 423 | raise LifecycleMessageError 424 | except Exception as error: 425 | logger.error("Error: %s", error) 426 | raise error 427 | 428 | availability_zone, vpc_zone_identifier = get_az_and_vpc_zone_identifier(asg) 429 | public_subnet_id = vpc_zone_identifier.split(",")[0] 430 | az = availability_zone.upper().replace("-", "_") 431 | route_tables = az in os.environ and os.getenv(az).split(",") 432 | if not route_tables: 433 | raise MissingEnvironmentVariableError 434 | vpc_id = get_vpc_id(route_tables[0]) 435 | 436 | nat_gateway_id = get_nat_gateway_id(vpc_id, public_subnet_id) 437 | 438 | for rtb in route_tables: 439 | replace_route(rtb, nat_gateway_id) 440 | logger.info("Route replacement succeeded") 441 | 442 | complete_asg_lifecycle_action( 443 | asg, lifecycle_hook_name, lifecycle_action_token, "CONTINUE" 444 | ) 445 | 446 | 447 | class UnknownEventTypeError(Exception): pass 448 | 449 | 450 | class MissingVpcConfigError(Exception): pass 451 | 452 | 453 | class MissingFunctionSubnetError(Exception): pass 454 | 455 | 456 | class MissingAZSubnetError(Exception): pass 457 | 458 | 459 | class MissingVPCZoneIdentifierError(Exception): pass 460 | 461 | 462 | class MissingVPCandSubnetError(Exception): pass 463 | 464 | 465 | class MissingNatGatewayError(Exception): pass 466 | 467 | 468 | class MissingRouteTableError(Exception): pass 469 | 470 | 471 | class LifecycleMessageError(Exception): pass 472 | 473 | 474 | class MissingEnvironmentVariableError(Exception): pass 475 | -------------------------------------------------------------------------------- /main.tf: -------------------------------------------------------------------------------- 1 | ## NAT instance configuration 2 | locals { 3 | initial_lifecycle_hooks = [ 4 | { 5 | name = "NATInstanceTerminationLifeCycleHook" 6 | default_result = "CONTINUE" 7 | heartbeat_timeout = var.lifecycle_heartbeat_timeout 8 | lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING" 9 | notification_target_arn = aws_sns_topic.alternat_topic.arn 10 | role_arn = aws_iam_role.alternat_lifecycle_hook.arn 11 | } 12 | ] 13 | launch_script_lifecycle_hook_name = "NATInstanceLaunchScript" 14 | 15 | nat_instance_image_id = var.nat_ami == "" ? "resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-minimal-kernel-default-${var.architecture}" : var.nat_ami 16 | 17 | nat_instance_ingress_sgs = concat(var.ingress_security_group_ids, [aws_security_group.nat_lambda.id]) 18 | 19 | all_route_tables = flatten([ 20 | for obj in var.vpc_az_maps : obj.route_table_ids 21 | ]) 22 | 23 | # One private subnet in each AZ to use for the VPC endpoints 24 | az_private_subnets = [for obj in var.vpc_az_maps : element(obj.private_subnet_ids, 0)] 25 | ec2_endpoint = ( 26 | var.enable_ec2_endpoint 27 | ? { 28 | ec2 = { 29 | service = "ec2" 30 | private_dns_enabled = true 31 | subnet_ids = local.az_private_subnets 32 | tags = { Name = "ec2-vpc-endpoint" } 33 | } 34 | } 35 | : {} 36 | ) 37 | 38 | # Must provide exactly 1 EIP per AZ 39 | # var.nat_instance_eip_ids ignored if doesn't match AZ count 40 | reuse_nat_instance_eips = length(var.nat_instance_eip_ids) == length(var.vpc_az_maps) 41 | nat_instance_eip_ids = local.reuse_nat_instance_eips ? var.nat_instance_eip_ids : (var.prevent_destroy_eips ? aws_eip.protected_nat_instance_eips[*].id : aws_eip.nat_instance_eips[*].id) 42 | nat_instance_eips = var.prevent_destroy_eips ? aws_eip.protected_nat_instance_eips : aws_eip.nat_instance_eips 43 | nat_gateway_eips = var.prevent_destroy_eips ? aws_eip.protected_nat_gateway_eips : aws_eip.nat_gateway_eips 44 | 45 | created_ngw_eip_alloc_ids = try({ for az, e in aws_eip.nat_gateway_eips : az => e.id }, {}) 46 | protected_ngw_eip_alloc_ids = try({ for az, e in aws_eip.protected_nat_gateway_eips : az => e.id }, {}) 47 | explicit_ngw_eip_alloc_ids = var.fallback_ngw_eip_allocation_ids 48 | 49 | # NAT Gateway EIP allocation IDs to use for fallback routes 50 | # Explicit preferred, then protected, then created 51 | ngw_alloc_ids = merge( 52 | local.created_ngw_eip_alloc_ids, 53 | local.protected_ngw_eip_alloc_ids, 54 | local.explicit_ngw_eip_alloc_ids 55 | ) 56 | } 57 | 58 | resource "aws_eip" "protected_nat_instance_eips" { 59 | count = (local.reuse_nat_instance_eips 60 | ? 0 61 | : var.prevent_destroy_eips ? length(var.vpc_az_maps) : 0) 62 | 63 | tags = merge(var.tags, { 64 | "Name" = "alternat-instance-${count.index}" 65 | }) 66 | 67 | lifecycle { 68 | prevent_destroy = true 69 | } 70 | } 71 | 72 | resource "aws_eip" "nat_instance_eips" { 73 | count = (local.reuse_nat_instance_eips 74 | ? 0 75 | : (var.prevent_destroy_eips ? 0 : length(var.vpc_az_maps))) 76 | 77 | tags = merge(var.tags, { 78 | "Name" = "alternat-instance-${count.index}" 79 | }) 80 | } 81 | 82 | resource "aws_sns_topic" "alternat_topic" { 83 | name_prefix = "alternat-topic" 84 | kms_master_key_id = "alias/aws/sns" 85 | tags = var.tags 86 | } 87 | 88 | resource "aws_autoscaling_group" "nat_instance" { 89 | for_each = { for obj in var.vpc_az_maps : obj.az => obj.public_subnet_id } 90 | 91 | name_prefix = var.nat_instance_name_prefix 92 | max_size = 1 93 | min_size = 1 94 | max_instance_lifetime = var.max_instance_lifetime 95 | vpc_zone_identifier = [each.value] 96 | 97 | launch_template { 98 | id = aws_launch_template.nat_instance_template[each.key].id 99 | version = "$Latest" 100 | } 101 | 102 | dynamic "initial_lifecycle_hook" { 103 | for_each = local.initial_lifecycle_hooks 104 | content { 105 | name = initial_lifecycle_hook.value.name 106 | default_result = try(initial_lifecycle_hook.value.default_result, null) 107 | heartbeat_timeout = try(initial_lifecycle_hook.value.heartbeat_timeout, null) 108 | lifecycle_transition = initial_lifecycle_hook.value.lifecycle_transition 109 | notification_metadata = try(initial_lifecycle_hook.value.notification_metadata, null) 110 | notification_target_arn = try(initial_lifecycle_hook.value.notification_target_arn, null) 111 | role_arn = try(initial_lifecycle_hook.value.role_arn, null) 112 | } 113 | } 114 | 115 | health_check_grace_period = var.enable_launch_script_lifecycle_hook ? 0 : 300 116 | 117 | dynamic "tag" { 118 | for_each = merge( 119 | var.tags, 120 | { Name = "${var.nat_instance_name_prefix}${each.key}" }, 121 | data.aws_default_tags.current.tags, 122 | ) 123 | 124 | content { 125 | key = tag.key 126 | value = tag.value 127 | propagate_at_launch = true 128 | } 129 | } 130 | } 131 | 132 | resource "aws_autoscaling_lifecycle_hook" "nat_instance_launch_script" { 133 | for_each = var.enable_launch_script_lifecycle_hook ? toset([for obj in var.vpc_az_maps : obj.az]) : [] 134 | 135 | autoscaling_group_name = aws_autoscaling_group.nat_instance[each.key].name 136 | 137 | name = local.launch_script_lifecycle_hook_name 138 | default_result = "ABANDON" 139 | heartbeat_timeout = 900 140 | lifecycle_transition = "autoscaling:EC2_INSTANCE_LAUNCHING" 141 | } 142 | 143 | resource "aws_iam_role" "alternat_lifecycle_hook" { 144 | name = var.nat_instance_lifecycle_hook_role_name == "" ? null : var.nat_instance_lifecycle_hook_role_name 145 | name_prefix = var.nat_instance_lifecycle_hook_role_name == "" ? "alternat-lifecycle-hook-" : null 146 | 147 | assume_role_policy = data.aws_iam_policy_document.lifecycle_hook_assume_role.json 148 | tags = var.tags 149 | } 150 | 151 | data "aws_iam_policy_document" "lifecycle_hook_assume_role" { 152 | statement { 153 | sid = "AutoScalingAssumeRole" 154 | 155 | actions = [ 156 | "sts:AssumeRole", 157 | ] 158 | 159 | principals { 160 | type = "Service" 161 | identifiers = ["autoscaling.amazonaws.com"] 162 | } 163 | } 164 | } 165 | 166 | data "aws_iam_policy_document" "lifecycle_hook_policy" { 167 | statement { 168 | sid = "alterNATLifecycleHookPermissions" 169 | effect = "Allow" 170 | actions = [ 171 | "sns:Publish", 172 | ] 173 | resources = [aws_sns_topic.alternat_topic.arn] 174 | } 175 | } 176 | 177 | resource "aws_iam_role_policy" "alternat_lifecycle_hook" { 178 | name = "lifecycle-publish-policy" 179 | policy = data.aws_iam_policy_document.lifecycle_hook_policy.json 180 | role = aws_iam_role.alternat_lifecycle_hook.name 181 | } 182 | 183 | data "cloudinit_config" "config" { 184 | for_each = { for obj in var.vpc_az_maps : obj.az => obj.route_table_ids } 185 | 186 | gzip = true 187 | base64_encode = true 188 | 189 | dynamic "part" { 190 | for_each = var.nat_instance_user_data_pre_install != "" ? [1] : [] 191 | 192 | content { 193 | content_type = "text/x-shellscript" 194 | content = var.nat_instance_user_data_pre_install 195 | } 196 | } 197 | 198 | part { 199 | content_type = "text/x-shellscript" 200 | content = templatefile("${path.module}/alternat.conf.tftpl", { 201 | eip_allocation_ids_csv = join(",", local.nat_instance_eip_ids), 202 | route_table_ids_csv = join(",", each.value), 203 | enable_ssm = var.enable_ssm, 204 | enable_cloudwatch_agent = var.enable_cloudwatch_agent 205 | }) 206 | } 207 | 208 | part { 209 | content_type = "text/x-shellscript" 210 | content = file("${path.module}/scripts/alternat.sh") 211 | } 212 | 213 | dynamic "part" { 214 | for_each = var.enable_cloudwatch_agent ? [1] : [] 215 | 216 | content { 217 | content_type = "text/x-shellscript" 218 | content = templatefile("${path.module}/cwagent.json.tftpl", { 219 | cloudwatch_namespace = var.cloudwatch_namespace, 220 | cloudwatch_interfaces = jsonencode(var.cloudwatch_interfaces) 221 | }) 222 | } 223 | } 224 | 225 | dynamic "part" { 226 | for_each = var.nat_instance_user_data_post_install != "" ? [1] : [] 227 | 228 | content { 229 | content_type = "text/x-shellscript" 230 | content = var.nat_instance_user_data_post_install 231 | } 232 | } 233 | } 234 | 235 | resource "aws_launch_template" "nat_instance_template" { 236 | for_each = { for obj in var.vpc_az_maps : obj.az => obj.route_table_ids } 237 | 238 | name_prefix = var.nat_instance_name_prefix 239 | 240 | image_id = local.nat_instance_image_id 241 | 242 | # Conditional block device mapping for AL2023 Minimal AMI. 243 | # By default the root volume is only 2GB and not enough free space 244 | # to safely install and use the CloudWatch Agent. 245 | dynamic "block_device_mappings" { 246 | for_each = (try(strcontains(local.nat_instance_image_id, "al2023-ami-minimal"), false) && var.enable_cloudwatch_agent) ? [1] : [] 247 | 248 | content { 249 | device_name = "/dev/xvda" 250 | 251 | ebs { 252 | volume_size = 3 253 | volume_type = "gp3" 254 | encrypted = true 255 | } 256 | } 257 | } 258 | 259 | dynamic "block_device_mappings" { 260 | for_each = try(var.nat_instance_block_devices, {}) 261 | 262 | content { 263 | device_name = try(block_device_mappings.value.device_name, null) 264 | 265 | dynamic "ebs" { 266 | for_each = try([block_device_mappings.value.ebs], []) 267 | 268 | content { 269 | encrypted = try(ebs.value.encrypted, null) 270 | volume_size = try(ebs.value.volume_size, null) 271 | volume_type = try(ebs.value.volume_type, null) 272 | } 273 | } 274 | } 275 | } 276 | 277 | iam_instance_profile { 278 | name = aws_iam_instance_profile.nat_instance.name 279 | } 280 | 281 | instance_type = var.nat_instance_type 282 | 283 | key_name = var.nat_instance_key_name 284 | 285 | metadata_options { 286 | http_endpoint = "enabled" 287 | http_tokens = "required" 288 | http_put_response_hop_limit = 1 289 | instance_metadata_tags = "enabled" 290 | } 291 | 292 | monitoring { 293 | enabled = true 294 | } 295 | 296 | network_interfaces { 297 | associate_public_ip_address = true 298 | security_groups = [aws_security_group.nat_instance.id] 299 | } 300 | 301 | tags = var.tags 302 | tag_specifications { 303 | resource_type = "instance" 304 | 305 | tags = merge(var.tags, { 306 | alterNATInstance = "true", 307 | }) 308 | } 309 | 310 | tag_specifications { 311 | resource_type = "volume" 312 | 313 | tags = merge(var.tags, { 314 | alterNATInstance = "true", 315 | }) 316 | } 317 | user_data = data.cloudinit_config.config[each.key].rendered 318 | } 319 | 320 | resource "aws_security_group" "nat_instance" { 321 | name_prefix = var.nat_instance_sg_name_prefix 322 | vpc_id = var.vpc_id 323 | tags = var.tags 324 | } 325 | 326 | resource "aws_security_group_rule" "nat_instance_egress" { 327 | type = "egress" 328 | protocol = "-1" 329 | from_port = 0 330 | to_port = 0 331 | cidr_blocks = ["0.0.0.0/0"] 332 | ipv6_cidr_blocks = ["::/0"] 333 | security_group_id = aws_security_group.nat_instance.id 334 | } 335 | 336 | resource "aws_security_group_rule" "nat_instance_ingress" { 337 | count = length(local.nat_instance_ingress_sgs) 338 | 339 | type = "ingress" 340 | protocol = "-1" 341 | from_port = 0 342 | to_port = 0 343 | security_group_id = aws_security_group.nat_instance.id 344 | source_security_group_id = local.nat_instance_ingress_sgs[count.index] 345 | } 346 | 347 | resource "aws_security_group_rule" "nat_instance_ip_range_ingress" { 348 | count = length(var.ingress_security_group_cidr_blocks) > 0 ? 1 : 0 349 | 350 | type = "ingress" 351 | protocol = "-1" 352 | from_port = 0 353 | to_port = 0 354 | security_group_id = aws_security_group.nat_instance.id 355 | cidr_blocks = var.ingress_security_group_cidr_blocks 356 | } 357 | 358 | resource "aws_security_group_rule" "nat_instance_ipv6_range_ingress" { 359 | count = length(var.ingress_security_group_ipv6_cidr_blocks) > 0 ? 1 : 0 360 | 361 | type = "ingress" 362 | protocol = "-1" 363 | from_port = 0 364 | to_port = 0 365 | security_group_id = aws_security_group.nat_instance.id 366 | ipv6_cidr_blocks = var.ingress_security_group_ipv6_cidr_blocks 367 | } 368 | 369 | ### NAT instance IAM 370 | 371 | resource "aws_iam_instance_profile" "nat_instance" { 372 | name = var.nat_instance_iam_profile_name == "" ? null : var.nat_instance_iam_profile_name 373 | name_prefix = var.nat_instance_iam_profile_name == "" ? "alternat-instance-" : null 374 | 375 | role = aws_iam_role.alternat_instance.name 376 | tags = var.tags 377 | } 378 | 379 | resource "aws_iam_role" "alternat_instance" { 380 | name = var.nat_instance_iam_role_name == "" ? null : var.nat_instance_iam_role_name 381 | name_prefix = var.nat_instance_iam_role_name == "" ? "alternat-instance-" : null 382 | 383 | assume_role_policy = data.aws_iam_policy_document.nat_instance_assume_role.json 384 | tags = var.tags 385 | } 386 | 387 | data "aws_iam_policy_document" "nat_instance_assume_role" { 388 | statement { 389 | sid = "NATInstanceAssumeRole" 390 | 391 | actions = [ 392 | "sts:AssumeRole", 393 | ] 394 | 395 | principals { 396 | type = "Service" 397 | identifiers = ["ec2.amazonaws.com"] 398 | } 399 | } 400 | } 401 | 402 | resource "aws_iam_role_policy_attachment" "ssm" { 403 | count = var.enable_ssm ? 1 : 0 404 | role = aws_iam_role.alternat_instance.name 405 | policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" 406 | } 407 | 408 | resource "aws_iam_role_policy_attachment" "cloudwatch" { 409 | count = var.enable_cloudwatch_agent ? 1 : 0 410 | role = aws_iam_role.alternat_instance.name 411 | policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy" 412 | } 413 | 414 | data "aws_iam_policy_document" "alternat_ec2_policy" { 415 | statement { 416 | sid = "alterNATInstancePermissions" 417 | effect = "Allow" 418 | actions = [ 419 | "ec2:ModifyInstanceAttribute", 420 | "ec2:DescribeInstanceAttribute" 421 | ] 422 | resources = ["*"] 423 | condition { 424 | test = "StringEquals" 425 | variable = "aws:ResourceTag/alterNATInstance" 426 | values = [ 427 | "true" 428 | ] 429 | } 430 | } 431 | 432 | statement { 433 | sid = "alterNATDescribeRoutePermissions" 434 | effect = "Allow" 435 | actions = [ 436 | "ec2:DescribeRouteTables" 437 | ] 438 | resources = ["*"] 439 | } 440 | 441 | statement { 442 | sid = "alterNATModifyRoutePermissions" 443 | effect = "Allow" 444 | actions = [ 445 | "ec2:CreateRoute", 446 | "ec2:ReplaceRoute" 447 | ] 448 | resources = [ 449 | for route_table in local.all_route_tables 450 | : "arn:aws:ec2:${data.aws_region.current.id}:${data.aws_caller_identity.current.id}:route-table/${route_table}" 451 | ] 452 | } 453 | 454 | statement { 455 | sid = "alterNATEIPPermissions" 456 | effect = "Allow" 457 | actions = [ 458 | "ec2:DescribeAddresses", 459 | "ec2:AssociateAddress" 460 | ] 461 | resources = ["*"] 462 | } 463 | 464 | statement { 465 | sid = "alterNATASGLifecyclePermissions" 466 | effect = "Allow" 467 | actions = [ 468 | "autoscaling:CompleteLifecycleAction", 469 | ] 470 | resources = [ 471 | "arn:aws:autoscaling:${data.aws_region.current.id}:${data.aws_caller_identity.current.account_id}:autoScalingGroup:*:autoScalingGroupName/${var.nat_instance_name_prefix}*", 472 | ] 473 | } 474 | } 475 | 476 | resource "aws_iam_role_policy" "alternat_ec2" { 477 | name = "alternat-policy" 478 | policy = data.aws_iam_policy_document.alternat_ec2_policy.json 479 | role = aws_iam_role.alternat_instance.name 480 | } 481 | 482 | resource "aws_iam_role_policy" "alternat_additional_policies" { 483 | count = length(var.additional_instance_policies) 484 | 485 | name = var.additional_instance_policies[count.index].policy_name 486 | policy = var.additional_instance_policies[count.index].policy_json 487 | role = aws_iam_role.alternat_instance.name 488 | } 489 | 490 | ## NAT Gateway used as a backup route 491 | resource "aws_eip" "protected_nat_gateway_eips" { 492 | for_each = { 493 | for obj in var.vpc_az_maps 494 | : obj.az => obj.public_subnet_id 495 | if var.create_nat_gateways && var.prevent_destroy_eips && !contains(keys(var.fallback_ngw_eip_allocation_ids), obj.az) 496 | } 497 | tags = merge(var.tags, { 498 | "Name" = "alternat-gateway-eip" 499 | }) 500 | lifecycle { 501 | prevent_destroy = true 502 | } 503 | } 504 | 505 | resource "aws_eip" "nat_gateway_eips" { 506 | for_each = { 507 | for obj in var.vpc_az_maps 508 | : obj.az => obj.public_subnet_id 509 | if var.create_nat_gateways && !var.prevent_destroy_eips && !contains(keys(var.fallback_ngw_eip_allocation_ids), obj.az) 510 | } 511 | tags = merge(var.tags, { 512 | "Name" = "alternat-gateway-eip" 513 | }) 514 | } 515 | 516 | resource "aws_nat_gateway" "main" { 517 | for_each = { 518 | for obj in var.vpc_az_maps 519 | : obj.az => obj.public_subnet_id 520 | if var.create_nat_gateways 521 | } 522 | allocation_id = local.ngw_alloc_ids[each.key] 523 | subnet_id = each.value 524 | tags = merge(var.tags, { 525 | Name = "alternat-${each.key}" 526 | }) 527 | } 528 | 529 | data "aws_vpc" "vpc" { 530 | id = var.vpc_id 531 | } 532 | 533 | locals { 534 | all_vpc_cidr_ranges = [ 535 | for cidr_assoc in data.aws_vpc.vpc.cidr_block_associations 536 | : cidr_assoc.cidr_block 537 | ] 538 | } 539 | 540 | resource "aws_security_group" "vpc_endpoint" { 541 | count = length(local.ec2_endpoint) > 0 ? 1 : 0 542 | 543 | name_prefix = "ec2-vpc-endpoints-" 544 | description = "Allow TLS from the VPC CIDR to the AWS API." 545 | vpc_id = var.vpc_id 546 | 547 | ingress { 548 | description = "TLS from within the VPC" 549 | from_port = 443 550 | to_port = 443 551 | protocol = "tcp" 552 | cidr_blocks = local.all_vpc_cidr_ranges 553 | } 554 | 555 | egress { 556 | from_port = 0 557 | to_port = 0 558 | protocol = "-1" 559 | cidr_blocks = ["0.0.0.0/0"] 560 | ipv6_cidr_blocks = ["::/0"] 561 | } 562 | 563 | tags = var.tags 564 | } 565 | 566 | module "vpc_endpoints" { 567 | count = length(local.ec2_endpoint) > 0 ? 1 : 0 568 | 569 | source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints" 570 | version = "~> 3.14.0" 571 | vpc_id = var.vpc_id 572 | security_group_ids = [aws_security_group.vpc_endpoint[0].id] 573 | endpoints = local.ec2_endpoint 574 | tags = var.tags 575 | } 576 | 577 | data "aws_default_tags" "current" {} 578 | data "aws_region" "current" {} 579 | data "aws_caller_identity" "current" {} 580 | -------------------------------------------------------------------------------- /functions/replace-route/tests/test_replace_route.py: -------------------------------------------------------------------------------- 1 | """ 2 | Run like this : `AWS_DEFAULT_REGION='us-east-1' pytest` 3 | """ 4 | 5 | import os 6 | import json 7 | import sys 8 | import zipfile 9 | import io 10 | import logging 11 | import mock 12 | import socket 13 | import sure 14 | 15 | import boto3 16 | import botocore 17 | 18 | from moto import mock_aws 19 | 20 | sys.path.append('..') 21 | 22 | logger = logging.getLogger() 23 | logger.setLevel(logging.DEBUG) 24 | logging.getLogger('boto3').setLevel(logging.CRITICAL) 25 | logging.getLogger('botocore').setLevel(logging.CRITICAL) 26 | 27 | EXAMPLE_AMI_ID = "ami-12c6146b" 28 | 29 | 30 | @mock_aws 31 | def setup_networking(): 32 | az = f"{os.environ['AWS_DEFAULT_REGION']}a" 33 | ec2 = boto3.resource("ec2") 34 | 35 | vpc = ec2.create_vpc(CidrBlock="10.1.0.0/16") 36 | 37 | public_subnet = ec2.create_subnet( 38 | VpcId=vpc.id, 39 | CidrBlock="10.1.1.0/24", 40 | AvailabilityZone=f"{az}" 41 | ) 42 | 43 | private_subnet = ec2.create_subnet( 44 | VpcId=vpc.id, 45 | CidrBlock="10.1.2.0/24", 46 | AvailabilityZone=f"{az}", 47 | ) 48 | 49 | private_subnet_two = ec2.create_subnet( 50 | VpcId=vpc.id, 51 | CidrBlock="10.1.3.0/24", 52 | AvailabilityZone=f"{az}", 53 | ) 54 | 55 | 56 | route_table = ec2.create_route_table(VpcId=vpc.id) 57 | route_table_two = ec2.create_route_table(VpcId=vpc.id) 58 | sg = ec2.create_security_group(GroupName="test-sg", Description="test-sg") 59 | 60 | ec2_client = boto3.client("ec2") 61 | allocation_id = ec2_client.allocate_address(Domain="vpc")["AllocationId"] 62 | nat_gw_id = ec2_client.create_nat_gateway( 63 | SubnetId=public_subnet.id, 64 | AllocationId=allocation_id 65 | )["NatGateway"]["NatGatewayId"] 66 | 67 | eni = ec2_client.create_network_interface( 68 | SubnetId=public_subnet.id, PrivateIpAddress="10.1.1.5" 69 | ) 70 | ec2_client.associate_route_table( 71 | RouteTableId=route_table.id, 72 | SubnetId=private_subnet.id 73 | ) 74 | ec2_client.create_route( 75 | DestinationCidrBlock="0.0.0.0/0", 76 | NetworkInterfaceId=eni["NetworkInterface"]["NetworkInterfaceId"], 77 | RouteTableId=route_table.id 78 | ) 79 | ec2_client.associate_route_table( 80 | RouteTableId=route_table.id, 81 | SubnetId=private_subnet_two.id 82 | ) 83 | ec2_client.create_route( 84 | DestinationCidrBlock="0.0.0.0/0", 85 | NetworkInterfaceId=eni["NetworkInterface"]["NetworkInterfaceId"], 86 | RouteTableId=route_table_two.id 87 | ) 88 | 89 | return { 90 | "vpc": vpc.id, 91 | "public_subnet": public_subnet.id, 92 | "private_subnet": private_subnet.id, 93 | "private_subnet_two": private_subnet_two.id, 94 | "nat_gw": nat_gw_id, 95 | "route_table": route_table.id, 96 | "route_table_two": route_table_two.id, 97 | "sg": sg.id, 98 | } 99 | 100 | 101 | def verify_nat_gateway_route(mocked_networking): 102 | ec2_client = boto3.client("ec2") 103 | 104 | filters = [{"Name": "route-table-id", "Values": [mocked_networking["route_table"],mocked_networking["route_table_two"]]}] 105 | route_tables = ec2_client.describe_route_tables(Filters=filters)["RouteTables"] 106 | 107 | route_tables.should.have.length_of(2) 108 | route_tables[0]["Routes"].should.have.length_of(2) 109 | route_tables[1]["Routes"].should.have.length_of(2) 110 | 111 | for rt in route_tables: 112 | for route in rt["Routes"]: 113 | if route["DestinationCidrBlock"] == "0.0.0.0/0": 114 | zero_route = route 115 | zero_route.should.have.key("NatGatewayId").equals(mocked_networking["nat_gw"]) 116 | 117 | 118 | def verify_nat_instance_route(mocked_networking, instance_id): 119 | ec2_client = boto3.client("ec2") 120 | 121 | filters = [{"Name": "route-table-id", "Values": [mocked_networking["route_table"],mocked_networking["route_table_two"]]}] 122 | route_tables = ec2_client.describe_route_tables(Filters=filters)["RouteTables"] 123 | 124 | route_tables.should.have.length_of(2) 125 | route_tables[0]["Routes"].should.have.length_of(2) 126 | route_tables[1]["Routes"].should.have.length_of(2) 127 | 128 | for rt in route_tables: 129 | for route in rt["Routes"]: 130 | if route["DestinationCidrBlock"] == "0.0.0.0/0": 131 | zero_route = route 132 | zero_route.should.have.key("InstanceId").equals(instance_id) 133 | 134 | 135 | @mock_aws 136 | def test_handler(monkeypatch): 137 | mocked_networking = setup_networking() 138 | ec2_client = boto3.client("ec2") 139 | template = ec2_client.create_launch_template( 140 | LaunchTemplateName="test_launch_template", 141 | LaunchTemplateData={"ImageId": EXAMPLE_AMI_ID, "InstanceType": "t2.micro"}, 142 | )["LaunchTemplate"] 143 | 144 | autoscaling_client = boto3.client("autoscaling") 145 | autoscaling_client.create_auto_scaling_group( 146 | AutoScalingGroupName="alternat-asg", 147 | VPCZoneIdentifier=mocked_networking["public_subnet"], 148 | MinSize=1, 149 | MaxSize=1, 150 | LaunchTemplate={ 151 | "LaunchTemplateId": template["LaunchTemplateId"], 152 | "Version": str(template["LatestVersionNumber"]), 153 | }, 154 | ) 155 | 156 | from app import handler 157 | 158 | script_dir = os.path.dirname(__file__) 159 | with open(os.path.join(script_dir, "../sns-event.json"), "r") as file: 160 | asg_termination_event = file.read() 161 | 162 | az = f"{os.environ['AWS_DEFAULT_REGION']}a".upper().replace("-", "_") 163 | monkeypatch.setenv(az, ",".join([mocked_networking["route_table"],mocked_networking["route_table_two"]])) 164 | 165 | # CompleteLifecycleAction is not implemented by Moto 166 | orig_make_api_call = botocore.client.BaseClient._make_api_call 167 | mock_complete_lifecycle_action = mock.Mock() 168 | def mock_make_api_call(self, operation_name, kwarg): 169 | if operation_name == "CompleteLifecycleAction": 170 | return mock_complete_lifecycle_action(self, operation_name, kwarg) 171 | return orig_make_api_call(self, operation_name, kwarg) 172 | with mock.patch("botocore.client.BaseClient._make_api_call", new=mock_make_api_call): 173 | handler(json.loads(asg_termination_event), {}) 174 | mock_complete_lifecycle_action.assert_called_once() 175 | 176 | verify_nat_gateway_route(mocked_networking) 177 | 178 | 179 | @mock_aws 180 | def get_role(): 181 | iam = boto3.client("iam") 182 | return iam.create_role( 183 | RoleName="my-role", 184 | AssumeRolePolicyDocument="some policy", 185 | Path="/my-path/", 186 | )["Role"]["Arn"] 187 | 188 | 189 | def get_test_zip_file1(): 190 | pfunc = """ 191 | def lambda_handler(event, context): 192 | print("custom log event") 193 | return event 194 | """ 195 | return _process_lambda(pfunc) 196 | 197 | 198 | def _process_lambda(func_str): 199 | zip_output = io.BytesIO() 200 | zip_file = zipfile.ZipFile(zip_output, "w", zipfile.ZIP_DEFLATED) 201 | zip_file.writestr("lambda_function.py", func_str) 202 | zip_file.close() 203 | zip_output.seek(0) 204 | return zip_output.read() 205 | 206 | 207 | @mock_aws 208 | @mock.patch('time.sleep') 209 | @mock.patch('urllib.request.urlopen') 210 | def test_connectivity_test_handler(mock_urlopen, mock_sleep, monkeypatch): 211 | from app import connectivity_test_handler 212 | mocked_networking = setup_networking() 213 | 214 | lambda_client = boto3.client("lambda") 215 | lambda_function_name = "alternat-connectivity-test" 216 | lambda_client.create_function( 217 | FunctionName=lambda_function_name, 218 | Role=get_role(), 219 | Code={"ZipFile": get_test_zip_file1()}, 220 | ) 221 | 222 | script_dir = os.path.dirname(__file__) 223 | with open(os.path.join(script_dir, "../cloudwatch-event.json"), "r") as file: 224 | cloudwatch_event = file.read() 225 | 226 | class Context: 227 | function_name = lambda_function_name 228 | 229 | mock_urlopen.side_effect = socket.timeout() 230 | monkeypatch.setenv("ROUTE_TABLE_IDS_CSV", ",".join([mocked_networking["route_table"], mocked_networking["route_table_two"]])) 231 | monkeypatch.setenv("PUBLIC_SUBNET_ID", mocked_networking["public_subnet"]) 232 | monkeypatch.setenv("ENABLE_NAT_RESTORE", "false") # Disable NAT restore for this test 233 | 234 | connectivity_test_handler(event=json.loads(cloudwatch_event), context=Context()) 235 | 236 | verify_nat_gateway_route(mocked_networking) 237 | 238 | 239 | def test_disable_ipv6(): 240 | with mock.patch('socket.getaddrinfo') as mock_getaddrinfo: 241 | from app import disable_ipv6 242 | disable_ipv6() 243 | socket.getaddrinfo('example.com', 80) 244 | mock_getaddrinfo.assert_called() 245 | call_args = mock_getaddrinfo.call_args.args 246 | assert len(call_args) == 3, f"With IPv6 disabled, expected 3 arguments to getaddrinfo, found {len(call_args)}" 247 | assert call_args[2] == socket.AF_INET, "Did not find AF_INET family in args to getaddrinfo" 248 | 249 | 250 | @mock_aws 251 | def test_is_source_dest_check_enabled(): 252 | mocked_networking = setup_networking() 253 | ec2_client = boto3.client("ec2") 254 | 255 | # Create a test instance 256 | instance = ec2_client.run_instances( 257 | ImageId=EXAMPLE_AMI_ID, 258 | MinCount=1, 259 | MaxCount=1, 260 | InstanceType="t2.micro", 261 | SubnetId=mocked_networking["public_subnet"] 262 | )["Instances"][0] 263 | instance_id = instance["InstanceId"] 264 | 265 | from app import is_source_dest_check_enabled 266 | 267 | # Default is True 268 | assert is_source_dest_check_enabled(instance_id) == True 269 | 270 | # Disable source/dest check 271 | ec2_client.modify_instance_attribute( 272 | InstanceId=instance_id, 273 | SourceDestCheck={"Value": False} 274 | ) 275 | 276 | # Should return False now 277 | assert is_source_dest_check_enabled(instance_id) == False 278 | 279 | # Test error handling (invalid instance ID) 280 | assert is_source_dest_check_enabled("i-invalid") is None 281 | 282 | 283 | @mock_aws 284 | def test_are_any_routes_pointing_to_nat_gateway(): 285 | mocked_networking = setup_networking() 286 | ec2_client = boto3.client("ec2") 287 | 288 | # Setup: Modify routes to use NAT Gateway for testing 289 | for rtb_id in [mocked_networking["route_table"], mocked_networking["route_table_two"]]: 290 | # First delete any existing default routes 291 | ec2_client.delete_route( 292 | RouteTableId=rtb_id, 293 | DestinationCidrBlock="0.0.0.0/0" 294 | ) 295 | # Create new route using NAT Gateway 296 | ec2_client.create_route( 297 | RouteTableId=rtb_id, 298 | DestinationCidrBlock="0.0.0.0/0", 299 | NatGatewayId=mocked_networking["nat_gw"] 300 | ) 301 | 302 | from app import are_any_routes_pointing_to_nat_gateway 303 | 304 | # Now our setup has routes with NAT gateway, should return True 305 | route_tables = [mocked_networking["route_table"], mocked_networking["route_table_two"]] 306 | assert are_any_routes_pointing_to_nat_gateway(route_tables) == True 307 | 308 | # Test with invalid route table 309 | assert are_any_routes_pointing_to_nat_gateway(["rtb-invalid"]) == False 310 | 311 | 312 | @mock_aws 313 | @mock.patch('time.sleep') 314 | def test_run_nat_instance_diagnostics(mock_sleep): 315 | from app import run_nat_instance_diagnostics 316 | 317 | # Common text fragments for building diagnostic outputs 318 | nft_table_start = "nft_nat_table=table ip nat {\nchain postrouting {\ntype nat hook postrouting priority 100; policy accept;" 319 | masquerade_rule = "\nip saddr 10.0.0.0/8 oifname \"eth0\" counter packets 0 bytes 0 masquerade" 320 | nft_table_end = "\n}\n}\n" 321 | 322 | # Build diagnostic outputs 323 | good_nft_output = nft_table_start + masquerade_rule + nft_table_end 324 | missing_masquerade_output = nft_table_start + nft_table_end 325 | 326 | # Test scenarios: (description, ssm_output, source_dest_check, expected_result) 327 | test_cases = [ 328 | ("successful diagnostics", f"ip_forward=1\n{good_nft_output}", False, True), 329 | ("ip_forward=0 failure", f"ip_forward=0\n{good_nft_output}", False, False), 330 | ("missing masquerade rule failure", f"ip_forward=1\n{missing_masquerade_output}", False, False), 331 | ("source/dest check enabled failure", f"ip_forward=1\n{good_nft_output}", True, False), 332 | ("error in source/dest check", f"ip_forward=1\n{good_nft_output}", None, False), 333 | ] 334 | 335 | with mock.patch('boto3.client') as mock_boto_client: 336 | mock_ssm = mock.MagicMock() 337 | mock_boto_client.return_value = mock_ssm 338 | mock_ssm.send_command.return_value = {'Command': {'CommandId': 'test-command-id'}} 339 | 340 | for description, ssm_output, source_dest_check, expected_result in test_cases: 341 | mock_ssm.get_command_invocation.return_value = { 342 | 'Status': 'Success', 343 | 'StandardOutputContent': ssm_output, 344 | 'StandardErrorContent': '' 345 | } 346 | 347 | with mock.patch('app.is_source_dest_check_enabled', return_value=source_dest_check): 348 | result = run_nat_instance_diagnostics('i-12345678') 349 | assert result == expected_result, f"Failed test case: {description}" 350 | 351 | # Test SSM command failure separately 352 | mock_ssm.send_command.side_effect = botocore.exceptions.ClientError( 353 | {'Error': {'Code': 'InvalidInstanceId', 'Message': 'Test error'}}, 354 | 'SendCommand' 355 | ) 356 | result = run_nat_instance_diagnostics('i-12345678') 357 | assert result == False 358 | 359 | 360 | @mock_aws 361 | @mock.patch('time.sleep') 362 | def test_attempt_nat_instance_restore(mock_sleep, monkeypatch): 363 | from app import attempt_nat_instance_restore 364 | 365 | # Need to mock boto3.client to avoid calling AWS API 366 | with mock.patch('boto3.client') as mock_boto_client: 367 | # Mock AWS clients 368 | mock_ssm = mock.MagicMock() 369 | mock_ec2 = mock.MagicMock() 370 | 371 | def get_boto_client(service): 372 | if service == 'ssm': 373 | return mock_ssm 374 | elif service == 'ec2': 375 | return mock_ec2 376 | return mock.MagicMock() 377 | 378 | mock_boto_client.side_effect = get_boto_client 379 | 380 | # Setup environment 381 | route_tables = ['rtb-12345', 'rtb-67890'] 382 | monkeypatch.setenv("ROUTE_TABLE_IDS_CSV", ",".join(route_tables)) 383 | monkeypatch.setenv("NAT_ASG_NAME", "test-nat-asg") 384 | 385 | # Mock successful test and diagnostic 386 | with mock.patch('app.get_current_nat_instance_id', return_value='i-test123'): 387 | with mock.patch('app.run_nat_instance_diagnostics', return_value=True): 388 | # Mock successful connectivity test 389 | mock_ssm.send_command.return_value = { 390 | 'Command': {'CommandId': 'test-command-id'} 391 | } 392 | mock_ssm.get_command_invocation.return_value = { 393 | 'Status': 'Success', 394 | 'StandardOutputContent': '200\n200', 395 | 'StandardErrorContent': '' 396 | } 397 | 398 | # Mock replace_route 399 | with mock.patch('app.replace_route') as mock_replace_route: 400 | # Test successful restore 401 | attempt_nat_instance_restore() 402 | 403 | # Verify replace_route called for both route tables 404 | assert mock_replace_route.call_count == 2 405 | mock_replace_route.assert_any_call(route_tables[0], 'i-test123') 406 | mock_replace_route.assert_any_call(route_tables[1], 'i-test123') 407 | 408 | # Test when NAT instance has no internet 409 | with mock.patch('app.get_current_nat_instance_id', return_value='i-test123'): 410 | mock_ssm.get_command_invocation.return_value = { 411 | 'Status': 'Success', 412 | 'StandardOutputContent': '404\n500', 413 | 'StandardErrorContent': '' 414 | } 415 | with mock.patch('app.replace_route') as mock_replace_route: 416 | attempt_nat_instance_restore() 417 | # Should not call replace_route 418 | assert mock_replace_route.call_count == 0 419 | 420 | # Test when diagnostics fail 421 | with mock.patch('app.get_current_nat_instance_id', return_value='i-test123'): 422 | with mock.patch('app.run_nat_instance_diagnostics', return_value=False): 423 | mock_ssm.get_command_invocation.return_value = { 424 | 'Status': 'Success', 425 | 'StandardOutputContent': '200\n200', 426 | 'StandardErrorContent': '' 427 | } 428 | with mock.patch('app.replace_route') as mock_replace_route: 429 | attempt_nat_instance_restore() 430 | # Should not call replace_route 431 | assert mock_replace_route.call_count == 0 432 | 433 | 434 | @mock_aws 435 | @mock.patch('time.sleep') 436 | def test_nat_restore_option(mock_sleep, monkeypatch): 437 | from app import connectivity_test_handler 438 | mocked_networking = setup_networking() 439 | 440 | with mock.patch('app.get_current_nat_instance_id') as mock_get_instance: 441 | # Just mock a NAT instance ID directly instead of creating instances 442 | mock_get_instance.return_value = 'i-test123' 443 | 444 | # Setup environment variables 445 | script_dir = os.path.dirname(__file__) 446 | with open(os.path.join(script_dir, "../cloudwatch-event.json"), "r") as file: 447 | cloudwatch_event = file.read() 448 | 449 | class Context: 450 | function_name = "alternat-connectivity-test" 451 | 452 | monkeypatch.setenv("ROUTE_TABLE_IDS_CSV", ",".join([mocked_networking["route_table"], mocked_networking["route_table_two"]])) 453 | monkeypatch.setenv("PUBLIC_SUBNET_ID", mocked_networking["public_subnet"]) 454 | monkeypatch.setenv("NAT_ASG_NAME", "alternat-nat-asg") 455 | monkeypatch.setenv("CONNECTIVITY_CHECK_INTERVAL", "60") 456 | 457 | # Use a with block for urllib mocking 458 | with mock.patch('urllib.request.urlopen') as mock_urlopen: 459 | # Test with NAT restore disabled (default) 460 | mock_urlopen.side_effect = None # Connection succeeds 461 | 462 | # Run test with restore DISABLED (default behavior) 463 | with mock.patch('app.attempt_nat_instance_restore') as mock_restore: 464 | connectivity_test_handler(event=json.loads(cloudwatch_event), context=Context()) 465 | mock_restore.assert_not_called() # Should not try to restore 466 | 467 | # Test with NAT restore enabled 468 | monkeypatch.setenv("ENABLE_NAT_RESTORE", "true") 469 | 470 | # Mock that we're using NAT Gateway 471 | with mock.patch('app.are_any_routes_pointing_to_nat_gateway', return_value=True): 472 | # Mock the attempt_nat_instance_restore function 473 | with mock.patch('app.attempt_nat_instance_restore') as mock_restore: 474 | connectivity_test_handler(event=json.loads(cloudwatch_event), context=Context()) 475 | mock_restore.assert_called_once() # Should try to restore 476 | -------------------------------------------------------------------------------- /test/alternat_test.go: -------------------------------------------------------------------------------- 1 | package test 2 | 3 | import ( 4 | "context" 5 | "fmt" 6 | "io" 7 | "net/http" 8 | "strings" 9 | "testing" 10 | "time" 11 | 12 | "github.com/aws/aws-sdk-go-v2/aws" 13 | "github.com/aws/aws-sdk-go-v2/config" 14 | 15 | "github.com/aws/aws-sdk-go-v2/service/ec2" 16 | ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" 17 | 18 | terraws "github.com/gruntwork-io/terratest/modules/aws" 19 | "github.com/gruntwork-io/terratest/modules/logger" 20 | "github.com/gruntwork-io/terratest/modules/random" 21 | "github.com/gruntwork-io/terratest/modules/retry" 22 | "github.com/gruntwork-io/terratest/modules/ssh" 23 | "github.com/gruntwork-io/terratest/modules/terraform" 24 | test_structure "github.com/gruntwork-io/terratest/modules/test-structure" 25 | 26 | "github.com/stretchr/testify/assert" 27 | "github.com/stretchr/testify/require" 28 | ) 29 | 30 | // Maintainer's note: This test will currently cause name collisions if multiple tests run in parallel 31 | // in the same account. This is because the test uses a fixed name prefix for resources. This could be fixed 32 | // by using GetRandomStableRegion and updating some resources (such as IAM role and CloudWatch event name) 33 | // to use a random suffix. 34 | 35 | func TestAlternat(t *testing.T) { 36 | // Uncomment any of the following lines to skip that part of the test. 37 | // This is useful for iterating during test development. 38 | // See https://terratest.gruntwork.io/docs/testing-best-practices/iterating-locally-using-test-stages/ 39 | // os.Setenv("SKIP_setup", "true") 40 | // os.Setenv("SKIP_apply_vpc", "true") 41 | // os.Setenv("SKIP_apply_alternat_basic", "true") 42 | // os.Setenv("SKIP_validate_alternat_basic", "true") 43 | // os.Setenv("SKIP_validate_alternat_setup", "true") 44 | // os.Setenv("SKIP_validate_alternat_replace_route", "true") 45 | // os.Setenv("SKIP_validate_alternat_return_to_nat_instance", "true") 46 | // os.Setenv("SKIP_cleanup", "true") 47 | 48 | exampleFolder := test_structure.CopyTerraformFolderToTemp(t, "..", "examples/") 49 | 50 | // logger := logger.Logger{} 51 | 52 | defer test_structure.RunTestStage(t, "cleanup", func() { 53 | terraformOptions := test_structure.LoadTerraformOptions(t, exampleFolder) 54 | awsKeyPair := test_structure.LoadEc2KeyPair(t, exampleFolder) 55 | terraws.DeleteEC2KeyPair(t, awsKeyPair) 56 | terraform.Destroy(t, terraformOptions) 57 | }) 58 | 59 | test_structure.RunTestStage(t, "setup", func() { 60 | //Use a random region if the SCP allows, otherwise hardcode. 61 | //awsRegion := terraws.GetRandomStableRegion(t, nil, nil) 62 | awsRegion := "us-east-1" 63 | 64 | uniqueID := random.UniqueId() 65 | keyPair := ssh.GenerateRSAKeyPair(t, 2048) 66 | awsKeyPair := terraws.ImportEC2KeyPair(t, awsRegion, uniqueID, keyPair) 67 | 68 | terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{ 69 | TerraformDir: exampleFolder, 70 | Vars: map[string]interface{}{ 71 | "aws_region": awsRegion, 72 | "nat_instance_key_name": awsKeyPair.Name, 73 | }, 74 | }) 75 | 76 | test_structure.SaveString(t, exampleFolder, "awsRegion", awsRegion) 77 | test_structure.SaveEc2KeyPair(t, exampleFolder, awsKeyPair) 78 | test_structure.SaveTerraformOptions(t, exampleFolder, terraformOptions) 79 | }) 80 | 81 | test_structure.RunTestStage(t, "apply_vpc", func() { 82 | terraformOptions := test_structure.LoadTerraformOptions(t, exampleFolder) 83 | terraformOptionsVpcOnly, err := terraformOptions.Clone() 84 | if err != nil { 85 | t.Fatal(err) 86 | } 87 | terraformOptionsVpcOnly.Targets = []string{"module.vpc"} 88 | terraform.InitAndApply(t, terraformOptionsVpcOnly) 89 | 90 | vpcID := terraform.Output(t, terraformOptions, "vpc_id") 91 | test_structure.SaveString(t, exampleFolder, "vpcID", vpcID) 92 | }) 93 | 94 | test_structure.RunTestStage(t, "apply_alternat_basic", func() { 95 | terraformOptions := test_structure.LoadTerraformOptions(t, exampleFolder) 96 | terraform.InitAndApply(t, terraformOptions) 97 | assert.Equal(t, 0, terraform.InitAndPlanWithExitCode(t, terraformOptions)) 98 | sgId := terraform.Output(t, terraformOptions, "nat_instance_security_group_id") 99 | test_structure.SaveString(t, exampleFolder, "sgId", sgId) 100 | }) 101 | 102 | test_structure.RunTestStage(t, "validate_alternat_basic", func() { 103 | vpcID := test_structure.LoadString(t, exampleFolder, "vpcID") 104 | awsRegion := test_structure.LoadString(t, exampleFolder, "awsRegion") 105 | ec2Client := getEc2Client(t, awsRegion) 106 | routeTables, err := getRouteTables(t, ec2Client, vpcID) 107 | require.NoError(t, err) 108 | 109 | // Validate that private route tables have routes to the Internet via ENI 110 | for _, rt := range routeTables { 111 | for _, r := range rt.Routes { 112 | // If the route has a gateway ID, it must be a public route table. 113 | // Otherwise, it must be a private route table, and it must route to the Internet via ENI. 114 | if aws.ToString(r.DestinationCidrBlock) == "0.0.0.0/0" && r.GatewayId == nil && r.NetworkInterfaceId == nil { 115 | t.Fatalf("Private route table %v does not have a default route via ENI", rt.RouteTableId) 116 | } 117 | } 118 | } 119 | }) 120 | 121 | test_structure.RunTestStage(t, "validate_alternat_setup", func() { 122 | sgId := aws.String(test_structure.LoadString(t, exampleFolder, "sgId")) 123 | awsRegion := test_structure.LoadString(t, exampleFolder, "awsRegion") 124 | ec2Client := getEc2Client(t, awsRegion) 125 | awsKeyPair := test_structure.LoadEc2KeyPair(t, exampleFolder) 126 | 127 | authorizeSshIngress(t, ec2Client, sgId) 128 | ip, err := getNatInstancePublicIp(t, ec2Client) 129 | require.NoError(t, err) 130 | 131 | natInstance := ssh.Host{ 132 | Hostname: ip, 133 | SshUserName: "ec2-user", 134 | SshKeyPair: awsKeyPair.KeyPair, 135 | } 136 | 137 | maxRetries := 6 138 | waitTime := 10 * time.Second 139 | retry.DoWithRetry(t, fmt.Sprintf("Check SSH connection to %s", ip), maxRetries, waitTime, func() (string, error) { 140 | return "", ssh.CheckSshConnectionE(t, natInstance) 141 | }) 142 | 143 | command := "sudo /usr/sbin/nft list ruleset" 144 | 145 | expectedText := `table ip nat { 146 | chain postrouting { 147 | type nat hook postrouting priority srcnat; policy accept; 148 | ip saddr 10.10.0.0/16 oif "ens5" masquerade 149 | ip saddr 10.20.0.0/16 oif "ens5" masquerade 150 | } 151 | } 152 | ` 153 | 154 | maxRetries = 5 155 | waitTime = 10 * time.Second 156 | retry.DoWithRetry(t, fmt.Sprintf("SSH to NAT instance at IP %s", ip), maxRetries, waitTime, func() (string, error) { 157 | actualText, err := ssh.CheckSshCommandE(t, natInstance, command) 158 | assert.NoError(t, err) 159 | if actualText != expectedText { 160 | return "", fmt.Errorf("Expected SSH command to return '%s' but got '%s'", expectedText, actualText) 161 | } 162 | return "", nil 163 | }) 164 | 165 | userdataLogFile := "/var/log/user-data.log" 166 | output := retry.DoWithRetry(t, fmt.Sprintf("Check contents of file %s", userdataLogFile), maxRetries, waitTime, func() (string, error) { 167 | return ssh.FetchContentsOfFileE(t, natInstance, false, userdataLogFile) 168 | }) 169 | assert.Contains(t, output, "Configuration completed successfully!", "Success string not found in user-data log: %s", output) 170 | }) 171 | 172 | // Delete the egress rules that allow access to the Internet from the instance, then 173 | // validate that Alternat has updated the route to use the NAT Gateway. 174 | test_structure.RunTestStage(t, "validate_alternat_replace_route", func() { 175 | sgId := aws.String(test_structure.LoadString(t, exampleFolder, "sgId")) 176 | vpcID := test_structure.LoadString(t, exampleFolder, "vpcID") 177 | awsRegion := test_structure.LoadString(t, exampleFolder, "awsRegion") 178 | ec2Client := getEc2Client(t, awsRegion) 179 | 180 | updateEgress(t, ec2Client, sgId, true) 181 | 182 | // Get the NAT Gateway IDs to validate routes point to any of the correct targets 183 | expectedNatGwIds, err := getNatGatewayIds(t, ec2Client, vpcID) 184 | require.NoError(t, err) 185 | require.Greater(t, len(expectedNatGwIds), 0, "No NAT Gateway IDs found") 186 | 187 | // Validate that private route tables have routes to the Internet via any of the NAT Gateways 188 | maxRetries := 12 189 | waitTime := 10 * time.Second 190 | output := retry.DoWithRetry(t, "Validating route through NAT Gateway", maxRetries, waitTime, func() (string, error) { 191 | routeTables, err := getPrivateRouteTables(t, ec2Client, vpcID) 192 | require.NoError(t, err) 193 | 194 | for _, rt := range routeTables { 195 | foundCorrectRoute := false 196 | var currentRouteTarget string 197 | for _, r := range rt.Routes { 198 | if aws.ToString(r.DestinationCidrBlock) == "0.0.0.0/0" { 199 | // Check that this default route points to one of our expected NAT Gateways 200 | currentNatGwId := aws.ToString(r.NatGatewayId) 201 | currentRouteTarget = fmt.Sprintf("NatGatewayId=%s, InstanceId=%s, NetworkInterfaceId=%s, GatewayId=%s", 202 | aws.ToString(r.NatGatewayId), aws.ToString(r.InstanceId), aws.ToString(r.NetworkInterfaceId), aws.ToString(r.GatewayId)) 203 | 204 | for _, expectedNatGwId := range expectedNatGwIds { 205 | if currentNatGwId == expectedNatGwId { 206 | foundCorrectRoute = true 207 | break 208 | } 209 | } 210 | if foundCorrectRoute { 211 | break 212 | } else if currentNatGwId != "" { 213 | // Route exists but points to wrong NAT Gateway 214 | return "", fmt.Errorf("Private route table %v has 0.0.0.0/0 route pointing to NAT Gateway %v, which is not one of the expected NAT Gateways %v", 215 | *rt.RouteTableId, currentNatGwId, expectedNatGwIds) 216 | } 217 | } 218 | } 219 | if !foundCorrectRoute { 220 | if currentRouteTarget != "" { 221 | return "", fmt.Errorf("Private route table %v has 0.0.0.0/0 route pointing to %s instead of expected NAT Gateways %v", 222 | *rt.RouteTableId, currentRouteTarget, expectedNatGwIds) 223 | } else { 224 | return "", fmt.Errorf("Private route table %v does not have a 0.0.0.0/0 route pointing to any NAT Gateway %v", 225 | *rt.RouteTableId, expectedNatGwIds) 226 | } 227 | } 228 | } 229 | return "All private route tables route through NAT Gateway", nil 230 | }) 231 | logger := logger.Logger{} 232 | logger.Logf(t, output) 233 | }) 234 | 235 | // Validate that Alternat returns to the NAT instance when the egress rules are restored 236 | test_structure.RunTestStage(t, "validate_alternat_return_to_nat_instance", func() { 237 | sgId := aws.String(test_structure.LoadString(t, exampleFolder, "sgId")) 238 | vpcID := test_structure.LoadString(t, exampleFolder, "vpcID") 239 | awsRegion := test_structure.LoadString(t, exampleFolder, "awsRegion") 240 | ec2Client := getEc2Client(t, awsRegion) 241 | 242 | // Restore the egress rules that allow access to the Internet from the instance 243 | updateEgress(t, ec2Client, sgId, false) 244 | 245 | // Get the NAT instance ENI IDs to validate routes point to any of the correct targets 246 | expectedEniIds, err := getNatInstanceEniIds(t, ec2Client) 247 | require.NoError(t, err) 248 | require.Greater(t, len(expectedEniIds), 0, "No NAT instance ENI IDs found") 249 | 250 | // Validate that private route tables have routes to the Internet via any of the NAT instance ENIs 251 | maxRetries := 12 252 | waitTime := 10 * time.Second 253 | output := retry.DoWithRetry(t, "Validating route returns to the NAT instance", maxRetries, waitTime, func() (string, error) { 254 | routeTables, err := getPrivateRouteTables(t, ec2Client, vpcID) 255 | require.NoError(t, err) 256 | 257 | for _, rt := range routeTables { 258 | foundCorrectRoute := false 259 | var currentRouteTarget string 260 | for _, r := range rt.Routes { 261 | if aws.ToString(r.DestinationCidrBlock) == "0.0.0.0/0" { 262 | // Check that this default route points to one of our expected NAT instance ENIs 263 | currentEniId := aws.ToString(r.NetworkInterfaceId) 264 | currentRouteTarget = fmt.Sprintf("NatGatewayId=%s, InstanceId=%s, NetworkInterfaceId=%s, GatewayId=%s", 265 | aws.ToString(r.NatGatewayId), aws.ToString(r.InstanceId), aws.ToString(r.NetworkInterfaceId), aws.ToString(r.GatewayId)) 266 | 267 | for _, expectedEniId := range expectedEniIds { 268 | if currentEniId == expectedEniId { 269 | foundCorrectRoute = true 270 | break 271 | } 272 | } 273 | if foundCorrectRoute { 274 | break 275 | } else if currentEniId != "" { 276 | // Route exists but points to wrong ENI 277 | return "", fmt.Errorf("Private route table %v has 0.0.0.0/0 route pointing to ENI %v, which is not one of the expected NAT instance ENIs %v", 278 | *rt.RouteTableId, currentEniId, expectedEniIds) 279 | } 280 | } 281 | } 282 | if !foundCorrectRoute { 283 | if currentRouteTarget != "" { 284 | return "", fmt.Errorf("Private route table %v has 0.0.0.0/0 route pointing to %s instead of expected NAT instance ENIs %v", 285 | *rt.RouteTableId, currentRouteTarget, expectedEniIds) 286 | } else { 287 | return "", fmt.Errorf("Private route table %v does not have a 0.0.0.0/0 route pointing to any NAT instance ENI %v", 288 | *rt.RouteTableId, expectedEniIds) 289 | } 290 | } 291 | } 292 | return "All private route tables route through NAT instance", nil 293 | }) 294 | logger := logger.Logger{} 295 | logger.Logf(t, output) 296 | }) 297 | } 298 | 299 | func updateEgress(t *testing.T, ec2Client *ec2.Client, sgId *string, revoke bool) { 300 | basePermission := ec2types.IpPermission{ 301 | FromPort: aws.Int32(0), 302 | ToPort: aws.Int32(0), 303 | IpProtocol: aws.String("-1"), 304 | } 305 | ipv4Permission := basePermission 306 | ipv4Permission.IpRanges = []ec2types.IpRange{ 307 | { 308 | CidrIp: aws.String("0.0.0.0/0"), 309 | }, 310 | } 311 | ipv6Permission := basePermission 312 | ipv6Permission.Ipv6Ranges = []ec2types.Ipv6Range{ 313 | { 314 | CidrIpv6: aws.String("::/0"), 315 | }, 316 | } 317 | allPermissions := []ec2types.IpPermission{ipv4Permission, ipv6Permission} 318 | 319 | var err error 320 | if revoke { 321 | _, err = ec2Client.RevokeSecurityGroupEgress(context.TODO(), &ec2.RevokeSecurityGroupEgressInput{ 322 | GroupId: sgId, 323 | IpPermissions: allPermissions, 324 | }, 325 | ) 326 | require.NoError(t, err) 327 | } else { 328 | _, err = ec2Client.AuthorizeSecurityGroupEgress(context.TODO(), &ec2.AuthorizeSecurityGroupEgressInput{ 329 | GroupId: sgId, 330 | IpPermissions: allPermissions, 331 | }, 332 | ) 333 | require.NoError(t, err) 334 | } 335 | } 336 | 337 | func getPrivateRouteTables(t *testing.T, client *ec2.Client, vpcID string) ([]ec2types.RouteTable, error) { 338 | allRouteTables, err := getRouteTables(t, client, vpcID) 339 | if err != nil { 340 | return nil, err 341 | } 342 | 343 | var privateRouteTables []ec2types.RouteTable 344 | for _, rt := range allRouteTables { 345 | // Skip the main/default route table 346 | isMainRouteTable := false 347 | for _, assoc := range rt.Associations { 348 | if aws.ToBool(assoc.Main) { 349 | isMainRouteTable = true 350 | break 351 | } 352 | } 353 | if isMainRouteTable { 354 | continue // Skip main route table 355 | } 356 | 357 | // Check if this is a private route table (doesn't route to IGW) 358 | isPrivate := true 359 | for _, route := range rt.Routes { 360 | if aws.ToString(route.DestinationCidrBlock) == "0.0.0.0/0" && route.GatewayId != nil && strings.HasPrefix(aws.ToString(route.GatewayId), "igw-") { 361 | isPrivate = false 362 | break 363 | } 364 | } 365 | if isPrivate { 366 | privateRouteTables = append(privateRouteTables, rt) 367 | } 368 | } 369 | 370 | return privateRouteTables, nil 371 | } 372 | 373 | func getRouteTables(t *testing.T, client *ec2.Client, vpcID string) ([]ec2types.RouteTable, error) { 374 | input := &ec2.DescribeRouteTablesInput{ 375 | Filters: []ec2types.Filter{ 376 | { 377 | Name: aws.String("vpc-id"), 378 | Values: []string{vpcID}, 379 | }, 380 | }, 381 | } 382 | 383 | result, err := client.DescribeRouteTables(context.TODO(), input) 384 | if err != nil { 385 | return nil, err 386 | } 387 | require.Greaterf(t, len(result.RouteTables), 0, "Could not find a route table for vpc %s", vpcID) 388 | 389 | return result.RouteTables, nil 390 | } 391 | 392 | func getNatGatewayIds(t *testing.T, ec2Client *ec2.Client, vpcID string) ([]string, error) { 393 | input := &ec2.DescribeNatGatewaysInput{ 394 | Filter: []ec2types.Filter{ 395 | { 396 | Name: aws.String("vpc-id"), 397 | Values: []string{vpcID}, 398 | }, 399 | { 400 | Name: aws.String("state"), 401 | Values: []string{"available"}, 402 | }, 403 | }, 404 | } 405 | 406 | maxRetries := 6 407 | waitTime := 10 * time.Second 408 | var finalNatGwIds []string 409 | 410 | retry.DoWithRetry(t, "Get NAT Gateway IDs", maxRetries, waitTime, func() (string, error) { 411 | result, err := ec2Client.DescribeNatGateways(context.TODO(), input) 412 | if err != nil { 413 | return "", err 414 | } 415 | 416 | if len(result.NatGateways) == 0 { 417 | return "", fmt.Errorf("No NAT Gateways found in VPC %v", vpcID) 418 | } 419 | 420 | var natGwIds []string 421 | for _, natGw := range result.NatGateways { 422 | natGwId := aws.ToString(natGw.NatGatewayId) 423 | if natGwId != "" { 424 | natGwIds = append(natGwIds, natGwId) 425 | } 426 | } 427 | 428 | if len(natGwIds) == 0 { 429 | return "", fmt.Errorf("No valid NAT Gateway IDs found in VPC %v", vpcID) 430 | } 431 | 432 | finalNatGwIds = natGwIds 433 | return "success", nil 434 | }) 435 | 436 | return finalNatGwIds, nil 437 | } 438 | 439 | func getNatInstanceEniIds(t *testing.T, ec2Client *ec2.Client) ([]string, error) { 440 | namePrefix := "alternat-" 441 | input := &ec2.DescribeInstancesInput{ 442 | Filters: []ec2types.Filter{ 443 | { 444 | Name: aws.String("tag:Name"), 445 | Values: []string{namePrefix + "*"}, 446 | }, 447 | { 448 | Name: aws.String("instance-state-name"), 449 | Values: []string{"running"}, 450 | }, 451 | }, 452 | } 453 | 454 | maxRetries := 6 455 | waitTime := 10 * time.Second 456 | var finalEniIds []string 457 | 458 | retry.DoWithRetry(t, "Get NAT Instance ENI IDs", maxRetries, waitTime, func() (string, error) { 459 | result, err := ec2Client.DescribeInstances(context.TODO(), input) 460 | if err != nil { 461 | return "", err 462 | } 463 | 464 | if len(result.Reservations) == 0 { 465 | return "", fmt.Errorf("No NAT instances found") 466 | } 467 | 468 | var eniIds []string 469 | for _, reservation := range result.Reservations { 470 | for _, instance := range reservation.Instances { 471 | if len(instance.NetworkInterfaces) == 0 { 472 | continue // Skip instances without network interfaces 473 | } 474 | // Get the primary network interface (index 0) 475 | eniId := aws.ToString(instance.NetworkInterfaces[0].NetworkInterfaceId) 476 | if eniId != "" { 477 | eniIds = append(eniIds, eniId) 478 | } 479 | } 480 | } 481 | 482 | if len(eniIds) == 0 { 483 | return "", fmt.Errorf("No valid ENI IDs found for NAT instances") 484 | } 485 | 486 | finalEniIds = eniIds 487 | return "success", nil 488 | }) 489 | 490 | return finalEniIds, nil 491 | } 492 | 493 | func getNatInstancePublicIp(t *testing.T, ec2Client *ec2.Client) (string, error) { 494 | namePrefix := "alternat-" 495 | input := &ec2.DescribeInstancesInput{ 496 | Filters: []ec2types.Filter{ 497 | { 498 | Name: aws.String("tag:Name"), 499 | Values: []string{namePrefix + "*"}, 500 | }, 501 | { 502 | Name: aws.String("instance-state-name"), 503 | Values: []string{"running"}, 504 | }, 505 | }, 506 | } 507 | maxRetries := 6 508 | waitTime := 10 * time.Second 509 | ip := retry.DoWithRetry(t, "Get NAT Instance public IP", maxRetries, waitTime, func() (string, error) { 510 | result, err := ec2Client.DescribeInstances(context.TODO(), input) 511 | if err != nil { 512 | return "", err 513 | } 514 | 515 | publicIp := aws.ToString(result.Reservations[0].Instances[0].PublicIpAddress) 516 | if publicIp == "" { 517 | return "", fmt.Errorf("Public IP not found") 518 | } 519 | return publicIp, nil 520 | }) 521 | 522 | return ip, nil 523 | } 524 | 525 | func getThisPublicIp() (string, error) { 526 | url := "https://api.ipify.org" 527 | resp, err := http.Get(url) 528 | if err != nil { 529 | return "", fmt.Errorf("Error fetching IP: %v\n", err) 530 | } 531 | defer resp.Body.Close() 532 | 533 | ip, err := io.ReadAll(resp.Body) 534 | if err != nil { 535 | return "", fmt.Errorf("Error reading response: %v", err) 536 | } 537 | 538 | return string(ip), nil 539 | } 540 | 541 | func authorizeSshIngress(t *testing.T, ec2Client *ec2.Client, sgId *string) { 542 | ip, err := getThisPublicIp() 543 | require.NoError(t, err) 544 | 545 | ipPermission := []ec2types.IpPermission{ 546 | { 547 | FromPort: aws.Int32(22), 548 | ToPort: aws.Int32(22), 549 | IpProtocol: aws.String("tcp"), 550 | IpRanges: []ec2types.IpRange{ 551 | { 552 | CidrIp: aws.String(ip + "/32"), 553 | }, 554 | }, 555 | }, 556 | } 557 | 558 | _, err = ec2Client.AuthorizeSecurityGroupIngress(context.TODO(), &ec2.AuthorizeSecurityGroupIngressInput{ 559 | GroupId: sgId, 560 | IpPermissions: ipPermission, 561 | }, 562 | ) 563 | require.NoError(t, err) 564 | } 565 | 566 | func getEc2Client(t *testing.T, awsRegion string) *ec2.Client { 567 | cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion(awsRegion)) 568 | if err != nil { 569 | t.Fatalf("Unable to load SDK config, %v", err) 570 | } 571 | return ec2.NewFromConfig(cfg) 572 | } 573 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # alterNAT 2 | 3 | NAT Gateways are dead. Long live NAT instances! 4 | 5 | Built and released with 💚 by Chime Engineering 6 | 7 | 8 | [![Test](https://github.com/chime/terraform-aws-alternat/actions/workflows/test.yaml/badge.svg)](https://github.com/chime/terraform-aws-alternat/actions/workflows/test.yaml) 9 | 10 | 11 | ## Background 12 | 13 | On AWS, [NAT devices](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat.html) are required for accessing the Internet from private VPC subnets. Usually, the best option is a NAT gateway, a fully managed NAT service. The [pricing structure of NAT gateway](https://aws.amazon.com/vpc/pricing/) includes charges of $0.045 per hour per NAT Gateway, plus **$0.045 per GB** processed. The former charge is reasonable at about $32.40 per month. However, the latter charge can be *extremely* expensive for larger traffic volumes. 14 | 15 | In addition to the direct NAT Gateway charges, there are also Data Transfer charges for outbound traffic leaving AWS (known as egress traffic). The cost varies depending on destination and volume, ranging from $0.09/GB to $0.01 per GB (after a free tier of 100GB). That’s right: traffic traversing the NAT Gateway is first charged for processing, then charged again for egress to the Internet. 16 | 17 | Consider, for instance, the cost of sending 1PB to and from the Internet through a NAT Gateway - not an unusual amount for some use cases - is $75,604. Many customers may be dealing with far less than 1PB, but the cost can be high even at relatively lower traffic volumes. This drawback of NAT gateway is [widely](https://www.lastweekinaws.com/blog/the-aws-managed-nat-gateway-is-unpleasant-and-not-recommended/) [lamented](https://www.cloudforecast.io/blog/aws-nat-gateway-pricing-and-cost/) [among](https://www.vantage.sh/blog/nat-gateway-vpc-endpoint-savings) [AWS users](https://www.stephengrier.com/reducing-the-cost-of-aws-nat-gateways/). 18 | 19 | Plug in the numbers to the [AWS Pricing Calculator](https://calculator.aws/#/estimate?id=25774f7303040fde173fe274a8dd6ef268a16087) and you may well be flabbergasted. Rather than 1PB, which may be less relatable for some users, let’s choose a nice, relatively low round number as an example. Say, 10TB. The cost of sending 10TB over the Internet (5TB ingress, 5TB egress) through NAT Gateway works out to $954 per month, or $11,448 per year. 20 | 21 | Unlike NAT Gateways, NAT instances do not suffer from data processing charges. With NAT instances, you pay for: 22 | 23 | 1. The cost of the EC2 instances 24 | 1. [Data transfer](https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer) out of AWS (the same as NAT Gateway) 25 | 1. The operational expense of maintaining EC2 instances 26 | 27 | Of these, at scale, data transfer is the most significant. NAT instances are subject to the same data transfer sliding scale as NAT Gateways. Inbound data transfer is free, and most importantly, there is no $0.045 per GB data processing charge. 28 | 29 | Consider the cost of transferring that same 5TB inbound and 5TB outbound through a NAT instance. Using the EC2 Data Transfer sliding scale for egress traffic and a `c6gn.large` NAT instance (optimized for networking), the cost comes to about $526. This is a $428 per month savings (~45%) compared to the NAT Gateway. The more data processed - especially on the ingress side - the higher the savings. 30 | 31 | NAT instances aren't for everyone. You might benefit from alterNAT if NAT Gateway data processing costs are a significant item on your AWS bill. If the hourly cost of the NAT instances and/or the NAT Gateways are a material line item on your bill, alterNAT is probably not for you. As a rule of thumb, assuming a roughly equal volume of ingress/egress traffic, and considering the slight overhead of operating NAT instances, you might save money using this solution if you are processing more than 10TB per month with NAT Gateway. 32 | 33 | Features: 34 | 35 | * Self-provisioned NAT instances in Auto Scaling Groups 36 | * Standby NAT Gateways with health checks and automated failover, facilitated by a Lambda function 37 | * Failback to the NAT instance upon recovery (optional) 38 | * Always uses the latest vanilla Amazon Linux 2023 AMI (no AMI management requirement) 39 | * Optional use of SSM for connecting to the NAT instances 40 | * Optional use of CloudWatch Agent to monitor the NAT instances 41 | * Max instance lifetimes (no long-lived instances!) with automated failover 42 | * A Terraform module to set everything up 43 | 44 | Read on to learn more about alterNAT. 45 | 46 | ## Architecture Overview 47 | 48 | ![Architecture diagram](/assets/architecture.png) 49 | 50 | The two main elements of the NAT instance solution are: 51 | 52 | 1. The NAT instance Auto Scaling Groups, one per zone, with a corresponding standby NAT Gateway 53 | 1. The replace-route Lambda function 54 | 55 | Both are deployed by the Terraform module. 56 | 57 | ### NAT Instance Auto Scaling Group and Standby NAT Gateway 58 | 59 | The solution deploys an Auto Scaling Group (ASG) for each provided public subnet. Each ASG contains a single instance. When the instance boots, the [user data](alternat.sh.tftpl) initializes the instance to do the NAT stuff. 60 | 61 | By default, the ASGs are configured with a [maximum instance lifetime](https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-max-instance-lifetime.html). This is to facilitate periodic replacement of the instance to automate patching. When the maximum instance lifetime is reached (14 days by default), the following occurs: 62 | 63 | 1. The instance is terminated by the Auto Scaling service. 64 | 1. A [`Terminating:Wait` lifecycle hook](https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html) fires to an SNS topic. 65 | 1. The replace-route function updates the route table of the corresponding private subnet to instead route through a standby NAT Gateway. 66 | 1. When the new instance boots, its user data automatically reclaims the Elastic IP address and updates the route table to route through itself. 67 | 68 | The standby NAT Gateway is a safety measure. It is only used if the NAT instance is actively being replaced, either due to the maximum instance lifetime or due to some other failure scenario. 69 | 70 | ### `replace-route` Lambda Function 71 | 72 | The purpose of [the replace-route Lambda Function](functions/replace-route) is to update the route table of the private subnets to route through the standby NAT gateway. It does this in response to two events: 73 | 74 | 1. By the lifecycle hook (via SNS topic) when the ASG terminates a NAT instance (such as when the max instance lifetime is reached), and 75 | 1. by a CloudWatch Event rule, once per minute for every private subnet. 76 | 77 | When a NAT instance in any of the zonal ASGs is terminated, the lifecycle hook publishes an event to an SNS topic to which the Lambda function is subscribed. The Lambda then performs the necessary steps to identify which zone is affected and updates the respective private route table to point at its standby NAT gateway. 78 | 79 | The replace-route function also acts as a health check. Every minute, in the private subnet of each availability zone, the function checks that connectivity to the Internet works by requesting https://www.example.com and, if that fails, https://www.google.com. If the request succeeds, the function exits. If both requests fail, the NAT instance is presumably borked, and the function updates the route to point at the standby NAT gateway. 80 | 81 | In the event that a NAT instance is unavailable, the function would have no route to the AWS EC2 API to perform the necessary steps to update the route table. This is mitigated by the use of an [interface VPC endpoint](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/interface-vpc-endpoints.html) to EC2. 82 | 83 | ### NAT instance recovery 84 | 85 | If the route has previously been updated to use the standby NAT gateway due to a health check failure, the replace-route function can optionally attempt to detect recovery of the NAT instance. This allows the system to automatically recover and return to the preferred, cost-effective path of the NAT instance as soon as it is healthy. 86 | 87 | This feature is disabled by default. To enable it, set `enable_nat_restore=true` and `enable_ssm=true`. 88 | 89 | How it works: 90 | 91 | 0. Assume that the connectivity check has failed and the route was updated to use the NAT Gateway. 92 | 1. During the next connectivity check, attempt to restore the NAT instance 93 | 2. Send an SSM command to `curl` the connectivity check URLs 94 | 3. If the connection is successful, check the NAT configuration 95 | 4. If the configuration is correct, update the route to use the NAT instance. 96 | 5. Continue with regular connectivity checks through the NAT instance. 97 | 98 | Note that the route recovery feature does _not_ attempt to remediate any configuration issue on the instance; the instance remains immutable. 99 | 100 | Also, under certain edge cases, this can potentially lead to flapping between NAT Gateway => NAT Instance => NAT Gateway => NAT Instance. Imagine a scenario where `curl` commands succeed from the NAT instance, and it appears to be configured correctly, so the NAT instance route is restored. But in actuality, a missing security group rule prevents traffic from reaching the NAT instance. During every connectivity check interval, the Lambda will update the route to use the instance since it appears healthy, but then the regular connectivity checks fail due to the missing security group rule, so the Lambda will immediately replace the route again pointing at NAT Gateway. This can happen until the security group rule is fixed. 101 | 102 | ## Drawbacks 103 | 104 | No solution is without its downsides. To understand the primary drawback of this design, a brief discussion about how NAT works is warranted. 105 | 106 | NAT stands for Network Address Translation. NAT devices act as proxies, allowing hosts in private networks to communicate over the Internet without public, Internet-routable addresses. They have a network presence in both the private network and on the Internet. NAT devices accept connections from hosts on the private network, mark the connection in a translation table, then open a corresponding connection to the destination using their public-facing Internet connection. 107 | 108 | ![NAT Translation Table](/assets/nat-table.png) 109 | 110 | The table, typically stored in memory on the NAT device, tracks the state of open connections. If the state is lost or changes abruptly, the connections will be unexpectedly closed. Processes on clients in the private network with open connections to the Internet will need to open new connections. 111 | 112 | In the design described above, NAT instances are intentionally terminated for automated patching. The route is updated to use the NAT Gateway, then back to the newly launched, freshly patched NAT instance. During these changes the NAT table is lost. Established TCP connections present at the time of the change will still appear to be open on both ends of the connection (client and server) because no TCP FIN or RST has been sent, but will in fact be closed because the table is lost and the public IP address of the NAT has changed. 113 | 114 | Importantly, **connectivity to the Internet is never lost**. A route to the Internet is available at all times. 115 | 116 | For our use case, and for many others, this limitation is acceptable. Many clients will open new connections. Other clients may use primarily short-lived connections that retry after a failure. 117 | 118 | For some use cases - for example, file transfers, or other operations that are unable to recover from failures - this drawback may be unacceptable. In this case, the max instance lifetime can be disabled, and route changes would only occur in the unlikely event that a NAT instance failed for another reason, in which case the connectivity checker automatically redirects through the NAT Gateway. 119 | 120 | [The Internet is unreliable](https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing), so failure modes such as connection loss should be a consideration in any resilient system. 121 | 122 | ### Edge Cases 123 | 124 | As described above, alterNAT uses the [`ReplaceRoute` API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_ReplaceRoute.html) (among others) to switch the route in the event of a NAT instance failure or Auto Scaling termination event. One possible failure scenario could occur where the EC2 control plane is for some reason not functional (e.g. an outage within AWS) and a NAT instance fails at the same time. The replace-route function may be unable to automatically switch the route to the NAT Gateway because the control plane is down. One mitigation would be to attempt to manually replace the route for the impacted subnet(s) using the CLI or console. However, if the control plane is in fact down and no APIs are working, waiting until the issue is resolved may be the only option. 125 | 126 | ## Usage and Considerations 127 | 128 | There are two ways to deploy alterNAT: 129 | 130 | - By building a Docker image and using AWS Lambda support for containers 131 | - By using AWS Lambda runtime for Python directly 132 | 133 | Use this project directly, as provided, or draw inspiration from it and use only the parts you need. We cut [releases](https://github.com/chime/terraform-aws-alternat/releases) following the [Semantic Versioning](https://semver.org/) method. We recommend pinning to our tagged releases or using the short commit SHA if you decide to use this repo directly. 134 | 135 | ### Building and Pushing the Container Image 136 | 137 | Build and push the container image using the [`Dockerfile`](Dockerfile). 138 | 139 | We do not provide a public image, so you'll need to build an image and push it to the registry and repo of your choice. [Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) is the obvious choice. 140 | 141 | ``` 142 | docker build . -t / 143 | docker push / 144 | ``` 145 | 146 | ### Use the Terraform Module 147 | 148 | Start by reviewing the available [input variables](variables.tf). 149 | 150 | Example usage using the [terraform module](https://registry.terraform.io/modules/chime/alternat/aws/latest): 151 | 152 | ```hcl 153 | locals { 154 | vpc_az_maps = [ 155 | for index, rt in module.vpc.private_route_table_ids : { 156 | az = data.aws_subnet.subnet[index].availability_zone 157 | route_table_ids = [rt] 158 | public_subnet_id = module.vpc.public_subnets[index] 159 | private_subnet_ids = [module.vpc.private_subnets[index]] 160 | } 161 | ] 162 | } 163 | 164 | data "aws_subnet" "subnet" { 165 | count = length(module.vpc.private_subnets) 166 | id = module.vpc.private_subnets[count.index] 167 | } 168 | 169 | module "alternat_instances" { 170 | source = "chime/alternat/aws" 171 | # It's recommended to pin every module to a specific version 172 | # version = "x.x.x" 173 | 174 | alternat_image_uri = "0123456789012.dkr.ecr.us-east-1.amazonaws.com/alternat-functions-lambda" 175 | alternat_image_tag = "v0.3.3" 176 | 177 | ingress_security_group_ids = var.ingress_security_group_ids 178 | 179 | lambda_package_type = "Image" 180 | 181 | # Optional EBS volume settings. If omitted, the AMI defaults will be used. 182 | nat_instance_block_devices = { 183 | xvda = { 184 | device_name = "/dev/xvda" 185 | ebs = { 186 | encrypted = true 187 | volume_type = "gp3" 188 | volume_size = 20 189 | } 190 | } 191 | } 192 | 193 | tags = var.tags 194 | 195 | vpc_id = module.vpc.vpc_id 196 | vpc_az_maps = local.vpc_az_maps 197 | } 198 | ``` 199 | 200 | To use AWS Lambda runtime for Python, remove `alternat_image_*` inputs and set `lambda_package_type` to `Zip`, e.g.: 201 | 202 | ```hcl 203 | module "alternat_instances" { 204 | ... 205 | lambda_package_type = "Zip" 206 | ... 207 | } 208 | ``` 209 | 210 | The `nat_instance_user_data_post_install` variable allows you to run an additional script to be executed after the main configuration has been installed. 211 | 212 | ```hcl 213 | module "alternat_instances" { 214 | ... 215 | nat_instance_user_data_post_install = templatefile("${path.root}/post_install.tpl", { 216 | VERSION_ENV = var.third_party_version 217 | }) 218 | ... 219 | } 220 | ``` 221 | 222 | Feel free to submit a pull request or create an issue if you need an input or output that isn't available. 223 | 224 | #### Can I use my own NAT Gateways? 225 | 226 | Yes, but with caveats. You can set `create_nat_gateways = false` and alterNAT will not create NAT Gateways or EIPs for the NAT Gateways. However, alterNAT needs to manage the route to the Internet (`0.0.0.0/0`) for the private route tables. You have to ensure that you do not have an `aws_route` resource that points to the NAT Gateway from the route tables that you want to route through the alterNAT instances. 227 | 228 | If you are using the open source terraform-aws-vpc module, you can set `nat_gateway_destination_cidr_block` to a value that is unlikely to affect your network. For instance, you could set `nat_gateway_destination_cidr_block=192.0.2.0/24`, an example CIDR range as discussed in [RFC5735](https://www.rfc-editor.org/rfc/rfc5735). This way the terraform-aws-vpc module will create and manage the NAT Gateways and their EIPs, but will not set the route to the Internet. 229 | 230 | AlterNATively, you can remove the NAT Gateways and their EIPs from your existing configuration and then `terraform import` them to allow alterNAT to manage them. 231 | 232 | #### Providing explicit Elastic IPs for fallback NAT Gateways 233 | 234 | You can optionally supply your own Elastic IP allocation IDs for the fallback NAT Gateways instead of letting alterNAT create them automatically. 235 | 236 | This is useful if you already have pre-allocated EIPs (for example, allow-listed IPs) that must be reused by the fallback NAT Gateways. 237 | 238 | ```hcl 239 | fallback_ngw_eip_allocation_ids = { 240 | "eu-west-1a" = "eipalloc-0123456789abcdef0" 241 | "eu-west-1b" = "eipalloc-1111222233334444" 242 | } 243 | ``` 244 | When an allocation ID is provided for an Availability Zone: 245 | - The module will not create a new aws_eip resource for that zone. 246 | - The corresponding NAT Gateway will use the specified allocation ID. 247 | - All other zones (without explicit IDs) will behave as before — alterNAT will create EIPs automatically or reuse protected ones. 248 | 249 | If you provide explicit EIPs for all zones, no new aws_eip.nat_gateway_eips resources will be created. 250 | 251 | 252 | ### Other Considerations 253 | 254 | - Read [the Amazon EC2 instance network bandwidth page](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html) carefully. In particular: 255 | 256 | > To other Regions, an internet gateway, Direct Connect, or local gateways (LGW) – Traffic can utilize up to 50% of the network bandwidth available to a current generation instance with a minimum of 32 vCPUs. Bandwidth for a current generation instance with less than 32 vCPUs is limited to 5 Gbps. 257 | 258 | - Hence if you need more than 5Gbps, make sure to use an instance type with at least 32 vCPUs, and divide the bandwidth in half. So the `c6gn.8xlarge` which offers 50Gbps guaranteed bandwidth will have 25Gbps available for egress to other regions, an internet gateway, etc. 259 | 260 | - It's wise to start by overprovisioning, observing patterns, and resizing if necessary. Don't be surprised by the network I/O credit mechanism explained in [the AWS EC2 docs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html) thusly: 261 | 262 | > Typically, instances with 16 vCPUs or fewer (size 4xlarge and smaller) are documented as having "up to" a specified bandwidth; for example, "up to 10 Gbps". These instances have a baseline bandwidth. To meet additional demand, they can use a network I/O credit mechanism to burst beyond their baseline bandwidth. Instances can use burst bandwidth for a limited time, typically from 5 to 60 minutes, depending on the instance size. 263 | 264 | - [SSM Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html) is enabled by default. To view NAT connections on an instance, use sessions manager to connect, then run `sudo cat /proc/net/nf_conntrack`. Disable SSM by setting `enable_ssm=false`. 265 | 266 | - A new instance will be launched automatically when the maximum instance lifetime is reached using the latest AMI. 267 | 268 | - Most of the time, except when the instance is actively being replaced, NAT traffic should be routed through the NAT instance and NOT through the NAT Gateway. You can monitor the logs for the text "Failed connectivity tests! Replacing route" to be alerted to NAT instance failures. 269 | 270 | - There are four Elastic IP addresses for the NAT instances and four for the NAT Gateways. Be sure to add all eight addresses to any external allow lists if necessary. 271 | 272 | - If you plan on running this in a dual stack network (IPv4 and IPv6), you may notice that it takes ~10 minutes for an alternat node to start. In that case, you can use the `nat_instance_user_data_pre_install` variable to prefer IPv4 over IPv6 before running any user data. 273 | 274 | ```tf 275 | nat_instance_user_data_pre_install = <<-EOF 276 | # Prefer IPv4 over IPv6 277 | echo 'precedence ::ffff:0:0/96 100' >> /etc/gai.conf 278 | EOF 279 | ``` 280 | - If you see errors like: `error connecting to https://www.google.com/: ` in the connectivity tester logs, you can set `lambda_has_ipv6 = false`. This will cause the lambda to request IPv4 addresses only in DNS lookups. 281 | 282 | - If you want to use just a single NAT Gateway for fallback, you can create it externally and provide its ID through the `nat_gateway_id` variable. Note that you will incur cross AZ traffic charges of $0.01/GB. 283 | 284 | ```tf 285 | create_nat_gateways = false 286 | nat_gateway_id = "nat-..." 287 | ``` 288 | 289 | - If your EIPs are critical, for example if they have been allow listed by third parties, use `prevent_destroy_eips=true` to prevent accidental deletion. 290 | 291 | - Monitoring by the [CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) is disabled by default. Enable by setting `enable_cloudwatch_agent=true`. Note that you will incur custom metric charges: 292 | 293 | > Metrics collected by the CloudWatch agent are billed as custom metrics. For more information about CloudWatch metrics pricing, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/). 294 | 295 | - There is a small risk that the NAT instance launch will fail due to transient errors. With `enable_launch_script_lifecycle_hook` set to true the ASG waits ~15 minutes for the script to complete successfully and starts over with a new instance if necessary. 296 | 297 | ## Contributing 298 | 299 | [Issues](https://github.com/chime/terraform-aws-alternat/issues) and [pull requests](https://github.com/chime/terraform-aws-alternat/pulls) are most welcome! 300 | 301 | alterNAT is intended to be a safe, welcoming space for collaboration. Contributors are expected to adhere to the [Contributor Covenant code of conduct](CODE_OF_CONDUCT.md). 302 | 303 | 304 | ## Local Testing 305 | 306 | ### Terraform module testing 307 | 308 | The `test/` directory uses the [Terratest](https://terratest.gruntwork.io/) library to run integration tests on the Terraform module. The test uses the example located in `examples/` to set up Alternat, runs validations, then destroys the resources. Unfortunately, because of how the [Lambda Hyperplane ENI](https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html#foundation-nw-eni) deletion process works, this takes a very long time (about 35 minutes) to run. 309 | 310 | ### Lambda function testing 311 | 312 | To test locally, install the AWS SAM CLI client: 313 | 314 | ```shell 315 | brew tap aws/tap 316 | brew install aws-sam-cli 317 | ``` 318 | 319 | Build sam and invoke the functions: 320 | 321 | ```shell 322 | sam build 323 | sam local invoke -e .json 324 | ``` 325 | 326 | Example: 327 | 328 | ```shell 329 | cd functions/replace-route 330 | sam local invoke AutoScalingTerminationFunction -e sns-event.json 331 | sam local invoke ConnectivityTestFunction -e cloudwatch-event.json 332 | ``` 333 | 334 | 335 | ## Testing with SAM 336 | 337 | In the first terminal 338 | 339 | ```shell 340 | cd functions/replace-route 341 | sam build && sam local start-lambda # This will start up a docker container running locally 342 | ``` 343 | 344 | In a second terminal, invoke the function back in terminal one: 345 | 346 | ```shell 347 | cd functions/replace-route 348 | aws lambda invoke --function-name "AutoScalingTerminationFunction" --endpoint-url "http://127.0.0.1:3001" --region us-east-1 --cli-binary-format raw-in-base64-out --payload file://./sns-event.json --no-verify-ssl out.txt 349 | aws lambda invoke --function-name "ConnectivityTestFunction" --endpoint-url "http://127.0.0.1:3001" --region us-east-1 --cli-binary-format raw-in-base64-out --payload file://./cloudwatch-event.json --no-verify-ssl out.txt 350 | ``` 351 | --------------------------------------------------------------------------------