├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── cdk ├── .gitignore ├── bin │ └── cdk.ts ├── cdk.json ├── docker │ ├── Dockerfile │ ├── client-iam.properties │ ├── client-tls.properties │ └── run-kafka-command.sh ├── jest.config.js ├── lambda │ ├── manage-infrastructure.js │ ├── query-test-output.js │ └── test-parameters.py ├── lib │ ├── batch-ami.ts │ ├── cdk-stack.ts │ ├── credit-depletion-sfn.ts │ └── vpc.ts ├── package-lock.json ├── package.json └── tsconfig.json ├── docs ├── create-env.md ├── framework-overview.png └── test-result.png └── notebooks ├── 00000000--empty-template.ipynb ├── 00000000--install-dependencies.ipynb ├── aggregate_statistics.py ├── get_test_details.py ├── plot.py └── query_experiment_details.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | __pycache__ 3 | cdk/cdk.context.json 4 | .DS_Store 5 | .vscode 6 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Performance Testing Framework for Apache Kafka 2 | 3 | 4 | ******* 5 | WARNING: This code is NOT intended for production cluster. It will delete topics on your clusters. Only use it on dedicated test clusters. 6 | ******* 7 | 8 | The tool is designed to evaluate the maximum throughput of a cluster and compare the put latency of different broker, producer, and consumer configurations. To run a test, you basically specify the different parameters that should be tested and the tool will iterate through all different combinations of the parameters, producing a graph similar to the one below. 9 | 10 | ![Put latencies for different broker sizes](docs/test-result.png) 11 | 12 | ## Configure Local Enviroment 13 | 14 | The framework is based on the `kafka-producer-perf-test.sh` and `kafka-consumer-perf-test.sh` tools that are part of the Apache Kafka distribution. It builds automation and visualization around these tools leveraging AWS services such as AWS Step Functions for scheduling and AWS Batch for running individual tests. 15 | 16 | ![Framework overview](docs/framework-overview.png) 17 | 18 | You just need to execute the AWS CDK template in the `cdk` folder of this repository to create the required resource to run a performance test. The [create environment](docs/create-env.md) section contains details on how to install the dependencies required to successfully deploy the template. It also explains how to configure an EC2 instance that matches all prerequisites to execute the template successfully. 19 | 20 | Once your environment is configured, you need to adapt the `defaults` variable in `bin/cdk.ts` so that it matches your account details. 21 | 22 | ```node 23 | const defaults : { 'env': Environment } = { 24 | env: { 25 | account: '123456789012', 26 | region: 'eu-west-1' 27 | } 28 | }; 29 | ``` 30 | 31 | Finally, you need to install all dependencies of the CDK template by running the following command in the `cdk` folder. 32 | 33 | ```bash 34 | # Install CDK template dependencies 35 | npm install 36 | ``` 37 | 38 | ## Choose MSK Cluster Configuration 39 | 40 | The CDK template can either create a fresh new MSK cluster for the performance tests or you can specify your precreated cluster. The framework is creating and deleting topics on the cluster to run the performance tests, so it's highly recommended to use a dedicated cluster to avoid any data loss. 41 | 42 | To create a new cluster, you need to specify the respective cluster configuration. 43 | 44 | ```node 45 | new CdkStack(app, 'stack-prefix--', { 46 | ...defaults, 47 | vpc: vpc, 48 | clusterProps: { 49 | numberOfBrokerNodes: 1, 50 | instanceType: InstanceType.of(InstanceClass.M5, InstanceSize.LARGE), 51 | ebsStorageInfo: { 52 | volumeSize: 5334 53 | }, 54 | encryptionInTransit: { 55 | enableInCluster: false, 56 | clientBroker: ClientBrokerEncryption.PLAINTEXT 57 | }, 58 | kafkaVersion: KafkaVersion.V2_8_0, 59 | } 60 | }); 61 | ``` 62 | 63 | Alternatively, you can specify an existing cluster instead. 64 | 65 | ```node 66 | new CdkStack(app, 'stack-prefix--', { 67 | ...defaults, 68 | vpc: vpc, 69 | clusterName: '...', 70 | bootstrapBrokerString: '...', 71 | sg: '...', 72 | }); 73 | ``` 74 | 75 | You can then deploy the CDK template to have it create the respectice resource into your account. 76 | 77 | ```bash 78 | # List the available stacks 79 | cdk list 80 | 81 | # Deploy a specific stack 82 | cdk deploy stack-name 83 | ``` 84 | 85 | Note that the created resources will start to incur cost, even if no test is executed. `cdk destroy` will remove most resources again. It will retain a CloudWatch log streams that contain the experiment output to preserve the raw output of the performance tests for later visualization. These resources will continue to incur cost, unless they are manually deleted. 86 | 87 | 88 | ## Test Input Configuration 89 | 90 | Once the infrastructure has been created, you can start a series of performance tests by executing an StepFunctions workflow that has been created. This is a simplified test specification that determines the parameters of individual tests. 91 | 92 | ```json 93 | { 94 | "test_specification": { 95 | "parameters": { 96 | "cluster_throughput_mb_per_sec": [ 16, 24, 32, 40, 48, 56, 60, 64, 68, 72, 74, 76, 78, 80, 82, 84, 86, 88 ], 97 | "consumer_groups" : [ { "num_groups": 0, "size": 6 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ], 98 | "num_producers": [ 6 ], 99 | "client_props": [ 100 | { 101 | "producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT", 102 | "consumer": "security.protocol=PLAINTEXT" 103 | } 104 | ], 105 | "num_partitions": [ 72 ], 106 | "record_size_byte": [ 1024 ], 107 | "replication_factor": [ 3 ], 108 | "duration_sec": [ 3600 ] 109 | }, 110 | ... 111 | } 112 | } 113 | ``` 114 | 115 | With this specification there will be a total of 54 tests being carried out. All individual tests are using `6` producers, the same properties for consumers and producers, `72` partitions, a record size of `1024` bytes, a replication of `3` and will take 1 hour (or `3600`) seconds. The tests only differ in the aggregate cluster throughput (between `16` and `88` MB/sec) and the number of consumer groups (`0`, `1`, or `2` with `6` consumers per group). 116 | 117 | You can add additional values to anything that is a list, and the framework will automatically iterate through the given options. 118 | 119 | The framework iterates through different throughput configurations and observe how put latency changes. This means that it will first iterate through the throughput options before it changes other options. 120 | 121 | Given the above specification, the test will gradually increase the throughput from `16` to `88` MB/sec with no consumer groups reading from the cluster, then they will increase the throughput from `16` to `88` MB/sec with one consumer group and finally with two consumer groups. 122 | 123 | 124 | ### Stop Conditions 125 | 126 | As you may not know the maximum throughput of a cluster, you can specify a stop condition for then the framework should reset the cluster throughput to the lowest value and choose the next test option (or stop if there is none). 127 | 128 | ```json 129 | { 130 | "test_specification": { 131 | "parameters": { 132 | "cluster_throughput_mb_per_sec": [ 16, 24, 32, 40, 48, 56, 60, 64, 68, 72, 74, 76, 78, 80, 82, 84, 86, 88 ], 133 | "consumer_groups" : [ { "num_groups": 0, "size": 6 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ], 134 | ... 135 | }, 136 | "skip_remaining_throughput": { 137 | "less-than": [ "sent_div_requested_mb_per_sec", 0.995 ] 138 | }, 139 | ... 140 | } 141 | } 142 | ``` 143 | 144 | With the above condition, the framework will skip the remaining throughput values, if the throughput that was actually sent into the cluster by the producers is 0.5% (more precisely, the ratio between actual sent throughput over the requested throughput is below `0.995`) below what has been specified by the test. 145 | 146 | The first test that is carried out will have `0` consumer groups and will be sending `16` MB/sec into the cluster with `0` consumer group. The next test has again `0` consumer groups and `24` MB/sec, etc. This continues until the cluster becomes saturated. Eg, cluster may only be able to absorb 81 MB/sec instead of `82` MB/sec that should be ingested. The skip condition matches as the difference is more than 0.5% and therefore the next test will use `1` consumer group and will be sending `16` MB/sec into the cluster. 147 | 148 | ### Depleting credits before running performance tests 149 | 150 | Amazon EC2 networking, Amazon EBS, and Amazon EBS networking all leverage a credit system that allows the performance to burst over a given baseline for a certain amount of time. This means that the cluster is able to ingest much more throughput for a certain amount of time. 151 | 152 | To obtain stable results for the baseline performance of a cluster, the test framework can deplete credits before a test is carried out. To this end it generates traffic that exceeds the peak performance of a cluster and waits until the throughput drops below configurable thresholds. 153 | 154 | ```json 155 | { 156 | "test_specification": { 157 | "parameters": { 158 | "cluster_throughput_mb_per_sec": [ 16, 24, 32, 40, 48, 56, 60, 64, 68, 72, 74, 76, 78, 80, 82, 84, 86, 88 ], 159 | "consumer_groups" : [ { "num_groups": 0, "size": 6 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ], 160 | "replication_factor": [ 3 ], 161 | ... 162 | }, 163 | "depletion_configuration": { 164 | "upper_threshold": { 165 | "mb_per_sec": 58 166 | }, 167 | "lower_threshold": { 168 | "mb_per_sec": 74 169 | }, 170 | "approximate_timeout_hours": 1 171 | } 172 | ... 173 | } 174 | } 175 | ``` 176 | 177 | With this configuration, the framework will generate a load of 250 MB/sec for credit depletion. To complete the credit depletion, the measured cluster throughput first needs to exceed `74` MB/sec. Then, the throughput must drop below `58` MB/sec. Only then the actual performance test will be started. 178 | 179 | ## Visualize results 180 | 181 | You can start visualizing the test results through Jupyter notebooks once the StepFunctions workflow is running. The CloudFormation stack will output an URL to a Notebook that is managed through Amazon Sagemaker. Just follow the link to gain access to a notebook that has been preconfigured appropriatelz. 182 | 183 | The first test needs to be completed before you can see any output, which can take a couple of hours when credit depletion is enabled. 184 | 185 | To produce a visualization, start with the [`00000000--empty-template.ipynb`](notebooks/00000000--empty-template.ipynb) notebook. You just need to add the excution arn of the StepFunctions workflow and execute all cells of the notebook. 186 | 187 | ```python 188 | test_params.extend([ 189 | {'execution_arn': '' } 190 | ]) 191 | ``` 192 | 193 | The logic in the notebook will then retrieve the test results from CloudWatch Logs, apply grouping and aggregations and produce the a graph as an output. 194 | 195 | You can adapt the grouping of different test runs by changing the `partitions` variable. By adapting `row_keys` and `column_keys` you can create a grid of diagrams and all meassurments within the same sub-diabram will have identical test settings for the specified keys. 196 | 197 | ```python 198 | partitions = { 199 | 'ignore_keys': [ 'topic_id', 'cluster_name', 'test_id', 'client_props.consumer', 'cluster_id', 'duration_sec', 'provisioned_throughput', 'throughput_series_id', 'brokers_type_numeric', ], 200 | 'title_keys': [ 'kafka_version', 'broker_storage', 'in_cluster_encryption', 'producer.security.protocol', ], 201 | 'row_keys': [ 'num_producers', 'consumer_groups.num_groups', ], 202 | 'column_keys': [ 'producer.acks', 'producer.batch.size', 'num_partitions', ], 203 | 'metric_color_keys': [ 'brokers_type_numeric', 'brokers', ], 204 | } 205 | ``` 206 | 207 | In this case, all diagrams in the same row will have identical producer and consumer settings. Within a column, all diagrams will have the same number of consumers. And withing a diagram, tests with different brokers and number of partitions will have different colors. 208 | 209 | If you experience errors that libraries such as boto3 are not found, execute the cells in the [`00000000--install-dependencies`](notebooks/00000000--install-dependencies.ipynb) notebook. 210 | 211 | 212 | ## Security 213 | 214 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 215 | 216 | ## License 217 | 218 | This library is licensed under the MIT-0 License. See the LICENSE file. -------------------------------------------------------------------------------- /cdk/.gitignore: -------------------------------------------------------------------------------- 1 | *.js 2 | !lambda/*js 3 | !jest.config.js 4 | *.d.ts 5 | node_modules 6 | 7 | # CDK asset staging directory 8 | .cdk.staging 9 | cdk.out 10 | -------------------------------------------------------------------------------- /cdk/bin/cdk.ts: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env node 2 | 3 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | // 5 | // Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | // this software and associated documentation files (the "Software"), to deal in 7 | // the Software without restriction, including without limitation the rights to 8 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | // the Software, and to permit persons to whom the Software is furnished to do so. 10 | // 11 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | import { App, Environment } from 'aws-cdk-lib'; 19 | import { CdkStack } from '../lib/cdk-stack'; 20 | import { VpcStack } from '../lib/vpc'; 21 | import { InstanceClass, InstanceSize, InstanceType } from 'aws-cdk-lib/aws-ec2'; 22 | import { ClientBrokerEncryption, KafkaVersion } from '@aws-cdk/aws-msk-alpha'; 23 | 24 | 25 | const app = new App(); 26 | 27 | 28 | const defaults : { 'env': Environment } = { 29 | env: { 30 | account: '123456789012', 31 | region: 'eu-west-1' 32 | } 33 | }; 34 | 35 | 36 | const vpc = new VpcStack(app, 'MskPerformanceVpc', defaults).vpc; 37 | 38 | 39 | 40 | const throughputSpec = (consumer: number, protocol: string, batchSize:number, partitions: number[], throughput: number[]) => ({ 41 | "test_specification": { 42 | "parameters": { 43 | "cluster_throughput_mb_per_sec": throughput, 44 | "num_producers": [ 6 ], 45 | "consumer_groups" : [ { "num_groups": consumer, "size": 6 } ], 46 | "client_props": [{ 47 | "producer": `acks=all linger.ms=5 batch.size=${batchSize} buffer.memory=2147483648 security.protocol=${protocol}`, 48 | "consumer": `security.protocol=${protocol}` 49 | }], 50 | "num_partitions": partitions, 51 | "record_size_byte": [ 1024 ], 52 | "replication_factor": [ 3 ], 53 | "duration_sec": [ 3600 ] 54 | }, 55 | "skip_remaining_throughput": { 56 | "less-than": [ "sent_div_requested_mb_per_sec", 0.995 ] 57 | }, 58 | "depletion_configuration": { 59 | "upper_threshold": { 60 | "mb_per_sec": 200 61 | }, 62 | "approximate_timeout_hours": 0.5 63 | } 64 | } 65 | }); 66 | 67 | 68 | const throughput056 = [8, 16, 24, 32, 40, 44, 48, 52, 56]; 69 | const througphut096 = [8, 16, 32, 48, 56, 64, 72, 80, 88, 96]; 70 | const throughput192 = [8, 16, 32, 64, 96, 112, 128, 144, 160, 176, 192]; 71 | const throughput200 = [8, 16, 32, 64, 96, 112, 128, 144, 160, 176, 192, 200]; 72 | const throughput384 = [8, 64, 128, 192, 256, 320, 336, 352, 368, 384] 73 | const throughput672 = [8, 128, 256, 384, 448, 512, 576, 608, 640, 672]; 74 | 75 | 76 | new CdkStack(app, 'm5large-perf-test--', { 77 | ...defaults, 78 | vpc: vpc, 79 | clusterProps: { 80 | numberOfBrokerNodes: 1, 81 | instanceType: InstanceType.of(InstanceClass.M5, InstanceSize.LARGE), 82 | ebsStorageInfo: { 83 | volumeSize: 5334 84 | }, 85 | encryptionInTransit: { 86 | enableInCluster: false, 87 | clientBroker: ClientBrokerEncryption.PLAINTEXT 88 | }, 89 | kafkaVersion: KafkaVersion.V2_8_0, 90 | }, 91 | // initialPerformanceTest: throughputSpec(2, "PLAINTEXT", 262114, [36], throughput056) 92 | }); 93 | 94 | new CdkStack(app, 'm52xlarge-perf-test', { 95 | ...defaults, 96 | vpc: vpc, 97 | clusterProps: { 98 | numberOfBrokerNodes: 1, 99 | instanceType: InstanceType.of(InstanceClass.M5, InstanceSize.XLARGE), 100 | ebsStorageInfo: { 101 | volumeSize: 5334 102 | }, 103 | encryptionInTransit: { 104 | enableInCluster: true, 105 | clientBroker: ClientBrokerEncryption.TLS 106 | }, 107 | kafkaVersion: KafkaVersion.V2_8_0 108 | }, 109 | // initialPerformanceTest: throughputSpec(2, "PLAINTEXT", 262114, [36], throughput192) 110 | }); 111 | -------------------------------------------------------------------------------- /cdk/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "npx ts-node --prefer-ts-exts bin/cdk.ts", 3 | "context": { 4 | "aws-cdk:enableDiffNoFail": "true", 5 | "@aws-cdk/core:stackRelativeExports": "true", 6 | "@aws-cdk/aws-secretsmanager:parseOwnedSecretName": true, 7 | "@aws-cdk/aws-kms:defaultKeyPolicies": true, 8 | "@aws-cdk/aws-s3:grantWriteWithoutAcl": true 9 | } 10 | } 11 | -------------------------------------------------------------------------------- /cdk/docker/Dockerfile: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | # this software and associated documentation files (the "Software"), to deal in 5 | # the Software without restriction, including without limitation the rights to 6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | # the Software, and to permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | FROM alpine:3.15 17 | 18 | ENV KAFKA_VERSION=2.8.0 19 | ENV SCALA_VERSION=2.12 20 | 21 | RUN apk add --no-cache tini zsh bash parallel mawk procps wget jq py-pip openjdk17-jre-headless \ 22 | && pip install awscli \ 23 | && wget -qO - https://archive.apache.org/dist/kafka/${KAFKA_VERSION}/kafka_${SCALA_VERSION}-${KAFKA_VERSION}.tgz | tar xz -C /opt/ \ 24 | && wget --no-verbose https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.1/aws-msk-iam-auth-1.1.1-all.jar -P /opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION}/libs/ 25 | 26 | RUN cp /usr/lib/jvm/java-*/lib/security/cacerts /opt/kafka.client.truststore.jks 27 | 28 | COPY run-kafka-command.sh /opt/ 29 | COPY client-*.properties /opt/ 30 | 31 | ENTRYPOINT [ "tini", "--", "/opt/run-kafka-command.sh" ] -------------------------------------------------------------------------------- /cdk/docker/client-iam.properties: -------------------------------------------------------------------------------- 1 | ssl.truststore.location=/opt/kafka.client.truststore.jks 2 | security.protocol=SASL_SSL 3 | sasl.mechanism=AWS_MSK_IAM 4 | sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required; 5 | sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler 6 | -------------------------------------------------------------------------------- /cdk/docker/client-tls.properties: -------------------------------------------------------------------------------- 1 | security.protocol=SSL 2 | ssl.truststore.location=/opt/kafka.client.truststore.jks 3 | -------------------------------------------------------------------------------- /cdk/docker/run-kafka-command.sh: -------------------------------------------------------------------------------- 1 | #!/bin/zsh -x 2 | 3 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | # 5 | # Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | # this software and associated documentation files (the "Software"), to deal in 7 | # the Software without restriction, including without limitation the rights to 8 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | # the Software, and to permit persons to whom the Software is furnished to do so. 10 | # 11 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | 19 | set -eo pipefail 20 | setopt shwordsplit 21 | 22 | signal_handler() { 23 | echo "trap triggered by signal $1" 24 | ps -weo pid,%cpu,%mem,size,vsize,cmd --sort=-%mem 25 | trap - EXIT 26 | exit $2 27 | } 28 | 29 | output_mem_usage() { 30 | while true; do 31 | ps -weo pid,%cpu,%mem,rss,cmd --sort=-%mem 32 | sleep 60 33 | done 34 | } 35 | 36 | wait_for_all_tasks() { 37 | # wait until all jobs/containers are actually running 38 | echo -n "waiting for jobs: " 39 | while [[ ${running_jobs:--1} -lt "$args[--num-jobs]" ]]; do 40 | sleep 1 41 | echo -n "." 42 | running_jobs=$(aws --region $args[--region] --output text batch list-jobs --array-job-id ${AWS_BATCH_JOB_ID%:*} --job-status RUNNING --query 'jobSummaryList | length(@)') || running_jobs=-1 43 | done 44 | echo " done" 45 | } 46 | 47 | producer_command() { 48 | KAFKA_HEAP_OPTS="-Xms1G -Xmx3584m" /opt/kafka_?.??-?.?.?/bin/kafka-producer-perf-test.sh \ 49 | --topic $TOPIC \ 50 | --num-records $(printf '%.0f' $args[--num-records-producer]) \ 51 | --throughput $THROUGHPUT \ 52 | --record-size $(printf '%.0f' $args[--record-size-byte]) \ 53 | --producer-props bootstrap.servers=$PRODUCER_BOOTSTRAP_SERVERS ${(@s/ /)args[--producer-props]} \ 54 | --producer.config /opt/client.properties 2>&1 55 | } 56 | 57 | consumer_command() { 58 | for config in $args[--consumer-props]; do 59 | if [[ $config == *SSL* ]]; then # skip encryption config as it's already part of the properties file 60 | continue 61 | fi 62 | 63 | echo $config >> /opt/client.properties 64 | done 65 | 66 | cat /opt/client.properties 67 | 68 | KAFKA_HEAP_OPTS="-Xms1G -Xmx3584m" /opt/kafka_?.??-?.?.?/bin/kafka-consumer-perf-test.sh \ 69 | --topic $TOPIC \ 70 | --messages $(printf '%.0f' $args[--num-records-consumer]) \ 71 | --broker-list $CONSUMER_BOOTSTRAP_SERVERS \ 72 | --consumer.config /opt/client.properties \ 73 | --group $CONSUMER_GROUP \ 74 | --print-metrics \ 75 | --show-detailed-stats \ 76 | --timeout 16000 2>&1 77 | } 78 | 79 | query_msk_endpoint() { 80 | if [[ $args[--bootstrap-broker] = "-" ]] 81 | then 82 | case $args[--producer-props] in 83 | *SASL_SSL*) 84 | PRODUCER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn] --query "BootstrapBrokerStringSaslIam" --output text) 85 | ;; 86 | *SSL*) 87 | PRODUCER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn] --query "BootstrapBrokerStringTls" --output text) 88 | ;; 89 | *) 90 | PRODUCER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn] --query "BootstrapBrokerString" --output text) 91 | ;; 92 | esac 93 | 94 | case $args[--consumer-props] in 95 | *SASL_SSL*) 96 | CONSUMER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn] --query "BootstrapBrokerStringSaslIam" --output text) 97 | ;; 98 | *SSL*) 99 | CONSUMER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn] --query "BootstrapBrokerStringTls" --output text) 100 | ;; 101 | *) 102 | CONSUMER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn] --query "BootstrapBrokerString" --output text) 103 | ;; 104 | esac 105 | else 106 | PRODUCER_BOOTSTRAP_SERVERS=$args[--bootstrap-broker] 107 | CONSUMER_BOOTSTRAP_SERVERS=$args[--bootstrap-broker] 108 | fi 109 | } 110 | 111 | 112 | trap 'signal_handler SIGTERM 15 $LINENO' SIGTERM 113 | # trap 'signal_handler EXIT $? $LINENO' EXIT 114 | 115 | 116 | # parse aruments 117 | zparseopts -A args -E -- -region: -msk-cluster-arn: -bootstrap-broker: -num-jobs: -command: -replication-factor: -topic-name: -depletion-topic-name: -record-size-byte: -records-per-sec: -num-partitions: -duration-sec: -producer-props: -num-producers: -num-records-producer: -consumer-props: -size-consumer-group: -num-records-consumer: 118 | 119 | date 120 | echo "parsed args: ${(kv)args}" 121 | echo "running command: $args[--command]" 122 | 123 | if [[ -f "$ECS_CONTAINER_METADATA_FILE" ]]; then 124 | CLUSTER=$(cat $ECS_CONTAINER_METADATA_FILE | jq -r '.Cluster') 125 | INSTANCE_ARN=$(cat $ECS_CONTAINER_METADATA_FILE | jq -r '.ContainerInstanceARN') 126 | echo -n "running on instance: " 127 | aws ecs describe-container-instances --region $args[--region] --cluster $CLUSTER --container-instances $INSTANCE_ARN | jq -r '.containerInstances[].ec2InstanceId' 128 | fi 129 | 130 | 131 | # periodically output mem usage for debugging purposes 132 | # output_mem_usage & 133 | 134 | 135 | # create authentication config properties 136 | case $args[--producer-props] in 137 | *SASL_SSL*) 138 | cp /opt/client-iam.properties /opt/client.properties 139 | ;; 140 | *SSL*) 141 | cp /opt/client-tls.properties /opt/client.properties 142 | ;; 143 | *) 144 | touch /opt/client.properties 145 | ;; 146 | esac 147 | 148 | 149 | # obtain cluster endpoint; retry if the command fails, eg, when credentials cannot be obtained from environment 150 | query_msk_endpoint 151 | while [ -z "$PRODUCER_BOOTSTRAP_SERVERS" ]; do 152 | sleep 10 153 | query_msk_endpoint 154 | done 155 | 156 | 157 | case $args[--command] in 158 | create-topics) 159 | # create topics with 5 sec retention for depletion task 160 | 161 | while 162 | ! KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \ 163 | --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \ 164 | --command-config /opt/client.properties \ 165 | --list \ 166 | | grep -q "$args[--depletion-topic-name]" 167 | do 168 | # try to create topic if none exists 169 | KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" \ 170 | /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \ 171 | --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \ 172 | --create \ 173 | --topic "$args[--depletion-topic-name]" \ 174 | --partitions $args[--num-partitions] \ 175 | --replication-factor $args[--replication-factor] \ 176 | --config retention.ms=5000 \ 177 | --command-config /opt/client.properties 178 | 179 | sleep 5 180 | done 181 | 182 | while 183 | # check if topic already exists 184 | ! KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \ 185 | --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \ 186 | --command-config /opt/client.properties \ 187 | --list \ 188 | | grep -q "$args[--topic-name]" 189 | do 190 | # try to create topics for actual performance tests 191 | KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" \ 192 | /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \ 193 | --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \ 194 | --create \ 195 | --topic "$args[--topic-name]" \ 196 | --partitions $args[--num-partitions] \ 197 | --replication-factor $args[--replication-factor] \ 198 | --command-config /opt/client.properties 199 | 200 | sleep 5 201 | done 202 | 203 | # list the created topics for debugging purposes 204 | KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \ 205 | --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \ 206 | --command-config /opt/client.properties \ 207 | --list 208 | ;; 209 | 210 | delete-topics) 211 | # delete all topics on the cluster (except for consumer group related ones) 212 | KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \ 213 | --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \ 214 | --command-config /opt/client.properties \ 215 | --delete \ 216 | --topic 'test-id-\d+--throughput-series-id-\d+((--\S+-\d+)+|--depletion)' \ 217 | || : # don't fail if no topic exists 218 | ;; 219 | 220 | deplete-credits) 221 | TOPIC="$args[--depletion-topic-name]" 222 | THROUGHPUT=-1 223 | 224 | wait_for_all_tasks 225 | 226 | if [[ "$AWS_BATCH_JOB_ARRAY_INDEX" -lt $args[--num-producers] ]]; then 227 | producer_command | mawk -W interactive ' 228 | /TimeoutException/ { if(++exceptions % 100000 == 0) {print "Filtered", exceptions, "TimeoutExceptions."} } 229 | !/TimeoutException/ { print $0 } 230 | ' 231 | else 232 | CONSUMER_GROUP="$(((AWS_BATCH_JOB_ARRAY_INDEX - args[--num-producers]) / args[--size-consumer-group]))" 233 | 234 | consumer_command 235 | fi 236 | ;; 237 | 238 | run-performance-test) 239 | TOPIC="$args[--topic-name]" 240 | THROUGHPUT=$(printf '%.0f' $args[--records-per-sec]) 241 | 242 | wait_for_all_tasks 243 | 244 | # build the command for the actual performance test, the array index determines whether to produce or consume with this job 245 | if [[ "$AWS_BATCH_JOB_ARRAY_INDEX" -lt $args[--num-producers] ]]; then 246 | producer_command | mawk -W interactive ' 247 | /TimeoutException/ { if(++timeouts % 100000 == 0) {print "Filtered", timeouts, "TimeoutExceptions."} } 248 | /error/ { errors++ } 249 | !/TimeoutException/ { print $0 } 250 | END { 251 | printf "{ \"type\": \"producer\", \"test_summary\": \"%s\", \"timeout_exception_count\": \"%d\", \"error_count\": \"%d\" }\n", $0, timeouts, errors 252 | } 253 | ' 254 | else 255 | CONSUMER_GROUP="$(((AWS_BATCH_JOB_ARRAY_INDEX - args[--num-producers]) / args[--size-consumer-group]))" 256 | 257 | consumer_command | mawk -W interactive ' 258 | /WARNING: Exiting before consuming the expected number of messages/ { warnings++ } 259 | // { print $0 } 260 | END { 261 | printf "{ \"type\": \"consumer\", \"too_few_messages_count\": \"%d\" }\n", warnings 262 | } 263 | ' 264 | fi 265 | ;; 266 | 267 | *) 268 | print NIL 269 | ;; 270 | esac -------------------------------------------------------------------------------- /cdk/jest.config.js: -------------------------------------------------------------------------------- 1 | module.exports = { 2 | roots: ['/test'], 3 | testMatch: ['**/*.test.ts'], 4 | transform: { 5 | '^.+\\.tsx?$': 'ts-jest' 6 | } 7 | }; 8 | -------------------------------------------------------------------------------- /cdk/lambda/manage-infrastructure.js: -------------------------------------------------------------------------------- 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | // this software and associated documentation files (the "Software"), to deal in 5 | // the Software without restriction, including without limitation the rights to 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | // the Software, and to permit persons to whom the Software is furnished to do so. 8 | // 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | const AWS = require('aws-sdk'); 17 | const batch = new AWS.Batch(); 18 | const cloudwatch = new AWS.CloudWatch(); 19 | 20 | exports.queryMskClusterThroughput = async function(event) { 21 | console.log(JSON.stringify(event)); 22 | 23 | const jobs = { 24 | jobs: [ 25 | event.depletion_job.JobId 26 | ] 27 | }; 28 | 29 | const jobDetails = await batch.describeJobs(jobs).promise(); 30 | 31 | console.log(JSON.stringify(jobDetails)); 32 | 33 | const jobCreatedAt = jobDetails.jobs[0].createdAt; 34 | const statusSummary = jobDetails.jobs[0].arrayProperties.statusSummary; 35 | 36 | // query max cluster throughput in the last 5 min, but not before the job started 37 | const queryEndTime = Date.now() 38 | const queryStartTime = Math.max(queryEndTime - 300_000, jobCreatedAt); 39 | 40 | const params = { 41 | EndTime: new Date(queryEndTime), 42 | StartTime: new Date(queryStartTime), 43 | MetricDataQueries: [ 44 | { 45 | Id: 'bytesInSum', 46 | Expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${process.env.MSK_CLUSTER_NAME}" MetricName="BytesInPerSec"', 'Maximum', 300))`, 47 | Period: '300', 48 | }, 49 | { 50 | Id: 'bytesInStddev', 51 | Expression: `STDDEV(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${process.env.MSK_CLUSTER_NAME}" MetricName="BytesInPerSec"', 'Minimum', 60))`, 52 | Period: '60', 53 | } 54 | ], 55 | }; 56 | 57 | const metrics = await cloudwatch.getMetricData(params).promise(); 58 | 59 | console.log(JSON.stringify(metrics)); 60 | 61 | const result = { 62 | clusterMbInPerSec: metrics.MetricDataResults.find(m => m.Id == 'bytesInSum').Values[0] / 1024**2, 63 | brokerMbInPerSecStddev: metrics.MetricDataResults.find(m => m.Id == 'bytesInStddev').Values.reduce((acc,v) => v>acc ? v : acc, 0) / 1024**2, 64 | succeededPlusFailedJobs: statusSummary.SUCCEEDED + statusSummary.FAILED 65 | }; 66 | 67 | console.log(JSON.stringify(result)); 68 | 69 | return result; 70 | } 71 | 72 | exports.terminateDepletionJob = async function(event) { 73 | console.log(JSON.stringify(event)); 74 | 75 | //fixme: change to query pending jobs of the job queue and terminate them 76 | 77 | const params = { 78 | jobId: event.JobId, 79 | reason: "Batch job terminated by StepFunctions workflow." 80 | }; 81 | 82 | return batch.terminateJob(params).promise(); 83 | } 84 | 85 | 86 | exports.updateMinCpus = async function(event) { 87 | console.log(JSON.stringify(event)); 88 | 89 | const params = { 90 | computeEnvironment: process.env.COMPUTE_ENVIRONMENT, 91 | computeResources: { 92 | minvCpus: 'current_test' in event ? event.current_test.parameters.num_jobs * 2 : 0 93 | } 94 | }; 95 | 96 | return batch.updateComputeEnvironment(params).promise(); 97 | } -------------------------------------------------------------------------------- /cdk/lambda/query-test-output.js: -------------------------------------------------------------------------------- 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | // this software and associated documentation files (the "Software"), to deal in 5 | // the Software without restriction, including without limitation the rights to 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | // the Software, and to permit persons to whom the Software is furnished to do so. 8 | // 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | const AWS = require('aws-sdk') 17 | const batch = new AWS.Batch(); 18 | const cloudwatchlogs = new AWS.CloudWatchLogs(); 19 | 20 | exports.queryProducerOutput = async function(event) { 21 | console.log(JSON.stringify(event)); 22 | 23 | const numProducer = event.current_test.parameters.num_producers 24 | const jobId = event.job_result.JobId 25 | 26 | const describeParams = { 27 | jobs: [...Array(numProducer).keys()].map(x => jobId + ":" + x) 28 | } 29 | 30 | const jobDecriptions = await batch.describeJobs(describeParams).promise(); 31 | 32 | const logStreams = jobDecriptions.jobs.map(job => job.attempts[0].container.logStreamName); 33 | 34 | console.log(JSON.stringify(logStreams)); 35 | 36 | const filterParams = { 37 | logGroupName: process.env.LOG_GROUP_NAME, 38 | logStreamNames: logStreams, 39 | filterPattern: '{$.type="producer" && $.test_summary=* && $.error_count=*}', 40 | } 41 | 42 | return queryAndParseLogs(filterParams, numProducer); 43 | } 44 | 45 | 46 | async function queryAndParseLogs(filterParams, numProducer) { 47 | let filteredEvents = await cloudwatchlogs.filterLogEvents(filterParams).promise(); 48 | 49 | console.log(JSON.stringify(filteredEvents)); 50 | 51 | let queryResult = filteredEvents.events; 52 | 53 | // obtain all results through pagination 54 | while ('nextToken' in filteredEvents) { 55 | filteredEvents = await cloudwatchlogs.filterLogEvents({...filterParams, nextToken: filteredEvents.nextToken}).promise(); 56 | 57 | console.log(JSON.stringify(filteredEvents)); 58 | 59 | queryResult = [...queryResult, ...filteredEvents.events] 60 | } 61 | 62 | console.log(JSON.stringify(queryResult)); 63 | 64 | if (queryResult.length < numProducer) { 65 | // if we didn't obtain a result for all producer tasks, fail so that the stepfunctions workflow can retry later 66 | throw new Error(`incomplete query results: expected ${numProducer} results, found ${queryResult.length}`) 67 | } else { 68 | const testResults = queryResult.map(e => JSON.parse(e.message)); 69 | 70 | console.log(JSON.stringify(testResults)); 71 | 72 | // aggregate number of timeout errors 73 | const errorCountSum = testResults.reduce((acc,message) => acc + parseInt(message.error_count), 0) 74 | // parse producer throughput 75 | const mbPerSec = testResults 76 | .map(message => message.test_summary.match(/([0-9.]+) MB\/sec/)) 77 | .map(match => parseFloat(match[1])) 78 | 79 | return { 80 | errorCountSum: errorCountSum, 81 | mbPerSecSum: mbPerSec.reduce((acc,v) => acc + v, 0), 82 | mbPerSecMin: mbPerSec.reduce((acc,v) => v evaluate_skip_condition(args[1], event) 57 | elif "less-than" in condition: 58 | args = condition["less-than"] 59 | 60 | return evaluate_skip_condition(args[0], event) < evaluate_skip_condition(args[1], event) 61 | else: 62 | raise Exception("unable to parse condition: " + condition) 63 | elif isinstance(condition, str): 64 | if condition == "sent_div_requested_mb_per_sec": 65 | return event["producer_result"]["Payload"]["mbPerSecSum"] / event["current_test"]["parameters"]["cluster_throughput_mb_per_sec"] 66 | else: 67 | raise Exception("unable to parse condition: " + condition) 68 | elif isinstance(condition, int) or isinstance(condition, float): 69 | return condition 70 | else: 71 | raise Exception("unable to parse condition: " + condition) 72 | 73 | 74 | def increment_test_index(index, event, pos): 75 | parameters = [key for key in event["test_specification"]["parameters"].keys()] 76 | 77 | # if all options to increment the index are exhausted 78 | if not (0 <= pos < len(parameters)): 79 | raise OverflowError 80 | 81 | paramater_name = parameters[pos] 82 | parameter_value = event["current_test"]["index"][paramater_name] 83 | 84 | if (parameter_value+1 < len(event["test_specification"]["parameters"][paramater_name])): 85 | # if there are still values that need to be tested for this parameter, increment the index 86 | return ({**index, paramater_name: parameter_value+1}, event["current_test"]["parameters"]["throughput_series_id"]) 87 | else: 88 | if paramater_name == "cluster_throughput_mb_per_sec": 89 | # if the cluster throughput is exhausted, increment the next index and update the throughput series id 90 | (updated_index, _) = increment_test_index({**index, paramater_name: 0}, event, pos+1) 91 | updated_throughput_series_id = random.randint(0,100000) 92 | 93 | return (updated_index, updated_throughput_series_id) 94 | else: 95 | return increment_test_index({**index, paramater_name: 0}, event, pos+1) 96 | 97 | 98 | 99 | def update_parameters(test_specification, updated_test_index, test_id, throughput_series_id): 100 | if updated_test_index is None: 101 | # all test have been completed 102 | return { 103 | "test_specification": test_specification 104 | } 105 | else: 106 | # get the new parameters for the updated index 107 | updated_parameters = { 108 | index_name: test_specification["parameters"][index_name][index_value] for (index_name,index_value) in updated_test_index.items() 109 | } 110 | 111 | # encode the current index into the topic name, so that the parameters can be extracted later 112 | max_topic_property_length = 15 113 | remove_invalid_chars = lambda x: re.sub(r'[^a-zA-Z0-9-]', '-', x.replace(' ', '-')) 114 | 115 | topic_name = 'test-id-' + str(test_id) + '--' + 'throughput-series-id-' + str(throughput_series_id) + '--' + '--'.join([f"{remove_invalid_chars(k)[:max_topic_property_length]}-{str(v)}" for k,v in updated_test_index.items()]) 116 | depletion_topic_name = 'test-id-' + str(test_id) + '--' + 'throughput-series-id-' + str(throughput_series_id) + '--depletion' 117 | 118 | record_size_byte = updated_parameters["record_size_byte"] 119 | 120 | if updated_parameters["cluster_throughput_mb_per_sec"]>0: 121 | producer_throughput_byte = updated_parameters["cluster_throughput_mb_per_sec"] * 1024**2 // updated_parameters["num_producers"] 122 | # one consumer group consumes the entire cluster throughput 123 | consumer_throughput_byte = updated_parameters["cluster_throughput_mb_per_sec"] * 1024**2 // updated_parameters["consumer_groups"]["size"] if updated_parameters["consumer_groups"]["size"] > 0 else 0 124 | 125 | num_records_producer = producer_throughput_byte * updated_parameters["duration_sec"] // record_size_byte 126 | num_records_consumer = consumer_throughput_byte * updated_parameters["duration_sec"] // record_size_byte 127 | else: 128 | # for depletion, publish messages as fast as possible 129 | producer_throughput_byte = -1 130 | consumer_throughput_byte = -1 131 | num_records_producer = 2147483647 132 | num_records_consumer = 2147483647 133 | 134 | # derive additional parameters for convenience 135 | updated_parameters = { 136 | **updated_parameters, 137 | "num_jobs": updated_parameters["num_producers"] + updated_parameters["consumer_groups"]["num_groups"]*updated_parameters["consumer_groups"]["size"], 138 | "producer_throughput": producer_throughput_byte, 139 | "records_per_sec": max(1, producer_throughput_byte // record_size_byte), 140 | "num_records_producer": num_records_producer, 141 | "num_records_consumer": num_records_consumer, 142 | "test_id": test_id, 143 | "throughput_series_id": throughput_series_id, 144 | "topic_name": topic_name, 145 | "depletion_topic_name": depletion_topic_name 146 | } 147 | 148 | updated_event = { 149 | "test_specification": test_specification, 150 | "current_test": { 151 | "index": updated_test_index, 152 | "parameters": updated_parameters 153 | } 154 | } 155 | 156 | return updated_event 157 | 158 | 159 | 160 | def skip_remaining_throughput(index, event): 161 | parameters = [key for key in event["test_specification"]["parameters"].keys()] 162 | throughput_index = parameters.index("cluster_throughput_mb_per_sec") 163 | 164 | # set all parameters up to cluster_throughput_mb_per_sec to 0 165 | reset_index = { 166 | index_name:0 for i, index_name in enumerate(index) if i<=throughput_index 167 | } 168 | 169 | # keep values of other parameters and increase next parameter after cluster_throughput_mb_per_sec 170 | (updated_index, _) = increment_test_index({**index, **reset_index}, event, pos=throughput_index+1) 171 | 172 | # choose new series id, as we are starting with new throughput 173 | throughput_series_id = random.randint(0,100000) 174 | 175 | return (updated_index, throughput_series_id) 176 | 177 | 178 | 179 | def update_parameters_for_depletion(event, context): 180 | test_specification = event["test_specification"] 181 | test_parameters = test_specification["parameters"] 182 | test_index = event["current_test"]["index"] 183 | test_id = event["current_test"]["parameters"]["test_id"] 184 | throughput_series_id = event["current_test"]["parameters"]["throughput_series_id"] 185 | depletion_duration_sec = test_specification["depletion_configuration"]["approximate_timeout_hours"] * 60 * 60 186 | 187 | # overwrite throughput of tests with default value for credit depletion 188 | depletion_parameters = { 189 | **test_parameters, 190 | "duration_sec": list(map(lambda _: depletion_duration_sec, test_parameters["duration_sec"])), 191 | "cluster_throughput_mb_per_sec": list(map(lambda _: -1, test_parameters["cluster_throughput_mb_per_sec"])), 192 | } 193 | 194 | # keep index unchanged and determine parameters for depletion 195 | return update_parameters({**test_specification, "parameters": depletion_parameters}, test_index, test_id, throughput_series_id) -------------------------------------------------------------------------------- /cdk/lib/batch-ami.ts: -------------------------------------------------------------------------------- 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | // this software and associated documentation files (the "Software"), to deal in 5 | // the Software without restriction, including without limitation the rights to 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | // the Software, and to permit persons to whom the Software is furnished to do so. 8 | // 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | 17 | import { Construct } from 'constructs'; 18 | import { Aws, Stack, StackProps } from 'aws-cdk-lib'; 19 | import { aws_ecs as ecs } from 'aws-cdk-lib'; 20 | import { aws_ec2 as ec2 } from 'aws-cdk-lib'; 21 | import { aws_iam as iam } from 'aws-cdk-lib'; 22 | import { aws_imagebuilder as imagebuilder } from 'aws-cdk-lib'; 23 | 24 | 25 | export class BatchAmiStack extends Stack { 26 | ami: ec2.IMachineImage; 27 | 28 | constructor(scope: Construct, id: string, props?: StackProps) { 29 | super(scope, id, props); 30 | 31 | 32 | const agentComponent = new imagebuilder.CfnComponent(this, 'ImageBuilderComponent', { 33 | name: `${Aws.STACK_NAME}`, 34 | platform: 'Linux', 35 | version: '0.1.0', 36 | data: `name: Custom image for AWS Batch 37 | description: Builds a custom image for AWS with additional debugging tools 38 | schemaVersion: 1.0 39 | 40 | phases: 41 | - name: build 42 | steps: 43 | - name: InstallCloudWatchAgentStep 44 | action: ExecuteBash 45 | inputs: 46 | commands: 47 | - | 48 | cat <<'EOF' >> /etc/ecs/ecs.config 49 | ECS_ENABLE_CONTAINER_METADATA=true 50 | EOF 51 | - rpm -Uvh https://amazoncloudwatch-agent-us-west-2.s3.us-west-2.amazonaws.com/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm 52 | - | 53 | cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json << 'EOF' 54 | { 55 | "agent":{ 56 | "metrics_collection_interval":10, 57 | "omit_hostname":true 58 | }, 59 | "logs":{ 60 | "logs_collected":{ 61 | "files":{ 62 | "collect_list":[ 63 | { 64 | "file_path":"/var/log/ecs/ecs-agent.log*", 65 | "log_group_name":"/aws-batch/ecs-agent.log", 66 | "log_stream_name":"{instance_id}", 67 | "timezone":"Local" 68 | }, 69 | { 70 | "file_path":"/var/log/ecs/ecs-init.log", 71 | "log_group_name":"/aws-batch/ecs-init.log", 72 | "log_stream_name":"{instance_id}", 73 | "timezone":"Local" 74 | }, 75 | { 76 | "file_path":"/var/log/ecs/audit.log*", 77 | "log_group_name":"/aws-batch/audit.log", 78 | "log_stream_name":"{instance_id}", 79 | "timezone":"Local" 80 | }, 81 | { 82 | "file_path":"/var/log/messages", 83 | "log_group_name":"/aws-batch/messages", 84 | "log_stream_name":"{instance_id}", 85 | "timezone":"Local" 86 | } 87 | ] 88 | } 89 | } 90 | } 91 | } 92 | EOF 93 | - /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s` 94 | }); 95 | 96 | const imageReceipe = new imagebuilder.CfnImageRecipe(this, 'ImageBuilderReceipt', { 97 | components: [{ 98 | componentArn: agentComponent.attrArn 99 | }], 100 | name: Aws.STACK_NAME, 101 | parentImage: ecs.EcsOptimizedImage.amazonLinux2().getImage(this).imageId, 102 | version: '0.1.0' 103 | }); 104 | 105 | const imageBuilderRole = new iam.Role(this, 'ImageBuilderRole', { 106 | assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'), 107 | managedPolicies: [ 108 | iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore'), 109 | iam.ManagedPolicy.fromAwsManagedPolicyName('EC2InstanceProfileForImageBuilder') 110 | ] 111 | }); 112 | 113 | const imageBuilderProfile = new iam.CfnInstanceProfile(this, 'ImageBuilderInstanceProfile', { 114 | roles: [ imageBuilderRole.roleName ], 115 | }); 116 | 117 | const infrastructureconfiguration = new imagebuilder.CfnInfrastructureConfiguration(this, 'InfrastructureConfiguration', { 118 | name: Aws.STACK_NAME, 119 | instanceProfileName: imageBuilderProfile.ref 120 | }); 121 | 122 | const image = new imagebuilder.CfnImage(this, 'Image', { 123 | imageRecipeArn: imageReceipe.attrArn, 124 | infrastructureConfigurationArn: infrastructureconfiguration.attrArn, 125 | }); 126 | 127 | this.ami = ec2.MachineImage.genericLinux({ 128 | [ this.region ] : image.attrImageId 129 | }); 130 | } 131 | } -------------------------------------------------------------------------------- /cdk/lib/cdk-stack.ts: -------------------------------------------------------------------------------- 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | // this software and associated documentation files (the "Software"), to deal in 5 | // the Software without restriction, including without limitation the rights to 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | // the Software, and to permit persons to whom the Software is furnished to do so. 8 | // 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import { Construct } from 'constructs'; 17 | import { Aws, Stack, StackProps, CfnResource, CfnOutput, Duration, RemovalPolicy } from 'aws-cdk-lib'; 18 | import { aws_ecs as ecs } from 'aws-cdk-lib'; 19 | import { aws_ec2 as ec2 } from 'aws-cdk-lib'; 20 | import { aws_iam as iam } from 'aws-cdk-lib'; 21 | import { aws_logs as logs } from 'aws-cdk-lib'; 22 | import { aws_stepfunctions as sfn } from 'aws-cdk-lib'; 23 | import { aws_stepfunctions_tasks as tasks } from 'aws-cdk-lib'; 24 | import { aws_lambda as lambda } from 'aws-cdk-lib'; 25 | import { aws_cloudwatch as cloudwatch} from 'aws-cdk-lib'; 26 | import { aws_sagemaker as sagemaker } from 'aws-cdk-lib'; 27 | import { custom_resources } from 'aws-cdk-lib'; 28 | 29 | import * as msk from '@aws-cdk/aws-msk-alpha'; 30 | import * as batch from '@aws-cdk/aws-batch-alpha'; 31 | 32 | import { InstanceType } from 'aws-cdk-lib/aws-ec2'; 33 | import { RetentionDays } from 'aws-cdk-lib/aws-logs'; 34 | import { IntegrationPattern, TaskInput } from 'aws-cdk-lib/aws-stepfunctions'; 35 | import { AwsCustomResource } from 'aws-cdk-lib/custom-resources'; 36 | import { CreditDepletion } from './credit-depletion-sfn'; 37 | import { ClusterMonitoringLevel, KafkaVersion } from '@aws-cdk/aws-msk-alpha'; 38 | 39 | 40 | export interface MskClusterParametersNewCluster extends StackProps { 41 | clusterProps: { 42 | numberOfBrokerNodes: number, //brokers per AZ 43 | instanceType: ec2.InstanceType, 44 | ebsStorageInfo: msk.EbsStorageInfo, 45 | encryptionInTransit: msk.EncryptionInTransitConfig, 46 | clientAuthetication?: msk.ClientAuthentication, 47 | kafkaVersion: KafkaVersion, 48 | configurationInfo?: msk.ClusterConfigurationInfo 49 | }, 50 | vpc?: ec2.IVpc, 51 | sg?: ec2.ISecurityGroup, 52 | initialPerformanceTest?: any, 53 | } 54 | 55 | export interface BootstrapBroker extends StackProps { 56 | bootstrapBrokerString: string, 57 | clusterName: string, 58 | vpc?: ec2.IVpc | string, 59 | sg: ec2.ISecurityGroup | string, 60 | initialPerformanceTest?: any, 61 | } 62 | 63 | 64 | 65 | function toStackName(params: MskClusterParametersNewCluster | BootstrapBroker, prefix: string) : string { 66 | if ('bootstrapBrokerString' in params) { 67 | return prefix 68 | } else { 69 | const brokerNodeType = params.clusterProps.instanceType.toString().replace(/[^0-9a-zA-Z]/g, ''); 70 | const inClusterEncryption = params.clusterProps.encryptionInTransit.enableInCluster ? "t" : "f" 71 | const clusterVersion = params.clusterProps.kafkaVersion.version.replace(/[^0-9a-zA-Z]/g, ''); 72 | 73 | const numSubnets = params.vpc ? Math.max(params.vpc.privateSubnets.length, 3) : 3; 74 | 75 | const name = `${params.clusterProps.numberOfBrokerNodes*numSubnets}-${brokerNodeType}-${params.clusterProps.ebsStorageInfo.volumeSize}-${inClusterEncryption}-${clusterVersion}`; 76 | 77 | return `${prefix}${name}`; 78 | } 79 | } 80 | 81 | 82 | export class CdkStack extends Stack { 83 | constructor(scope: Construct, id: string, props: MskClusterParametersNewCluster | BootstrapBroker) { 84 | super(scope, toStackName(props, id), { ...props, description: 'Creates resources to run automated performance tests against Amazon MSK clusters (shausma-msk-performance)' }); 85 | 86 | var vpc; 87 | if (props.vpc) { 88 | if (typeof props.vpc === "string") { 89 | vpc = ec2.Vpc.fromLookup(this, 'Vpc', { vpcId: props.vpc }) 90 | } else { 91 | vpc = props.vpc 92 | } 93 | } else { 94 | vpc = new ec2.Vpc(this, 'Vpc'); 95 | } 96 | 97 | var sg; 98 | if (props.sg) { 99 | if (typeof props.sg === "string") { 100 | sg = ec2.SecurityGroup.fromSecurityGroupId(this, 'SecurityGroup', props.sg) 101 | } else { 102 | sg = props.sg 103 | } 104 | } else { 105 | sg = new ec2.SecurityGroup(this, 'SecurityGroup', { vpc: vpc }); 106 | sg.addIngressRule(sg, ec2.Port.allTraffic()); 107 | } 108 | 109 | var kafka : { cfnCluster?: CfnResource, numBrokers?: number, clusterName: string }; 110 | 111 | 112 | if ('bootstrapBrokerString' in props) { 113 | kafka = { 114 | clusterName: props.clusterName 115 | }; 116 | } else { 117 | const cluster = new msk.Cluster(this, 'KafkaCluster', { 118 | ...props.clusterProps, 119 | monitoring: { 120 | clusterMonitoringLevel: ClusterMonitoringLevel.PER_BROKER 121 | }, 122 | vpc: vpc, 123 | vpcSubnets: { 124 | subnetType: ec2.SubnetType.PRIVATE_WITH_NAT 125 | }, 126 | securityGroups: [ sg ], 127 | clusterName: Aws.STACK_NAME, 128 | removalPolicy: RemovalPolicy.DESTROY, 129 | }); 130 | 131 | kafka = { 132 | cfnCluster: cluster.node.findChild('Resource') as CfnResource, 133 | numBrokers: Math.max(vpc.privateSubnets.length, 3) * props.clusterProps.numberOfBrokerNodes, 134 | clusterName: cluster.clusterName 135 | } 136 | } 137 | 138 | 139 | const defaultLogGroup = new logs.LogGroup(this, 'DefaultLogGroup', { 140 | retention: RetentionDays.ONE_WEEK 141 | }); 142 | 143 | const performanceTestLogGroup = new logs.LogGroup(this, 'PerformanceTestLogGroup', { 144 | retention: RetentionDays.ONE_YEAR 145 | }); 146 | 147 | 148 | const batchEcsInstanceRole = new iam.Role(this, 'BatchEcsInstanceRole', { 149 | assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'), 150 | managedPolicies: [ 151 | iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2ContainerServiceforEC2Role'), 152 | iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore') 153 | ] 154 | }); 155 | 156 | const batchEcsInstanceProfile = new iam.CfnInstanceProfile(this, 'BatchEcsInstanceProfile', { 157 | roles: [ batchEcsInstanceRole.roleName ], 158 | }); 159 | 160 | const computeEnvironment = new batch.ComputeEnvironment(this, 'ComputeEnvironment', { 161 | computeEnvironmentName: `${Aws.STACK_NAME}`, 162 | computeResources: { 163 | vpc, 164 | instanceTypes: [ new InstanceType('c5n.large') ], 165 | instanceRole: batchEcsInstanceProfile.attrArn, 166 | securityGroups: [ sg ], 167 | }, 168 | }); 169 | 170 | const jobQueue = new batch.JobQueue(this, 'JobQueue', { 171 | jobQueueName: Aws.STACK_NAME, 172 | computeEnvironments: [ 173 | { 174 | computeEnvironment, 175 | order: 1, 176 | }, 177 | ], 178 | }); 179 | 180 | const jobRole = new iam.Role(this, 'JobRole', { 181 | assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'), 182 | managedPolicies: [ 183 | iam.ManagedPolicy.fromAwsManagedPolicyName('ReadOnlyAccess') 184 | ], 185 | inlinePolicies: { 186 | TerminateJob: new iam.PolicyDocument({ 187 | statements: [ 188 | new iam.PolicyStatement({ 189 | actions: ["batch:TerminateJob"], 190 | resources: ["*"] 191 | }), 192 | new iam.PolicyStatement({ 193 | actions: [ 194 | "kafka-cluster:Connect", 195 | "kafka-cluster:DescribeCluster" 196 | ], 197 | resources: [`arn:${Aws.PARTITION}:kafka:${Aws.REGION}:${Aws.ACCOUNT_ID}:cluster/${kafka.clusterName}/*`] 198 | }), 199 | new iam.PolicyStatement({ 200 | actions: [ 201 | "kafka-cluster:*Topic*", 202 | "kafka-cluster:WriteData", 203 | "kafka-cluster:ReadData" 204 | ], 205 | resources: [`arn:${Aws.PARTITION}:kafka:${Aws.REGION}:${Aws.ACCOUNT_ID}:topic/${kafka.clusterName}/*`] 206 | }), 207 | new iam.PolicyStatement({ 208 | actions: [ 209 | "kafka-cluster:AlterGroup", 210 | "kafka-cluster:DescribeGroup" 211 | ], 212 | resources: [`arn:${Aws.PARTITION}:kafka:${Aws.REGION}:${Aws.ACCOUNT_ID}:group/${kafka.clusterName}/*`] 213 | }) 214 | ] 215 | }) 216 | } 217 | }); 218 | 219 | 220 | const containerImage = ecs.ContainerImage.fromAsset('./docker'); 221 | 222 | const performanceTestJobDefinition = new batch.JobDefinition(this, 'PerformanceTestJobDefinition', { 223 | container: { 224 | image: containerImage, 225 | vcpus: 2, 226 | memoryLimitMiB: 4900, 227 | logConfiguration: { 228 | logDriver: batch.LogDriver.AWSLOGS, 229 | options: { 230 | 'awslogs-region': Aws.REGION, 231 | 'awslogs-group': performanceTestLogGroup.logGroupName 232 | } 233 | }, 234 | jobRole: jobRole 235 | }, 236 | }); 237 | 238 | const defaultJobDefinition = new batch.JobDefinition(this, 'DefaultJobDefinition', { 239 | container: { 240 | image: containerImage, 241 | vcpus: 2, 242 | memoryLimitMiB: 4900, 243 | logConfiguration: { 244 | logDriver: batch.LogDriver.AWSLOGS, 245 | options: { 246 | 'awslogs-region': Aws.REGION, 247 | 'awslogs-group': defaultLogGroup.logGroupName 248 | } 249 | }, 250 | jobRole: jobRole 251 | }, 252 | }); 253 | 254 | 255 | 256 | const payload = TaskInput.fromObject({ 257 | 'num_jobs.$': 'States.JsonToString($.current_test.parameters.num_jobs)', //hack to convert number to string 258 | 'replication_factor.$': 'States.JsonToString($.current_test.parameters.replication_factor)', 259 | topic_name: sfn.JsonPath.stringAt('$.current_test.parameters.topic_name'), 260 | depletion_topic_name: sfn.JsonPath.stringAt('$.current_test.parameters.depletion_topic_name'), 261 | 'record_size_byte.$': 'States.JsonToString($.current_test.parameters.record_size_byte)', 262 | 'records_per_sec.$': 'States.JsonToString($.current_test.parameters.records_per_sec)', 263 | 'num_partitions.$': 'States.JsonToString($.current_test.parameters.num_partitions)', 264 | 'duration_sec.$': 'States.JsonToString($.current_test.parameters.duration_sec)', 265 | producer_props: sfn.JsonPath.stringAt('$.current_test.parameters.client_props.producer'), //this fails for empty strings :-( 266 | 'num_producers.$': 'States.JsonToString($.current_test.parameters.num_producers)', 267 | 'num_records_producer.$': 'States.JsonToString($.current_test.parameters.num_records_producer)', 268 | consumer_props: sfn.JsonPath.stringAt('$.current_test.parameters.client_props.consumer'), 269 | 'num_consumer_groups.$': 'States.JsonToString($.current_test.parameters.consumer_groups.num_groups)', 270 | 'size_consumer_group.$': 'States.JsonToString($.current_test.parameters.consumer_groups.size)', 271 | 'num_records_consumer.$': 'States.JsonToString($.current_test.parameters.num_records_consumer)', 272 | }); 273 | 274 | const commandParameters = [ 275 | '--region', Aws.REGION, 276 | '--msk-cluster-arn', kafka.cfnCluster? kafka.cfnCluster.ref : '-', 277 | '--bootstrap-broker', 'bootstrapBrokerString' in props ? props.bootstrapBrokerString : '-', 278 | '--num-jobs', 'Ref::num_jobs', 279 | '--replication-factor', 'Ref::replication_factor', 280 | '--topic-name', 'Ref::topic_name', 281 | '--depletion-topic-name', 'Ref::depletion_topic_name', 282 | '--record-size-byte', 'Ref::record_size_byte', 283 | '--records-per-sec', 'Ref::records_per_sec', 284 | '--num-partitions', 'Ref::num_partitions', 285 | '--duration-sec', 'Ref::duration_sec', 286 | '--producer-props', 'Ref::producer_props', 287 | '--num-producers', 'Ref::num_producers', 288 | '--num-records-producer', 'Ref::num_records_producer', 289 | '--consumer-props', 'Ref::consumer_props', 290 | '--num-consumer-groups', 'Ref::num_consumer_groups', 291 | '--num-records-consumer', 'Ref::num_records_consumer', 292 | '--size-consumer-group', 'Ref::size_consumer_group' 293 | ] 294 | 295 | 296 | const queryProducerResultLambda = new lambda.Function(this, 'QueryProducerResultLambda', { 297 | runtime: lambda.Runtime.NODEJS_14_X, 298 | code: lambda.Code.fromAsset('lambda'), 299 | timeout: Duration.minutes(2), 300 | handler: 'query-test-output.queryProducerOutput', 301 | environment: { 302 | LOG_GROUP_NAME: performanceTestLogGroup.logGroupName 303 | } 304 | }); 305 | 306 | performanceTestLogGroup.grant(queryProducerResultLambda, 'logs:FilterLogEvents') 307 | 308 | queryProducerResultLambda.addToRolePolicy( 309 | new iam.PolicyStatement({ 310 | actions: ['batch:DescribeJobs'], 311 | resources: ['*'] 312 | }) 313 | ); 314 | 315 | const queryProducerResult = new tasks.LambdaInvoke(this, 'QueryProducerResult', { 316 | lambdaFunction: queryProducerResultLambda, 317 | resultPath: '$.producer_result' 318 | }); 319 | 320 | // if the query restult is not available in cloudwatch yet, retry query 321 | queryProducerResult.addRetry({ 322 | interval: Duration.seconds(10), 323 | maxAttempts: 3 324 | }); 325 | 326 | 327 | const runPerformanceTest = new tasks.BatchSubmitJob(this, 'RunPerformanceTest', { 328 | jobName: 'RunPerformanceTest', 329 | jobDefinitionArn: performanceTestJobDefinition.jobDefinitionArn, 330 | jobQueueArn: jobQueue.jobQueueArn, 331 | arraySize: sfn.JsonPath.numberAt('$.current_test.parameters.num_jobs'), 332 | containerOverrides: { 333 | command: [...commandParameters, '--command', 'run-performance-test' ] 334 | }, 335 | payload: payload, 336 | resultPath: '$.job_result', 337 | }); 338 | 339 | runPerformanceTest.next(queryProducerResult); 340 | 341 | 342 | const depleteCreditsConstruct = new CreditDepletion(this, 'DepleteCreditsSfn', { 343 | clusterName: kafka.clusterName, 344 | jobDefinition: defaultJobDefinition, 345 | jobQueue: jobQueue, 346 | commandParameters: commandParameters, 347 | payload: payload 348 | }) 349 | 350 | const depleteCredits = new tasks.StepFunctionsStartExecution(this, 'DepleteCredits', { 351 | stateMachine: depleteCreditsConstruct.stateMachine, 352 | integrationPattern: IntegrationPattern.RUN_JOB, 353 | inputPath: '$', 354 | resultPath: 'DISCARD', 355 | timeout: Duration.hours(6) //fixme: use timeoutPath when it becomes exposed through cdk 356 | }); 357 | 358 | depleteCredits.next(runPerformanceTest); 359 | 360 | 361 | const createTopics = new tasks.BatchSubmitJob(this, 'CreateTopics', { 362 | jobName: 'CreateTopics', 363 | jobDefinitionArn: defaultJobDefinition.jobDefinitionArn, 364 | jobQueueArn: jobQueue.jobQueueArn, 365 | containerOverrides: { 366 | command: [...commandParameters, '--command', 'create-topics' ] 367 | }, 368 | payload: payload, 369 | resultPath: 'DISCARD', 370 | }); 371 | 372 | createTopics.next(depleteCredits); 373 | 374 | 375 | const finalState = new sfn.Succeed(this, 'Succeed'); 376 | 377 | const checkCompleted = new sfn.Choice(this, 'CheckAllTestsCompleted') 378 | .when(sfn.Condition.isPresent('$.current_test'), createTopics) 379 | .otherwise(finalState); 380 | 381 | 382 | const updateTestParameterLambda = new lambda.Function(this, 'UpdateTestParameterLambda', { 383 | runtime: lambda.Runtime.PYTHON_3_8, 384 | code: lambda.Code.fromAsset('lambda'), 385 | timeout: Duration.seconds(5), 386 | handler: 'test-parameters.increment_index_and_update_parameters', 387 | }); 388 | 389 | const updateTestParameter = new tasks.LambdaInvoke(this, 'UpdateTestParameter', { 390 | lambdaFunction: updateTestParameterLambda, 391 | outputPath: '$.Payload', 392 | }); 393 | 394 | updateTestParameter.next(checkCompleted); 395 | 396 | 397 | const deleteTopics = new tasks.BatchSubmitJob(this, 'DeleteAllTopics', { 398 | jobName: 'DeleteAllTopics', 399 | jobDefinitionArn: defaultJobDefinition.jobDefinitionArn, 400 | jobQueueArn: jobQueue.jobQueueArn, 401 | containerOverrides: { 402 | command: [...commandParameters, '--command', 'delete-topics' ] 403 | }, 404 | payload: payload, 405 | resultPath: 'DISCARD', 406 | }); 407 | 408 | deleteTopics.next(updateTestParameter); 409 | queryProducerResult.next(deleteTopics); 410 | 411 | 412 | const fail = new sfn.Fail(this, 'Fail'); 413 | 414 | 415 | const cleanup = new tasks.BatchSubmitJob(this, 'DeleteAllTopicsTrap', { 416 | jobName: 'DeleteAllTopics', 417 | jobDefinitionArn: defaultJobDefinition.jobDefinitionArn, 418 | jobQueueArn: jobQueue.jobQueueArn, 419 | containerOverrides: { 420 | command: [ 421 | '--region', Aws.REGION, 422 | '--msk-cluster-arn', kafka.cfnCluster? kafka.cfnCluster.ref : '-', 423 | '--bootstrap-broker', 'bootstrapBrokerString' in props ? props.bootstrapBrokerString : '-', 424 | '--producer-props', 'Ref::producer_props', 425 | '--command', 'delete-topics' 426 | ] 427 | }, 428 | payload: TaskInput.fromObject({ 429 | 'producer_props.$': '$.Cause' 430 | }), 431 | resultPath: 'DISCARD' 432 | }); 433 | 434 | cleanup.next(fail); 435 | 436 | 437 | const parallel = new sfn.Parallel(this, 'IterateThroughPerformanceTests'); 438 | 439 | parallel.branch(updateTestParameter); 440 | parallel.addCatch(cleanup); 441 | 442 | const stateMachine = new sfn.StateMachine(this, 'StateMachine', { 443 | definition: parallel, 444 | stateMachineName: `${Aws.STACK_NAME}-main` 445 | }); 446 | 447 | if (props.initialPerformanceTest) { 448 | new AwsCustomResource(this, 'InitialPerformanceTestResource', { 449 | policy: custom_resources.AwsCustomResourcePolicy.fromSdkCalls({ 450 | resources: [ stateMachine.stateMachineArn ] 451 | }), 452 | onCreate: { 453 | action: "startExecution", 454 | service: "StepFunctions", 455 | parameters: { 456 | input: JSON.stringify(props.initialPerformanceTest), 457 | stateMachineArn: stateMachine.stateMachineArn 458 | }, 459 | physicalResourceId: { 460 | id: 'InitialPerformanceTestCall' 461 | } 462 | } 463 | }) 464 | } 465 | 466 | 467 | const bytesInWidget = new cloudwatch.GraphWidget({ 468 | left: [ 469 | new cloudwatch.MathExpression({ 470 | expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesInPerSec"', 'Minimum', 60))/1024/1024`, 471 | label: `cluster ${kafka.clusterName}`, 472 | usingMetrics: {} 473 | }), 474 | new cloudwatch.MathExpression({ 475 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesInPerSec"', 'Minimum', 60)/1024/1024`, 476 | label: 'broker', 477 | usingMetrics: {} 478 | }), 479 | ], 480 | leftYAxis: { 481 | min: 0 482 | }, 483 | title: 'throughput in (MB/sec)', 484 | width: 24, 485 | liveData: true 486 | }); 487 | 488 | const bytesInOutStddevWidget = new cloudwatch.GraphWidget({ 489 | left: [ 490 | new cloudwatch.MathExpression({ 491 | expression: `STDDEV(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesInPerSec"', 'Minimum', 60))/1024/1024`, 492 | label: 'in', 493 | usingMetrics: {} 494 | }), 495 | new cloudwatch.MathExpression({ 496 | expression: `STDDEV(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesOutPerSec"', 'Minimum', 60))/1024/1024`, 497 | label: 'out', 498 | usingMetrics: {} 499 | }), 500 | ], 501 | leftYAxis: { 502 | min: 0, 503 | max: 3 504 | }, 505 | leftAnnotations: [{ 506 | value: 0.75 507 | }], 508 | title: 'stdev broker throughput (MB/sec)', 509 | width: 24, 510 | liveData: true, 511 | }); 512 | 513 | 514 | const bytesOutWidget = new cloudwatch.GraphWidget({ 515 | left: [ 516 | new cloudwatch.MathExpression({ 517 | expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesOutPerSec"', 'Average', 60))/1024/1024`, 518 | label: `cluster ${kafka.clusterName}`, 519 | usingMetrics: {} 520 | }), 521 | new cloudwatch.MathExpression({ 522 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesOutPerSec"', 'Average', 60)/1024/1024`, 523 | label: 'broker', 524 | usingMetrics: {} 525 | }), 526 | ], 527 | title: 'throughput out (MB/sec)', 528 | width: 24, 529 | liveData: true 530 | }); 531 | 532 | const burstBalance = new cloudwatch.GraphWidget({ 533 | left: [ 534 | new cloudwatch.MathExpression({ 535 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BurstBalance"', 'Minimum', 60)`, 536 | label: `EBS volume BurstBalance`, 537 | usingMetrics: {} 538 | }), 539 | new cloudwatch.MathExpression({ 540 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="CPUCreditBalance"', 'Minimum', 60)`, 541 | label: `CPUCreditBalance`, 542 | usingMetrics: {} 543 | }), 544 | ], 545 | title: 'burst balances', 546 | width: 24, 547 | liveData: true 548 | }); 549 | 550 | 551 | const trafficShaping = new cloudwatch.GraphWidget({ 552 | left: [ 553 | new cloudwatch.MathExpression({ 554 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BwInAllowanceExceeded"', 'Maximum', 60)`, 555 | label: `BwInAllowanceExceeded`, 556 | usingMetrics: {} 557 | }), 558 | new cloudwatch.MathExpression({ 559 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BwOutAllowanceExceeded"', 'Maximum', 60)`, 560 | label: `BwOutAllowanceExceeded`, 561 | usingMetrics: {} 562 | }), 563 | new cloudwatch.MathExpression({ 564 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="PpsAllowanceExceeded"', 'Maximum', 60)`, 565 | label: `PpsAllowanceExceeded`, 566 | usingMetrics: {} 567 | }), 568 | new cloudwatch.MathExpression({ 569 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="ConntrackAllowanceExceeded"', 'Maximum', 60)`, 570 | label: `ConntrackAllowanceExceeded`, 571 | usingMetrics: {} 572 | }), 573 | ], 574 | title: 'traffic shaping applied', 575 | width: 24, 576 | liveData: true 577 | }); 578 | 579 | 580 | const cpuIdleWidget = new cloudwatch.GraphWidget({ 581 | left: [ 582 | new cloudwatch.MathExpression({ 583 | expression: `AVG(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="CpuIdle"', 'Minimum', 60))`, 584 | label: `cluster ${kafka.clusterName}`, 585 | usingMetrics: {} 586 | }), 587 | new cloudwatch.MathExpression({ 588 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="CpuIdle"', 'Minimum', 60)`, 589 | label: 'broker', 590 | usingMetrics: {} 591 | }), 592 | ], 593 | title: 'cpu idle (percent)', 594 | width: 24, 595 | leftYAxis: { 596 | min: 0 597 | }, 598 | liveData: true 599 | }); 600 | 601 | const replicationWidget = new cloudwatch.GraphWidget({ 602 | left: [ 603 | new cloudwatch.MathExpression({ 604 | expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="UnderReplicatedPartitions"', 'Maximum', 60))`, 605 | label: 'under replicated', 606 | usingMetrics: {} 607 | }), 608 | new cloudwatch.MathExpression({ 609 | expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="UnderMinIsrPartitionCount"', 'Maximum', 60))`, 610 | label: 'under min isr', 611 | usingMetrics: {} 612 | }), 613 | ], 614 | leftYAxis: { 615 | min: 0 616 | }, 617 | title: 'partition (count)', 618 | width: 24, 619 | liveData: true 620 | }); 621 | 622 | const partitionWidget = new cloudwatch.GraphWidget({ 623 | left: [ 624 | new cloudwatch.MathExpression({ 625 | expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="PartitionCount"', 'Maximum', 60))`, 626 | label: 'partiton count', 627 | usingMetrics: {} 628 | }), 629 | new cloudwatch.MathExpression({ 630 | expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="PartitionCount"', 'Maximum', 60)`, 631 | label: 'partiton count', 632 | usingMetrics: {} 633 | }), 634 | ], 635 | leftYAxis: { 636 | min: 0 637 | }, 638 | title: 'partition (count)', 639 | width: 24, 640 | liveData: true 641 | }); 642 | 643 | const dashboard = new cloudwatch.Dashboard(this, 'Dashboard', { 644 | dashboardName: Aws.STACK_NAME, 645 | widgets: [ 646 | [ bytesInWidget ], 647 | [ bytesOutWidget ], 648 | [ trafficShaping ], 649 | [ burstBalance ], 650 | [ bytesInOutStddevWidget ], 651 | [ cpuIdleWidget ], 652 | [ replicationWidget ], 653 | [ partitionWidget ] 654 | ], 655 | }); 656 | 657 | const sagemakerInstanceRole = new iam.Role(this, 'SagemakerInstanceRole', { 658 | assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'), 659 | inlinePolicies: { 660 | cloudFormation: new iam.PolicyDocument({ 661 | statements: [new iam.PolicyStatement({ 662 | actions: [ 663 | 'cloudformation:DescribeStacks' 664 | ], 665 | resources: [ Aws.STACK_ID ], 666 | })] 667 | }), 668 | cwLogs: new iam.PolicyDocument({ 669 | statements: [ 670 | new iam.PolicyStatement({ 671 | actions: [ 672 | 'logs:StartQuery' 673 | ], 674 | resources: [ performanceTestLogGroup.logGroupArn ], 675 | }), 676 | new iam.PolicyStatement({ 677 | actions: [ 678 | 'logs:GetQueryResults' 679 | ], 680 | resources: [ '*' ], 681 | }) 682 | ] 683 | }), 684 | stepFunctions: new iam.PolicyDocument({ 685 | statements: [new iam.PolicyStatement({ 686 | actions: [ 687 | 'states:DescribeExecution' 688 | ], 689 | resources: [ `arn:${Aws.PARTITION}:states:${Aws.REGION}:${Aws.ACCOUNT_ID}:execution:${stateMachine.stateMachineName}:*` ], 690 | })] 691 | }) 692 | } 693 | }); 694 | 695 | const sagemakerNotebook = new sagemaker.CfnNotebookInstance(this, 'SagemakerNotebookInstance', { 696 | instanceType: 'ml.t3.medium', 697 | roleArn: sagemakerInstanceRole.roleArn, 698 | notebookInstanceName: Aws.STACK_NAME, 699 | defaultCodeRepository: 'https://github.com/aws-samples/performance-testing-framework-for-apache-kafka/' 700 | }); 701 | 702 | new CfnOutput(this, 'SagemakerNotebook', { 703 | value: `https://console.aws.amazon.com/sagemaker/home#/notebook-instances/openNotebook/${sagemakerNotebook.attrNotebookInstanceName}?view=lab` 704 | }); 705 | 706 | new CfnOutput(this, 'LogGroupName', { 707 | value: performanceTestLogGroup.logGroupName 708 | }); 709 | } 710 | } 711 | -------------------------------------------------------------------------------- /cdk/lib/credit-depletion-sfn.ts: -------------------------------------------------------------------------------- 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | // this software and associated documentation files (the "Software"), to deal in 5 | // the Software without restriction, including without limitation the rights to 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | // the Software, and to permit persons to whom the Software is furnished to do so. 8 | // 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | 17 | import { Construct } from 'constructs'; 18 | import { Aws, StackProps, Duration } from 'aws-cdk-lib'; 19 | import { aws_iam as iam } from 'aws-cdk-lib'; 20 | import { aws_stepfunctions as sfn } from 'aws-cdk-lib'; 21 | import { aws_stepfunctions_tasks as tasks } from 'aws-cdk-lib'; 22 | import { aws_lambda as lambda } from 'aws-cdk-lib'; 23 | import * as batch from '@aws-cdk/aws-batch-alpha'; 24 | 25 | import { IntegrationPattern } from 'aws-cdk-lib/aws-stepfunctions'; 26 | 27 | export interface CreditDepletionParameters extends StackProps { 28 | clusterName: string, 29 | jobDefinition: batch.IJobDefinition, 30 | jobQueue: batch.IJobQueue, 31 | commandParameters: string[], 32 | payload: sfn.TaskInput 33 | } 34 | 35 | 36 | export class CreditDepletion extends Construct { 37 | 38 | stateMachine: sfn.StateMachine; 39 | 40 | constructor(scope: Construct, id: string, props: CreditDepletionParameters) { 41 | super(scope, id); 42 | 43 | 44 | const fail = new sfn.Fail(this, 'FailDepletion'); 45 | const succeed = new sfn.Succeed(this, 'SucceedDepletion'); 46 | 47 | // the number of jobs is determined before all jobs are manually failed, so if there were already failed or succeeded jobs, the depletion exited early and needs to be retried 48 | const checkAllJobsRunning = new sfn.Choice(this, 'CheckAllJobsRunning') 49 | .when(sfn.Condition.numberGreaterThan('$.cluster_throughput.Payload.succeededPlusFailedJobs', 0), fail) 50 | .otherwise(succeed); 51 | 52 | 53 | const terminateCreditDepletionLambda = new lambda.Function(this, 'TerminateCreditDepletionLambda', { 54 | runtime: lambda.Runtime.NODEJS_14_X, 55 | code: lambda.Code.fromAsset('lambda'), 56 | timeout: Duration.seconds(5), 57 | handler: 'manage-infrastructure.terminateDepletionJob', 58 | }); 59 | 60 | terminateCreditDepletionLambda.addToRolePolicy( 61 | new iam.PolicyStatement({ 62 | actions: ['batch:TerminateJob'], 63 | resources: ['*'] 64 | }) 65 | ); 66 | 67 | const terminateCreditDepletion = new tasks.LambdaInvoke(this, 'TerminateCreditDepletion', { 68 | lambdaFunction: terminateCreditDepletionLambda, 69 | inputPath: '$.depletion_job', 70 | resultPath: 'DISCARD' 71 | }); 72 | 73 | terminateCreditDepletion.next(checkAllJobsRunning); 74 | 75 | 76 | const queryClusterThroughputLambda = new lambda.Function(this, 'QueryClusterThroughputLambda', { 77 | runtime: lambda.Runtime.NODEJS_14_X, 78 | code: lambda.Code.fromAsset('lambda'), 79 | timeout: Duration.seconds(5), 80 | handler: 'manage-infrastructure.queryMskClusterThroughput', 81 | environment: { 82 | MSK_CLUSTER_NAME: props.clusterName 83 | } 84 | }); 85 | 86 | queryClusterThroughputLambda.addToRolePolicy( 87 | new iam.PolicyStatement({ 88 | actions: ['cloudwatch:GetMetricData', 'batch:DescribeJobs'], 89 | resources: ['*'] 90 | }) 91 | ); 92 | 93 | const queryClusterThroughputLowerThan = new tasks.LambdaInvoke(this, 'QueryClusterThroughputLowerThan', { 94 | lambdaFunction: queryClusterThroughputLambda, 95 | inputPath: '$', 96 | resultPath: '$.cluster_throughput' 97 | }); 98 | 99 | const waitThroughputLowerThan = new sfn.Wait(this, 'WaitThroughputLowerThan', { 100 | time: sfn.WaitTime.duration(Duration.minutes(1)) 101 | }); 102 | 103 | waitThroughputLowerThan.next(queryClusterThroughputLowerThan); 104 | 105 | 106 | const checkThroughputLowerThan = new sfn.Choice(this, 'CheckThroughputLowerThanThreshold') 107 | .when(sfn.Condition.numberGreaterThan('$.cluster_throughput.Payload.succeededPlusFailedJobs', 0), terminateCreditDepletion) 108 | .when( 109 | sfn.Condition.and( 110 | sfn.Condition.numberLessThanJsonPath('$.cluster_throughput.Payload.clusterMbInPerSec', '$.test_specification.depletion_configuration.lower_threshold.mb_per_sec'), 111 | // sfn.Condition.numberLessThanJsonPath('$.cluster_throughput.Payload.brokerMbInPerSecStddev', '$.test_specification.depletion_configuration.lower_threshold.max_broker_stddev') 112 | ), 113 | terminateCreditDepletion 114 | ) 115 | .otherwise(waitThroughputLowerThan); 116 | 117 | queryClusterThroughputLowerThan.next(checkThroughputLowerThan); 118 | 119 | 120 | const queryClusterThroughputExceeded = new tasks.LambdaInvoke(this, 'QueryClusterThroughputExceeded', { 121 | lambdaFunction: queryClusterThroughputLambda, 122 | inputPath: '$', 123 | resultPath: '$.cluster_throughput' 124 | }); 125 | 126 | const waitThroughputExceeded = new sfn.Wait(this, 'WaitThroughputExceeded', { 127 | time: sfn.WaitTime.duration(Duration.minutes(1)) 128 | }); 129 | 130 | waitThroughputExceeded.next(queryClusterThroughputExceeded); 131 | 132 | 133 | const checkLowerThanPresent = new sfn.Choice(this, 'CheckLowerThanPresent') 134 | .when(sfn.Condition.isPresent('$.test_specification.depletion_configuration.lower_threshold.mb_per_sec'), waitThroughputLowerThan) 135 | .otherwise(terminateCreditDepletion); 136 | 137 | 138 | const checkThroughputExceeded = new sfn.Choice(this, 'CheckThroughputExceeded') 139 | .when(sfn.Condition.numberGreaterThan('$.cluster_throughput.Payload.succeededPlusFailedJobs', 0), terminateCreditDepletion) 140 | .when(sfn.Condition.numberGreaterThanJsonPath('$.cluster_throughput.Payload.clusterMbInPerSec', '$.test_specification.depletion_configuration.upper_threshold.mb_per_sec'), checkLowerThanPresent) 141 | .otherwise(waitThroughputExceeded); 142 | 143 | queryClusterThroughputExceeded.next(checkThroughputExceeded); 144 | 145 | 146 | 147 | const submitCreditDepletion = new tasks.BatchSubmitJob(this, 'SubmitCreditDepletion', { 148 | jobName: 'RunCreditDepletion', 149 | jobDefinitionArn: props.jobDefinition.jobDefinitionArn, 150 | jobQueueArn: props.jobQueue.jobQueueArn, 151 | arraySize: sfn.JsonPath.numberAt('$.current_test.parameters.num_jobs'), 152 | containerOverrides: { 153 | command: [...props.commandParameters, '--command', 'deplete-credits' ] 154 | }, 155 | payload: props.payload, 156 | integrationPattern: IntegrationPattern.REQUEST_RESPONSE, 157 | inputPath: '$', 158 | resultPath: '$.depletion_job', 159 | }); 160 | 161 | submitCreditDepletion.next(waitThroughputExceeded); 162 | 163 | 164 | const updateDepletionParameterLambda = new lambda.Function(this, 'UpdateDepletionParameterLambda', { 165 | runtime: lambda.Runtime.PYTHON_3_8, 166 | code: lambda.Code.fromAsset('lambda'), 167 | timeout: Duration.seconds(5), 168 | handler: 'test-parameters.update_parameters_for_depletion', 169 | }); 170 | 171 | const updateDepletionParameter = new tasks.LambdaInvoke(this, 'UpdateDepletionParameter', { 172 | lambdaFunction: updateDepletionParameterLambda, 173 | outputPath: '$.Payload', 174 | }); 175 | 176 | updateDepletionParameter.next(submitCreditDepletion); 177 | 178 | const checkDepletion = new sfn.Choice(this, 'CheckDepletionRequired') 179 | .when(sfn.Condition.isNotPresent('$.test_specification.depletion_configuration'), succeed) 180 | .otherwise(updateDepletionParameter); 181 | 182 | this.stateMachine = new sfn.StateMachine(this, 'DepleteCreditsStateMachine', { 183 | definition: checkDepletion, 184 | stateMachineName: `${Aws.STACK_NAME}-credit-depletion`, 185 | }); 186 | } 187 | } 188 | -------------------------------------------------------------------------------- /cdk/lib/vpc.ts: -------------------------------------------------------------------------------- 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | // this software and associated documentation files (the "Software"), to deal in 5 | // the Software without restriction, including without limitation the rights to 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | // the Software, and to permit persons to whom the Software is furnished to do so. 8 | // 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import { Construct } from 'constructs'; 17 | import { Stack, StackProps } from 'aws-cdk-lib'; 18 | import { aws_ec2 as ec2 } from 'aws-cdk-lib'; 19 | 20 | export class VpcStack extends Stack { 21 | vpc: ec2.Vpc; 22 | 23 | constructor(scope: Construct, id: string, props?: StackProps) { 24 | super(scope, id, props); 25 | 26 | this.vpc = new ec2.Vpc(this, 'Vpc'); 27 | } 28 | } -------------------------------------------------------------------------------- /cdk/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "tmp", 3 | "version": "0.1.0", 4 | "bin": { 5 | "tmp": "bin/tmp.js" 6 | }, 7 | "scripts": { 8 | "build": "tsc", 9 | "watch": "tsc -w", 10 | "test": "jest", 11 | "cdk": "cdk" 12 | }, 13 | "devDependencies": { 14 | "@types/jest": "^26.0.10", 15 | "@types/node": "10.17.27", 16 | "aws-cdk": "2.15.0", 17 | "jest": "^26.4.2", 18 | "ts-jest": "^26.2.0", 19 | "ts-node": "^9.0.0", 20 | "typescript": "~3.9.7" 21 | }, 22 | "dependencies": { 23 | "@aws-cdk/aws-batch-alpha": "2.15.0-alpha.0", 24 | "@aws-cdk/aws-msk-alpha": "2.15.0-alpha.0", 25 | "aws-cdk-lib": "2.15.0", 26 | "constructs": "^10.0.0", 27 | "source-map-support": "^0.5.16" 28 | } 29 | } 30 | -------------------------------------------------------------------------------- /cdk/tsconfig.json: -------------------------------------------------------------------------------- 1 | { 2 | "compilerOptions": { 3 | "target": "ES2018", 4 | "module": "commonjs", 5 | "lib": ["es2018"], 6 | "declaration": true, 7 | "strict": true, 8 | "noImplicitAny": true, 9 | "strictNullChecks": true, 10 | "noImplicitThis": true, 11 | "alwaysStrict": true, 12 | "noUnusedLocals": false, 13 | "noUnusedParameters": false, 14 | "noImplicitReturns": true, 15 | "noFallthroughCasesInSwitch": false, 16 | "inlineSourceMap": true, 17 | "inlineSources": true, 18 | "experimentalDecorators": true, 19 | "strictPropertyInitialization": false, 20 | "typeRoots": ["./node_modules/@types"] 21 | }, 22 | "exclude": ["cdk.out"] 23 | } 24 | -------------------------------------------------------------------------------- /docs/create-env.md: -------------------------------------------------------------------------------- 1 | ## Setup CDK Environment 2 | 3 | To deploy the CDK template, you need to have node, cdk, and docker configured locally. Alternatively, you can use the `amzn2-ami-ecs-hvm-2.0.20210106-x86_64-ebs` AMI with an Amazon EC2 instance and the following steps to configure an environment that satisfies these constraints. 4 | 5 | ```bash 6 | # Install Node Version Manager 7 | curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash 8 | . ~/.nvm/nvm.sh 9 | 10 | # Install Node and CDK 11 | nvm install node 12 | npm -g install typescript 13 | npm install -g aws-cdk 14 | 15 | # Bootstrap CDK envitonment, if you have never executed CDK in this AWS account and region. 16 | cdk bootstrap 17 | 18 | sudo yum install -y git awscli 19 | ``` -------------------------------------------------------------------------------- /docs/framework-overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/performance-testing-framework-for-apache-kafka/e38bf5d8392b024351dc3b049646af1019fe2864/docs/framework-overview.png -------------------------------------------------------------------------------- /docs/test-result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/performance-testing-framework-for-apache-kafka/e38bf5d8392b024351dc3b049646af1019fe2864/docs/test-result.png -------------------------------------------------------------------------------- /notebooks/00000000--empty-template.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import boto3\n", 10 | "import datetime\n", 11 | "import json\n", 12 | "from dateutil.tz import tzlocal\n", 13 | "\n", 14 | "import get_test_details\n", 15 | "import query_experiment_details\n", 16 | "import aggregate_statistics\n", 17 | "import plot\n", 18 | "\n", 19 | "stepfunctions = boto3.client('stepfunctions')\n", 20 | "cloudwatch_logs = boto3.client('logs')\n", 21 | "cloudformation = boto3.client('cloudformation')\n", 22 | "cloudwatch = boto3.client('cloudwatch')" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "Sample test specification. Submit is as the input of the StepFunction workflow that ends with `-main` to start a test.\n", 30 | "```\n", 31 | "{\n", 32 | " \"test_specification\": {\n", 33 | " \"parameters\": {\n", 34 | " \"cluster_throughput_mb_per_sec\": [ 8, 16, 24, 32, 40, 44, 48, 52, 56 ],\n", 35 | " \"num_producers\": [ 6 ],\n", 36 | " \"consumer_groups\": [ { \"num_groups\": 2, \"size\": 6 }\n", 37 | " ],\n", 38 | " \"client_props\": [\n", 39 | " {\n", 40 | " \"producer\": \"acks=all linger.ms=5 batch.size=65536 buffer.memory=2147483648 security.protocol=PLAINTEXT\",\n", 41 | " \"consumer\": \"security.protocol=PLAINTEXT\"\n", 42 | " }\n", 43 | " ],\n", 44 | " \"num_partitions\": [ 36 ],\n", 45 | " \"record_size_byte\": [ 1024 ],\n", 46 | " \"replication_factor\": [ 3 ],\n", 47 | " \"duration_sec\": [ 3600 ]\n", 48 | " },\n", 49 | " \"skip_remaining_throughput\": {\n", 50 | " \"less-than\": [\n", 51 | " \"sent_div_requested_mb_per_sec\",\n", 52 | " 0.995\n", 53 | " ]\n", 54 | " },\n", 55 | " \"depletion_configuration\": {\n", 56 | " \"upper_threshold\": {\n", 57 | " \"mb_per_sec\": 200\n", 58 | " },\n", 59 | " \"approximate_timeout_hours\": 0.5\n", 60 | " }\n", 61 | " }\n", 62 | "}\n", 63 | "```" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": { 70 | "tags": [] 71 | }, 72 | "outputs": [], 73 | "source": [ 74 | "test_params = []" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": { 81 | "tags": [] 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "test_params.extend([\n", 86 | " {'execution_arn': '' }\n", 87 | "])" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "test_details = get_test_details.get_test_details(test_params, stepfunctions, cloudformation)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "(producer_stats, consumer_stats) = query_experiment_details.query_cw_logs(test_details, cloudwatch_logs)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "partitions = {\n", 115 | " 'ignore_keys': [ 'topic_id', 'cluster_name', 'test_id', 'client_props.consumer', 'cluster_id', 'duration_sec', 'throughput_series_id', 'brokers_type_numeric', ],\n", 116 | " 'title_keys': [ 'kafka_version', 'broker_storage', 'provisioned_throughput', 'in_cluster_encryption', 'producer.security.protocol', ],\n", 117 | " 'row_keys': [ 'num_producers', 'consumer_groups.num_groups', ],\n", 118 | " 'column_keys': [ 'producer.acks', 'producer.batch.size', 'num_partitions', ],\n", 119 | " 'metric_color_keys': [ 'brokers_type_numeric', 'brokers', ],\n", 120 | "}\n", 121 | "\n", 122 | "filter_fn = lambda x: True\n", 123 | "filter_agg_fn = lambda x: True\n", 124 | "\n", 125 | "filtered_producer_stats = list(filter(filter_fn, producer_stats))\n", 126 | "filtered_consumer_stats = filter(filter_fn, consumer_stats)\n", 127 | "\n", 128 | "(producer_aggregated_stats, consumer_aggregated_stats, combined_stats) = aggregate_statistics.aggregate_cw_logs(filtered_producer_stats, filtered_consumer_stats, partitions)\n", 129 | "filtered_producer_aggregated_stats = list(filter(filter_agg_fn, producer_aggregated_stats))" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "plot.plot_measurements(filtered_producer_aggregated_stats, ['latency_ms_p50_mean', 'latency_ms_p99_mean', ], 'producer put latency (ms)', **partitions, )\n", 139 | "plot.plot_measurements(filtered_producer_aggregated_stats, ['latency_ms_p50_stdev', 'latency_ms_p99_stdev', ], 'producer put latency stdev (ms)', **partitions, )" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "plot.plot_measurements(producer_aggregated_stats, ['sent_div_requested_mb_per_sec'], 'sent / requested throughput (mb/sec)', **partitions, ymin=0.990, ymax=1.01)\n", 149 | "plot.plot_measurements(producer_aggregated_stats, ['actual_duration_div_requested_duration_sec_max'], 'actual test duration/requested test duration (ratio)', **partitions)\n", 150 | "plot.plot_measurements(producer_aggregated_stats, ['num_tests'], 'number of tests (count)', **partitions)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3 (ipykernel)", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.10.2" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 4 182 | } 183 | -------------------------------------------------------------------------------- /notebooks/00000000--install-dependencies.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import sys\n", 10 | "!{sys.executable} -m pip install --upgrade pip\n", 11 | "!{sys.executable} -m pip install boto3\n", 12 | "!{sys.executable} -m pip install numpy\n", 13 | "!{sys.executable} -m pip install more-itertools\n", 14 | "!{sys.executable} -m pip install matplotlib\n", 15 | "!{sys.executable} -m pip install jupyterlab-git\n", 16 | "!{sys.executable} -m pip install nbdime\n", 17 | "!{sys.executable} -m pip install flatten-dict" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [] 26 | } 27 | ], 28 | "metadata": { 29 | "kernelspec": { 30 | "display_name": "conda_amazonei_mxnet_p36", 31 | "language": "python", 32 | "name": "conda_amazonei_mxnet_p36" 33 | }, 34 | "language_info": { 35 | "codemirror_mode": { 36 | "name": "ipython", 37 | "version": 3 38 | }, 39 | "file_extension": ".py", 40 | "mimetype": "text/x-python", 41 | "name": "python", 42 | "nbconvert_exporter": "python", 43 | "pygments_lexer": "ipython3", 44 | "version": "3.6.13" 45 | } 46 | }, 47 | "nbformat": 4, 48 | "nbformat_minor": 4 49 | } 50 | -------------------------------------------------------------------------------- /notebooks/aggregate_statistics.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | # this software and associated documentation files (the "Software"), to deal in 5 | # the Software without restriction, including without limitation the rights to 6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | # the Software, and to permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import re 17 | import pprint 18 | import itertools 19 | import numpy as np 20 | from more_itertools import ilen 21 | from statistics import mean, median, variance, stdev 22 | from dateutil.tz import tzlocal 23 | 24 | 25 | producer_aggregation_fns = [ 26 | # ('sent_records', sum), 27 | # ('sent_records_sec', min), 28 | # ('sent_records_sec', sum), 29 | ('sent_mb_sec', min), 30 | # ('sent_mb_sec', sum), 31 | ('latency_ms_avg', mean), 32 | ('latency_ms_avg', stdev), 33 | ('latency_ms_p50', mean), 34 | ('latency_ms_p50', stdev), 35 | ('latency_ms_p95', mean), 36 | ('latency_ms_p95', stdev), 37 | ('latency_ms_p99', mean), 38 | ('latency_ms_p99', stdev), 39 | ('latency_ms_p999', mean), 40 | ('latency_ms_p999', stdev), 41 | ('latency_ms_max', max), 42 | ('latency_ms_max', stdev), 43 | ('actual_duration_div_requested_duration_sec', max), 44 | ('start_ts', min), 45 | ('end_ts', max) 46 | ] 47 | 48 | consumer_aggregation_fns = [ 49 | ('consumed_mb', sum), 50 | ('consumed_mb_sec', sum), 51 | ('consumed_records', sum), 52 | ('records_lag_max', max), 53 | ('actual_duration_div_requested_duration_sec', max), 54 | ('start_ts', min), 55 | ('end_ts', max) 56 | ] 57 | 58 | def broker_type_to_num(broker): 59 | if "m5large" in broker: 60 | return 0 61 | elif "m5xlarge" in broker: 62 | return 1 63 | else: 64 | return int(re.search('m5(\d+)xlarge', broker).group(1)) 65 | 66 | # aggregate intividual producer/consumer results into one for identical test parameters 67 | 68 | def aggregate_cw_logs(producer_stats, consumer_stats, partitons): 69 | producer_aggregated_stats = [] 70 | consumer_aggregated_stats = [] 71 | 72 | partiton_by_fn = lambda x: tuple([v for k,v in x['test_params'].items() if k not in partitons['ignore_keys']]) 73 | 74 | for params,group in itertools.groupby(sorted(producer_stats, key=partiton_by_fn), key=partiton_by_fn): 75 | # convert iterator to list, as we need to iterate over it several times 76 | lst = list(group) 77 | 78 | # aggregate individual test results into a single result 79 | agg_test_results = { 80 | "{}_{}".format(attr, aggregation_fn.__name__): aggregation_fn(map(lambda x: x['test_results'][attr], lst)) 81 | for (attr,aggregation_fn) in producer_aggregation_fns 82 | } 83 | 84 | # all test params within a group are identical, so just pick the first one 85 | test_params = lst[0]['test_params'] 86 | 87 | producer_aggregated_stats.append({ 88 | 'test_params': { 89 | **test_params, 90 | 'brokers' : "{} {}".format(test_params['num_brokers'], test_params['broker_type']), 91 | 'brokers_type_numeric' : broker_type_to_num(test_params['broker_type']) 92 | }, 93 | 'test_results': { 94 | **agg_test_results, 95 | 'num_tests': len(lst) / test_params['num_producers'], 96 | 'sent_div_requested_mb_per_sec' : agg_test_results['sent_mb_sec_min'] / (test_params['cluster_throughput_mb_per_sec']/test_params['num_producers']) 97 | } 98 | }) 99 | 100 | 101 | for params,group in itertools.groupby(sorted(consumer_stats, key=partiton_by_fn), key=partiton_by_fn): 102 | # convert iterator to list, as we need to iterate over it several times 103 | lst = list(group) 104 | 105 | # aggregate individual test results into a single result 106 | agg_test_results = { 107 | "{}_{}".format(attr, aggregation_fn.__name__): aggregation_fn(map(lambda x: x['test_results'][attr], lst)) 108 | for (attr,aggregation_fn) in consumer_aggregation_fns 109 | } 110 | 111 | # all test params within a group are identical, so just pick the first one 112 | test_params = lst[0]['test_params'] 113 | 114 | consumer_aggregated_stats.append({ 115 | 'test_params': { 116 | **test_params, 117 | 'brokers' : "{} {}".format(test_params['num_brokers'], test_params['broker_type']), 118 | }, 119 | 'test_results': { 120 | **agg_test_results, 121 | 'num_tests': len(lst) / (test_params['consumer_groups.num_groups'] * test_params['consumer_groups.size']), 122 | } 123 | }) 124 | 125 | 126 | combined_stats = [] 127 | 128 | return (producer_aggregated_stats, consumer_aggregated_stats, combined_stats) 129 | -------------------------------------------------------------------------------- /notebooks/get_test_details.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | # this software and associated documentation files (the "Software"), to deal in 5 | # the Software without restriction, including without limitation the rights to 6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | # the Software, and to permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import re 17 | import json 18 | import datetime 19 | import collections 20 | 21 | def get_test_details(test_params, stepfunctions, cloudformation): 22 | sfn_status = [] 23 | test_details = [] 24 | 25 | for test_param in test_params: 26 | if 'execution_arn' not in test_param: 27 | # if the execution_arn is not included, we cannot query them and just assume that the details are complete and correct 28 | test_details.append(test_param) 29 | continue 30 | 31 | execution_arn = test_param['execution_arn'].strip() 32 | execution_details = stepfunctions.describe_execution(executionArn=execution_arn) 33 | 34 | if 'stopDate' in execution_details: 35 | stop_date = execution_details['stopDate'] 36 | else: 37 | # if the workflow is still running, just query everything up until now 38 | stop_date = datetime.datetime.now() 39 | 40 | # the name of the state machine includes the CFN stack name 41 | stack_name = test_param['execution_arn'].split(':')[-2].split('-main')[0] 42 | 43 | # obtain the name of the log group for containing all test results from cloud watch 44 | stack_details = cloudformation.describe_stacks(StackName=stack_name) 45 | log_group_name = next(filter(lambda x: x['OutputKey']=='LogGroupName', stack_details['Stacks'][0]['Outputs']))['OutputValue'] 46 | 47 | # extract the cluster parameters from the stack name 48 | cluster_properties_p = re.compile(r'(\S+)--(\d+)-([a-zA-Z0-9]+)-(\d+)-([tf])-(\d+)(?:-(\d+))?.*') 49 | cluster_properties_m = cluster_properties_p.search(stack_name) 50 | 51 | cluster_properties = { 52 | 'cluster_name': stack_name if cluster_properties_m else "n/a", 53 | 'cluster_id': cluster_properties_m.group(1) if cluster_properties_m else "n/a", 54 | 'num_brokers': int(cluster_properties_m.group(2)) if cluster_properties_m else "n/a", 55 | 'broker_type': cluster_properties_m.group(3) if cluster_properties_m else "n/a", 56 | 'broker_storage': int(cluster_properties_m.group(4)) if cluster_properties_m else "n/a", 57 | 'in_cluster_encryption': cluster_properties_m.group(5) if cluster_properties_m else "n/a", 58 | 'kafka_version': cluster_properties_m.group(6) if cluster_properties_m else "n/a", 59 | 'provisioned_throughput': int(cluster_properties_m.group(7)) if cluster_properties_m and cluster_properties_m.group(7) is not None else 0, 60 | } 61 | 62 | status = { 63 | 'log_group_name': log_group_name, 64 | 'start_date': execution_details['startDate'], 65 | 'stop_date': stop_date, 66 | 'test_parameters': json.loads(execution_details['input'])['test_specification']['parameters'], 67 | 'cluster_properties': cluster_properties 68 | } 69 | 70 | # if the test is still running, add the execution arn, so that the details can be queried again 71 | if execution_details['status'] == 'RUNNING': 72 | status['execution_arn'] = test_param['execution_arn'] 73 | 74 | test_details.append(status) 75 | sfn_status.append(execution_details['status']) 76 | 77 | if len(sfn_status) > 0: 78 | print("sfn workflow status: ", dict(collections.Counter(sfn_status))) 79 | print() 80 | for test_detail in test_details: 81 | print("test_params.extend([", test_detail, "])") 82 | print() 83 | 84 | return test_details -------------------------------------------------------------------------------- /notebooks/plot.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | # this software and associated documentation files (the "Software"), to deal in 5 | # the Software without restriction, including without limitation the rights to 6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | # the Software, and to permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import re 17 | import pprint 18 | import itertools 19 | import numpy as np 20 | import matplotlib.pyplot as plt 21 | import matplotlib.colors as mcolors 22 | import warnings 23 | 24 | key_by_fn = lambda key_by: lambda stat: tuple(stat['test_params'][key] for key in key_by) 25 | map_to_throughput_fn = lambda x: x['test_params']['cluster_throughput_mb_per_sec'] 26 | 27 | 28 | colors = list(mcolors.TABLEAU_COLORS.values()) 29 | markers = ['o', 'x', '.', '.'] 30 | # linestyles = ['-', '--', '-.', ':'] 31 | 32 | 33 | def plot_measurements(data, metric_line_style_names, ylabel, 34 | xlogscale=False, ylogscale=False, ymin=0, ymax=None, xmin=0, xmax=None, exclude=[], indicate_max=False, 35 | ignore_keys = [ 'topic_id', 'cluster_id', 'test_id', 'cluster_name', 'num_producers', ], 36 | title_keys = [ 'brokers', 'producers', 'at_rest', 'in_transit', 'duration_sec' ], 37 | row_keys = [ 'replication_factor', 'num_partitions' ], 38 | column_keys = [ 'num_consumer_groups', 'consumer_props', 'num_producers', 'producer_props' ], 39 | metric_color_keys = ['message_size_byte'], 40 | legend_replace_str = r"(buffer.memory=\d*|_mean)", 41 | linestyles = ['-', '--', ':', '-.']): 42 | 43 | rows_key_by = key_by_fn(row_keys) 44 | columns_key_by = key_by_fn(column_keys) 45 | metric_color_key_by = key_by_fn(metric_color_keys) 46 | 47 | num_rows = len(set(map(rows_key_by, data))) 48 | num_columns = len(set(map(columns_key_by, data))) 49 | 50 | fig, ax = plt.subplots(num_rows, num_columns, figsize=(num_columns*10, num_rows*7), sharex='all', sharey='all', squeeze=False) 51 | fig.subplots_adjust(hspace=.4) 52 | 53 | row_index = 0 54 | for row_key,row_group in itertools.groupby(sorted(data, key=rows_key_by), key=rows_key_by): 55 | column_index = 0 56 | for column_key,column_group in itertools.groupby(sorted(row_group, key=columns_key_by), key=columns_key_by): 57 | max_throughput_values = set() 58 | metric_color_index = 0 59 | for metric_color_key,metric_color_group in itertools.groupby(sorted(column_group, key=metric_color_key_by), key=metric_color_key_by): 60 | metric_line_style_index = 0 61 | 62 | subplot = ax[row_index][column_index] 63 | 64 | results = sorted(metric_color_group, key=map_to_throughput_fn) 65 | filtered_results = [result for result in results if result['test_params'] not in exclude] 66 | 67 | test_params = next(map(lambda x: x['test_params'], filtered_results), None) 68 | 69 | if not test_params: 70 | continue 71 | 72 | for metric_name in metric_line_style_names: 73 | data_x = np.array(list(map(map_to_throughput_fn, filtered_results))) 74 | data_y = np.array(list(map(lambda x: x['test_results'][metric_name], filtered_results))) 75 | 76 | if len(np.unique(data_x)) != len(data_x): 77 | warnings.warn("Datapoints with identical throughput detected: check the partitions parameter") 78 | 79 | label = str(dict((k,v) for k,v in zip(metric_color_keys, metric_color_key) if k not in ignore_keys)) 80 | simplified_label = re.sub(legend_replace_str, "", "{} {}".format(label, metric_name)) 81 | 82 | 83 | subplot.plot(data_x, data_y, label=simplified_label, marker=markers[metric_line_style_index], linestyle=linestyles[metric_line_style_index], color=colors[metric_color_index % len(colors)]) 84 | 85 | max_x = max(data_x) 86 | metric_line_style_index += 1 87 | 88 | subplot.grid(True) 89 | subplot.tick_params(labelleft=True, labelbottom=True) 90 | # subplot.ticklabel_format(useOffset=False) 91 | subplot.set_xlabel("requested aggregated producer throughput (mb/sec)") 92 | subplot.set_ylabel(ylabel) 93 | subplot.legend() 94 | 95 | 96 | title = "{}\n{}\n{}\n".format( 97 | str(dict((k, test_params[k]) for k in title_keys if k in test_params and k not in ignore_keys)), 98 | str(dict((k,v) for k,v in zip(row_keys, row_key) if k not in ignore_keys)), 99 | str(dict((k,v) for k,v in zip(column_keys, column_key) if k not in ignore_keys)) 100 | ) 101 | subplot.set_title(re.sub(legend_replace_str, "_", title)) 102 | 103 | if xmax or xmin: 104 | subplot.set_xlim(left=xmin, right=xmax) 105 | 106 | if ymax or ymin: 107 | subplot.set_ylim(bottom=ymin, top=ymax) 108 | 109 | if xlogscale: 110 | subplot.set_xscale('log') 111 | 112 | if ylogscale: 113 | subplot.set_yscale('log') 114 | 115 | if indicate_max: 116 | max_throughput_values.add(max_x) 117 | subplot.axvline(x=max_x, linestyle='-.', color=colors[metric_color_index% len(colors)], alpha=.5) 118 | 119 | metric_color_index += 1 120 | 121 | if num_rows<=1 or num_columns<=1: 122 | subplot.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1)) 123 | 124 | column_index += 1 125 | row_index += 1 126 | 127 | if indicate_max and len(ax)>0: 128 | subplot = ax[0][0] 129 | ticks = list(subplot.get_xticks()) 130 | 131 | for max_throughput_value in max_throughput_values: 132 | closest_tick = min(ticks, key=lambda x:abs(x-max_throughput_value)) 133 | ticks = [max_throughput_value if x==closest_tick else x for x in ticks] 134 | 135 | subplot.set_xticks(ticks) 136 | 137 | for plots in ax: 138 | for plot in plots: 139 | plot.axvline(x=indicate_max, linestyle=':', color=mcolors.CSS4_COLORS['dimgray']) -------------------------------------------------------------------------------- /notebooks/query_experiment_details.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | # this software and associated documentation files (the "Software"), to deal in 5 | # the Software without restriction, including without limitation the rights to 6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | # the Software, and to permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import re 17 | import sys 18 | import time 19 | import itertools 20 | import collections 21 | import dateutil.parser 22 | from flatten_dict import flatten 23 | from collections import defaultdict 24 | 25 | 26 | cwlogs_query_limit = 4 27 | 28 | def obtain_raw_logs(test_details, cloudwatch_logs): 29 | # query cw logs to get the raw logs during the period the step functions workflow was running 30 | # cw logs limits the number of concurrent queries, so the number of running queries always needs to be lower or equal cwlogs_query_limit 31 | 32 | # filter queries that are currently running (query metadata is added to the cwlogs_query in the test details once they are submitted) 33 | incomplete_query_results_iter = lambda: filter(lambda x: 'cwlogs_query' in x and ('cwlogs_query_result' not in x or x['cwlogs_query_result']['status'] == 'Running'), test_details) 34 | 35 | # print summary of running, completed, and failed queries 36 | output_statistics = lambda: print("query status:", dict(collections.Counter(map(lambda x: x['cwlogs_query_result']['status'] if 'cwlogs_query_result' in x else 'Pending', test_details)))) 37 | 38 | # populate queue with all queries that need to be executed 39 | queue = test_details.copy() 40 | cwlogs_query_capacity = cwlogs_query_limit 41 | 42 | output_statistics() 43 | 44 | # while there are queries to run and we can still run queries 45 | while len(queue)>0 or cwlogs_query_capacity