├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── cdk
    ├── .gitignore
    ├── bin
    │   └── cdk.ts
    ├── cdk.json
    ├── docker
    │   ├── Dockerfile
    │   ├── client-iam.properties
    │   ├── client-tls.properties
    │   └── run-kafka-command.sh
    ├── jest.config.js
    ├── lambda
    │   ├── manage-infrastructure.js
    │   ├── query-test-output.js
    │   └── test-parameters.py
    ├── lib
    │   ├── batch-ami.ts
    │   ├── cdk-stack.ts
    │   ├── credit-depletion-sfn.ts
    │   └── vpc.ts
    ├── package-lock.json
    ├── package.json
    └── tsconfig.json
├── docs
    ├── create-env.md
    ├── framework-overview.png
    └── test-result.png
└── notebooks
    ├── 00000000--empty-template.ipynb
    ├── 00000000--install-dependencies.ipynb
    ├── aggregate_statistics.py
    ├── get_test_details.py
    ├── plot.py
    └── query_experiment_details.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | __pycache__
3 | cdk/cdk.context.json
4 | .DS_Store
5 | .vscode
6 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Performance Testing Framework for Apache Kafka
  2 | 
  3 | 
  4 | *******
  5 | WARNING: This code is NOT intended for production cluster. It will delete topics on your clusters. Only use it on dedicated test clusters.
  6 | *******
  7 | 
  8 | The tool is designed to evaluate the maximum throughput of a cluster and compare the put latency of different broker, producer, and consumer configurations. To run a test, you basically specify the different parameters that should be tested and the tool will iterate through all different combinations of the parameters, producing a graph similar to the one below.
  9 | 
 10 | ![Put latencies for different broker sizes](docs/test-result.png)
 11 | 
 12 | ## Configure Local Enviroment
 13 | 
 14 | The framework is based on the `kafka-producer-perf-test.sh` and `kafka-consumer-perf-test.sh` tools that are part of the Apache Kafka distribution. It builds automation and visualization around these tools leveraging AWS services such as AWS Step Functions for scheduling and AWS Batch for running individual tests.
 15 | 
 16 | ![Framework overview](docs/framework-overview.png)
 17 | 
 18 | You just need to execute the AWS CDK template in the `cdk` folder of this repository to create the required resource to run a performance test. The [create environment](docs/create-env.md) section contains details on how to install the dependencies required to successfully deploy the template. It also explains how to configure an EC2 instance that matches all prerequisites to execute the template successfully.
 19 | 
 20 | Once your environment is configured, you need to adapt the `defaults` variable in `bin/cdk.ts` so that it matches your account details.
 21 | 
 22 | ```node
 23 | const defaults : { 'env': Environment } = {
 24 |   env: {
 25 |       account: '123456789012',
 26 |       region: 'eu-west-1'
 27 |   }
 28 | };
 29 | ```
 30 | 
 31 | Finally, you need to install all dependencies of the CDK template by running the following command in the `cdk` folder.
 32 | 
 33 | ```bash
 34 | # Install CDK template dependencies
 35 | npm install
 36 | ```
 37 | 
 38 | ## Choose MSK Cluster Configuration
 39 | 
 40 | The CDK template can either create a fresh new MSK cluster for the performance tests or you can specify your precreated cluster. The framework is creating and deleting topics on the cluster to run the performance tests, so it's highly recommended to use a dedicated cluster to avoid any data loss.
 41 | 
 42 | To create a new cluster, you need to specify the respective cluster configuration. 
 43 | 
 44 | ```node
 45 | new CdkStack(app, 'stack-prefix--', {
 46 |   ...defaults,
 47 |   vpc: vpc,
 48 |   clusterProps: {
 49 |     numberOfBrokerNodes: 1,
 50 |     instanceType: InstanceType.of(InstanceClass.M5, InstanceSize.LARGE),
 51 |     ebsStorageInfo: {
 52 |       volumeSize: 5334
 53 |     },
 54 |     encryptionInTransit: {
 55 |       enableInCluster: false,
 56 |       clientBroker: ClientBrokerEncryption.PLAINTEXT
 57 |     },
 58 |     kafkaVersion: KafkaVersion.V2_8_0,
 59 |   }
 60 | });
 61 | ```
 62 | 
 63 | Alternatively, you can specify an existing cluster instead.
 64 | 
 65 | ```node
 66 | new CdkStack(app, 'stack-prefix--', {
 67 |   ...defaults,
 68 |   vpc: vpc,
 69 |   clusterName: '...',
 70 |   bootstrapBrokerString: '...',
 71 |   sg: '...',
 72 | });
 73 | ```
 74 | 
 75 | You can then deploy the CDK template to have it create the respectice resource into your account.
 76 | 
 77 | ```bash
 78 | # List the available stacks
 79 | cdk list
 80 | 
 81 | # Deploy a specific stack
 82 | cdk deploy stack-name
 83 | ```
 84 | 
 85 | Note that the created resources will start to incur cost, even if no test is executed. `cdk destroy` will remove most resources again. It will retain a CloudWatch log streams that contain the experiment output to preserve the raw output of the performance tests for later visualization. These resources will continue to incur cost, unless they are manually deleted.
 86 | 
 87 | 
 88 | ## Test Input Configuration
 89 | 
 90 | Once the infrastructure has been created, you can start a series of performance tests by executing an StepFunctions workflow that has been created. This is a simplified test specification that determines the parameters of individual tests.
 91 | 
 92 | ```json
 93 | {
 94 |   "test_specification": {
 95 |     "parameters": {
 96 |       "cluster_throughput_mb_per_sec": [ 16, 24, 32, 40, 48, 56, 60, 64, 68, 72, 74, 76, 78, 80, 82, 84, 86, 88 ],
 97 |       "consumer_groups" : [ { "num_groups": 0, "size": 6 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
 98 |       "num_producers": [ 6 ],
 99 |       "client_props": [
100 |         {
101 |           "producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT",
102 |           "consumer": "security.protocol=PLAINTEXT"
103 |         }
104 |       ],
105 |       "num_partitions": [ 72 ],
106 |       "record_size_byte": [ 1024 ],
107 |       "replication_factor": [ 3 ],
108 |       "duration_sec": [ 3600 ]
109 |     },
110 |     ...
111 |   }
112 | }
113 | ```
114 | 
115 | With this specification there will be a total of 54 tests being carried out. All individual tests are using `6` producers, the same properties for consumers and producers, `72` partitions, a record size of `1024` bytes, a replication of `3` and will take 1 hour (or `3600`) seconds. The tests only differ in the aggregate cluster throughput (between `16` and `88` MB/sec) and the number of consumer groups (`0`, `1`, or `2` with `6` consumers per group).
116 | 
117 | You can add additional values to anything that is a list, and the framework will automatically iterate through the given options.
118 | 
119 | The framework iterates through different throughput configurations and observe how put latency changes. This means that it will first iterate through the throughput options before it changes other options. 
120 | 
121 | Given the above specification, the test will gradually increase the throughput from `16` to `88` MB/sec with no consumer groups reading from the cluster, then they will increase the throughput from `16` to `88` MB/sec with one consumer group and finally with two consumer groups.
122 | 
123 | 
124 | ### Stop Conditions
125 | 
126 | As you may not know the maximum throughput of a cluster, you can specify a stop condition for then the framework should reset the cluster throughput to the lowest value and choose the next test option (or stop if there is none).
127 | 
128 | ```json
129 | {
130 |   "test_specification": {
131 |     "parameters": {
132 |       "cluster_throughput_mb_per_sec": [ 16, 24, 32, 40, 48, 56, 60, 64, 68, 72, 74, 76, 78, 80, 82, 84, 86, 88 ],
133 |       "consumer_groups" : [ { "num_groups": 0, "size": 6 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
134 |       ...
135 |     },
136 |     "skip_remaining_throughput": {
137 |       "less-than": [ "sent_div_requested_mb_per_sec", 0.995 ]
138 |     },
139 |     ...
140 |   }
141 | }
142 | ```
143 | 
144 | With the above condition, the framework will skip the remaining throughput values, if the throughput that was actually sent into the cluster by the producers is 0.5% (more precisely, the ratio between actual sent throughput over the requested throughput is below `0.995`) below what has been specified by the test.
145 | 
146 | The first test that is carried out will have `0` consumer groups and will be sending `16` MB/sec into the cluster with `0` consumer group. The next test has again `0` consumer groups and `24` MB/sec, etc. This continues until the cluster becomes saturated. Eg, cluster may only be able to absorb 81 MB/sec instead of `82` MB/sec that should be ingested. The skip condition  matches as the difference is more than 0.5% and therefore the next test will use `1` consumer group and will be sending `16` MB/sec into the cluster.
147 | 
148 | ### Depleting credits before running performance tests
149 | 
150 | Amazon EC2 networking, Amazon EBS, and Amazon EBS networking all leverage a credit system that allows the performance to burst over a given baseline for a certain amount of time. This means that the cluster is able to ingest much more throughput for a certain amount of time.
151 | 
152 | To obtain stable results for the baseline performance of a cluster, the test framework can deplete credits before a test is carried out. To this end it generates traffic that exceeds the peak performance of a cluster and waits until the throughput drops below configurable thresholds.
153 | 
154 | ```json
155 | {
156 |   "test_specification": {
157 |     "parameters": {
158 |       "cluster_throughput_mb_per_sec": [ 16, 24, 32, 40, 48, 56, 60, 64, 68, 72, 74, 76, 78, 80, 82, 84, 86, 88 ],
159 |       "consumer_groups" : [ { "num_groups": 0, "size": 6 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
160 |       "replication_factor": [ 3 ],
161 |       ...
162 |     },
163 |     "depletion_configuration": {
164 |       "upper_threshold": {
165 |         "mb_per_sec": 58
166 |       },
167 |       "lower_threshold": {
168 |         "mb_per_sec": 74
169 |       },
170 |       "approximate_timeout_hours": 1
171 |     }
172 |     ...
173 |   }
174 | }
175 | ```
176 | 
177 | With this configuration, the framework will generate a load of 250 MB/sec for credit depletion. To complete the credit depletion, the measured cluster throughput first needs to exceed `74` MB/sec. Then, the throughput must drop below `58` MB/sec. Only then the actual performance test will be started.
178 | 
179 | ## Visualize results
180 | 
181 | You can start visualizing the test results through Jupyter notebooks once the StepFunctions workflow is running. The CloudFormation stack will output an URL to a Notebook that is managed through Amazon Sagemaker. Just follow the link to gain access to a notebook that has been preconfigured appropriatelz.
182 | 
183 | The first test needs to be completed before you can see any output, which can take a couple of hours when credit depletion is enabled.
184 | 
185 | To produce a visualization, start with the [`00000000--empty-template.ipynb`](notebooks/00000000--empty-template.ipynb) notebook. You just need to add the excution arn of the StepFunctions workflow and execute all cells of the notebook.
186 | 
187 | ```python
188 | test_params.extend([
189 |     {'execution_arn': '' }
190 | ])
191 | ```
192 | 
193 | The logic in the notebook will then retrieve the test results from CloudWatch Logs, apply grouping and aggregations and produce the a graph as an output.
194 | 
195 | You can adapt the grouping of different test runs by changing the `partitions` variable. By adapting `row_keys` and `column_keys` you can create a grid of diagrams and all meassurments within the same sub-diabram will have identical test settings for the specified keys.
196 | 
197 | ```python
198 | partitions = {
199 |     'ignore_keys': [ 'topic_id', 'cluster_name', 'test_id', 'client_props.consumer', 'cluster_id', 'duration_sec', 'provisioned_throughput',  'throughput_series_id', 'brokers_type_numeric', ],
200 |     'title_keys': [ 'kafka_version', 'broker_storage', 'in_cluster_encryption', 'producer.security.protocol', ],
201 |     'row_keys': [ 'num_producers', 'consumer_groups.num_groups',  ],
202 |     'column_keys': [ 'producer.acks', 'producer.batch.size', 'num_partitions', ],
203 |     'metric_color_keys': [ 'brokers_type_numeric', 'brokers', ],
204 | }
205 | ```
206 | 
207 | In this case, all diagrams in the same row will have identical producer and consumer settings. Within a column, all diagrams will have the same number of consumers. And withing a diagram, tests with different brokers and number of partitions will have different colors.
208 | 
209 | If you experience errors that libraries such as boto3 are not found, execute the cells in the [`00000000--install-dependencies`](notebooks/00000000--install-dependencies.ipynb) notebook.
210 | 
211 | 
212 | ## Security
213 | 
214 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
215 | 
216 | ## License
217 | 
218 | This library is licensed under the MIT-0 License. See the LICENSE file.


--------------------------------------------------------------------------------
/cdk/.gitignore:
--------------------------------------------------------------------------------
 1 | *.js
 2 | !lambda/*js
 3 | !jest.config.js
 4 | *.d.ts
 5 | node_modules
 6 | 
 7 | # CDK asset staging directory
 8 | .cdk.staging
 9 | cdk.out
10 | 


--------------------------------------------------------------------------------
/cdk/bin/cdk.ts:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env node
  2 | 
  3 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  4 | // 
  5 | // Permission is hereby granted, free of charge, to any person obtaining a copy of
  6 | // this software and associated documentation files (the "Software"), to deal in
  7 | // the Software without restriction, including without limitation the rights to
  8 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  9 | // the Software, and to permit persons to whom the Software is furnished to do so.
 10 | // 
 11 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 12 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 13 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 14 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 15 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 16 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 17 | 
 18 | import { App, Environment } from 'aws-cdk-lib';
 19 | import { CdkStack } from '../lib/cdk-stack';
 20 | import { VpcStack } from '../lib/vpc';
 21 | import { InstanceClass, InstanceSize, InstanceType } from 'aws-cdk-lib/aws-ec2';
 22 | import { ClientBrokerEncryption, KafkaVersion } from '@aws-cdk/aws-msk-alpha';
 23 | 
 24 | 
 25 | const app = new App();
 26 | 
 27 | 
 28 | const defaults : { 'env': Environment } = {
 29 |   env: {
 30 |     account: '123456789012',
 31 |     region: 'eu-west-1'
 32 |   }
 33 | };
 34 | 
 35 | 
 36 | const vpc = new VpcStack(app, 'MskPerformanceVpc', defaults).vpc;
 37 | 
 38 | 
 39 | 
 40 | const throughputSpec = (consumer: number, protocol: string, batchSize:number, partitions: number[], throughput: number[]) => ({
 41 |   "test_specification": {
 42 |     "parameters": {
 43 |       "cluster_throughput_mb_per_sec": throughput,
 44 |       "num_producers": [ 6 ],
 45 |       "consumer_groups" : [ { "num_groups": consumer, "size": 6 } ],
 46 |       "client_props": [{ 
 47 |         "producer": `acks=all linger.ms=5 batch.size=${batchSize} buffer.memory=2147483648 security.protocol=${protocol}`,
 48 |         "consumer": `security.protocol=${protocol}`
 49 |       }],
 50 |       "num_partitions": partitions,
 51 |       "record_size_byte": [ 1024 ],
 52 |       "replication_factor": [ 3 ],
 53 |       "duration_sec": [ 3600 ]
 54 |     },
 55 |     "skip_remaining_throughput": {
 56 |       "less-than": [ "sent_div_requested_mb_per_sec", 0.995 ]
 57 |     },
 58 |     "depletion_configuration": {
 59 |       "upper_threshold": {
 60 |         "mb_per_sec": 200
 61 |       },
 62 |       "approximate_timeout_hours": 0.5
 63 |     }
 64 |   }
 65 | });
 66 | 
 67 | 
 68 | const throughput056 = [8, 16, 24, 32, 40, 44, 48, 52, 56];
 69 | const througphut096 = [8, 16, 32, 48,                 56, 64, 72, 80, 88, 96];
 70 | const throughput192 = [8, 16, 32,                         64,             96, 112, 128, 144, 160, 176, 192];
 71 | const throughput200 = [8, 16, 32,                         64,             96, 112, 128, 144, 160, 176, 192, 200];
 72 | const throughput384 = [8,                                 64,                      128,                192,      256, 320, 336, 352, 368, 384]
 73 | const throughput672 = [8,                                                          128,                          256,                     384, 448, 512, 576, 608, 640, 672];
 74 | 
 75 | 
 76 | new CdkStack(app, 'm5large-perf-test--', {
 77 |   ...defaults,
 78 |   vpc: vpc,
 79 |   clusterProps: {
 80 |     numberOfBrokerNodes: 1,
 81 |     instanceType: InstanceType.of(InstanceClass.M5, InstanceSize.LARGE),
 82 |     ebsStorageInfo: {
 83 |       volumeSize: 5334
 84 |     },
 85 |     encryptionInTransit: {
 86 |       enableInCluster: false,
 87 |       clientBroker: ClientBrokerEncryption.PLAINTEXT
 88 |     },
 89 |     kafkaVersion: KafkaVersion.V2_8_0,
 90 |   },
 91 | //  initialPerformanceTest: throughputSpec(2, "PLAINTEXT", 262114, [36], throughput056)
 92 | });
 93 | 
 94 | new CdkStack(app, 'm52xlarge-perf-test', {
 95 |   ...defaults,
 96 |   vpc: vpc,
 97 |   clusterProps: {
 98 |     numberOfBrokerNodes: 1,
 99 |     instanceType: InstanceType.of(InstanceClass.M5, InstanceSize.XLARGE),
100 |     ebsStorageInfo: {
101 |       volumeSize: 5334
102 |     },
103 |     encryptionInTransit: {
104 |       enableInCluster: true,
105 |       clientBroker: ClientBrokerEncryption.TLS
106 |     },
107 |     kafkaVersion: KafkaVersion.V2_8_0
108 |   },
109 | //  initialPerformanceTest: throughputSpec(2, "PLAINTEXT", 262114, [36], throughput192)
110 | });
111 | 


--------------------------------------------------------------------------------
/cdk/cdk.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "app": "npx ts-node --prefer-ts-exts bin/cdk.ts",
 3 |   "context": {
 4 |     "aws-cdk:enableDiffNoFail": "true",
 5 |     "@aws-cdk/core:stackRelativeExports": "true",
 6 |     "@aws-cdk/aws-secretsmanager:parseOwnedSecretName": true,
 7 |     "@aws-cdk/aws-kms:defaultKeyPolicies": true,
 8 |     "@aws-cdk/aws-s3:grantWriteWithoutAcl": true
 9 |   }
10 | }
11 | 


--------------------------------------------------------------------------------
/cdk/docker/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # 
 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | # this software and associated documentation files (the "Software"), to deal in
 5 | # the Software without restriction, including without limitation the rights to
 6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | # the Software, and to permit persons to whom the Software is furnished to do so.
 8 | # 
 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | FROM alpine:3.15
17 | 
18 | ENV KAFKA_VERSION=2.8.0
19 | ENV SCALA_VERSION=2.12
20 | 
21 | RUN apk add --no-cache tini zsh bash parallel mawk procps wget jq py-pip openjdk17-jre-headless \
22 |   && pip install awscli \
23 |   && wget -qO - https://archive.apache.org/dist/kafka/${KAFKA_VERSION}/kafka_${SCALA_VERSION}-${KAFKA_VERSION}.tgz | tar xz -C /opt/ \
24 |   && wget --no-verbose https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.1/aws-msk-iam-auth-1.1.1-all.jar -P /opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION}/libs/
25 | 
26 | RUN cp /usr/lib/jvm/java-*/lib/security/cacerts /opt/kafka.client.truststore.jks
27 | 
28 | COPY run-kafka-command.sh /opt/
29 | COPY client-*.properties /opt/
30 | 
31 | ENTRYPOINT [ "tini", "--",  "/opt/run-kafka-command.sh" ]


--------------------------------------------------------------------------------
/cdk/docker/client-iam.properties:
--------------------------------------------------------------------------------
1 | ssl.truststore.location=/opt/kafka.client.truststore.jks
2 | security.protocol=SASL_SSL
3 | sasl.mechanism=AWS_MSK_IAM
4 | sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
5 | sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
6 | 


--------------------------------------------------------------------------------
/cdk/docker/client-tls.properties:
--------------------------------------------------------------------------------
1 | security.protocol=SSL
2 | ssl.truststore.location=/opt/kafka.client.truststore.jks
3 | 


--------------------------------------------------------------------------------
/cdk/docker/run-kafka-command.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/zsh -x
  2 | 
  3 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  4 | # 
  5 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
  6 | # this software and associated documentation files (the "Software"), to deal in
  7 | # the Software without restriction, including without limitation the rights to
  8 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  9 | # the Software, and to permit persons to whom the Software is furnished to do so.
 10 | # 
 11 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 12 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 13 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 14 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 15 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 16 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 17 | 
 18 | 
 19 | set -eo pipefail
 20 | setopt shwordsplit
 21 | 
 22 | signal_handler() {
 23 |     echo "trap triggered by signal $1"
 24 |     ps -weo pid,%cpu,%mem,size,vsize,cmd --sort=-%mem
 25 |     trap - EXIT
 26 |     exit $2
 27 | }
 28 | 
 29 | output_mem_usage() {
 30 |     while true; do
 31 |         ps -weo pid,%cpu,%mem,rss,cmd --sort=-%mem
 32 |         sleep 60
 33 |     done
 34 | }
 35 | 
 36 | wait_for_all_tasks() {
 37 |     # wait until all jobs/containers are actually running
 38 |     echo -n "waiting for jobs: "
 39 |     while [[ ${running_jobs:--1} -lt "$args[--num-jobs]" ]]; do
 40 |         sleep 1
 41 |         echo -n "."
 42 |         running_jobs=$(aws --region $args[--region] --output text batch list-jobs --array-job-id ${AWS_BATCH_JOB_ID%:*} --job-status RUNNING --query 'jobSummaryList | length(@)') || running_jobs=-1
 43 |     done
 44 |     echo " done"
 45 | }
 46 | 
 47 | producer_command() {
 48 |     KAFKA_HEAP_OPTS="-Xms1G -Xmx3584m" /opt/kafka_?.??-?.?.?/bin/kafka-producer-perf-test.sh \
 49 |         --topic $TOPIC \
 50 |         --num-records $(printf '%.0f' $args[--num-records-producer]) \
 51 |         --throughput $THROUGHPUT \
 52 |         --record-size $(printf '%.0f' $args[--record-size-byte]) \
 53 |         --producer-props bootstrap.servers=$PRODUCER_BOOTSTRAP_SERVERS ${(@s/ /)args[--producer-props]} \
 54 |         --producer.config /opt/client.properties 2>&1
 55 | }
 56 | 
 57 | consumer_command() {
 58 |     for config in $args[--consumer-props]; do
 59 |         if [[ $config ==  *SSL* ]]; then    # skip encryption config as it's already part of the properties file
 60 |             continue
 61 |         fi
 62 | 
 63 |         echo $config >> /opt/client.properties
 64 |     done
 65 | 
 66 |     cat /opt/client.properties
 67 | 
 68 |     KAFKA_HEAP_OPTS="-Xms1G -Xmx3584m" /opt/kafka_?.??-?.?.?/bin/kafka-consumer-perf-test.sh \
 69 |         --topic $TOPIC \
 70 |         --messages $(printf '%.0f' $args[--num-records-consumer]) \
 71 |         --broker-list $CONSUMER_BOOTSTRAP_SERVERS \
 72 |         --consumer.config /opt/client.properties \
 73 |         --group $CONSUMER_GROUP \
 74 |         --print-metrics \
 75 |         --show-detailed-stats \
 76 |         --timeout 16000 2>&1
 77 | }
 78 | 
 79 | query_msk_endpoint() {
 80 |     if [[ $args[--bootstrap-broker] = "-" ]]
 81 |     then
 82 |         case $args[--producer-props] in
 83 |             *SASL_SSL*)
 84 |                 PRODUCER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn]  --query "BootstrapBrokerStringSaslIam" --output text)
 85 |                 ;;
 86 |             *SSL*)
 87 |                 PRODUCER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn]  --query "BootstrapBrokerStringTls" --output text)
 88 |                 ;;
 89 |             *)
 90 |                 PRODUCER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn]  --query "BootstrapBrokerString" --output text)
 91 |                 ;;
 92 |         esac
 93 | 
 94 |         case $args[--consumer-props] in
 95 |             *SASL_SSL*)
 96 |                 CONSUMER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn]  --query "BootstrapBrokerStringSaslIam" --output text)
 97 |                 ;;
 98 |             *SSL*)
 99 |                 CONSUMER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn]  --query "BootstrapBrokerStringTls" --output text)
100 |                 ;;
101 |             *)
102 |                 CONSUMER_BOOTSTRAP_SERVERS=$(aws --region $args[--region] kafka get-bootstrap-brokers --cluster-arn $args[--msk-cluster-arn]  --query "BootstrapBrokerString" --output text)
103 |                 ;;
104 |         esac
105 |     else
106 |         PRODUCER_BOOTSTRAP_SERVERS=$args[--bootstrap-broker]
107 |         CONSUMER_BOOTSTRAP_SERVERS=$args[--bootstrap-broker]
108 |     fi
109 | }
110 | 
111 | 
112 | trap 'signal_handler SIGTERM 15 $LINENO' SIGTERM
113 | # trap 'signal_handler EXIT $? $LINENO' EXIT
114 | 
115 | 
116 | # parse aruments
117 | zparseopts -A args -E -- -region: -msk-cluster-arn: -bootstrap-broker: -num-jobs: -command: -replication-factor: -topic-name: -depletion-topic-name: -record-size-byte: -records-per-sec: -num-partitions: -duration-sec: -producer-props: -num-producers: -num-records-producer: -consumer-props: -size-consumer-group: -num-records-consumer:
118 | 
119 | date
120 | echo "parsed args: ${(kv)args}"
121 | echo "running command: $args[--command]"
122 | 
123 | if [[ -f "$ECS_CONTAINER_METADATA_FILE" ]]; then
124 |     CLUSTER=$(cat $ECS_CONTAINER_METADATA_FILE | jq -r '.Cluster')
125 |     INSTANCE_ARN=$(cat $ECS_CONTAINER_METADATA_FILE | jq -r '.ContainerInstanceARN')
126 |     echo -n "running on instance: "
127 |     aws ecs describe-container-instances --region $args[--region] --cluster $CLUSTER --container-instances $INSTANCE_ARN | jq -r '.containerInstances[].ec2InstanceId'
128 | fi
129 | 
130 | 
131 | # periodically output mem usage for debugging purposes
132 | # output_mem_usage &
133 | 
134 | 
135 | # create authentication config properties
136 | case $args[--producer-props] in
137 |     *SASL_SSL*)
138 |         cp /opt/client-iam.properties /opt/client.properties
139 |         ;;
140 |     *SSL*)
141 |         cp /opt/client-tls.properties /opt/client.properties
142 |         ;;
143 |     *)
144 |         touch /opt/client.properties
145 |         ;;
146 | esac
147 | 
148 | 
149 | # obtain cluster endpoint; retry if the command fails, eg, when credentials cannot be obtained from environment
150 | query_msk_endpoint
151 | while [ -z "$PRODUCER_BOOTSTRAP_SERVERS"  ]; do
152 |     sleep 10
153 |     query_msk_endpoint
154 | done
155 | 
156 | 
157 | case $args[--command] in
158 | create-topics)
159 |     # create topics with 5 sec retention for depletion task
160 | 
161 |     while
162 |         ! KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \
163 |             --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \
164 |             --command-config /opt/client.properties \
165 |             --list \
166 |         | grep -q "$args[--depletion-topic-name]"
167 |     do
168 |         # try to create topic if none exists
169 |         KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" \
170 |             /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \
171 |             --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \
172 |             --create \
173 |             --topic "$args[--depletion-topic-name]" \
174 |             --partitions $args[--num-partitions] \
175 |             --replication-factor $args[--replication-factor] \
176 |             --config retention.ms=5000 \
177 |             --command-config /opt/client.properties
178 | 
179 |         sleep 5
180 |     done
181 | 
182 |     while
183 |         # check if topic already exists
184 |         ! KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \
185 |             --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \
186 |             --command-config /opt/client.properties \
187 |             --list \
188 |         | grep -q "$args[--topic-name]"
189 |     do
190 |         # try to create topics for actual performance tests
191 |         KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" \
192 |             /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \
193 |             --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \
194 |             --create \
195 |             --topic "$args[--topic-name]" \
196 |             --partitions $args[--num-partitions] \
197 |             --replication-factor $args[--replication-factor] \
198 |             --command-config /opt/client.properties
199 | 
200 |         sleep 5
201 |     done
202 | 
203 |     # list the created topics for debugging purposes
204 |     KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \
205 |         --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \
206 |         --command-config /opt/client.properties \
207 |         --list
208 |     ;;
209 | 
210 | delete-topics)
211 |     # delete all topics on the cluster (except for consumer group related ones)
212 |     KAFKA_HEAP_OPTS="-Xms128m -Xmx512m" /opt/kafka_?.??-?.?.?/bin/kafka-topics.sh \
213 |         --bootstrap-server $PRODUCER_BOOTSTRAP_SERVERS \
214 |         --command-config /opt/client.properties \
215 |         --delete \
216 |         --topic 'test-id-\d+--throughput-series-id-\d+((--\S+-\d+)+|--depletion)' \
217 |         || :    # don't fail if no topic exists
218 |     ;;
219 | 
220 | deplete-credits)
221 |     TOPIC="$args[--depletion-topic-name]"
222 |     THROUGHPUT=-1
223 | 
224 |     wait_for_all_tasks
225 | 
226 |     if [[ "$AWS_BATCH_JOB_ARRAY_INDEX" -lt $args[--num-producers] ]]; then
227 |         producer_command | mawk -W interactive '
228 |                  /TimeoutException/ { if(++exceptions % 100000 == 0) {print "Filtered", exceptions, "TimeoutExceptions."} } 
229 |                 !/TimeoutException/ { print $0 }
230 |             '
231 |     else
232 |         CONSUMER_GROUP="$(((AWS_BATCH_JOB_ARRAY_INDEX - args[--num-producers]) / args[--size-consumer-group]))"
233 | 
234 |         consumer_command
235 |     fi
236 |     ;;
237 |     
238 | run-performance-test)
239 |     TOPIC="$args[--topic-name]"
240 |     THROUGHPUT=$(printf '%.0f' $args[--records-per-sec])
241 | 
242 |     wait_for_all_tasks
243 | 
244 |     # build the command for the actual performance test, the array index determines whether to produce or consume with this job
245 |     if [[ "$AWS_BATCH_JOB_ARRAY_INDEX" -lt $args[--num-producers] ]]; then
246 |         producer_command | mawk -W interactive '
247 |                 /TimeoutException/ { if(++timeouts % 100000 == 0) {print "Filtered", timeouts, "TimeoutExceptions."} }
248 |                 /error/            { errors++ }
249 |                 !/TimeoutException/ { print $0 }
250 |                 END {
251 |                     printf "{ \"type\": \"producer\", \"test_summary\": \"%s\", \"timeout_exception_count\": \"%d\", \"error_count\": \"%d\" }\n", $0, timeouts, errors
252 |                 }
253 |             '
254 |     else
255 |         CONSUMER_GROUP="$(((AWS_BATCH_JOB_ARRAY_INDEX - args[--num-producers]) / args[--size-consumer-group]))"
256 | 
257 |         consumer_command | mawk -W interactive '
258 |                 /WARNING: Exiting before consuming the expected number of messages/ { warnings++ }
259 |                 // { print $0 }
260 |                 END {
261 |                     printf  "{ \"type\": \"consumer\", \"too_few_messages_count\": \"%d\" }\n", warnings
262 |                 }
263 |             '
264 |     fi
265 |     ;;
266 | 
267 | *)
268 |     print NIL
269 |     ;;
270 | esac


--------------------------------------------------------------------------------
/cdk/jest.config.js:
--------------------------------------------------------------------------------
1 | module.exports = {
2 |   roots: ['<rootDir>/test'],
3 |   testMatch: ['**/*.test.ts'],
4 |   transform: {
5 |     '^.+\\.tsx?$': 'ts-jest'
6 |   }
7 | };
8 | 


--------------------------------------------------------------------------------
/cdk/lambda/manage-infrastructure.js:
--------------------------------------------------------------------------------
 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | // 
 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | // this software and associated documentation files (the "Software"), to deal in
 5 | // the Software without restriction, including without limitation the rights to
 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | // the Software, and to permit persons to whom the Software is furnished to do so.
 8 | // 
 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | const AWS = require('aws-sdk');
17 | const batch = new AWS.Batch();
18 | const cloudwatch = new AWS.CloudWatch();
19 | 
20 | exports.queryMskClusterThroughput = async function(event) {
21 |   console.log(JSON.stringify(event));
22 |   
23 |   const jobs = {
24 |     jobs: [
25 |       event.depletion_job.JobId
26 |     ]
27 |   };
28 | 
29 |   const jobDetails = await batch.describeJobs(jobs).promise();
30 | 
31 |   console.log(JSON.stringify(jobDetails));
32 | 
33 |   const jobCreatedAt = jobDetails.jobs[0].createdAt;
34 |   const statusSummary = jobDetails.jobs[0].arrayProperties.statusSummary;
35 | 
36 |   // query max cluster throughput in the last 5 min, but not before the job started
37 |   const queryEndTime = Date.now()
38 |   const queryStartTime = Math.max(queryEndTime - 300_000, jobCreatedAt);
39 | 
40 |   const params = {
41 |     EndTime: new Date(queryEndTime),
42 |     StartTime: new Date(queryStartTime),
43 |     MetricDataQueries: [
44 |       {
45 |         Id: 'bytesInSum',
46 |         Expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${process.env.MSK_CLUSTER_NAME}" MetricName="BytesInPerSec"', 'Maximum', 300))`,
47 |         Period: '300',
48 |       },
49 |       {
50 |         Id: 'bytesInStddev',
51 |         Expression: `STDDEV(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${process.env.MSK_CLUSTER_NAME}" MetricName="BytesInPerSec"', 'Minimum', 60))`,
52 |         Period: '60',
53 |       }
54 |     ],
55 |   };
56 | 
57 |   const metrics = await cloudwatch.getMetricData(params).promise();
58 | 
59 |   console.log(JSON.stringify(metrics));
60 | 
61 |   const result = {
62 |     clusterMbInPerSec: metrics.MetricDataResults.find(m => m.Id == 'bytesInSum').Values[0] / 1024**2,
63 |     brokerMbInPerSecStddev: metrics.MetricDataResults.find(m => m.Id == 'bytesInStddev').Values.reduce((acc,v) => v>acc ? v : acc, 0) / 1024**2,
64 |     succeededPlusFailedJobs: statusSummary.SUCCEEDED + statusSummary.FAILED
65 |   };
66 | 
67 |   console.log(JSON.stringify(result));
68 | 
69 |   return result;
70 | }
71 | 
72 | exports.terminateDepletionJob = async function(event) {
73 |   console.log(JSON.stringify(event));
74 |   
75 |   //fixme: change to query pending jobs of the job queue and terminate them
76 | 
77 |   const params = {
78 |     jobId: event.JobId,
79 |     reason: "Batch job terminated by StepFunctions workflow."
80 |   };
81 | 
82 |   return batch.terminateJob(params).promise();
83 | }
84 | 
85 | 
86 | exports.updateMinCpus = async function(event) {
87 |   console.log(JSON.stringify(event));
88 | 
89 |   const params = {
90 |     computeEnvironment: process.env.COMPUTE_ENVIRONMENT,
91 |     computeResources: {
92 |       minvCpus: 'current_test' in event ? event.current_test.parameters.num_jobs * 2 : 0
93 |     }
94 |   };
95 | 
96 |   return batch.updateComputeEnvironment(params).promise();
97 | }


--------------------------------------------------------------------------------
/cdk/lambda/query-test-output.js:
--------------------------------------------------------------------------------
 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | // 
 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | // this software and associated documentation files (the "Software"), to deal in
 5 | // the Software without restriction, including without limitation the rights to
 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | // the Software, and to permit persons to whom the Software is furnished to do so.
 8 | // 
 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | const AWS = require('aws-sdk')
17 | const batch = new AWS.Batch();
18 | const cloudwatchlogs = new AWS.CloudWatchLogs();
19 | 
20 | exports.queryProducerOutput = async function(event) {
21 |     console.log(JSON.stringify(event));
22 | 
23 |     const numProducer = event.current_test.parameters.num_producers
24 |     const jobId = event.job_result.JobId
25 | 
26 |     const describeParams = {
27 |         jobs: [...Array(numProducer).keys()].map(x => jobId + ":" + x)
28 |     }    
29 | 
30 |     const jobDecriptions = await batch.describeJobs(describeParams).promise();
31 |     
32 |     const logStreams = jobDecriptions.jobs.map(job => job.attempts[0].container.logStreamName);
33 | 
34 |     console.log(JSON.stringify(logStreams));
35 | 
36 |     const filterParams = {
37 |         logGroupName: process.env.LOG_GROUP_NAME,
38 |         logStreamNames: logStreams,
39 |         filterPattern: '{$.type="producer" && $.test_summary=* && $.error_count=*}',
40 |     }
41 | 
42 |     return queryAndParseLogs(filterParams, numProducer);
43 | }
44 | 
45 | 
46 | async function queryAndParseLogs(filterParams, numProducer) {
47 |     let filteredEvents = await cloudwatchlogs.filterLogEvents(filterParams).promise();
48 | 
49 |     console.log(JSON.stringify(filteredEvents));
50 | 
51 |     let queryResult = filteredEvents.events;
52 | 
53 |     // obtain all results through pagination
54 |     while ('nextToken' in filteredEvents) {
55 |         filteredEvents = await cloudwatchlogs.filterLogEvents({...filterParams, nextToken: filteredEvents.nextToken}).promise();
56 | 
57 |         console.log(JSON.stringify(filteredEvents));
58 | 
59 |         queryResult = [...queryResult, ...filteredEvents.events]
60 |     }
61 | 
62 |     console.log(JSON.stringify(queryResult));
63 | 
64 |     if (queryResult.length < numProducer) {
65 |         // if we didn't obtain a result for all producer tasks, fail so that the stepfunctions workflow can retry later
66 |         throw new Error(`incomplete query results: expected ${numProducer} results, found ${queryResult.length}`)
67 |     } else {
68 |         const testResults = queryResult.map(e => JSON.parse(e.message));
69 | 
70 |         console.log(JSON.stringify(testResults));
71 | 
72 |         // aggregate number of timeout errors
73 |         const errorCountSum = testResults.reduce((acc,message) => acc + parseInt(message.error_count), 0)
74 |         // parse producer throughput
75 |         const mbPerSec = testResults
76 |             .map(message => message.test_summary.match(/([0-9.]+) MB\/sec/))
77 |             .map(match => parseFloat(match[1]))
78 | 
79 |         return {
80 |             errorCountSum: errorCountSum,
81 |             mbPerSecSum: mbPerSec.reduce((acc,v) => acc + v, 0),
82 |             mbPerSecMin: mbPerSec.reduce((acc,v) => v<acc ? v : acc, Number.MAX_VALUE),
83 |         }
84 |     }
85 | }


--------------------------------------------------------------------------------
/cdk/lambda/test-parameters.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # 
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
  4 | # this software and associated documentation files (the "Software"), to deal in
  5 | # the Software without restriction, including without limitation the rights to
  6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  7 | # the Software, and to permit persons to whom the Software is furnished to do so.
  8 | # 
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | import re
 17 | import os
 18 | import math
 19 | import random
 20 | import sys
 21 | 
 22 | random.seed()
 23 | 
 24 | def increment_index_and_update_parameters(event, context):
 25 |     test_specification = event["test_specification"]
 26 | 
 27 |     if "current_test" in event:
 28 |         previous_test_index = event["current_test"]["index"]
 29 |         skip_condition = event["test_specification"]["skip_remaining_throughput"]
 30 |         
 31 |         try:
 32 |             if evaluate_skip_condition(skip_condition, event):
 33 |                 # if the skip condition matches, reset throughput index to 0 and increment next (hardcoded to first) index
 34 |                 (updated_test_index, throughput_series_id) = skip_remaining_throughput(previous_test_index, event)
 35 |                 
 36 |             else:
 37 |                 (updated_test_index, throughput_series_id) = increment_test_index(previous_test_index, event, 0)
 38 |         except OverflowError:
 39 |             # all tests have been executed
 40 |             updated_test_index = None
 41 |             throughput_series_id = random.randint(0,100000)
 42 |     else:
 43 |         # if no test has been executed yet, start with the first index
 44 |         updated_test_index = {key: 0 for key in test_specification["parameters"].keys()}
 45 |         throughput_series_id = random.randint(0,100000)
 46 | 
 47 |     return update_parameters(test_specification, updated_test_index, random.randint(0,100000), throughput_series_id)
 48 | 
 49 | 
 50 | 
 51 | def evaluate_skip_condition(condition, event):
 52 |     if isinstance(condition, dict):
 53 |         if "greater-than" in condition:
 54 |             args = condition["greater-than"]
 55 | 
 56 |             return evaluate_skip_condition(args[0], event) > evaluate_skip_condition(args[1], event)
 57 |         elif "less-than" in condition:
 58 |             args = condition["less-than"]
 59 | 
 60 |             return evaluate_skip_condition(args[0], event) < evaluate_skip_condition(args[1], event)
 61 |         else:
 62 |             raise Exception("unable to parse condition: " + condition)
 63 |     elif isinstance(condition, str):
 64 |         if condition == "sent_div_requested_mb_per_sec":
 65 |             return event["producer_result"]["Payload"]["mbPerSecSum"] / event["current_test"]["parameters"]["cluster_throughput_mb_per_sec"]
 66 |         else:
 67 |             raise Exception("unable to parse condition: " + condition)
 68 |     elif isinstance(condition, int) or isinstance(condition, float):
 69 |         return condition
 70 |     else:
 71 |         raise Exception("unable to parse condition: " + condition)
 72 | 
 73 | 
 74 | def increment_test_index(index, event, pos):
 75 |     parameters = [key for key in event["test_specification"]["parameters"].keys()]
 76 | 
 77 |     # if all options to increment the index are exhausted
 78 |     if not (0 <= pos < len(parameters)):
 79 |         raise OverflowError
 80 | 
 81 |     paramater_name = parameters[pos]
 82 |     parameter_value = event["current_test"]["index"][paramater_name]
 83 | 
 84 |     if (parameter_value+1 < len(event["test_specification"]["parameters"][paramater_name])):
 85 |         # if there are still values that need to be tested for this parameter, increment the index
 86 |         return ({**index, paramater_name: parameter_value+1}, event["current_test"]["parameters"]["throughput_series_id"])
 87 |     else:
 88 |         if paramater_name == "cluster_throughput_mb_per_sec":
 89 |             # if the cluster throughput is exhausted, increment the next index and update the throughput series id
 90 |             (updated_index, _) = increment_test_index({**index, paramater_name: 0}, event, pos+1)
 91 |             updated_throughput_series_id = random.randint(0,100000)
 92 |             
 93 |             return (updated_index, updated_throughput_series_id)
 94 |         else:
 95 |             return increment_test_index({**index, paramater_name: 0}, event, pos+1)
 96 | 
 97 | 
 98 | 
 99 | def update_parameters(test_specification, updated_test_index, test_id, throughput_series_id):
100 |     if updated_test_index is None:
101 |         # all test have been completed
102 |         return {
103 |             "test_specification": test_specification
104 |         }
105 |     else:
106 |         # get the new parameters for the updated index
107 |         updated_parameters = {
108 |             index_name: test_specification["parameters"][index_name][index_value] for (index_name,index_value) in updated_test_index.items()
109 |         }
110 | 
111 |         # encode the current index into the topic name, so that the parameters can be extracted later
112 |         max_topic_property_length = 15
113 |         remove_invalid_chars = lambda x: re.sub(r'[^a-zA-Z0-9-]', '-', x.replace(' ', '-'))
114 | 
115 |         topic_name = 'test-id-' + str(test_id) + '--' + 'throughput-series-id-' + str(throughput_series_id) + '--' + '--'.join([f"{remove_invalid_chars(k)[:max_topic_property_length]}-{str(v)}" for k,v in updated_test_index.items()])
116 |         depletion_topic_name = 'test-id-' + str(test_id) + '--' + 'throughput-series-id-' + str(throughput_series_id)  + '--depletion'
117 | 
118 |         record_size_byte = updated_parameters["record_size_byte"]
119 | 
120 |         if updated_parameters["cluster_throughput_mb_per_sec"]>0:
121 |             producer_throughput_byte = updated_parameters["cluster_throughput_mb_per_sec"] * 1024**2 // updated_parameters["num_producers"]
122 |             # one consumer group consumes the entire cluster throughput
123 |             consumer_throughput_byte = updated_parameters["cluster_throughput_mb_per_sec"] * 1024**2 // updated_parameters["consumer_groups"]["size"] if updated_parameters["consumer_groups"]["size"] > 0 else 0
124 | 
125 |             num_records_producer = producer_throughput_byte * updated_parameters["duration_sec"] // record_size_byte
126 |             num_records_consumer = consumer_throughput_byte * updated_parameters["duration_sec"] // record_size_byte
127 |         else:
128 |             # for depletion, publish messages as fast as possible
129 |             producer_throughput_byte = -1
130 |             consumer_throughput_byte = -1
131 |             num_records_producer = 2147483647
132 |             num_records_consumer = 2147483647
133 | 
134 |         # derive additional parameters for convenience
135 |         updated_parameters = {
136 |             **updated_parameters,
137 |             "num_jobs": updated_parameters["num_producers"] + updated_parameters["consumer_groups"]["num_groups"]*updated_parameters["consumer_groups"]["size"],
138 |             "producer_throughput": producer_throughput_byte,
139 |             "records_per_sec": max(1, producer_throughput_byte // record_size_byte),
140 |             "num_records_producer": num_records_producer,
141 |             "num_records_consumer": num_records_consumer,
142 |             "test_id": test_id,
143 |             "throughput_series_id": throughput_series_id,
144 |             "topic_name": topic_name,
145 |             "depletion_topic_name": depletion_topic_name
146 |         }
147 | 
148 |         updated_event = {
149 |             "test_specification": test_specification,
150 |             "current_test": {
151 |                 "index": updated_test_index,
152 |                 "parameters": updated_parameters
153 |             }
154 |         }
155 | 
156 |         return updated_event
157 | 
158 | 
159 | 
160 | def skip_remaining_throughput(index, event):
161 |     parameters = [key for key in event["test_specification"]["parameters"].keys()]
162 |     throughput_index = parameters.index("cluster_throughput_mb_per_sec")
163 | 
164 |     # set all parameters up to cluster_throughput_mb_per_sec to 0
165 |     reset_index = {
166 |         index_name:0 for i, index_name in enumerate(index) if i<=throughput_index
167 |     }
168 | 
169 |     # keep values of other parameters and increase next parameter after cluster_throughput_mb_per_sec
170 |     (updated_index, _) = increment_test_index({**index, **reset_index}, event, pos=throughput_index+1)
171 | 
172 |     # choose new series id, as we are starting with new throughput
173 |     throughput_series_id = random.randint(0,100000)
174 | 
175 |     return (updated_index, throughput_series_id)
176 | 
177 | 
178 | 
179 | def update_parameters_for_depletion(event, context):
180 |     test_specification = event["test_specification"]
181 |     test_parameters = test_specification["parameters"]
182 |     test_index = event["current_test"]["index"]
183 |     test_id = event["current_test"]["parameters"]["test_id"]
184 |     throughput_series_id = event["current_test"]["parameters"]["throughput_series_id"]
185 |     depletion_duration_sec = test_specification["depletion_configuration"]["approximate_timeout_hours"] * 60 * 60
186 | 
187 |     # overwrite throughput of tests with default value for credit depletion
188 |     depletion_parameters = {
189 |         **test_parameters,
190 |         "duration_sec": list(map(lambda _: depletion_duration_sec, test_parameters["duration_sec"])),
191 |         "cluster_throughput_mb_per_sec": list(map(lambda _: -1, test_parameters["cluster_throughput_mb_per_sec"])),
192 |     }
193 | 
194 |     # keep index unchanged and determine parameters for depletion
195 |     return update_parameters({**test_specification, "parameters": depletion_parameters}, test_index, test_id, throughput_series_id)


--------------------------------------------------------------------------------
/cdk/lib/batch-ami.ts:
--------------------------------------------------------------------------------
  1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | // 
  3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of
  4 | // this software and associated documentation files (the "Software"), to deal in
  5 | // the Software without restriction, including without limitation the rights to
  6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  7 | // the Software, and to permit persons to whom the Software is furnished to do so.
  8 | // 
  9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | 
 17 | import { Construct } from 'constructs';
 18 | import { Aws, Stack, StackProps } from 'aws-cdk-lib';
 19 | import { aws_ecs as ecs } from 'aws-cdk-lib';
 20 | import { aws_ec2 as ec2 } from 'aws-cdk-lib';
 21 | import { aws_iam as iam } from 'aws-cdk-lib';
 22 | import { aws_imagebuilder as imagebuilder } from 'aws-cdk-lib';
 23 | 
 24 | 
 25 | export class BatchAmiStack extends Stack {
 26 |   ami: ec2.IMachineImage;
 27 | 
 28 |   constructor(scope: Construct, id: string, props?: StackProps) {
 29 |     super(scope, id, props);
 30 | 
 31 | 
 32 |     const agentComponent = new imagebuilder.CfnComponent(this, 'ImageBuilderComponent', {
 33 |       name: `${Aws.STACK_NAME}`,
 34 |       platform: 'Linux',
 35 |       version: '0.1.0',
 36 |       data: `name: Custom image for AWS Batch
 37 | description: Builds a custom image for AWS with additional debugging tools
 38 | schemaVersion: 1.0
 39 | 
 40 | phases:
 41 |   - name: build
 42 |     steps:
 43 |       - name: InstallCloudWatchAgentStep
 44 |         action: ExecuteBash
 45 |         inputs:
 46 |           commands:
 47 |             - |
 48 |               cat <<'EOF' >> /etc/ecs/ecs.config
 49 |               ECS_ENABLE_CONTAINER_METADATA=true
 50 |               EOF
 51 |               - rpm -Uvh https://amazoncloudwatch-agent-us-west-2.s3.us-west-2.amazonaws.com/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
 52 |             - |
 53 |               cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json << 'EOF'
 54 |               {
 55 |                 "agent":{
 56 |                     "metrics_collection_interval":10,
 57 |                     "omit_hostname":true
 58 |                 },
 59 |                 "logs":{
 60 |                     "logs_collected":{
 61 |                       "files":{
 62 |                           "collect_list":[
 63 |                             {
 64 |                                 "file_path":"/var/log/ecs/ecs-agent.log*",
 65 |                                 "log_group_name":"/aws-batch/ecs-agent.log",
 66 |                                 "log_stream_name":"{instance_id}",
 67 |                                 "timezone":"Local"
 68 |                             },
 69 |                             {
 70 |                                 "file_path":"/var/log/ecs/ecs-init.log",
 71 |                                 "log_group_name":"/aws-batch/ecs-init.log",
 72 |                                 "log_stream_name":"{instance_id}",
 73 |                                 "timezone":"Local"
 74 |                             },
 75 |                             {
 76 |                                 "file_path":"/var/log/ecs/audit.log*",
 77 |                                 "log_group_name":"/aws-batch/audit.log",
 78 |                                 "log_stream_name":"{instance_id}",
 79 |                                 "timezone":"Local"
 80 |                             },
 81 |                             {
 82 |                                 "file_path":"/var/log/messages",
 83 |                                 "log_group_name":"/aws-batch/messages",
 84 |                                 "log_stream_name":"{instance_id}",
 85 |                                 "timezone":"Local"
 86 |                             }
 87 |                           ]
 88 |                       }
 89 |                     }
 90 |                 }
 91 |               }
 92 |               EOF
 93 |             - /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s`
 94 |     });
 95 | 
 96 |     const imageReceipe = new imagebuilder.CfnImageRecipe(this, 'ImageBuilderReceipt', {
 97 |       components: [{
 98 |         componentArn: agentComponent.attrArn
 99 |       }],
100 |       name: Aws.STACK_NAME,
101 |       parentImage: ecs.EcsOptimizedImage.amazonLinux2().getImage(this).imageId,
102 |       version: '0.1.0'
103 |     });
104 | 
105 |     const imageBuilderRole = new iam.Role(this, 'ImageBuilderRole', {
106 |       assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
107 |       managedPolicies: [
108 |         iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore'),
109 |         iam.ManagedPolicy.fromAwsManagedPolicyName('EC2InstanceProfileForImageBuilder')
110 |       ]
111 |     });
112 | 
113 |     const imageBuilderProfile = new iam.CfnInstanceProfile(this, 'ImageBuilderInstanceProfile', {
114 |       roles: [ imageBuilderRole.roleName ],
115 |     });
116 | 
117 |     const infrastructureconfiguration = new imagebuilder.CfnInfrastructureConfiguration(this, 'InfrastructureConfiguration', {
118 |       name: Aws.STACK_NAME,
119 |       instanceProfileName: imageBuilderProfile.ref
120 |     });
121 | 
122 |     const image = new imagebuilder.CfnImage(this, 'Image', {
123 |       imageRecipeArn: imageReceipe.attrArn,
124 |       infrastructureConfigurationArn: infrastructureconfiguration.attrArn,
125 |     });
126 | 
127 |     this.ami = ec2.MachineImage.genericLinux({
128 |       [ this.region ] : image.attrImageId
129 |     });
130 |   }
131 | }


--------------------------------------------------------------------------------
/cdk/lib/cdk-stack.ts:
--------------------------------------------------------------------------------
  1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | // 
  3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of
  4 | // this software and associated documentation files (the "Software"), to deal in
  5 | // the Software without restriction, including without limitation the rights to
  6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  7 | // the Software, and to permit persons to whom the Software is furnished to do so.
  8 | // 
  9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | import { Construct } from 'constructs';
 17 | import { Aws, Stack, StackProps, CfnResource,  CfnOutput, Duration, RemovalPolicy } from 'aws-cdk-lib';
 18 | import { aws_ecs as ecs } from 'aws-cdk-lib';
 19 | import { aws_ec2 as ec2 } from 'aws-cdk-lib';
 20 | import { aws_iam as iam } from 'aws-cdk-lib';
 21 | import { aws_logs as logs } from 'aws-cdk-lib';
 22 | import { aws_stepfunctions as sfn } from 'aws-cdk-lib';
 23 | import { aws_stepfunctions_tasks as tasks } from 'aws-cdk-lib';
 24 | import { aws_lambda as lambda } from 'aws-cdk-lib';
 25 | import { aws_cloudwatch as cloudwatch} from 'aws-cdk-lib';
 26 | import { aws_sagemaker as sagemaker } from 'aws-cdk-lib';
 27 | import { custom_resources } from 'aws-cdk-lib';
 28 | 
 29 | import * as msk from '@aws-cdk/aws-msk-alpha';
 30 | import * as batch from '@aws-cdk/aws-batch-alpha';
 31 | 
 32 | import { InstanceType } from 'aws-cdk-lib/aws-ec2';
 33 | import { RetentionDays } from 'aws-cdk-lib/aws-logs';
 34 | import { IntegrationPattern, TaskInput } from 'aws-cdk-lib/aws-stepfunctions';
 35 | import { AwsCustomResource } from 'aws-cdk-lib/custom-resources';
 36 | import { CreditDepletion } from './credit-depletion-sfn';
 37 | import { ClusterMonitoringLevel, KafkaVersion } from '@aws-cdk/aws-msk-alpha';
 38 | 
 39 | 
 40 | export interface MskClusterParametersNewCluster extends StackProps {
 41 |   clusterProps: {
 42 |     numberOfBrokerNodes: number,    //brokers per AZ
 43 |     instanceType: ec2.InstanceType,
 44 |     ebsStorageInfo: msk.EbsStorageInfo,
 45 |     encryptionInTransit: msk.EncryptionInTransitConfig,
 46 |     clientAuthetication?: msk.ClientAuthentication,
 47 |     kafkaVersion: KafkaVersion,
 48 |     configurationInfo?: msk.ClusterConfigurationInfo
 49 |   },
 50 |   vpc?: ec2.IVpc,
 51 |   sg?: ec2.ISecurityGroup,
 52 |   initialPerformanceTest?: any,
 53 | }
 54 | 
 55 | export interface BootstrapBroker extends StackProps {
 56 |   bootstrapBrokerString: string,
 57 |   clusterName: string,
 58 |   vpc?: ec2.IVpc | string,
 59 |   sg: ec2.ISecurityGroup | string,
 60 |   initialPerformanceTest?: any,
 61 | }
 62 | 
 63 | 
 64 | 
 65 | function toStackName(params: MskClusterParametersNewCluster | BootstrapBroker, prefix: string) : string {
 66 |   if ('bootstrapBrokerString' in params) {
 67 |     return prefix
 68 |   } else {
 69 |     const brokerNodeType = params.clusterProps.instanceType.toString().replace(/[^0-9a-zA-Z]/g, '');
 70 |     const inClusterEncryption = params.clusterProps.encryptionInTransit.enableInCluster ? "t" : "f"
 71 |     const clusterVersion = params.clusterProps.kafkaVersion.version.replace(/[^0-9a-zA-Z]/g, '');
 72 | 
 73 |     const numSubnets = params.vpc ? Math.max(params.vpc.privateSubnets.length, 3) : 3;
 74 | 
 75 |     const name = `${params.clusterProps.numberOfBrokerNodes*numSubnets}-${brokerNodeType}-${params.clusterProps.ebsStorageInfo.volumeSize}-${inClusterEncryption}-${clusterVersion}`;
 76 | 
 77 |     return `${prefix}${name}`;
 78 |   }
 79 | }
 80 | 
 81 | 
 82 | export class CdkStack extends Stack {
 83 |   constructor(scope: Construct, id: string, props: MskClusterParametersNewCluster | BootstrapBroker) {
 84 |     super(scope, toStackName(props, id), { ...props, description: 'Creates resources to run automated performance tests against Amazon MSK clusters (shausma-msk-performance)' });
 85 | 
 86 |     var vpc;
 87 |     if (props.vpc) {
 88 |       if (typeof props.vpc === "string") {
 89 |         vpc = ec2.Vpc.fromLookup(this, 'Vpc', { vpcId: props.vpc })
 90 |       } else {
 91 |         vpc = props.vpc
 92 |       }
 93 |     } else {
 94 |       vpc = new ec2.Vpc(this, 'Vpc');
 95 |     }
 96 | 
 97 |     var sg;
 98 |     if (props.sg) {
 99 |       if (typeof props.sg === "string") {
100 |         sg = ec2.SecurityGroup.fromSecurityGroupId(this, 'SecurityGroup', props.sg)
101 |       } else {
102 |         sg = props.sg
103 |       }
104 |     } else {
105 |       sg = new ec2.SecurityGroup(this, 'SecurityGroup', { vpc: vpc });
106 |       sg.addIngressRule(sg, ec2.Port.allTraffic());
107 |     }
108 |     
109 |     var kafka : { cfnCluster?: CfnResource, numBrokers?: number, clusterName: string };
110 | 
111 | 
112 |     if ('bootstrapBrokerString' in props) {
113 |       kafka = {
114 |         clusterName: props.clusterName
115 |       };
116 |     } else {
117 |       const cluster = new msk.Cluster(this, 'KafkaCluster', {
118 |         ...props.clusterProps,
119 |         monitoring: {
120 |           clusterMonitoringLevel: ClusterMonitoringLevel.PER_BROKER
121 |         },
122 |         vpc: vpc,
123 |         vpcSubnets: {
124 |           subnetType: ec2.SubnetType.PRIVATE_WITH_NAT
125 |         },
126 |         securityGroups: [ sg ],
127 |         clusterName: Aws.STACK_NAME,
128 |         removalPolicy: RemovalPolicy.DESTROY,
129 |       });
130 | 
131 |       kafka = {
132 |         cfnCluster: cluster.node.findChild('Resource') as CfnResource,
133 |         numBrokers: Math.max(vpc.privateSubnets.length, 3) * props.clusterProps.numberOfBrokerNodes,
134 |         clusterName: cluster.clusterName
135 |       }
136 |     }
137 | 
138 | 
139 |     const defaultLogGroup = new logs.LogGroup(this, 'DefaultLogGroup', {
140 |       retention: RetentionDays.ONE_WEEK
141 |     });
142 | 
143 |     const performanceTestLogGroup = new logs.LogGroup(this, 'PerformanceTestLogGroup', {
144 |       retention: RetentionDays.ONE_YEAR
145 |     });
146 | 
147 | 
148 |     const batchEcsInstanceRole = new iam.Role(this, 'BatchEcsInstanceRole', {
149 |       assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
150 |       managedPolicies: [
151 |         iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2ContainerServiceforEC2Role'),
152 |         iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore')
153 |       ]
154 |     });
155 | 
156 |     const batchEcsInstanceProfile = new iam.CfnInstanceProfile(this, 'BatchEcsInstanceProfile', {
157 |       roles: [ batchEcsInstanceRole.roleName ],
158 |     });
159 | 
160 |     const computeEnvironment = new batch.ComputeEnvironment(this, 'ComputeEnvironment', {
161 |       computeEnvironmentName: `${Aws.STACK_NAME}`,
162 |       computeResources: {
163 |         vpc,
164 |         instanceTypes: [ new InstanceType('c5n.large') ],
165 |         instanceRole: batchEcsInstanceProfile.attrArn,
166 |         securityGroups: [ sg ],
167 |       },
168 |     });
169 | 
170 |     const jobQueue = new batch.JobQueue(this, 'JobQueue', {
171 |       jobQueueName: Aws.STACK_NAME,
172 |       computeEnvironments: [
173 |         {
174 |           computeEnvironment,
175 |           order: 1,
176 |         },
177 |       ],
178 |     });
179 | 
180 |     const jobRole = new iam.Role(this, 'JobRole', {
181 |       assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
182 |       managedPolicies: [
183 |         iam.ManagedPolicy.fromAwsManagedPolicyName('ReadOnlyAccess')
184 |       ],
185 |       inlinePolicies: {
186 |         TerminateJob: new iam.PolicyDocument({
187 |           statements: [
188 |             new iam.PolicyStatement({
189 |               actions: ["batch:TerminateJob"],
190 |               resources: ["*"]
191 |             }),
192 |             new iam.PolicyStatement({
193 |               actions: [
194 |                 "kafka-cluster:Connect",
195 |                 "kafka-cluster:DescribeCluster"
196 |               ],
197 |               resources: [`arn:${Aws.PARTITION}:kafka:${Aws.REGION}:${Aws.ACCOUNT_ID}:cluster/${kafka.clusterName}/*`]
198 |             }),
199 |             new iam.PolicyStatement({
200 |               actions: [
201 |                 "kafka-cluster:*Topic*",
202 |                 "kafka-cluster:WriteData",
203 |                 "kafka-cluster:ReadData"
204 |               ],
205 |               resources: [`arn:${Aws.PARTITION}:kafka:${Aws.REGION}:${Aws.ACCOUNT_ID}:topic/${kafka.clusterName}/*`]
206 |             }),
207 |             new iam.PolicyStatement({
208 |               actions: [
209 |                 "kafka-cluster:AlterGroup",
210 |                 "kafka-cluster:DescribeGroup"
211 |               ],
212 |               resources: [`arn:${Aws.PARTITION}:kafka:${Aws.REGION}:${Aws.ACCOUNT_ID}:group/${kafka.clusterName}/*`]
213 |             })
214 |           ]
215 |         })
216 |       }
217 |     });
218 | 
219 | 
220 |     const containerImage = ecs.ContainerImage.fromAsset('./docker');
221 | 
222 |     const performanceTestJobDefinition = new batch.JobDefinition(this, 'PerformanceTestJobDefinition', {
223 |       container: {
224 |         image: containerImage,
225 |         vcpus: 2,
226 |         memoryLimitMiB: 4900,
227 |         logConfiguration: {
228 |           logDriver: batch.LogDriver.AWSLOGS,
229 |           options: {
230 |             'awslogs-region': Aws.REGION,
231 |             'awslogs-group': performanceTestLogGroup.logGroupName
232 |           }
233 |         },
234 |         jobRole: jobRole
235 |       },
236 |     });
237 | 
238 |     const defaultJobDefinition = new batch.JobDefinition(this, 'DefaultJobDefinition', {
239 |       container: {
240 |         image: containerImage,
241 |         vcpus: 2,
242 |         memoryLimitMiB: 4900,
243 |         logConfiguration: {
244 |           logDriver: batch.LogDriver.AWSLOGS,
245 |           options: {
246 |             'awslogs-region': Aws.REGION,
247 |             'awslogs-group': defaultLogGroup.logGroupName
248 |           }
249 |         },
250 |         jobRole: jobRole
251 |       },
252 |     });
253 | 
254 | 
255 | 
256 |     const payload = TaskInput.fromObject({
257 |       'num_jobs.$': 'States.JsonToString($.current_test.parameters.num_jobs)',                                //hack to convert number to string
258 |       'replication_factor.$': 'States.JsonToString($.current_test.parameters.replication_factor)',
259 |       topic_name: sfn.JsonPath.stringAt('$.current_test.parameters.topic_name'),
260 |       depletion_topic_name: sfn.JsonPath.stringAt('$.current_test.parameters.depletion_topic_name'),
261 |       'record_size_byte.$': 'States.JsonToString($.current_test.parameters.record_size_byte)',
262 |       'records_per_sec.$': 'States.JsonToString($.current_test.parameters.records_per_sec)',
263 |       'num_partitions.$': 'States.JsonToString($.current_test.parameters.num_partitions)',
264 |       'duration_sec.$': 'States.JsonToString($.current_test.parameters.duration_sec)',
265 |       producer_props: sfn.JsonPath.stringAt('$.current_test.parameters.client_props.producer'),             //this fails for empty strings :-(
266 |       'num_producers.$':  'States.JsonToString($.current_test.parameters.num_producers)',
267 |       'num_records_producer.$': 'States.JsonToString($.current_test.parameters.num_records_producer)',
268 |       consumer_props: sfn.JsonPath.stringAt('$.current_test.parameters.client_props.consumer'),
269 |       'num_consumer_groups.$': 'States.JsonToString($.current_test.parameters.consumer_groups.num_groups)',
270 |       'size_consumer_group.$': 'States.JsonToString($.current_test.parameters.consumer_groups.size)',
271 |       'num_records_consumer.$': 'States.JsonToString($.current_test.parameters.num_records_consumer)',
272 |     });
273 | 
274 |     const commandParameters = [
275 |       '--region', Aws.REGION,
276 |       '--msk-cluster-arn', kafka.cfnCluster? kafka.cfnCluster.ref : '-',
277 |       '--bootstrap-broker', 'bootstrapBrokerString' in props ? props.bootstrapBrokerString : '-',
278 |       '--num-jobs', 'Ref::num_jobs',
279 |       '--replication-factor', 'Ref::replication_factor',
280 |       '--topic-name', 'Ref::topic_name',
281 |       '--depletion-topic-name', 'Ref::depletion_topic_name',
282 |       '--record-size-byte', 'Ref::record_size_byte',
283 |       '--records-per-sec', 'Ref::records_per_sec',
284 |       '--num-partitions', 'Ref::num_partitions',
285 |       '--duration-sec', 'Ref::duration_sec',
286 |       '--producer-props', 'Ref::producer_props',
287 |       '--num-producers', 'Ref::num_producers',
288 |       '--num-records-producer', 'Ref::num_records_producer',
289 |       '--consumer-props', 'Ref::consumer_props',
290 |       '--num-consumer-groups', 'Ref::num_consumer_groups',
291 |       '--num-records-consumer', 'Ref::num_records_consumer',
292 |       '--size-consumer-group', 'Ref::size_consumer_group'
293 |     ]
294 | 
295 | 
296 |     const queryProducerResultLambda = new lambda.Function(this, 'QueryProducerResultLambda', { 
297 |       runtime: lambda.Runtime.NODEJS_14_X,
298 |       code: lambda.Code.fromAsset('lambda'),
299 |       timeout: Duration.minutes(2),
300 |       handler: 'query-test-output.queryProducerOutput',
301 |       environment: {
302 |         LOG_GROUP_NAME: performanceTestLogGroup.logGroupName
303 |       }
304 |     });
305 | 
306 |     performanceTestLogGroup.grant(queryProducerResultLambda, 'logs:FilterLogEvents')
307 | 
308 |     queryProducerResultLambda.addToRolePolicy(
309 |       new iam.PolicyStatement({
310 |         actions: ['batch:DescribeJobs'],
311 |         resources: ['*']
312 |       })
313 |     );
314 | 
315 |     const queryProducerResult = new tasks.LambdaInvoke(this, 'QueryProducerResult', {
316 |       lambdaFunction: queryProducerResultLambda,
317 |       resultPath: '$.producer_result'
318 |     });
319 | 
320 |     // if the query restult is not available in cloudwatch yet, retry query
321 |     queryProducerResult.addRetry({
322 |       interval: Duration.seconds(10),
323 |       maxAttempts: 3
324 |     });
325 | 
326 | 
327 |     const runPerformanceTest = new tasks.BatchSubmitJob(this, 'RunPerformanceTest', {
328 |       jobName: 'RunPerformanceTest',
329 |       jobDefinitionArn: performanceTestJobDefinition.jobDefinitionArn,
330 |       jobQueueArn: jobQueue.jobQueueArn,
331 |       arraySize: sfn.JsonPath.numberAt('$.current_test.parameters.num_jobs'),
332 |       containerOverrides: {
333 |         command: [...commandParameters, '--command', 'run-performance-test' ]
334 |       },
335 |       payload: payload,
336 |       resultPath: '$.job_result',
337 |     });
338 | 
339 |     runPerformanceTest.next(queryProducerResult);
340 | 
341 | 
342 |     const depleteCreditsConstruct = new CreditDepletion(this, 'DepleteCreditsSfn', {
343 |       clusterName: kafka.clusterName,
344 |       jobDefinition: defaultJobDefinition,
345 |       jobQueue: jobQueue,
346 |       commandParameters: commandParameters,
347 |       payload: payload        
348 |     })
349 | 
350 |     const depleteCredits = new tasks.StepFunctionsStartExecution(this, 'DepleteCredits', {
351 |       stateMachine: depleteCreditsConstruct.stateMachine,
352 |       integrationPattern: IntegrationPattern.RUN_JOB,
353 |       inputPath: '$',
354 |       resultPath: 'DISCARD',
355 |       timeout: Duration.hours(6)    //fixme: use timeoutPath when it becomes exposed through cdk
356 |     });
357 | 
358 |     depleteCredits.next(runPerformanceTest);
359 | 
360 | 
361 |     const createTopics = new tasks.BatchSubmitJob(this, 'CreateTopics', {
362 |       jobName: 'CreateTopics',
363 |       jobDefinitionArn: defaultJobDefinition.jobDefinitionArn,
364 |       jobQueueArn: jobQueue.jobQueueArn,
365 |       containerOverrides: {
366 |         command: [...commandParameters, '--command', 'create-topics' ]
367 |       },
368 |       payload: payload,
369 |       resultPath: 'DISCARD',
370 |     });
371 | 
372 |     createTopics.next(depleteCredits);
373 | 
374 | 
375 |     const finalState = new sfn.Succeed(this, 'Succeed');
376 | 
377 |     const checkCompleted = new sfn.Choice(this, 'CheckAllTestsCompleted')
378 |       .when(sfn.Condition.isPresent('$.current_test'), createTopics)
379 |       .otherwise(finalState);
380 | 
381 | 
382 |     const updateTestParameterLambda = new lambda.Function(this, 'UpdateTestParameterLambda', {
383 |       runtime: lambda.Runtime.PYTHON_3_8,
384 |       code: lambda.Code.fromAsset('lambda'),
385 |       timeout: Duration.seconds(5),
386 |       handler: 'test-parameters.increment_index_and_update_parameters',
387 |     });
388 | 
389 |     const updateTestParameter = new tasks.LambdaInvoke(this, 'UpdateTestParameter', {
390 |       lambdaFunction: updateTestParameterLambda,
391 |       outputPath: '$.Payload',
392 |     });
393 | 
394 |     updateTestParameter.next(checkCompleted);
395 | 
396 | 
397 |     const deleteTopics = new tasks.BatchSubmitJob(this, 'DeleteAllTopics', {
398 |       jobName: 'DeleteAllTopics',
399 |       jobDefinitionArn: defaultJobDefinition.jobDefinitionArn,
400 |       jobQueueArn: jobQueue.jobQueueArn,
401 |       containerOverrides: {
402 |         command: [...commandParameters, '--command', 'delete-topics' ]
403 |       },
404 |       payload: payload,
405 |       resultPath: 'DISCARD',
406 |     });
407 | 
408 |     deleteTopics.next(updateTestParameter);
409 |     queryProducerResult.next(deleteTopics);
410 | 
411 | 
412 |     const fail = new sfn.Fail(this, 'Fail');
413 | 
414 | 
415 |     const cleanup = new tasks.BatchSubmitJob(this, 'DeleteAllTopicsTrap', {
416 |       jobName: 'DeleteAllTopics',
417 |       jobDefinitionArn: defaultJobDefinition.jobDefinitionArn,
418 |       jobQueueArn: jobQueue.jobQueueArn,
419 |       containerOverrides: {
420 |         command: [ 
421 |           '--region', Aws.REGION, 
422 |           '--msk-cluster-arn', kafka.cfnCluster? kafka.cfnCluster.ref : '-', 
423 |           '--bootstrap-broker', 'bootstrapBrokerString' in props ? props.bootstrapBrokerString : '-',
424 |           '--producer-props', 'Ref::producer_props',
425 |           '--command', 'delete-topics' 
426 |         ]
427 |       },
428 |       payload: TaskInput.fromObject({
429 |         'producer_props.$': '$.Cause'
430 |       }),
431 |       resultPath: 'DISCARD'
432 |     });
433 | 
434 |     cleanup.next(fail);
435 | 
436 | 
437 |     const parallel = new sfn.Parallel(this, 'IterateThroughPerformanceTests');
438 | 
439 |     parallel.branch(updateTestParameter);
440 |     parallel.addCatch(cleanup);
441 | 
442 |     const stateMachine = new sfn.StateMachine(this, 'StateMachine', {
443 |       definition: parallel,
444 |       stateMachineName: `${Aws.STACK_NAME}-main`
445 |     });
446 | 
447 |     if (props.initialPerformanceTest) {
448 |       new AwsCustomResource(this, 'InitialPerformanceTestResource', {
449 |         policy: custom_resources.AwsCustomResourcePolicy.fromSdkCalls({
450 |           resources: [ stateMachine.stateMachineArn ]
451 |         }),
452 |         onCreate: {
453 |           action: "startExecution",
454 |           service: "StepFunctions",
455 |           parameters: {
456 |             input: JSON.stringify(props.initialPerformanceTest),
457 |             stateMachineArn: stateMachine.stateMachineArn
458 |           },
459 |           physicalResourceId: {
460 |             id: 'InitialPerformanceTestCall'
461 |           }
462 |         }
463 |       })
464 |     }
465 | 
466 | 
467 |     const bytesInWidget = new cloudwatch.GraphWidget({
468 |       left: [
469 |         new cloudwatch.MathExpression({
470 |           expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesInPerSec"', 'Minimum', 60))/1024/1024`,
471 |           label: `cluster ${kafka.clusterName}`,
472 |           usingMetrics: {}
473 |         }),
474 |         new cloudwatch.MathExpression({
475 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesInPerSec"', 'Minimum', 60)/1024/1024`,
476 |           label: 'broker',
477 |           usingMetrics: {}
478 |         }),
479 |       ],
480 |       leftYAxis: {
481 |         min: 0
482 |       },
483 |       title: 'throughput in (MB/sec)',
484 |       width: 24,
485 |       liveData: true
486 |     });
487 | 
488 |     const bytesInOutStddevWidget = new cloudwatch.GraphWidget({
489 |       left: [
490 |         new cloudwatch.MathExpression({
491 |           expression: `STDDEV(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesInPerSec"', 'Minimum', 60))/1024/1024`,
492 |           label: 'in',
493 |           usingMetrics: {}
494 |         }),
495 |         new cloudwatch.MathExpression({
496 |           expression: `STDDEV(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesOutPerSec"', 'Minimum', 60))/1024/1024`,
497 |           label: 'out',
498 |           usingMetrics: {}
499 |         }),
500 |       ],
501 |       leftYAxis: {
502 |         min: 0,
503 |         max: 3
504 |       },
505 |       leftAnnotations: [{
506 |         value: 0.75
507 |       }],
508 |       title: 'stdev broker throughput (MB/sec)',
509 |       width: 24,
510 |       liveData: true,
511 |     });
512 | 
513 | 
514 |     const bytesOutWidget = new cloudwatch.GraphWidget({
515 |       left: [
516 |         new cloudwatch.MathExpression({
517 |           expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesOutPerSec"', 'Average', 60))/1024/1024`,
518 |           label: `cluster ${kafka.clusterName}`,
519 |           usingMetrics: {}
520 |         }),
521 |         new cloudwatch.MathExpression({
522 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BytesOutPerSec"', 'Average', 60)/1024/1024`,
523 |           label: 'broker',
524 |           usingMetrics: {}
525 |         }),
526 |       ],
527 |       title: 'throughput out (MB/sec)',
528 |       width: 24,
529 |       liveData: true
530 |     });
531 | 
532 |     const burstBalance = new cloudwatch.GraphWidget({
533 |       left: [
534 |         new cloudwatch.MathExpression({
535 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BurstBalance"', 'Minimum', 60)`,
536 |           label: `EBS volume BurstBalance`,
537 |           usingMetrics: {}
538 |         }),
539 |         new cloudwatch.MathExpression({
540 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="CPUCreditBalance"', 'Minimum', 60)`,
541 |           label: `CPUCreditBalance`,
542 |           usingMetrics: {}
543 |         }),
544 |       ],
545 |       title: 'burst balances',
546 |       width: 24,
547 |       liveData: true
548 |     });
549 | 
550 | 
551 |     const trafficShaping = new cloudwatch.GraphWidget({
552 |       left: [
553 |         new cloudwatch.MathExpression({
554 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BwInAllowanceExceeded"', 'Maximum', 60)`,
555 |           label: `BwInAllowanceExceeded`,
556 |           usingMetrics: {}
557 |         }),
558 |         new cloudwatch.MathExpression({
559 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="BwOutAllowanceExceeded"', 'Maximum', 60)`,
560 |           label: `BwOutAllowanceExceeded`,
561 |           usingMetrics: {}
562 |         }),
563 |         new cloudwatch.MathExpression({
564 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="PpsAllowanceExceeded"', 'Maximum', 60)`,
565 |           label: `PpsAllowanceExceeded`,
566 |           usingMetrics: {}
567 |         }),
568 |         new cloudwatch.MathExpression({
569 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="ConntrackAllowanceExceeded"', 'Maximum', 60)`,
570 |           label: `ConntrackAllowanceExceeded`,
571 |           usingMetrics: {}
572 |         }),
573 |       ],
574 |       title: 'traffic shaping applied',
575 |       width: 24,
576 |       liveData: true
577 |     });
578 | 
579 | 
580 |     const cpuIdleWidget = new cloudwatch.GraphWidget({
581 |       left: [
582 |         new cloudwatch.MathExpression({
583 |           expression: `AVG(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="CpuIdle"', 'Minimum', 60))`,
584 |           label: `cluster ${kafka.clusterName}`,
585 |           usingMetrics: {}
586 |         }),
587 |         new cloudwatch.MathExpression({
588 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="CpuIdle"', 'Minimum', 60)`,
589 |           label: 'broker',
590 |           usingMetrics: {}
591 |         }),
592 |       ],
593 |       title: 'cpu idle (percent)',
594 |       width: 24,
595 |       leftYAxis: {
596 |         min: 0
597 |       },
598 |       liveData: true
599 |     });
600 | 
601 |     const replicationWidget = new cloudwatch.GraphWidget({
602 |       left: [
603 |         new cloudwatch.MathExpression({
604 |           expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="UnderReplicatedPartitions"', 'Maximum', 60))`,
605 |           label: 'under replicated',
606 |           usingMetrics: {}
607 |         }),
608 |         new cloudwatch.MathExpression({
609 |           expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="UnderMinIsrPartitionCount"', 'Maximum', 60))`,
610 |           label: 'under min isr',
611 |           usingMetrics: {}
612 |         }),
613 |       ],
614 |       leftYAxis: {
615 |         min: 0
616 |       },
617 |       title: 'partition (count)',
618 |       width: 24,
619 |       liveData: true
620 |     });
621 | 
622 |     const partitionWidget = new cloudwatch.GraphWidget({
623 |       left: [
624 |         new cloudwatch.MathExpression({
625 |           expression: `SUM(SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="PartitionCount"', 'Maximum', 60))`,
626 |           label: 'partiton count',
627 |           usingMetrics: {}
628 |         }),
629 |         new cloudwatch.MathExpression({
630 |           expression: `SEARCH('{AWS/Kafka,"Broker ID","Cluster Name"} "Cluster Name"="${kafka.clusterName}" MetricName="PartitionCount"', 'Maximum', 60)`,
631 |           label: 'partiton count',
632 |           usingMetrics: {}
633 |         }),        
634 |       ],
635 |       leftYAxis: {
636 |         min: 0
637 |       },
638 |       title: 'partition (count)',
639 |       width: 24,
640 |       liveData: true
641 |     });
642 | 
643 |     const dashboard = new cloudwatch.Dashboard(this, 'Dashboard', {
644 |       dashboardName: Aws.STACK_NAME,
645 |       widgets: [
646 |         [ bytesInWidget ],
647 |         [ bytesOutWidget ],
648 |         [ trafficShaping ],
649 |         [ burstBalance ],
650 |         [ bytesInOutStddevWidget ],
651 |         [ cpuIdleWidget ],
652 |         [ replicationWidget ],
653 |         [ partitionWidget ]
654 |       ],
655 |     });
656 | 
657 |     const sagemakerInstanceRole = new iam.Role(this, 'SagemakerInstanceRole', {
658 |       assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'),
659 |       inlinePolicies: {
660 |         cloudFormation: new iam.PolicyDocument({
661 |           statements: [new iam.PolicyStatement({
662 |             actions: [
663 |               'cloudformation:DescribeStacks'
664 |             ],
665 |             resources: [ Aws.STACK_ID ],
666 |           })]
667 |         }),
668 |         cwLogs: new iam.PolicyDocument({
669 |           statements: [
670 |             new iam.PolicyStatement({
671 |               actions: [
672 |                 'logs:StartQuery'
673 |               ],
674 |               resources: [ performanceTestLogGroup.logGroupArn ],
675 |             }),
676 |             new iam.PolicyStatement({
677 |               actions: [
678 |                 'logs:GetQueryResults'
679 |               ],
680 |               resources: [ '*' ],
681 |             })
682 |           ]
683 |         }),
684 |         stepFunctions: new iam.PolicyDocument({
685 |           statements: [new iam.PolicyStatement({
686 |             actions: [
687 |               'states:DescribeExecution'
688 |             ],
689 |             resources: [ `arn:${Aws.PARTITION}:states:${Aws.REGION}:${Aws.ACCOUNT_ID}:execution:${stateMachine.stateMachineName}:*` ],
690 |           })]
691 |         })
692 |       }
693 |     });
694 | 
695 |     const sagemakerNotebook = new sagemaker.CfnNotebookInstance(this, 'SagemakerNotebookInstance', {
696 |       instanceType: 'ml.t3.medium',
697 |       roleArn: sagemakerInstanceRole.roleArn,
698 |       notebookInstanceName: Aws.STACK_NAME,
699 |       defaultCodeRepository: 'https://github.com/aws-samples/performance-testing-framework-for-apache-kafka/'
700 |     });
701 | 
702 |     new CfnOutput(this, 'SagemakerNotebook', { 
703 |       value: `https://console.aws.amazon.com/sagemaker/home#/notebook-instances/openNotebook/${sagemakerNotebook.attrNotebookInstanceName}?view=lab`
704 |     });
705 | 
706 |     new CfnOutput(this, 'LogGroupName', { 
707 |       value: performanceTestLogGroup.logGroupName 
708 |     });
709 |   }
710 | }
711 | 


--------------------------------------------------------------------------------
/cdk/lib/credit-depletion-sfn.ts:
--------------------------------------------------------------------------------
  1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | // 
  3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of
  4 | // this software and associated documentation files (the "Software"), to deal in
  5 | // the Software without restriction, including without limitation the rights to
  6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  7 | // the Software, and to permit persons to whom the Software is furnished to do so.
  8 | // 
  9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | 
 17 | import { Construct } from 'constructs';
 18 | import { Aws, StackProps, Duration } from 'aws-cdk-lib';
 19 | import { aws_iam as iam } from 'aws-cdk-lib';
 20 | import { aws_stepfunctions as sfn } from 'aws-cdk-lib';
 21 | import { aws_stepfunctions_tasks as tasks } from 'aws-cdk-lib';
 22 | import { aws_lambda as lambda } from 'aws-cdk-lib';
 23 | import * as batch from '@aws-cdk/aws-batch-alpha';
 24 | 
 25 | import { IntegrationPattern } from 'aws-cdk-lib/aws-stepfunctions';
 26 | 
 27 | export interface CreditDepletionParameters extends StackProps {
 28 |     clusterName: string,
 29 |     jobDefinition: batch.IJobDefinition,
 30 |     jobQueue: batch.IJobQueue,
 31 |     commandParameters: string[],
 32 |     payload: sfn.TaskInput
 33 | }
 34 |   
 35 | 
 36 | export class CreditDepletion extends Construct {
 37 | 
 38 |     stateMachine: sfn.StateMachine;
 39 | 
 40 |     constructor(scope: Construct, id: string, props: CreditDepletionParameters) {
 41 |         super(scope, id);
 42 | 
 43 | 
 44 |         const fail = new sfn.Fail(this, 'FailDepletion');
 45 |         const succeed = new sfn.Succeed(this, 'SucceedDepletion');
 46 | 
 47 |         // the number of jobs is determined before all jobs are manually failed, so if there were already failed or succeeded jobs, the depletion exited early and needs to be retried
 48 |         const checkAllJobsRunning = new sfn.Choice(this, 'CheckAllJobsRunning')
 49 |             .when(sfn.Condition.numberGreaterThan('$.cluster_throughput.Payload.succeededPlusFailedJobs', 0), fail)
 50 |             .otherwise(succeed);
 51 | 
 52 | 
 53 |         const terminateCreditDepletionLambda = new lambda.Function(this, 'TerminateCreditDepletionLambda', { 
 54 |             runtime: lambda.Runtime.NODEJS_14_X,
 55 |             code: lambda.Code.fromAsset('lambda'),
 56 |             timeout: Duration.seconds(5),
 57 |             handler: 'manage-infrastructure.terminateDepletionJob',
 58 |         });
 59 | 
 60 |         terminateCreditDepletionLambda.addToRolePolicy(
 61 |             new iam.PolicyStatement({ 
 62 |                 actions: ['batch:TerminateJob'],
 63 |                 resources: ['*']
 64 |             })
 65 |         );
 66 |       
 67 |         const terminateCreditDepletion = new tasks.LambdaInvoke(this, 'TerminateCreditDepletion', {
 68 |             lambdaFunction: terminateCreditDepletionLambda,
 69 |             inputPath: '$.depletion_job',
 70 |             resultPath: 'DISCARD'
 71 |         });
 72 | 
 73 |         terminateCreditDepletion.next(checkAllJobsRunning);
 74 | 
 75 |       
 76 |         const queryClusterThroughputLambda = new lambda.Function(this, 'QueryClusterThroughputLambda', { 
 77 |             runtime: lambda.Runtime.NODEJS_14_X,
 78 |             code: lambda.Code.fromAsset('lambda'),
 79 |             timeout: Duration.seconds(5),
 80 |             handler: 'manage-infrastructure.queryMskClusterThroughput',
 81 |             environment: {
 82 |               MSK_CLUSTER_NAME: props.clusterName
 83 |             }
 84 |           });
 85 |       
 86 |         queryClusterThroughputLambda.addToRolePolicy(
 87 |             new iam.PolicyStatement({ 
 88 |               actions: ['cloudwatch:GetMetricData', 'batch:DescribeJobs'],
 89 |               resources: ['*']
 90 |             })
 91 |           );
 92 |       
 93 |         const queryClusterThroughputLowerThan = new tasks.LambdaInvoke(this, 'QueryClusterThroughputLowerThan', {
 94 |             lambdaFunction: queryClusterThroughputLambda,
 95 |             inputPath: '$',
 96 |             resultPath: '$.cluster_throughput'
 97 |           });
 98 |       
 99 |         const waitThroughputLowerThan = new sfn.Wait(this, 'WaitThroughputLowerThan', {
100 |             time: sfn.WaitTime.duration(Duration.minutes(1))
101 |         });
102 |       
103 |         waitThroughputLowerThan.next(queryClusterThroughputLowerThan);
104 |       
105 |       
106 |         const checkThroughputLowerThan = new sfn.Choice(this, 'CheckThroughputLowerThanThreshold')
107 |             .when(sfn.Condition.numberGreaterThan('$.cluster_throughput.Payload.succeededPlusFailedJobs', 0), terminateCreditDepletion)
108 |             .when(
109 |                 sfn.Condition.and(
110 |                     sfn.Condition.numberLessThanJsonPath('$.cluster_throughput.Payload.clusterMbInPerSec', '$.test_specification.depletion_configuration.lower_threshold.mb_per_sec'),
111 | //                    sfn.Condition.numberLessThanJsonPath('$.cluster_throughput.Payload.brokerMbInPerSecStddev', '$.test_specification.depletion_configuration.lower_threshold.max_broker_stddev')
112 |                 ), 
113 |                 terminateCreditDepletion
114 |             )
115 |             .otherwise(waitThroughputLowerThan);
116 |       
117 |         queryClusterThroughputLowerThan.next(checkThroughputLowerThan);
118 |       
119 | 
120 |         const queryClusterThroughputExceeded = new tasks.LambdaInvoke(this, 'QueryClusterThroughputExceeded', {
121 |             lambdaFunction: queryClusterThroughputLambda,
122 |             inputPath: '$',
123 |             resultPath: '$.cluster_throughput'
124 |         });
125 |       
126 |         const waitThroughputExceeded = new sfn.Wait(this, 'WaitThroughputExceeded', {
127 |             time: sfn.WaitTime.duration(Duration.minutes(1))
128 |         });
129 |       
130 |         waitThroughputExceeded.next(queryClusterThroughputExceeded);
131 | 
132 | 
133 |         const checkLowerThanPresent = new sfn.Choice(this, 'CheckLowerThanPresent')
134 |             .when(sfn.Condition.isPresent('$.test_specification.depletion_configuration.lower_threshold.mb_per_sec'), waitThroughputLowerThan)
135 |             .otherwise(terminateCreditDepletion);
136 |       
137 |       
138 |         const checkThroughputExceeded = new sfn.Choice(this, 'CheckThroughputExceeded')
139 |             .when(sfn.Condition.numberGreaterThan('$.cluster_throughput.Payload.succeededPlusFailedJobs', 0), terminateCreditDepletion)
140 |             .when(sfn.Condition.numberGreaterThanJsonPath('$.cluster_throughput.Payload.clusterMbInPerSec', '$.test_specification.depletion_configuration.upper_threshold.mb_per_sec'), checkLowerThanPresent)
141 |             .otherwise(waitThroughputExceeded);
142 |       
143 |         queryClusterThroughputExceeded.next(checkThroughputExceeded);
144 |       
145 | 
146 |       
147 |         const submitCreditDepletion = new tasks.BatchSubmitJob(this, 'SubmitCreditDepletion', {
148 |             jobName: 'RunCreditDepletion',
149 |             jobDefinitionArn: props.jobDefinition.jobDefinitionArn,
150 |             jobQueueArn: props.jobQueue.jobQueueArn,
151 |             arraySize: sfn.JsonPath.numberAt('$.current_test.parameters.num_jobs'),
152 |             containerOverrides: {
153 |               command: [...props.commandParameters, '--command', 'deplete-credits' ]
154 |             },
155 |             payload: props.payload,
156 |             integrationPattern: IntegrationPattern.REQUEST_RESPONSE,
157 |             inputPath: '$',
158 |             resultPath: '$.depletion_job',
159 |         });
160 |       
161 |         submitCreditDepletion.next(waitThroughputExceeded);
162 | 
163 | 
164 |         const updateDepletionParameterLambda = new lambda.Function(this, 'UpdateDepletionParameterLambda', { 
165 |             runtime: lambda.Runtime.PYTHON_3_8,
166 |             code: lambda.Code.fromAsset('lambda'),
167 |             timeout: Duration.seconds(5),
168 |             handler: 'test-parameters.update_parameters_for_depletion',
169 |         });
170 |     
171 |         const updateDepletionParameter = new tasks.LambdaInvoke(this, 'UpdateDepletionParameter', {
172 |             lambdaFunction: updateDepletionParameterLambda,
173 |             outputPath: '$.Payload',
174 |         });
175 | 
176 |         updateDepletionParameter.next(submitCreditDepletion);
177 |   
178 |         const checkDepletion = new sfn.Choice(this, 'CheckDepletionRequired')
179 |             .when(sfn.Condition.isNotPresent('$.test_specification.depletion_configuration'), succeed)
180 |             .otherwise(updateDepletionParameter);
181 |       
182 |         this.stateMachine = new sfn.StateMachine(this, 'DepleteCreditsStateMachine', {
183 |             definition: checkDepletion,
184 |             stateMachineName: `${Aws.STACK_NAME}-credit-depletion`,
185 |         }); 
186 |     }
187 | }
188 | 


--------------------------------------------------------------------------------
/cdk/lib/vpc.ts:
--------------------------------------------------------------------------------
 1 | // Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | // 
 3 | // Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | // this software and associated documentation files (the "Software"), to deal in
 5 | // the Software without restriction, including without limitation the rights to
 6 | // use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | // the Software, and to permit persons to whom the Software is furnished to do so.
 8 | // 
 9 | // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | // FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | // COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | // IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | // CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | import { Construct } from 'constructs';
17 | import { Stack, StackProps } from 'aws-cdk-lib';
18 | import { aws_ec2 as ec2 } from 'aws-cdk-lib';
19 | 
20 | export class VpcStack extends Stack {
21 |     vpc: ec2.Vpc;
22 | 
23 |     constructor(scope: Construct, id: string, props?: StackProps) {
24 |         super(scope, id, props);
25 |   
26 |         this.vpc = new ec2.Vpc(this, 'Vpc');
27 |     }
28 | }


--------------------------------------------------------------------------------
/cdk/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "tmp",
 3 |   "version": "0.1.0",
 4 |   "bin": {
 5 |     "tmp": "bin/tmp.js"
 6 |   },
 7 |   "scripts": {
 8 |     "build": "tsc",
 9 |     "watch": "tsc -w",
10 |     "test": "jest",
11 |     "cdk": "cdk"
12 |   },
13 |   "devDependencies": {
14 |     "@types/jest": "^26.0.10",
15 |     "@types/node": "10.17.27",
16 |     "aws-cdk": "2.15.0",
17 |     "jest": "^26.4.2",
18 |     "ts-jest": "^26.2.0",
19 |     "ts-node": "^9.0.0",
20 |     "typescript": "~3.9.7"
21 |   },
22 |   "dependencies": {
23 |     "@aws-cdk/aws-batch-alpha": "2.15.0-alpha.0",
24 |     "@aws-cdk/aws-msk-alpha": "2.15.0-alpha.0",
25 |     "aws-cdk-lib": "2.15.0",
26 |     "constructs": "^10.0.0",
27 |     "source-map-support": "^0.5.16"
28 |   }
29 | }
30 | 


--------------------------------------------------------------------------------
/cdk/tsconfig.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "compilerOptions": {
 3 |     "target": "ES2018",
 4 |     "module": "commonjs",
 5 |     "lib": ["es2018"],
 6 |     "declaration": true,
 7 |     "strict": true,
 8 |     "noImplicitAny": true,
 9 |     "strictNullChecks": true,
10 |     "noImplicitThis": true,
11 |     "alwaysStrict": true,
12 |     "noUnusedLocals": false,
13 |     "noUnusedParameters": false,
14 |     "noImplicitReturns": true,
15 |     "noFallthroughCasesInSwitch": false,
16 |     "inlineSourceMap": true,
17 |     "inlineSources": true,
18 |     "experimentalDecorators": true,
19 |     "strictPropertyInitialization": false,
20 |     "typeRoots": ["./node_modules/@types"]
21 |   },
22 |   "exclude": ["cdk.out"]
23 | }
24 | 


--------------------------------------------------------------------------------
/docs/create-env.md:
--------------------------------------------------------------------------------
 1 | ## Setup CDK Environment
 2 | 
 3 | To deploy the CDK template, you need to have node, cdk, and docker configured locally. Alternatively, you can use the `amzn2-ami-ecs-hvm-2.0.20210106-x86_64-ebs` AMI with an Amazon EC2 instance and the following steps to configure an environment that satisfies these constraints.
 4 | 
 5 | ```bash
 6 | # Install Node Version Manager
 7 | curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash
 8 | . ~/.nvm/nvm.sh
 9 | 
10 | # Install Node and CDK
11 | nvm install node
12 | npm -g install typescript
13 | npm install -g aws-cdk
14 | 
15 | # Bootstrap CDK envitonment, if you have never executed CDK in this AWS account and region.
16 | cdk bootstrap
17 | 
18 | sudo yum install -y git awscli
19 | ```


--------------------------------------------------------------------------------
/docs/framework-overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/performance-testing-framework-for-apache-kafka/e38bf5d8392b024351dc3b049646af1019fe2864/docs/framework-overview.png


--------------------------------------------------------------------------------
/docs/test-result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/performance-testing-framework-for-apache-kafka/e38bf5d8392b024351dc3b049646af1019fe2864/docs/test-result.png


--------------------------------------------------------------------------------
/notebooks/00000000--empty-template.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import boto3\n",
 10 |     "import datetime\n",
 11 |     "import json\n",
 12 |     "from dateutil.tz import tzlocal\n",
 13 |     "\n",
 14 |     "import get_test_details\n",
 15 |     "import query_experiment_details\n",
 16 |     "import aggregate_statistics\n",
 17 |     "import plot\n",
 18 |     "\n",
 19 |     "stepfunctions = boto3.client('stepfunctions')\n",
 20 |     "cloudwatch_logs = boto3.client('logs')\n",
 21 |     "cloudformation = boto3.client('cloudformation')\n",
 22 |     "cloudwatch = boto3.client('cloudwatch')"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "Sample test specification. Submit is as the input of the StepFunction workflow that ends with `-main` to start a test.\n",
 30 |     "```\n",
 31 |     "{\n",
 32 |     "  \"test_specification\": {\n",
 33 |     "    \"parameters\": {\n",
 34 |     "      \"cluster_throughput_mb_per_sec\": [ 8, 16, 24, 32, 40, 44, 48, 52, 56 ],\n",
 35 |     "      \"num_producers\": [ 6 ],\n",
 36 |     "      \"consumer_groups\": [ { \"num_groups\": 2, \"size\": 6 }\n",
 37 |     "      ],\n",
 38 |     "      \"client_props\": [\n",
 39 |     "        {\n",
 40 |     "          \"producer\": \"acks=all linger.ms=5 batch.size=65536 buffer.memory=2147483648 security.protocol=PLAINTEXT\",\n",
 41 |     "          \"consumer\": \"security.protocol=PLAINTEXT\"\n",
 42 |     "        }\n",
 43 |     "      ],\n",
 44 |     "      \"num_partitions\": [ 36 ],\n",
 45 |     "      \"record_size_byte\": [ 1024 ],\n",
 46 |     "      \"replication_factor\": [ 3 ],\n",
 47 |     "      \"duration_sec\": [ 3600 ]\n",
 48 |     "    },\n",
 49 |     "    \"skip_remaining_throughput\": {\n",
 50 |     "      \"less-than\": [\n",
 51 |     "        \"sent_div_requested_mb_per_sec\",\n",
 52 |     "        0.995\n",
 53 |     "      ]\n",
 54 |     "    },\n",
 55 |     "    \"depletion_configuration\": {\n",
 56 |     "      \"upper_threshold\": {\n",
 57 |     "        \"mb_per_sec\": 200\n",
 58 |     "      },\n",
 59 |     "      \"approximate_timeout_hours\": 0.5\n",
 60 |     "    }\n",
 61 |     "  }\n",
 62 |     "}\n",
 63 |     "```"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": null,
 69 |    "metadata": {
 70 |     "tags": []
 71 |    },
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "test_params = []"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {
 81 |     "tags": []
 82 |    },
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "test_params.extend([\n",
 86 |     "    {'execution_arn': '' }\n",
 87 |     "])"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "metadata": {},
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "test_details = get_test_details.get_test_details(test_params, stepfunctions, cloudformation)"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "(producer_stats, consumer_stats) = query_experiment_details.query_cw_logs(test_details, cloudwatch_logs)"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "partitions = {\n",
115 |     "    'ignore_keys': [ 'topic_id', 'cluster_name', 'test_id', 'client_props.consumer', 'cluster_id', 'duration_sec', 'throughput_series_id', 'brokers_type_numeric', ],\n",
116 |     "    'title_keys': [ 'kafka_version', 'broker_storage', 'provisioned_throughput', 'in_cluster_encryption', 'producer.security.protocol', ],\n",
117 |     "    'row_keys': [ 'num_producers', 'consumer_groups.num_groups',  ],\n",
118 |     "    'column_keys': [ 'producer.acks', 'producer.batch.size', 'num_partitions', ],\n",
119 |     "    'metric_color_keys': [ 'brokers_type_numeric', 'brokers', ],\n",
120 |     "}\n",
121 |     "\n",
122 |     "filter_fn = lambda x: True\n",
123 |     "filter_agg_fn = lambda x: True\n",
124 |     "\n",
125 |     "filtered_producer_stats = list(filter(filter_fn, producer_stats))\n",
126 |     "filtered_consumer_stats = filter(filter_fn, consumer_stats)\n",
127 |     "\n",
128 |     "(producer_aggregated_stats, consumer_aggregated_stats, combined_stats) = aggregate_statistics.aggregate_cw_logs(filtered_producer_stats, filtered_consumer_stats, partitions)\n",
129 |     "filtered_producer_aggregated_stats = list(filter(filter_agg_fn, producer_aggregated_stats))"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "plot.plot_measurements(filtered_producer_aggregated_stats, ['latency_ms_p50_mean', 'latency_ms_p99_mean', ], 'producer put latency (ms)', **partitions, )\n",
139 |     "plot.plot_measurements(filtered_producer_aggregated_stats, ['latency_ms_p50_stdev', 'latency_ms_p99_stdev', ], 'producer put latency stdev (ms)', **partitions, )"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "metadata": {},
146 |    "outputs": [],
147 |    "source": [
148 |     "plot.plot_measurements(producer_aggregated_stats, ['sent_div_requested_mb_per_sec'], 'sent / requested throughput (mb/sec)', **partitions, ymin=0.990, ymax=1.01)\n",
149 |     "plot.plot_measurements(producer_aggregated_stats, ['actual_duration_div_requested_duration_sec_max'], 'actual test duration/requested test duration (ratio)', **partitions)\n",
150 |     "plot.plot_measurements(producer_aggregated_stats, ['num_tests'], 'number of tests (count)', **partitions)"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": []
159 |   }
160 |  ],
161 |  "metadata": {
162 |   "kernelspec": {
163 |    "display_name": "Python 3 (ipykernel)",
164 |    "language": "python",
165 |    "name": "python3"
166 |   },
167 |   "language_info": {
168 |    "codemirror_mode": {
169 |     "name": "ipython",
170 |     "version": 3
171 |    },
172 |    "file_extension": ".py",
173 |    "mimetype": "text/x-python",
174 |    "name": "python",
175 |    "nbconvert_exporter": "python",
176 |    "pygments_lexer": "ipython3",
177 |    "version": "3.10.2"
178 |   }
179 |  },
180 |  "nbformat": 4,
181 |  "nbformat_minor": 4
182 | }
183 | 


--------------------------------------------------------------------------------
/notebooks/00000000--install-dependencies.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "metadata": {},
 7 |    "outputs": [],
 8 |    "source": [
 9 |     "import sys\n",
10 |     "!{sys.executable} -m pip install --upgrade pip\n",
11 |     "!{sys.executable} -m pip install boto3\n",
12 |     "!{sys.executable} -m pip install numpy\n",
13 |     "!{sys.executable} -m pip install more-itertools\n",
14 |     "!{sys.executable} -m pip install matplotlib\n",
15 |     "!{sys.executable} -m pip install jupyterlab-git\n",
16 |     "!{sys.executable} -m pip install nbdime\n",
17 |     "!{sys.executable} -m pip install flatten-dict"
18 |    ]
19 |   },
20 |   {
21 |    "cell_type": "code",
22 |    "execution_count": null,
23 |    "metadata": {},
24 |    "outputs": [],
25 |    "source": []
26 |   }
27 |  ],
28 |  "metadata": {
29 |   "kernelspec": {
30 |    "display_name": "conda_amazonei_mxnet_p36",
31 |    "language": "python",
32 |    "name": "conda_amazonei_mxnet_p36"
33 |   },
34 |   "language_info": {
35 |    "codemirror_mode": {
36 |     "name": "ipython",
37 |     "version": 3
38 |    },
39 |    "file_extension": ".py",
40 |    "mimetype": "text/x-python",
41 |    "name": "python",
42 |    "nbconvert_exporter": "python",
43 |    "pygments_lexer": "ipython3",
44 |    "version": "3.6.13"
45 |   }
46 |  },
47 |  "nbformat": 4,
48 |  "nbformat_minor": 4
49 | }
50 | 


--------------------------------------------------------------------------------
/notebooks/aggregate_statistics.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # 
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
  4 | # this software and associated documentation files (the "Software"), to deal in
  5 | # the Software without restriction, including without limitation the rights to
  6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  7 | # the Software, and to permit persons to whom the Software is furnished to do so.
  8 | # 
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | import re
 17 | import pprint
 18 | import itertools
 19 | import numpy as np
 20 | from more_itertools import ilen
 21 | from statistics import mean, median, variance, stdev
 22 | from dateutil.tz import tzlocal
 23 | 
 24 | 
 25 | producer_aggregation_fns = [
 26 | #    ('sent_records', sum),
 27 | #    ('sent_records_sec', min),
 28 | #    ('sent_records_sec', sum),
 29 |     ('sent_mb_sec', min),
 30 | #    ('sent_mb_sec', sum),
 31 |     ('latency_ms_avg', mean),
 32 |     ('latency_ms_avg', stdev),
 33 |     ('latency_ms_p50', mean),
 34 |     ('latency_ms_p50', stdev),
 35 |     ('latency_ms_p95', mean),
 36 |     ('latency_ms_p95', stdev),
 37 |     ('latency_ms_p99', mean),
 38 |     ('latency_ms_p99', stdev),
 39 |     ('latency_ms_p999', mean),
 40 |     ('latency_ms_p999', stdev),
 41 |     ('latency_ms_max', max),
 42 |     ('latency_ms_max', stdev),
 43 |     ('actual_duration_div_requested_duration_sec', max),
 44 |     ('start_ts', min),
 45 |     ('end_ts', max)
 46 | ]
 47 | 
 48 | consumer_aggregation_fns = [
 49 |     ('consumed_mb', sum),
 50 |     ('consumed_mb_sec', sum),
 51 |     ('consumed_records', sum),
 52 |     ('records_lag_max', max),
 53 |     ('actual_duration_div_requested_duration_sec', max),
 54 |     ('start_ts', min),
 55 |     ('end_ts', max)
 56 | ]
 57 | 
 58 | def broker_type_to_num(broker):
 59 |     if "m5large" in broker:
 60 |         return 0
 61 |     elif "m5xlarge" in broker:
 62 |         return 1
 63 |     else:
 64 |         return int(re.search('m5(\d+)xlarge', broker).group(1))
 65 | 
 66 | # aggregate intividual producer/consumer results into one for identical test parameters
 67 | 
 68 | def aggregate_cw_logs(producer_stats, consumer_stats, partitons):
 69 |     producer_aggregated_stats = []
 70 |     consumer_aggregated_stats = []
 71 |         
 72 |     partiton_by_fn = lambda x: tuple([v for k,v in x['test_params'].items() if k not in partitons['ignore_keys']])
 73 | 
 74 |     for params,group in itertools.groupby(sorted(producer_stats, key=partiton_by_fn), key=partiton_by_fn):
 75 |         # convert iterator to list, as we need to iterate over it several times
 76 |         lst = list(group)
 77 |         
 78 |         # aggregate individual test results into a single result
 79 |         agg_test_results = {
 80 |             "{}_{}".format(attr, aggregation_fn.__name__): aggregation_fn(map(lambda x: x['test_results'][attr], lst))
 81 |             for (attr,aggregation_fn) in producer_aggregation_fns
 82 |         }
 83 |         
 84 |         # all test params within a group are identical, so just pick the first one
 85 |         test_params = lst[0]['test_params']
 86 |         
 87 |         producer_aggregated_stats.append({
 88 |             'test_params': {
 89 |                 **test_params,
 90 |                 'brokers' : "{} {}".format(test_params['num_brokers'], test_params['broker_type']),
 91 |                 'brokers_type_numeric' : broker_type_to_num(test_params['broker_type'])
 92 |             },
 93 |             'test_results': {
 94 |                 **agg_test_results,
 95 |                 'num_tests': len(lst) / test_params['num_producers'],                
 96 |                 'sent_div_requested_mb_per_sec' : agg_test_results['sent_mb_sec_min'] / (test_params['cluster_throughput_mb_per_sec']/test_params['num_producers'])
 97 |             }
 98 |         })
 99 | 
100 | 
101 |     for params,group in itertools.groupby(sorted(consumer_stats, key=partiton_by_fn), key=partiton_by_fn):
102 |         # convert iterator to list, as we need to iterate over it several times
103 |         lst = list(group)
104 | 
105 |         # aggregate individual test results into a single result
106 |         agg_test_results = {
107 |             "{}_{}".format(attr, aggregation_fn.__name__): aggregation_fn(map(lambda x: x['test_results'][attr], lst))
108 |             for (attr,aggregation_fn) in consumer_aggregation_fns
109 |         }
110 | 
111 |         # all test params within a group are identical, so just pick the first one
112 |         test_params = lst[0]['test_params']
113 | 
114 |         consumer_aggregated_stats.append({
115 |             'test_params': {
116 |                 **test_params,
117 |                 'brokers' : "{} {}".format(test_params['num_brokers'], test_params['broker_type']),
118 |             },
119 |             'test_results': {
120 |                 **agg_test_results,
121 |                 'num_tests': len(lst) / (test_params['consumer_groups.num_groups'] * test_params['consumer_groups.size']),
122 |             }
123 |         })
124 | 
125 | 
126 |     combined_stats = []
127 | 
128 |     return (producer_aggregated_stats, consumer_aggregated_stats, combined_stats)
129 | 


--------------------------------------------------------------------------------
/notebooks/get_test_details.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # 
 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | # this software and associated documentation files (the "Software"), to deal in
 5 | # the Software without restriction, including without limitation the rights to
 6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | # the Software, and to permit persons to whom the Software is furnished to do so.
 8 | # 
 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | import re
17 | import json
18 | import datetime
19 | import collections
20 | 
21 | def get_test_details(test_params, stepfunctions, cloudformation):
22 |     sfn_status = []
23 |     test_details = []
24 | 
25 |     for test_param in test_params:
26 |         if 'execution_arn' not in test_param:
27 |             # if the execution_arn is not included, we cannot query them and just assume that the details are complete and correct
28 |             test_details.append(test_param)
29 |             continue
30 | 
31 |         execution_arn = test_param['execution_arn'].strip()
32 |         execution_details = stepfunctions.describe_execution(executionArn=execution_arn)
33 | 
34 |         if 'stopDate' in execution_details:
35 |             stop_date = execution_details['stopDate']
36 |         else:
37 |             # if the workflow is still running, just query everything up until now
38 |             stop_date = datetime.datetime.now()
39 |                 
40 |         # the name of the state machine includes the CFN stack name 
41 |         stack_name = test_param['execution_arn'].split(':')[-2].split('-main')[0]
42 |         
43 |         # obtain the name of the log group for containing all test results from cloud watch
44 |         stack_details = cloudformation.describe_stacks(StackName=stack_name)
45 |         log_group_name = next(filter(lambda x: x['OutputKey']=='LogGroupName', stack_details['Stacks'][0]['Outputs']))['OutputValue']
46 | 
47 |         # extract the cluster parameters from the stack name
48 |         cluster_properties_p = re.compile(r'(\S+)--(\d+)-([a-zA-Z0-9]+)-(\d+)-([tf])-(\d+)(?:-(\d+))?.*')
49 |         cluster_properties_m = cluster_properties_p.search(stack_name)
50 |         
51 |         cluster_properties = {
52 |             'cluster_name': stack_name if cluster_properties_m else "n/a",
53 |             'cluster_id': cluster_properties_m.group(1) if cluster_properties_m else "n/a",
54 |             'num_brokers': int(cluster_properties_m.group(2)) if cluster_properties_m else "n/a",
55 |             'broker_type': cluster_properties_m.group(3) if cluster_properties_m else "n/a",
56 |             'broker_storage': int(cluster_properties_m.group(4)) if cluster_properties_m else "n/a",
57 |             'in_cluster_encryption': cluster_properties_m.group(5) if cluster_properties_m else "n/a",
58 |             'kafka_version': cluster_properties_m.group(6) if cluster_properties_m else "n/a",
59 |             'provisioned_throughput': int(cluster_properties_m.group(7)) if cluster_properties_m and cluster_properties_m.group(7) is not None else 0,
60 |         }
61 |         
62 |         status = {
63 |             'log_group_name': log_group_name,
64 |             'start_date': execution_details['startDate'],
65 |             'stop_date': stop_date,
66 |             'test_parameters': json.loads(execution_details['input'])['test_specification']['parameters'],
67 |             'cluster_properties': cluster_properties
68 |         }
69 | 
70 |         # if the test is still running, add the execution arn, so that the details can be queried again
71 |         if execution_details['status'] == 'RUNNING':
72 |             status['execution_arn'] = test_param['execution_arn'] 
73 | 
74 |         test_details.append(status)
75 |         sfn_status.append(execution_details['status'])
76 | 
77 |     if len(sfn_status) > 0:
78 |         print("sfn workflow status: ", dict(collections.Counter(sfn_status)))
79 |         print()
80 |         for test_detail in test_details:
81 |             print("test_params.extend([", test_detail, "])")
82 |             print()
83 | 
84 |     return test_details


--------------------------------------------------------------------------------
/notebooks/plot.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # 
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
  4 | # this software and associated documentation files (the "Software"), to deal in
  5 | # the Software without restriction, including without limitation the rights to
  6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  7 | # the Software, and to permit persons to whom the Software is furnished to do so.
  8 | # 
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | import re
 17 | import pprint
 18 | import itertools
 19 | import numpy as np
 20 | import matplotlib.pyplot as plt
 21 | import matplotlib.colors as mcolors
 22 | import warnings
 23 | 
 24 | key_by_fn = lambda key_by: lambda stat: tuple(stat['test_params'][key] for key in key_by)
 25 | map_to_throughput_fn = lambda x: x['test_params']['cluster_throughput_mb_per_sec']
 26 | 
 27 | 
 28 | colors = list(mcolors.TABLEAU_COLORS.values())
 29 | markers = ['o', 'x', '.', '.']
 30 | # linestyles = ['-', '--', '-.', ':']
 31 | 
 32 | 
 33 | def plot_measurements(data, metric_line_style_names, ylabel, 
 34 |                       xlogscale=False, ylogscale=False, ymin=0, ymax=None, xmin=0, xmax=None, exclude=[], indicate_max=False,
 35 |                       ignore_keys = [ 'topic_id', 'cluster_id', 'test_id', 'cluster_name', 'num_producers', ],
 36 |                       title_keys = [ 'brokers', 'producers', 'at_rest', 'in_transit', 'duration_sec' ],
 37 |                       row_keys = [ 'replication_factor', 'num_partitions' ],
 38 |                       column_keys = [ 'num_consumer_groups', 'consumer_props', 'num_producers', 'producer_props' ],
 39 |                       metric_color_keys = ['message_size_byte'],
 40 |                       legend_replace_str = r"(buffer.memory=\d*|_mean)",
 41 |                       linestyles = ['-', '--', ':', '-.']):
 42 | 
 43 |     rows_key_by = key_by_fn(row_keys)
 44 |     columns_key_by = key_by_fn(column_keys)
 45 |     metric_color_key_by = key_by_fn(metric_color_keys)
 46 | 
 47 |     num_rows = len(set(map(rows_key_by, data)))
 48 |     num_columns = len(set(map(columns_key_by, data)))
 49 | 
 50 |     fig, ax = plt.subplots(num_rows, num_columns, figsize=(num_columns*10, num_rows*7), sharex='all', sharey='all', squeeze=False)
 51 |     fig.subplots_adjust(hspace=.4)
 52 |     
 53 |     row_index = 0
 54 |     for row_key,row_group in itertools.groupby(sorted(data, key=rows_key_by), key=rows_key_by):
 55 |         column_index = 0
 56 |         for column_key,column_group in itertools.groupby(sorted(row_group, key=columns_key_by), key=columns_key_by):
 57 |             max_throughput_values = set()
 58 |             metric_color_index = 0
 59 |             for metric_color_key,metric_color_group in itertools.groupby(sorted(column_group, key=metric_color_key_by), key=metric_color_key_by):
 60 |                 metric_line_style_index = 0
 61 | 
 62 |                 subplot = ax[row_index][column_index]
 63 | 
 64 |                 results = sorted(metric_color_group, key=map_to_throughput_fn)
 65 |                 filtered_results = [result for result in results if result['test_params'] not in exclude]
 66 |                                 
 67 |                 test_params = next(map(lambda x: x['test_params'], filtered_results), None)
 68 |                 
 69 |                 if not test_params:
 70 |                     continue
 71 | 
 72 |                 for metric_name in metric_line_style_names:
 73 |                     data_x = np.array(list(map(map_to_throughput_fn, filtered_results)))
 74 |                     data_y = np.array(list(map(lambda x: x['test_results'][metric_name], filtered_results)))
 75 |                     
 76 |                     if len(np.unique(data_x)) != len(data_x):
 77 |                         warnings.warn("Datapoints with identical throughput detected: check the partitions parameter")
 78 | 
 79 |                     label = str(dict((k,v) for k,v in zip(metric_color_keys, metric_color_key) if k not in ignore_keys))
 80 |                     simplified_label =  re.sub(legend_replace_str, "", "{} {}".format(label, metric_name))
 81 |                     
 82 |                     
 83 |                     subplot.plot(data_x, data_y, label=simplified_label, marker=markers[metric_line_style_index], linestyle=linestyles[metric_line_style_index], color=colors[metric_color_index % len(colors)])                 
 84 | 
 85 |                     max_x = max(data_x)
 86 |                     metric_line_style_index += 1
 87 | 
 88 |                 subplot.grid(True)
 89 |                 subplot.tick_params(labelleft=True, labelbottom=True)
 90 | #                subplot.ticklabel_format(useOffset=False)
 91 |                 subplot.set_xlabel("requested aggregated producer throughput (mb/sec)")
 92 |                 subplot.set_ylabel(ylabel)
 93 |                 subplot.legend()
 94 |             
 95 |                 
 96 |                 title = "{}\n{}\n{}\n".format(
 97 |                     str(dict((k, test_params[k]) for k in title_keys if k in test_params and k not in ignore_keys)),
 98 |                     str(dict((k,v) for k,v in zip(row_keys, row_key) if k not in ignore_keys)),
 99 |                     str(dict((k,v) for k,v in zip(column_keys, column_key) if k not in ignore_keys))
100 |                 )
101 |                 subplot.set_title(re.sub(legend_replace_str, "_", title))
102 | 
103 |                 if xmax or xmin:
104 |                     subplot.set_xlim(left=xmin, right=xmax)
105 | 
106 |                 if ymax or ymin:
107 |                     subplot.set_ylim(bottom=ymin, top=ymax)
108 | 
109 |                 if xlogscale:
110 |                     subplot.set_xscale('log')
111 | 
112 |                 if ylogscale:
113 |                     subplot.set_yscale('log')
114 |                         
115 |                 if indicate_max:
116 |                     max_throughput_values.add(max_x)
117 |                     subplot.axvline(x=max_x, linestyle='-.', color=colors[metric_color_index% len(colors)], alpha=.5)
118 |                     
119 |                 metric_color_index += 1
120 |                 
121 |             if num_rows<=1 or num_columns<=1:
122 |                 subplot.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1))
123 | 
124 |             column_index += 1
125 |         row_index += 1
126 | 
127 |     if indicate_max and len(ax)>0:
128 |         subplot = ax[0][0]
129 |         ticks = list(subplot.get_xticks())
130 |                         
131 |         for max_throughput_value in max_throughput_values:
132 |             closest_tick = min(ticks, key=lambda x:abs(x-max_throughput_value))
133 |             ticks = [max_throughput_value if x==closest_tick else x for x in ticks]
134 |         
135 |         subplot.set_xticks(ticks)
136 | 
137 |         for plots in ax:
138 |             for plot in plots:
139 |                 plot.axvline(x=indicate_max, linestyle=':', color=mcolors.CSS4_COLORS['dimgray'])


--------------------------------------------------------------------------------
/notebooks/query_experiment_details.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # 
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
  4 | # this software and associated documentation files (the "Software"), to deal in
  5 | # the Software without restriction, including without limitation the rights to
  6 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
  7 | # the Software, and to permit persons to whom the Software is furnished to do so.
  8 | # 
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 10 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 11 | # FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 12 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 13 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 14 | # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | import re
 17 | import sys
 18 | import time
 19 | import itertools
 20 | import collections
 21 | import dateutil.parser
 22 | from flatten_dict import flatten
 23 | from collections import defaultdict
 24 | 
 25 | 
 26 | cwlogs_query_limit = 4
 27 | 
 28 | def obtain_raw_logs(test_details, cloudwatch_logs):
 29 |     # query cw logs to get the raw logs during the period the step functions workflow was running
 30 |     # cw logs limits the number of concurrent queries, so the number of running queries always needs to be lower or equal cwlogs_query_limit
 31 |     
 32 |     # filter queries that are currently running (query metadata is added to the cwlogs_query in the test details once they are submitted)
 33 |     incomplete_query_results_iter = lambda: filter(lambda x: 'cwlogs_query' in x and ('cwlogs_query_result' not in x or x['cwlogs_query_result']['status'] == 'Running'), test_details)
 34 |     
 35 |     # print summary of running, completed, and failed queries
 36 |     output_statistics = lambda: print("query status:", dict(collections.Counter(map(lambda x: x['cwlogs_query_result']['status'] if 'cwlogs_query_result' in x else 'Pending', test_details))))
 37 | 
 38 |     # populate queue with all queries that need to be executed
 39 |     queue = test_details.copy()
 40 |     cwlogs_query_capacity = cwlogs_query_limit
 41 | 
 42 |     output_statistics()
 43 | 
 44 |     # while there are queries to run and we can still run queries
 45 |     while len(queue)>0 or cwlogs_query_capacity<cwlogs_query_limit:
 46 |         
 47 |         # submit cwlogs_query_capacity additional queries from the queue to cloudwatch logs
 48 |         for test_detail in queue[:cwlogs_query_capacity]:
 49 |             test_detail['cwlogs_query'] = cloudwatch_logs.start_query(
 50 |                 logGroupName=test_detail['log_group_name'],
 51 |                 startTime=int(test_detail['start_date'].timestamp()),
 52 |                 endTime=int(test_detail['stop_date'].timestamp()),
 53 |                 queryString='fields @timestamp, @message, @logStream | filter @message like /(kafka-producer-perf-test.sh --topic|^\d+ records sent.*99\.9th|kafka-consumer-perf-test.sh --topic|consumer-fetch-manager-metrics:(records-consumed-total|records-consumed-rate|bytes-consumed-rate|bytes-consumed-total|records-lag-max):\{\S+\})/ | sort @timestamp asc | limit 10000'
 54 |             )
 55 | 
 56 |         # remove started queries from the queue
 57 |         del queue[:cwlogs_query_capacity]
 58 | 
 59 |         time.sleep(2)
 60 | 
 61 |         # obtain the query result for completed queries
 62 |         for test_detail in incomplete_query_results_iter():
 63 |             test_detail['cwlogs_query_result'] = cloudwatch_logs.get_query_results(
 64 |                 queryId=test_detail['cwlogs_query']['queryId']
 65 |             )
 66 | 
 67 |         # adapt query capacity of cloudwatch based on the running queries 
 68 |         cwlogs_query_capacity = cwlogs_query_limit - len(list(incomplete_query_results_iter()))
 69 | 
 70 |         output_statistics()
 71 | 
 72 |         
 73 | 
 74 | key_by_logstream_fn = lambda x: x['@logStream']
 75 | 
 76 | producer_result_p = re.compile(r'(\S+) records sent, (\S+) records.sec \((\S+) MB.sec\), (\S+) ms avg latency, (\S+) ms max latency, (\S+) ms 50th, (\S+) ms 95th, (\S+) ms 99th, (\S+) ms 99.9th')
 77 | consumer_result_p = re.compile(r'(\S+)$')
 78 | consumer_timeout_p = re.compile(r'--timeout (\S+)')
 79 | 
 80 | def query_cw_logs(test_details, cloudwatch_logs):
 81 |     # first, obtain the raw output from cloudwatch
 82 |     # then, parse the output so that each producer/consumer result is clearly mapped to the test parameters that lead to the result
 83 |     # this will still give multiple results per test, one result per producer/consumer
 84 | 
 85 |     obtain_raw_logs(test_details, cloudwatch_logs)
 86 |         
 87 |     producer_stats = []
 88 |     consumer_stats = []
 89 |     
 90 |     max_topic_property_length = 15
 91 | 
 92 |     for raw_test_result in test_details:
 93 |         # create regular expressions that match the test parameters indexes in the topic name
 94 |         test_params_re = { name: re.compile(r'{}\S*?-(\d+)'.format(name.replace('_', '-')[:max_topic_property_length])) for name in raw_test_result['test_parameters'].keys()}
 95 |         test_params_re = {
 96 |             **test_params_re,
 97 |             'test_id' : re.compile(r'--topic test-id-(\d+)--throughput-series-id-\d+'),
 98 |             'throughput_series_id' : re.compile(r'--topic test-id-\d+--throughput-series-id-(\d+)')
 99 |         }
100 |         
101 |         # transform the cloudwatch logs output into a dictionary with properly named keys
102 |         statistics_result = map(lambda x: {kv['field']: kv['value'] for kv in x}, raw_test_result['cwlogs_query_result']['results'])
103 | 
104 |         # groups the results by log stream name so that each producer/consumer output can be transformed independent of the others
105 |         for log_stream,logs in itertools.groupby(sorted(statistics_result, key=key_by_logstream_fn), key=key_by_logstream_fn):
106 |             params, *results = list(logs)
107 | 
108 |             if len(results)==0:
109 |                 continue
110 |             
111 |             start_ts = dateutil.parser.parse(params['@timestamp'])
112 |             end_ts = max(map(lambda x: dateutil.parser.parse(x['@timestamp']), results))
113 |             duration_sec = (end_ts - start_ts).total_seconds()
114 | 
115 |             # extract the test parameter indexes from the topic name
116 |             test_index = { name: int(re.search(params['@message']).group(1)) for name, re in test_params_re.items()}
117 |             
118 |             # lookup index to actual parameter value
119 |             test_params = { 
120 |                 name: raw_test_result['test_parameters'][name][index] if name in raw_test_result['test_parameters'] else index 
121 |                 for name, index in test_index.items()
122 |             }
123 |             
124 |             # parse producer & consumer probs into a dict
125 |             test_params['producer'] =  dict([arr.split('=') for arr in test_params['client_props']['producer'].split()])
126 |             test_params['consumer'] =  dict([arr.split('=') for arr in test_params['client_props']['consumer'].split()])
127 | 
128 |             
129 |             if "kafka-producer-perf-test" in params['@message']:            
130 |                 # last matched output contains the overall statistics of the producer command
131 |                 producer_result_m = producer_result_p.search(results[-1]['@message'])
132 | 
133 |                 test_results = {
134 |                     'sent_records': float(producer_result_m.group(1)),
135 |                     'sent_records_sec': float(producer_result_m.group(2)),
136 |                     'sent_mb_sec': float(producer_result_m.group(3)),
137 |                     'latency_ms_avg': float(producer_result_m.group(4)),
138 |                     'latency_ms_max': float(producer_result_m.group(5)),
139 |                     'latency_ms_p50': float(producer_result_m.group(6)),
140 |                     'latency_ms_p95': float(producer_result_m.group(7)),
141 |                     'latency_ms_p99': float(producer_result_m.group(8)),
142 |                     'latency_ms_p999': float(producer_result_m.group(9)),
143 |                     'log_group': raw_test_result['log_group_name'],
144 |                     'log_stream': log_stream,
145 |                     'start_ts': start_ts,
146 |                     'end_ts': end_ts,
147 |                     'actual_duration_div_requested_duration_sec': duration_sec/test_params['duration_sec']
148 |                 }
149 | 
150 |                 producer_stats.append({
151 |                     'test_params': {
152 |                         **flatten(test_params, reducer='dot'),
153 |                         **raw_test_result['cluster_properties']
154 |                     },
155 |                     'test_results': test_results
156 |                 })
157 |             elif "kafka-consumer-perf-test" in params['@message']:
158 |                 consumer_timeout_m = consumer_timeout_p.search(params['@message'])
159 |                 
160 |                 try:
161 |                     metrics = {
162 |                         metric: float(consumer_result_p.search(next(filter(lambda x: metric in x['@message'], results))['@message']).group(0))
163 |                         for metric in ['records-consumed-total', 'records-consumed-rate', 'bytes-consumed-rate', 'bytes-consumed-total', 'records-lag-max']
164 |                     }
165 |                 except StopIteration:
166 |                     print("skipping incomplete result for test " + str(flatten(test_params, reducer='dot')), file=sys.stderr)
167 | 
168 |                 test_results = {
169 |                     'consumed_mb': metrics['bytes-consumed-total']/1024**2,
170 |                     'consumed_mb_sec': metrics['bytes-consumed-rate']/1024**2,
171 |                     'consumed_records':  metrics['records-consumed-total'],
172 |                     'consumed_records_sec': metrics['records-consumed-rate'],
173 |                     'records_lag_max': metrics['records-lag-max'],
174 |                     'log_group': raw_test_result['log_group_name'],
175 |                     'log_stream': results[-1]['@logStream'],
176 |                     'start_ts': start_ts,
177 |                     'end_ts': end_ts,
178 |                     'actual_duration_div_requested_duration_sec': duration_sec/(test_params['duration_sec']+int(consumer_timeout_m.group(1))/1000)
179 |                 }
180 | 
181 |                 consumer_stats.append({
182 |                     'test_params': {**flatten(test_params, reducer='dot'), **raw_test_result['cluster_properties']},
183 |                     'test_results': test_results
184 |                 })
185 |     return (producer_stats, consumer_stats)


--------------------------------------------------------------------------------