├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── assets └── architecture.png ├── code.zip ├── deploy └── template.yml ├── pipeline.yml ├── src ├── requirements.txt └── sample.py └── tests ├── conftest.py ├── requirements-test.txt └── test_sample.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | 3 | .vscode/ -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 4 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 5 | opensource-codeofconduct@amazon.com with any additional questions or comments. 6 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | ## Reporting Bugs/Feature Requests 10 | 11 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 12 | 13 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 14 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 15 | 16 | * A reproducible test case or series of steps 17 | * The version of our code being used 18 | * Any modifications you've made relevant to the bug 19 | * Anything unusual about your environment or deployment 20 | 21 | ## Contributing via Pull Requests 22 | 23 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 24 | 25 | 1. You are working against the latest source on the *main* branch. 26 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 27 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 28 | 29 | To send us a pull request, please: 30 | 31 | 1. Fork the repository. 32 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 33 | 3. Ensure local tests pass. 34 | 4. Commit to your fork using clear commit messages. 35 | 5. Send us a pull request, answering any default questions in the pull request interface. 36 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 37 | 38 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 39 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 40 | 41 | ## Finding contributions to work on 42 | 43 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 44 | 45 | ## Code of Conduct 46 | 47 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 48 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 49 | opensource-codeofconduct@amazon.com with any additional questions or comments. 50 | 51 | ## Security issue notifications 52 | 53 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 54 | 55 | ## Licensing 56 | 57 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 58 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # aws-glue-jobs-unit-testing 2 | 3 | This demo illustrates the execution of PyTest unit test cases for AWS Glue jobs in AWS CodePipeline using AWS CodeBuild projects. The solution involves using the container image in Public ECR gallery as the runtime environment for executing the test case in AWS CodeBuild project. The [aws-glue-libs](https://gallery.ecr.aws/amazon/aws-glue-libs) Public ECR repository contains image for all version of AWS Glue. 4 | 5 | AWS services used for the CI/CD portion in the solution: 6 | 7 | - [AWS Glue](https://aws.amazon.com/glue/) 8 | - [AWS CodeBuild](https://aws.amazon.com/codebuild/) 9 | - [AWS CloudFormation](https://aws.amazon.com/cloudformation/) 10 | - [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) 11 | 12 | The AWS Cloudformation template provided in this sample does not work in ap-southeast-1, af-south-1, ap-northeast-3, ap-southeast-3, me-south-1 and us-gov-east-1 AWS regions due to service limitations. 13 | 14 | ## Architecture 15 | 16 | ![Architecture Diagram](assets/architecture.png) 17 | 18 | ## Stack deployment 19 | 20 | The cloudformation stack for setting up the pipeline can be deployed using Cloudformation page in AWS Console or using the AWS CLI as shown below 21 | 22 | First, zip and upload the sample AWS Glue job and its pytest test cases code to an S3 bucket 23 | 24 | ```bash 25 | zip -r code.zip src/sample.py src/requirements.txt tests/* deploy/*.yml 26 | ``` 27 | 28 | ```bash 29 | aws s3 cp code.zip s3://aws-glue-artifacts-us-east-1/ 30 | ``` 31 | 32 | Trigger the cloudformation stack creation pointing to that S3 bucket zip. 33 | 34 | ```bash 35 | aws cloudformation create-stack --stack-name glue-unit-testing-pipeline --template-body file://pipeline.yml --parameters ParameterKey=ApplicationStackName,ParameterValue=glue-codepipeline-app ParameterKey=BucketName,ParameterValue=aws-glue-artifacts1-us-east ParameterKey=CodeZipFile,ParameterValue=code.zip ParameterKey=TestReportGroupName,ParameterValue=glue-unittest-report --capabilities CAPABILITY_NAMED_IAM --region us-east-1 36 | ``` 37 | 38 | ## Components details 39 | 40 | [deploy/template.yml](deploy/template.yml) - AWS Cloudformation template for demonstrating the deployment of AWS Glue job and related resources. 41 | 42 | [src/sample.py](src/sample.py) - Python code for a sample AWS Glue job to transform data in parquet files in S3 bucket. 43 | 44 | [src/requirements.txt](src/requirements.txt) - Simple text file containing AWS Glue job's dependencies for use by Python package manager Pip. 45 | 46 | [tests/conftest.py](tests/conftest.py) - PyTest configuration for initializing Glue context and Spark session 47 | 48 | [tests/test_sample.py](tests/test_sample.py) - PyTest test cases for the AWS Glue job [src/sample.py](src/sample.py). 49 | 50 | [code.zip](code.zip) - Compressed ZIP file containing the `src`, `tests` and `deploy` folder. 51 | 52 | [pipeline.yml](pipeline.yml) - AWS Cloudformation template for deploying pipeline to build, test and deploy the AWS Glue job in `src` folder. 53 | 54 | ## License 55 | 56 | This library is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file. 57 | -------------------------------------------------------------------------------- /assets/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-jobs-unit-testing/cde3f862027d485316ef2e4933766fd9a7ff26b6/assets/architecture.png -------------------------------------------------------------------------------- /code.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-jobs-unit-testing/cde3f862027d485316ef2e4933766fd9a7ff26b6/code.zip -------------------------------------------------------------------------------- /deploy/template.yml: -------------------------------------------------------------------------------- 1 | --- 2 | AWSTemplateFormatVersion: 2010-09-09 3 | Description: "**WARNING** This template creates IAM Role, AWS Glue job and related resources. You will be billed for the AWS resources used if you create a stack from this template." 4 | Parameters: 5 | S3Bucketname: 6 | Description: Name of the existing Artifact store S3 bucket creation 7 | Type: String 8 | KMSKey: 9 | Description: KMS Key used to encrypt the bucket 10 | Type: String 11 | Resources: 12 | GlueJobRole: 13 | Type: AWS::IAM::Role 14 | Metadata: 15 | cdk_nag: 16 | rules_to_suppress: 17 | - id: AwsSolutions-IAM5 18 | reason: "Wild card in policy is required for matching S3 objects" 19 | Properties: 20 | AssumeRolePolicyDocument: 21 | Version: "2012-10-17" 22 | Statement: 23 | - Effect: Allow 24 | Principal: 25 | Service: 26 | - glue.amazonaws.com 27 | Action: 28 | - sts:AssumeRole 29 | Path: / 30 | Policies: 31 | - PolicyName: root 32 | PolicyDocument: 33 | Version: "2012-10-17" 34 | Statement: 35 | - Effect: Allow 36 | Action: 37 | - s3:GetObject 38 | - s3:PutObject 39 | - s3:ListBucket 40 | - s3:DeleteObject 41 | Resource: 42 | - !Sub "arn:${AWS::Partition}:s3:::${S3Bucketname}" 43 | - !Sub "arn:${AWS::Partition}:s3:::${S3Bucketname}/*" 44 | - Effect: Allow 45 | Action: 46 | - kms:Decrypt 47 | - kms:Encrypt 48 | - kms:GenerateDataKey 49 | - kms:DescribeKey 50 | Resource: 51 | - !Ref "KMSKey" 52 | GlueJob: 53 | Type: AWS::Glue::Job 54 | Properties: 55 | Command: 56 | Name: glueetl 57 | ScriptLocation: !Sub "s3://${S3Bucketname}/GlueJobs/sample.py" 58 | DefaultArguments: 59 | --job-bookmark-option: job-bookmark-enable 60 | GlueVersion: "4.0" 61 | ExecutionProperty: 62 | MaxConcurrentRuns: 2 63 | MaxRetries: 0 64 | Name: samplejob 65 | Role: !Ref "GlueJobRole" 66 | WorkerType: G.2X 67 | NumberOfWorkers: 10 68 | SecurityConfiguration: !Ref "GlueSecurityConfiguration" 69 | DeletionPolicy: Delete 70 | UpdateReplacePolicy: Delete 71 | GlueSecurityConfiguration: 72 | Type: AWS::Glue::SecurityConfiguration 73 | Properties: 74 | Name: DefaultSecurityConfiguration 75 | EncryptionConfiguration: 76 | CloudWatchEncryption: 77 | CloudWatchEncryptionMode: SSE-KMS 78 | KmsKeyArn: !Ref "KMSKey" 79 | JobBookmarksEncryption: 80 | JobBookmarksEncryptionMode: CSE-KMS 81 | KmsKeyArn: !Ref "KMSKey" 82 | S3Encryptions: 83 | - KmsKeyArn: !Ref "KMSKey" 84 | S3EncryptionMode: SSE-KMS 85 | DeletionPolicy: Delete 86 | UpdateReplacePolicy: Delete 87 | -------------------------------------------------------------------------------- /pipeline.yml: -------------------------------------------------------------------------------- 1 | --- 2 | AWSTemplateFormatVersion: 2010-09-09 3 | Description: 4 | "**WARNING** This template creates IAM Roles, AWS CodeBuild projects and report groups, AWS CodePipeline and related resources. You will be billed for the AWS resources 5 | used if you create a stack from this template." 6 | Parameters: 7 | CodeZipFile: 8 | Description: S3 bucket key value of Zip file downloaded from GitHub 9 | Type: String 10 | Default: code.zip 11 | BucketName: 12 | Description: Name of the existing Artifact store S3 bucket creation 13 | Type: String 14 | Default: aws-glue-artifacts-us-east-1 15 | AllowedPattern: ^[0-9a-zA-Z]+([0-9a-zA-Z-]*[0-9a-zA-Z])*$ 16 | ApplicationStackName: 17 | Description: Glue job deployment application stack name 18 | Type: String 19 | Default: glue-codepipeline-app 20 | AllowedPattern: "[A-Za-z0-9-]+" 21 | TestReportGroupName: 22 | Description: Glue application unit test report group name 23 | Type: String 24 | Default: glue-unittest-report 25 | AllowedPattern: "[A-Za-z0-9-]+" 26 | Metadata: 27 | cdk_nag: 28 | rules_to_suppress: 29 | - id: AwsSolutions-IAM5 30 | reason: "Wild card in policy is required for matching S3 objects" 31 | Resources: 32 | KMSKey: 33 | Type: AWS::KMS::Key 34 | Properties: 35 | Description: Key for encrypting artifacts in S3 Buckets 36 | Enabled: true 37 | EnableKeyRotation: true 38 | KeyPolicy: 39 | Version: "2012-10-17" 40 | Statement: 41 | - Sid: Enable IAM User Permissions for root 42 | Effect: Allow 43 | Principal: 44 | AWS: !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:root" 45 | Action: kms:* 46 | Resource: "*" 47 | KeyUsage: ENCRYPT_DECRYPT 48 | DeletionPolicy: Delete 49 | UpdateReplacePolicy: Delete 50 | CodePipelineServiceRole: 51 | Type: AWS::IAM::Role 52 | Properties: 53 | AssumeRolePolicyDocument: 54 | Version: "2012-10-17" 55 | Statement: 56 | Effect: Allow 57 | Principal: 58 | Service: 59 | - codepipeline.amazonaws.com 60 | Action: sts:AssumeRole 61 | Policies: 62 | - PolicyName: AWS-CodePipeline-Service-Policy 63 | PolicyDocument: 64 | Version: "2012-10-17" 65 | Statement: 66 | - Effect: Allow 67 | Action: 68 | - cloudformation:CreateStack 69 | - cloudformation:DeleteStack 70 | - cloudformation:DescribeStacks 71 | - cloudformation:UpdateStack 72 | - cloudformation:CreateChangeSet 73 | - cloudformation:DeleteChangeSet 74 | - cloudformation:DescribeChangeSet 75 | - cloudformation:ExecuteChangeSet 76 | - cloudformation:SetStackPolicy 77 | - cloudformation:ValidateTemplate 78 | Resource: 79 | - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${ApplicationStackName}/*" 80 | - Effect: Allow 81 | Action: 82 | - s3:Get* 83 | - s3:PutObject 84 | - s3:ListBucket 85 | - s3:DeleteObject 86 | Resource: 87 | - !Sub "arn:${AWS::Partition}:s3:::${BucketName}" 88 | - !Sub "arn:${AWS::Partition}:s3:::${BucketName}/*" 89 | - Effect: Allow 90 | Action: 91 | - codebuild:BatchGetBuilds 92 | - codebuild:StartBuild 93 | Resource: 94 | - !GetAtt "TestBuild.Arn" 95 | - !GetAtt "PublishBuild.Arn" 96 | - Effect: Allow 97 | Action: 98 | - iam:PassRole 99 | Resource: 100 | - !GetAtt "CloudformationRole.Arn" 101 | - Effect: Allow 102 | Action: 103 | - kms:Decrypt 104 | - kms:Encrypt 105 | - kms:GenerateDataKey 106 | - kms:DescribeKey 107 | Resource: 108 | - !GetAtt "KMSKey.Arn" 109 | DeletionPolicy: Delete 110 | UpdateReplacePolicy: Delete 111 | GlueCodePipeline: 112 | Type: AWS::CodePipeline::Pipeline 113 | Properties: 114 | ExecutionMode: QUEUED 115 | PipelineType: V2 116 | Name: !Sub "${ApplicationStackName}-pipeline" 117 | ArtifactStore: 118 | Type: S3 119 | Location: !Ref "BucketName" 120 | RestartExecutionOnUpdate: true 121 | RoleArn: !GetAtt "CodePipelineServiceRole.Arn" 122 | Stages: 123 | - Name: Source 124 | Actions: 125 | - Name: SourceAction 126 | ActionTypeId: 127 | Category: Source 128 | Owner: AWS 129 | Provider: S3 130 | Version: 1 131 | OutputArtifacts: 132 | - Name: SourceCode 133 | Configuration: 134 | S3Bucket: !Ref BucketName 135 | S3ObjectKey: !Ref CodeZipFile 136 | PollForSourceChanges: false 137 | RunOrder: 1 138 | - Name: Build_and_Publish 139 | Actions: 140 | - Name: Test_and_Build 141 | ActionTypeId: 142 | Category: Build 143 | Owner: AWS 144 | Provider: CodeBuild 145 | Version: "1" 146 | InputArtifacts: 147 | - Name: SourceCode 148 | OutputArtifacts: 149 | - Name: BuiltCode 150 | Configuration: 151 | ProjectName: !Sub "${ApplicationStackName}-build" 152 | RunOrder: 1 153 | - Name: Publish 154 | ActionTypeId: 155 | Category: Build 156 | Owner: AWS 157 | Provider: CodeBuild 158 | Version: "1" 159 | InputArtifacts: 160 | - Name: BuiltCode 161 | Configuration: 162 | ProjectName: !Sub "${ApplicationStackName}-publish" 163 | RunOrder: 2 164 | - Name: Deploy 165 | Actions: 166 | - Name: CloudFormationDeploy 167 | ActionTypeId: 168 | Category: Deploy 169 | Owner: AWS 170 | Provider: CloudFormation 171 | Version: "1" 172 | InputArtifacts: 173 | - Name: BuiltCode 174 | Configuration: 175 | ActionMode: CREATE_UPDATE 176 | Capabilities: CAPABILITY_IAM 177 | RoleArn: !GetAtt "CloudformationRole.Arn" 178 | StackName: !Ref "ApplicationStackName" 179 | TemplatePath: BuiltCode::deploy/template.yml 180 | ParameterOverrides: !Sub '{"S3Bucketname": "${BucketName}", "KMSKey": "${KMSKey.Arn}"}' 181 | RunOrder: 1 182 | DeletionPolicy: Delete 183 | UpdateReplacePolicy: Delete 184 | CloudformationRole: 185 | Type: AWS::IAM::Role 186 | Properties: 187 | AssumeRolePolicyDocument: 188 | Version: "2012-10-17" 189 | Statement: 190 | Effect: Allow 191 | Principal: 192 | Service: cloudformation.amazonaws.com 193 | Action: sts:AssumeRole 194 | Policies: 195 | - PolicyName: AWS-Cloudformation-Service-Policy 196 | PolicyDocument: 197 | Version: "2012-10-17" 198 | Statement: 199 | - Effect: Allow 200 | Action: 201 | - glue:CreateSecurityConfiguration 202 | - glue:DeleteSecurityConfiguration 203 | Resource: "*" 204 | - Effect: Allow 205 | Action: 206 | - glue:CreateJob 207 | - glue:UpdateJob 208 | - glue:DeleteJob 209 | Resource: 210 | - !Sub "arn:${AWS::Partition}:glue:${AWS::Region}:${AWS::AccountId}:job/samplejob" 211 | - Effect: Allow 212 | Action: 213 | - iam:CreateRole 214 | - iam:DeleteRole 215 | - iam:PutRolePolicy 216 | - iam:GetRolePolicy 217 | - iam:DeleteRolePolicy 218 | - iam:PassRole 219 | Resource: 220 | - !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/glue-codepipeline-app-GlueJobRole-*" 221 | DeletionPolicy: Delete 222 | UpdateReplacePolicy: Delete 223 | TestBuild: 224 | Type: AWS::CodeBuild::Project 225 | Properties: 226 | Artifacts: 227 | Type: CODEPIPELINE 228 | BadgeEnabled: false 229 | Environment: 230 | ComputeType: BUILD_GENERAL1_LARGE 231 | Image: public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 232 | ImagePullCredentialsType: CODEBUILD 233 | PrivilegedMode: false 234 | Type: LINUX_CONTAINER 235 | Name: !Sub "${ApplicationStackName}-build" 236 | ServiceRole: !GetAtt "CodeBuildRole.Arn" 237 | EncryptionKey: !GetAtt "KMSKey.Arn" 238 | Source: 239 | BuildSpec: !Sub | 240 | version: 0.2 241 | env: 242 | shell: bash 243 | variables: 244 | GLUEJOBS_PATH: "src" 245 | JUNIT_XML: junit_coverage.xml 246 | phases: 247 | pre_build: 248 | commands: 249 | - echo "Install dependencies" 250 | - python3 --version 251 | - pip3 install -r tests/requirements-test.txt 252 | - cd "$CODEBUILD_SRC_DIR/$GLUEJOBS_PATH" 253 | - pip3 install -r requirements.txt 254 | build: 255 | commands: 256 | - cd "$CODEBUILD_SRC_DIR/" 257 | - echo "Running Cloudformation templates linter" 258 | - cfn-lint -c INCLUDE_CHECKS I3042 -- deploy/template.yml 259 | - echo "Running Python linter" 260 | - flake8 --ignore F403,F405,E501 261 | - echo "Running unit test cases" 262 | - python3 -m pytest tests/ --junitxml=$CODEBUILD_SRC_DIR/$JUNIT_XML 263 | reports: 264 | ${TestReportGroup.Arn}: 265 | files: 266 | - $JUNIT_XML 267 | base-directory: $CODEBUILD_SRC_DIR 268 | discard-paths: yes 269 | file-format: JunitXml 270 | artifacts: 271 | files: 272 | - $GLUEJOBS_PATH/* 273 | - deploy/* 274 | base-directory: $CODEBUILD_SRC_DIR 275 | discard-paths: no 276 | Type: CODEPIPELINE 277 | TimeoutInMinutes: 15 278 | DeletionPolicy: Delete 279 | UpdateReplacePolicy: Delete 280 | PublishBuild: 281 | Type: AWS::CodeBuild::Project 282 | Properties: 283 | Artifacts: 284 | Type: CODEPIPELINE 285 | BadgeEnabled: false 286 | Environment: 287 | ComputeType: BUILD_GENERAL1_LARGE 288 | Image: aws/codebuild/standard:5.0 289 | ImagePullCredentialsType: CODEBUILD 290 | PrivilegedMode: false 291 | Type: LINUX_CONTAINER 292 | Name: !Sub "${ApplicationStackName}-publish" 293 | ServiceRole: !GetAtt "CodeBuildRole.Arn" 294 | EncryptionKey: !GetAtt "KMSKey.Arn" 295 | Source: 296 | BuildSpec: !Sub | 297 | version: 0.2 298 | env: 299 | shell: bash 300 | variables: 301 | GLUEJOBS_PATH: "src" 302 | phases: 303 | build: 304 | commands: 305 | - cd "$CODEBUILD_SRC_DIR/$GLUEJOBS_PATH" 306 | - echo "Publish GlueJob to S3" 307 | - aws s3 cp ./ s3://${BucketName}/GlueJobs/ --sse AES256 --recursive --exclude "*.txt" --exclude "./tests/" 308 | Type: CODEPIPELINE 309 | TimeoutInMinutes: 15 310 | DeletionPolicy: Delete 311 | UpdateReplacePolicy: Delete 312 | TestReportGroup: 313 | Type: AWS::CodeBuild::ReportGroup 314 | Properties: 315 | DeleteReports: true 316 | ExportConfig: 317 | ExportConfigType: S3 318 | S3Destination: 319 | Bucket: !Ref "BucketName" 320 | Path: test-report 321 | Name: !Ref "TestReportGroupName" 322 | Type: TEST 323 | DeletionPolicy: Delete 324 | UpdateReplacePolicy: Delete 325 | CodeBuildRole: 326 | Type: AWS::IAM::Role 327 | Properties: 328 | AssumeRolePolicyDocument: 329 | Version: "2012-10-17" 330 | Statement: 331 | Effect: Allow 332 | Principal: 333 | Service: codebuild.amazonaws.com 334 | Action: sts:AssumeRole 335 | Policies: 336 | - PolicyName: AWS-CodeBuild-Service-Policy 337 | PolicyDocument: 338 | Version: "2012-10-17" 339 | Statement: 340 | - Effect: Allow 341 | Action: 342 | - s3:GetObject 343 | - s3:PutObject 344 | - s3:GetObjectVersion 345 | - s3:GetBucketAcl 346 | - s3:GetBucketLocation 347 | Resource: 348 | - !Sub "arn:${AWS::Partition}:s3:::${BucketName}" 349 | - !Sub "arn:${AWS::Partition}:s3:::${BucketName}/*" 350 | - Effect: Allow 351 | Action: 352 | - codebuild:CreateReportGroup 353 | - codebuild:CreateReport 354 | - codebuild:UpdateReport 355 | - codebuild:BatchPutTestCases 356 | - codebuild:BatchPutCodeCoverages 357 | Resource: 358 | - !GetAtt "TestReportGroup.Arn" 359 | - Effect: Allow 360 | Action: 361 | - logs:CreateLogGroup 362 | - logs:CreateLogStream 363 | - logs:PutLogEvents 364 | Resource: 365 | - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/${ApplicationStackName}-*" 366 | - Effect: Allow 367 | Action: 368 | - kms:Decrypt 369 | - kms:Encrypt 370 | - kms:GenerateDataKey 371 | - kms:DescribeKey 372 | Resource: 373 | - !GetAtt "KMSKey.Arn" 374 | DeletionPolicy: Delete 375 | UpdateReplacePolicy: Delete 376 | -------------------------------------------------------------------------------- /src/requirements.txt: -------------------------------------------------------------------------------- 1 | boto3 -------------------------------------------------------------------------------- /src/sample.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from pyspark.context import SparkContext 3 | from pyspark.sql import SparkSession, DataFrame 4 | from pyspark.sql.functions import * 5 | from awsglue.context import GlueContext 6 | from awsglue.job import Job 7 | from awsglue.transforms import * 8 | from awsglue.utils import getResolvedOptions 9 | 10 | 11 | def transform(spark: SparkSession, df: DataFrame) -> DataFrame: 12 | """ 13 | Function to extract and transform dataframe columns with date 14 | to get day, month and year. 15 | 16 | Args: 17 | spark (SparkSession): PySpark session object 18 | df (DataFrame): Dataframe object containing the data before transform 19 | 20 | Returns: 21 | DataFrame: Dataframe object containing the data after transform 22 | """ 23 | return ( 24 | df.withColumn("test", year(col("dt"))) 25 | .withColumn("data", month(col("dt"))) 26 | .withColumn("msg", dayofmonth(col("dt"))) 27 | ) 28 | 29 | 30 | def process_data(spark: SparkSession, source_path, table_name): 31 | """ 32 | Function to read and process data from parquet file 33 | 34 | Args: 35 | spark (SparkSession): PySpark session object 36 | source_path (String): Data file path 37 | table_name (String): Output Table name 38 | """ 39 | df = spark.read.parquet(f"s3://data-s3/{source_path}") 40 | 41 | df_transform = transform(spark, df) 42 | df_transform.write.mode("append").partitionBy( 43 | "test", "data", "msg" 44 | ).parquet(f"s3://data-s3/{table_name}") 45 | 46 | 47 | if __name__ == "__main__": 48 | args = getResolvedOptions( 49 | sys.argv, ["JOB_NAME", "table_name", "source_path"] 50 | ) 51 | glueContext = GlueContext(SparkContext.getOrCreate()) 52 | spark = glueContext.spark_session 53 | job = Job(glueContext) 54 | job.init(args["JOB_NAME"], args) 55 | 56 | spark._jsc.haddopConfiguration().set( 57 | "fs.s3.useRequesterPaysHeader", "true" 58 | ) 59 | 60 | process_data(spark, args["source_path"], args["table_name"]) 61 | 62 | job.commit() 63 | -------------------------------------------------------------------------------- /tests/conftest.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext 2 | from awsglue.context import GlueContext 3 | import pytest 4 | 5 | 6 | @pytest.fixture(scope="session") 7 | def glueContext(): 8 | """ 9 | Function to setup test environment for PySpark and Glue 10 | """ 11 | spark_context = SparkContext() 12 | glueContext = GlueContext(spark_context) 13 | yield glueContext 14 | spark_context.stop() 15 | -------------------------------------------------------------------------------- /tests/requirements-test.txt: -------------------------------------------------------------------------------- 1 | cfn-lint 2 | flake8 3 | pytest-cov 4 | pytest 5 | pandas==2.2.0 6 | moto[all]==5.0.20 7 | flask==3.0.2 8 | flask_cors 9 | pyarrow==15.0.0 -------------------------------------------------------------------------------- /tests/test_sample.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import signal 4 | import subprocess # nosec B404 5 | from src.sample import transform 6 | from awsglue.context import GlueContext 7 | from pyspark.sql import SparkSession, DataFrame 8 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType 9 | 10 | 11 | SOURCE_NAME = "data.parquet" 12 | TABLE_NAME = "dummy" 13 | S3_BUCKET_NAME = "data-s3" 14 | ENDPOINT_URL = "http://127.0.0.1:5000/" 15 | 16 | 17 | def initialize_test(spark: SparkSession): 18 | """ 19 | Function to setup and initialize test case execution 20 | 21 | Args: 22 | spark (SparkSession): PySpark session object 23 | 24 | Returns: 25 | process: Process object for the moto server that was started 26 | """ 27 | process = subprocess.Popen( # nosec B607 28 | "moto_server -p5000", 29 | stdout=subprocess.PIPE, 30 | shell=True, # nosec B602 31 | preexec_fn=os.setsid, 32 | ) 33 | 34 | s3 = boto3.resource( # nosec B106 35 | "s3", 36 | endpoint_url=ENDPOINT_URL, 37 | aws_access_key_id="FakeKey", 38 | aws_secret_access_key="FakeSecretKey", 39 | aws_session_token="FakeSessionToken", 40 | region_name="us-east-1", 41 | ) 42 | s3.create_bucket( 43 | Bucket=S3_BUCKET_NAME, 44 | ) 45 | 46 | hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration() 47 | hadoop_conf.set("fs.s3a.access.key", "dummy-value") 48 | hadoop_conf.set("fs.s3a.secret.key", "dummy-value") 49 | hadoop_conf.set("fs.s3a.endpoint", ENDPOINT_URL) 50 | hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 51 | 52 | values = [ 53 | ("sam", "1962-05-25"), 54 | ("let", "1999-05-21"), 55 | ("nick", "1996-04-03"), 56 | ] 57 | columns = ["name", "dt"] 58 | df = spark.createDataFrame(values, columns) 59 | df.write.parquet(f"s3://{S3_BUCKET_NAME}/{SOURCE_NAME}") 60 | return process 61 | 62 | 63 | def compare_schema(schema_a: StructType, schema_b: StructType) -> bool: 64 | """ 65 | Utility method to compare two schema and return the results of comparison 66 | 67 | Args: 68 | schema_a (StructType): Schema for comparison 69 | schema_b (StructType): Schema for comparison 70 | 71 | Returns: 72 | bool: Result of schema comparison 73 | """ 74 | return len(schema_a) == len(schema_b) and all( 75 | (a.name, a.dataType) == (b.name, b.dataType) for a, b in zip(schema_a, schema_b) 76 | ) 77 | 78 | 79 | # Test to verify data transformation 80 | def test_transform(glueContext: GlueContext): 81 | """ 82 | Test case to test the transform function 83 | 84 | Args: 85 | glueContext (GlueContext): Test Glue context object 86 | """ 87 | spark = glueContext.spark_session 88 | input_data = spark.createDataFrame( 89 | [("sam", "1962-05-25"), ("let", "1999-05-21"), ("nick", "1996-04-03")], 90 | ["name", "dt"], 91 | ) 92 | output_schema = StructType( 93 | [ 94 | StructField("name", StringType(), False), 95 | StructField("dt", StringType(), False), 96 | StructField("test", IntegerType(), False), 97 | StructField("data", IntegerType(), False), 98 | StructField("msg", IntegerType(), False), 99 | ] 100 | ) 101 | real_output = transform(spark, input_data) 102 | assert compare_schema(real_output.schema, output_schema) # nosec assert_used 103 | 104 | 105 | # Test to verify data present in valid partitioned format 106 | def test_process_data_record(glueContext: GlueContext): 107 | """ 108 | Test case to test the process_data function for 109 | valid partitioned data output 110 | 111 | Args: 112 | glueContext (GlueContext): Test Glue context object 113 | """ 114 | spark = glueContext.spark_session 115 | process = initialize_test(spark) 116 | 117 | try: 118 | from src.sample import process_data 119 | 120 | process_data(spark, SOURCE_NAME, TABLE_NAME) 121 | df = spark.read.parquet( 122 | f"s3a://{S3_BUCKET_NAME}/{TABLE_NAME}/test=1962/data=5/msg=25" 123 | ) 124 | assert isinstance(df, DataFrame) # nosec assert_used 125 | finally: 126 | os.killpg(os.getpgid(process.pid), signal.SIGTERM) 127 | 128 | 129 | # Test to verify number of records 130 | def test_process_data_record_count(glueContext: GlueContext): 131 | """ 132 | Test case to test the process_data function for 133 | number of records in input and output 134 | 135 | Args: 136 | glueContext (GlueContext): Test Glue context object 137 | """ 138 | spark = glueContext.spark_session 139 | process = initialize_test(spark) 140 | 141 | try: 142 | from src.sample import process_data 143 | 144 | process_data(spark, SOURCE_NAME, TABLE_NAME) 145 | 146 | df = spark.read.parquet(f"s3a://{S3_BUCKET_NAME}/{TABLE_NAME}") 147 | assert df.count() == 3 # nosec assert_used 148 | finally: 149 | os.killpg(os.getpgid(process.pid), signal.SIGTERM) 150 | --------------------------------------------------------------------------------