├── img ├── img1.png ├── img2.png ├── img3.png ├── img4.png ├── img5.png └── img6.png ├── README.md └── code └── CFN-SagemakerEMRNoAuthProductWithStudio-v3.yaml /img/img1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/debnsuma/sagemaker-studio-emr-spark/HEAD/img/img1.png -------------------------------------------------------------------------------- /img/img2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/debnsuma/sagemaker-studio-emr-spark/HEAD/img/img2.png -------------------------------------------------------------------------------- /img/img3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/debnsuma/sagemaker-studio-emr-spark/HEAD/img/img3.png -------------------------------------------------------------------------------- /img/img4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/debnsuma/sagemaker-studio-emr-spark/HEAD/img/img4.png -------------------------------------------------------------------------------- /img/img5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/debnsuma/sagemaker-studio-emr-spark/HEAD/img/img5.png -------------------------------------------------------------------------------- /img/img6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/debnsuma/sagemaker-studio-emr-spark/HEAD/img/img6.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Scalable data preparation & ML using Apache Spark on AWS 2 | 3 | Analyzing, transforming and preparing large amounts of data is a foundational step of any data science and ML workflow. This session shows how to build end-to-end data preparation and machine learning (ML) workflows. We explain how to connect Apache Spark, for fast data preparation in your data processing environments on Amazon EMR and AWS Glue interactive sessions from Amazon SageMaker Studio. Uncover how to access data governed by AWS Lake Formation to interactively query, explore, visualize data, run and debug Spark jobs as you prepare large-scale data for use in ML. 4 | 5 | You may like to refer to session **Scalable data preparation & ML using Apache Spark on AWS** at [AWS Innovate 2023 (Data and Machine Learning)](https://aws.amazon.com/events/aws-innovate/apj/aiml-data/) for the code walk-through. 6 | 7 | ## Introduction 8 | 9 | While some organizations see data science, data engineering, and data analytics as separate siloed functions, we're increasingly seeing with many of our customers that data prep and analytics are foundational components of ML workflows. 10 | 11 | For example, although organizations have data engineering teams to clean and prepare data for analytics and ML, the specific data that a data scientist may need for training a specific model may not be available in the repository of data that a data engineering team may have prepared. 12 | 13 | ![Intro](img/img1.png) 14 | 15 | ## Problem statement 16 | 17 | Lets take a problem and try to solve it. As we all know, Air pollution in cities can be an acute problem leading to damaging effects on people, animals, plants and property. 18 | 19 | ![Intro](img/img2.png) 20 | 21 | We need to build a machine learning model which can help to predict the amount of NO2 in the area based on weather conditions 22 | 23 | So, ultimately we would like to have a ML model, wherein we are going to feed the weather details of a particular city at a given time, 24 | These details would be, mean temperature, maximum temperature, minimum temperate and so on. 25 | 26 | And the Model should predict the NO2 or nitrogen dioxide concentration levels at that time. 27 | 28 | ![Intro](img/img3.png) 29 | 30 | ## Solution 31 | 32 | So, what are the tasks we have in hand ? 33 | 34 | - We first need to clean and prepare the data for ML training and we are going to do that using `Apache Spark` 35 | 36 | - And then we would need to train and finally deploy the model using `Amazon SageMaker`. 37 | 38 | ![Intro](img/img4.png) 39 | 40 | So, we are going to be using Amazon SageMaker for ML training, and Model hosting. We are going to use `Pandas` for data analysis and `Amazon EMR` for data processing. And our training dataset would be stored in an `Amazon S3` bucket. 41 | 42 | But we don’t have to worry about these multiple services and tools. 43 | 44 | ## Amazon SageMaker Studio 45 | 46 | We are going to use Amazon SageMaker Studio, which is the first fully integrated development environment, or IDE, for machine learning. SageMaker Studio provides users the ability to visually browse and connect to Amazon EMR clusters right from the Studio notebook. Additionally, you can now provision and terminate EMR clusters directly from Studio 47 | 48 | 49 | ![Intro](img/img5.png) 50 | 51 | ## Steps to follow 52 | 53 | In this section we will walk you through how you can create the environment and launch a Studio Notebook and perform the data processing and model training and deployment using Amazon SageMaker Studio Notebook and Amazon EMR. 54 | 55 | ### Create the environment 56 | 57 | In this section, we supply AWS CloudFormation template and example notebooks to get started in a demonstration SageMaker domain. 58 | 59 | The following stack provides an end-to-end CloudFormation template that stands up a private VPC, a SageMaker domain attached to that VPC, and a SageMaker user with visibility to the pre-created AWS Service Catalog product. 60 | 61 | Please use this [CloudFormation template](/code/CFN-SagemakerEMRNoAuthProductWithStudio-v3.yaml) to deploy the environment. 62 | 63 | ### Launch the SageMaker Studio Notebook 64 | 65 | Once the stack is deployed, perform the following to launch a SageMaker Studio Notebook 66 | 67 | 1. Open the **Amazon SageMaker** console 68 | 2. Click on `Domains` on the left menu. 69 | 3. Click on `StudioDomain` 70 | 4. Click on `Launch` button for the `studio-user` 71 | 72 | ![Intro](img/img6.png) 73 | 74 | ### Air Quality Predictions with Amazon SageMaker and Amazon EMR 75 | 76 | Download this [Jupyter Notebook](/code/demo-sm-emr.ipynb) and import it inside the Studio Notebook and follow the instructions in the notebook. 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | -------------------------------------------------------------------------------- /code/CFN-SagemakerEMRNoAuthProductWithStudio-v3.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | AWSTemplateFormatVersion: '2010-09-09' 3 | Description: > 4 | This cloudformation template enables SageMaker Studio to launch and connect to EMR clusters. 5 | The EMR cluster is launched via Service Catalog. 6 | This template creates a demonstration SageMaker Studio Domain & SageMaker User Profile. It ppopulates Service Catalog 7 | with a Product that consists of another cloudformation template for launching EMR. 8 | It creates the Studio Domain in a private VPC and establishes connectivity with EMR via No-Auth as described in 9 | "https://aws.amazon.com/blogs/machine-learning/part-1-create-and-manage-amazon-emr-clusters-from-sagemaker-studio-to-run-interactive-spark-and-ml-workloads/" 10 | 11 | Mappings: 12 | VpcConfigurations: 13 | cidr: 14 | Vpc: 10.0.0.0/16 15 | PublicSubnet1: 10.0.10.0/24 16 | PrivateSubnet1: 10.0.20.0/24 17 | ClusterConfigurations: 18 | emr: 19 | BootStrapScriptFile: installpylibs-v2.sh 20 | StepScriptFile: configurekdc.sh 21 | s3params: 22 | BlogS3Bucket: aws-ml-blog 23 | S3Key: artifacts/sma-milestone1/ 24 | 25 | Parameters: 26 | SageMakerDomainName: 27 | Type: String 28 | Description: Name of the Studio Domain to Create 29 | Default: SageMakerEMRDomain 30 | 31 | Resources: 32 | S3Bucket: 33 | Type: AWS::S3::Bucket 34 | Properties: 35 | BucketName: !Join [ "-", [ "sagemaker-emr-template-cfn", !Select [ 2, !Split [ "/", !Ref AWS::StackId ] ] ] ] 36 | 37 | VPC: 38 | Type: 'AWS::EC2::VPC' 39 | Properties: 40 | CidrBlock: !FindInMap 41 | - VpcConfigurations 42 | - cidr 43 | - Vpc 44 | EnableDnsSupport: true 45 | EnableDnsHostnames: true 46 | Tags: 47 | - Key: "for-use-with-amazon-emr-managed-policies" 48 | Value: "true" 49 | - Key: Name 50 | Value: !Sub '${AWS::StackName}-VPC' 51 | 52 | InternetGateway: 53 | Type: 'AWS::EC2::InternetGateway' 54 | Properties: 55 | Tags: 56 | - Key: Name 57 | Value: !Sub '${AWS::StackName}-IGW' 58 | 59 | InternetGatewayAttachment: 60 | Type: 'AWS::EC2::VPCGatewayAttachment' 61 | Properties: 62 | InternetGatewayId: !Ref InternetGateway 63 | VpcId: !Ref VPC 64 | 65 | PublicSubnet1: 66 | Type: 'AWS::EC2::Subnet' 67 | Properties: 68 | VpcId: !Ref VPC 69 | AvailabilityZone: !Select 70 | - 0 71 | - !GetAZs '' 72 | CidrBlock: !FindInMap 73 | - VpcConfigurations 74 | - cidr 75 | - PublicSubnet1 76 | MapPublicIpOnLaunch: true 77 | Tags: 78 | - Key: Name 79 | Value: !Sub '${AWS::StackName} Public Subnet (AZ1)' 80 | 81 | PrivateSubnet1: 82 | Type: 'AWS::EC2::Subnet' 83 | Properties: 84 | VpcId: !Ref VPC 85 | AvailabilityZone: !Select 86 | - 0 87 | - !GetAZs '' 88 | CidrBlock: !FindInMap 89 | - VpcConfigurations 90 | - cidr 91 | - PrivateSubnet1 92 | MapPublicIpOnLaunch: false 93 | Tags: 94 | - Key: "for-use-with-amazon-emr-managed-policies" 95 | Value: "true" 96 | - Key: Name 97 | Value: !Sub '${AWS::StackName} Private Subnet (AZ1)' 98 | 99 | NatGateway1EIP: 100 | Type: 'AWS::EC2::EIP' 101 | DependsOn: InternetGatewayAttachment 102 | Properties: 103 | Domain: vpc 104 | 105 | NatGateway1: 106 | Type: 'AWS::EC2::NatGateway' 107 | Properties: 108 | AllocationId: !GetAtt 109 | - NatGateway1EIP 110 | - AllocationId 111 | SubnetId: !Ref PublicSubnet1 112 | 113 | PublicRouteTable: 114 | Type: 'AWS::EC2::RouteTable' 115 | Properties: 116 | VpcId: !Ref VPC 117 | Tags: 118 | - Key: Name 119 | Value: !Sub '${AWS::StackName} Public Routes' 120 | 121 | DefaultPublicRoute: 122 | Type: 'AWS::EC2::Route' 123 | DependsOn: InternetGatewayAttachment 124 | Properties: 125 | RouteTableId: !Ref PublicRouteTable 126 | DestinationCidrBlock: 0.0.0.0/0 127 | GatewayId: !Ref InternetGateway 128 | 129 | PublicSubnet1RouteTableAssociation: 130 | Type: 'AWS::EC2::SubnetRouteTableAssociation' 131 | Properties: 132 | RouteTableId: !Ref PublicRouteTable 133 | SubnetId: !Ref PublicSubnet1 134 | 135 | PrivateRouteTable1: 136 | Type: 'AWS::EC2::RouteTable' 137 | Properties: 138 | VpcId: !Ref VPC 139 | Tags: 140 | - Key: Name 141 | Value: !Sub '${AWS::StackName} Private Routes (AZ1)' 142 | 143 | 144 | PrivateSubnet1RouteTableAssociation: 145 | Type: 'AWS::EC2::SubnetRouteTableAssociation' 146 | Properties: 147 | RouteTableId: !Ref PrivateRouteTable1 148 | SubnetId: !Ref PrivateSubnet1 149 | 150 | PrivateSubnet1InternetRoute: 151 | Type: 'AWS::EC2::Route' 152 | Properties: 153 | RouteTableId: !Ref PrivateRouteTable1 154 | DestinationCidrBlock: 0.0.0.0/0 155 | NatGatewayId: !Ref NatGateway1 156 | 157 | S3Endpoint: 158 | Type: 'AWS::EC2::VPCEndpoint' 159 | Properties: 160 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3' 161 | VpcEndpointType: Gateway 162 | PolicyDocument: 163 | Version: 2012-10-17 164 | Statement: 165 | - Effect: Allow 166 | Principal: '*' 167 | Action: 168 | - '*' 169 | Resource: 170 | - '*' 171 | VpcId: !Ref VPC 172 | RouteTableIds: 173 | - !Ref PrivateRouteTable1 174 | 175 | SageMakerInstanceSecurityGroup: 176 | Type: 'AWS::EC2::SecurityGroup' 177 | Properties: 178 | Tags: 179 | - Key: "for-use-with-amazon-emr-managed-policies" 180 | Value: "true" 181 | GroupName: SMSG 182 | GroupDescription: Security group with no ingress rule 183 | SecurityGroupEgress: 184 | - IpProtocol: -1 185 | FromPort: -1 186 | ToPort: -1 187 | CidrIp: 0.0.0.0/0 188 | VpcId: !Ref VPC 189 | SageMakerInstanceSecurityGroupIngress: 190 | Type: AWS::EC2::SecurityGroupIngress 191 | Properties: 192 | IpProtocol: '-1' 193 | GroupId: !Ref SageMakerInstanceSecurityGroup 194 | SourceSecurityGroupId: !Ref SageMakerInstanceSecurityGroup 195 | VPCEndpointSecurityGroup: 196 | Type: AWS::EC2::SecurityGroup 197 | Properties: 198 | GroupDescription: Allow TLS for VPC Endpoint 199 | SecurityGroupEgress: 200 | - IpProtocol: -1 201 | FromPort: -1 202 | ToPort: -1 203 | CidrIp: 0.0.0.0/0 204 | VpcId: !Ref VPC 205 | Tags: 206 | - Key: Name 207 | Value: !Sub ${AWS::StackName}-endpoint-security-group 208 | EndpointSecurityGroupIngress: 209 | Type: AWS::EC2::SecurityGroupIngress 210 | Properties: 211 | IpProtocol: '-1' 212 | GroupId: !Ref VPCEndpointSecurityGroup 213 | SourceSecurityGroupId: !Ref SageMakerInstanceSecurityGroup 214 | SageMakerExecutionRole: 215 | Type: 'AWS::IAM::Role' 216 | Properties: 217 | RoleName: !Sub "${AWS::StackName}-EMR-SageMakerExecutionRole" 218 | AssumeRolePolicyDocument: 219 | Version: 2012-10-17 220 | Statement: 221 | - Effect: Allow 222 | Principal: 223 | Service: 224 | - sagemaker.amazonaws.com 225 | Action: 226 | - 'sts:AssumeRole' 227 | Path: / 228 | Policies: 229 | - PolicyName: !Sub '${AWS::StackName}-sageemr' 230 | PolicyDocument: 231 | Version: 2012-10-17 232 | Statement: 233 | - Effect: Allow 234 | Action: 235 | - elasticmapreduce:ListInstances 236 | - elasticmapreduce:DescribeCluster 237 | - elasticmapreduce:DescribeSecurityConfiguration 238 | - elasticmapreduce:CreatePersistentAppUI 239 | - elasticmapreduce:DescribePersistentAppUI 240 | - elasticmapreduce:GetPersistentAppUIPresignedURL 241 | - elasticmapreduce:GetOnClusterAppUIPresignedURL 242 | - elasticmapreduce:ListClusters 243 | - iam:GetRole 244 | Resource: '*' 245 | - Effect: Allow 246 | Action: 247 | - elasticmapreduce:DescribeCluster 248 | - elasticmapreduce:ListInstanceGroups 249 | Resource: !Sub "arn:${AWS::Partition}:elasticmapreduce:*:*:cluster/*" 250 | - Effect: Allow 251 | Action: 252 | - elasticmapreduce:ListClusters 253 | Resource: '*' 254 | ManagedPolicyArns: 255 | - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonSageMakerFullAccess" 256 | - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonS3ReadOnlyAccess" 257 | 258 | VPCEndpointSagemakerAPI: 259 | Type: AWS::EC2::VPCEndpoint 260 | Properties: 261 | PolicyDocument: 262 | Version: 2012-10-17 263 | Statement: 264 | - Effect: Allow 265 | Principal: '*' 266 | Action: '*' 267 | Resource: '*' 268 | VpcEndpointType: Interface 269 | PrivateDnsEnabled: true 270 | SubnetIds: 271 | - !Ref PrivateSubnet1 272 | SecurityGroupIds: 273 | - !Ref VPCEndpointSecurityGroup 274 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.sagemaker.api' 275 | VpcId: !Ref VPC 276 | VPCEndpointSageMakerRuntime: 277 | Type: AWS::EC2::VPCEndpoint 278 | Properties: 279 | PolicyDocument: 280 | Version: 2012-10-17 281 | Statement: 282 | - Effect: Allow 283 | Principal: '*' 284 | Action: '*' 285 | Resource: '*' 286 | VpcEndpointType: Interface 287 | PrivateDnsEnabled: true 288 | SubnetIds: 289 | - !Ref PrivateSubnet1 290 | SecurityGroupIds: 291 | - !Ref VPCEndpointSecurityGroup 292 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.sagemaker.runtime' 293 | VpcId: !Ref VPC 294 | VPCEndpointSTS: 295 | Type: 'AWS::EC2::VPCEndpoint' 296 | Properties: 297 | PolicyDocument: 298 | Version: 2012-10-17 299 | Statement: 300 | - Effect: Allow 301 | Principal: '*' 302 | Action: '*' 303 | Resource: '*' 304 | VpcEndpointType: Interface 305 | PrivateDnsEnabled: true 306 | SubnetIds: 307 | - !Ref PrivateSubnet1 308 | SecurityGroupIds: 309 | - !Ref VPCEndpointSecurityGroup 310 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.sts' 311 | VpcId: !Ref VPC 312 | VPCEndpointCW: 313 | Type: 'AWS::EC2::VPCEndpoint' 314 | Properties: 315 | PolicyDocument: 316 | Version: 2012-10-17 317 | Statement: 318 | - Effect: Allow 319 | Principal: '*' 320 | Action: '*' 321 | Resource: '*' 322 | VpcEndpointType: Interface 323 | PrivateDnsEnabled: true 324 | SubnetIds: 325 | - !Ref PrivateSubnet1 326 | SecurityGroupIds: 327 | - !Ref VPCEndpointSecurityGroup 328 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.monitoring' 329 | VpcId: !Ref VPC 330 | VPCEndpointCWL: 331 | Type: 'AWS::EC2::VPCEndpoint' 332 | Properties: 333 | PolicyDocument: 334 | Version: 2012-10-17 335 | Statement: 336 | - Effect: Allow 337 | Principal: '*' 338 | Action: '*' 339 | Resource: '*' 340 | VpcEndpointType: Interface 341 | PrivateDnsEnabled: true 342 | SubnetIds: 343 | - !Ref PrivateSubnet1 344 | SecurityGroupIds: 345 | - !Ref VPCEndpointSecurityGroup 346 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.logs' 347 | VpcId: !Ref VPC 348 | VPCEndpointECR: 349 | Type: 'AWS::EC2::VPCEndpoint' 350 | Properties: 351 | PolicyDocument: 352 | Version: 2012-10-17 353 | Statement: 354 | - Effect: Allow 355 | Principal: '*' 356 | Action: '*' 357 | Resource: '*' 358 | VpcEndpointType: Interface 359 | PrivateDnsEnabled: true 360 | SubnetIds: 361 | - !Ref PrivateSubnet1 362 | SecurityGroupIds: 363 | - !Ref VPCEndpointSecurityGroup 364 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.ecr.dkr' 365 | VpcId: !Ref VPC 366 | VPCEndpointECRAPI: 367 | Type: 'AWS::EC2::VPCEndpoint' 368 | Properties: 369 | PolicyDocument: 370 | Version: 2012-10-17 371 | Statement: 372 | - Effect: Allow 373 | Principal: '*' 374 | Action: '*' 375 | Resource: '*' 376 | VpcEndpointType: Interface 377 | PrivateDnsEnabled: true 378 | SubnetIds: 379 | - !Ref PrivateSubnet1 380 | SecurityGroupIds: 381 | - !Ref VPCEndpointSecurityGroup 382 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.ecr.api' 383 | VpcId: !Ref VPC 384 | 385 | 386 | StudioDomain: 387 | Type: AWS::SageMaker::Domain 388 | Properties: 389 | DomainName: !Ref SageMakerDomainName 390 | AppNetworkAccessType: VpcOnly 391 | AuthMode: IAM 392 | VpcId: !Ref VPC 393 | SubnetIds: 394 | - !Ref PrivateSubnet1 395 | DefaultUserSettings: 396 | ExecutionRole: !GetAtt SageMakerExecutionRole.Arn 397 | SecurityGroups: 398 | - !Ref SageMakerInstanceSecurityGroup 399 | 400 | StudioUserProfile: 401 | Type: AWS::SageMaker::UserProfile 402 | Properties: 403 | DomainId: !Ref StudioDomain 404 | UserProfileName: studio-user 405 | UserSettings: 406 | ExecutionRole: !GetAtt SageMakerExecutionRole.Arn 407 | 408 | # Products populated to Service Catalog 409 | ################################################### 410 | 411 | SageMakerStudioEMRNoAuthProduct: 412 | Type: AWS::ServiceCatalog::CloudFormationProduct 413 | Properties: 414 | Owner: AWS 415 | Name: SageMaker Studio Domain No Auth EMR 416 | ProvisioningArtifactParameters: 417 | - Name: SageMaker Studio Domain No Auth EMR 418 | Description: Provisions a SageMaker domain and No Auth EMR Cluster 419 | Info: 420 | LoadTemplateFromURL: https://aws-ml-blog.s3.amazonaws.com/artifacts/astra-m4-sagemaker/end-to-end/CFN-EMR-NoStudioNoAuthTemplate-v3.yaml 421 | Tags: 422 | - Key: "sagemaker:studio-visibility:emr" 423 | Value: "true" 424 | 425 | SageMakerStudioEMRNoAuthProductPortfolio: 426 | Type: AWS::ServiceCatalog::Portfolio 427 | Properties: 428 | ProviderName: AWS 429 | DisplayName: SageMaker Product Portfolio 430 | 431 | SageMakerStudioEMRNoAuthProductPortfolioAssociation: 432 | Type: AWS::ServiceCatalog::PortfolioProductAssociation 433 | Properties: 434 | PortfolioId: !Ref SageMakerStudioEMRNoAuthProductPortfolio 435 | ProductId: !Ref SageMakerStudioEMRNoAuthProduct 436 | 437 | EMRNoAuthLaunchConstraint: 438 | Type: 'AWS::IAM::Role' 439 | Properties: 440 | Policies: 441 | - PolicyDocument: 442 | Statement: 443 | - Action: 444 | - s3:* 445 | Effect: Allow 446 | Resource: 447 | - !Sub "arn:${AWS::Partition}:s3:::sagemaker-emr-template-cfn-*/*" 448 | - !Sub "arn:${AWS::Partition}:s3:::sagemaker-emr-template-cfn-*" 449 | - Action: 450 | - s3:GetObject 451 | Effect: Allow 452 | Resource: "*" 453 | Condition: 454 | StringEquals: 455 | s3:ExistingObjectTag/servicecatalog:provisioning: 'true' 456 | PolicyName: !Sub ${AWS::StackName}-${AWS::Region}-S3-Policy 457 | - PolicyDocument: 458 | Statement: 459 | - Action: 460 | - "sns:Publish" 461 | Effect: Allow 462 | Resource: !Sub "arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:*" 463 | Version: "2012-10-17" 464 | PolicyName: SNSPublishPermissions 465 | - PolicyDocument: 466 | Statement: 467 | - Action: 468 | - "ec2:CreateSecurityGroup" 469 | - "ec2:RevokeSecurityGroupEgress" 470 | - "ec2:DeleteSecurityGroup" 471 | - "ec2:createTags" 472 | - "iam:TagRole" 473 | - "ec2:AuthorizeSecurityGroupEgress" 474 | - "ec2:AuthorizeSecurityGroupIngress" 475 | - "ec2:RevokeSecurityGroupIngress" 476 | Effect: Allow 477 | Resource: "*" 478 | Version: "2012-10-17" 479 | PolicyName: EC2Permissions 480 | - PolicyDocument: 481 | Statement: 482 | - Action: 483 | - "elasticmapreduce:RunJobFlow" 484 | Effect: Allow 485 | Resource: !Sub "arn:${AWS::Partition}:elasticmapreduce:${AWS::Region}:${AWS::AccountId}:cluster/*" 486 | Version: "2012-10-17" 487 | PolicyName: EMRRunJobFlowPermissions 488 | - PolicyDocument: 489 | Statement: 490 | - Action: 491 | - "iam:PassRole" 492 | Effect: Allow 493 | Resource: 494 | - !GetAtt EMRClusterinstanceProfileRole.Arn 495 | - !GetAtt EMRClusterServiceRole.Arn 496 | - Action: 497 | - "iam:CreateInstanceProfile" 498 | - "iam:RemoveRoleFromInstanceProfile" 499 | - "iam:DeleteInstanceProfile" 500 | - "iam:AddRoleToInstanceProfile" 501 | Effect: Allow 502 | Resource: !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:instance-profile/SC-*" 503 | Version: "2012-10-17" 504 | PolicyName: IAMPermissions 505 | AssumeRolePolicyDocument: 506 | Version: "2012-10-17" 507 | Statement: 508 | - 509 | Effect: "Allow" 510 | Principal: 511 | Service: 512 | - "servicecatalog.amazonaws.com" 513 | Action: 514 | - "sts:AssumeRole" 515 | ManagedPolicyArns: 516 | - "Fn::Sub": "arn:${AWS::Partition}:iam::aws:policy/AWSServiceCatalogAdminFullAccess" 517 | - "Fn::Sub": "arn:${AWS::Partition}:iam::aws:policy/AmazonEMRFullAccessPolicy_v2" 518 | 519 | # Sets the principal who can initate provisioning from Service Studio 520 | ####################################################################### 521 | 522 | SageMakerStudioEMRNoAuthProductPortfolioPrincipalAssociation: 523 | Type: AWS::ServiceCatalog::PortfolioPrincipalAssociation 524 | Properties: 525 | PrincipalARN: !GetAtt SageMakerExecutionRole.Arn 526 | PortfolioId: !Ref SageMakerStudioEMRNoAuthProductPortfolio 527 | PrincipalType: IAM 528 | 529 | SageMakerStudioPortfolioLaunchRoleConstraint: 530 | Type: AWS::ServiceCatalog::LaunchRoleConstraint 531 | Properties: 532 | PortfolioId: !Ref SageMakerStudioEMRNoAuthProductPortfolio 533 | ProductId: !Ref SageMakerStudioEMRNoAuthProduct 534 | RoleArn: !GetAtt EMRNoAuthLaunchConstraint.Arn 535 | Description: Role used for provisioning 536 | 537 | # EMR IAM Roles 538 | ######################################################################## 539 | EMRClusterServiceRole: 540 | Type: AWS::IAM::Role 541 | Properties: 542 | AssumeRolePolicyDocument: 543 | Statement: 544 | - Action: 545 | - sts:AssumeRole 546 | Effect: Allow 547 | Principal: 548 | Service: 549 | - elasticmapreduce.amazonaws.com 550 | Version: '2012-10-17' 551 | ManagedPolicyArns: 552 | - arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2 553 | Path: "/" 554 | Policies: 555 | - PolicyName: 556 | Fn::Sub: AllowEMRInstnaceProfilePolicy-${AWS::StackName} 557 | PolicyDocument: 558 | Version: '2012-10-17' 559 | Statement: 560 | - Effect: Allow 561 | Action: "iam:PassRole" 562 | Resource: !GetAtt EMRClusterinstanceProfileRole.Arn 563 | 564 | # User's Should Consider using RoleBasedAccess Control now that it is available to pass your SageMaker execution role 565 | # to the cluster instead. 566 | EMRClusterinstanceProfileRole: 567 | Properties: 568 | RoleName: 569 | Fn::Sub: "${AWS::StackName}-EMRClusterinstanceProfileRole" 570 | AssumeRolePolicyDocument: 571 | Statement: 572 | - Action: 573 | - sts:AssumeRole 574 | Effect: Allow 575 | Principal: 576 | Service: 577 | - ec2.amazonaws.com 578 | Version: '2012-10-17' 579 | ManagedPolicyArns: 580 | - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonSageMakerFullAccess" 581 | - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonS3ReadOnlyAccess" 582 | Path: "/" 583 | Type: AWS::IAM::Role 584 | 585 | # Manage EMR Log and Artifacts S3 Bucket 586 | ######################################################################## 587 | CopyZips: 588 | Type: Custom::CopyZips 589 | DependsOn: CleanUpBucketonDelete 590 | Properties: 591 | ServiceToken: 592 | Fn::GetAtt: CopyZipsFunction.Arn 593 | DestBucket: 594 | Ref: S3Bucket 595 | SourceBucket: 596 | Fn::FindInMap: 597 | - ClusterConfigurations 598 | - s3params 599 | - BlogS3Bucket 600 | Prefix: 601 | Fn::FindInMap: 602 | - ClusterConfigurations 603 | - s3params 604 | - S3Key 605 | Objects: 606 | - Fn::FindInMap: 607 | - ClusterConfigurations 608 | - emr 609 | - BootStrapScriptFile 610 | - Fn::FindInMap: 611 | - ClusterConfigurations 612 | - emr 613 | - StepScriptFile 614 | BucketManagementRole: 615 | Type: AWS::IAM::Role 616 | Properties: 617 | AssumeRolePolicyDocument: 618 | Version: '2012-10-17' 619 | Statement: 620 | - Effect: Allow 621 | Principal: 622 | Service: lambda.amazonaws.com 623 | Action: sts:AssumeRole 624 | ManagedPolicyArns: 625 | - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole 626 | Path: "/" 627 | Policies: 628 | - PolicyName: 629 | Fn::Sub: BucketManagementLambdaPolicy-${AWS::StackName} 630 | PolicyDocument: 631 | Version: '2012-10-17' 632 | Statement: 633 | - Effect: Allow 634 | Action: 635 | - s3:GetObject 636 | Resource: "*" 637 | - Effect: Allow 638 | Action: 639 | - s3:PutObject 640 | - s3:DeleteObject 641 | Resource: 642 | - Fn::Sub: arn:aws:s3:::${S3Bucket}/* 643 | CopyZipsFunction: 644 | Type: AWS::Lambda::Function 645 | Properties: 646 | Description: Copies objects from a source S3 bucket to a destination 647 | Handler: index.handler 648 | Runtime: python3.8 649 | Role: 650 | Fn::GetAtt: BucketManagementRole.Arn 651 | Timeout: 900 652 | Code: 653 | ZipFile: | 654 | import json 655 | import logging 656 | import threading 657 | import boto3 658 | import cfnresponse 659 | def copy_objects(source_bucket, dest_bucket, prefix, objects): 660 | s3 = boto3.client('s3') 661 | for o in objects: 662 | key = prefix + o 663 | copy_source = { 664 | 'Bucket': source_bucket, 665 | 'Key': key 666 | } 667 | print('copy_source: %s' % copy_source) 668 | print('dest_bucket = %s'%dest_bucket) 669 | print('key = %s' %key) 670 | s3.copy_object(CopySource=copy_source, Bucket=dest_bucket, 671 | Key=key) 672 | def delete_objects(bucket, prefix, objects): 673 | s3 = boto3.client('s3') 674 | objects = {'Objects': [{'Key': prefix + o} for o in objects]} 675 | s3.delete_objects(Bucket=bucket, Delete=objects) 676 | def timeout(event, context): 677 | logging.error('Execution is about to time out, sending failure response to CloudFormation') 678 | cfnresponse.send(event, context, cfnresponse.FAILED, {}, None) 679 | def handler(event, context): 680 | # make sure we send a failure to CloudFormation if the function 681 | # is going to timeout 682 | timer = threading.Timer((context.get_remaining_time_in_millis() 683 | / 1000.00) - 0.5, timeout, args=[event, context]) 684 | timer.start() 685 | print('Received event: %s' % json.dumps(event)) 686 | status = cfnresponse.SUCCESS 687 | try: 688 | source_bucket = event['ResourceProperties']['SourceBucket'] 689 | dest_bucket = event['ResourceProperties']['DestBucket'] 690 | prefix = event['ResourceProperties']['Prefix'] 691 | objects = event['ResourceProperties']['Objects'] 692 | if event['RequestType'] == 'Delete': 693 | delete_objects(dest_bucket, prefix, objects) 694 | else: 695 | copy_objects(source_bucket, dest_bucket, prefix, objects) 696 | except Exception as e: 697 | logging.error('Exception: %s' % e, exc_info=True) 698 | status = cfnresponse.FAILED 699 | finally: 700 | timer.cancel() 701 | cfnresponse.send(event, context, status, {}, None) 702 | CleanUpBucketonDelete: 703 | Type: Custom::emptybucket 704 | Properties: 705 | ServiceToken: 706 | Fn::GetAtt: 707 | - CleanUpBucketonDeleteLambda 708 | - Arn 709 | BucketName: 710 | Ref: S3Bucket 711 | CleanUpBucketonDeleteLambda: 712 | Type: AWS::Lambda::Function 713 | Properties: 714 | Code: 715 | ZipFile: 716 | !Sub | 717 | import json, boto3, logging 718 | import cfnresponse 719 | logger = logging.getLogger() 720 | logger.setLevel(logging.INFO) 721 | 722 | def lambda_handler(event, context): 723 | logger.info("event: {}".format(event)) 724 | try: 725 | bucket = event['ResourceProperties']['BucketName'] 726 | logger.info("bucket: {}, event['RequestType']: {}".format(bucket,event['RequestType'])) 727 | if event['RequestType'] == 'Delete': 728 | s3 = boto3.resource('s3') 729 | bucket = s3.Bucket(bucket) 730 | for obj in bucket.objects.filter(): 731 | logger.info("delete obj: {}".format(obj)) 732 | s3.Object(bucket.name, obj.key).delete() 733 | 734 | sendResponseCfn(event, context, cfnresponse.SUCCESS) 735 | except Exception as e: 736 | logger.info("Exception: {}".format(e)) 737 | sendResponseCfn(event, context, cfnresponse.FAILED) 738 | 739 | def sendResponseCfn(event, context, responseStatus): 740 | responseData = {} 741 | responseData['Data'] = {} 742 | cfnresponse.send(event, context, responseStatus, responseData, "CustomResourcePhysicalID") 743 | Handler: "index.lambda_handler" 744 | Runtime: python3.7 745 | MemorySize: 128 746 | Timeout: 60 747 | Role: !GetAtt BucketManagementRole.Arn 748 | 749 | # Stack Outputs 750 | ########################################################################### 751 | Outputs: 752 | SageMakerEMRDemoCloudformationVPCId: 753 | Description: The ID of the Sagemaker Studio VPC 754 | Value: !Ref VPC 755 | Export: 756 | Name: "SageMakerEMRDemoCloudformationVPCId" 757 | 758 | SageMakerEMRDemoCloudformationSubnetId: 759 | Description: The Subnet Id of Sagemaker Studio 760 | Value: !Ref PrivateSubnet1 761 | Export: 762 | Name: "SageMakerEMRDemoCloudformationSubnetId" 763 | 764 | SageMakerEMRDemoCloudformationSecurityGroup: 765 | Description: The Security group of Sagemaker Studio instance 766 | Value: !Ref SageMakerInstanceSecurityGroup 767 | Export: 768 | Name: "SageMakerEMRDemoCloudformationSecurityGroup" 769 | 770 | SageMakerEMRDemoCloudformationEMRClusterinstanceProfileRole: 771 | Description: Role for EMR Cluster's InstanceProfile 772 | Value: !Ref EMRClusterinstanceProfileRole 773 | Export: 774 | Name: "SageMakerEMRDemoCloudformationEMRClusterinstanceProfileRole" 775 | 776 | SageMakerEMRDemoCloudformationEMRClusterServiceRole: 777 | Description: Role for EMR Cluster's Service Role 778 | Value: !Ref EMRClusterServiceRole 779 | Export: 780 | Name: "SageMakerEMRDemoCloudformationEMRClusterServiceRole" 781 | 782 | SageMakerEMRDemoCloudformationS3BucketName: 783 | Description: Bucket Name for Amazon S3 bucket 784 | Value: 785 | Ref: S3Bucket 786 | Export: 787 | Name: "SageMakerEMRDemoCloudformationS3BucketName" --------------------------------------------------------------------------------