├── Chapter02 ├── README.md ├── datascience-vpc.yaml ├── notebook-instance-environment.yaml ├── studio-domain.yaml └── studio-environment.yaml ├── Chapter03 ├── LabelData.ipynb ├── utils │ ├── cognito-helper.py │ └── sagemaker-helper.py └── workflow │ ├── post-multiple.py │ ├── post.py │ ├── pre.py │ └── ui.liquid.html ├── Chapter04 ├── PrepareData.ipynb ├── scripts │ └── preprocess.py └── spark-example.ipynb ├── Chapter05 ├── feature_store_apis.ipynb └── feature_store_train_deploy_models.ipynb ├── Chapter06 ├── DistributedTraining │ └── distrbuted-training.ipynb ├── Experiments │ └── Experiments.ipynb ├── HPO │ └── HPO.ipynb ├── code │ ├── csv_loader.py │ ├── csv_loader_pd.py │ ├── csv_loader_simple.py │ ├── model_pytorch.py │ ├── train_pytorch-dist.py │ ├── train_pytorch.py │ ├── train_pytorch_dist.py │ └── train_pytorch_model_dist.py └── train.ipynb ├── Chapter07 ├── code │ ├── csv_loader.py │ ├── csv_loader_pd.py │ ├── csv_loader_simple.py │ ├── model_pytorch.py │ ├── train_pytorch-dist.py │ ├── train_pytorch.py │ └── train_pytorch_dist.py ├── images │ ├── LossNotDecreasingRule.png │ └── RulesSummary.png ├── rules │ └── custom_rule.py ├── weather-prediction-debugger-profiler-with-rules.ipynb ├── weather-prediction-debugger-profiler.ipynb └── weather-prediction-debugger.ipynb ├── Chapter08 └── register-model.ipynb ├── Chapter09 └── a_b_deployment_with_production_variants.ipynb ├── Chapter10 ├── optimize.ipynb └── scripts │ └── preprocess_param.py ├── Chapter11 ├── README.md ├── bias_drift_monitoring │ ├── WeatherPredictionBiasDrift.ipynb │ ├── data │ │ ├── data-drift-baseline-data.csv │ │ └── t_file.csv │ └── model │ │ ├── WeatherPredictionFeatureAttributionDrift.ipynb │ │ ├── data │ │ ├── data-drift-baseline-data.csv │ │ └── t_file.csv │ │ ├── model │ │ └── weather-prediction-model.tar.gz │ │ └── weather-prediction-model.tar.gz ├── data_drift_monitoring │ ├── .DS_Store │ ├── WeatherPredictionDataDriftModelMonitoring.ipynb │ ├── data │ │ ├── data-drift-baseline-data.csv │ │ └── t_file.csv │ └── model │ │ └── weather-prediction-model.tar.gz ├── feature_attribution_drift_monitoring │ ├── WeatherPredictionFeatureAttributionDrift.ipynb │ ├── data │ │ ├── data-drift-baseline-data.csv │ │ └── t_file.csv │ └── model │ │ └── weather-prediction-model.tar.gz ├── model │ └── weather-prediction-model.tar.gz └── model_quality_monitoring │ ├── WeatherPredictionModelQualityMonitoring (1).ipynb │ ├── WeatherPredictionModelQualityMonitoring.ipynb │ ├── data │ ├── model-quality-baseline-data.csv │ ├── t_file.csv │ └── validation_data.csv │ ├── images │ ├── r2_Alarm.png │ └── r2_InsufficientData.png │ ├── model │ └── weather-prediction-model.tar.gz │ └── model_quality_churn_sdk.ipynb ├── LICENSE └── README.md /Chapter02/README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This repository contains the example CloudFormation templates that can be used to setup a data science environment using either Amazon SageMaker Studio or Amazon SageMaker Notebook Instances. The pre-requisities are noted where applicable for either option. There is also a section included for those unfamiliar with using AWS CloudFormation on how to launch the CloudFormation templates provided. 4 | 5 | # Prerequisites 6 | 7 | ## Clone Repository 8 | 9 | To use the CloudFormation templates provided in this repository, you must first clone this repository creating a local copy. These local files will be used to upload the appropriate template to AWS CloudFormation to create your data science environment. 10 | 11 | ## Create or Identify VPC 12 | 13 | The templates in this repository are setup using a VPC so you must either (1) have an existing VPC that can be used ~OR~ (2) create a new VPC that will be referenced as parameters when launching the CloudFormation stacks. 14 | 15 | If you do not have a VPC, a CloudFormation template is provided in this directory, [datascience-vpc.yaml](datascience-vpc.yaml), to get you up and running quickly. 16 | 17 | ## Create a KMS Key for Encryption 18 | 19 | The templates provided accept a KMS key on input to be used for encrypting storage directly attached to your environments. 20 | 21 | You can use an existing key or alternatively create your own symmetric encryption key using the following instructions: https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html 22 | 23 | Make sure you note the KMS Key ID & ARN to provide that as an input parameter to your CloudFormation templates. 24 | 25 | 26 | ## Using AWS CloudFormation 27 | 28 | To launch the CloudFormation templates provided you can utilize the CLI, SDK, or the AWS Console. Instructions provided include using the AWS console to launch the provided templates. 29 | 30 | 1. Sign in to the AWS Management Console and open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation 31 | 32 | 2. In the **Stacks** section, select **Create stack** and select **With new resources (standard)** from the dropdown. 33 | 34 | 3. Under **Prerequisite - Prepare template** choose **Template is ready** 35 | 36 | 4. Under **Specify template** choose **Upload a template file** 37 | 38 | 5. This will allow you to upload the appropriate CloudFormation template that will be used to create your data science environment. Because the input parameters, additional pre-requisites, and post launch tasks vary between the two environment types, they are covered specifically in the sections below. 39 | 40 | # Studio Environment 41 | 42 | ## Studio Environment Pre-Requisites 43 | 1. **Existing Studio Domain** Because it's common to add new users to an existing Studio Domain, the CloudFormation templates to create a domain and add a new user are typically separated. Creating a Studio Domain is a one-time setup activity per AWS Account / AWS Region. The CloudFormation template to create a new user assumes an existing Studio Domain. If you do not have a Studio Domain, a CloudFormation template is provided to create a new domain as a pre-requisite: [studio-domain.yaml](datascience-vpc.yaml). 44 | 45 | ## Creating the Stack & Template Usage 46 | 47 | Go through the provided instructions to launch the [studio-environment.yaml](studio-environment.yaml) CloudFormation template 48 | 49 | ## Post Launch Tasks 50 | 51 | We need to copy the dataset we'll be using throughout the chapters from a public S3 bucket to the bucket created by the CloudFormation template. There are multiple ways this could be done but for simplicity in this case, we'll simply execute a command using the AWS Command Line Interface (CLI) using the terminal available in your Studio environment. 52 | 53 | To do this: 54 | 1. Sign in to the AWS Management Console and open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation 55 | 2. Go to the stack provisioned above -> Go to the **Outputs** tab and capture the name of the S3Bucket that was created 56 | 3. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker, select **Amazon SageMaker Studio** then select **Open Studio** for the user name you provided. 57 | 4. Once you're in your Studio environment, go to **File** -> **New** -> **Terminal** 58 | 5. Copy & paste the command below, replacing s3 bucket with the value from #2 above: 59 | 60 | nohup aws s3 cp s3://openaq-fetches/ s3:///data/ --recursive & 61 | 62 | # Notebook Instance Environment 63 | 64 | ## Notebook Instance Environment Pre-Requisites 65 | 66 | None other than what is already noted above under Prerequisites 67 | 68 | ## Creating the Stack & Template Usage 69 | 70 | Go through the provided instructions to launch the [notebook-instance-environment.yaml](notebook-instance-environment.yaml) CloudFormation template 71 | 72 | ## Post Launch Tasks 73 | 74 | There are no post launch tasks following the successful completion of stack creation above and your data science environment is available for use in the AWS console under **Amazon SageMaker** -> **Notebook** -> **Notebook instances** 75 | -------------------------------------------------------------------------------- /Chapter02/datascience-vpc.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: 3 | Creates a VPC with public and private subnets for a given AWS Account. 4 | Parameters: 5 | VpcCidrParam: 6 | Type: String 7 | Default: 10.0.0.0/16 8 | Description: VPC CIDR. For more info, see http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html#VPC_Sizing 9 | AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$" 10 | ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918. 11 | 12 | # Public Subnets 13 | PublicAZASubnetBlock: 14 | Type: String 15 | Default: 10.0.0.0/24 16 | Description: Subnet CIDR for first Availability Zone 17 | AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$" 18 | ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918. 19 | 20 | PublicAZBSubnetBlock: 21 | Type: String 22 | Default: 10.0.1.0/24 23 | Description: Subnet CIDR for second Availability Zone 24 | AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$" 25 | ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918. 26 | 27 | 28 | # Private Subnets 29 | PrivateAZASubnetBlock: 30 | Type: String 31 | Default: 10.0.2.0/24 32 | Description: Subnet CIDR for first Availability Zone (e.g. us-west-2a, us-east-1b) 33 | AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$" 34 | ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918. 35 | 36 | PrivateAZBSubnetBlock: 37 | Type: String 38 | Default: 10.0.3.0/24 39 | Description: Subnet CIDR for second Availability Zone (e.g. us-west-2b, us-east-1c) 40 | AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$" 41 | ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918. 42 | 43 | Outputs: 44 | VpcId: 45 | Description: VPC Id 46 | Value: !Ref Vpc 47 | Export: 48 | Name: !Sub "${AWS::StackName}-vpc-id" 49 | 50 | PublicRouteTableId: 51 | Description: Route Table for public subnets 52 | Value: !Ref PublicRouteTable 53 | Export: 54 | Name: !Sub "${AWS::StackName}-public-rtb" 55 | 56 | PublicAZASubnetId: 57 | Description: Availability Zone A public subnet Id 58 | Value: !Ref PublicAZASubnet 59 | Export: 60 | Name: !Sub "${AWS::StackName}-public-az-a-subnet" 61 | 62 | PublicAZBSubnetId: 63 | Description: Availability Zone B public subnet Id 64 | Value: !Ref PublicAZBSubnet 65 | Export: 66 | Name: !Sub "${AWS::StackName}-public-az-b-subnet" 67 | 68 | PrivateAZASubnetId: 69 | Description: Availability Zone A private subnet Id 70 | Value: !Ref PrivateAZASubnet 71 | Export: 72 | Name: !Sub "${AWS::StackName}-private-az-a-subnet" 73 | 74 | PrivateAZBSubnetId: 75 | Description: Availability Zone B private subnet Id 76 | Value: !Ref PrivateAZBSubnet 77 | Export: 78 | Name: !Sub "${AWS::StackName}-private-az-b-subnet" 79 | 80 | PrivateAZARouteTableId: 81 | Description: Route table for private subnets in AZ A 82 | Value: !Ref PrivateAZARouteTable 83 | Export: 84 | Name: !Sub "${AWS::StackName}-private-az-a-rtb" 85 | 86 | PrivateAZBRouteTableId: 87 | Description: Route table for private subnets in AZ B 88 | Value: !Ref PrivateAZBRouteTable 89 | Export: 90 | Name: !Sub "${AWS::StackName}-private-az-b-rtb" 91 | 92 | 93 | Resources: 94 | Vpc: 95 | Type: AWS::EC2::VPC 96 | Properties: 97 | CidrBlock: !Ref VpcCidrParam 98 | Tags: 99 | - Key: Name 100 | Value: !Sub ${AWS::StackName} 101 | 102 | InternetGateway: 103 | Type: AWS::EC2::InternetGateway 104 | Properties: 105 | Tags: 106 | - Key: Name 107 | Value: !Sub ${AWS::StackName} 108 | 109 | VPCGatewayAttachment: 110 | Type: AWS::EC2::VPCGatewayAttachment 111 | Properties: 112 | InternetGatewayId: !Ref InternetGateway 113 | VpcId: !Ref Vpc 114 | 115 | # Public Subnets - Route Table 116 | PublicRouteTable: 117 | Type: AWS::EC2::RouteTable 118 | Properties: 119 | VpcId: !Ref Vpc 120 | Tags: 121 | - Key: Name 122 | Value: !Sub ${AWS::StackName}-public 123 | - Key: Type 124 | Value: public 125 | 126 | PublicSubnetsRoute: 127 | Type: AWS::EC2::Route 128 | Properties: 129 | RouteTableId: !Ref PublicRouteTable 130 | DestinationCidrBlock: 0.0.0.0/0 131 | GatewayId: !Ref InternetGateway 132 | DependsOn: VPCGatewayAttachment 133 | 134 | # Public Subnets 135 | # First Availability Zone 136 | PublicAZASubnet: 137 | Type: AWS::EC2::Subnet 138 | Properties: 139 | VpcId: !Ref Vpc 140 | CidrBlock: !Ref PublicAZASubnetBlock 141 | AvailabilityZone: !Select [0, !GetAZs ""] 142 | MapPublicIpOnLaunch: true 143 | Tags: 144 | - Key: Name 145 | Value: !Sub 146 | - ${AWS::StackName}-public-${AZ} 147 | - { AZ: !Select [0, !GetAZs ""] } 148 | - Key: Type 149 | Value: public 150 | 151 | PublicAZASubnetRouteTableAssociation: 152 | Type: AWS::EC2::SubnetRouteTableAssociation 153 | Properties: 154 | SubnetId: !Ref PublicAZASubnet 155 | RouteTableId: !Ref PublicRouteTable 156 | 157 | # Second Availability Zone 158 | PublicAZBSubnet: 159 | Type: AWS::EC2::Subnet 160 | Properties: 161 | VpcId: !Ref Vpc 162 | CidrBlock: !Ref PublicAZBSubnetBlock 163 | AvailabilityZone: !Select [1, !GetAZs ""] 164 | MapPublicIpOnLaunch: true 165 | Tags: 166 | - Key: Name 167 | Value: !Sub 168 | - ${AWS::StackName}-public-${AZ} 169 | - { AZ: !Select [1, !GetAZs ""] } 170 | - Key: Type 171 | Value: public 172 | 173 | PublicAZBSubnetRouteTableAssociation: 174 | Type: AWS::EC2::SubnetRouteTableAssociation 175 | Properties: 176 | SubnetId: !Ref PublicAZBSubnet 177 | RouteTableId: !Ref PublicRouteTable 178 | 179 | # Private Subnets - NAT Gateways 180 | # First Availability Zone 181 | AZANatGatewayEIP: 182 | Type: AWS::EC2::EIP 183 | Properties: 184 | Domain: vpc 185 | DependsOn: VPCGatewayAttachment 186 | 187 | AZANatGateway: 188 | Type: AWS::EC2::NatGateway 189 | Properties: 190 | AllocationId: !GetAtt AZANatGatewayEIP.AllocationId 191 | SubnetId: !Ref PublicAZASubnet 192 | 193 | # Private Subnets 194 | # First Availability Zone 195 | PrivateAZASubnet: 196 | Type: AWS::EC2::Subnet 197 | Properties: 198 | VpcId: !Ref Vpc 199 | CidrBlock: !Ref PrivateAZASubnetBlock 200 | AvailabilityZone: !Select [0, !GetAZs ""] 201 | Tags: 202 | - Key: Name 203 | Value: !Sub 204 | - ${AWS::StackName}-private-${AZ} 205 | - { AZ: !Select [0, !GetAZs ""] } 206 | - Key: Type 207 | Value: private 208 | 209 | PrivateAZARouteTable: 210 | Type: AWS::EC2::RouteTable 211 | Properties: 212 | VpcId: !Ref Vpc 213 | Tags: 214 | - Key: Name 215 | Value: !Sub 216 | - ${AWS::StackName}-private-${AZ} 217 | - { AZ: !Select [0, !GetAZs ""] } 218 | - Key: Type 219 | Value: private 220 | 221 | PrivateAZARoute: 222 | Type: AWS::EC2::Route 223 | Properties: 224 | RouteTableId: !Ref PrivateAZARouteTable 225 | DestinationCidrBlock: 0.0.0.0/0 226 | NatGatewayId: !Ref AZANatGateway 227 | 228 | PrivateAZARouteTableAssociation: 229 | Type: AWS::EC2::SubnetRouteTableAssociation 230 | Properties: 231 | SubnetId: !Ref PrivateAZASubnet 232 | RouteTableId: !Ref PrivateAZARouteTable 233 | 234 | # # Second Availability Zone 235 | PrivateAZBSubnet: 236 | Type: AWS::EC2::Subnet 237 | Properties: 238 | VpcId: !Ref Vpc 239 | CidrBlock: !Ref PrivateAZBSubnetBlock 240 | AvailabilityZone: !Select [1, !GetAZs ""] 241 | Tags: 242 | - Key: Name 243 | Value: !Sub 244 | - ${AWS::StackName}-private-${AZ} 245 | - { AZ: !Select [1, !GetAZs ""] } 246 | - Key: Type 247 | Value: private 248 | 249 | PrivateAZBRouteTable: 250 | Type: AWS::EC2::RouteTable 251 | Properties: 252 | VpcId: !Ref Vpc 253 | Tags: 254 | - Key: Name 255 | Value: !Sub 256 | - ${AWS::StackName}-private-${AZ} 257 | - { AZ: !Select [1, !GetAZs ""] } 258 | - Key: Type 259 | Value: private 260 | 261 | 262 | PrivateAZBRoute: 263 | Type: AWS::EC2::Route 264 | Properties: 265 | RouteTableId: !Ref PrivateAZBRouteTable 266 | DestinationCidrBlock: 0.0.0.0/0 267 | NatGatewayId: !Ref AZANatGateway 268 | 269 | PrivateAZBRouteTableAssociation: 270 | Type: AWS::EC2::SubnetRouteTableAssociation 271 | Properties: 272 | SubnetId: !Ref PrivateAZBSubnet 273 | RouteTableId: !Ref PrivateAZBRouteTable 274 | 275 | S3VPCEndpoint: 276 | Type: "AWS::EC2::VPCEndpoint" 277 | Properties: 278 | RouteTableIds: 279 | - !Ref PublicRouteTable 280 | - !Ref PrivateAZARouteTable 281 | - !Ref PrivateAZBRouteTable 282 | ServiceName: !Join 283 | - "" 284 | - - com.amazonaws. 285 | - !Ref "AWS::Region" 286 | - .s3 287 | VpcId: !Ref Vpc -------------------------------------------------------------------------------- /Chapter02/notebook-instance-environment.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | Metadata: 3 | License: Apache-2.0 4 | Description: 'Example data science environment creating a new SageMaker Notebook Instance using an existing VPC. This template also includes the creation of an Amazon S3 Bucket and IAM Role. A lifecycle policy is also included to pull the dataset that will be used in future book chapters.' 5 | Parameters: #These are configuration parameters that are passed in as input on stack creation 6 | NotebookInstanceName: 7 | AllowedPattern: '[A-Za-z0-9-]{1,63}' 8 | ConstraintDescription: Maximum of 63 alphanumeric characters. Can include hyphens but not spaces. 9 | Description: SageMaker Notebook instance name 10 | MaxLength: '63' 11 | MinLength: '1' 12 | Type: String 13 | Default: 'myNotebook' 14 | NotebookInstanceType: 15 | AllowedValues: 16 | - ml.t2.large 17 | - ml.t2.xlarge 18 | - ml.t3.large 19 | - ml.t3.xlarge 20 | ConstraintDescription: Must select a valid notebook instance type. 21 | Default: ml.t3.large 22 | Description: Select Instance type for the SageMaker Notebook 23 | Type: String 24 | VPCSubnetIds: 25 | Description: The ID of the subnet in a VPC 26 | Type: String 27 | Default: 'Replace with your VPC subnet' 28 | VPCSecurityGroupIds: 29 | Description: The VPC security group IDs, in the form sg-xxxxxxxx. 30 | Type: CommaDelimitedList 31 | Default: 'Replace with the security group id(s) for your VPC' 32 | KMSKeyId: 33 | Description: The ARN of the KMS Key to use for encrypting storage attached to notebook 34 | Type: String 35 | Default: 'Replace with your KMS Key ARN' 36 | NotebookVolumeSize: 37 | Description: The size of the ML Storage Volume attached to your notebook instance. 38 | Type: Number 39 | Default: 75 40 | Resources: 41 | SageMakerRole: 42 | Type: AWS::IAM::Role 43 | Properties: 44 | AssumeRolePolicyDocument: 45 | Version: 2012-10-17 46 | Statement: 47 | - Effect: Allow 48 | Principal: 49 | Service: 50 | - "sagemaker.amazonaws.com" 51 | Action: 52 | - "sts:AssumeRole" 53 | ManagedPolicyArns: 54 | - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess" 55 | - "arn:aws:iam::aws:policy/AmazonS3FullAccess" 56 | - "arn:aws:iam::aws:policy/IAMReadOnlyAccess" 57 | - "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess" 58 | - "arn:aws:iam::aws:policy/AWSLambda_FullAccess" 59 | - "arn:aws:iam::aws:policy/AmazonCognitoPowerUser" 60 | SageMakerLifecycleConfig: 61 | Type: AWS::SageMaker::NotebookInstanceLifecycleConfig 62 | Properties: 63 | OnCreate: 64 | - Content: 65 | Fn::Base64: !Sub "nohup aws s3 cp s3://openaq-fetches/ s3://${S3Bucket}/data/ --recursive &" 66 | DependsOn: S3Bucket 67 | SageMakerNotebookInstance: 68 | Type: "AWS::SageMaker::NotebookInstance" 69 | Properties: 70 | KmsKeyId: !Ref KMSKeyId 71 | NotebookInstanceName: !Ref NotebookInstanceName 72 | InstanceType: !Ref NotebookInstanceType 73 | RoleArn: !GetAtt SageMakerRole.Arn 74 | SubnetId: !Ref VPCSubnetIds 75 | SecurityGroupIds: !Ref VPCSecurityGroupIds 76 | LifecycleConfigName: !GetAtt SageMakerLifecycleConfig.NotebookInstanceLifecycleConfigName 77 | VolumeSizeInGB: !Ref NotebookVolumeSize 78 | S3Bucket: 79 | Type: AWS::S3::Bucket 80 | Properties: 81 | BucketName: 82 | Fn::Join: 83 | - '-' 84 | - - datascience-environment-notebookinstance- 85 | - Fn::Select: 86 | - 4 87 | - Fn::Split: 88 | - '-' 89 | - Fn::Select: 90 | - 2 91 | - Fn::Split: 92 | - / 93 | - Ref: AWS::StackId 94 | 95 | Outputs: 96 | SageMakerNoteBookURL: 97 | Description: "URL for the SageMaker Notebook Instance" 98 | Value: !Sub 'https://${AWS::Region}.console.aws.amazon.com/sagemaker/home?region=${AWS::Region}#/notebook-instances/openNotebook/${NotebookInstanceName}' 99 | SageMakerNotebookInstanceARN: 100 | Description: "ARN for the SageMaker Notebook Instance" 101 | Value: !Ref SageMakerNotebookInstance 102 | S3BucketARN: 103 | Description: "ARN for the S3 Bucket" 104 | Value: !Ref S3Bucket 105 | 106 | -------------------------------------------------------------------------------- /Chapter02/studio-domain.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | Metadata: 3 | License: Apache-2.0 4 | Description: 'CloudFormation to create new Studio Domain if one does not exist' 5 | Parameters: 6 | DomainName: 7 | AllowedPattern: '[A-Za-z0-9-]{1,63}' 8 | ConstraintDescription: Maximum of 63 alphanumeric characters. Can include hyphens but not spaces. 9 | Description: SageMaker Studio Domain Name 10 | MaxLength: '63' 11 | MinLength: '1' 12 | Type: String 13 | Default: 'StudioDomain' 14 | VPCId: 15 | Description: The ID of the VPC that Studio uses for communication 16 | Type: String 17 | Default: 'Replace with your VPC ID' 18 | VPCSubnetIds: 19 | Description: Choose which subnets Studio should use 20 | Type: 'List' 21 | Default: 'subnet-1,subnet-2' 22 | VPCSecurityGroupIds: 23 | Description: The VPC security group IDs, in the form sg-xxxxxxxx. 24 | Type: CommaDelimitedList 25 | Default: 'Replace with the security group id(s) for your VPC' 26 | KMSKeyId: 27 | Description: The ARN of the KMS Key to use for encrypting storage attached to notebook 28 | Type: String 29 | Default: 'Replace with your KMS Key ARN' 30 | Resources: 31 | StudioDomain: 32 | Type: AWS::SageMaker::Domain 33 | Properties: 34 | AppNetworkAccessType: VpcOnly 35 | AuthMode: IAM 36 | DefaultUserSettings: 37 | ExecutionRole: !GetAtt SageMakerRole.Arn 38 | SecurityGroups: !Ref VPCSecurityGroupIds 39 | DomainName: !Ref DomainName 40 | KmsKeyId: !Ref KMSKeyId 41 | SubnetIds: !Ref VPCSubnetIds 42 | VpcId: !Ref VPCId 43 | SageMakerRole: 44 | Type: AWS::IAM::Role 45 | Properties: 46 | AssumeRolePolicyDocument: 47 | Version: 2012-10-17 48 | Statement: 49 | - Effect: Allow 50 | Principal: 51 | Service: 52 | - "sagemaker.amazonaws.com" 53 | Action: 54 | - "sts:AssumeRole" 55 | ManagedPolicyArns: 56 | - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess" 57 | - "arn:aws:iam::aws:policy/AmazonS3FullAccess" 58 | - "arn:aws:iam::aws:policy/IAMReadOnlyAccess" 59 | - "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess" 60 | - "arn:aws:iam::aws:policy/AWSLambda_FullAccess" 61 | - "arn:aws:iam::aws:policy/AmazonCognitoPowerUser" 62 | Outputs: 63 | SageMakerDomainID: 64 | Description: "ID for the Studio Domain" 65 | Value: !Ref StudioDomain 66 | -------------------------------------------------------------------------------- /Chapter02/studio-environment.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | Metadata: 3 | License: Apache-2.0 4 | Description: 'Example data science environment creating a new SageMaker Studio User in an existing Studio Domain using an existing VPC. This template also includes the creation of an Amazon S3 Bucket and IAM Role.' 5 | Parameters: 6 | StudioDomainID: 7 | AllowedPattern: '[A-Za-z0-9-]{1,63}' 8 | Description: ID of the Studio Domain where user should be created (ex. d-xxxnxxnxxnxn) 9 | Default: d-xxxnxxnxxnxn 10 | Type: String 11 | Team: 12 | AllowedValues: 13 | - weatherproduct 14 | - weatherresearch 15 | Description: Team name for user working in associated environment 16 | Default: weatherproduct 17 | Type: String 18 | UserProfileName: 19 | Description: User profile name 20 | AllowedPattern: '^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}' 21 | Type: String 22 | Default: 'UserName' 23 | VPCSecurityGroupIds: 24 | Description: The VPC security group IDs, in the form sg-xxxxxxxx. 25 | Type: CommaDelimitedList 26 | Default: 'Replace with the security group id(s) for your VPC' 27 | Resources: 28 | StudioUser: 29 | Type: AWS::SageMaker::UserProfile 30 | Properties: 31 | DomainId: !Ref StudioDomainID 32 | Tags: 33 | - Key: "Environment" 34 | Value: "Development" 35 | - Key: "Team" 36 | Value: !Ref Team 37 | UserProfileName: !Ref UserProfileName 38 | UserSettings: 39 | ExecutionRole: !GetAtt SageMakerRole.Arn 40 | SecurityGroups: !Ref VPCSecurityGroupIds 41 | 42 | SageMakerRole: 43 | Type: AWS::IAM::Role 44 | Properties: 45 | AssumeRolePolicyDocument: 46 | Version: 2012-10-17 47 | Statement: 48 | - Effect: Allow 49 | Principal: 50 | Service: 51 | - "sagemaker.amazonaws.com" 52 | Action: 53 | - "sts:AssumeRole" 54 | ManagedPolicyArns: 55 | - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess" 56 | - "arn:aws:iam::aws:policy/AmazonS3FullAccess" 57 | - "arn:aws:iam::aws:policy/IAMReadOnlyAccess" 58 | - "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess" 59 | - "arn:aws:iam::aws:policy/AWSLambda_FullAccess" 60 | 61 | S3Bucket: 62 | Type: AWS::S3::Bucket 63 | Properties: 64 | BucketName: 65 | Fn::Join: 66 | - '-' 67 | - - datascience-environment-studio 68 | - !Ref UserProfileName 69 | - Fn::Select: 70 | - 4 71 | - Fn::Split: 72 | - '-' 73 | - Fn::Select: 74 | - 2 75 | - Fn::Split: 76 | - / 77 | - Ref: AWS::StackId 78 | 79 | Outputs: 80 | S3BucketARN: 81 | Description: "ARN for the S3 Bucket" 82 | Value: !Ref S3Bucket 83 | 84 | -------------------------------------------------------------------------------- /Chapter03/LabelData.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "source": [ 5 | "# Chapter 2: Data Labeling with SageMaker Ground Truth: Custom Labeling\n", 6 | "\n", 7 | "In this notebook we'll perform the following steps:\n", 8 | "\n", 9 | "* Create a private workforce backed by a Cognito user pool.\n", 10 | "* Create a manifest file that lists the items we want to label\n", 11 | "* Define a custom Ground Truth labeling workflow, consisting of two Lambda functions and a UI template, and launch a labeling job\n", 12 | "* Add a second worker to our private workforce\n", 13 | "* Adjust the post-processing part of the workflow to handle input from multiple workers, and launch another labeling job\n", 14 | "\n" 15 | ], 16 | "cell_type": "markdown", 17 | "metadata": {} 18 | }, 19 | { 20 | "source": [ 21 | "## Create a private workforce\n", 22 | "\n", 23 | "Before executing the code in this section, review and set the following variables:\n", 24 | "\n", 25 | "* `PoolName`: The name for the user pool in Cognito\n", 26 | "* `ClientName`: The name for the Cognito user pool client\n", 27 | "* `IdentityPoolName`: The name for the Cognito identity pool\n", 28 | "* `Region`: The name of the AWS region you're working in\n", 29 | "* `IamRolePrefix`: A prefix to use when naming new IAM roles\n", 30 | "* `GroupName`: Name for the Cognito user group\n", 31 | "* `DomainName`: Domain name for the Cognito authentication page\n", 32 | "* `WorkteamName`: Name for the private work team\n", 33 | "* `UserEmail`: User name to use (use a fake email address)\n", 34 | "* `Password`: Use a password with at least one upper case character, one symbol, and one number" 35 | ], 36 | "cell_type": "markdown", 37 | "metadata": {} 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 1, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "# Constants\n", 46 | "\n", 47 | "PoolName = 'MyUserPool'\n", 48 | "ClientName = 'MyUserPoolClient'\n", 49 | "IdentityPoolName = 'MyIdentityPool'\n", 50 | "Region = 'us-east-1'\n", 51 | "IamRolePrefix = 'MyRole'\n", 52 | "GroupName = 'MyGroup'\n", 53 | "DomainName = 'MyDomain'\n", 54 | "WorkteamName = 'MyTeam'\n", 55 | "UserEmail = \"me@foo.com\"\n", 56 | "Password = 'PwTest123!'" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 88, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "from utils.cognito-helper import CognitoHelper\n", 66 | "cognito_helper = CognitoHelper(Region, IamRolePrefix)\n", 67 | "cognito_helper.create_user_pool(PoolName)\n", 68 | "cognito_helper.create_user_pool_client(ClientName)\n", 69 | "cognito_helper.create_identity_pool(IdentityPoolName)\n", 70 | "cognito_helper.create_group(GroupName)\n", 71 | "cognito_helper.create_user_pool_domain(DomainName)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 48, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "arn:aws:sagemaker:us-east-1:102165494304:workteam/private-crowd/rdtest\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "from util.sagemaker-helper import SagemakerHelper\n", 89 | "sagemaker_helper = SagemakerHelper(Region, IamRolePrefix)\n", 90 | "sagemaker_helper.create_workteam(WorkteamName, \n", 91 | " cognito_helper.user_pool_id, \n", 92 | " cognito_helper.group_name, \n", 93 | " cognito_helper.user_pool_client_id)\n" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "cognito_helper.update_client(sagemaker_helper.get_workforce_domain())" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 76, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "{'UserConfirmed': False,\n", 114 | " 'UserSub': 'dac2455f-692c-4e38-b83b-6afbcd6a57ef',\n", 115 | " 'ResponseMetadata': {'RequestId': '057a69f9-609b-4979-ad47-83475ec05eac',\n", 116 | " 'HTTPStatusCode': 200,\n", 117 | " 'HTTPHeaders': {'date': 'Fri, 26 Mar 2021 00:02:48 GMT',\n", 118 | " 'content-type': 'application/x-amz-json-1.1',\n", 119 | " 'content-length': '72',\n", 120 | " 'connection': 'keep-alive',\n", 121 | " 'x-amzn-requestid': '057a69f9-609b-4979-ad47-83475ec05eac'},\n", 122 | " 'RetryAttempts': 0}}" 123 | ] 124 | }, 125 | "execution_count": 76, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "cognito_helper.add_user(UserEmail, Password)" 132 | ] 133 | }, 134 | { 135 | "source": [ 136 | "## Create a manifest file\n", 137 | "\n", 138 | "In this section, you'll need to define:\n", 139 | "\n", 140 | "* The name of your S3 bucket\n", 141 | "* The folder (prefix) where you stored the _OpenAQ_ data set.\n", 142 | "* The folder (prefix) where you want to store the manifest." 143 | ], 144 | "cell_type": "markdown", 145 | "metadata": {} 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 49, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "s3_bucket = 'MyS3Bucket'\n", 154 | "s3_prefix = 'openaq/realtime/'\n", 155 | "s3_prefix_manifest = 'inventory'" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 105, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "Processing openaq/realtime/2013-11-27/2013-11-27.ndjson\n", 168 | "Got 40 manifest entries\n" 169 | ] 170 | } 171 | ], 172 | "source": [ 173 | "sagemaker_helper.create_manifest(s3_bucket, s3_prefix, s3_prefix_manifest)" 174 | ] 175 | }, 176 | { 177 | "source": [ 178 | "## Create custom workflow\n", 179 | "\n", 180 | "In this section, you must define:\n", 181 | "\n", 182 | "* The folder (prefix) where you want to store the workflow files.\n", 183 | "* The name prefix for your Lambda functions.\n", 184 | "* The folder (prefix) where you want to store the labeling output." 185 | ], 186 | "cell_type": "markdown", 187 | "metadata": {} 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "s3_prefix_workflow = 'workflow'\n", 196 | "fn_prefix = 'MyFn'\n", 197 | "s3_prefix_labels = 'labels'" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 114, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "sagemaker_helper.create_workflow(s3_bucket, s3_prefix_workflow, fn_prefix, s3_prefix_labels)" 207 | ] 208 | }, 209 | { 210 | "source": [ 211 | "## Add another worker\n", 212 | "\n", 213 | "In this section you'll need to define:\n", 214 | "\n", 215 | "* `UserEmail2`: User name to use for second worker (use a fake email address)\n", 216 | "* `Password2`: Use a password with at least one upper case character, one symbol, and one number" 217 | ], 218 | "cell_type": "markdown", 219 | "metadata": {} 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [ 227 | "cognito_helper.add_user(UserEmail2, Password2)" 228 | ] 229 | }, 230 | { 231 | "source": [ 232 | "## Launch labeling job for multiple workers" 233 | ], 234 | "cell_type": "markdown", 235 | "metadata": {} 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "sagemaker_helper.create_workflow_multiple_workers(s3_bucket, s3_prefix_workflow, fn_prefix, \n", 244 | " s3_prefix_labels, s3_prefix_labels)" 245 | ] 246 | } 247 | ], 248 | "metadata": { 249 | "instance_type": "ml.m5.large", 250 | "kernelspec": { 251 | "name": "python3", 252 | "display_name": "Python 3.8.5 64-bit ('.venv': venv)", 253 | "metadata": { 254 | "interpreter": { 255 | "hash": "cb064b2fb9a88e905f0b275d2c13a9378a141d23a43c9eb5ab74845797a0d104" 256 | } 257 | } 258 | }, 259 | "language_info": { 260 | "codemirror_mode": { 261 | "name": "ipython", 262 | "version": 3 263 | }, 264 | "file_extension": ".py", 265 | "mimetype": "text/x-python", 266 | "name": "python", 267 | "nbconvert_exporter": "python", 268 | "pygments_lexer": "ipython3", 269 | "version": "3.8.5-final" 270 | } 271 | }, 272 | "nbformat": 4, 273 | "nbformat_minor": 4 274 | } -------------------------------------------------------------------------------- /Chapter03/utils/cognito-helper.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import logging 3 | import sys 4 | import hmac 5 | import hashlib 6 | import base64 7 | 8 | class CognitoHelper: 9 | def __init__(self, region, iam_role_prefix): 10 | logging.basicConfig(level=logging.INFO) 11 | self.logger = logging.getLogger('CognitoHelper') 12 | self.cognito = boto3.client('cognito-idp') 13 | self.cognitoid = boto3.client('cognito-identity') 14 | self.iam = boto3.client('iam') 15 | self.region = region 16 | self.iam_role_prefix = iam_role_prefix 17 | 18 | def create_user_pool(self, PoolName): 19 | response = self.cognito.create_user_pool(PoolName=PoolName) 20 | self.user_pool_id = response['UserPool']['Id'] 21 | self.user_pool_arn = response['UserPool']['Arn'] 22 | self.logger.info(f"Created user pool with ID: {self.user_pool_id}; ARN: {self.user_pool_arn}") 23 | 24 | def create_user_pool_client(self, ClientName): 25 | response = self.cognito.create_user_pool_client( 26 | UserPoolId=self.user_pool_id, 27 | ClientName=ClientName, 28 | GenerateSecret=True, 29 | SupportedIdentityProviders = ['COGNITO'], 30 | ExplicitAuthFlows=[ 31 | 'ADMIN_NO_SRP_AUTH' 32 | ] 33 | ) 34 | self.user_pool_client_id = response['UserPoolClient']['ClientId'] 35 | self.logger.info(f"Created user pool client with ID: {self.user_pool_client_id}") 36 | 37 | def create_identity_pool(self, IdentityPoolName): 38 | response = self.cognitoid.create_identity_pool( 39 | IdentityPoolName=IdentityPoolName, 40 | AllowUnauthenticatedIdentities=False, 41 | CognitoIdentityProviders=[ 42 | { 43 | 'ProviderName': f"cognito-idp.{self.region}.amazonaws.com/{self.user_pool_id}", 44 | 'ClientId': self.user_pool_client_id 45 | }, 46 | ] 47 | ) 48 | self.id_pool_id = response['IdentityPoolId'] 49 | self.logger.info(f"Created identity pool {self.id_pool_id}") 50 | 51 | def create_group(self, GroupName): 52 | assume_role_doc = """{ 53 | "Version": "2012-10-17", 54 | "Statement": [ 55 | { 56 | "Effect": "Allow", 57 | "Principal": { 58 | "Federated": "cognito-identity.amazonaws.com" 59 | }, 60 | "Action": "sts:AssumeRoleWithWebIdentity", 61 | "Condition": { 62 | "StringEquals": { 63 | "cognito-identity.amazonaws.com:aud": """ + '"' + self.id_pool_id + '"' + """ 64 | } 65 | } 66 | }, 67 | { 68 | "Effect": "Allow", 69 | "Principal": { 70 | "Federated": "cognito-identity.amazonaws.com" 71 | }, 72 | "Action": "sts:AssumeRoleWithWebIdentity", 73 | "Condition": { 74 | "ForAnyValue:StringLike": { 75 | "cognito-identity.amazonaws.com:amr": "authenticated" 76 | } 77 | } 78 | } 79 | ] 80 | } 81 | """ 82 | response = self.iam.create_role( 83 | RoleName=f"{self.iam_role_prefix}-worker-group-role", 84 | AssumeRolePolicyDocument=assume_role_doc 85 | ) 86 | self.group_role_id = response['Role']['RoleId'] 87 | self.group_role_arn = response['Role']['Arn'] 88 | 89 | response = self.cognito.create_group( 90 | GroupName=GroupName, 91 | UserPoolId=self.user_pool_id, 92 | RoleArn=self.group_role_arn, 93 | Precedence=1 94 | ) 95 | self.group_name = response['Group']['GroupName'] 96 | print(f"Created worker group {self.group_name}") 97 | 98 | def create_user_pool_domain(self, DomainName): 99 | self.cognito.create_user_pool_domain( 100 | Domain=DomainName, 101 | UserPoolId=self.user_pool_id 102 | ) 103 | 104 | def get_client_secret(self): 105 | response = self.cognito.describe_user_pool_client( 106 | UserPoolId=self.user_pool_id, 107 | ClientId=self.user_pool_client_id 108 | ) 109 | self.client_secret = response['UserPoolClient']['ClientSecret'] 110 | 111 | def update_client(self, labeling_domain): 112 | self.cognito.update_user_pool_client( 113 | UserPoolId=self.user_pool_id, 114 | ClientId=self.user_pool_client_id, 115 | CallbackURLs=['https://{labeling_domain}/oauth2/idpresponse'], 116 | LogoutURLs=['https://{labeling_domain}/logout'], 117 | AllowedOAuthFlows=['code','implicit'], 118 | AllowedOAuthScopes=['email','profile','openid'] 119 | ) 120 | 121 | def add_user(self, UserEmail, Password): 122 | self.get_client_secret() 123 | dig = hmac.new(bytearray(self.client_secret, encoding='utf-8'), 124 | msg=f"{UserEmail}{self.user_pool_client_id}".encode('utf-8'), 125 | digestmod=hashlib.sha256).digest() 126 | secret_hash = base64.b64encode(dig).decode() 127 | self.cognito.sign_up( 128 | ClientId=self.user_pool_client_id, 129 | Username=UserEmail, 130 | Password=Password, 131 | SecretHash=secret_hash, 132 | UserAttributes=[ 133 | { 134 | 'Name': 'email', 135 | 'Value': UserEmail 136 | }, 137 | { 138 | 'Name': 'phone_number', 139 | 'Value': '+12485551212' 140 | } 141 | ] 142 | ) 143 | self.cognito.admin_confirm_sign_up( 144 | UserPoolId=self.user_pool_id, 145 | Username=UserEmail 146 | ) 147 | self.cognito.admin_add_user_to_group( 148 | UserPoolId=self.user_pool_id, 149 | Username=UserEmail, 150 | GroupName=self.group_name 151 | ) 152 | 153 | if __name__ == "__main__": 154 | logging.warn("No main function defined") 155 | sys.exit(0) -------------------------------------------------------------------------------- /Chapter03/utils/sagemaker-helper.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from zipfile import ZipFile 3 | import logging 4 | import sys 5 | import json 6 | 7 | class SagemakerHelper: 8 | def __init__(self, region, iam_role_prefix): 9 | logging.basicConfig(level=logging.INFO) 10 | self.logger = logging.getLogger('SagemakerHelper') 11 | self.sagemaker = boto3.client('sagemaker') 12 | self.iam = boto3.client('iam') 13 | self.s3 = boto3.client('s3') 14 | self.lambdac = boto3.client('lambda') 15 | self.region = region 16 | self.iam_role_prefix = iam_role_prefix 17 | 18 | def create_workteam(self, WorkteamName, user_pool_id, group_name, user_pool_client_id): 19 | response = self.sagemaker.create_workteam( 20 | WorkteamName=WorkteamName, 21 | MemberDefinitions=[ 22 | { 23 | 'CognitoMemberDefinition': { 24 | 'UserPool': user_pool_id, 25 | 'UserGroup': group_name, 26 | 'ClientId': user_pool_client_id 27 | } 28 | } 29 | ], 30 | Description = WorkteamName 31 | ) 32 | self.workteam_arn = response['WorkteamArn'] 33 | self.workteam_name = WorkteamName 34 | self.logger.info(f"Created workteam {self.workteam_arn}") 35 | 36 | def get_workforce_domain(self): 37 | response = self.sagemaker.describe_workteam( 38 | WorkteamName=self.workteam_name 39 | ) 40 | self.workforce_domain = response['Workteam']['SubDomain'] 41 | return self.workforce_domain 42 | 43 | def create_manifest(self, s3_bucket, s3_prefix, s3_prefix_manifest, max_entries = 20): 44 | self.s3_prefix_manifest = s3_prefix_manifest 45 | manifest_file_local = 'manifest.txt' 46 | manifests = [] 47 | response = self.s3.list_objects_v2( 48 | Bucket=s3_bucket, 49 | Prefix=s3_prefix 50 | ) 51 | r = response['Contents'][0] 52 | self.s3.download_file(s3_bucket, r['Key'], 'temp.json') 53 | self.logger.debug("Processing " + r['Key']) 54 | with open('temp.json', 'r') as F: 55 | for l in F.readlines(): 56 | if len(manifests) > max_entries: 57 | break 58 | j = json.loads(l) 59 | manifests.append(f"{j['parameter']},{j['value']},{j['unit']},{j['coordinates']['latitude']},{j['coordinates']['longitude']}") 60 | self.logger.debug(f"Got {len(manifests)} manifest entries") 61 | with open(manifest_file_local, 'wt') as F: 62 | for m in manifests: 63 | F.write('{"source": "' + m + '"}' + "\n") 64 | self.s3.upload_file(manifest_file_local, s3_bucket, f"{self.s3_prefix_manifest}/openaq.manifest") 65 | 66 | label_file_local = 'label.txt' 67 | with open(label_file_local, 'wt') as F: 68 | F.write('{' + "\n") 69 | F.write('"document-version": "2018-11-28",' + "\n") 70 | F.write('"labels": [{"label": "good"},{"label": "bad"}]' + "\n") 71 | F.write('}' + "\n") 72 | self.s3.upload_file(label_file_local, s3_bucket, f"{self.s3_prefix_manifest}/openaq.labels") 73 | 74 | def create_role(self, service_for, name, policies = []): 75 | role_doc = """{ 76 | "Version": "2012-10-17", 77 | "Statement": [ 78 | { 79 | "Effect": "Allow", 80 | "Principal": { 81 | "Service": f"{service_for}.amazonaws.com" 82 | }, 83 | "Action": "sts:AssumeRole" 84 | } 85 | ] 86 | } 87 | """ 88 | role =f"{self.iam_role_prefix}-{name}-role", 89 | response = self.iam.create_role( 90 | RoleName=role, 91 | AssumeRolePolicyDocument=role_doc 92 | ) 93 | role_arn, role_name = response['Role']['Arn'],response['Role']['RoleName'] 94 | for p in policies: 95 | self.iam.attach_role_policy( 96 | RoleName=role_name, 97 | PolicyArn=p 98 | ) 99 | 100 | return role_arn, role_name 101 | 102 | def create_fn(self, fn_prefix, fname, l_prefix, role_arn, handler): 103 | fzip = f"{fname}.zip" 104 | with ZipFile(fzip,'w') as zip: 105 | zip.write(f"workflow/{fname}") 106 | with open(fzip, 'rb') as file_data: 107 | f_bytes = file_data.read() 108 | 109 | response = self.lambdac.create_function( 110 | FunctionName=f"{fn_prefix}-{l_prefix}-LabelingFunction", 111 | Runtime='python3.7', 112 | Role=role_arn, 113 | Handler=handler, 114 | Code={ 115 | 'ZipFile': f_bytes 116 | }, 117 | Description=f"{fn_prefix}-{l_prefix}-LabelingFunction", 118 | Timeout=300, 119 | MemorySize=1024, 120 | Publish=True, 121 | PackageType='Zip' 122 | ) 123 | return response['FunctionArn'] 124 | 125 | 126 | def create_workflow(self, s3_bucket, s3_prefix_workflow, fn_prefix, label_prefix, s3_prefix_labels): 127 | self.workflow_role_arn, self.workflow_role_name = self.create_role("sagemaker", "workflow", 128 | policies = ['arn:aws:iam::aws:policy/AmazonS3FullAccess', 129 | 'arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'] 130 | ) 131 | self.lambda_role_arn, self.lambda_role_name = self.create_role("lambda", "lambda", 132 | policies = ['arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole', 133 | 'arn:aws:iam::aws:policy/AmazonS3FullAccess'] 134 | ) 135 | 136 | self.s3.upload_file('workflow/ui.liquid.html', s3_bucket, f"{s3_prefix_workflow}/openaq.liquid.html") 137 | self.pre_arn = self.create_fn(fn_prefix, 'pre.py', "pre", self.lambda_role_arn, "pre.handler") 138 | self.post_arn = self.create_fn(fn_prefix, 'post.py', "post", self.lambda_role_arn, "post.handler") 139 | 140 | self.sagemaker.create_labeling_job( 141 | LabelingJobName=fn_prefix, 142 | LabelAttributeName='badair', 143 | InputConfig={ 144 | 'DataSource': { 145 | 'S3DataSource': { 146 | 'ManifestS3Uri': f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.manifest" 147 | } 148 | } 149 | }, 150 | OutputConfig={ 151 | 'S3OutputPath': f"s3://{s3_bucket}/{s3_prefix_labels}/openaq" 152 | }, 153 | RoleArn=self.workflow_role_arn, 154 | LabelCategoryConfigS3Uri=f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.labels", 155 | StoppingConditions={ 156 | 'MaxHumanLabeledObjectCount': 10, 157 | 'MaxPercentageOfInputDatasetLabeled': 5 158 | }, 159 | HumanTaskConfig={ 160 | 'WorkteamArn': self.workteam_arn, 161 | 'UiConfig': { 162 | 'UiTemplateS3Uri': f"s3://{s3_bucket}/{s3_prefix_workflow}/openaq.liquid.html" 163 | }, 164 | 'PreHumanTaskLambdaArn': self.pre_arn, 165 | 'TaskTitle': 'Label Air Quality', 166 | 'TaskDescription': 'Was it a good air day?', 167 | 'NumberOfHumanWorkersPerDataObject': 1, 168 | 'TaskTimeLimitInSeconds': 3600, 169 | 'AnnotationConsolidationConfig': { 170 | 'AnnotationConsolidationLambdaArn': self.post_arn 171 | } 172 | } 173 | ) 174 | 175 | def create_workflow_multiple_workers(self, s3_bucket, s3_prefix_workflow, fn_prefix, label_prefix, s3_prefix_labels): 176 | self.mpost_arn = self.create_fn(fn_prefix, 'post-multiple.py', "mpost", self.lambda_role_arn, "post.handler") 177 | self.sagemaker.create_labeling_job( 178 | LabelingJobName=fn_prefix, 179 | LabelAttributeName='badair', 180 | InputConfig={ 181 | 'DataSource': { 182 | 'S3DataSource': { 183 | 'ManifestS3Uri': f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.manifest" 184 | } 185 | } 186 | }, 187 | OutputConfig={ 188 | 'S3OutputPath': f"s3://{s3_bucket}/{s3_prefix_labels}/openaq" 189 | }, 190 | RoleArn=self.workflow_role_arn, 191 | LabelCategoryConfigS3Uri=f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.labels", 192 | StoppingConditions={ 193 | 'MaxHumanLabeledObjectCount': 10, 194 | 'MaxPercentageOfInputDatasetLabeled': 5 195 | }, 196 | HumanTaskConfig={ 197 | 'WorkteamArn': self.workteam_arn, 198 | 'UiConfig': { 199 | 'UiTemplateS3Uri': f"s3://{s3_bucket}/{s3_prefix_workflow}/openaq.liquid.html" 200 | }, 201 | 'PreHumanTaskLambdaArn': self.pre_arn, 202 | 'TaskTitle': 'Label Air Quality', 203 | 'TaskDescription': 'Was it a good air day?', 204 | 'NumberOfHumanWorkersPerDataObject': 2, 205 | 'TaskTimeLimitInSeconds': 3600, 206 | 'AnnotationConsolidationConfig': { 207 | 'AnnotationConsolidationLambdaArn': self.mpost_arn 208 | } 209 | } 210 | ) 211 | 212 | if __name__ == "__main__": 213 | logging.warn("No main function defined") 214 | sys.exit(0) -------------------------------------------------------------------------------- /Chapter03/workflow/post-multiple.py: -------------------------------------------------------------------------------- 1 | import json 2 | import boto3 3 | 4 | """ 5 | Input: 6 | 7 | { 8 | "version": "2018-10-16", 9 | "labelingJobArn": , 10 | "labelCategories": [], 11 | "labelAttributeName": , 12 | "roleArn" : "string", 13 | "payload": { 14 | "s3Uri": 15 | } 16 | } 17 | 18 | Contents of payload: 19 | 20 | [ 21 | { 22 | "datasetObjectId": , 23 | "dataObject": { 24 | "s3Uri": , 25 | "content": 26 | }, 27 | "annotations": [{ 28 | "workerId": , 29 | "annotationData": { 30 | "content": , 31 | "s3Uri": 32 | } 33 | }] 34 | } 35 | ] 36 | 37 | Output: 38 | 39 | [ 40 | { 41 | "datasetObjectId": , 42 | "consolidatedAnnotation": { 43 | "content": { 44 | "": { 45 | # ... label content 46 | } 47 | } 48 | } 49 | }, 50 | { 51 | "datasetObjectId": , 52 | "consolidatedAnnotation": { 53 | "content": { 54 | "": { 55 | # ... label content 56 | } 57 | } 58 | } 59 | } 60 | ] 61 | 62 | """ 63 | def handler(event, context): 64 | input_uri = event["payload"]['s3Uri'] 65 | parts = input_uri.split('/') 66 | s3 = boto3.client('s3') 67 | s3.download_file(parts[2], "/".join(parts[3:]), '/tmp/input.json') 68 | 69 | with open('/tmp/input.json', 'r') as F: 70 | input_data = json.load(F) 71 | 72 | output_data = [] 73 | for p in range(len(input_data)): 74 | d_id = input_data[p]['datasetObjectId'] 75 | 76 | annotations = input_data[p]['annotations'] 77 | annotation = annotations[len(annotations)-1]['annotationData']['content'] 78 | 79 | response = { 80 | "datasetObjectId": d_id, 81 | "consolidatedAnnotation": { 82 | "content": annotation 83 | } 84 | } 85 | 86 | output_data.append(response) 87 | 88 | # Perform consolidation 89 | return output_data -------------------------------------------------------------------------------- /Chapter03/workflow/post.py: -------------------------------------------------------------------------------- 1 | import json 2 | import boto3 3 | 4 | """ 5 | Input: 6 | 7 | { 8 | "version": "2018-10-16", 9 | "labelingJobArn": , 10 | "labelCategories": [], 11 | "labelAttributeName": , 12 | "roleArn" : "string", 13 | "payload": { 14 | "s3Uri": 15 | } 16 | } 17 | 18 | Contents of payload: 19 | 20 | [ 21 | { 22 | "datasetObjectId": , 23 | "dataObject": { 24 | "s3Uri": , 25 | "content": 26 | }, 27 | "annotations": [{ 28 | "workerId": , 29 | "annotationData": { 30 | "content": , 31 | "s3Uri": 32 | } 33 | }] 34 | } 35 | ] 36 | 37 | Output: 38 | 39 | [ 40 | { 41 | "datasetObjectId": , 42 | "consolidatedAnnotation": { 43 | "content": { 44 | "": { 45 | # ... label content 46 | } 47 | } 48 | } 49 | }, 50 | { 51 | "datasetObjectId": , 52 | "consolidatedAnnotation": { 53 | "content": { 54 | "": { 55 | # ... label content 56 | } 57 | } 58 | } 59 | } 60 | ] 61 | 62 | """ 63 | def handler(event, context): 64 | input_uri = event["payload"]['s3Uri'] 65 | parts = input_uri.split('/') 66 | s3 = boto3.client('s3') 67 | s3.download_file(parts[2], "/".join(parts[3:]), '/tmp/input.json') 68 | 69 | with open('/tmp/input.json', 'r') as F: 70 | input_data = json.load(F) 71 | 72 | output_data = [] 73 | for p in range(len(input_data)): 74 | d_id = input_data[p]['datasetObjectId'] 75 | 76 | annotation = input_data[p]['annotations'][0]['annotationData']['content'] 77 | 78 | response = { 79 | "datasetObjectId": d_id, 80 | "consolidatedAnnotation": { 81 | "content": annotation 82 | } 83 | } 84 | 85 | output_data.append(response) 86 | 87 | # Perform consolidation 88 | return output_data -------------------------------------------------------------------------------- /Chapter03/workflow/pre.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | """ 4 | The input event looks like this: 5 | 6 | { 7 | "version":"2018-10-16", 8 | "labelingJobArn":"", 9 | "dataObject":{ 10 | "source":"metric type,metric value,metric unit,lat,lon" 11 | } 12 | } 13 | 14 | The output should look like this: 15 | 16 | { 17 | "taskInput":{ 18 | "metric": "PM2.5 = 30.0", 19 | "lat": 0.0, 20 | "lon": 0.0 21 | }, 22 | "isHumanAnnotationRequired":"true" 23 | } 24 | """ 25 | def handler(event, context): 26 | sourceText = event['dataObject']['source'] 27 | parts = sourceText.split(',') 28 | 29 | output = { 30 | "taskInput": { 31 | "metric": f"{parts[0]}={parts[1]} {parts[2]}", 32 | "lat": parts[3], 33 | "lon": parts[4] 34 | }, 35 | "isHumanAnnotationRequired": "true" 36 | } 37 | 38 | return output -------------------------------------------------------------------------------- /Chapter03/workflow/ui.liquid.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | 9 | 10 |
11 | 23 | 24 | 25 | 30 | 31 | {{ task.input.metric }} 32 | 33 | 34 | 35 |

Good air quality, safe for outdoor exercise for those with asthma

36 |

Bad air quality, not safe for outdoor exercise for those with asthma

37 |
38 | 39 | 40 | Choose the most relevant label for this air measurement. 41 | 42 |
43 |
44 | 45 | 46 | 47 | 48 | 49 | -------------------------------------------------------------------------------- /Chapter04/PrepareData.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "source": [ 5 | "# Chapter 3: Data preparation at scale using Amazon SageMaker Data Wrangler and Amazon SageMaker Processing\n", 6 | "\n", 7 | "In this notebook we'll perform the following steps:\n", 8 | "\n", 9 | "* Create a table in the Glue catalog for our data steps\n", 10 | "* Run a SageMaker Processing job to prepare the full data set\n", 11 | "\n", 12 | "You need to define the following variables:\n", 13 | "\n", 14 | "* `s3_bucket`: Bucket with the data set\n", 15 | "* `glue_db_name`: Glue database name\n", 16 | "* `glue_tbl_name`: Glue table name\n", 17 | "* `s3_prefix_parquet`: Location of the Parquet tables in the S3 bucket\n", 18 | "* `s3_output_prefix`: Location to store the prepared data in the S3 bucket\n", 19 | "* `s3_prefix`: Location of the JSON data in the S3 bucket\n" 20 | ], 21 | "cell_type": "markdown", 22 | "metadata": {} 23 | }, 24 | { 25 | "source": [ 26 | "## Glue Catalog" 27 | ], 28 | "cell_type": "markdown", 29 | "metadata": {} 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "s3_bucket = 'MyBucket'\n", 38 | "glue_db_name = 'MyDatabase'\n", 39 | "glue_tbl_name = 'openaq'\n", 40 | "s3_prefix = 'openaq/realtime'\n", 41 | "s3_prefix_parquet = 'openaq/realtime-parquet-gzipped/tables'\n", 42 | "s3_output_prefix = 'prepared'\n", 43 | "\n", 44 | "import boto3\n", 45 | "s3 = boto3.client('s3')" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "glue = boto3.client('glue')\n", 55 | "response = glue.create_database(\n", 56 | " DatabaseInput={\n", 57 | " 'Name': glue_db_name,\n", 58 | " }\n", 59 | ")" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "response = glue.create_table(\n", 69 | " DatabaseName=glue_db_name,\n", 70 | " TableInput={\n", 71 | " 'Name': glue_tbl_name,\n", 72 | " 'StorageDescriptor': {\n", 73 | " 'Columns': [\n", 74 | " {\n", 75 | " \"Name\": \"date\",\n", 76 | " \"Type\": \"struct\"\n", 77 | " },\n", 78 | " {\n", 79 | " \"Name\": \"parameter\",\n", 80 | " \"Type\": \"string\"\n", 81 | " },\n", 82 | " {\n", 83 | " \"Name\": \"location\",\n", 84 | " \"Type\": \"string\"\n", 85 | " },\n", 86 | " {\n", 87 | " \"Name\": \"value\",\n", 88 | " \"Type\": \"double\"\n", 89 | " },\n", 90 | " {\n", 91 | " \"Name\": \"unit\",\n", 92 | " \"Type\": \"string\"\n", 93 | " },\n", 94 | " {\n", 95 | " \"Name\": \"city\",\n", 96 | " \"Type\": \"string\"\n", 97 | " },\n", 98 | " {\n", 99 | " \"Name\": \"attribution\",\n", 100 | " \"Type\": \"array>\"\n", 101 | " },\n", 102 | " {\n", 103 | " \"Name\": \"averagingperiod\",\n", 104 | " \"Type\": \"struct\"\n", 105 | " },\n", 106 | " {\n", 107 | " \"Name\": \"coordinates\",\n", 108 | " \"Type\": \"struct\"\n", 109 | " },\n", 110 | " {\n", 111 | " \"Name\": \"country\",\n", 112 | " \"Type\": \"string\"\n", 113 | " },\n", 114 | " {\n", 115 | " \"Name\": \"sourcename\",\n", 116 | " \"Type\": \"string\"\n", 117 | " },\n", 118 | " {\n", 119 | " \"Name\": \"sourcetype\",\n", 120 | " \"Type\": \"string\"\n", 121 | " },\n", 122 | " {\n", 123 | " \"Name\": \"mobile\",\n", 124 | " \"Type\": \"boolean\"\n", 125 | " }\n", 126 | " ],\n", 127 | " 'Location': 's3://' + s3_bucket + '/' + s3_prefix + '/',\n", 128 | " 'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',\n", 129 | " 'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',\n", 130 | " 'Compressed': False,\n", 131 | " 'SerdeInfo': {\n", 132 | " 'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',\n", 133 | " \"Parameters\": {\n", 134 | " \"paths\": \"attribution,averagingPeriod,city,coordinates,country,date,location,mobile,parameter,sourceName,sourceType,unit,value\"\n", 135 | " }\n", 136 | " },\n", 137 | " 'Parameters': {\n", 138 | " \"classification\": \"json\",\n", 139 | " \"compressionType\": \"none\",\n", 140 | " },\n", 141 | " 'StoredAsSubDirectories': False,\n", 142 | " },\n", 143 | " 'PartitionKeys': [\n", 144 | " {\n", 145 | " \"Name\": \"aggdate\",\n", 146 | " \"Type\": \"string\"\n", 147 | " },\n", 148 | " ],\n", 149 | " 'TableType': 'EXTERNAL_TABLE',\n", 150 | " 'Parameters': {\n", 151 | " \"classification\": \"json\",\n", 152 | " \"compressionType\": \"none\",\n", 153 | " }\n", 154 | " \n", 155 | " }\n", 156 | ")" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "partitions_to_add = []\n", 166 | "response = s3.list_objects_v2(\n", 167 | " Bucket=s3_bucket,\n", 168 | " Prefix=s3_prefix + '/'\n", 169 | ")\n", 170 | "for r in response['Contents']:\n", 171 | " partitions_to_add.append(r['Key'])\n", 172 | "while response['IsTruncated']:\n", 173 | " token = response['NextContinuationToken']\n", 174 | " response = s3.list_objects_v2(\n", 175 | " Bucket=s3_bucket,\n", 176 | " Prefix=s3_prefix,\n", 177 | " ContinuationToken=token\n", 178 | " ) \n", 179 | " for r in response['Contents']:\n", 180 | " partitions_to_add.append(r['Key'])\n", 181 | " if response['IsTruncated']:\n", 182 | " oken = response['NextContinuationToken']\n", 183 | " print(\"Getting next batch\")" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "print(f\"Need to add {len(partitions_to_add)} partitions\")" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "def chunks(lst, n):\n", 202 | " \"\"\"Yield successive n-sized chunks from lst.\"\"\"\n", 203 | " for i in range(0, len(lst), n):\n", 204 | " yield lst[i:i + n]" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "def get_part_def(p):\n", 214 | " part_value = p.split('/')[-2]\n", 215 | " return {\n", 216 | " 'Values': [\n", 217 | " part_value\n", 218 | " ],\n", 219 | " 'StorageDescriptor': {\n", 220 | " 'Columns': [\n", 221 | " {\n", 222 | " \"Name\": \"date\",\n", 223 | " \"Type\": \"struct\"\n", 224 | " },\n", 225 | " {\n", 226 | " \"Name\": \"parameter\",\n", 227 | " \"Type\": \"string\"\n", 228 | " },\n", 229 | " {\n", 230 | " \"Name\": \"location\",\n", 231 | " \"Type\": \"string\"\n", 232 | " },\n", 233 | " {\n", 234 | " \"Name\": \"value\",\n", 235 | " \"Type\": \"double\"\n", 236 | " },\n", 237 | " {\n", 238 | " \"Name\": \"unit\",\n", 239 | " \"Type\": \"string\"\n", 240 | " },\n", 241 | " {\n", 242 | " \"Name\": \"city\",\n", 243 | " \"Type\": \"string\"\n", 244 | " },\n", 245 | " {\n", 246 | " \"Name\": \"attribution\",\n", 247 | " \"Type\": \"array>\"\n", 248 | " },\n", 249 | " {\n", 250 | " \"Name\": \"averagingperiod\",\n", 251 | " \"Type\": \"struct\"\n", 252 | " },\n", 253 | " {\n", 254 | " \"Name\": \"coordinates\",\n", 255 | " \"Type\": \"struct\"\n", 256 | " },\n", 257 | " {\n", 258 | " \"Name\": \"country\",\n", 259 | " \"Type\": \"string\"\n", 260 | " },\n", 261 | " {\n", 262 | " \"Name\": \"sourcename\",\n", 263 | " \"Type\": \"string\"\n", 264 | " },\n", 265 | " {\n", 266 | " \"Name\": \"sourcetype\",\n", 267 | " \"Type\": \"string\"\n", 268 | " },\n", 269 | " {\n", 270 | " \"Name\": \"mobile\",\n", 271 | " \"Type\": \"boolean\"\n", 272 | " }\n", 273 | " ],\n", 274 | " 'Location': f\"s3://{s3_bucket}/{s3_prefix}/{part_value}/\",\n", 275 | " 'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',\n", 276 | " 'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',\n", 277 | " 'Compressed': False,\n", 278 | " 'SerdeInfo': {\n", 279 | " 'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',\n", 280 | " \"Parameters\": {\n", 281 | " \"paths\": \"attribution,averagingPeriod,city,coordinates,country,date,location,mobile,parameter,sourceName,sourceType,unit,value\"\n", 282 | " }\n", 283 | " },\n", 284 | " 'StoredAsSubDirectories': False\n", 285 | " },\n", 286 | " 'Parameters': {\n", 287 | " \"classification\": \"json\",\n", 288 | " \"compressionType\": \"none\",\n", 289 | " },\n", 290 | "\n", 291 | "\n", 292 | " }" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "for batch in chunks(partitions_to_add, 100):\n", 302 | " response = glue.batch_create_partition(\n", 303 | " DatabaseName=glue_db_name,\n", 304 | " TableName=glue_tbl_name,\n", 305 | " PartitionInputList=[get_part_def(p) for p in batch]\n", 306 | " )" 307 | ] 308 | }, 309 | { 310 | "source": [ 311 | "## Processing Job" 312 | ], 313 | "cell_type": "markdown", 314 | "metadata": {} 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "import logging\n", 323 | "import sagemaker\n", 324 | "from time import gmtime, strftime\n", 325 | "\n", 326 | "sagemaker_logger = logging.getLogger(\"sagemaker\")\n", 327 | "sagemaker_logger.setLevel(logging.INFO)\n", 328 | "sagemaker_logger.addHandler(logging.StreamHandler())\n", 329 | "\n", 330 | "sagemaker_session = sagemaker.Session()\n", 331 | "role = sagemaker.get_execution_role()" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "from sagemaker.spark.processing import PySparkProcessor\n", 341 | "\n", 342 | "spark_processor = PySparkProcessor(\n", 343 | " base_job_name=\"spark-preprocessor\",\n", 344 | " framework_version=\"3.0\",\n", 345 | " role=role,\n", 346 | " instance_count=15,\n", 347 | " instance_type=\"ml.m5.4xlarge\",\n", 348 | " max_runtime_in_seconds=7200,\n", 349 | ")\n", 350 | "\n", 351 | "configuration = [\n", 352 | " {\n", 353 | " \"Classification\": \"spark-defaults\",\n", 354 | " \"Properties\": {\"spark.executor.memory\": \"18g\", \n", 355 | " \"spark.yarn.executor.memoryOverhead\": \"3g\",\n", 356 | " \"spark.driver.memory\": \"18g\",\n", 357 | " \"spark.yarn.driver.memoryOverhead\": \"3g\",\n", 358 | " \"spark.executor.cores\": \"5\", \n", 359 | " \"spark.driver.cores\": \"5\",\n", 360 | " \"spark.executor.instances\": \"44\",\n", 361 | " \"spark.default.parallelism\": \"440\",\n", 362 | " \"spark.dynamicAllocation.enabled\": \"false\"\n", 363 | " },\n", 364 | " },\n", 365 | " {\n", 366 | " \"Classification\": \"yarn-site\",\n", 367 | " \"Properties\": {\"yarn.nodemanager.vmem-check-enabled\": \"false\", \n", 368 | " \"yarn.nodemanager.mmem-check-enabled\": \"false\"},\n", 369 | " }\n", 370 | "]\n", 371 | "\n", 372 | "spark_processor.run(\n", 373 | " submit_app=\"scripts/preprocess.py\",\n", 374 | " submit_jars=[\"s3://crawler-public/json/serde/json-serde.jar\"],\n", 375 | " arguments=['--s3_input_bucket', s3_bucket,\n", 376 | " '--s3_input_key_prefix', s3_prefix_parquet,\n", 377 | " '--s3_output_bucket', s3_bucket,\n", 378 | " '--s3_output_key_prefix', s3_output_prefix],\n", 379 | " spark_event_logs_s3_uri=\"s3://{}/{}/spark_event_logs\".format(s3_bucket, 'sparklogs'),\n", 380 | " logs=True,\n", 381 | " configuration=configuration\n", 382 | ")" 383 | ] 384 | } 385 | ], 386 | "metadata": { 387 | "instance_type": "ml.t3.medium", 388 | "kernelspec": { 389 | "display_name": "Python 3 (Data Science)", 390 | "language": "python", 391 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" 392 | }, 393 | "language_info": { 394 | "codemirror_mode": { 395 | "name": "ipython", 396 | "version": 3 397 | }, 398 | "file_extension": ".py", 399 | "mimetype": "text/x-python", 400 | "name": "python", 401 | "nbconvert_exporter": "python", 402 | "pygments_lexer": "ipython3", 403 | "version": "3.7.10" 404 | } 405 | }, 406 | "nbformat": 4, 407 | "nbformat_minor": 4 408 | } -------------------------------------------------------------------------------- /Chapter04/scripts/preprocess.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from __future__ import unicode_literals 3 | 4 | import argparse 5 | import csv 6 | import os 7 | import shutil 8 | import sys 9 | import time 10 | import logging 11 | import boto3 12 | 13 | import pyspark 14 | from pyspark.sql import SparkSession 15 | from pyspark import SparkContext 16 | from pyspark.ml import Pipeline 17 | from pyspark.ml.linalg import Vectors 18 | from pyspark.ml.feature import ( 19 | StringIndexer, 20 | VectorAssembler, 21 | VectorIndexer, 22 | StandardScaler, 23 | OneHotEncoder 24 | ) 25 | from pyspark.sql.functions import * 26 | from pyspark.sql.functions import round as round_ 27 | from pyspark.sql.types import ( 28 | DoubleType, 29 | StringType, 30 | StructField, 31 | StructType, 32 | BooleanType, 33 | IntegerType 34 | ) 35 | 36 | def get_tables(): 37 | 38 | tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2'] 39 | #tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2', 40 | # '1645e09b-0919-439a-b07f-cd7532069f10', 41 | # '80d785aa-6da8-4b37-8632-7386b1d535f3', 42 | # '8375c185-ea3e-44b7-b61c-68534e33ddf7', 43 | # '9a6e24db-cffc-42de-a77f-7ab96c487022', 44 | # 'bc981da3-bc9d-435a-8bf2-4107a8fb2676', 45 | # 'cf5cf814-c9e3-4a2b-8811-e4fc6481a1fe', 46 | # 'd3b8f1ab-f3e5-4fc9-84ab-8568edd8a03d'] 47 | 48 | 49 | return tables 50 | 51 | 52 | def isBadAir(v, p): 53 | if p == 'pm10': 54 | if v > 50: 55 | return 1 56 | else: 57 | return 0 58 | elif p == 'pm25': 59 | if v > 25: 60 | return 1 61 | else: 62 | return 0 63 | elif p == 'so2': 64 | if v > 20: 65 | return 1 66 | else: 67 | return 0 68 | elif p == 'no2': 69 | if v > 200: 70 | return 1 71 | else: 72 | return 0 73 | elif p == 'o3': 74 | if v > 100: 75 | return 1 76 | else: 77 | return 0 78 | else: 79 | return 0 80 | 81 | def extract(row): 82 | return (row.value, row.ismobile, row.year, row.month, row.quarter, row.day, row.isBadAir, 83 | row.indexed_location, row.indexed_city, row.indexed_country, row.indexed_sourcename, 84 | row.indexed_sourcetype) + tuple(row.vec_parameter.toArray().tolist()) 85 | 86 | """ 87 | Schema on disk: 88 | 89 | |-- date_utc: string (nullable = true) 90 | |-- date_local: string (nullable = true) 91 | |-- location: string (nullable = true) 92 | |-- country: string (nullable = true) 93 | |-- value: float (nullable = true) 94 | |-- unit: string (nullable = true) 95 | |-- city: string (nullable = true) 96 | |-- attribution: array (nullable = true) 97 | | |-- element: struct (containsNull = true) 98 | | | |-- name: string (nullable = true) 99 | | | |-- url: string (nullable = true) 100 | |-- averagingperiod: struct (nullable = true) 101 | | |-- unit: string (nullable = true) 102 | | |-- value: float (nullable = true) 103 | |-- coordinates: struct (nullable = true) 104 | | |-- latitude: float (nullable = true) 105 | | |-- longitude: float (nullable = true) 106 | |-- sourcename: string (nullable = true) 107 | |-- sourcetype: string (nullable = true) 108 | |-- mobile: string (nullable = true) 109 | |-- parameter: string (nullable = true) 110 | 111 | Example output: 112 | 113 | date_utc='2015-10-31T07:00:00.000Z' 114 | date_local='2015-10-31T04:00:00-03:00' 115 | location='Quintero Centro' 116 | country='CL' 117 | value=19.81999969482422 118 | unit='µg/m³' 119 | city='Quintero' 120 | attribution=[Row(name='SINCA', url='http://sinca.mma.gob.cl/'), Row(name='CENTRO QUINTERO', url=None)] 121 | averagingperiod=None 122 | coordinates=Row(latitude=-32.786170959472656, longitude=-71.53143310546875) 123 | sourcename='Chile - SINCA' 124 | sourcetype=None 125 | mobile=None 126 | parameter='o3' 127 | 128 | Transformations: 129 | 130 | * Featurize date_utc 131 | * Drop date_local 132 | * Encode location 133 | * Encode country 134 | * Scale value 135 | * Drop unit 136 | * Encode city 137 | * Drop attribution 138 | * Drop averaging period 139 | * Drop coordinates 140 | * Encode source name 141 | * Encode source type 142 | * Convert mobile to integer 143 | * Encode parameter 144 | 145 | * Add label for good/bad air quality 146 | 147 | """ 148 | def main(): 149 | parser = argparse.ArgumentParser(description="Preprocessing configuration") 150 | parser.add_argument("--s3_input_bucket", type=str, help="s3 input bucket") 151 | parser.add_argument("--s3_input_key_prefix", type=str, help="s3 input key prefix") 152 | parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket") 153 | parser.add_argument("--s3_output_key_prefix", type=str, help="s3 output key prefix") 154 | args = parser.parse_args() 155 | 156 | logging.basicConfig(level=logging.INFO) 157 | logger = logging.getLogger('Preprocess') 158 | 159 | spark = SparkSession.builder.appName("Preprocessor").getOrCreate() 160 | 161 | logger.info("Reading data set") 162 | tables = get_tables() 163 | df = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{tables[0]}/") 164 | for t in tables[1:]: 165 | df_new = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{t}/") 166 | df = df.union(df_new) 167 | 168 | # Drop columns 169 | logger.info("Dropping columns") 170 | df = df.drop('date_local').drop('unit').drop('attribution').drop('averagingperiod').drop('coordinates') 171 | 172 | # Mobile field to int 173 | logger.info("Casting mobile field to int") 174 | df = df.withColumn("ismobile",col("mobile").cast(IntegerType())).drop('mobile') 175 | 176 | # scale value 177 | logger.info("Scaling value") 178 | value_assembler = VectorAssembler(inputCols=["value"], outputCol="value_vec") 179 | value_scaler = StandardScaler(inputCol="value_vec", outputCol="value_scaled") 180 | value_pipeline = Pipeline(stages=[value_assembler, value_scaler]) 181 | value_model = value_pipeline.fit(df) 182 | xform_df = value_model.transform(df) 183 | 184 | # featurize date 185 | logger.info("Featurizing date") 186 | xform_df = xform_df.withColumn('aggdt', 187 | to_date(unix_timestamp(col('date_utc'), "yyyy-MM-dd'T'HH:mm:ss.SSSX").cast("timestamp"))) 188 | xform_df = xform_df.withColumn('year',year(xform_df.aggdt)) \ 189 | .withColumn('month',month(xform_df.aggdt)) \ 190 | .withColumn('quarter',quarter(xform_df.aggdt)) 191 | xform_df = xform_df.withColumn("day", date_format(col("aggdt"), "d")) 192 | 193 | # Automatically assign good/bad labels 194 | logger.info("Simulating good/bad air labels") 195 | isBadAirUdf = udf(isBadAir, IntegerType()) 196 | xform_df = xform_df.withColumn('isBadAir', isBadAirUdf('value', 'parameter')) 197 | 198 | # Categorical encodings. 199 | logger.info("Categorical encoding") 200 | parameter_indexer = StringIndexer(inputCol="parameter", outputCol="indexed_parameter", handleInvalid='keep') 201 | location_indexer = StringIndexer(inputCol="location", outputCol="indexed_location", handleInvalid='keep') 202 | city_indexer = StringIndexer(inputCol="city", outputCol="indexed_city", handleInvalid='keep') 203 | country_indexer = StringIndexer(inputCol="country", outputCol="indexed_country", handleInvalid='keep') 204 | sourcename_indexer = StringIndexer(inputCol="sourcename", outputCol="indexed_sourcename", handleInvalid='keep') 205 | sourcetype_indexer = StringIndexer(inputCol="sourcetype", outputCol="indexed_sourcetype", handleInvalid='keep') 206 | enc_est = OneHotEncoder(inputCols=["indexed_parameter"], outputCols=["vec_parameter"]) 207 | enc_pipeline = Pipeline(stages=[parameter_indexer, location_indexer, 208 | city_indexer, country_indexer, sourcename_indexer, 209 | sourcetype_indexer, enc_est]) 210 | enc_model = enc_pipeline.fit(xform_df) 211 | enc_df = enc_model.transform(xform_df) 212 | param_cols = enc_df.schema.fields[17].metadata['ml_attr']['vals'] 213 | 214 | # Clean up data set 215 | logger.info("Final cleanup") 216 | final_df = enc_df.drop('parameter').drop('location') \ 217 | .drop('city').drop('country').drop('sourcename') \ 218 | .drop('sourcetype').drop('date_utc') \ 219 | .drop('value_vec').drop('aggdt').drop('indexed_parameter') 220 | firstelement=udf(lambda v:str(v[0]),StringType()) 221 | final_df = final_df.withColumn('value_str', firstelement('value_scaled')) 222 | final_df = final_df.withColumn("value",final_df.value_str.cast(DoubleType())).drop('value_str').drop('value_scaled') 223 | schema = StructType([ 224 | StructField("value", DoubleType(), True), 225 | StructField("ismobile", StringType(), True), 226 | StructField("year", StringType(), True), 227 | StructField("month", StringType(), True), 228 | StructField("quarter", StringType(), True), 229 | StructField("day", StringType(), True), 230 | StructField("isBadAir", StringType(), True), 231 | StructField("location", StringType(), True), 232 | StructField("city", StringType(), True), 233 | StructField("country", StringType(), True), 234 | StructField("sourcename", StringType(), True), 235 | StructField("sourcetype", StringType(), True), 236 | StructField("o3", StringType(), True), 237 | StructField("no2", StringType(), True), 238 | StructField("so2", StringType(), True), 239 | StructField("co", StringType(), True), 240 | StructField("pm10", StringType(), True), 241 | StructField("pm25", StringType(), True), 242 | StructField("bc", StringType(), True), 243 | ]) 244 | final_df = final_df.rdd.map(extract).toDF(schema=schema) 245 | 246 | # Replace missing values 247 | final_df = final_df.na.fill("0") 248 | 249 | # Round the value 250 | final_df = final_df.withColumn("value", round_(final_df["value"], 4)) 251 | 252 | # Split sets 253 | logger.info("Splitting data set") 254 | (train_df, validation_df, test_df) = final_df.randomSplit([0.7, 0.2, 0.1]) 255 | 256 | # Drop value from test set 257 | test_df = test_df.drop('value') 258 | 259 | # Save to S3 260 | logger.info("Saving to S3") 261 | train_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 262 | args.s3_output_key_prefix, 'train/')) 263 | validation_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 264 | args.s3_output_key_prefix, 'validation/')) 265 | test_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 266 | args.s3_output_key_prefix, 'test/')) 267 | 268 | if __name__ == "__main__": 269 | main() 270 | -------------------------------------------------------------------------------- /Chapter06/Experiments/Experiments.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6a503428", 6 | "metadata": {}, 7 | "source": [ 8 | "### Tracking and organizing training and tuning jobs with Amazon SageMaker Experiments\n", 9 | "\n", 10 | "This notebook demonstrates using SageMaker Experiment capability to organize, track, compare, and evaluate your machine learning (ML) model training experiments.\n", 11 | "\n", 12 | "\n", 13 | "### Overview\n", 14 | "\n", 15 | "1. Set up\n", 16 | "2. Create a SageMaker Experiment\n", 17 | "3. Train XGBoost regression model as part of the Experiment\n", 18 | "4. Visualize results from the Experiment." 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "776e3d5b", 24 | "metadata": {}, 25 | "source": [ 26 | "### 1. Set up" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 18, 32 | "id": "ba1209de", 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | "Requirement already satisfied: sagemaker-experiments in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.1.34)\n", 40 | "Requirement already satisfied: boto3>=1.16.27 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker-experiments) (1.17.99)\n", 41 | "Requirement already satisfied: s3transfer<0.5.0,>=0.4.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments) (0.4.2)\n", 42 | "Requirement already satisfied: botocore<1.21.0,>=1.20.99 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments) (1.20.99)\n", 43 | "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments) (0.10.0)\n", 44 | "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.99->boto3>=1.16.27->sagemaker-experiments) (1.26.5)\n", 45 | "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.99->boto3>=1.16.27->sagemaker-experiments) (2.8.1)\n", 46 | "Requirement already satisfied: six>=1.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.21.0,>=1.20.99->boto3>=1.16.27->sagemaker-experiments) (1.15.0)\n", 47 | "\u001b[33mWARNING: You are using pip version 21.1.2; however, version 21.2.3 is available.\n", 48 | "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.\u001b[0m\n" 49 | ] 50 | } 51 | ], 52 | "source": [ 53 | "#Install the sagemaker experiments SDK\n", 54 | "!pip install sagemaker-experiments" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "51f53efa", 60 | "metadata": {}, 61 | "source": [ 62 | "#### 1.1 Import libraries" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 19, 68 | "id": "cb5c2578", 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "import time\n", 73 | "\n", 74 | "import boto3\n", 75 | "import numpy as np\n", 76 | "import pandas as pd\n", 77 | "from IPython.display import set_matplotlib_formats\n", 78 | "from matplotlib import pyplot as plt\n", 79 | "import datetime\n", 80 | "\n", 81 | "import sagemaker\n", 82 | "from sagemaker import get_execution_role\n", 83 | "from sagemaker.session import Session\n", 84 | "from sagemaker.analytics import ExperimentAnalytics\n", 85 | "from sagemaker.inputs import TrainingInput\n", 86 | "\n", 87 | "from smexperiments.experiment import Experiment\n", 88 | "from smexperiments.trial import Trial\n", 89 | "from smexperiments.trial_component import TrialComponent\n", 90 | "from smexperiments.tracker import Tracker\n", 91 | "\n", 92 | "region = 'us-west-2'\n", 93 | "\n", 94 | "set_matplotlib_formats('retina')" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 20, 100 | "id": "8f981167", 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "sess = boto3.Session()\n", 105 | "sm = sess.client('sagemaker')\n", 106 | "role = get_execution_role()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "f97dd9a8", 112 | "metadata": {}, 113 | "source": [ 114 | "#### 1.2 S3 paths to training and validation data and output paths" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 21, 120 | "id": "ffc6e083", 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "# define the data type and paths to the training and validation datasets\n", 125 | "content_type = \"csv\"\n", 126 | "\n", 127 | "#s3_bucket = 'bestpractices-bucket-sm'\n", 128 | "#s3_prefix = 'prepared_parquet4'\n", 129 | "\n", 130 | "#Set the s3_bucket to the correct bucket name created in your datascience environment\n", 131 | "s3_bucket = 'datascience-environment-notebookinstance--06dc7a0224df'\n", 132 | "s3_prefix = 'prepared'\n", 133 | "\n", 134 | "train_input = TrainingInput(\"s3://{}/{}/{}/\".format(s3_bucket, s3_prefix, 'train'), content_type=content_type, distribution='ShardedByS3Key')\n", 135 | "validation_input = TrainingInput(\"s3://{}/{}/{}/\".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type, distribution='ShardedByS3Key')" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "id": "d68d8e57", 141 | "metadata": {}, 142 | "source": [ 143 | "Now lets track the parameters from the training step. " 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 22, 149 | "id": "908dfbc8", 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "with Tracker.create(display_name=\"Training\", sagemaker_boto_client=sm) as tracker:\n", 154 | " tracker.log_parameters({\"learning_rate\": 1.0, \"dropout\": 0.5})\n", 155 | " \n", 156 | " # we can log the location of the training dataset\n", 157 | " tracker.log_input(name=\"weather-training-dataset\", media_type=\"s3/uri\", value=\"s3://{}/{}/{}/\".format(s3_bucket, s3_prefix, 'train'))" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "id": "bdef5991", 163 | "metadata": {}, 164 | "source": [ 165 | "### 2. Set up the Experiment" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "id": "8cea3276", 171 | "metadata": {}, 172 | "source": [ 173 | "Create an experiment to track all the model training iterations. Use Experiments to organize your data science work." 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "id": "1901f35b", 179 | "metadata": {}, 180 | "source": [ 181 | "#### 2.1 Create an Experiment" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 23, 187 | "id": "37e37739", 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "name": "stdout", 192 | "output_type": "stream", 193 | "text": [ 194 | "Experiment(sagemaker_boto_client=,experiment_name='weather-experiment-1628392649',description='Prediction of weather quality',tags=None,experiment_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment/weather-experiment-1628392649',response_metadata={'RequestId': '3ae94205-d193-460a-acab-4a89d547bc2e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '3ae94205-d193-460a-acab-4a89d547bc2e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '101', 'date': 'Sun, 08 Aug 2021 03:17:29 GMT'}, 'RetryAttempts': 0})\n" 195 | ] 196 | } 197 | ], 198 | "source": [ 199 | "weather_experiment = Experiment.create(\n", 200 | " experiment_name=f\"weather-experiment-{int(time.time())}\",\n", 201 | " description=\"Prediction of weather quality\",\n", 202 | " sagemaker_boto_client=sm)\n", 203 | "print(weather_experiment)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "id": "3d1b785d", 209 | "metadata": {}, 210 | "source": [ 211 | "#### 2.2 Track Experiment" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "id": "7aeb9892", 217 | "metadata": {}, 218 | "source": [ 219 | "Now create a Trial for each training run to track the it's inputs, parameters, and metrics." 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "id": "fdd69d6e", 225 | "metadata": {}, 226 | "source": [ 227 | "While training the XGBoost model on SageMaker, we will experiment with several values for the number of hidden channel in the model. We will create a Trial to track each training job run. We will also create a TrialComponent from the tracker we created before, and add to the Trial.\n", 228 | "\n", 229 | "Note the execution of the following code takes a while." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 24, 235 | "id": "e547917c", 236 | "metadata": {}, 237 | "outputs": [], 238 | "source": [ 239 | "##Keep track of the trails\n", 240 | "max_depth_trial_name_map = {}\n", 241 | "##Keep track of the training jobs launched to check if they are complete before analyzing the experiment results.\n", 242 | "training_jobs =[]\n" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "91a9d011", 248 | "metadata": {}, 249 | "source": [ 250 | "### 3. Train XGBoost regression model as part of the Experiment" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 25, 256 | "id": "a5af3e86", 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stderr", 261 | "output_type": "stream", 262 | "text": [ 263 | "INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.\n", 264 | "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n", 265 | "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.\n", 266 | "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n", 267 | "INFO:sagemaker:Creating training-job with name: xgboost-training-job-1628392650\n", 268 | "INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.\n", 269 | "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n", 270 | "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.\n", 271 | "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n", 272 | "INFO:sagemaker:Creating training-job with name: xgboost-training-job-1628392652\n" 273 | ] 274 | } 275 | ], 276 | "source": [ 277 | "training_instance_type='ml.m5.12xlarge'\n", 278 | "#Explore two different values for the max_depth hyerparameter for XGBoost model\n", 279 | "for i, max_depth in enumerate([2, 5]):\n", 280 | " # create trial\n", 281 | " trial_name = f\"xgboost-training-job-trial-{max_depth}-max-depth-{int(time.time())}\"\n", 282 | " xgboost_trial = Trial.create(\n", 283 | " trial_name=trial_name, \n", 284 | " experiment_name=weather_experiment.experiment_name,\n", 285 | " sagemaker_boto_client=sm,\n", 286 | " )\n", 287 | " max_depth_trial_name_map[max_depth] = trial_name\n", 288 | " \n", 289 | " # initialize hyperparameters\n", 290 | " hyperparameters = {\n", 291 | " \"max_depth\": max_depth,\n", 292 | " \"eta\":\"0.2\",\n", 293 | " \"gamma\":\"4\",\n", 294 | " \"min_child_weight\":\"6\",\n", 295 | " \"subsample\":\"0.7\",\n", 296 | " \"objective\":\"reg:squarederror\",\n", 297 | " \"num_round\":\"5\"}\n", 298 | "\n", 299 | " #set an output path where the trained model will be saved\n", 300 | " output_prefix = 'weather-experiments'\n", 301 | " output_path = 's3://{}/{}/{}/output'.format(s3_bucket, output_prefix, 'xgboost')\n", 302 | "\n", 303 | " # This line automatically looks for the XGBoost image URI and builds an XGBoost container.\n", 304 | " # specify the repo_version depending on your preference.\n", 305 | " xgboost_container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.2-1\")\n", 306 | "\n", 307 | " # construct a SageMaker estimator that calls the xgboost-container\n", 308 | " estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, \n", 309 | " hyperparameters=hyperparameters,\n", 310 | " role=sagemaker.get_execution_role(),\n", 311 | " instance_count=1, \n", 312 | " instance_type=training_instance_type, \n", 313 | " volume_size=200, # 5 GB \n", 314 | " output_path=output_path)\n", 315 | "\n", 316 | " xgboost_training_job_name = \"xgboost-training-job-{}\".format(int(time.time()))\n", 317 | " \n", 318 | " training_jobs.append(xgboost_training_job_name)\n", 319 | " \n", 320 | " # Now associate the estimator with the Experiment and Trial\n", 321 | " estimator.fit(\n", 322 | " inputs={'train': train_input}, \n", 323 | " job_name=xgboost_training_job_name,\n", 324 | " experiment_config={\n", 325 | " \"TrialName\": xgboost_trial.trial_name,\n", 326 | " \"TrialComponentDisplayName\": \"Training\"\n", 327 | " },\n", 328 | " wait=False, #Don't wait for the training job to be completed\n", 329 | " )\n", 330 | " \n", 331 | " # Wait before launching the next training job\n", 332 | " time.sleep(2)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 26, 338 | "id": "1e192d08", 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "data": { 343 | "text/plain": [ 344 | "{2: 'xgboost-training-job-trial-2-max-depth-1628392649',\n", 345 | " 5: 'xgboost-training-job-trial-5-max-depth-1628392652'}" 346 | ] 347 | }, 348 | "execution_count": 26, 349 | "metadata": {}, 350 | "output_type": "execute_result" 351 | } 352 | ], 353 | "source": [ 354 | "max_depth_trial_name_map" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 27, 360 | "id": "7eda9ed7", 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "name": "stdout", 365 | "output_type": "stream", 366 | "text": [ 367 | "TrialSummary(trial_name='xgboost-training-job-trial-5-max-depth-1628392652',trial_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment-trial/xgboost-training-job-trial-5-max-depth-1628392652',display_name='xgboost-training-job-trial-5-max-depth-1628392652',creation_time=datetime.datetime(2021, 8, 8, 3, 17, 32, 730000, tzinfo=tzlocal()),last_modified_time=datetime.datetime(2021, 8, 8, 3, 17, 32, 730000, tzinfo=tzlocal()))\n", 368 | "TrialSummary(trial_name='xgboost-training-job-trial-2-max-depth-1628392649',trial_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment-trial/xgboost-training-job-trial-2-max-depth-1628392649',display_name='xgboost-training-job-trial-2-max-depth-1628392649',creation_time=datetime.datetime(2021, 8, 8, 3, 17, 29, 979000, tzinfo=tzlocal()),last_modified_time=datetime.datetime(2021, 8, 8, 3, 17, 29, 979000, tzinfo=tzlocal()))\n" 369 | ] 370 | } 371 | ], 372 | "source": [ 373 | "##Quick check of the trails of the experiment\n", 374 | "trails = weather_experiment.list_trials()\n", 375 | "type(trails)\n", 376 | "for trial in trails:\n", 377 | " print(trial)" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "id": "fea9947d", 384 | "metadata": {}, 385 | "outputs": [ 386 | { 387 | "name": "stdout", 388 | "output_type": "stream", 389 | "text": [ 390 | "Training job name: xgboost-training-job-1628392650\n", 391 | "Status : InProgress\n", 392 | "Status InProgress\n", 393 | "Status InProgress\n", 394 | "Status InProgress\n", 395 | "Status InProgress\n", 396 | "Status InProgress\n" 397 | ] 398 | } 399 | ], 400 | "source": [ 401 | "##Wait till the training jobs are complete.\n", 402 | "for training_job in training_jobs:\n", 403 | " print(\"Training job name: \" + training_job)\n", 404 | " description = sm.describe_training_job(TrainingJobName=training_job)\n", 405 | " print(\"Status : \" + description[\"TrainingJobStatus\"])\n", 406 | " \n", 407 | " while description[\"TrainingJobStatus\"] != \"Completed\" and description[\"TrainingJobStatus\"] != \"Failed\":\n", 408 | " description = sm.describe_training_job(TrainingJobName=training_job)\n", 409 | " primary_status = description[\"TrainingJobStatus\"]\n", 410 | " print(\"Status {}\".format(primary_status))\n", 411 | " time.sleep(15)" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "id": "1f43971e", 417 | "metadata": {}, 418 | "source": [ 419 | "### 4. Visualize results from the Experiment.\n", 420 | "Compare the model training runs of an experiment using the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression." 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "id": "53d4348e", 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "experiment_name = weather_experiment.experiment_name\n", 431 | "experiment_name" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "id": "d2d57858", 438 | "metadata": {}, 439 | "outputs": [], 440 | "source": [ 441 | "from sagemaker.analytics import ExperimentAnalytics\n", 442 | "sess = boto3.Session()\n", 443 | "sm = sess.client(\"sagemaker\")\n", 444 | "sagemaker_session = Session(sess)\n", 445 | "\n", 446 | "trial_component_analytics = ExperimentAnalytics(\n", 447 | " sagemaker_session=sagemaker_session, experiment_name=experiment_name\n", 448 | ")\n", 449 | "trial_comp_ds_jobs = trial_component_analytics.dataframe()\n", 450 | "trial_comp_ds_jobs" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "id": "26b7d585", 456 | "metadata": {}, 457 | "source": [ 458 | "Results show the RMSE metrics for the various hyperparameters tried as part of the Experiment" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "id": "1d36c6a5", 465 | "metadata": {}, 466 | "outputs": [], 467 | "source": [] 468 | } 469 | ], 470 | "metadata": { 471 | "kernelspec": { 472 | "display_name": "conda_python3", 473 | "language": "python", 474 | "name": "conda_python3" 475 | }, 476 | "language_info": { 477 | "codemirror_mode": { 478 | "name": "ipython", 479 | "version": 3 480 | }, 481 | "file_extension": ".py", 482 | "mimetype": "text/x-python", 483 | "name": "python", 484 | "nbconvert_exporter": "python", 485 | "pygments_lexer": "ipython3", 486 | "version": "3.6.13" 487 | } 488 | }, 489 | "nbformat": 4, 490 | "nbformat_minor": 5 491 | } 492 | -------------------------------------------------------------------------------- /Chapter06/code/csv_loader.py: -------------------------------------------------------------------------------- 1 | import os 2 | from torch.utils.data import Dataset 3 | import glob 4 | import torch 5 | import sys 6 | import logging 7 | import collections 8 | import bisect 9 | 10 | logger = logging.getLogger(__name__) 11 | logger.setLevel(logging.INFO) 12 | logger.addHandler(logging.StreamHandler(sys.stdout)) 13 | 14 | class CsvDataset(Dataset): 15 | 16 | def __init__(self, csv_path): 17 | 18 | self.csv_path = csv_path 19 | if os.path.isfile(csv_path): 20 | self.folder = False 21 | self.count, fmap, self.line_offset = self.get_line_count(csv_path) 22 | logger.debug(f"For {csv_path}, count = {self.count}") 23 | else: 24 | self.folder = True 25 | self.count, fmap, self.line_offset = self.get_folder_line_count(csv_path) 26 | 27 | self.fmap = collections.OrderedDict(sorted(fmap.items())) 28 | 29 | def get_folder_line_count(self, d): 30 | cnt = 0 31 | all_map = {} 32 | all_lc = {} 33 | for f in glob.glob(os.path.join(d, '*.csv')): 34 | fcnt, _, line_offset = self.get_line_count(f) 35 | cnt = cnt + fcnt 36 | all_map[cnt] = f 37 | all_lc.update(line_offset) 38 | return cnt, all_map, all_lc 39 | 40 | def get_line_count(self, f): 41 | with open(f) as F: 42 | line_offset = [] 43 | offset = 0 44 | count = 0 45 | for line in F: 46 | line_offset.append(offset) 47 | offset += len(line) 48 | count = count + 1 49 | 50 | return count, {count: f}, {f: line_offset} 51 | 52 | def __len__(self): 53 | return self.count 54 | 55 | def __getitem__(self, idx): 56 | 57 | if torch.is_tensor(idx): 58 | idx = idx.tolist() 59 | logger.debug(f"Indices: {idx}") 60 | 61 | # This gives us the index in the line counts greater than or equal to the desired index. 62 | # The map value for this line count is the file name containing that row. 63 | klist = list(self.fmap.keys()) 64 | idx_m = bisect.bisect_left(klist, idx+1) 65 | 66 | # Grab the ending count of thisl file 67 | cur_idx = klist[idx_m] 68 | 69 | # grab the ending count of the previous file 70 | if idx_m > 0: 71 | prev_idx = klist[idx_m-1] 72 | else: 73 | prev_idx = 0 74 | 75 | # grab the file name for the desired row count 76 | fname = self.fmap[cur_idx] 77 | 78 | loff = self.line_offset[fname] 79 | with open(fname) as F: 80 | F.seek(loff[idx - prev_idx]) 81 | idx_line = F.readline() 82 | 83 | idx_parts = idx_line.split(',') 84 | 85 | return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))]) -------------------------------------------------------------------------------- /Chapter06/code/csv_loader_pd.py: -------------------------------------------------------------------------------- 1 | import os 2 | from torch.utils.data import Dataset 3 | import glob 4 | import torch 5 | import sys 6 | import logging 7 | import collections 8 | import bisect 9 | import time 10 | import pandas as pd 11 | from linecache import getline 12 | from itertools import islice 13 | 14 | logger = logging.getLogger(__name__) 15 | logger.setLevel(logging.DEBUG) 16 | logger.addHandler(logging.StreamHandler(sys.stdout)) 17 | 18 | def _make_gen(reader): 19 | b = reader(1024 * 1024) 20 | while b: 21 | yield b 22 | b = reader(1024*1024) 23 | 24 | def rawgencount(filename): 25 | f = open(filename, 'rb') 26 | f_gen = _make_gen(f.raw.read) 27 | return sum( buf.count(b'\n') for buf in f_gen ) 28 | 29 | 30 | class CsvDatasetPd(Dataset): 31 | 32 | def __init__(self, csv_path, max_files=5): 33 | 34 | self.csv_path = csv_path 35 | if os.path.isfile(csv_path): 36 | self.folder = False 37 | self.count, fmap = self.get_line_count(csv_path) 38 | logger.debug(f"For {csv_path}, count = {self.count}") 39 | else: 40 | self.folder = True 41 | self.count, fmap = self.get_folder_line_count(csv_path, max_files) 42 | 43 | self.fmap = collections.OrderedDict(sorted(fmap.items())) 44 | self.max_files = max_files 45 | 46 | def get_folder_line_count(self, d, max_files): 47 | cnt = 0 48 | all_map = {} 49 | file_cnt = 0 50 | for f in glob.glob(os.path.join(d, '*.csv')): 51 | fcnt, _ = self.get_line_count(f) 52 | cnt = cnt + fcnt 53 | all_map[cnt] = f 54 | file_cnt = file_cnt + 1 55 | if file_cnt > max_files: 56 | break 57 | 58 | return cnt, all_map 59 | 60 | def get_line_count(self, f): 61 | count = rawgencount(f) 62 | return count, {count: f} 63 | 64 | def __len__(self): 65 | return self.count 66 | 67 | def __getitem__(self, idx): 68 | 69 | if torch.is_tensor(idx): 70 | idx = idx.tolist() 71 | logger.debug(f"Indices: {idx}") 72 | 73 | # This gives us the index in the line counts greater than or equal to the desired index. 74 | # The map value for this line count is the file name containing that row. 75 | klist = list(self.fmap.keys()) 76 | idx_m = bisect.bisect_left(klist, idx+1) 77 | 78 | # Grab the ending count of thisl file 79 | cur_idx = klist[idx_m] 80 | 81 | # grab the ending count of the previous file 82 | if idx_m > 0: 83 | prev_idx = klist[idx_m-1] 84 | else: 85 | prev_idx = 0 86 | 87 | # grab the file name for the desired row count 88 | fname = self.fmap[cur_idx] 89 | 90 | #with open(fname) as F: 91 | # lines = list(islice(F, idx-prev_idx, idx-prev_idx+1)) 92 | idx_line = getline(fname, idx - prev_idx +1) 93 | 94 | idx_parts = idx_line.split(',') 95 | 96 | return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))]) 97 | 98 | -------------------------------------------------------------------------------- /Chapter06/code/csv_loader_simple.py: -------------------------------------------------------------------------------- 1 | import os 2 | from torch.utils.data import Dataset 3 | import glob 4 | import torch 5 | import sys 6 | import logging 7 | import collections 8 | import bisect 9 | import time 10 | import pandas as pd 11 | from linecache import getline 12 | from itertools import islice 13 | 14 | logger = logging.getLogger(__name__) 15 | logger.setLevel(logging.INFO) 16 | logger.addHandler(logging.StreamHandler(sys.stdout)) 17 | 18 | class CsvDatasetSimple(Dataset): 19 | 20 | def __init__(self, csv_path, max_files=5): 21 | 22 | self.csv_path = csv_path 23 | if os.path.isfile(csv_path): 24 | self.count, self.tensors = self.get_line_data(csv_path) 25 | logger.debug(f"For {csv_path}, count = {self.count}") 26 | else: 27 | self.count, self.tensors = self.get_folder_line_data(csv_path, max_files) 28 | 29 | def get_folder_line_data(self, d, max_files): 30 | cnt = 0 31 | file_cnt = 0 32 | tensors = [] 33 | for f in glob.glob(os.path.join(d, '*.csv')): 34 | fcnt, ftensors = self.get_line_data(f) 35 | cnt = cnt + fcnt 36 | tensors = tensors + ftensors 37 | file_cnt = file_cnt + 1 38 | if file_cnt > max_files: 39 | break 40 | 41 | return cnt, tensors 42 | 43 | def get_line_data(self, f): 44 | cnt = 0 45 | tensors = [] 46 | with open(f) as F: 47 | f_lines = F.readlines() 48 | for l in f_lines: 49 | cnt = cnt + 1 50 | parts = l.split(',') 51 | tensors.append(tuple([torch.tensor( [float(f) for f in parts[1:]] ), torch.tensor(float(parts[0]))])) 52 | 53 | return cnt, tensors 54 | 55 | def __len__(self): 56 | return self.count 57 | 58 | def __getitem__(self, idx): 59 | 60 | if torch.is_tensor(idx): 61 | idx = idx.tolist() 62 | logger.debug(f"Indices: {idx}") 63 | 64 | return self.tensors[idx] 65 | 66 | -------------------------------------------------------------------------------- /Chapter06/code/model_pytorch.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import logging 5 | import sys 6 | 7 | logger = logging.getLogger(__name__) 8 | logger.setLevel(logging.INFO) 9 | logger.addHandler(logging.StreamHandler(sys.stdout)) 10 | 11 | class TabularNet(nn.Module): 12 | 13 | def __init__(self, n_cont, n_cat, emb_sz = 100, dropout_p = 0.1, layers=[200,100], cat_mask=[], cat_dim=[], y_min = 0., y_max = 1.): 14 | 15 | super(TabularNet, self).__init__() 16 | 17 | self.cat_mask = cat_mask 18 | self.cat_dim = cat_dim 19 | self.y_min = y_min 20 | self.y_max = y_max 21 | self.n_cat = n_cat 22 | self.n_cont = n_cont 23 | 24 | emb_dim = [] 25 | for ii in range(len(cat_mask)): 26 | if cat_mask[ii]: 27 | c_dim = cat_dim[ii] 28 | emb_dim.append(c_dim) 29 | #emb = nn.Embedding(c_dim, emb_sz) 30 | #self.embeddings.append(emb) 31 | #setattr(self, 'emb_{}'.format(ii), emb) 32 | 33 | self.embeddings = nn.ModuleList([nn.Embedding(c_dim, emb_sz) for c_dim in emb_dim]) 34 | 35 | modules = [] 36 | prev_size = n_cont + n_cat * emb_sz 37 | for l in layers: 38 | modules.append(nn.Linear(prev_size, l)) 39 | modules.append(nn.BatchNorm1d(l)) 40 | modules.append(nn.Dropout(dropout_p)) 41 | modules.append(nn.ReLU(inplace=True)) 42 | prev_size = l 43 | modules.append(nn.BatchNorm1d(prev_size)) 44 | modules.append(nn.Dropout(dropout_p)) 45 | modules.append(nn.Linear(prev_size, 1)) 46 | modules.append(nn.Sigmoid()) 47 | 48 | self.m_seq = nn.Sequential(*modules) 49 | self.emb_drop = nn.Dropout(dropout_p) 50 | self.bn_cont = nn.BatchNorm1d(n_cont) 51 | 52 | def forward(self, x_in): 53 | 54 | logger.debug(f"Forward pass on {x_in.shape}") 55 | x = None 56 | ee = 0 57 | for ii in range(len(self.cat_mask)): 58 | 59 | if self.cat_mask[ii]: 60 | logger.debug(f"Embedding: {self.embeddings[ee]}, input: {x_in[:,ii]}") 61 | logger.debug(f"cat Device for x_in: {x_in.get_device()}") 62 | logger.debug(f"cat Device for x_in slice: {x_in[:,ii].get_device()}") 63 | logger.debug(f"cat Device for embed: {next(self.embeddings[ee].parameters()).get_device()}") 64 | x_e = self.embeddings[ee](x_in[:,ii].to(device = x_in.get_device(), dtype= torch.long)) 65 | logger.debug(f"cat Device for x_e: {x_e.get_device()}") 66 | logger.debug(f"cat x_e = {x_e.shape}") 67 | if x is None: 68 | x = x_e 69 | else: 70 | x = torch.cat([x, x_e], 1) 71 | logger.debug(f"cat Device for x: {x.get_device()}") 72 | x = self.emb_drop(x) 73 | logger.debug(f"cat Device for x: {x.get_device()}") 74 | logger.debug(f"cat x = {x.shape}") 75 | ee = ee + 1 76 | else: 77 | logger.debug(f"cont Device for x_in: {x_in.get_device()}") 78 | x_cont = x_in[:, ii] # self.bn_cont(x_in[:, ii]) 79 | logger.debug(f"cont Device for x_cont: {x_cont.get_device()}") 80 | logger.debug(f"cont x_cont = {x_cont.shape}") 81 | if x is None: 82 | x = torch.unsqueeze(x_cont, 1) 83 | else: 84 | x = torch.cat([x, torch.unsqueeze(x_cont, 1)], 1) 85 | logger.debug(f"cont Device for x: {x.get_device()}") 86 | logger.debug(f"cont x = {x.shape}") 87 | 88 | return self.m_seq(x) * (self.y_max - self.y_min) + self.y_min -------------------------------------------------------------------------------- /Chapter06/code/train_pytorch-dist.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import sys 5 | import logging 6 | import json 7 | import shutil 8 | import torch 9 | import torch.nn as nn 10 | from torch.utils.data import DataLoader, TensorDataset 11 | from model_pytorch import TabularNet 12 | from csv_loader_simple import CsvDatasetSimple 13 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP 14 | 15 | 16 | logger = logging.getLogger(__name__) 17 | logger.setLevel(logging.INFO) 18 | logger.addHandler(logging.StreamHandler(sys.stdout)) 19 | 20 | class CUDANotFoundException(Exception): 21 | pass 22 | 23 | 24 | dist.init_process_group() 25 | 26 | 27 | def parse_args(): 28 | """ 29 | Parse arguments passed from the SageMaker API 30 | to the container 31 | """ 32 | 33 | parser = argparse.ArgumentParser() 34 | 35 | # Hyperparameters sent by the client are passed as command-line arguments to the script 36 | parser.add_argument("--epochs", type=int, default=1) 37 | parser.add_argument("--batch_size", type=int, default=64) 38 | parser.add_argument("--learning_rate", type=float, default=0.01) 39 | 40 | # Data directories 41 | parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) 42 | parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) 43 | 44 | # Model directory: we will use the default set by SageMaker, /opt/ml/model 45 | parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR")) 46 | 47 | return parser.parse_known_args() 48 | 49 | def train(model, device): 50 | """ 51 | Train the PyTorch model 52 | """ 53 | 54 | cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False] 55 | train_ds = CsvDatasetSimple(args.train) 56 | test_ds = CsvDatasetSimple(args.test) 57 | 58 | batch_size = args.batch_size 59 | epochs = args.epochs 60 | learning_rate = args.learning_rate 61 | logger.info( 62 | "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate) 63 | ) 64 | 65 | train_sampler = torch.utils.data.distributed.DistributedSampler( 66 | train_ds, num_replicas=args.world_size, rank=args.rank 67 | ) 68 | 69 | train_dl = DataLoader(train_ds, batch_size, shuffle=False, drop_last=True, sampler=train_sampler) 70 | 71 | model = TabularNet(n_cont=9, n_cat=9, 72 | cat_mask = cat_mask, 73 | cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 74 | y_min = 0., y_max = 1., device=device) 75 | logger.debug(model) 76 | model = DDP(model).to(device) 77 | torch.cuda.set_device(args.local_rank) 78 | model.cuda(args.local_rank) 79 | 80 | criterion = nn.MSELoss() 81 | optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 82 | 83 | model.train() 84 | for epoch in range(epochs): 85 | batch_no = 0 86 | for x_train_batch, y_train_batch in train_dl: 87 | logger.debug(f"Training on shape {x_train_batch.shape}") 88 | y = model(x_train_batch.float()) 89 | loss = criterion(y.flatten(), y_train_batch.float().to(device)) 90 | if batch_no % 50 == 0: 91 | logger.info(f"batch {batch_no} -> loss: {loss}") 92 | optimizer.zero_grad() 93 | loss.backward() 94 | optimizer.step() 95 | batch_no +=1 96 | epoch += 1 97 | logger.info(f"epoch: {epoch} -> loss: {loss}") 98 | 99 | # evalutate on test set 100 | if args.rank == 0: 101 | model.eval() 102 | test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False) 103 | with torch.no_grad(): 104 | mse = 0. 105 | for x_test_batch, y_test_batch in test_dl: 106 | y = model(x_test_batch.float()) 107 | mse = mse + ((y - y_test_batch.to(device)) ** 2).sum() / x_test_batch.shape[0] 108 | 109 | mse = mse / len(test_dl.dataset) 110 | logger.info(f"Test MSE: {mse}") 111 | 112 | torch.save(model.state_dict(), args.model_dir + "/model.pth") 113 | # PyTorch requires that the inference script must 114 | # be in the .tar.gz model file and Step Functions SDK doesn't do this. 115 | inference_code_path = args.model_dir + "/code/" 116 | 117 | if not os.path.exists(inference_code_path): 118 | os.mkdir(inference_code_path) 119 | logger.info("Created a folder at {}!".format(inference_code_path)) 120 | 121 | shutil.copy("train_pytorch.py", inference_code_path) 122 | shutil.copy("model_pytorch.py", inference_code_path) 123 | shutil.copy("csv_loader.py", inference_code_path) 124 | logger.info("Saving models files to {}".format(inference_code_path)) 125 | 126 | 127 | if __name__ == "__main__": 128 | 129 | args, _ = parse_args() 130 | args.world_size = dist.get_world_size() 131 | args.rank = rank = dist.get_rank() 132 | args.local_rank = local_rank = dist.get_local_rank() 133 | args.batch_size //= args.world_size // 8 134 | args.batch_size = max(args.batch_size, 1) 135 | 136 | if not torch.cuda.is_available(): 137 | raise CUDANotFoundException( 138 | "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices." 139 | ) 140 | 141 | torch.manual_seed(1) 142 | 143 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 144 | 145 | train() 146 | -------------------------------------------------------------------------------- /Chapter06/code/train_pytorch.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import sys 5 | import logging 6 | import json 7 | import shutil 8 | import torch 9 | import torch.nn as nn 10 | from torch.utils.data import DataLoader, TensorDataset 11 | from model_pytorch import TabularNet 12 | from csv_loader_simple import CsvDatasetSimple 13 | 14 | 15 | logger = logging.getLogger(__name__) 16 | logger.setLevel(logging.INFO) 17 | logger.addHandler(logging.StreamHandler(sys.stdout)) 18 | 19 | 20 | def parse_args(): 21 | """ 22 | Parse arguments passed from the SageMaker API 23 | to the container 24 | """ 25 | 26 | parser = argparse.ArgumentParser() 27 | 28 | # Hyperparameters sent by the client are passed as command-line arguments to the script 29 | parser.add_argument("--epochs", type=int, default=1) 30 | parser.add_argument("--batch_size", type=int, default=64) 31 | parser.add_argument("--learning_rate", type=float, default=0.01) 32 | 33 | # Data directories 34 | parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) 35 | parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) 36 | 37 | # Model directory: we will use the default set by SageMaker, /opt/ml/model 38 | parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR")) 39 | 40 | return parser.parse_known_args() 41 | 42 | def train(): 43 | """ 44 | Train the PyTorch model 45 | """ 46 | 47 | cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False] 48 | train_ds = CsvDatasetSimple(args.train) 49 | test_ds = CsvDatasetSimple(args.test) 50 | 51 | batch_size = args.batch_size 52 | epochs = args.epochs 53 | learning_rate = args.learning_rate 54 | logger.info( 55 | "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate) 56 | ) 57 | 58 | train_dl = DataLoader(train_ds, batch_size, shuffle=False, 59 | num_workers=0, 60 | pin_memory=True, 61 | drop_last=True) 62 | 63 | model = TabularNet(n_cont=9, n_cat=9, 64 | cat_mask = cat_mask, 65 | cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 66 | y_min = 0., y_max = 1.) 67 | logger.debug(model) 68 | model = model.to(device) 69 | criterion = nn.MSELoss() 70 | optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 71 | 72 | model.train() 73 | for epoch in range(epochs): 74 | batch_no = 0 75 | for x_train_batch, y_train_batch in train_dl: 76 | x_train_batch_d = x_train_batch.to(device) 77 | y_train_batch_d = y_train_batch.to(device) 78 | optimizer.zero_grad() 79 | 80 | logger.debug(f"Training on shape {x_train_batch.shape}") 81 | y = model(x_train_batch_d.float()) 82 | loss = criterion(y.flatten(), y_train_batch_d.float()) 83 | if batch_no % 50 == 0: 84 | logger.info(f"batch {batch_no} -> loss: {loss}") 85 | 86 | loss.backward() 87 | optimizer.step() 88 | batch_no +=1 89 | epoch += 1 90 | logger.info(f"epoch: {epoch} -> loss: {loss}") 91 | 92 | # evalutate on test set 93 | model.eval() 94 | test_dl = DataLoader(test_ds, batch_size, shuffle=False, drop_last=True) 95 | with torch.no_grad(): 96 | mse = 0. 97 | for x_test_batch, y_test_batch in test_dl: 98 | x_test_batch_d = x_test_batch.to(device) 99 | y_test_batch_d = y_test_batch.to(device) 100 | 101 | y = model(x_test_batch_d.float()) 102 | mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0] 103 | 104 | mse = mse / len(test_dl.dataset) 105 | logger.info(f"Test MSE: {mse}") 106 | 107 | torch.save(model.state_dict(), args.model_dir + "/model.pth") 108 | # PyTorch requires that the inference script must 109 | # be in the .tar.gz model file and Step Functions SDK doesn't do this. 110 | inference_code_path = args.model_dir + "/code/" 111 | 112 | if not os.path.exists(inference_code_path): 113 | os.mkdir(inference_code_path) 114 | logger.info("Created a folder at {}!".format(inference_code_path)) 115 | 116 | #shutil.copy("train_pytorch.py", inference_code_path) 117 | #shutil.copy("model_pytorch.py", inference_code_path) 118 | #shutil.copy("csv_loader.py", inference_code_path) 119 | logger.info("Saving models files to {}".format(inference_code_path)) 120 | 121 | 122 | if __name__ == "__main__": 123 | 124 | args, _ = parse_args() 125 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 126 | 127 | train() 128 | -------------------------------------------------------------------------------- /Chapter06/code/train_pytorch_dist.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import sys 5 | import logging 6 | import json 7 | import shutil 8 | import torch 9 | import torch.nn as nn 10 | from torch.utils.data import DataLoader, TensorDataset 11 | import torch.nn.functional as F 12 | 13 | from model_pytorch import TabularNet 14 | from csv_loader_simple import CsvDatasetSimple 15 | 16 | ##Necessary to use SM distributed libraries 17 | import smdistributed.dataparallel.torch.distributed as dist 18 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP 19 | 20 | 21 | logger = logging.getLogger(__name__) 22 | logger.setLevel(logging.INFO) 23 | logger.addHandler(logging.StreamHandler(sys.stdout)) 24 | 25 | class CUDANotFoundException(Exception): 26 | pass 27 | 28 | ##Initialize the distributed library 29 | dist.init_process_group() 30 | 31 | 32 | def parse_args(): 33 | """ 34 | Parse arguments passed from the SageMaker API 35 | to the container 36 | """ 37 | 38 | parser = argparse.ArgumentParser() 39 | 40 | # Hyperparameters sent by the client are passed as command-line arguments to the script 41 | parser.add_argument("--epochs", type=int, default=1) 42 | parser.add_argument("--batch_size", type=int, default=64) 43 | parser.add_argument("--learning_rate", type=float, default=0.01) 44 | 45 | # Data directories 46 | parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) 47 | parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) 48 | 49 | # Model directory: we will use the default set by SageMaker, /opt/ml/model 50 | parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR")) 51 | 52 | parser.add_argument( 53 | "--save-model", action="store_true", default=False, help="For Saving the current Model" 54 | ) 55 | 56 | return parser.parse_known_args() 57 | 58 | def train(args): 59 | """ 60 | Train the PyTorch model 61 | """ 62 | 63 | cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False] 64 | train_ds = CsvDatasetSimple(args.train) 65 | test_ds = CsvDatasetSimple(args.test) 66 | 67 | batch_size = args.batch_size 68 | epochs = args.epochs 69 | learning_rate = args.learning_rate 70 | logger.info( 71 | "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate) 72 | ) 73 | logger.info(f"World size: {args.world_size}") 74 | logger.info(f"Rank: {args.rank}") 75 | logger.info(f"Local Rank: {args.local_rank}") 76 | 77 | train_sampler = torch.utils.data.distributed.DistributedSampler( 78 | train_ds, num_replicas=args.world_size, rank=args.rank 79 | ) 80 | logger.debug(f"Created distributed sampler") 81 | 82 | train_dl = DataLoader(train_ds, batch_size, shuffle=False, 83 | num_workers=0, 84 | pin_memory=True, 85 | drop_last=True, sampler=train_sampler) 86 | logger.debug(f"Created train data loader") 87 | 88 | model = TabularNet(n_cont=9, n_cat=9, 89 | cat_mask = cat_mask, 90 | cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 91 | y_min = 0., y_max = 1.) 92 | logger.debug("Created model") 93 | logger.debug(model) 94 | model = DDP(model.to(device), broadcast_buffers=False) 95 | logger.debug("created DDP") 96 | torch.cuda.set_device(args.local_rank) 97 | logger.debug("Set device on CUDA") 98 | model.cuda(args.local_rank) 99 | logger.debug(f"Set model CUDA to {args.local_rank}") 100 | 101 | criterion = nn.MSELoss() 102 | optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 103 | logger.debug("Created loss fn and optimizer") 104 | 105 | model.train() 106 | for epoch in range(epochs): 107 | batch_no = 0 108 | for x_train_batch, y_train_batch in train_dl: 109 | logger.debug(f"Working on batch {batch_no}") 110 | x_train_batch_d = x_train_batch.to(device) 111 | y_train_batch_d = y_train_batch.to(device) 112 | 113 | optimizer.zero_grad() 114 | logger.debug("Did optimizer grad") 115 | 116 | logger.debug(f"Training on shape {x_train_batch.shape}") 117 | y = model(x_train_batch_d.float()) 118 | logger.debug("Did forward pass") 119 | loss = criterion(y.flatten(), y_train_batch_d.float()) 120 | logger.debug("Got loss") 121 | if batch_no % 50 == 0: 122 | logger.info(f"batch {batch_no} -> loss: {loss}") 123 | 124 | loss.backward() 125 | logger.debug("Did backward step") 126 | optimizer.step() 127 | logger.debug(f"Did optimizer step, batch {batch_no}") 128 | batch_no +=1 129 | epoch += 1 130 | logger.info(f"epoch: {epoch} -> loss: {loss}") 131 | 132 | # evalutate on test set 133 | if args.rank == 0: 134 | logger.info(f"Starting test eval on rank 0") 135 | model.eval() 136 | test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False) 137 | logger.info(f"Loaded test data set") 138 | with torch.no_grad(): 139 | mse = 0. 140 | batch_no = 0 141 | for x_test_batch, y_test_batch in test_dl: 142 | x_test_batch_d = x_test_batch.to(device) 143 | y_test_batch_d = y_test_batch.to(device) 144 | 145 | y = model(x_test_batch_d.float()) 146 | mse += F.mse_loss(y, y_test_batch_d, reduction="sum") 147 | #mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0] 148 | 149 | if batch_no % 50 == 0: 150 | logger.info(f"batch {batch_no} -> MSE: {mse}") 151 | batch_no +=1 152 | 153 | mse = mse / len(test_dl.dataset) 154 | logger.info(f"Test MSE: {mse}") 155 | 156 | if args.rank == 0: 157 | logger.info("Saving model on rank 0") 158 | torch.save(model.state_dict(), args.model_dir + "/model.pth") 159 | 160 | 161 | if __name__ == "__main__": 162 | 163 | args, _ = parse_args() 164 | args.world_size = dist.get_world_size() 165 | args.rank = rank = dist.get_rank() 166 | args.local_rank = local_rank = dist.get_local_rank() 167 | args.batch_size //= args.world_size // 8 168 | args.batch_size = max(args.batch_size, 1) 169 | 170 | if not torch.cuda.is_available(): 171 | raise CUDANotFoundException( 172 | "Must run smdistributed.dataparallel on CUDA-capable devices." 173 | ) 174 | 175 | torch.manual_seed(1) 176 | 177 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 178 | 179 | train(args) 180 | -------------------------------------------------------------------------------- /Chapter06/code/train_pytorch_model_dist.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import sys 5 | import logging 6 | import json 7 | import shutil 8 | import torch 9 | import torch.nn as nn 10 | import torch.nn.functional as F 11 | import torch.optim as optim 12 | #from torchnet.dataset import SplitDataset 13 | from torchvision import datasets 14 | 15 | from torch.utils.data import DataLoader, TensorDataset 16 | from model_pytorch import TabularNet 17 | from csv_loader_simple import CsvDatasetSimple 18 | 19 | # SMP: Import and initialize SMP API. 20 | import smdistributed.modelparallel.torch as smp 21 | 22 | 23 | logger = logging.getLogger(__name__) 24 | logger.setLevel(logging.INFO) 25 | logger.addHandler(logging.StreamHandler(sys.stdout)) 26 | 27 | smp.init() 28 | 29 | 30 | def parse_args(): 31 | """ 32 | Parse arguments passed from the SageMaker API 33 | to the container 34 | """ 35 | 36 | parser = argparse.ArgumentParser() 37 | 38 | # Hyperparameters sent by the client are passed as command-line arguments to the script 39 | parser.add_argument("--epochs", type=int, default=1) 40 | parser.add_argument("--batch_size", type=int, default=64) 41 | parser.add_argument("--learning_rate", type=float, default=0.01) 42 | 43 | # Data directories 44 | parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) 45 | parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) 46 | 47 | # Model directory: we will use the default set by SageMaker, /opt/ml/model 48 | parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR")) 49 | 50 | return parser.parse_known_args() 51 | 52 | 53 | # smdistributed: Define smp.step. Return any tensors needed outside. 54 | @smp.step 55 | def train_step(model, data, target): 56 | #logger.info("**** TRAIN_STEP method target is {} ".format(target)) 57 | #print("target is : ", target) 58 | output = model(data) 59 | long_target = target.long() 60 | #loss = F.nll_loss(output, target, reduction="mean") 61 | loss = F.nll_loss(output, long_target, reduction="mean") 62 | model.backward(loss) 63 | return output, loss 64 | 65 | def train(model, device, train_loader, optimizer): 66 | print("Into init_train") 67 | model.train() 68 | for batch_idx, (data, target) in enumerate(train_loader): 69 | #logger.info("**** TRAIN method target is {}".format(target)) 70 | # smdistributed: Move input tensors to the GPU ID used by the current process, 71 | # based on the set_device call. 72 | data, target = data.to(device), target.to(device) 73 | optimizer.zero_grad() 74 | # Return value, loss_mb is a StepOutput object 75 | _, loss_mb = train_step(model, data, target) 76 | 77 | # smdistributed: Average the loss across microbatches. 78 | loss = loss_mb.reduce_mean() 79 | 80 | optimizer.step() 81 | 82 | 83 | def init_train(): 84 | """ 85 | Train the PyTorch model 86 | """ 87 | 88 | cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False] 89 | train_ds = CsvDatasetSimple(args.train) 90 | test_ds = CsvDatasetSimple(args.test) 91 | 92 | batch_size = args.batch_size 93 | epochs = args.epochs 94 | learning_rate = args.learning_rate 95 | 96 | logger.info( 97 | "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate) 98 | ) 99 | 100 | # smdistributed: initialize the backend 101 | smp.init() 102 | 103 | # smdistributed: Set the device to the GPU ID used by the current process. 104 | # Input tensors should be transferred to this device. 105 | torch.cuda.set_device(smp.local_rank()) 106 | device = torch.device("cuda") 107 | 108 | # smdistributed: Download only on a single process per instance. 109 | # When this is not present, the file is corrupted by multiple processes trying 110 | # to download and extract at the same time 111 | #dataset = datasets.MNIST("../data", train=True, download=False) 112 | dataset = train_ds 113 | 114 | # smdistributed: Shard the dataset based on data-parallel ranks 115 | if smp.dp_size() > 1: 116 | partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())} 117 | dataset = SplitDataset(dataset, partitions=partitions_dict) 118 | dataset.select(f"{smp.dp_rank()}") 119 | 120 | 121 | # smdistributed: Set drop_last=True to ensure that batch size is always divisible 122 | # by the number of microbatches 123 | train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True) 124 | 125 | model = TabularNet(n_cont=9, n_cat=9, 126 | cat_mask = cat_mask, 127 | cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 128 | y_min = 0., y_max = 1.) 129 | 130 | logger.debug(model) 131 | 132 | optimizer = optim.Adadelta(model.parameters(), lr=4.0) 133 | 134 | 135 | # SMP: Instantiate DistributedModel object using the model. 136 | # This handles distributing the model among multiple ranks 137 | # behind the scenes 138 | # If horovod is enabled this will do an overlapping_all_reduce by 139 | # default. 140 | 141 | # smdistributed: Use the DistributedModel container to provide the model 142 | # to be partitioned across different ranks. For the rest of the script, 143 | # the returned DistributedModel object should be used in place of 144 | # the model provided for DistributedModel class instantiation. 145 | model = smp.DistributedModel(model) 146 | 147 | optimizer = smp.DistributedOptimizer(optimizer) 148 | 149 | train(model, device, train_loader, optimizer) 150 | 151 | torch.save(model.state_dict(), args.model_dir + "/model.pth") 152 | 153 | 154 | if __name__ == "__main__": 155 | 156 | args, _ = parse_args() 157 | 158 | init_train() 159 | -------------------------------------------------------------------------------- /Chapter07/code/csv_loader.py: -------------------------------------------------------------------------------- 1 | import os 2 | from torch.utils.data import Dataset 3 | import glob 4 | import torch 5 | import sys 6 | import logging 7 | import collections 8 | import bisect 9 | 10 | logger = logging.getLogger(__name__) 11 | logger.setLevel(logging.INFO) 12 | logger.addHandler(logging.StreamHandler(sys.stdout)) 13 | 14 | class CsvDataset(Dataset): 15 | 16 | def __init__(self, csv_path): 17 | 18 | self.csv_path = csv_path 19 | if os.path.isfile(csv_path): 20 | self.folder = False 21 | self.count, fmap, self.line_offset = self.get_line_count(csv_path) 22 | logger.debug(f"For {csv_path}, count = {self.count}") 23 | else: 24 | self.folder = True 25 | self.count, fmap, self.line_offset = self.get_folder_line_count(csv_path) 26 | 27 | self.fmap = collections.OrderedDict(sorted(fmap.items())) 28 | 29 | def get_folder_line_count(self, d): 30 | cnt = 0 31 | all_map = {} 32 | all_lc = {} 33 | for f in glob.glob(os.path.join(d, '*.csv')): 34 | fcnt, _, line_offset = self.get_line_count(f) 35 | cnt = cnt + fcnt 36 | all_map[cnt] = f 37 | all_lc.update(line_offset) 38 | return cnt, all_map, all_lc 39 | 40 | def get_line_count(self, f): 41 | with open(f) as F: 42 | line_offset = [] 43 | offset = 0 44 | count = 0 45 | for line in F: 46 | line_offset.append(offset) 47 | offset += len(line) 48 | count = count + 1 49 | 50 | return count, {count: f}, {f: line_offset} 51 | 52 | def __len__(self): 53 | return self.count 54 | 55 | def __getitem__(self, idx): 56 | 57 | if torch.is_tensor(idx): 58 | idx = idx.tolist() 59 | logger.debug(f"Indices: {idx}") 60 | 61 | # This gives us the index in the line counts greater than or equal to the desired index. 62 | # The map value for this line count is the file name containing that row. 63 | klist = list(self.fmap.keys()) 64 | idx_m = bisect.bisect_left(klist, idx+1) 65 | 66 | # Grab the ending count of thisl file 67 | cur_idx = klist[idx_m] 68 | 69 | # grab the ending count of the previous file 70 | if idx_m > 0: 71 | prev_idx = klist[idx_m-1] 72 | else: 73 | prev_idx = 0 74 | 75 | # grab the file name for the desired row count 76 | fname = self.fmap[cur_idx] 77 | 78 | loff = self.line_offset[fname] 79 | with open(fname) as F: 80 | F.seek(loff[idx - prev_idx]) 81 | idx_line = F.readline() 82 | 83 | idx_parts = idx_line.split(',') 84 | 85 | return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))]) -------------------------------------------------------------------------------- /Chapter07/code/csv_loader_pd.py: -------------------------------------------------------------------------------- 1 | import os 2 | from torch.utils.data import Dataset 3 | import glob 4 | import torch 5 | import sys 6 | import logging 7 | import collections 8 | import bisect 9 | import time 10 | import pandas as pd 11 | from linecache import getline 12 | from itertools import islice 13 | 14 | logger = logging.getLogger(__name__) 15 | logger.setLevel(logging.DEBUG) 16 | logger.addHandler(logging.StreamHandler(sys.stdout)) 17 | 18 | def _make_gen(reader): 19 | b = reader(1024 * 1024) 20 | while b: 21 | yield b 22 | b = reader(1024*1024) 23 | 24 | def rawgencount(filename): 25 | f = open(filename, 'rb') 26 | f_gen = _make_gen(f.raw.read) 27 | return sum( buf.count(b'\n') for buf in f_gen ) 28 | 29 | 30 | class CsvDatasetPd(Dataset): 31 | 32 | def __init__(self, csv_path, max_files=5): 33 | 34 | self.csv_path = csv_path 35 | if os.path.isfile(csv_path): 36 | self.folder = False 37 | self.count, fmap = self.get_line_count(csv_path) 38 | logger.debug(f"For {csv_path}, count = {self.count}") 39 | else: 40 | self.folder = True 41 | self.count, fmap = self.get_folder_line_count(csv_path, max_files) 42 | 43 | self.fmap = collections.OrderedDict(sorted(fmap.items())) 44 | self.max_files = max_files 45 | 46 | def get_folder_line_count(self, d, max_files): 47 | cnt = 0 48 | all_map = {} 49 | file_cnt = 0 50 | for f in glob.glob(os.path.join(d, '*.csv')): 51 | fcnt, _ = self.get_line_count(f) 52 | cnt = cnt + fcnt 53 | all_map[cnt] = f 54 | file_cnt = file_cnt + 1 55 | if file_cnt > max_files: 56 | break 57 | 58 | return cnt, all_map 59 | 60 | def get_line_count(self, f): 61 | count = rawgencount(f) 62 | return count, {count: f} 63 | 64 | def __len__(self): 65 | return self.count 66 | 67 | def __getitem__(self, idx): 68 | 69 | if torch.is_tensor(idx): 70 | idx = idx.tolist() 71 | logger.debug(f"Indices: {idx}") 72 | 73 | # This gives us the index in the line counts greater than or equal to the desired index. 74 | # The map value for this line count is the file name containing that row. 75 | klist = list(self.fmap.keys()) 76 | idx_m = bisect.bisect_left(klist, idx+1) 77 | 78 | # Grab the ending count of thisl file 79 | cur_idx = klist[idx_m] 80 | 81 | # grab the ending count of the previous file 82 | if idx_m > 0: 83 | prev_idx = klist[idx_m-1] 84 | else: 85 | prev_idx = 0 86 | 87 | # grab the file name for the desired row count 88 | fname = self.fmap[cur_idx] 89 | 90 | #with open(fname) as F: 91 | # lines = list(islice(F, idx-prev_idx, idx-prev_idx+1)) 92 | idx_line = getline(fname, idx - prev_idx +1) 93 | 94 | idx_parts = idx_line.split(',') 95 | 96 | return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))]) 97 | 98 | -------------------------------------------------------------------------------- /Chapter07/code/csv_loader_simple.py: -------------------------------------------------------------------------------- 1 | import os 2 | from torch.utils.data import Dataset 3 | import glob 4 | import torch 5 | import sys 6 | import logging 7 | import collections 8 | import bisect 9 | import time 10 | import pandas as pd 11 | from linecache import getline 12 | from itertools import islice 13 | 14 | logger = logging.getLogger(__name__) 15 | logger.setLevel(logging.INFO) 16 | logger.addHandler(logging.StreamHandler(sys.stdout)) 17 | 18 | class CsvDatasetSimple(Dataset): 19 | 20 | def __init__(self, csv_path, max_files=5): 21 | 22 | self.csv_path = csv_path 23 | if os.path.isfile(csv_path): 24 | self.count, self.tensors = self.get_line_data(csv_path) 25 | logger.debug(f"For {csv_path}, count = {self.count}") 26 | else: 27 | self.count, self.tensors = self.get_folder_line_data(csv_path, max_files) 28 | 29 | def get_folder_line_data(self, d, max_files): 30 | cnt = 0 31 | file_cnt = 0 32 | tensors = [] 33 | for f in glob.glob(os.path.join(d, '*.csv')): 34 | fcnt, ftensors = self.get_line_data(f) 35 | cnt = cnt + fcnt 36 | tensors = tensors + ftensors 37 | file_cnt = file_cnt + 1 38 | if file_cnt > max_files: 39 | break 40 | 41 | return cnt, tensors 42 | 43 | def get_line_data(self, f): 44 | cnt = 0 45 | tensors = [] 46 | with open(f) as F: 47 | f_lines = F.readlines() 48 | for l in f_lines: 49 | cnt = cnt + 1 50 | parts = l.split(',') 51 | tensors.append(tuple([torch.tensor( [float(f) for f in parts[1:]] ), torch.tensor(float(parts[0]))])) 52 | 53 | return cnt, tensors 54 | 55 | def __len__(self): 56 | return self.count 57 | 58 | def __getitem__(self, idx): 59 | 60 | if torch.is_tensor(idx): 61 | idx = idx.tolist() 62 | logger.debug(f"Indices: {idx}") 63 | 64 | return self.tensors[idx] 65 | 66 | -------------------------------------------------------------------------------- /Chapter07/code/model_pytorch.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import logging 5 | import sys 6 | 7 | logger = logging.getLogger(__name__) 8 | logger.setLevel(logging.INFO) 9 | logger.addHandler(logging.StreamHandler(sys.stdout)) 10 | 11 | class TabularNet(nn.Module): 12 | 13 | def __init__(self, n_cont, n_cat, emb_sz = 100, dropout_p = 0.1, layers=[200,100], cat_mask=[], cat_dim=[], y_min = 0., y_max = 1.): 14 | 15 | super(TabularNet, self).__init__() 16 | 17 | self.cat_mask = cat_mask 18 | self.cat_dim = cat_dim 19 | self.y_min = y_min 20 | self.y_max = y_max 21 | self.n_cat = n_cat 22 | self.n_cont = n_cont 23 | 24 | emb_dim = [] 25 | for ii in range(len(cat_mask)): 26 | if cat_mask[ii]: 27 | c_dim = cat_dim[ii] 28 | emb_dim.append(c_dim) 29 | #emb = nn.Embedding(c_dim, emb_sz) 30 | #self.embeddings.append(emb) 31 | #setattr(self, 'emb_{}'.format(ii), emb) 32 | 33 | self.embeddings = nn.ModuleList([nn.Embedding(c_dim, emb_sz) for c_dim in emb_dim]) 34 | 35 | modules = [] 36 | prev_size = n_cont + n_cat * emb_sz 37 | for l in layers: 38 | modules.append(nn.Linear(prev_size, l)) 39 | modules.append(nn.BatchNorm1d(l)) 40 | modules.append(nn.Dropout(dropout_p)) 41 | modules.append(nn.ReLU(inplace=True)) 42 | prev_size = l 43 | modules.append(nn.BatchNorm1d(prev_size)) 44 | modules.append(nn.Dropout(dropout_p)) 45 | modules.append(nn.Linear(prev_size, 1)) 46 | modules.append(nn.Sigmoid()) 47 | 48 | self.m_seq = nn.Sequential(*modules) 49 | self.emb_drop = nn.Dropout(dropout_p) 50 | self.bn_cont = nn.BatchNorm1d(n_cont) 51 | 52 | def forward(self, x_in): 53 | 54 | logger.debug(f"Forward pass on {x_in.shape}") 55 | x = None 56 | ee = 0 57 | for ii in range(len(self.cat_mask)): 58 | 59 | if self.cat_mask[ii]: 60 | logger.debug(f"Embedding: {self.embeddings[ee]}, input: {x_in[:,ii]}") 61 | logger.debug(f"cat Device for x_in: {x_in.get_device()}") 62 | logger.debug(f"cat Device for x_in slice: {x_in[:,ii].get_device()}") 63 | logger.debug(f"cat Device for embed: {next(self.embeddings[ee].parameters()).get_device()}") 64 | x_e = self.embeddings[ee](x_in[:,ii].to(device = x_in.get_device(), dtype= torch.long)) 65 | logger.debug(f"cat Device for x_e: {x_e.get_device()}") 66 | logger.debug(f"cat x_e = {x_e.shape}") 67 | if x is None: 68 | x = x_e 69 | else: 70 | x = torch.cat([x, x_e], 1) 71 | logger.debug(f"cat Device for x: {x.get_device()}") 72 | x = self.emb_drop(x) 73 | logger.debug(f"cat Device for x: {x.get_device()}") 74 | logger.debug(f"cat x = {x.shape}") 75 | ee = ee + 1 76 | else: 77 | logger.debug(f"cont Device for x_in: {x_in.get_device()}") 78 | x_cont = x_in[:, ii] # self.bn_cont(x_in[:, ii]) 79 | logger.debug(f"cont Device for x_cont: {x_cont.get_device()}") 80 | logger.debug(f"cont x_cont = {x_cont.shape}") 81 | if x is None: 82 | x = torch.unsqueeze(x_cont, 1) 83 | else: 84 | x = torch.cat([x, torch.unsqueeze(x_cont, 1)], 1) 85 | logger.debug(f"cont Device for x: {x.get_device()}") 86 | logger.debug(f"cont x = {x.shape}") 87 | 88 | return self.m_seq(x) * (self.y_max - self.y_min) + self.y_min -------------------------------------------------------------------------------- /Chapter07/code/train_pytorch-dist.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import sys 5 | import logging 6 | import json 7 | import shutil 8 | import torch 9 | import torch.nn as nn 10 | from torch.utils.data import DataLoader, TensorDataset 11 | from model_pytorch import TabularNet 12 | from csv_loader_simple import CsvDatasetSimple 13 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP 14 | 15 | 16 | logger = logging.getLogger(__name__) 17 | logger.setLevel(logging.INFO) 18 | logger.addHandler(logging.StreamHandler(sys.stdout)) 19 | 20 | class CUDANotFoundException(Exception): 21 | pass 22 | 23 | 24 | dist.init_process_group() 25 | 26 | 27 | def parse_args(): 28 | """ 29 | Parse arguments passed from the SageMaker API 30 | to the container 31 | """ 32 | 33 | parser = argparse.ArgumentParser() 34 | 35 | # Hyperparameters sent by the client are passed as command-line arguments to the script 36 | parser.add_argument("--epochs", type=int, default=1) 37 | parser.add_argument("--batch_size", type=int, default=64) 38 | parser.add_argument("--learning_rate", type=float, default=0.01) 39 | 40 | # Data directories 41 | parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) 42 | parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) 43 | 44 | # Model directory: we will use the default set by SageMaker, /opt/ml/model 45 | parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR")) 46 | 47 | return parser.parse_known_args() 48 | 49 | def train(model, device): 50 | """ 51 | Train the PyTorch model 52 | """ 53 | 54 | cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False] 55 | train_ds = CsvDatasetSimple(args.train) 56 | test_ds = CsvDatasetSimple(args.test) 57 | 58 | batch_size = args.batch_size 59 | epochs = args.epochs 60 | learning_rate = args.learning_rate 61 | logger.info( 62 | "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate) 63 | ) 64 | 65 | train_sampler = torch.utils.data.distributed.DistributedSampler( 66 | train_ds, num_replicas=args.world_size, rank=args.rank 67 | ) 68 | 69 | train_dl = DataLoader(train_ds, batch_size, shuffle=False, drop_last=True, sampler=train_sampler) 70 | 71 | model = TabularNet(n_cont=9, n_cat=9, 72 | cat_mask = cat_mask, 73 | cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 74 | y_min = 0., y_max = 1., device=device) 75 | logger.debug(model) 76 | model = DDP(model).to(device) 77 | torch.cuda.set_device(args.local_rank) 78 | model.cuda(args.local_rank) 79 | 80 | criterion = nn.MSELoss() 81 | optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 82 | 83 | model.train() 84 | for epoch in range(epochs): 85 | batch_no = 0 86 | for x_train_batch, y_train_batch in train_dl: 87 | logger.debug(f"Training on shape {x_train_batch.shape}") 88 | y = model(x_train_batch.float()) 89 | loss = criterion(y.flatten(), y_train_batch.float().to(device)) 90 | if batch_no % 50 == 0: 91 | logger.info(f"batch {batch_no} -> loss: {loss}") 92 | optimizer.zero_grad() 93 | loss.backward() 94 | optimizer.step() 95 | batch_no +=1 96 | epoch += 1 97 | logger.info(f"epoch: {epoch} -> loss: {loss}") 98 | 99 | # evalutate on test set 100 | if args.rank == 0: 101 | model.eval() 102 | test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False) 103 | with torch.no_grad(): 104 | mse = 0. 105 | for x_test_batch, y_test_batch in test_dl: 106 | y = model(x_test_batch.float()) 107 | mse = mse + ((y - y_test_batch.to(device)) ** 2).sum() / x_test_batch.shape[0] 108 | 109 | mse = mse / len(test_dl.dataset) 110 | logger.info(f"Test MSE: {mse}") 111 | 112 | torch.save(model.state_dict(), args.model_dir + "/model.pth") 113 | # PyTorch requires that the inference script must 114 | # be in the .tar.gz model file and Step Functions SDK doesn't do this. 115 | inference_code_path = args.model_dir + "/code/" 116 | 117 | if not os.path.exists(inference_code_path): 118 | os.mkdir(inference_code_path) 119 | logger.info("Created a folder at {}!".format(inference_code_path)) 120 | 121 | shutil.copy("train_pytorch.py", inference_code_path) 122 | shutil.copy("model_pytorch.py", inference_code_path) 123 | shutil.copy("csv_loader.py", inference_code_path) 124 | logger.info("Saving models files to {}".format(inference_code_path)) 125 | 126 | 127 | if __name__ == "__main__": 128 | 129 | args, _ = parse_args() 130 | args.world_size = dist.get_world_size() 131 | args.rank = rank = dist.get_rank() 132 | args.local_rank = local_rank = dist.get_local_rank() 133 | args.batch_size //= args.world_size // 8 134 | args.batch_size = max(args.batch_size, 1) 135 | 136 | if not torch.cuda.is_available(): 137 | raise CUDANotFoundException( 138 | "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices." 139 | ) 140 | 141 | torch.manual_seed(1) 142 | 143 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 144 | 145 | train() 146 | -------------------------------------------------------------------------------- /Chapter07/code/train_pytorch.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import sys 5 | import logging 6 | import json 7 | import shutil 8 | import torch 9 | import torch.nn as nn 10 | from torch.utils.data import DataLoader, TensorDataset 11 | from model_pytorch import TabularNet 12 | from csv_loader_simple import CsvDatasetSimple 13 | 14 | 15 | logger = logging.getLogger(__name__) 16 | logger.setLevel(logging.INFO) 17 | logger.addHandler(logging.StreamHandler(sys.stdout)) 18 | 19 | 20 | def parse_args(): 21 | """ 22 | Parse arguments passed from the SageMaker API 23 | to the container 24 | """ 25 | 26 | parser = argparse.ArgumentParser() 27 | 28 | # Hyperparameters sent by the client are passed as command-line arguments to the script 29 | parser.add_argument("--epochs", type=int, default=1) 30 | parser.add_argument("--batch_size", type=int, default=64) 31 | parser.add_argument("--learning_rate", type=float, default=0.01) 32 | 33 | # Data directories 34 | parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) 35 | parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) 36 | 37 | # Model directory: we will use the default set by SageMaker, /opt/ml/model 38 | parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR")) 39 | 40 | return parser.parse_known_args() 41 | 42 | def train(): 43 | """ 44 | Train the PyTorch model 45 | """ 46 | 47 | cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False] 48 | train_ds = CsvDatasetSimple(args.train) 49 | test_ds = CsvDatasetSimple(args.test) 50 | 51 | batch_size = args.batch_size 52 | epochs = args.epochs 53 | learning_rate = args.learning_rate 54 | logger.info( 55 | "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate) 56 | ) 57 | 58 | train_dl = DataLoader(train_ds, batch_size, shuffle=False, 59 | num_workers=0, 60 | pin_memory=True, 61 | drop_last=True) 62 | 63 | model = TabularNet(n_cont=9, n_cat=9, 64 | cat_mask = cat_mask, 65 | cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 66 | y_min = 0., y_max = 1.) 67 | logger.debug(model) 68 | model = model.to(device) 69 | criterion = nn.MSELoss() 70 | optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 71 | 72 | model.train() 73 | for epoch in range(epochs): 74 | batch_no = 0 75 | for x_train_batch, y_train_batch in train_dl: 76 | x_train_batch_d = x_train_batch.to(device) 77 | y_train_batch_d = y_train_batch.to(device) 78 | optimizer.zero_grad() 79 | 80 | logger.debug(f"Training on shape {x_train_batch.shape}") 81 | y = model(x_train_batch_d.float()) 82 | loss = criterion(y.flatten(), y_train_batch_d.float()) 83 | if batch_no % 50 == 0: 84 | logger.info(f"batch {batch_no} -> loss: {loss}") 85 | 86 | loss.backward() 87 | optimizer.step() 88 | batch_no +=1 89 | epoch += 1 90 | logger.info(f"epoch: {epoch} -> loss: {loss}") 91 | 92 | # evalutate on test set 93 | model.eval() 94 | test_dl = DataLoader(test_ds, batch_size, shuffle=False, drop_last=True) 95 | with torch.no_grad(): 96 | mse = 0. 97 | for x_test_batch, y_test_batch in test_dl: 98 | x_test_batch_d = x_test_batch.to(device) 99 | y_test_batch_d = y_test_batch.to(device) 100 | 101 | y = model(x_test_batch_d.float()) 102 | mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0] 103 | 104 | mse = mse / len(test_dl.dataset) 105 | logger.info(f"Test MSE: {mse}") 106 | 107 | torch.save(model.state_dict(), args.model_dir + "/model.pth") 108 | # PyTorch requires that the inference script must 109 | # be in the .tar.gz model file and Step Functions SDK doesn't do this. 110 | inference_code_path = args.model_dir + "/code/" 111 | 112 | if not os.path.exists(inference_code_path): 113 | os.mkdir(inference_code_path) 114 | logger.info("Created a folder at {}!".format(inference_code_path)) 115 | 116 | #shutil.copy("train_pytorch.py", inference_code_path) 117 | #shutil.copy("model_pytorch.py", inference_code_path) 118 | #shutil.copy("csv_loader.py", inference_code_path) 119 | logger.info("Saving models files to {}".format(inference_code_path)) 120 | 121 | 122 | if __name__ == "__main__": 123 | 124 | args, _ = parse_args() 125 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 126 | 127 | train() 128 | -------------------------------------------------------------------------------- /Chapter07/code/train_pytorch_dist.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import sys 5 | import logging 6 | import json 7 | import shutil 8 | import torch 9 | import torch.nn as nn 10 | from torch.utils.data import DataLoader, TensorDataset 11 | import torch.nn.functional as F 12 | 13 | from model_pytorch import TabularNet 14 | from csv_loader_simple import CsvDatasetSimple 15 | 16 | import smdistributed.dataparallel.torch.distributed as dist 17 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP 18 | 19 | 20 | logger = logging.getLogger(__name__) 21 | logger.setLevel(logging.INFO) 22 | logger.addHandler(logging.StreamHandler(sys.stdout)) 23 | 24 | class CUDANotFoundException(Exception): 25 | pass 26 | 27 | 28 | dist.init_process_group() 29 | 30 | 31 | def parse_args(): 32 | """ 33 | Parse arguments passed from the SageMaker API 34 | to the container 35 | """ 36 | 37 | parser = argparse.ArgumentParser() 38 | 39 | # Hyperparameters sent by the client are passed as command-line arguments to the script 40 | parser.add_argument("--epochs", type=int, default=1) 41 | parser.add_argument("--batch_size", type=int, default=64) 42 | parser.add_argument("--learning_rate", type=float, default=0.01) 43 | 44 | # Data directories 45 | parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) 46 | parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) 47 | 48 | # Model directory: we will use the default set by SageMaker, /opt/ml/model 49 | parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR")) 50 | 51 | parser.add_argument( 52 | "--save-model", action="store_true", default=False, help="For Saving the current Model" 53 | ) 54 | 55 | return parser.parse_known_args() 56 | 57 | def train(args): 58 | """ 59 | Train the PyTorch model 60 | """ 61 | 62 | cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False] 63 | train_ds = CsvDatasetSimple(args.train) 64 | test_ds = CsvDatasetSimple(args.test) 65 | 66 | batch_size = args.batch_size 67 | epochs = args.epochs 68 | learning_rate = args.learning_rate 69 | logger.info( 70 | "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate) 71 | ) 72 | logger.info(f"World size: {args.world_size}") 73 | logger.info(f"Rank: {args.rank}") 74 | logger.info(f"Local Rank: {args.local_rank}") 75 | 76 | train_sampler = torch.utils.data.distributed.DistributedSampler( 77 | train_ds, num_replicas=args.world_size, rank=args.rank 78 | ) 79 | logger.debug(f"Created distributed sampler") 80 | 81 | train_dl = DataLoader(train_ds, batch_size, shuffle=False, 82 | num_workers=0, 83 | pin_memory=True, 84 | drop_last=True, sampler=train_sampler) 85 | logger.debug(f"Created train data loader") 86 | 87 | model = TabularNet(n_cont=9, n_cat=9, 88 | cat_mask = cat_mask, 89 | cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 90 | y_min = 0., y_max = 1.) 91 | logger.debug("Created model") 92 | logger.debug(model) 93 | model = DDP(model.to(device), broadcast_buffers=False) 94 | logger.debug("created DDP") 95 | torch.cuda.set_device(args.local_rank) 96 | logger.debug("Set device on CUDA") 97 | model.cuda(args.local_rank) 98 | logger.debug(f"Set model CUDA to {args.local_rank}") 99 | 100 | criterion = nn.MSELoss() 101 | optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 102 | logger.debug("Created loss fn and optimizer") 103 | 104 | model.train() 105 | for epoch in range(epochs): 106 | batch_no = 0 107 | for x_train_batch, y_train_batch in train_dl: 108 | logger.debug(f"Working on batch {batch_no}") 109 | x_train_batch_d = x_train_batch.to(device) 110 | y_train_batch_d = y_train_batch.to(device) 111 | 112 | optimizer.zero_grad() 113 | logger.debug("Did optimizer grad") 114 | 115 | logger.debug(f"Training on shape {x_train_batch.shape}") 116 | y = model(x_train_batch_d.float()) 117 | logger.debug("Did forward pass") 118 | loss = criterion(y.flatten(), y_train_batch_d.float()) 119 | logger.debug("Got loss") 120 | if batch_no % 50 == 0: 121 | logger.info(f"batch {batch_no} -> loss: {loss}") 122 | 123 | loss.backward() 124 | logger.debug("Did backward step") 125 | optimizer.step() 126 | logger.debug(f"Did optimizer step, batch {batch_no}") 127 | batch_no +=1 128 | epoch += 1 129 | logger.info(f"epoch: {epoch} -> loss: {loss}") 130 | 131 | # evalutate on test set 132 | if args.rank == 0: 133 | logger.info(f"Starting test eval on rank 0") 134 | model.eval() 135 | test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False) 136 | logger.info(f"Loaded test data set") 137 | with torch.no_grad(): 138 | mse = 0. 139 | batch_no = 0 140 | for x_test_batch, y_test_batch in test_dl: 141 | x_test_batch_d = x_test_batch.to(device) 142 | y_test_batch_d = y_test_batch.to(device) 143 | 144 | y = model(x_test_batch_d.float()) 145 | mse += F.mse_loss(y, y_test_batch_d, reduction="sum") 146 | #mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0] 147 | 148 | if batch_no % 50 == 0: 149 | logger.info(f"batch {batch_no} -> MSE: {mse}") 150 | batch_no +=1 151 | 152 | mse = mse / len(test_dl.dataset) 153 | logger.info(f"Test MSE: {mse}") 154 | 155 | if args.rank == 0: 156 | logger.info("Saving model on rank 0") 157 | torch.save(model.state_dict(), args.model_dir + "/model.pth") 158 | 159 | 160 | if __name__ == "__main__": 161 | 162 | args, _ = parse_args() 163 | args.world_size = dist.get_world_size() 164 | args.rank = rank = dist.get_rank() 165 | args.local_rank = local_rank = dist.get_local_rank() 166 | args.batch_size //= args.world_size // 8 167 | args.batch_size = max(args.batch_size, 1) 168 | 169 | if not torch.cuda.is_available(): 170 | raise CUDANotFoundException( 171 | "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices." 172 | ) 173 | 174 | torch.manual_seed(1) 175 | 176 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 177 | 178 | train(args) 179 | -------------------------------------------------------------------------------- /Chapter07/images/LossNotDecreasingRule.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter07/images/LossNotDecreasingRule.png -------------------------------------------------------------------------------- /Chapter07/images/RulesSummary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter07/images/RulesSummary.png -------------------------------------------------------------------------------- /Chapter07/rules/custom_rule.py: -------------------------------------------------------------------------------- 1 | # First Party 2 | from smdebug.rules.rule import Rule 3 | 4 | 5 | class CustomGradientRule(Rule): 6 | def __init__(self, base_trial, threshold=10.0): 7 | super().__init__(base_trial) 8 | self.threshold = float(threshold) 9 | 10 | def invoke_at_step(self, step): 11 | for tname in self.base_trial.tensor_names(collection="gradients"): 12 | t = self.base_trial.tensor(tname) 13 | abs_mean = t.reduction_value(step, "mean", abs=True) 14 | if abs_mean > self.threshold: 15 | return True 16 | return False -------------------------------------------------------------------------------- /Chapter08/register-model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Register Candidate Model to SageMaker Model Registry\n", 8 | "\n", 9 | "When you have a candidate model that is performing well according to your objective metric, you can register that version of your model to SageMaker Model Registry. From there, it can be used for deployment for either batch or real-time use cases. \n", 10 | "\n", 11 | "A model version can be registered using the Studio Console, Boto3, or as a step in SageMaker Pipelines (which will be covered in a later chapter).\n", 12 | "\n", 13 | "In this notebook, you'll perform the following tasks: \n", 14 | " \n", 15 | " 1. Create a Model Group which is group of versioned models\n", 16 | " 2. Register a model version into the Model Group. \n", 17 | " \n", 18 | "For this, exercise we'll register the XGBoost model previously created in Chapter 5. The same steps apply for the PyTorch model as well. \n", 19 | " \n", 20 | "First, you need to import the boto3 packages required. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "import time\n", 30 | "import os\n", 31 | "from sagemaker import get_execution_role, session, image_uris\n", 32 | "import boto3\n", 33 | "\n", 34 | "region = boto3.Session().region_name\n", 35 | "\n", 36 | "role = get_execution_role()\n", 37 | "\n", 38 | "sm_client = boto3.client('sagemaker', region_name=region)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## 1. Create a Model Group" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import time\n", 55 | "model_package_group_name = \"air-quality-\" + str(round(time.time()))\n", 56 | "model_package_group_input_dict = {\n", 57 | " \"ModelPackageGroupName\" : model_package_group_name,\n", 58 | " \"ModelPackageGroupDescription\" : \"model package group for air quality models\",\n", 59 | " \"Tags\": [\n", 60 | " {\n", 61 | " \"Key\": \"MLProject\",\n", 62 | " \"Value\": \"weather\"\n", 63 | " }\n", 64 | " ] \n", 65 | "}\n", 66 | "create_model_pacakge_group_response = sm_client.create_model_package_group(**model_package_group_input_dict)\n", 67 | "print('ModelPackageGroup Arn : {}'.format(create_model_pacakge_group_response['ModelPackageGroupArn']))" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## 2. Register a Model Version\n", 75 | "\n", 76 | "First, we need to find the model_url that will be used as input to register the model version. Typically this is included as part of a pipeline; however, in this case we are registering the model outside of a pipeline so we need to pull the data from our previous training job that resulted in a candidate model that is performing well according to our objective metric.\n", 77 | "\n", 78 | "**Replace the variable below with the name of the training job from Chapter 5**" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "#Example: \n", 88 | "#training_job = 'sagemaker-xgboost-2021-07-28-02-43-50-684'\n", 89 | "training_job = 'REPLACE WITH NAME OF TRAINING JOB'" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "training_job_response = sm_client.describe_training_job(\n", 99 | " TrainingJobName=training_job\n", 100 | ")" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "model_url=training_job_response['ModelArtifacts']['S3ModelArtifacts']\n", 110 | "print('Model Data URL', model_url)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "# this line automatically looks for the XGBoost image URI and builds an XGBoost container.\n", 120 | "# specify the repo_version depending on your preference.\n", 121 | "xgboost_container = image_uris.retrieve(region=boto3.Session().region_name,\n", 122 | " framework='xgboost', \n", 123 | " version='1.2-1')\n", 124 | "\n", 125 | "print('XGBoost Container for Inference:', xgboost_container)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "modelpackage_inference_specification = {\n", 135 | " \"InferenceSpecification\": {\n", 136 | " \"Containers\": [\n", 137 | " {\n", 138 | " \"Image\": xgboost_container,\n", 139 | " \"ModelDataUrl\": model_url\n", 140 | " }\n", 141 | " ],\n", 142 | " \"SupportedContentTypes\": [ \"text/csv\" ],\n", 143 | " \"SupportedResponseMIMETypes\": [ \"text/csv\" ],\n", 144 | " }\n", 145 | " }\n", 146 | "\n", 147 | "create_model_package_input_dict = {\n", 148 | " \"ModelPackageGroupName\" : model_package_group_name,\n", 149 | " \"ModelPackageDescription\" : \"Model to predict air quality ratings using XGBoost\",\n", 150 | " \"ModelApprovalStatus\" : \"PendingManualApproval\" \n", 151 | "}\n", 152 | "create_model_package_input_dict.update(modelpackage_inference_specification)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "create_model_package_response = sm_client.create_model_package(**create_model_package_input_dict)\n", 162 | "model_package_arn = create_mode_package_response[\"ModelPackageArn\"]\n", 163 | "print('ModelPackage Version ARN : {}'.format(model_package_arn))" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "Let's view the detailed of our registered model..." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "sm_client.list_model_packages(ModelPackageGroupName=model_package_group_name)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [] 188 | } 189 | ], 190 | "metadata": { 191 | "instance_type": "ml.t3.medium", 192 | "kernelspec": { 193 | "display_name": "Python 3 (Data Science)", 194 | "language": "python", 195 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" 196 | }, 197 | "language_info": { 198 | "codemirror_mode": { 199 | "name": "ipython", 200 | "version": 3 201 | }, 202 | "file_extension": ".py", 203 | "mimetype": "text/x-python", 204 | "name": "python", 205 | "nbconvert_exporter": "python", 206 | "pygments_lexer": "ipython3", 207 | "version": "3.7.10" 208 | } 209 | }, 210 | "nbformat": 4, 211 | "nbformat_minor": 4 212 | } 213 | -------------------------------------------------------------------------------- /Chapter10/scripts/preprocess_param.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from __future__ import unicode_literals 3 | 4 | import argparse 5 | import csv 6 | import os 7 | import shutil 8 | import sys 9 | import time 10 | import logging 11 | import boto3 12 | 13 | import pyspark 14 | from pyspark.sql import SparkSession 15 | from pyspark import SparkContext 16 | from pyspark.ml import Pipeline 17 | from pyspark.ml.linalg import Vectors 18 | from pyspark.ml.feature import ( 19 | StringIndexer, 20 | VectorAssembler, 21 | VectorIndexer, 22 | StandardScaler, 23 | OneHotEncoder 24 | ) 25 | from pyspark.sql.functions import * 26 | from pyspark.sql.functions import round as round_ 27 | from pyspark.sql.types import ( 28 | DoubleType, 29 | StringType, 30 | StructField, 31 | StructType, 32 | BooleanType, 33 | IntegerType 34 | ) 35 | 36 | def get_tables(): 37 | 38 | tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2'] 39 | #tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2', 40 | # '1645e09b-0919-439a-b07f-cd7532069f10', 41 | # '80d785aa-6da8-4b37-8632-7386b1d535f3', 42 | # '8375c185-ea3e-44b7-b61c-68534e33ddf7', 43 | # '9a6e24db-cffc-42de-a77f-7ab96c487022', 44 | # 'bc981da3-bc9d-435a-8bf2-4107a8fb2676', 45 | # 'cf5cf814-c9e3-4a2b-8811-e4fc6481a1fe', 46 | # 'd3b8f1ab-f3e5-4fc9-84ab-8568edd8a03d'] 47 | 48 | 49 | return tables 50 | 51 | 52 | def isBadAir(v, p): 53 | if p == 'pm10': 54 | if v > 50: 55 | return 1 56 | else: 57 | return 0 58 | elif p == 'pm25': 59 | if v > 25: 60 | return 1 61 | else: 62 | return 0 63 | elif p == 'so2': 64 | if v > 20: 65 | return 1 66 | else: 67 | return 0 68 | elif p == 'no2': 69 | if v > 200: 70 | return 1 71 | else: 72 | return 0 73 | elif p == 'o3': 74 | if v > 100: 75 | return 1 76 | else: 77 | return 0 78 | else: 79 | return 0 80 | 81 | def extract(row): 82 | return (row.value, row.ismobile, row.year, row.month, row.quarter, row.day, row.isBadAir, 83 | row.indexed_location, row.indexed_city, row.indexed_country, row.indexed_sourcename, 84 | row.indexed_sourcetype) 85 | 86 | """ 87 | Schema on disk: 88 | 89 | |-- date_utc: string (nullable = true) 90 | |-- date_local: string (nullable = true) 91 | |-- location: string (nullable = true) 92 | |-- country: string (nullable = true) 93 | |-- value: float (nullable = true) 94 | |-- unit: string (nullable = true) 95 | |-- city: string (nullable = true) 96 | |-- attribution: array (nullable = true) 97 | | |-- element: struct (containsNull = true) 98 | | | |-- name: string (nullable = true) 99 | | | |-- url: string (nullable = true) 100 | |-- averagingperiod: struct (nullable = true) 101 | | |-- unit: string (nullable = true) 102 | | |-- value: float (nullable = true) 103 | |-- coordinates: struct (nullable = true) 104 | | |-- latitude: float (nullable = true) 105 | | |-- longitude: float (nullable = true) 106 | |-- sourcename: string (nullable = true) 107 | |-- sourcetype: string (nullable = true) 108 | |-- mobile: string (nullable = true) 109 | |-- parameter: string (nullable = true) 110 | 111 | Example output: 112 | 113 | date_utc='2015-10-31T07:00:00.000Z' 114 | date_local='2015-10-31T04:00:00-03:00' 115 | location='Quintero Centro' 116 | country='CL' 117 | value=19.81999969482422 118 | unit='µg/m³' 119 | city='Quintero' 120 | attribution=[Row(name='SINCA', url='http://sinca.mma.gob.cl/'), Row(name='CENTRO QUINTERO', url=None)] 121 | averagingperiod=None 122 | coordinates=Row(latitude=-32.786170959472656, longitude=-71.53143310546875) 123 | sourcename='Chile - SINCA' 124 | sourcetype=None 125 | mobile=None 126 | parameter='o3' 127 | 128 | Transformations: 129 | 130 | * Featurize date_utc 131 | * Drop date_local 132 | * Encode location 133 | * Encode country 134 | * Scale value 135 | * Drop unit 136 | * Encode city 137 | * Drop attribution 138 | * Drop averaging period 139 | * Drop coordinates 140 | * Encode source name 141 | * Encode source type 142 | * Convert mobile to integer 143 | * Encode parameter 144 | 145 | * Add label for good/bad air quality 146 | 147 | """ 148 | def main(): 149 | parser = argparse.ArgumentParser(description="Preprocessing configuration") 150 | parser.add_argument("--s3_input_bucket", type=str, help="s3 input bucket") 151 | parser.add_argument("--s3_input_key_prefix", type=str, help="s3 input key prefix") 152 | parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket") 153 | parser.add_argument("--s3_output_key_prefix", type=str, help="s3 output key prefix") 154 | parser.add_argument("--parameter", type=str, help="parameter filter") 155 | args = parser.parse_args() 156 | 157 | logging.basicConfig(level=logging.INFO) 158 | logger = logging.getLogger('Preprocess') 159 | 160 | spark = SparkSession.builder.appName("Preprocessor").getOrCreate() 161 | 162 | logger.info("Reading data set") 163 | tables = get_tables() 164 | df = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{tables[0]}/") 165 | for t in tables[1:]: 166 | df_new = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{t}/") 167 | df = df.union(df_new) 168 | 169 | # Filter on parameter 170 | df = df.filter(df.parameter == args.parameter) 171 | 172 | # Drop columns 173 | logger.info("Dropping columns") 174 | df = df.drop('date_local').drop('unit').drop('attribution').drop('averagingperiod').drop('coordinates') 175 | 176 | # Mobile field to int 177 | logger.info("Casting mobile field to int") 178 | df = df.withColumn("ismobile",col("mobile").cast(IntegerType())).drop('mobile') 179 | 180 | # scale value 181 | logger.info("Scaling value") 182 | value_assembler = VectorAssembler(inputCols=["value"], outputCol="value_vec") 183 | value_scaler = StandardScaler(inputCol="value_vec", outputCol="value_scaled") 184 | value_pipeline = Pipeline(stages=[value_assembler, value_scaler]) 185 | value_model = value_pipeline.fit(df) 186 | xform_df = value_model.transform(df) 187 | 188 | # featurize date 189 | logger.info("Featurizing date") 190 | xform_df = xform_df.withColumn('aggdt', 191 | to_date(unix_timestamp(col('date_utc'), "yyyy-MM-dd'T'HH:mm:ss.SSSX").cast("timestamp"))) 192 | xform_df = xform_df.withColumn('year',year(xform_df.aggdt)) \ 193 | .withColumn('month',month(xform_df.aggdt)) \ 194 | .withColumn('quarter',quarter(xform_df.aggdt)) 195 | xform_df = xform_df.withColumn("day", date_format(col("aggdt"), "d")) 196 | 197 | # Automatically assign good/bad labels 198 | logger.info("Simulating good/bad air labels") 199 | isBadAirUdf = udf(isBadAir, IntegerType()) 200 | xform_df = xform_df.withColumn('isBadAir', isBadAirUdf('value', 'parameter')) 201 | xform_df = xform_df.drop('parameter') 202 | 203 | # Categorical encodings. 204 | logger.info("Categorical encoding") 205 | #parameter_indexer = StringIndexer(inputCol="parameter", outputCol="indexed_parameter", handleInvalid='keep') 206 | location_indexer = StringIndexer(inputCol="location", outputCol="indexed_location", handleInvalid='keep') 207 | city_indexer = StringIndexer(inputCol="city", outputCol="indexed_city", handleInvalid='keep') 208 | country_indexer = StringIndexer(inputCol="country", outputCol="indexed_country", handleInvalid='keep') 209 | sourcename_indexer = StringIndexer(inputCol="sourcename", outputCol="indexed_sourcename", handleInvalid='keep') 210 | sourcetype_indexer = StringIndexer(inputCol="sourcetype", outputCol="indexed_sourcetype", handleInvalid='keep') 211 | #enc_est = OneHotEncoder(inputCols=["indexed_parameter"], outputCols=["vec_parameter"]) 212 | enc_pipeline = Pipeline(stages=[location_indexer, 213 | city_indexer, country_indexer, sourcename_indexer, 214 | sourcetype_indexer]) 215 | enc_model = enc_pipeline.fit(xform_df) 216 | enc_df = enc_model.transform(xform_df) 217 | #param_cols = enc_df.schema.fields[17].metadata['ml_attr']['vals'] 218 | 219 | # Clean up data set 220 | logger.info("Final cleanup") 221 | final_df = enc_df.drop('location') \ 222 | .drop('city').drop('country').drop('sourcename') \ 223 | .drop('sourcetype').drop('date_utc') \ 224 | .drop('value_vec').drop('aggdt') 225 | firstelement=udf(lambda v:str(v[0]),StringType()) 226 | final_df = final_df.withColumn('value_str', firstelement('value_scaled')) 227 | final_df = final_df.withColumn("value",final_df.value_str.cast(DoubleType())).drop('value_str').drop('value_scaled') 228 | schema = StructType([ 229 | StructField("value", DoubleType(), True), 230 | StructField("ismobile", StringType(), True), 231 | StructField("year", StringType(), True), 232 | StructField("month", StringType(), True), 233 | StructField("quarter", StringType(), True), 234 | StructField("day", StringType(), True), 235 | StructField("isBadAir", StringType(), True), 236 | StructField("location", StringType(), True), 237 | StructField("city", StringType(), True), 238 | StructField("country", StringType(), True), 239 | StructField("sourcename", StringType(), True), 240 | StructField("sourcetype", StringType(), True) 241 | ]) 242 | final_df = final_df.rdd.map(extract).toDF(schema=schema) 243 | 244 | # Replace missing values 245 | final_df = final_df.na.fill("0") 246 | 247 | # Round the value 248 | final_df = final_df.withColumn("value", round_(final_df["value"], 4)) 249 | 250 | # Split sets 251 | logger.info("Splitting data set") 252 | (train_df, validation_df, test_df) = final_df.randomSplit([0.7, 0.2, 0.1]) 253 | 254 | # Drop value from test set 255 | test_df = test_df.drop('value') 256 | 257 | # Save to S3 258 | logger.info("Saving to S3") 259 | train_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 260 | args.s3_output_key_prefix, 'train/')) 261 | validation_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 262 | args.s3_output_key_prefix, 'validation/')) 263 | test_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 264 | args.s3_output_key_prefix, 'test/')) 265 | 266 | if __name__ == "__main__": 267 | main() -------------------------------------------------------------------------------- /Chapter11/README.md: -------------------------------------------------------------------------------- 1 | # Monitor Production Models using Amazon SageMaker Model Monitor and Amazon SageMaker Clarify 2 | 3 | Examples in this chapter demonstrate how to use SageMaker capabilities for monitoring weather prediction regression model for data drift, model quality dift, bias drift and feature attribution drift. 4 | 5 | * data_drift_monitoring/WeatherPredictionDataDriftModelMonitoring.ipynb 6 | 7 | Monitor and detect data drift using SageMaker Model Monitor 8 | 9 | * model_quality_monitoring/WeatherPredictionModelQualityMonitoring.ipynb 10 | 11 | Monitor and detect model quality drift using SageMaker Model Monitor 12 | 13 | * bias_drift_monitoring/WeatherPredictionBiasAndFeatureAttributionDrift.ipynb 14 | 15 | Monitor and detect bias and feature attribution drift using SageMaker Clarify 16 | 17 | * feature_attribute_drift_monitoring/WeatherPredictionFeatureAttributionDrift.ipynb 18 | 19 | Monitor and detect bias and feature attribution drift using SageMaker Clarify 20 | -------------------------------------------------------------------------------- /Chapter11/bias_drift_monitoring/model/model/weather-prediction-model.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/bias_drift_monitoring/model/model/weather-prediction-model.tar.gz -------------------------------------------------------------------------------- /Chapter11/bias_drift_monitoring/model/weather-prediction-model.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/bias_drift_monitoring/model/weather-prediction-model.tar.gz -------------------------------------------------------------------------------- /Chapter11/data_drift_monitoring/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/data_drift_monitoring/.DS_Store -------------------------------------------------------------------------------- /Chapter11/data_drift_monitoring/model/weather-prediction-model.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/data_drift_monitoring/model/weather-prediction-model.tar.gz -------------------------------------------------------------------------------- /Chapter11/feature_attribution_drift_monitoring/model/weather-prediction-model.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/feature_attribution_drift_monitoring/model/weather-prediction-model.tar.gz -------------------------------------------------------------------------------- /Chapter11/model/weather-prediction-model.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model/weather-prediction-model.tar.gz -------------------------------------------------------------------------------- /Chapter11/model_quality_monitoring/data/model-quality-baseline-data.csv: -------------------------------------------------------------------------------- 1 | probability,prediction,label 2 | -4.902510643005371,-4.902510643005371,-7.535634515882377 3 | -4.902510643005371,-4.902510643005371,-7.535634515882377 4 | -4.902510643005371,-4.902510643005371,-7.535634515882377 5 | -4.902510643005371,-4.902510643005371,-7.535634515882377 6 | -4.902510643005371,-4.902510643005371,-7.535634515882377 7 | -4.902510643005371,-4.902510643005371,-7.535634515882377 8 | -4.902510643005371,-4.902510643005371,-7.535634515882377 9 | -4.902510643005371,-4.902510643005371,-7.535634515882377 10 | -4.902510643005371,-4.902510643005371,-7.535634515882377 11 | -4.902510643005371,-4.902510643005371,-7.535634515882377 12 | -4.902510643005371,-4.902510643005371,-7.535634515882377 13 | -4.902510643005371,-4.902510643005371,-7.535634515882377 14 | -4.902510643005371,-4.902510643005371,-7.535634515882377 15 | -4.902510643005371,-4.902510643005371,-7.535634515882377 16 | -4.902510643005371,-4.902510643005371,-7.535634515882377 17 | -4.902510643005371,-4.902510643005371,-7.535634515882377 18 | -4.902510643005371,-4.902510643005371,-7.535634515882377 19 | -4.902510643005371,-4.902510643005371,-7.535634515882377 20 | -4.902510643005371,-4.902510643005371,-7.535634515882377 21 | -4.902510643005371,-4.902510643005371,-7.535634515882377 22 | -4.902510643005371,-4.902510643005371,-7.535634515882377 23 | -4.902510643005371,-4.902510643005371,-7.535634515882377 24 | -4.902510643005371,-4.902510643005371,-7.535634515882377 25 | -4.902510643005371,-4.902510643005371,-7.535634515882377 26 | -4.902510643005371,-4.902510643005371,-7.535634515882377 27 | -4.902510643005371,-4.902510643005371,-7.535634515882377 28 | -4.902510643005371,-4.902510643005371,-7.535634515882377 29 | -4.902510643005371,-4.902510643005371,-7.535634515882377 30 | -4.902510643005371,-4.902510643005371,-7.535634515882377 31 | -4.902510643005371,-4.902510643005371,-7.535634515882377 32 | -4.902510643005371,-4.902510643005371,-7.535634515882377 33 | -4.902510643005371,-4.902510643005371,-7.535634515882377 34 | -4.902510643005371,-4.902510643005371,-7.535634515882377 35 | -4.902510643005371,-4.902510643005371,-7.535634515882377 36 | -4.902510643005371,-4.902510643005371,-7.535634515882377 37 | -4.902510643005371,-4.902510643005371,-7.535634515882377 38 | -4.902510643005371,-4.902510643005371,-7.535634515882377 39 | -4.902510643005371,-4.902510643005371,-7.535634515882377 40 | -4.902510643005371,-4.902510643005371,-7.535634515882377 41 | -4.902510643005371,-4.902510643005371,-7.535634515882377 42 | -4.902510643005371,-4.902510643005371,-7.535634515882377 43 | -4.902510643005371,-4.902510643005371,-7.535634515882377 44 | -4.902510643005371,-4.902510643005371,-7.535634515882377 45 | -4.902510643005371,-4.902510643005371,-7.535634515882377 46 | -4.902510643005371,-4.902510643005371,-7.535634515882377 47 | -4.902510643005371,-4.902510643005371,-7.535634515882377 48 | -4.902510643005371,-4.902510643005371,-7.535634515882377 49 | -4.902510643005371,-4.902510643005371,-7.535634515882377 50 | -4.902510643005371,-4.902510643005371,-7.535634515882377 51 | -4.902510643005371,-4.902510643005371,-7.535634515882377 52 | -4.902510643005371,-4.902510643005371,-7.535634515882377 53 | -4.902510643005371,-4.902510643005371,-7.535634515882377 54 | -4.902510643005371,-4.902510643005371,-7.535634515882377 55 | -4.902510643005371,-4.902510643005371,-7.535634515882377 56 | -4.902510643005371,-4.902510643005371,-7.535634515882377 57 | -4.902510643005371,-4.902510643005371,-7.535634515882377 58 | -4.902510643005371,-4.902510643005371,-7.535634515882377 59 | -4.902510643005371,-4.902510643005371,-7.535634515882377 60 | -4.902510643005371,-4.902510643005371,-7.535634515882377 61 | -4.902510643005371,-4.902510643005371,-7.535634515882377 62 | -4.902510643005371,-4.902510643005371,-7.535634515882377 63 | -4.902510643005371,-4.902510643005371,-7.535634515882377 64 | -4.902510643005371,-4.902510643005371,-7.535634515882377 65 | -4.902510643005371,-4.902510643005371,-7.535634515882377 66 | -4.902510643005371,-4.902510643005371,-7.535634515882377 67 | -4.902510643005371,-4.902510643005371,-7.535634515882377 68 | -4.902510643005371,-4.902510643005371,-7.535634515882377 69 | -4.902510643005371,-4.902510643005371,-7.535634515882377 70 | -4.902510643005371,-4.902510643005371,-7.535634515882377 71 | -4.902510643005371,-4.902510643005371,-7.535634515882377 72 | -4.902510643005371,-4.902510643005371,-7.535634515882377 73 | -4.902510643005371,-4.902510643005371,-7.535634515882377 74 | -4.902510643005371,-4.902510643005371,-7.535634515882377 75 | -4.902510643005371,-4.902510643005371,-7.535634515882377 76 | -4.902510643005371,-4.902510643005371,-7.535634515882377 77 | -4.902510643005371,-4.902510643005371,-7.535634515882377 78 | -4.902510643005371,-4.902510643005371,-7.535634515882377 79 | -4.902510643005371,-4.902510643005371,-7.535634515882377 80 | -4.902510643005371,-4.902510643005371,-7.535634515882377 81 | -4.902510643005371,-4.902510643005371,-7.535634515882377 82 | -4.902510643005371,-4.902510643005371,-7.535634515882377 83 | -4.902510643005371,-4.902510643005371,-7.535634515882377 84 | -4.902510643005371,-4.902510643005371,-7.535634515882377 85 | -4.902510643005371,-4.902510643005371,-7.535634515882377 86 | -4.902510643005371,-4.902510643005371,-7.535634515882377 87 | -4.902510643005371,-4.902510643005371,-7.535634515882377 88 | -4.902510643005371,-4.902510643005371,-7.535634515882377 89 | -4.902510643005371,-4.902510643005371,-7.535634515882377 90 | -4.902510643005371,-4.902510643005371,-7.535634515882377 91 | -4.902510643005371,-4.902510643005371,-7.535634515882377 92 | -4.902510643005371,-4.902510643005371,-7.535634515882377 93 | -4.902510643005371,-4.902510643005371,-7.535634515882377 94 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 95 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 96 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 97 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 98 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 99 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 100 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 101 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 102 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 103 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 104 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 105 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 106 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 107 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 108 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 109 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 110 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 111 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 112 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 113 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 114 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 115 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 116 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 117 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 118 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 119 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 120 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 121 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 122 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 123 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 124 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 125 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 126 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 127 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 128 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 129 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 130 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 131 | -3.218202590942383,-3.218202590942383,-7.535634515882377 132 | -3.218202590942383,-3.218202590942383,-7.535634515882377 133 | -3.218202590942383,-3.218202590942383,-7.535634515882377 134 | -3.218202590942383,-3.218202590942383,-7.535634515882377 135 | -3.218202590942383,-3.218202590942383,-7.535634515882377 136 | -3.218202590942383,-3.218202590942383,-7.535634515882377 137 | -3.218202590942383,-3.218202590942383,-7.535634515882377 138 | -3.218202590942383,-3.218202590942383,-7.535634515882377 139 | -3.218202590942383,-3.218202590942383,-7.535634515882377 140 | -3.218202590942383,-3.218202590942383,-7.535634515882377 141 | -3.218202590942383,-3.218202590942383,-7.535634515882377 142 | -3.218202590942383,-3.218202590942383,-7.535634515882377 143 | -3.218202590942383,-3.218202590942383,-7.535634515882377 144 | -3.218202590942383,-3.218202590942383,-7.535634515882377 145 | -3.218202590942383,-3.218202590942383,-7.535634515882377 146 | -3.218202590942383,-3.218202590942383,-7.535634515882377 147 | -3.218202590942383,-3.218202590942383,-7.535634515882377 148 | -3.218202590942383,-3.218202590942383,-7.535634515882377 149 | -3.218202590942383,-3.218202590942383,-7.535634515882377 150 | -3.218202590942383,-3.218202590942383,-7.535634515882377 151 | -3.218202590942383,-3.218202590942383,-7.535634515882377 152 | -3.218202590942383,-3.218202590942383,-7.535634515882377 153 | -3.218202590942383,-3.218202590942383,-7.535634515882377 154 | -3.218202590942383,-3.218202590942383,-7.535634515882377 155 | -3.218202590942383,-3.218202590942383,-7.535634515882377 156 | -3.218202590942383,-3.218202590942383,-7.535634515882377 157 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377 158 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377 159 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377 160 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377 161 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377 162 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377 163 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377 164 | -3.218202590942383,-3.218202590942383,-7.535634515882377 165 | -3.218202590942383,-3.218202590942383,-7.535634515882377 166 | -3.218202590942383,-3.218202590942383,-7.535634515882377 167 | -3.218202590942383,-3.218202590942383,-7.535634515882377 168 | -3.218202590942383,-3.218202590942383,-7.535634515882377 169 | -3.218202590942383,-3.218202590942383,-7.535634515882377 170 | -3.218202590942383,-3.218202590942383,-7.535634515882377 171 | -3.218202590942383,-3.218202590942383,-7.535634515882377 172 | -3.218202590942383,-3.218202590942383,-7.535634515882377 173 | -3.218202590942383,-3.218202590942383,-7.535634515882377 174 | -3.218202590942383,-3.218202590942383,-7.535634515882377 175 | -3.218202590942383,-3.218202590942383,-7.535634515882377 176 | -3.218202590942383,-3.218202590942383,-7.535634515882377 177 | -3.218202590942383,-3.218202590942383,-7.535634515882377 178 | -3.218202590942383,-3.218202590942383,-7.535634515882377 179 | -3.218202590942383,-3.218202590942383,-7.535634515882377 180 | -3.218202590942383,-3.218202590942383,-7.535634515882377 181 | -3.218202590942383,-3.218202590942383,-7.535634515882377 182 | -3.218202590942383,-3.218202590942383,-7.535634515882377 183 | -3.218202590942383,-3.218202590942383,-7.535634515882377 184 | -3.218202590942383,-3.218202590942383,-7.535634515882377 185 | -3.218202590942383,-3.218202590942383,-7.535634515882377 186 | -3.218202590942383,-3.218202590942383,-7.535634515882377 187 | -3.218202590942383,-3.218202590942383,-7.535634515882377 188 | -3.218202590942383,-3.218202590942383,-7.535634515882377 189 | -3.218202590942383,-3.218202590942383,-7.535634515882377 190 | -3.218202590942383,-3.218202590942383,-7.535634515882377 191 | -3.218202590942383,-3.218202590942383,-7.535634515882377 192 | -3.218202590942383,-3.218202590942383,-7.535634515882377 193 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 194 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 195 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 196 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 197 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 198 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 199 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 200 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 201 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 202 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 203 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 204 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 205 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 206 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 207 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 208 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 209 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 210 | -3.218202590942383,-3.218202590942383,-7.535634515882377 211 | -3.218202590942383,-3.218202590942383,-7.535634515882377 212 | -3.218202590942383,-3.218202590942383,-7.535634515882377 213 | -3.218202590942383,-3.218202590942383,-7.535634515882377 214 | -3.218202590942383,-3.218202590942383,-7.535634515882377 215 | -3.218202590942383,-3.218202590942383,-7.535634515882377 216 | -3.218202590942383,-3.218202590942383,-7.535634515882377 217 | -3.218202590942383,-3.218202590942383,-7.535634515882377 218 | -3.218202590942383,-3.218202590942383,-7.535634515882377 219 | -3.218202590942383,-3.218202590942383,-7.535634515882377 220 | -3.218202590942383,-3.218202590942383,-7.535634515882377 221 | -3.218202590942383,-3.218202590942383,-7.535634515882377 222 | -3.218202590942383,-3.218202590942383,-7.535634515882377 223 | -3.218202590942383,-3.218202590942383,-7.535634515882377 224 | -3.218202590942383,-3.218202590942383,-7.535634515882377 225 | -3.218202590942383,-3.218202590942383,-7.535634515882377 226 | -3.218202590942383,-3.218202590942383,-7.535634515882377 227 | -3.218202590942383,-3.218202590942383,-7.535634515882377 228 | -3.218202590942383,-3.218202590942383,-7.535634515882377 229 | -3.218202590942383,-3.218202590942383,-7.535634515882377 230 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 231 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 232 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 233 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 234 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 235 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 236 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 237 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 238 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 239 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 240 | -0.4467509984970093,-0.4467509984970093,-7.535634515882377 241 | -0.4467509984970093,-0.4467509984970093,-7.535634515882377 242 | -0.4467509984970093,-0.4467509984970093,-7.535634515882377 243 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 244 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 245 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 246 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 247 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 248 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 249 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 250 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 251 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 252 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 253 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 254 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 255 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 256 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 257 | 0.10094326734542847,0.10094326734542847,-7.535634515882377 258 | 0.10094326734542847,0.10094326734542847,-7.535634515882377 259 | 0.10094326734542847,0.10094326734542847,-7.535634515882377 260 | 0.17167794704437256,0.17167794704437256,-7.535634515882377 261 | -4.902510643005371,-4.902510643005371,-7.535634515882377 262 | -4.902510643005371,-4.902510643005371,-7.535634515882377 263 | -4.902510643005371,-4.902510643005371,-7.535634515882377 264 | -4.902510643005371,-4.902510643005371,-7.535634515882377 265 | -4.902510643005371,-4.902510643005371,-7.535634515882377 266 | -4.902510643005371,-4.902510643005371,-7.535634515882377 267 | -4.902510643005371,-4.902510643005371,-7.535634515882377 268 | -4.902510643005371,-4.902510643005371,-7.535634515882377 269 | -4.902510643005371,-4.902510643005371,-7.535634515882377 270 | -4.902510643005371,-4.902510643005371,-7.535634515882377 271 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 272 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377 273 | -1.1011035442352295,-1.1011035442352295,-7.535634515882377 274 | -3.218202590942383,-3.218202590942383,-7.535634515882377 275 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 276 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377 277 | -3.218202590942383,-3.218202590942383,-7.535634515882377 278 | -3.218202590942383,-3.218202590942383,-7.535634515882377 279 | -3.218202590942383,-3.218202590942383,-7.535634515882377 280 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 281 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 282 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377 283 | 0.10094326734542847,0.10094326734542847,-7.535634515882377 284 | 0.17167794704437256,0.17167794704437256,-7.535634515882377 285 | 0.17167794704437256,0.17167794704437256,-7.535634515882377 286 | 0.10094326734542847,0.10094326734542847,-7.535634515882377 287 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149 288 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149 289 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149 290 | 0.10094326734542847,0.10094326734542847,-0.7528851766543149 291 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149 292 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149 293 | 0.10094326734542847,0.10094326734542847,-0.7528851766543149 294 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149 295 | 0.10094326734542847,0.10094326734542847,-0.7528851766543149 296 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149 297 | 0.10094326734542847,0.10094326734542847,-0.752131537838845 298 | 0.10094326734542847,0.10094326734542847,-0.752131537838845 299 | 0.19079375267028809,0.19079375267028809,-0.0041450134850838155 300 | 0.17314249277114868,0.17314249277114868,-0.003768194077348923 301 | -------------------------------------------------------------------------------- /Chapter11/model_quality_monitoring/images/r2_Alarm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model_quality_monitoring/images/r2_Alarm.png -------------------------------------------------------------------------------- /Chapter11/model_quality_monitoring/images/r2_InsufficientData.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model_quality_monitoring/images/r2_InsufficientData.png -------------------------------------------------------------------------------- /Chapter11/model_quality_monitoring/model/weather-prediction-model.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model_quality_monitoring/model/weather-prediction-model.tar.gz -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Amazon SageMaker Best Practices 5 | 6 | Book Name 7 | 8 | This is the code repository for [Amazon SageMaker Best Practices](https://www.packtpub.com/in/data/amazon-sagemaker-best-practices), published by Packt. 9 | 10 | **Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker** 11 | 12 | ## What is this book about? 13 | Amazon SageMaker is a fully managed AWS service that provides the ability to build, train, deploy, and monitor machine learning models. The book begins with a high-level overview of Amazon SageMaker capabilities that map to the various phases of the machine learning process to help set the right foundation. You'll learn efficient tactics to address data science challenges such as processing data at scale, data preparation, connecting to big data pipelines, identifying data bias, running A/B tests, and model explainability using Amazon SageMaker. 14 | 15 | This book covers the following exciting features: 16 | * Perform data bias detection with AWS Data Wrangler and SageMaker Clarify 17 | * Speed up data processing with SageMaker Feature Store 18 | * Overcome labeling bias with SageMaker Ground Truth 19 | * Improve training time with the monitoring and profiling capabilities of SageMaker Debugger 20 | * Address the challenge of model deployment automation with CI/CD using the SageMaker model registry 21 | * Explore SageMaker Neo for model optimization 22 | 23 | If you feel this book is for you, get your [copy](https://www.amazon.com/dp/1801070520) today! 24 | 25 | https://www.packtpub.com/ 26 | 27 | ## Instructions and Navigations 28 | All of the code is organized into folders. For example, Chapter04. 29 | 30 | The code will look like the following: 31 | 32 | ``` 33 | for batch in chunks(partitions_to_add, 100): 34 | response = glue.batch_create_partition( 35 | DatabaseName=glue_db_name, 36 | TableName=glue_tbl_name, 37 | PartitionInputList=[get_part_def(p) for p in batch] 38 | ) 39 | 40 | ``` 41 | 42 | **Following is what you need for this book:** 43 | This book is for expert data scientists responsible for building machine learning applications using Amazon SageMaker. Working knowledge of Amazon SageMaker, machine learning, deep learning, and experience using Jupyter Notebooks and Python is expected. Basic knowledge of AWS related to data, security, and monitoring will help you make the most of the book. 44 | 45 | With the following software and hardware list you can run all code files present in the book (Chapter 1-14). 46 | 47 | ### Software and Hardware List 48 | 49 | | Chapter | Software required | OS required | 50 | | -------- | ---------------------------------------------------------------------------------------------------| -----------------------------------| 51 | | 1-14 | AWS Account, Amazon SageMaker, Amazon SageMaker Studio, Amazon Athena | Windows, Mac OS X, and Linux (Any) | 52 | 53 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](https://static.packt-cdn.com/downloads/9781801070522_ColorImages.pdf). 54 | 55 | ### Related products 56 | * Learn Amazon SageMaker[[Packt]](https://www.packtpub.com/product/learn-amazon-sagemaker/9781800208919) [[Amazon]](https://www.amazon.com/Learn-Amazon-SageMaker-developers-scientists/dp/180020891X) 57 | 58 | ## Errata 59 | * Page 142 (Code snippet under the line "Now, we must configure the PyTorch estimator by using the infrastructure and profiler configuration as parameters:"): Object Name is **pt_estimator** _should be_ **estimator** 60 | 61 | ## Get to Know the Authors 62 | **Sireesha Muppala** 63 | She is a Principal Enterprise Solutions Architect, AI/ML at Amazon Web Services (AWS). Sireesha holds a PhD in computer science and post-doctorate from the University of Colorado. She is a prolific content creator in the ML space with multiple journal articles, blogs, and public speaking engagements. Sireesha is a co-creator and instructor of the Practical Data Science specialization on Coursera. 64 | 65 | **Randy DeFauw** 66 | He is a Principal Solution Architect at AWS. He holds an MSEE from the University of Michigan, where his graduate thesis focused on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. 67 | 68 | **Shelbee Eigenbrode** 69 | She is a Principal AI and ML Specialist Solutions Architect at AWS. She holds six AWS certifications and has been in technology for 23 years, spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background to deliver and manage ML workloads at scale. 70 | ### Download a free PDF 71 | 72 | If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.
73 |

https://packt.link/free-ebook/9781801070522

--------------------------------------------------------------------------------