├── Chapter02
    ├── README.md
    ├── datascience-vpc.yaml
    ├── notebook-instance-environment.yaml
    ├── studio-domain.yaml
    └── studio-environment.yaml
├── Chapter03
    ├── LabelData.ipynb
    ├── utils
    │   ├── cognito-helper.py
    │   └── sagemaker-helper.py
    └── workflow
    │   ├── post-multiple.py
    │   ├── post.py
    │   ├── pre.py
    │   └── ui.liquid.html
├── Chapter04
    ├── PrepareData.ipynb
    ├── scripts
    │   └── preprocess.py
    └── spark-example.ipynb
├── Chapter05
    ├── feature_store_apis.ipynb
    └── feature_store_train_deploy_models.ipynb
├── Chapter06
    ├── DistributedTraining
    │   └── distrbuted-training.ipynb
    ├── Experiments
    │   └── Experiments.ipynb
    ├── HPO
    │   └── HPO.ipynb
    ├── code
    │   ├── csv_loader.py
    │   ├── csv_loader_pd.py
    │   ├── csv_loader_simple.py
    │   ├── model_pytorch.py
    │   ├── train_pytorch-dist.py
    │   ├── train_pytorch.py
    │   ├── train_pytorch_dist.py
    │   └── train_pytorch_model_dist.py
    └── train.ipynb
├── Chapter07
    ├── code
    │   ├── csv_loader.py
    │   ├── csv_loader_pd.py
    │   ├── csv_loader_simple.py
    │   ├── model_pytorch.py
    │   ├── train_pytorch-dist.py
    │   ├── train_pytorch.py
    │   └── train_pytorch_dist.py
    ├── images
    │   ├── LossNotDecreasingRule.png
    │   └── RulesSummary.png
    ├── rules
    │   └── custom_rule.py
    ├── weather-prediction-debugger-profiler-with-rules.ipynb
    ├── weather-prediction-debugger-profiler.ipynb
    └── weather-prediction-debugger.ipynb
├── Chapter08
    └── register-model.ipynb
├── Chapter09
    └── a_b_deployment_with_production_variants.ipynb
├── Chapter10
    ├── optimize.ipynb
    └── scripts
    │   └── preprocess_param.py
├── Chapter11
    ├── README.md
    ├── bias_drift_monitoring
    │   ├── WeatherPredictionBiasDrift.ipynb
    │   ├── data
    │   │   ├── data-drift-baseline-data.csv
    │   │   └── t_file.csv
    │   └── model
    │   │   ├── WeatherPredictionFeatureAttributionDrift.ipynb
    │   │   ├── data
    │   │       ├── data-drift-baseline-data.csv
    │   │       └── t_file.csv
    │   │   ├── model
    │   │       └── weather-prediction-model.tar.gz
    │   │   └── weather-prediction-model.tar.gz
    ├── data_drift_monitoring
    │   ├── .DS_Store
    │   ├── WeatherPredictionDataDriftModelMonitoring.ipynb
    │   ├── data
    │   │   ├── data-drift-baseline-data.csv
    │   │   └── t_file.csv
    │   └── model
    │   │   └── weather-prediction-model.tar.gz
    ├── feature_attribution_drift_monitoring
    │   ├── WeatherPredictionFeatureAttributionDrift.ipynb
    │   ├── data
    │   │   ├── data-drift-baseline-data.csv
    │   │   └── t_file.csv
    │   └── model
    │   │   └── weather-prediction-model.tar.gz
    ├── model
    │   └── weather-prediction-model.tar.gz
    └── model_quality_monitoring
    │   ├── WeatherPredictionModelQualityMonitoring (1).ipynb
    │   ├── WeatherPredictionModelQualityMonitoring.ipynb
    │   ├── data
    │       ├── model-quality-baseline-data.csv
    │       ├── t_file.csv
    │       └── validation_data.csv
    │   ├── images
    │       ├── r2_Alarm.png
    │       └── r2_InsufficientData.png
    │   ├── model
    │       └── weather-prediction-model.tar.gz
    │   └── model_quality_churn_sdk.ipynb
├── LICENSE
└── README.md


/Chapter02/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | This repository contains the example CloudFormation templates that can be used to setup a data science environment using either Amazon SageMaker Studio or Amazon SageMaker Notebook Instances. The pre-requisities are noted where applicable for either option.  There is also a section included for those unfamiliar with using AWS CloudFormation on how to launch the CloudFormation templates provided.  
 4 | 
 5 | # Prerequisites
 6 | 
 7 | ## Clone Repository 
 8 | 
 9 | To use the CloudFormation templates provided in this repository, you must first clone this repository creating a local copy. These local files will be used to upload the appropriate template to AWS CloudFormation to create your data science environment. 
10 | 
11 | ## Create or Identify VPC 
12 | 
13 | The templates in this repository are setup using a VPC so you must either (1) have an existing VPC that can be used ~OR~ (2) create a new VPC that will be referenced as parameters when launching the CloudFormation stacks.  
14 | 
15 | If you do not have a VPC, a CloudFormation template is provided in this directory, [datascience-vpc.yaml](datascience-vpc.yaml), to get you up and running quickly. 
16 | 
17 | ## Create a KMS Key for Encryption
18 | 
19 | The templates provided accept a KMS key on input to be used for encrypting storage directly attached to your environments.  
20 | 
21 | You can use an existing key or alternatively create your own symmetric encryption key using the following instructions:  https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html
22 | 
23 | Make sure you note the KMS Key ID & ARN to provide that as an input parameter to your CloudFormation templates. 
24 | 
25 | 
26 | ## Using AWS CloudFormation
27 |  
28 | To launch the CloudFormation templates provided you can utilize the CLI, SDK, or the AWS Console.  Instructions provided include using the AWS console to launch the provided templates. 
29 | 
30 | 1. Sign in to the AWS Management Console and open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation
31 | 
32 | 2. In the **Stacks** section, select **Create stack** and select **With new resources (standard)** from the dropdown.  
33 | 
34 | 3. Under **Prerequisite - Prepare template** choose **Template is ready** 
35 | 
36 | 4. Under **Specify template** choose **Upload a template file**
37 | 
38 | 5. This will allow you to upload the appropriate CloudFormation template that will be used to create your data science environment.  Because the input parameters, additional pre-requisites, and post launch tasks vary between the two environment types, they are covered specifically in the sections below. 
39 | 
40 | # Studio Environment
41 | 
42 | ## Studio Environment Pre-Requisites
43 | 1. **Existing Studio Domain**  Because it's common to add new users to an existing Studio Domain, the CloudFormation templates to create a domain and add a new user are typically separated.  Creating a Studio Domain is a one-time setup activity per AWS Account / AWS Region.   The CloudFormation template to create a new user assumes an existing Studio Domain.  If you do not have a Studio Domain, a CloudFormation template is provided to create a new domain as a pre-requisite: [studio-domain.yaml](datascience-vpc.yaml).
44 | 
45 | ## Creating the Stack & Template Usage
46 |  
47 | Go through the provided instructions to launch the [studio-environment.yaml](studio-environment.yaml) CloudFormation template
48 | 
49 | ## Post Launch Tasks
50 | 
51 | We need to copy the dataset we'll be using throughout the chapters from a public S3 bucket to the bucket created by the CloudFormation template.  There are multiple ways this could be done but for simplicity in this case, we'll simply execute a command using the AWS Command Line Interface (CLI) using the terminal available in your Studio environment. 
52 | 
53 | To do this: 
54 | 1. Sign in to the AWS Management Console and open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation
55 | 2. Go to the stack provisioned above -> Go to the **Outputs** tab and capture the name of the S3Bucket that was created
56 | 3. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker, select **Amazon SageMaker Studio** then select **Open Studio** for the user name you provided.
57 | 4. Once you're in your Studio environment, go to **File** -> **New** -> **Terminal** 
58 | 5. Copy & paste the command below, replacing s3 bucket with the value from #2 above: 
59 | 
60 |        nohup aws s3 cp s3://openaq-fetches/ s3://<enter S3Bucket>/data/  --recursive &
61 | 
62 | # Notebook Instance Environment
63 | 
64 | ## Notebook Instance Environment Pre-Requisites
65 | 
66 | None other than what is already noted above under Prerequisites
67 | 
68 | ## Creating the Stack & Template Usage
69 | 
70 | Go through the provided instructions to launch the [notebook-instance-environment.yaml](notebook-instance-environment.yaml) CloudFormation template
71 | 
72 | ## Post Launch Tasks
73 | 
74 | There are no post launch tasks following the successful completion of stack creation above and your data science environment is available for use in the AWS console under **Amazon SageMaker** -> **Notebook** -> **Notebook instances**
75 | 


--------------------------------------------------------------------------------
/Chapter02/datascience-vpc.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: 2010-09-09
  2 | Description: 
  3 |   Creates a VPC with public and private subnets for a given AWS Account.
  4 | Parameters:
  5 |   VpcCidrParam:
  6 |     Type: String
  7 |     Default: 10.0.0.0/16
  8 |     Description: VPC CIDR. For more info, see http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html#VPC_Sizing
  9 |     AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$"
 10 |     ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918.
 11 | 
 12 |   # Public Subnets
 13 |   PublicAZASubnetBlock:
 14 |     Type: String
 15 |     Default: 10.0.0.0/24
 16 |     Description: Subnet CIDR for first Availability Zone
 17 |     AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$"
 18 |     ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918.
 19 | 
 20 |   PublicAZBSubnetBlock:
 21 |     Type: String
 22 |     Default: 10.0.1.0/24
 23 |     Description: Subnet CIDR for second Availability Zone
 24 |     AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$"
 25 |     ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918.
 26 | 
 27 |  
 28 |   # Private Subnets
 29 |   PrivateAZASubnetBlock:
 30 |     Type: String
 31 |     Default: 10.0.2.0/24
 32 |     Description: Subnet CIDR for first Availability Zone (e.g. us-west-2a, us-east-1b)
 33 |     AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$"
 34 |     ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918.
 35 | 
 36 |   PrivateAZBSubnetBlock:
 37 |     Type: String
 38 |     Default: 10.0.3.0/24
 39 |     Description: Subnet CIDR for second Availability Zone (e.g. us-west-2b, us-east-1c)
 40 |     AllowedPattern: "^(10|172|192)\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/(16|17|18|19|20|21|22|23|24|25|26|27|28)$"
 41 |     ConstraintDescription: must be valid IPv4 CIDR block (/16 to /28) from the private address ranges defined in RFC 1918.
 42 | 
 43 | Outputs:
 44 |   VpcId:
 45 |     Description: VPC Id
 46 |     Value: !Ref Vpc
 47 |     Export:
 48 |       Name: !Sub "${AWS::StackName}-vpc-id"
 49 | 
 50 |   PublicRouteTableId:
 51 |     Description: Route Table for public subnets
 52 |     Value: !Ref PublicRouteTable
 53 |     Export:
 54 |       Name: !Sub "${AWS::StackName}-public-rtb"
 55 | 
 56 |   PublicAZASubnetId:
 57 |     Description: Availability Zone A public subnet Id
 58 |     Value: !Ref PublicAZASubnet
 59 |     Export:
 60 |       Name: !Sub "${AWS::StackName}-public-az-a-subnet"
 61 | 
 62 |   PublicAZBSubnetId:
 63 |     Description: Availability Zone B public subnet Id
 64 |     Value: !Ref PublicAZBSubnet
 65 |     Export:
 66 |       Name: !Sub "${AWS::StackName}-public-az-b-subnet"
 67 | 
 68 |   PrivateAZASubnetId:
 69 |     Description: Availability Zone A private subnet Id
 70 |     Value: !Ref PrivateAZASubnet
 71 |     Export:
 72 |       Name: !Sub "${AWS::StackName}-private-az-a-subnet"
 73 | 
 74 |   PrivateAZBSubnetId:
 75 |     Description: Availability Zone B private subnet Id
 76 |     Value: !Ref PrivateAZBSubnet
 77 |     Export:
 78 |       Name: !Sub "${AWS::StackName}-private-az-b-subnet"
 79 | 
 80 |   PrivateAZARouteTableId:
 81 |     Description: Route table for private subnets in AZ A
 82 |     Value: !Ref PrivateAZARouteTable
 83 |     Export:
 84 |       Name: !Sub "${AWS::StackName}-private-az-a-rtb"
 85 | 
 86 |   PrivateAZBRouteTableId:
 87 |     Description: Route table for private subnets in AZ B
 88 |     Value: !Ref PrivateAZBRouteTable
 89 |     Export:
 90 |       Name: !Sub "${AWS::StackName}-private-az-b-rtb"
 91 | 
 92 | 
 93 | Resources:
 94 |   Vpc:
 95 |     Type: AWS::EC2::VPC
 96 |     Properties:
 97 |       CidrBlock: !Ref VpcCidrParam
 98 |       Tags:
 99 |         - Key: Name
100 |           Value: !Sub ${AWS::StackName}
101 | 
102 |   InternetGateway:
103 |     Type: AWS::EC2::InternetGateway
104 |     Properties:
105 |       Tags:
106 |         - Key: Name
107 |           Value: !Sub ${AWS::StackName}
108 | 
109 |   VPCGatewayAttachment:
110 |     Type: AWS::EC2::VPCGatewayAttachment
111 |     Properties:
112 |       InternetGatewayId: !Ref InternetGateway
113 |       VpcId: !Ref Vpc
114 | 
115 |   # Public Subnets - Route Table
116 |   PublicRouteTable:
117 |     Type: AWS::EC2::RouteTable
118 |     Properties:
119 |       VpcId: !Ref Vpc
120 |       Tags:
121 |         - Key: Name
122 |           Value: !Sub ${AWS::StackName}-public
123 |         - Key: Type
124 |           Value: public
125 | 
126 |   PublicSubnetsRoute:
127 |     Type: AWS::EC2::Route
128 |     Properties:
129 |       RouteTableId: !Ref PublicRouteTable
130 |       DestinationCidrBlock: 0.0.0.0/0
131 |       GatewayId: !Ref InternetGateway
132 |     DependsOn: VPCGatewayAttachment
133 | 
134 |   # Public Subnets
135 |   # First Availability Zone
136 |   PublicAZASubnet:
137 |     Type: AWS::EC2::Subnet
138 |     Properties:
139 |       VpcId: !Ref Vpc
140 |       CidrBlock: !Ref PublicAZASubnetBlock
141 |       AvailabilityZone: !Select [0, !GetAZs ""]
142 |       MapPublicIpOnLaunch: true
143 |       Tags:
144 |         - Key: Name
145 |           Value: !Sub
146 |             - ${AWS::StackName}-public-${AZ}
147 |             - { AZ: !Select [0, !GetAZs ""] }
148 |         - Key: Type
149 |           Value: public
150 | 
151 |   PublicAZASubnetRouteTableAssociation:
152 |     Type: AWS::EC2::SubnetRouteTableAssociation
153 |     Properties:
154 |       SubnetId: !Ref PublicAZASubnet
155 |       RouteTableId: !Ref PublicRouteTable
156 | 
157 |   # Second Availability Zone
158 |   PublicAZBSubnet:
159 |     Type: AWS::EC2::Subnet
160 |     Properties:
161 |       VpcId: !Ref Vpc
162 |       CidrBlock: !Ref PublicAZBSubnetBlock
163 |       AvailabilityZone: !Select [1, !GetAZs ""]
164 |       MapPublicIpOnLaunch: true
165 |       Tags:
166 |         - Key: Name
167 |           Value: !Sub
168 |             - ${AWS::StackName}-public-${AZ}
169 |             - { AZ: !Select [1, !GetAZs ""] }
170 |         - Key: Type
171 |           Value: public
172 | 
173 |   PublicAZBSubnetRouteTableAssociation:
174 |     Type: AWS::EC2::SubnetRouteTableAssociation
175 |     Properties:
176 |       SubnetId: !Ref PublicAZBSubnet
177 |       RouteTableId: !Ref PublicRouteTable
178 | 
179 |   # Private Subnets - NAT Gateways
180 |   # First Availability Zone
181 |   AZANatGatewayEIP:
182 |     Type: AWS::EC2::EIP
183 |     Properties:
184 |       Domain: vpc
185 |     DependsOn: VPCGatewayAttachment
186 | 
187 |   AZANatGateway:
188 |     Type: AWS::EC2::NatGateway
189 |     Properties:
190 |       AllocationId: !GetAtt AZANatGatewayEIP.AllocationId
191 |       SubnetId: !Ref PublicAZASubnet
192 | 
193 |   # Private Subnets
194 |   # First Availability Zone
195 |   PrivateAZASubnet:
196 |     Type: AWS::EC2::Subnet
197 |     Properties:
198 |       VpcId: !Ref Vpc
199 |       CidrBlock: !Ref PrivateAZASubnetBlock
200 |       AvailabilityZone: !Select [0, !GetAZs ""]
201 |       Tags:
202 |         - Key: Name
203 |           Value: !Sub
204 |             - ${AWS::StackName}-private-${AZ}
205 |             - { AZ: !Select [0, !GetAZs ""] }
206 |         - Key: Type
207 |           Value: private
208 | 
209 |   PrivateAZARouteTable:
210 |     Type: AWS::EC2::RouteTable
211 |     Properties:
212 |       VpcId: !Ref Vpc
213 |       Tags:
214 |         - Key: Name
215 |           Value: !Sub
216 |             - ${AWS::StackName}-private-${AZ}
217 |             - { AZ: !Select [0, !GetAZs ""] }
218 |         - Key: Type
219 |           Value: private
220 | 
221 |   PrivateAZARoute:
222 |     Type: AWS::EC2::Route
223 |     Properties:
224 |       RouteTableId: !Ref PrivateAZARouteTable
225 |       DestinationCidrBlock: 0.0.0.0/0
226 |       NatGatewayId: !Ref AZANatGateway
227 | 
228 |   PrivateAZARouteTableAssociation:
229 |     Type: AWS::EC2::SubnetRouteTableAssociation
230 |     Properties:
231 |       SubnetId: !Ref PrivateAZASubnet
232 |       RouteTableId: !Ref PrivateAZARouteTable
233 | 
234 |   # # Second Availability Zone
235 |   PrivateAZBSubnet:
236 |     Type: AWS::EC2::Subnet
237 |     Properties:
238 |       VpcId: !Ref Vpc
239 |       CidrBlock: !Ref PrivateAZBSubnetBlock
240 |       AvailabilityZone: !Select [1, !GetAZs ""]
241 |       Tags:
242 |         - Key: Name
243 |           Value: !Sub
244 |             - ${AWS::StackName}-private-${AZ}
245 |             - { AZ: !Select [1, !GetAZs ""] }
246 |         - Key: Type
247 |           Value: private
248 | 
249 |   PrivateAZBRouteTable:
250 |     Type: AWS::EC2::RouteTable
251 |     Properties:
252 |       VpcId: !Ref Vpc
253 |       Tags:
254 |         - Key: Name
255 |           Value: !Sub
256 |             - ${AWS::StackName}-private-${AZ}
257 |             - { AZ: !Select [1, !GetAZs ""] }
258 |         - Key: Type
259 |           Value: private
260 | 
261 | 
262 |   PrivateAZBRoute:
263 |     Type: AWS::EC2::Route
264 |     Properties:
265 |       RouteTableId: !Ref PrivateAZBRouteTable
266 |       DestinationCidrBlock: 0.0.0.0/0
267 |       NatGatewayId: !Ref AZANatGateway
268 | 
269 |   PrivateAZBRouteTableAssociation:
270 |     Type: AWS::EC2::SubnetRouteTableAssociation
271 |     Properties:
272 |       SubnetId: !Ref PrivateAZBSubnet
273 |       RouteTableId: !Ref PrivateAZBRouteTable
274 | 
275 |   S3VPCEndpoint:
276 |     Type: "AWS::EC2::VPCEndpoint"
277 |     Properties:
278 |       RouteTableIds:
279 |         - !Ref PublicRouteTable
280 |         - !Ref PrivateAZARouteTable
281 |         - !Ref PrivateAZBRouteTable
282 |       ServiceName: !Join
283 |         - ""
284 |         - - com.amazonaws.
285 |           - !Ref "AWS::Region"
286 |           - .s3
287 |       VpcId: !Ref Vpc


--------------------------------------------------------------------------------
/Chapter02/notebook-instance-environment.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: '2010-09-09'
  2 | Metadata: 
  3 |   License: Apache-2.0
  4 | Description: 'Example data science environment creating a new SageMaker Notebook Instance using an existing VPC.  This template also includes the creation of an Amazon S3 Bucket and IAM Role.  A lifecycle policy is also included to pull the dataset that will be used in future book chapters.'
  5 | Parameters: #These are configuration parameters that are passed in as input on stack creation
  6 |   NotebookInstanceName:
  7 |     AllowedPattern: '[A-Za-z0-9-]{1,63}'
  8 |     ConstraintDescription: Maximum of 63 alphanumeric characters. Can include hyphens but not spaces. 
  9 |     Description: SageMaker Notebook instance name
 10 |     MaxLength: '63'
 11 |     MinLength: '1'
 12 |     Type: String
 13 |     Default: 'myNotebook'
 14 |   NotebookInstanceType:
 15 |     AllowedValues:
 16 |       - ml.t2.large
 17 |       - ml.t2.xlarge
 18 |       - ml.t3.large
 19 |       - ml.t3.xlarge
 20 |     ConstraintDescription: Must select a valid notebook instance type.
 21 |     Default: ml.t3.large
 22 |     Description: Select Instance type for the SageMaker Notebook
 23 |     Type: String
 24 |   VPCSubnetIds:
 25 |     Description: The ID of the subnet in a VPC
 26 |     Type: String
 27 |     Default: 'Replace with your VPC subnet'
 28 |   VPCSecurityGroupIds:
 29 |     Description: The VPC security group IDs, in the form sg-xxxxxxxx.  
 30 |     Type: CommaDelimitedList
 31 |     Default: 'Replace with the security group id(s) for your VPC'
 32 |   KMSKeyId:
 33 |     Description: The ARN of the KMS Key to use for encrypting storage attached to notebook
 34 |     Type: String
 35 |     Default: 'Replace with your KMS Key ARN'
 36 |   NotebookVolumeSize:
 37 |     Description:  The size of the ML Storage Volume attached to your notebook instance. 
 38 |     Type: Number
 39 |     Default: 75
 40 | Resources:
 41 |   SageMakerRole: 
 42 |     Type: AWS::IAM::Role
 43 |     Properties:
 44 |       AssumeRolePolicyDocument:
 45 |         Version: 2012-10-17
 46 |         Statement:
 47 |           - Effect: Allow
 48 |             Principal:
 49 |               Service:
 50 |                 - "sagemaker.amazonaws.com"
 51 |             Action:
 52 |               - "sts:AssumeRole"
 53 |       ManagedPolicyArns:
 54 |         - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
 55 |         - "arn:aws:iam::aws:policy/AmazonS3FullAccess"
 56 |         - "arn:aws:iam::aws:policy/IAMReadOnlyAccess"
 57 |         - "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess"
 58 |         - "arn:aws:iam::aws:policy/AWSLambda_FullAccess"
 59 |         - "arn:aws:iam::aws:policy/AmazonCognitoPowerUser"
 60 |   SageMakerLifecycleConfig:      
 61 |     Type: AWS::SageMaker::NotebookInstanceLifecycleConfig
 62 |     Properties: 
 63 |       OnCreate: 
 64 |         - Content:
 65 |             Fn::Base64: !Sub "nohup aws s3 cp s3://openaq-fetches/ s3://${S3Bucket}/data/ --recursive &"
 66 |     DependsOn: S3Bucket
 67 |   SageMakerNotebookInstance:
 68 |     Type: "AWS::SageMaker::NotebookInstance"
 69 |     Properties:
 70 |       KmsKeyId: !Ref KMSKeyId
 71 |       NotebookInstanceName: !Ref NotebookInstanceName
 72 |       InstanceType: !Ref NotebookInstanceType 
 73 |       RoleArn: !GetAtt SageMakerRole.Arn
 74 |       SubnetId: !Ref VPCSubnetIds
 75 |       SecurityGroupIds: !Ref VPCSecurityGroupIds
 76 |       LifecycleConfigName: !GetAtt SageMakerLifecycleConfig.NotebookInstanceLifecycleConfigName
 77 |       VolumeSizeInGB: !Ref NotebookVolumeSize
 78 |   S3Bucket:
 79 |     Type: AWS::S3::Bucket
 80 |     Properties: 
 81 |       BucketName: 
 82 |         Fn::Join:
 83 |           - '-'
 84 |           - - datascience-environment-notebookinstance-
 85 |             - Fn::Select:
 86 |                 - 4
 87 |                 - Fn::Split:
 88 |                     - '-'
 89 |                     - Fn::Select:
 90 |                         - 2
 91 |                         - Fn::Split:
 92 |                             - /
 93 |                             - Ref: AWS::StackId
 94 | 
 95 | Outputs:
 96 |   SageMakerNoteBookURL:
 97 |     Description: "URL for the SageMaker Notebook Instance"
 98 |     Value: !Sub 'https://${AWS::Region}.console.aws.amazon.com/sagemaker/home?region=${AWS::Region}#/notebook-instances/openNotebook/${NotebookInstanceName}'
 99 |   SageMakerNotebookInstanceARN:
100 |     Description: "ARN for the SageMaker Notebook Instance"
101 |     Value: !Ref SageMakerNotebookInstance
102 |   S3BucketARN:
103 |     Description: "ARN for the S3 Bucket"
104 |     Value: !Ref S3Bucket
105 | 
106 | 


--------------------------------------------------------------------------------
/Chapter02/studio-domain.yaml:
--------------------------------------------------------------------------------
 1 | AWSTemplateFormatVersion: '2010-09-09'
 2 | Metadata: 
 3 |   License: Apache-2.0
 4 | Description: 'CloudFormation to create new Studio Domain if one does not exist'
 5 | Parameters:
 6 |   DomainName:
 7 |     AllowedPattern: '[A-Za-z0-9-]{1,63}'
 8 |     ConstraintDescription: Maximum of 63 alphanumeric characters. Can include hyphens but not spaces. 
 9 |     Description: SageMaker Studio Domain Name
10 |     MaxLength: '63'
11 |     MinLength: '1'
12 |     Type: String
13 |     Default: 'StudioDomain'
14 |   VPCId:
15 |     Description: The ID of the VPC that Studio uses for communication
16 |     Type: String
17 |     Default: 'Replace with your VPC ID'
18 |   VPCSubnetIds:
19 |     Description: Choose which subnets Studio should use
20 |     Type: 'List<AWS::EC2::Subnet::Id>'
21 |     Default: 'subnet-1,subnet-2'
22 |   VPCSecurityGroupIds:
23 |     Description: The VPC security group IDs, in the form sg-xxxxxxxx.  
24 |     Type: CommaDelimitedList
25 |     Default: 'Replace with the security group id(s) for your VPC'
26 |   KMSKeyId:
27 |     Description: The ARN of the KMS Key to use for encrypting storage attached to notebook
28 |     Type: String
29 |     Default: 'Replace with your KMS Key ARN'
30 | Resources:
31 |   StudioDomain:
32 |     Type: AWS::SageMaker::Domain
33 |     Properties: 
34 |       AppNetworkAccessType: VpcOnly
35 |       AuthMode: IAM
36 |       DefaultUserSettings: 
37 |           ExecutionRole: !GetAtt SageMakerRole.Arn
38 |           SecurityGroups: !Ref VPCSecurityGroupIds
39 |       DomainName: !Ref DomainName
40 |       KmsKeyId: !Ref KMSKeyId
41 |       SubnetIds: !Ref VPCSubnetIds
42 |       VpcId: !Ref VPCId
43 |   SageMakerRole: 
44 |     Type: AWS::IAM::Role
45 |     Properties:
46 |       AssumeRolePolicyDocument:
47 |         Version: 2012-10-17
48 |         Statement:
49 |           - Effect: Allow
50 |             Principal:
51 |               Service:
52 |                 - "sagemaker.amazonaws.com"
53 |             Action:              
54 |               - "sts:AssumeRole"
55 |       ManagedPolicyArns:
56 |         - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
57 |         - "arn:aws:iam::aws:policy/AmazonS3FullAccess"
58 |         - "arn:aws:iam::aws:policy/IAMReadOnlyAccess"
59 |         - "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess"
60 |         - "arn:aws:iam::aws:policy/AWSLambda_FullAccess" 
61 |         - "arn:aws:iam::aws:policy/AmazonCognitoPowerUser"
62 | Outputs:
63 |   SageMakerDomainID:
64 |     Description: "ID for the Studio Domain"
65 |     Value: !Ref StudioDomain
66 | 


--------------------------------------------------------------------------------
/Chapter02/studio-environment.yaml:
--------------------------------------------------------------------------------
 1 | AWSTemplateFormatVersion: '2010-09-09'
 2 | Metadata: 
 3 |   License: Apache-2.0
 4 | Description: 'Example data science environment creating a new SageMaker Studio User in an existing Studio Domain using an existing VPC.  This template also includes the creation of an Amazon S3 Bucket and IAM Role.'
 5 | Parameters:
 6 |   StudioDomainID:
 7 |     AllowedPattern: '[A-Za-z0-9-]{1,63}'
 8 |     Description: ID of the Studio Domain where user should be created (ex. d-xxxnxxnxxnxn)
 9 |     Default: d-xxxnxxnxxnxn
10 |     Type: String
11 |   Team:
12 |     AllowedValues:
13 |       - weatherproduct
14 |       - weatherresearch  
15 |     Description: Team name for user working in associated environment
16 |     Default: weatherproduct
17 |     Type: String
18 |   UserProfileName:
19 |     Description: User profile name
20 |     AllowedPattern: '^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}'
21 |     Type: String
22 |     Default: 'UserName'
23 |   VPCSecurityGroupIds:
24 |     Description: The VPC security group IDs, in the form sg-xxxxxxxx.  
25 |     Type: CommaDelimitedList
26 |     Default: 'Replace with the security group id(s) for your VPC'
27 | Resources:
28 |   StudioUser:
29 |     Type: AWS::SageMaker::UserProfile
30 |     Properties: 
31 |       DomainId: !Ref StudioDomainID
32 |       Tags: 
33 |         - Key: "Environment"
34 |           Value: "Development"
35 |         - Key: "Team"
36 |           Value: !Ref Team
37 |       UserProfileName: !Ref UserProfileName
38 |       UserSettings: 
39 |         ExecutionRole: !GetAtt SageMakerRole.Arn
40 |         SecurityGroups: !Ref VPCSecurityGroupIds
41 |   
42 |   SageMakerRole: 
43 |     Type: AWS::IAM::Role
44 |     Properties:
45 |       AssumeRolePolicyDocument:
46 |         Version: 2012-10-17
47 |         Statement:
48 |           - Effect: Allow
49 |             Principal:
50 |               Service:
51 |                 - "sagemaker.amazonaws.com"
52 |             Action:
53 |               - "sts:AssumeRole"
54 |       ManagedPolicyArns:
55 |         - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
56 |         - "arn:aws:iam::aws:policy/AmazonS3FullAccess"
57 |         - "arn:aws:iam::aws:policy/IAMReadOnlyAccess"
58 |         - "arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess"
59 |         - "arn:aws:iam::aws:policy/AWSLambda_FullAccess"
60 |  
61 |   S3Bucket:
62 |     Type: AWS::S3::Bucket
63 |     Properties: 
64 |       BucketName: 
65 |         Fn::Join:
66 |           - '-'
67 |           - - datascience-environment-studio
68 |             - !Ref UserProfileName
69 |             - Fn::Select:
70 |                 - 4
71 |                 - Fn::Split:
72 |                     - '-'
73 |                     - Fn::Select:
74 |                         - 2
75 |                         - Fn::Split:
76 |                             - /
77 |                             - Ref: AWS::StackId
78 | 
79 | Outputs:
80 |   S3BucketARN:
81 |     Description: "ARN for the S3 Bucket"
82 |     Value: !Ref S3Bucket
83 | 
84 | 


--------------------------------------------------------------------------------
/Chapter03/LabelData.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "source": [
  5 |     "# Chapter 2: Data Labeling with SageMaker Ground Truth: Custom Labeling\n",
  6 |     "\n",
  7 |     "In this notebook we'll perform the following steps:\n",
  8 |     "\n",
  9 |     "* Create a private workforce backed by a Cognito user pool.\n",
 10 |     "* Create a manifest file that lists the items we want to label\n",
 11 |     "* Define a custom Ground Truth labeling workflow, consisting of two Lambda functions and a UI template, and launch a labeling job\n",
 12 |     "* Add a second worker to our private workforce\n",
 13 |     "* Adjust the post-processing part of the workflow to handle input from multiple workers, and launch another labeling job\n",
 14 |     "\n"
 15 |    ],
 16 |    "cell_type": "markdown",
 17 |    "metadata": {}
 18 |   },
 19 |   {
 20 |    "source": [
 21 |     "## Create a private workforce\n",
 22 |     "\n",
 23 |     "Before executing the code in this section, review and set the following variables:\n",
 24 |     "\n",
 25 |     "* `PoolName`: The name for the user pool in Cognito\n",
 26 |     "* `ClientName`: The name for the Cognito user pool client\n",
 27 |     "* `IdentityPoolName`: The name for the Cognito identity pool\n",
 28 |     "* `Region`: The name of the AWS region you're working in\n",
 29 |     "* `IamRolePrefix`: A prefix to use when naming new IAM roles\n",
 30 |     "* `GroupName`: Name for the Cognito user group\n",
 31 |     "* `DomainName`: Domain name for the Cognito authentication page\n",
 32 |     "* `WorkteamName`: Name for the private work team\n",
 33 |     "* `UserEmail`: User name to use (use a fake email address)\n",
 34 |     "* `Password`: Use a password with at least one upper case character, one symbol, and one number"
 35 |    ],
 36 |    "cell_type": "markdown",
 37 |    "metadata": {}
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 1,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "# Constants\n",
 46 |     "\n",
 47 |     "PoolName = 'MyUserPool'\n",
 48 |     "ClientName = 'MyUserPoolClient'\n",
 49 |     "IdentityPoolName = 'MyIdentityPool'\n",
 50 |     "Region = 'us-east-1'\n",
 51 |     "IamRolePrefix = 'MyRole'\n",
 52 |     "GroupName = 'MyGroup'\n",
 53 |     "DomainName = 'MyDomain'\n",
 54 |     "WorkteamName = 'MyTeam'\n",
 55 |     "UserEmail = \"me@foo.com\"\n",
 56 |     "Password = 'PwTest123!'"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 88,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "from utils.cognito-helper import CognitoHelper\n",
 66 |     "cognito_helper = CognitoHelper(Region, IamRolePrefix)\n",
 67 |     "cognito_helper.create_user_pool(PoolName)\n",
 68 |     "cognito_helper.create_user_pool_client(ClientName)\n",
 69 |     "cognito_helper.create_identity_pool(IdentityPoolName)\n",
 70 |     "cognito_helper.create_group(GroupName)\n",
 71 |     "cognito_helper.create_user_pool_domain(DomainName)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 48,
 77 |    "metadata": {},
 78 |    "outputs": [
 79 |     {
 80 |      "name": "stdout",
 81 |      "output_type": "stream",
 82 |      "text": [
 83 |       "arn:aws:sagemaker:us-east-1:102165494304:workteam/private-crowd/rdtest\n"
 84 |      ]
 85 |     }
 86 |    ],
 87 |    "source": [
 88 |     "from util.sagemaker-helper import SagemakerHelper\n",
 89 |     "sagemaker_helper = SagemakerHelper(Region, IamRolePrefix)\n",
 90 |     "sagemaker_helper.create_workteam(WorkteamName, \n",
 91 |     "    cognito_helper.user_pool_id, \n",
 92 |     "    cognito_helper.group_name, \n",
 93 |     "    cognito_helper.user_pool_client_id)\n"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "cognito_helper.update_client(sagemaker_helper.get_workforce_domain())"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 76,
108 |    "metadata": {},
109 |    "outputs": [
110 |     {
111 |      "data": {
112 |       "text/plain": [
113 |        "{'UserConfirmed': False,\n",
114 |        " 'UserSub': 'dac2455f-692c-4e38-b83b-6afbcd6a57ef',\n",
115 |        " 'ResponseMetadata': {'RequestId': '057a69f9-609b-4979-ad47-83475ec05eac',\n",
116 |        "  'HTTPStatusCode': 200,\n",
117 |        "  'HTTPHeaders': {'date': 'Fri, 26 Mar 2021 00:02:48 GMT',\n",
118 |        "   'content-type': 'application/x-amz-json-1.1',\n",
119 |        "   'content-length': '72',\n",
120 |        "   'connection': 'keep-alive',\n",
121 |        "   'x-amzn-requestid': '057a69f9-609b-4979-ad47-83475ec05eac'},\n",
122 |        "  'RetryAttempts': 0}}"
123 |       ]
124 |      },
125 |      "execution_count": 76,
126 |      "metadata": {},
127 |      "output_type": "execute_result"
128 |     }
129 |    ],
130 |    "source": [
131 |     "cognito_helper.add_user(UserEmail, Password)"
132 |    ]
133 |   },
134 |   {
135 |    "source": [
136 |     "## Create a manifest file\n",
137 |     "\n",
138 |     "In this section, you'll need to define:\n",
139 |     "\n",
140 |     "* The name of your S3 bucket\n",
141 |     "* The folder (prefix) where you stored the _OpenAQ_ data set.\n",
142 |     "* The folder (prefix) where you want to store the manifest."
143 |    ],
144 |    "cell_type": "markdown",
145 |    "metadata": {}
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": 49,
150 |    "metadata": {},
151 |    "outputs": [],
152 |    "source": [
153 |     "s3_bucket = 'MyS3Bucket'\n",
154 |     "s3_prefix = 'openaq/realtime/'\n",
155 |     "s3_prefix_manifest = 'inventory'"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 105,
161 |    "metadata": {},
162 |    "outputs": [
163 |     {
164 |      "name": "stdout",
165 |      "output_type": "stream",
166 |      "text": [
167 |       "Processing openaq/realtime/2013-11-27/2013-11-27.ndjson\n",
168 |       "Got 40 manifest entries\n"
169 |      ]
170 |     }
171 |    ],
172 |    "source": [
173 |     "sagemaker_helper.create_manifest(s3_bucket, s3_prefix, s3_prefix_manifest)"
174 |    ]
175 |   },
176 |   {
177 |    "source": [
178 |     "## Create custom workflow\n",
179 |     "\n",
180 |     "In this section, you must define:\n",
181 |     "\n",
182 |     "* The folder (prefix) where you want to store the workflow files.\n",
183 |     "* The name prefix for your Lambda functions.\n",
184 |     "* The folder (prefix) where you want to store the labeling output."
185 |    ],
186 |    "cell_type": "markdown",
187 |    "metadata": {}
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": [
195 |     "s3_prefix_workflow = 'workflow'\n",
196 |     "fn_prefix = 'MyFn'\n",
197 |     "s3_prefix_labels = 'labels'"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": 114,
203 |    "metadata": {},
204 |    "outputs": [],
205 |    "source": [
206 |     "sagemaker_helper.create_workflow(s3_bucket, s3_prefix_workflow, fn_prefix, s3_prefix_labels)"
207 |    ]
208 |   },
209 |   {
210 |    "source": [
211 |     "## Add another worker\n",
212 |     "\n",
213 |     "In this section you'll need to define:\n",
214 |     "\n",
215 |     "* `UserEmail2`: User name to use for second worker (use a fake email address)\n",
216 |     "* `Password2`: Use a password with at least one upper case character, one symbol, and one number"
217 |    ],
218 |    "cell_type": "markdown",
219 |    "metadata": {}
220 |   },
221 |   {
222 |    "cell_type": "code",
223 |    "execution_count": null,
224 |    "metadata": {},
225 |    "outputs": [],
226 |    "source": [
227 |     "cognito_helper.add_user(UserEmail2, Password2)"
228 |    ]
229 |   },
230 |   {
231 |    "source": [
232 |     "## Launch labeling job for multiple workers"
233 |    ],
234 |    "cell_type": "markdown",
235 |    "metadata": {}
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": null,
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "sagemaker_helper.create_workflow_multiple_workers(s3_bucket, s3_prefix_workflow, fn_prefix,        \n",
244 |     "    s3_prefix_labels, s3_prefix_labels)"
245 |    ]
246 |   }
247 |  ],
248 |  "metadata": {
249 |   "instance_type": "ml.m5.large",
250 |   "kernelspec": {
251 |    "name": "python3",
252 |    "display_name": "Python 3.8.5 64-bit ('.venv': venv)",
253 |    "metadata": {
254 |     "interpreter": {
255 |      "hash": "cb064b2fb9a88e905f0b275d2c13a9378a141d23a43c9eb5ab74845797a0d104"
256 |     }
257 |    }
258 |   },
259 |   "language_info": {
260 |    "codemirror_mode": {
261 |     "name": "ipython",
262 |     "version": 3
263 |    },
264 |    "file_extension": ".py",
265 |    "mimetype": "text/x-python",
266 |    "name": "python",
267 |    "nbconvert_exporter": "python",
268 |    "pygments_lexer": "ipython3",
269 |    "version": "3.8.5-final"
270 |   }
271 |  },
272 |  "nbformat": 4,
273 |  "nbformat_minor": 4
274 | }


--------------------------------------------------------------------------------
/Chapter03/utils/cognito-helper.py:
--------------------------------------------------------------------------------
  1 | import boto3
  2 | import logging
  3 | import sys
  4 | import hmac
  5 | import hashlib
  6 | import base64
  7 | 
  8 | class CognitoHelper:
  9 |     def __init__(self, region, iam_role_prefix):
 10 |         logging.basicConfig(level=logging.INFO)
 11 |         self.logger = logging.getLogger('CognitoHelper')
 12 |         self.cognito = boto3.client('cognito-idp')
 13 |         self.cognitoid = boto3.client('cognito-identity')
 14 |         self.iam = boto3.client('iam')
 15 |         self.region = region
 16 |         self.iam_role_prefix = iam_role_prefix
 17 | 
 18 |     def create_user_pool(self, PoolName):
 19 |         response = self.cognito.create_user_pool(PoolName=PoolName)
 20 |         self.user_pool_id = response['UserPool']['Id']
 21 |         self.user_pool_arn = response['UserPool']['Arn']
 22 |         self.logger.info(f"Created user pool with ID: {self.user_pool_id}; ARN: {self.user_pool_arn}")
 23 | 
 24 |     def create_user_pool_client(self, ClientName):
 25 |         response = self.cognito.create_user_pool_client(
 26 |             UserPoolId=self.user_pool_id,
 27 |             ClientName=ClientName,
 28 |             GenerateSecret=True,
 29 |             SupportedIdentityProviders = ['COGNITO'],
 30 |             ExplicitAuthFlows=[
 31 |                 'ADMIN_NO_SRP_AUTH'
 32 |             ]
 33 |         )
 34 |         self.user_pool_client_id = response['UserPoolClient']['ClientId']
 35 |         self.logger.info(f"Created user pool client with ID: {self.user_pool_client_id}")
 36 | 
 37 |     def create_identity_pool(self, IdentityPoolName):
 38 |         response = self.cognitoid.create_identity_pool(
 39 |             IdentityPoolName=IdentityPoolName,
 40 |             AllowUnauthenticatedIdentities=False,
 41 |             CognitoIdentityProviders=[
 42 |                 {
 43 |                     'ProviderName': f"cognito-idp.{self.region}.amazonaws.com/{self.user_pool_id}",
 44 |                     'ClientId': self.user_pool_client_id
 45 |                 },
 46 |             ]
 47 |         )
 48 |         self.id_pool_id = response['IdentityPoolId']
 49 |         self.logger.info(f"Created identity pool {self.id_pool_id}")
 50 | 
 51 |     def create_group(self, GroupName):
 52 |         assume_role_doc = """{
 53 |             "Version": "2012-10-17",
 54 |             "Statement": [
 55 |                 {
 56 |                 "Effect": "Allow",
 57 |                 "Principal": {
 58 |                     "Federated": "cognito-identity.amazonaws.com"
 59 |                 },
 60 |                 "Action": "sts:AssumeRoleWithWebIdentity",
 61 |                 "Condition": {
 62 |                     "StringEquals": {
 63 |                     "cognito-identity.amazonaws.com:aud": """ + '"' + self.id_pool_id + '"' + """
 64 |                     }
 65 |                 }
 66 |                 },
 67 |                 {
 68 |                 "Effect": "Allow",
 69 |                 "Principal": {
 70 |                     "Federated": "cognito-identity.amazonaws.com"
 71 |                 },
 72 |                 "Action": "sts:AssumeRoleWithWebIdentity",
 73 |                 "Condition": {
 74 |                     "ForAnyValue:StringLike": {
 75 |                     "cognito-identity.amazonaws.com:amr": "authenticated"
 76 |                     }
 77 |                 }
 78 |                 }
 79 |             ]
 80 |             }
 81 |         """
 82 |         response = self.iam.create_role(
 83 |             RoleName=f"{self.iam_role_prefix}-worker-group-role",
 84 |             AssumeRolePolicyDocument=assume_role_doc
 85 |         )
 86 |         self.group_role_id = response['Role']['RoleId']
 87 |         self.group_role_arn = response['Role']['Arn']
 88 | 
 89 |         response = self.cognito.create_group(
 90 |             GroupName=GroupName,
 91 |             UserPoolId=self.user_pool_id,    
 92 |             RoleArn=self.group_role_arn,
 93 |             Precedence=1
 94 |         )
 95 |         self.group_name = response['Group']['GroupName']
 96 |         print(f"Created worker group {self.group_name}")
 97 | 
 98 |     def create_user_pool_domain(self, DomainName):
 99 |         self.cognito.create_user_pool_domain(
100 |             Domain=DomainName,
101 |             UserPoolId=self.user_pool_id
102 |         )
103 | 
104 |     def get_client_secret(self):
105 |         response = self.cognito.describe_user_pool_client(
106 |             UserPoolId=self.user_pool_id,
107 |             ClientId=self.user_pool_client_id
108 |         )
109 |         self.client_secret = response['UserPoolClient']['ClientSecret']
110 | 
111 |     def update_client(self, labeling_domain):
112 |         self.cognito.update_user_pool_client(
113 |             UserPoolId=self.user_pool_id,
114 |             ClientId=self.user_pool_client_id,
115 |             CallbackURLs=['https://{labeling_domain}/oauth2/idpresponse'],
116 |             LogoutURLs=['https://{labeling_domain}/logout'],
117 |             AllowedOAuthFlows=['code','implicit'],
118 |             AllowedOAuthScopes=['email','profile','openid']
119 |         )
120 | 
121 |     def add_user(self, UserEmail, Password):
122 |         self.get_client_secret()
123 |         dig = hmac.new(bytearray(self.client_secret, encoding='utf-8'), 
124 |             msg=f"{UserEmail}{self.user_pool_client_id}".encode('utf-8'), 
125 |             digestmod=hashlib.sha256).digest()
126 |         secret_hash = base64.b64encode(dig).decode() 
127 |         self.cognito.sign_up(
128 |             ClientId=self.user_pool_client_id,
129 |             Username=UserEmail,
130 |             Password=Password,
131 |             SecretHash=secret_hash,
132 |             UserAttributes=[
133 |                 {
134 |                     'Name': 'email',
135 |                     'Value': UserEmail
136 |                 },
137 |                 {
138 |                     'Name': 'phone_number',
139 |                     'Value': '+12485551212'
140 |                 }
141 |             ]
142 |         )
143 |         self.cognito.admin_confirm_sign_up(
144 |             UserPoolId=self.user_pool_id,
145 |             Username=UserEmail
146 |         )
147 |         self.cognito.admin_add_user_to_group(
148 |             UserPoolId=self.user_pool_id,
149 |             Username=UserEmail,
150 |             GroupName=self.group_name
151 |         )
152 | 
153 | if __name__ == "__main__":
154 |     logging.warn("No main function defined")
155 |     sys.exit(0)


--------------------------------------------------------------------------------
/Chapter03/utils/sagemaker-helper.py:
--------------------------------------------------------------------------------
  1 | import boto3
  2 | from zipfile import ZipFile
  3 | import logging
  4 | import sys
  5 | import json
  6 | 
  7 | class SagemakerHelper:
  8 |     def __init__(self, region, iam_role_prefix):
  9 |         logging.basicConfig(level=logging.INFO)
 10 |         self.logger = logging.getLogger('SagemakerHelper')
 11 |         self.sagemaker = boto3.client('sagemaker')
 12 |         self.iam = boto3.client('iam')
 13 |         self.s3 = boto3.client('s3')
 14 |         self.lambdac = boto3.client('lambda')
 15 |         self.region = region
 16 |         self.iam_role_prefix = iam_role_prefix
 17 | 
 18 |     def create_workteam(self, WorkteamName, user_pool_id, group_name, user_pool_client_id):
 19 |         response = self.sagemaker.create_workteam(
 20 |             WorkteamName=WorkteamName,
 21 |             MemberDefinitions=[
 22 |                 {
 23 |                     'CognitoMemberDefinition': {
 24 |                         'UserPool': user_pool_id,
 25 |                         'UserGroup': group_name,
 26 |                         'ClientId': user_pool_client_id
 27 |                     }
 28 |                 }
 29 |             ],
 30 |             Description = WorkteamName
 31 |         )
 32 |         self.workteam_arn = response['WorkteamArn']
 33 |         self.workteam_name = WorkteamName
 34 |         self.logger.info(f"Created workteam {self.workteam_arn}")
 35 | 
 36 |     def get_workforce_domain(self):
 37 |         response = self.sagemaker.describe_workteam(
 38 |             WorkteamName=self.workteam_name
 39 |         )
 40 |         self.workforce_domain = response['Workteam']['SubDomain']
 41 |         return self.workforce_domain
 42 | 
 43 |     def create_manifest(self, s3_bucket, s3_prefix, s3_prefix_manifest, max_entries = 20):
 44 |         self.s3_prefix_manifest = s3_prefix_manifest
 45 |         manifest_file_local = 'manifest.txt'
 46 |         manifests = []
 47 |         response = self.s3.list_objects_v2(
 48 |             Bucket=s3_bucket,
 49 |             Prefix=s3_prefix
 50 |         )
 51 |         r = response['Contents'][0]
 52 |         self.s3.download_file(s3_bucket, r['Key'], 'temp.json')
 53 |         self.logger.debug("Processing " + r['Key'])
 54 |         with open('temp.json', 'r') as F:
 55 |             for l in F.readlines():
 56 |                 if len(manifests) > max_entries:
 57 |                     break
 58 |                 j = json.loads(l)
 59 |                 manifests.append(f"{j['parameter']},{j['value']},{j['unit']},{j['coordinates']['latitude']},{j['coordinates']['longitude']}")
 60 |         self.logger.debug(f"Got {len(manifests)} manifest entries")
 61 |         with open(manifest_file_local, 'wt') as F:
 62 |             for m in manifests:
 63 |                 F.write('{"source": "' + m + '"}' + "\n")
 64 |         self.s3.upload_file(manifest_file_local, s3_bucket, f"{self.s3_prefix_manifest}/openaq.manifest")
 65 | 
 66 |         label_file_local = 'label.txt'
 67 |         with open(label_file_local, 'wt') as F:
 68 |             F.write('{' + "\n")
 69 |             F.write('"document-version": "2018-11-28",' + "\n")
 70 |             F.write('"labels": [{"label": "good"},{"label": "bad"}]' + "\n")
 71 |             F.write('}' + "\n")
 72 |         self.s3.upload_file(label_file_local, s3_bucket, f"{self.s3_prefix_manifest}/openaq.labels")
 73 |     
 74 |     def create_role(self, service_for, name, policies = []):
 75 |         role_doc = """{
 76 |         "Version": "2012-10-17",
 77 |         "Statement": [
 78 |             {
 79 |             "Effect": "Allow",
 80 |             "Principal": {
 81 |                 "Service": f"{service_for}.amazonaws.com"
 82 |             },
 83 |             "Action": "sts:AssumeRole"
 84 |             }
 85 |         ]
 86 |         }
 87 |         """
 88 |         role =f"{self.iam_role_prefix}-{name}-role",
 89 |         response = self.iam.create_role(
 90 |             RoleName=role,
 91 |             AssumeRolePolicyDocument=role_doc
 92 |         )
 93 |         role_arn, role_name = response['Role']['Arn'],response['Role']['RoleName']
 94 |         for p in policies:
 95 |             self.iam.attach_role_policy(
 96 |                 RoleName=role_name,
 97 |                 PolicyArn=p
 98 |             )
 99 | 
100 |         return role_arn, role_name
101 | 
102 |     def create_fn(self, fn_prefix, fname, l_prefix, role_arn, handler):
103 |         fzip = f"{fname}.zip"
104 |         with ZipFile(fzip,'w') as zip:
105 |             zip.write(f"workflow/{fname}")
106 |         with open(fzip, 'rb') as file_data:
107 |             f_bytes = file_data.read()
108 |         
109 |         response = self.lambdac.create_function(
110 |             FunctionName=f"{fn_prefix}-{l_prefix}-LabelingFunction",
111 |             Runtime='python3.7',
112 |             Role=role_arn,
113 |             Handler=handler,
114 |             Code={
115 |                 'ZipFile': f_bytes
116 |             },
117 |             Description=f"{fn_prefix}-{l_prefix}-LabelingFunction",
118 |             Timeout=300,
119 |             MemorySize=1024,
120 |             Publish=True,
121 |             PackageType='Zip'
122 |         )
123 |         return response['FunctionArn']
124 | 
125 | 
126 |     def create_workflow(self, s3_bucket, s3_prefix_workflow, fn_prefix, label_prefix, s3_prefix_labels):
127 |         self.workflow_role_arn, self.workflow_role_name = self.create_role("sagemaker", "workflow",
128 |             policies = ['arn:aws:iam::aws:policy/AmazonS3FullAccess',
129 |                 'arn:aws:iam::aws:policy/AmazonSageMakerFullAccess']
130 |         )
131 |         self.lambda_role_arn, self.lambda_role_name = self.create_role("lambda", "lambda",
132 |             policies = ['arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole',
133 |                 'arn:aws:iam::aws:policy/AmazonS3FullAccess']
134 |         )
135 | 
136 |         self.s3.upload_file('workflow/ui.liquid.html', s3_bucket, f"{s3_prefix_workflow}/openaq.liquid.html")
137 |         self.pre_arn = self.create_fn(fn_prefix, 'pre.py', "pre", self.lambda_role_arn, "pre.handler")
138 |         self.post_arn = self.create_fn(fn_prefix, 'post.py', "post", self.lambda_role_arn, "post.handler")
139 | 
140 |         self.sagemaker.create_labeling_job(
141 |             LabelingJobName=fn_prefix,
142 |             LabelAttributeName='badair',
143 |             InputConfig={
144 |                 'DataSource': {
145 |                     'S3DataSource': {
146 |                         'ManifestS3Uri': f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.manifest"
147 |                     }
148 |                 }
149 |             },
150 |             OutputConfig={
151 |                 'S3OutputPath': f"s3://{s3_bucket}/{s3_prefix_labels}/openaq"
152 |             },
153 |             RoleArn=self.workflow_role_arn,
154 |             LabelCategoryConfigS3Uri=f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.labels",
155 |             StoppingConditions={
156 |                 'MaxHumanLabeledObjectCount': 10,
157 |                 'MaxPercentageOfInputDatasetLabeled': 5
158 |             },
159 |             HumanTaskConfig={
160 |                 'WorkteamArn': self.workteam_arn,
161 |                 'UiConfig': {
162 |                     'UiTemplateS3Uri': f"s3://{s3_bucket}/{s3_prefix_workflow}/openaq.liquid.html"
163 |                 },
164 |                 'PreHumanTaskLambdaArn': self.pre_arn,
165 |                 'TaskTitle': 'Label Air Quality',
166 |                 'TaskDescription': 'Was it a good air day?',
167 |                 'NumberOfHumanWorkersPerDataObject': 1,
168 |                 'TaskTimeLimitInSeconds': 3600,
169 |                 'AnnotationConsolidationConfig': {
170 |                     'AnnotationConsolidationLambdaArn': self.post_arn
171 |                 }
172 |             }
173 |         )
174 | 
175 |     def create_workflow_multiple_workers(self, s3_bucket, s3_prefix_workflow, fn_prefix, label_prefix, s3_prefix_labels):
176 |         self.mpost_arn = self.create_fn(fn_prefix, 'post-multiple.py', "mpost", self.lambda_role_arn, "post.handler")
177 |         self.sagemaker.create_labeling_job(
178 |             LabelingJobName=fn_prefix,
179 |             LabelAttributeName='badair',
180 |             InputConfig={
181 |                 'DataSource': {
182 |                     'S3DataSource': {
183 |                         'ManifestS3Uri': f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.manifest"
184 |                     }
185 |                 }
186 |             },
187 |             OutputConfig={
188 |                 'S3OutputPath': f"s3://{s3_bucket}/{s3_prefix_labels}/openaq"
189 |             },
190 |             RoleArn=self.workflow_role_arn,
191 |             LabelCategoryConfigS3Uri=f"s3://{s3_bucket}/{self.s3_prefix_manifest}/openaq.labels",
192 |             StoppingConditions={
193 |                 'MaxHumanLabeledObjectCount': 10,
194 |                 'MaxPercentageOfInputDatasetLabeled': 5
195 |             },
196 |             HumanTaskConfig={
197 |                 'WorkteamArn': self.workteam_arn,
198 |                 'UiConfig': {
199 |                     'UiTemplateS3Uri': f"s3://{s3_bucket}/{s3_prefix_workflow}/openaq.liquid.html"
200 |                 },
201 |                 'PreHumanTaskLambdaArn': self.pre_arn,
202 |                 'TaskTitle': 'Label Air Quality',
203 |                 'TaskDescription': 'Was it a good air day?',
204 |                 'NumberOfHumanWorkersPerDataObject': 2,
205 |                 'TaskTimeLimitInSeconds': 3600,
206 |                 'AnnotationConsolidationConfig': {
207 |                     'AnnotationConsolidationLambdaArn': self.mpost_arn
208 |                 }
209 |             }
210 |         )
211 | 
212 | if __name__ == "__main__":
213 |     logging.warn("No main function defined")
214 |     sys.exit(0)


--------------------------------------------------------------------------------
/Chapter03/workflow/post-multiple.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import boto3
 3 | 
 4 | """
 5 | Input:
 6 | 
 7 | {
 8 |     "version": "2018-10-16",
 9 |     "labelingJobArn": <labelingJobArn>,
10 |     "labelCategories": [<string>],
11 |     "labelAttributeName": <string>,
12 |     "roleArn" : "string",
13 |     "payload": {
14 |         "s3Uri": <string>
15 |     }
16 |  }
17 | 
18 | Contents of payload:
19 | 
20 | [
21 |     {
22 |         "datasetObjectId": <string>,
23 |         "dataObject": {
24 |             "s3Uri": <string>,
25 |             "content": <string>
26 |         },
27 |         "annotations": [{
28 |             "workerId": <string>,
29 |             "annotationData": {
30 |                 "content": <string>,
31 |                 "s3Uri": <string>
32 |             }
33 |        }]
34 |     }
35 | ]
36 | 
37 | Output:
38 | 
39 | [
40 |    {        
41 |         "datasetObjectId": <string>,
42 |         "consolidatedAnnotation": {
43 |             "content": {
44 |                 "<labelattributename>": {
45 |                     # ... label content
46 |                 }
47 |             }
48 |         }
49 |     },
50 |    {        
51 |         "datasetObjectId": <string>,
52 |         "consolidatedAnnotation": {
53 |             "content": {
54 |                 "<labelattributename>": {
55 |                     # ... label content
56 |                 }
57 |             }
58 |         }
59 |     }
60 | ]
61 | 
62 | """
63 | def handler(event, context):
64 |     input_uri = event["payload"]['s3Uri']
65 |     parts = input_uri.split('/')
66 |     s3 = boto3.client('s3')
67 |     s3.download_file(parts[2], "/".join(parts[3:]), '/tmp/input.json')
68 |     
69 |     with open('/tmp/input.json', 'r') as F:
70 |         input_data = json.load(F)
71 |         
72 |     output_data = []
73 |     for p in range(len(input_data)):
74 |         d_id = input_data[p]['datasetObjectId']
75 |                 
76 |         annotations = input_data[p]['annotations']
77 |         annotation = annotations[len(annotations)-1]['annotationData']['content']
78 | 
79 |         response = {
80 |             "datasetObjectId": d_id,
81 |                 "consolidatedAnnotation": {
82 |                     "content": annotation
83 |                 }
84 |             }
85 | 
86 |         output_data.append(response)
87 | 
88 |     # Perform consolidation
89 |     return output_data


--------------------------------------------------------------------------------
/Chapter03/workflow/post.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import boto3
 3 | 
 4 | """
 5 | Input:
 6 | 
 7 | {
 8 |     "version": "2018-10-16",
 9 |     "labelingJobArn": <labelingJobArn>,
10 |     "labelCategories": [<string>],
11 |     "labelAttributeName": <string>,
12 |     "roleArn" : "string",
13 |     "payload": {
14 |         "s3Uri": <string>
15 |     }
16 |  }
17 | 
18 | Contents of payload:
19 | 
20 | [
21 |     {
22 |         "datasetObjectId": <string>,
23 |         "dataObject": {
24 |             "s3Uri": <string>,
25 |             "content": <string>
26 |         },
27 |         "annotations": [{
28 |             "workerId": <string>,
29 |             "annotationData": {
30 |                 "content": <string>,
31 |                 "s3Uri": <string>
32 |             }
33 |        }]
34 |     }
35 | ]
36 | 
37 | Output:
38 | 
39 | [
40 |    {        
41 |         "datasetObjectId": <string>,
42 |         "consolidatedAnnotation": {
43 |             "content": {
44 |                 "<labelattributename>": {
45 |                     # ... label content
46 |                 }
47 |             }
48 |         }
49 |     },
50 |    {        
51 |         "datasetObjectId": <string>,
52 |         "consolidatedAnnotation": {
53 |             "content": {
54 |                 "<labelattributename>": {
55 |                     # ... label content
56 |                 }
57 |             }
58 |         }
59 |     }
60 | ]
61 | 
62 | """
63 | def handler(event, context):
64 |     input_uri = event["payload"]['s3Uri']
65 |     parts = input_uri.split('/')
66 |     s3 = boto3.client('s3')
67 |     s3.download_file(parts[2], "/".join(parts[3:]), '/tmp/input.json')
68 |     
69 |     with open('/tmp/input.json', 'r') as F:
70 |         input_data = json.load(F)
71 |         
72 |     output_data = []
73 |     for p in range(len(input_data)):
74 |         d_id = input_data[p]['datasetObjectId']
75 |                 
76 |         annotation = input_data[p]['annotations'][0]['annotationData']['content']
77 | 
78 |         response = {
79 |             "datasetObjectId": d_id,
80 |                 "consolidatedAnnotation": {
81 |                     "content": annotation
82 |                 }
83 |             }
84 | 
85 |         output_data.append(response)
86 | 
87 |     # Perform consolidation
88 |     return output_data


--------------------------------------------------------------------------------
/Chapter03/workflow/pre.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | """
 4 | The input event looks like this:
 5 | 
 6 |         {
 7 |            "version":"2018-10-16",
 8 |            "labelingJobArn":"<your labeling job ARN>",
 9 |            "dataObject":{
10 |               "source":"metric type,metric value,metric unit,lat,lon"
11 |            }
12 |         }
13 | 
14 | The output should look like this:
15 | 
16 |         {
17 |            "taskInput":{
18 |               "metric": "PM2.5 = 30.0",
19 |               "lat": 0.0,
20 |               "lon": 0.0
21 |            },
22 |            "isHumanAnnotationRequired":"true"
23 |         }
24 | """
25 | def handler(event, context):
26 |     sourceText = event['dataObject']['source']
27 |     parts = sourceText.split(',')
28 |     
29 |     output = {
30 |         "taskInput": {
31 |             "metric": f"{parts[0]}={parts[1]} {parts[2]}",
32 |             "lat": parts[3],
33 |             "lon": parts[4] 
34 |         },
35 |         "isHumanAnnotationRequired": "true"
36 |     }
37 |     
38 |     return output


--------------------------------------------------------------------------------
/Chapter03/workflow/ui.liquid.html:
--------------------------------------------------------------------------------
 1 |   
 2 | <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
 3 | <link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css"
 4 |    integrity="sha512-xodZBNTC5n17Xt2atTPuE1HxjVMSvLVW9ocqUKLsCC5CXdbqCmblAshOMAS6/keqq/sMZMZ19scR4PsZChSR7A=="
 5 |    crossorigin=""/>
 6 | <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js"
 7 |    integrity="sha512-XQoYMqMTK8LvdxXYG3nZ448hOEQiglfqkJs1NOQV44cWnUrBc8PkAOcXy20w0vlaXaVUearIOBhiXZ5V3ynxwA=="
 8 |    crossorigin=""></script>
 9 | 
10 | <div id="map" style="height: 440px; border: 1px solid #AAA;"></div>
11 |         <script> 
12 | var map = L.map( 'map', {
13 |     center: [{{ task.input.lat }}, {{ task.input.lon }}],
14 |     minZoom: 2,
15 |     zoom:5
16 | });
17 |             L.tileLayer( 'http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
18 |     attribution: '&copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a>',
19 |     subdomains: ['a','b','c']
20 | }).addTo( map );
21 |             L.marker( [{{ task.input.lat }},{{ task.input.lon }}]).addTo( map );
22 |         </script>
23 | 
24 | <crowd-form>
25 |     <crowd-classifier
26 |       name="category"
27 |       categories='["good","bad"]'
28 |       header="Select the most relevant label"
29 |     >
30 |       <classification-target>
31 |         {{ task.input.metric }}
32 |       </classification-target>
33 |       
34 |       <full-instructions header="Text Categorization Instructions">
35 |         <p><strong>Good</strong> air quality, safe for outdoor exercise for those with asthma</p>
36 |         <p><strong>Bad</strong> air quality, not safe for outdoor exercise for those with asthma</p>
37 |       </full-instructions>
38 | 
39 |       <short-instructions>
40 |        Choose the most relevant label for this air measurement. 
41 |       </short-instructions>
42 |     </crowd-classifier>
43 | </crowd-form>
44 | 
45 | 
46 | 
47 | 
48 | 
49 | 


--------------------------------------------------------------------------------
/Chapter04/PrepareData.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "source": [
  5 |     "# Chapter 3: Data preparation at scale using Amazon SageMaker Data Wrangler and Amazon SageMaker Processing\n",
  6 |     "\n",
  7 |     "In this notebook we'll perform the following steps:\n",
  8 |     "\n",
  9 |     "* Create a table in the Glue catalog for our data steps\n",
 10 |     "* Run a SageMaker Processing job to prepare the full data set\n",
 11 |     "\n",
 12 |     "You need to define the following variables:\n",
 13 |     "\n",
 14 |     "* `s3_bucket`: Bucket with the data set\n",
 15 |     "* `glue_db_name`: Glue database name\n",
 16 |     "* `glue_tbl_name`: Glue table name\n",
 17 |     "* `s3_prefix_parquet`: Location of the Parquet tables in the S3 bucket\n",
 18 |     "* `s3_output_prefix`: Location to store the prepared data in the S3 bucket\n",
 19 |     "* `s3_prefix`: Location of the JSON data in the S3 bucket\n"
 20 |    ],
 21 |    "cell_type": "markdown",
 22 |    "metadata": {}
 23 |   },
 24 |   {
 25 |    "source": [
 26 |     "## Glue Catalog"
 27 |    ],
 28 |    "cell_type": "markdown",
 29 |    "metadata": {}
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "s3_bucket = 'MyBucket'\n",
 38 |     "glue_db_name = 'MyDatabase'\n",
 39 |     "glue_tbl_name = 'openaq'\n",
 40 |     "s3_prefix = 'openaq/realtime'\n",
 41 |     "s3_prefix_parquet = 'openaq/realtime-parquet-gzipped/tables'\n",
 42 |     "s3_output_prefix = 'prepared'\n",
 43 |     "\n",
 44 |     "import boto3\n",
 45 |     "s3 = boto3.client('s3')"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "glue = boto3.client('glue')\n",
 55 |     "response = glue.create_database(\n",
 56 |     "    DatabaseInput={\n",
 57 |     "        'Name': glue_db_name,\n",
 58 |     "    }\n",
 59 |     ")"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "response = glue.create_table(\n",
 69 |     "    DatabaseName=glue_db_name,\n",
 70 |     "    TableInput={\n",
 71 |     "        'Name': glue_tbl_name,\n",
 72 |     "        'StorageDescriptor': {\n",
 73 |     "            'Columns': [\n",
 74 |     "                {\n",
 75 |     "                    \"Name\": \"date\",\n",
 76 |     "                    \"Type\": \"struct<utc:string,local:string>\"\n",
 77 |     "                },\n",
 78 |     "                {\n",
 79 |     "                    \"Name\": \"parameter\",\n",
 80 |     "                    \"Type\": \"string\"\n",
 81 |     "                },\n",
 82 |     "                {\n",
 83 |     "                    \"Name\": \"location\",\n",
 84 |     "                    \"Type\": \"string\"\n",
 85 |     "                },\n",
 86 |     "                {\n",
 87 |     "                    \"Name\": \"value\",\n",
 88 |     "                    \"Type\": \"double\"\n",
 89 |     "                },\n",
 90 |     "                {\n",
 91 |     "                    \"Name\": \"unit\",\n",
 92 |     "                    \"Type\": \"string\"\n",
 93 |     "                },\n",
 94 |     "                {\n",
 95 |     "                    \"Name\": \"city\",\n",
 96 |     "                    \"Type\": \"string\"\n",
 97 |     "                },\n",
 98 |     "                {\n",
 99 |     "                    \"Name\": \"attribution\",\n",
100 |     "                    \"Type\": \"array<struct<name:string,url:string>>\"\n",
101 |     "                },\n",
102 |     "                {\n",
103 |     "                    \"Name\": \"averagingperiod\",\n",
104 |     "                    \"Type\": \"struct<value:double,unit:string>\"\n",
105 |     "                },\n",
106 |     "                {\n",
107 |     "                    \"Name\": \"coordinates\",\n",
108 |     "                    \"Type\": \"struct<latitude:double,longitude:double>\"\n",
109 |     "                },\n",
110 |     "                {\n",
111 |     "                    \"Name\": \"country\",\n",
112 |     "                    \"Type\": \"string\"\n",
113 |     "                },\n",
114 |     "                {\n",
115 |     "                    \"Name\": \"sourcename\",\n",
116 |     "                    \"Type\": \"string\"\n",
117 |     "                },\n",
118 |     "                {\n",
119 |     "                    \"Name\": \"sourcetype\",\n",
120 |     "                    \"Type\": \"string\"\n",
121 |     "                },\n",
122 |     "                {\n",
123 |     "                    \"Name\": \"mobile\",\n",
124 |     "                    \"Type\": \"boolean\"\n",
125 |     "                }\n",
126 |     "            ],\n",
127 |     "            'Location': 's3://' + s3_bucket + '/' + s3_prefix + '/',\n",
128 |     "            'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',\n",
129 |     "            'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',\n",
130 |     "            'Compressed': False,\n",
131 |     "            'SerdeInfo': {\n",
132 |     "                'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',\n",
133 |     "                \"Parameters\": {\n",
134 |     "                    \"paths\": \"attribution,averagingPeriod,city,coordinates,country,date,location,mobile,parameter,sourceName,sourceType,unit,value\"\n",
135 |     "                }\n",
136 |     "            },\n",
137 |     "            'Parameters': {\n",
138 |     "                \"classification\": \"json\",\n",
139 |     "                \"compressionType\": \"none\",\n",
140 |     "            },\n",
141 |     "            'StoredAsSubDirectories': False,\n",
142 |     "        },\n",
143 |     "        'PartitionKeys': [\n",
144 |     "            {\n",
145 |     "                \"Name\": \"aggdate\",\n",
146 |     "                \"Type\": \"string\"\n",
147 |     "            },\n",
148 |     "        ],\n",
149 |     "        'TableType': 'EXTERNAL_TABLE',\n",
150 |     "        'Parameters': {\n",
151 |     "            \"classification\": \"json\",\n",
152 |     "            \"compressionType\": \"none\",\n",
153 |     "        }\n",
154 |     "        \n",
155 |     "    }\n",
156 |     ")"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": null,
162 |    "metadata": {},
163 |    "outputs": [],
164 |    "source": [
165 |     "partitions_to_add = []\n",
166 |     "response = s3.list_objects_v2(\n",
167 |     "    Bucket=s3_bucket,\n",
168 |     "    Prefix=s3_prefix + '/'\n",
169 |     ")\n",
170 |     "for r in response['Contents']:\n",
171 |     "    partitions_to_add.append(r['Key'])\n",
172 |     "while response['IsTruncated']:\n",
173 |     "    token = response['NextContinuationToken']\n",
174 |     "    response = s3.list_objects_v2(\n",
175 |     "        Bucket=s3_bucket,\n",
176 |     "        Prefix=s3_prefix,\n",
177 |     "        ContinuationToken=token\n",
178 |     "    ) \n",
179 |     "    for r in response['Contents']:\n",
180 |     "        partitions_to_add.append(r['Key'])\n",
181 |     "    if response['IsTruncated']:\n",
182 |     "        oken = response['NextContinuationToken']\n",
183 |     "    print(\"Getting next batch\")"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": null,
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": [
192 |     "print(f\"Need to add {len(partitions_to_add)} partitions\")"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": [
201 |     "def chunks(lst, n):\n",
202 |     "    \"\"\"Yield successive n-sized chunks from lst.\"\"\"\n",
203 |     "    for i in range(0, len(lst), n):\n",
204 |     "        yield lst[i:i + n]"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "def get_part_def(p):\n",
214 |     "    part_value = p.split('/')[-2]\n",
215 |     "    return {\n",
216 |     "                'Values': [\n",
217 |     "                    part_value\n",
218 |     "                ],\n",
219 |     "                'StorageDescriptor': {\n",
220 |     "                    'Columns': [\n",
221 |     "                        {\n",
222 |     "                            \"Name\": \"date\",\n",
223 |     "                            \"Type\": \"struct<utc:string,local:string>\"\n",
224 |     "                        },\n",
225 |     "                        {\n",
226 |     "                            \"Name\": \"parameter\",\n",
227 |     "                            \"Type\": \"string\"\n",
228 |     "                        },\n",
229 |     "                        {\n",
230 |     "                            \"Name\": \"location\",\n",
231 |     "                            \"Type\": \"string\"\n",
232 |     "                        },\n",
233 |     "                        {\n",
234 |     "                            \"Name\": \"value\",\n",
235 |     "                            \"Type\": \"double\"\n",
236 |     "                        },\n",
237 |     "                        {\n",
238 |     "                            \"Name\": \"unit\",\n",
239 |     "                            \"Type\": \"string\"\n",
240 |     "                        },\n",
241 |     "                        {\n",
242 |     "                            \"Name\": \"city\",\n",
243 |     "                            \"Type\": \"string\"\n",
244 |     "                        },\n",
245 |     "                        {\n",
246 |     "                            \"Name\": \"attribution\",\n",
247 |     "                            \"Type\": \"array<struct<name:string,url:string>>\"\n",
248 |     "                        },\n",
249 |     "                        {\n",
250 |     "                            \"Name\": \"averagingperiod\",\n",
251 |     "                            \"Type\": \"struct<value:double,unit:string>\"\n",
252 |     "                        },\n",
253 |     "                        {\n",
254 |     "                            \"Name\": \"coordinates\",\n",
255 |     "                            \"Type\": \"struct<latitude:double,longitude:double>\"\n",
256 |     "                        },\n",
257 |     "                        {\n",
258 |     "                            \"Name\": \"country\",\n",
259 |     "                            \"Type\": \"string\"\n",
260 |     "                        },\n",
261 |     "                        {\n",
262 |     "                            \"Name\": \"sourcename\",\n",
263 |     "                            \"Type\": \"string\"\n",
264 |     "                        },\n",
265 |     "                        {\n",
266 |     "                            \"Name\": \"sourcetype\",\n",
267 |     "                            \"Type\": \"string\"\n",
268 |     "                        },\n",
269 |     "                        {\n",
270 |     "                            \"Name\": \"mobile\",\n",
271 |     "                            \"Type\": \"boolean\"\n",
272 |     "                        }\n",
273 |     "                    ],\n",
274 |     "                    'Location': f\"s3://{s3_bucket}/{s3_prefix}/{part_value}/\",\n",
275 |     "                    'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',\n",
276 |     "                    'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',\n",
277 |     "                    'Compressed': False,\n",
278 |     "                    'SerdeInfo': {\n",
279 |     "                        'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',\n",
280 |     "                        \"Parameters\": {\n",
281 |     "                            \"paths\": \"attribution,averagingPeriod,city,coordinates,country,date,location,mobile,parameter,sourceName,sourceType,unit,value\"\n",
282 |     "                        }\n",
283 |     "                    },\n",
284 |     "                    'StoredAsSubDirectories': False\n",
285 |     "                },\n",
286 |     "                'Parameters': {\n",
287 |     "                    \"classification\": \"json\",\n",
288 |     "                    \"compressionType\": \"none\",\n",
289 |     "                },\n",
290 |     "\n",
291 |     "\n",
292 |     "            }"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": null,
298 |    "metadata": {},
299 |    "outputs": [],
300 |    "source": [
301 |     "for batch in chunks(partitions_to_add, 100):\n",
302 |     "    response = glue.batch_create_partition(\n",
303 |     "        DatabaseName=glue_db_name,\n",
304 |     "        TableName=glue_tbl_name,\n",
305 |     "        PartitionInputList=[get_part_def(p) for p in batch]\n",
306 |     "    )"
307 |    ]
308 |   },
309 |   {
310 |    "source": [
311 |     "## Processing Job"
312 |    ],
313 |    "cell_type": "markdown",
314 |    "metadata": {}
315 |   },
316 |   {
317 |    "cell_type": "code",
318 |    "execution_count": null,
319 |    "metadata": {},
320 |    "outputs": [],
321 |    "source": [
322 |     "import logging\n",
323 |     "import sagemaker\n",
324 |     "from time import gmtime, strftime\n",
325 |     "\n",
326 |     "sagemaker_logger = logging.getLogger(\"sagemaker\")\n",
327 |     "sagemaker_logger.setLevel(logging.INFO)\n",
328 |     "sagemaker_logger.addHandler(logging.StreamHandler())\n",
329 |     "\n",
330 |     "sagemaker_session = sagemaker.Session()\n",
331 |     "role = sagemaker.get_execution_role()"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": null,
337 |    "metadata": {},
338 |    "outputs": [],
339 |    "source": [
340 |     "from sagemaker.spark.processing import PySparkProcessor\n",
341 |     "\n",
342 |     "spark_processor = PySparkProcessor(\n",
343 |     "    base_job_name=\"spark-preprocessor\",\n",
344 |     "    framework_version=\"3.0\",\n",
345 |     "    role=role,\n",
346 |     "    instance_count=15,\n",
347 |     "    instance_type=\"ml.m5.4xlarge\",\n",
348 |     "    max_runtime_in_seconds=7200,\n",
349 |     ")\n",
350 |     "\n",
351 |     "configuration = [\n",
352 |     "    {\n",
353 |     "    \"Classification\": \"spark-defaults\",\n",
354 |     "    \"Properties\": {\"spark.executor.memory\": \"18g\", \n",
355 |     "                   \"spark.yarn.executor.memoryOverhead\": \"3g\",\n",
356 |     "                   \"spark.driver.memory\": \"18g\",\n",
357 |     "                   \"spark.yarn.driver.memoryOverhead\": \"3g\",\n",
358 |     "                   \"spark.executor.cores\": \"5\", \n",
359 |     "                   \"spark.driver.cores\": \"5\",\n",
360 |     "                   \"spark.executor.instances\": \"44\",\n",
361 |     "                   \"spark.default.parallelism\": \"440\",\n",
362 |     "                   \"spark.dynamicAllocation.enabled\": \"false\"\n",
363 |     "                  },\n",
364 |     "    },\n",
365 |     "    {\n",
366 |     "    \"Classification\": \"yarn-site\",\n",
367 |     "    \"Properties\": {\"yarn.nodemanager.vmem-check-enabled\": \"false\", \n",
368 |     "                   \"yarn.nodemanager.mmem-check-enabled\": \"false\"},\n",
369 |     "    }\n",
370 |     "]\n",
371 |     "\n",
372 |     "spark_processor.run(\n",
373 |     "    submit_app=\"scripts/preprocess.py\",\n",
374 |     "    submit_jars=[\"s3://crawler-public/json/serde/json-serde.jar\"],\n",
375 |     "    arguments=['--s3_input_bucket', s3_bucket,\n",
376 |     "               '--s3_input_key_prefix', s3_prefix_parquet,\n",
377 |     "               '--s3_output_bucket', s3_bucket,\n",
378 |     "               '--s3_output_key_prefix', s3_output_prefix],\n",
379 |     "    spark_event_logs_s3_uri=\"s3://{}/{}/spark_event_logs\".format(s3_bucket, 'sparklogs'),\n",
380 |     "    logs=True,\n",
381 |     "    configuration=configuration\n",
382 |     ")"
383 |    ]
384 |   }
385 |  ],
386 |  "metadata": {
387 |   "instance_type": "ml.t3.medium",
388 |   "kernelspec": {
389 |    "display_name": "Python 3 (Data Science)",
390 |    "language": "python",
391 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
392 |   },
393 |   "language_info": {
394 |    "codemirror_mode": {
395 |     "name": "ipython",
396 |     "version": 3
397 |    },
398 |    "file_extension": ".py",
399 |    "mimetype": "text/x-python",
400 |    "name": "python",
401 |    "nbconvert_exporter": "python",
402 |    "pygments_lexer": "ipython3",
403 |    "version": "3.7.10"
404 |   }
405 |  },
406 |  "nbformat": 4,
407 |  "nbformat_minor": 4
408 | }


--------------------------------------------------------------------------------
/Chapter04/scripts/preprocess.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | from __future__ import unicode_literals
  3 | 
  4 | import argparse
  5 | import csv
  6 | import os
  7 | import shutil
  8 | import sys
  9 | import time
 10 | import logging
 11 | import boto3
 12 | 
 13 | import pyspark
 14 | from pyspark.sql import SparkSession
 15 | from pyspark import SparkContext
 16 | from pyspark.ml import Pipeline
 17 | from pyspark.ml.linalg import Vectors
 18 | from pyspark.ml.feature import (
 19 |     StringIndexer,
 20 |     VectorAssembler,
 21 |     VectorIndexer,
 22 |     StandardScaler,
 23 |     OneHotEncoder
 24 | )
 25 | from pyspark.sql.functions import *
 26 | from pyspark.sql.functions import round as round_
 27 | from pyspark.sql.types import (
 28 |     DoubleType,
 29 |     StringType,
 30 |     StructField,
 31 |     StructType,
 32 |     BooleanType,
 33 |     IntegerType
 34 | )
 35 | 
 36 | def get_tables():
 37 |     
 38 |     tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2']
 39 |     #tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2',
 40 |     #                       '1645e09b-0919-439a-b07f-cd7532069f10',
 41 |     #                       '80d785aa-6da8-4b37-8632-7386b1d535f3',
 42 |     #                       '8375c185-ea3e-44b7-b61c-68534e33ddf7',
 43 |     #                       '9a6e24db-cffc-42de-a77f-7ab96c487022',
 44 |     #                       'bc981da3-bc9d-435a-8bf2-4107a8fb2676',
 45 |     #                       'cf5cf814-c9e3-4a2b-8811-e4fc6481a1fe',
 46 |     #                       'd3b8f1ab-f3e5-4fc9-84ab-8568edd8a03d']
 47 |     
 48 | 
 49 |     return tables
 50 | 
 51 | 
 52 | def isBadAir(v, p):
 53 |     if p == 'pm10':
 54 |         if v > 50:
 55 |             return 1
 56 |         else:
 57 |             return 0
 58 |     elif p == 'pm25':
 59 |         if v > 25:
 60 |             return 1
 61 |         else:
 62 |             return 0
 63 |     elif p == 'so2':
 64 |         if v > 20:
 65 |             return 1
 66 |         else:
 67 |             return 0
 68 |     elif p == 'no2':
 69 |         if v > 200:
 70 |             return 1
 71 |         else:
 72 |             return 0
 73 |     elif p == 'o3':
 74 |         if v > 100:
 75 |             return 1
 76 |         else:
 77 |             return 0
 78 |     else:
 79 |         return 0
 80 | 
 81 | def extract(row):
 82 |     return (row.value, row.ismobile, row.year, row.month, row.quarter, row.day, row.isBadAir, 
 83 |             row.indexed_location, row.indexed_city, row.indexed_country, row.indexed_sourcename, 
 84 |             row.indexed_sourcetype) + tuple(row.vec_parameter.toArray().tolist())
 85 | 
 86 | """
 87 | Schema on disk:
 88 | 
 89 |  |-- date_utc: string (nullable = true)
 90 |  |-- date_local: string (nullable = true)
 91 |  |-- location: string (nullable = true)
 92 |  |-- country: string (nullable = true)
 93 |  |-- value: float (nullable = true)
 94 |  |-- unit: string (nullable = true)
 95 |  |-- city: string (nullable = true)
 96 |  |-- attribution: array (nullable = true)
 97 |  |    |-- element: struct (containsNull = true)
 98 |  |    |    |-- name: string (nullable = true)
 99 |  |    |    |-- url: string (nullable = true)
100 |  |-- averagingperiod: struct (nullable = true)
101 |  |    |-- unit: string (nullable = true)
102 |  |    |-- value: float (nullable = true)
103 |  |-- coordinates: struct (nullable = true)
104 |  |    |-- latitude: float (nullable = true)
105 |  |    |-- longitude: float (nullable = true)
106 |  |-- sourcename: string (nullable = true)
107 |  |-- sourcetype: string (nullable = true)
108 |  |-- mobile: string (nullable = true)
109 |  |-- parameter: string (nullable = true)
110 |  
111 | Example output:
112 | 
113 | date_utc='2015-10-31T07:00:00.000Z'
114 | date_local='2015-10-31T04:00:00-03:00'
115 | location='Quintero Centro'
116 | country='CL'
117 | value=19.81999969482422
118 | unit='µg/m³'
119 | city='Quintero'
120 | attribution=[Row(name='SINCA', url='http://sinca.mma.gob.cl/'), Row(name='CENTRO QUINTERO', url=None)]
121 | averagingperiod=None
122 | coordinates=Row(latitude=-32.786170959472656, longitude=-71.53143310546875)
123 | sourcename='Chile - SINCA'
124 | sourcetype=None
125 | mobile=None
126 | parameter='o3'
127 | 
128 | Transformations:
129 | 
130 | * Featurize date_utc
131 | * Drop date_local
132 | * Encode location
133 | * Encode country
134 | * Scale value
135 | * Drop unit
136 | * Encode city
137 | * Drop attribution
138 | * Drop averaging period
139 | * Drop coordinates
140 | * Encode source name
141 | * Encode source type
142 | * Convert mobile to integer
143 | * Encode parameter
144 | 
145 | * Add label for good/bad air quality
146 | 
147 | """
148 | def main():
149 |     parser = argparse.ArgumentParser(description="Preprocessing configuration")
150 |     parser.add_argument("--s3_input_bucket", type=str, help="s3 input bucket")
151 |     parser.add_argument("--s3_input_key_prefix", type=str, help="s3 input key prefix")
152 |     parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket")
153 |     parser.add_argument("--s3_output_key_prefix", type=str, help="s3 output key prefix")
154 |     args = parser.parse_args()
155 | 
156 |     logging.basicConfig(level=logging.INFO)
157 |     logger = logging.getLogger('Preprocess')
158 | 
159 |     spark = SparkSession.builder.appName("Preprocessor").getOrCreate()
160 | 
161 |     logger.info("Reading data set")
162 |     tables = get_tables()
163 |     df = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{tables[0]}/")
164 |     for t in tables[1:]:
165 |         df_new = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{t}/")
166 |         df = df.union(df_new)
167 |     
168 |     # Drop columns
169 |     logger.info("Dropping columns")
170 |     df = df.drop('date_local').drop('unit').drop('attribution').drop('averagingperiod').drop('coordinates')
171 | 
172 |     # Mobile field to int
173 |     logger.info("Casting mobile field to int")
174 |     df = df.withColumn("ismobile",col("mobile").cast(IntegerType())).drop('mobile')
175 | 
176 |     # scale value
177 |     logger.info("Scaling value")
178 |     value_assembler = VectorAssembler(inputCols=["value"], outputCol="value_vec")
179 |     value_scaler = StandardScaler(inputCol="value_vec", outputCol="value_scaled")
180 |     value_pipeline = Pipeline(stages=[value_assembler, value_scaler])
181 |     value_model = value_pipeline.fit(df)
182 |     xform_df = value_model.transform(df)
183 | 
184 |     # featurize date
185 |     logger.info("Featurizing date")
186 |     xform_df = xform_df.withColumn('aggdt', 
187 |                    to_date(unix_timestamp(col('date_utc'), "yyyy-MM-dd'T'HH:mm:ss.SSSX").cast("timestamp")))
188 |     xform_df = xform_df.withColumn('year',year(xform_df.aggdt)) \
189 |         .withColumn('month',month(xform_df.aggdt)) \
190 |         .withColumn('quarter',quarter(xform_df.aggdt))
191 |     xform_df = xform_df.withColumn("day", date_format(col("aggdt"), "d"))
192 | 
193 |     # Automatically assign good/bad labels
194 |     logger.info("Simulating good/bad air labels")
195 |     isBadAirUdf = udf(isBadAir, IntegerType())
196 |     xform_df = xform_df.withColumn('isBadAir', isBadAirUdf('value', 'parameter'))
197 | 
198 |     # Categorical encodings.  
199 |     logger.info("Categorical encoding")
200 |     parameter_indexer = StringIndexer(inputCol="parameter", outputCol="indexed_parameter", handleInvalid='keep')
201 |     location_indexer = StringIndexer(inputCol="location", outputCol="indexed_location", handleInvalid='keep')
202 |     city_indexer = StringIndexer(inputCol="city", outputCol="indexed_city", handleInvalid='keep')
203 |     country_indexer = StringIndexer(inputCol="country", outputCol="indexed_country", handleInvalid='keep')
204 |     sourcename_indexer = StringIndexer(inputCol="sourcename", outputCol="indexed_sourcename", handleInvalid='keep')
205 |     sourcetype_indexer = StringIndexer(inputCol="sourcetype", outputCol="indexed_sourcetype", handleInvalid='keep')
206 |     enc_est = OneHotEncoder(inputCols=["indexed_parameter"], outputCols=["vec_parameter"])
207 |     enc_pipeline = Pipeline(stages=[parameter_indexer, location_indexer, 
208 |         city_indexer, country_indexer, sourcename_indexer, 
209 |         sourcetype_indexer, enc_est])
210 |     enc_model = enc_pipeline.fit(xform_df)
211 |     enc_df = enc_model.transform(xform_df)
212 |     param_cols = enc_df.schema.fields[17].metadata['ml_attr']['vals']
213 | 
214 |     # Clean up data set
215 |     logger.info("Final cleanup")
216 |     final_df = enc_df.drop('parameter').drop('location') \
217 |         .drop('city').drop('country').drop('sourcename') \
218 |         .drop('sourcetype').drop('date_utc') \
219 |         .drop('value_vec').drop('aggdt').drop('indexed_parameter')
220 |     firstelement=udf(lambda v:str(v[0]),StringType())
221 |     final_df = final_df.withColumn('value_str', firstelement('value_scaled'))
222 |     final_df = final_df.withColumn("value",final_df.value_str.cast(DoubleType())).drop('value_str').drop('value_scaled')
223 |     schema = StructType([
224 |         StructField("value", DoubleType(), True),
225 |         StructField("ismobile", StringType(), True),
226 |         StructField("year", StringType(), True),
227 |         StructField("month", StringType(), True),
228 |         StructField("quarter", StringType(), True),
229 |         StructField("day", StringType(), True),
230 |         StructField("isBadAir", StringType(), True),
231 |         StructField("location", StringType(), True),
232 |         StructField("city", StringType(), True),
233 |         StructField("country", StringType(), True),
234 |         StructField("sourcename", StringType(), True),
235 |         StructField("sourcetype", StringType(), True),
236 |         StructField("o3", StringType(), True),
237 |         StructField("no2", StringType(), True),
238 |         StructField("so2", StringType(), True),
239 |         StructField("co", StringType(), True),
240 |         StructField("pm10", StringType(), True),
241 |         StructField("pm25", StringType(), True),
242 |         StructField("bc", StringType(), True),
243 |                     ])
244 |     final_df = final_df.rdd.map(extract).toDF(schema=schema)
245 | 
246 |     # Replace missing values
247 |     final_df = final_df.na.fill("0")
248 |     
249 |     # Round the value
250 |     final_df = final_df.withColumn("value", round_(final_df["value"], 4))
251 | 
252 |     # Split sets
253 |     logger.info("Splitting data set")
254 |     (train_df, validation_df, test_df) = final_df.randomSplit([0.7, 0.2, 0.1])
255 |     
256 |     # Drop value from test set
257 |     test_df = test_df.drop('value')
258 | 
259 |     # Save to S3
260 |     logger.info("Saving to S3")
261 |     train_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 
262 |         args.s3_output_key_prefix, 'train/'))
263 |     validation_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 
264 |         args.s3_output_key_prefix, 'validation/'))
265 |     test_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 
266 |         args.s3_output_key_prefix, 'test/'))
267 | 
268 | if __name__ == "__main__":
269 |     main()
270 | 


--------------------------------------------------------------------------------
/Chapter06/Experiments/Experiments.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "6a503428",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "### Tracking and organizing training and tuning jobs with Amazon SageMaker Experiments\n",
  9 |     "\n",
 10 |     "This notebook demonstrates using SageMaker Experiment capability to organize, track, compare, and evaluate your machine learning (ML) model training experiments.\n",
 11 |     "\n",
 12 |     "\n",
 13 |     "### Overview\n",
 14 |     "\n",
 15 |     "1. Set up\n",
 16 |     "2. Create a SageMaker Experiment\n",
 17 |     "3. Train XGBoost regression model as part of the Experiment\n",
 18 |     "4. Visualize results from the Experiment."
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "id": "776e3d5b",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "### 1. Set up"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 18,
 32 |    "id": "ba1209de",
 33 |    "metadata": {},
 34 |    "outputs": [
 35 |     {
 36 |      "name": "stdout",
 37 |      "output_type": "stream",
 38 |      "text": [
 39 |       "Requirement already satisfied: sagemaker-experiments in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (0.1.34)\n",
 40 |       "Requirement already satisfied: boto3>=1.16.27 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker-experiments) (1.17.99)\n",
 41 |       "Requirement already satisfied: s3transfer<0.5.0,>=0.4.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments) (0.4.2)\n",
 42 |       "Requirement already satisfied: botocore<1.21.0,>=1.20.99 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments) (1.20.99)\n",
 43 |       "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.16.27->sagemaker-experiments) (0.10.0)\n",
 44 |       "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.99->boto3>=1.16.27->sagemaker-experiments) (1.26.5)\n",
 45 |       "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.99->boto3>=1.16.27->sagemaker-experiments) (2.8.1)\n",
 46 |       "Requirement already satisfied: six>=1.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.21.0,>=1.20.99->boto3>=1.16.27->sagemaker-experiments) (1.15.0)\n",
 47 |       "\u001b[33mWARNING: You are using pip version 21.1.2; however, version 21.2.3 is available.\n",
 48 |       "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.\u001b[0m\n"
 49 |      ]
 50 |     }
 51 |    ],
 52 |    "source": [
 53 |     "#Install the sagemaker experiments SDK\n",
 54 |     "!pip install sagemaker-experiments"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "id": "51f53efa",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "#### 1.1 Import libraries"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 19,
 68 |    "id": "cb5c2578",
 69 |    "metadata": {},
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "import time\n",
 73 |     "\n",
 74 |     "import boto3\n",
 75 |     "import numpy as np\n",
 76 |     "import pandas as pd\n",
 77 |     "from IPython.display import set_matplotlib_formats\n",
 78 |     "from matplotlib import pyplot as plt\n",
 79 |     "import datetime\n",
 80 |     "\n",
 81 |     "import sagemaker\n",
 82 |     "from sagemaker import get_execution_role\n",
 83 |     "from sagemaker.session import Session\n",
 84 |     "from sagemaker.analytics import ExperimentAnalytics\n",
 85 |     "from sagemaker.inputs import TrainingInput\n",
 86 |     "\n",
 87 |     "from smexperiments.experiment import Experiment\n",
 88 |     "from smexperiments.trial import Trial\n",
 89 |     "from smexperiments.trial_component import TrialComponent\n",
 90 |     "from smexperiments.tracker import Tracker\n",
 91 |     "\n",
 92 |     "region = 'us-west-2'\n",
 93 |     "\n",
 94 |     "set_matplotlib_formats('retina')"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": 20,
100 |    "id": "8f981167",
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "sess = boto3.Session()\n",
105 |     "sm = sess.client('sagemaker')\n",
106 |     "role = get_execution_role()"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "id": "f97dd9a8",
112 |    "metadata": {},
113 |    "source": [
114 |     "#### 1.2 S3 paths to training and validation data and output paths"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 21,
120 |    "id": "ffc6e083",
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "# define the data type and paths to the training and validation datasets\n",
125 |     "content_type = \"csv\"\n",
126 |     "\n",
127 |     "#s3_bucket = 'bestpractices-bucket-sm'\n",
128 |     "#s3_prefix = 'prepared_parquet4'\n",
129 |     "\n",
130 |     "#Set the s3_bucket to the correct bucket name created in your datascience environment\n",
131 |     "s3_bucket = 'datascience-environment-notebookinstance--06dc7a0224df'\n",
132 |     "s3_prefix = 'prepared'\n",
133 |     "\n",
134 |     "train_input = TrainingInput(\"s3://{}/{}/{}/\".format(s3_bucket, s3_prefix, 'train'), content_type=content_type, distribution='ShardedByS3Key')\n",
135 |     "validation_input = TrainingInput(\"s3://{}/{}/{}/\".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type, distribution='ShardedByS3Key')"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "id": "d68d8e57",
141 |    "metadata": {},
142 |    "source": [
143 |     "Now lets track the parameters from the training step. "
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": 22,
149 |    "id": "908dfbc8",
150 |    "metadata": {},
151 |    "outputs": [],
152 |    "source": [
153 |     "with Tracker.create(display_name=\"Training\", sagemaker_boto_client=sm) as tracker:\n",
154 |     "    tracker.log_parameters({\"learning_rate\": 1.0, \"dropout\": 0.5})\n",
155 |     "    \n",
156 |     "    # we can log the location of the training dataset\n",
157 |     "    tracker.log_input(name=\"weather-training-dataset\", media_type=\"s3/uri\", value=\"s3://{}/{}/{}/\".format(s3_bucket, s3_prefix, 'train'))"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "markdown",
162 |    "id": "bdef5991",
163 |    "metadata": {},
164 |    "source": [
165 |     "### 2.  Set up the Experiment"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "id": "8cea3276",
171 |    "metadata": {},
172 |    "source": [
173 |     "Create an experiment to track all the model training iterations. Use Experiments to organize your data science work."
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "markdown",
178 |    "id": "1901f35b",
179 |    "metadata": {},
180 |    "source": [
181 |     "#### 2.1 Create an Experiment"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": 23,
187 |    "id": "37e37739",
188 |    "metadata": {},
189 |    "outputs": [
190 |     {
191 |      "name": "stdout",
192 |      "output_type": "stream",
193 |      "text": [
194 |       "Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f8aea8c7320>,experiment_name='weather-experiment-1628392649',description='Prediction of weather quality',tags=None,experiment_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment/weather-experiment-1628392649',response_metadata={'RequestId': '3ae94205-d193-460a-acab-4a89d547bc2e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '3ae94205-d193-460a-acab-4a89d547bc2e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '101', 'date': 'Sun, 08 Aug 2021 03:17:29 GMT'}, 'RetryAttempts': 0})\n"
195 |      ]
196 |     }
197 |    ],
198 |    "source": [
199 |     "weather_experiment = Experiment.create(\n",
200 |     "    experiment_name=f\"weather-experiment-{int(time.time())}\",\n",
201 |     "    description=\"Prediction of weather quality\",\n",
202 |     "    sagemaker_boto_client=sm)\n",
203 |     "print(weather_experiment)"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "id": "3d1b785d",
209 |    "metadata": {},
210 |    "source": [
211 |     "#### 2.2 Track Experiment"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "id": "7aeb9892",
217 |    "metadata": {},
218 |    "source": [
219 |     "Now create a Trial for each training run to track the it's inputs, parameters, and metrics."
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "id": "fdd69d6e",
225 |    "metadata": {},
226 |    "source": [
227 |     "While training the XGBoost model on SageMaker, we will experiment with several values for the number of hidden channel in the model. We will create a Trial to track each training job run. We will also create a TrialComponent from the tracker we created before, and add to the Trial.\n",
228 |     "\n",
229 |     "Note the execution of the following code takes a while."
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": 24,
235 |    "id": "e547917c",
236 |    "metadata": {},
237 |    "outputs": [],
238 |    "source": [
239 |     "##Keep track of the trails\n",
240 |     "max_depth_trial_name_map = {}\n",
241 |     "##Keep track of the training jobs launched to check if they are complete before analyzing the experiment results.\n",
242 |     "training_jobs =[]\n"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "id": "91a9d011",
248 |    "metadata": {},
249 |    "source": [
250 |     "### 3. Train XGBoost regression model as part of the Experiment"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": 25,
256 |    "id": "a5af3e86",
257 |    "metadata": {},
258 |    "outputs": [
259 |     {
260 |      "name": "stderr",
261 |      "output_type": "stream",
262 |      "text": [
263 |       "INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.\n",
264 |       "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n",
265 |       "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.\n",
266 |       "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n",
267 |       "INFO:sagemaker:Creating training-job with name: xgboost-training-job-1628392650\n",
268 |       "INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.\n",
269 |       "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n",
270 |       "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.\n",
271 |       "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n",
272 |       "INFO:sagemaker:Creating training-job with name: xgboost-training-job-1628392652\n"
273 |      ]
274 |     }
275 |    ],
276 |    "source": [
277 |     "training_instance_type='ml.m5.12xlarge'\n",
278 |     "#Explore two different values for the max_depth hyerparameter for XGBoost model\n",
279 |     "for i, max_depth in enumerate([2, 5]):\n",
280 |     "    # create trial\n",
281 |     "    trial_name = f\"xgboost-training-job-trial-{max_depth}-max-depth-{int(time.time())}\"\n",
282 |     "    xgboost_trial = Trial.create(\n",
283 |     "        trial_name=trial_name, \n",
284 |     "        experiment_name=weather_experiment.experiment_name,\n",
285 |     "        sagemaker_boto_client=sm,\n",
286 |     "    )\n",
287 |     "    max_depth_trial_name_map[max_depth] = trial_name\n",
288 |     "    \n",
289 |     "    # initialize hyperparameters\n",
290 |     "    hyperparameters = {\n",
291 |     "        \"max_depth\": max_depth,\n",
292 |     "        \"eta\":\"0.2\",\n",
293 |     "        \"gamma\":\"4\",\n",
294 |     "        \"min_child_weight\":\"6\",\n",
295 |     "        \"subsample\":\"0.7\",\n",
296 |     "        \"objective\":\"reg:squarederror\",\n",
297 |     "        \"num_round\":\"5\"}\n",
298 |     "\n",
299 |     "    #set an output path where the trained model will be saved\n",
300 |     "    output_prefix = 'weather-experiments'\n",
301 |     "    output_path = 's3://{}/{}/{}/output'.format(s3_bucket, output_prefix, 'xgboost')\n",
302 |     "\n",
303 |     "    # This line automatically looks for the XGBoost image URI and builds an XGBoost container.\n",
304 |     "    # specify the repo_version depending on your preference.\n",
305 |     "    xgboost_container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.2-1\")\n",
306 |     "\n",
307 |     "    # construct a SageMaker estimator that calls the xgboost-container\n",
308 |     "    estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, \n",
309 |     "                                          hyperparameters=hyperparameters,\n",
310 |     "                                          role=sagemaker.get_execution_role(),\n",
311 |     "                                          instance_count=1, \n",
312 |     "                                          instance_type=training_instance_type,   \n",
313 |     "                                          volume_size=200, # 5 GB \n",
314 |     "                                          output_path=output_path)\n",
315 |     "\n",
316 |     "    xgboost_training_job_name = \"xgboost-training-job-{}\".format(int(time.time()))\n",
317 |     "    \n",
318 |     "    training_jobs.append(xgboost_training_job_name)\n",
319 |     "    \n",
320 |     "    # Now associate the estimator with the Experiment and Trial\n",
321 |     "    estimator.fit(\n",
322 |     "        inputs={'train': train_input}, \n",
323 |     "        job_name=xgboost_training_job_name,\n",
324 |     "        experiment_config={\n",
325 |     "            \"TrialName\": xgboost_trial.trial_name,\n",
326 |     "            \"TrialComponentDisplayName\": \"Training\"\n",
327 |     "        },\n",
328 |     "        wait=False, #Don't wait for the training job to be completed\n",
329 |     "    )\n",
330 |     "    \n",
331 |     "    # Wait before launching the next training job\n",
332 |     "    time.sleep(2)"
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "code",
337 |    "execution_count": 26,
338 |    "id": "1e192d08",
339 |    "metadata": {},
340 |    "outputs": [
341 |     {
342 |      "data": {
343 |       "text/plain": [
344 |        "{2: 'xgboost-training-job-trial-2-max-depth-1628392649',\n",
345 |        " 5: 'xgboost-training-job-trial-5-max-depth-1628392652'}"
346 |       ]
347 |      },
348 |      "execution_count": 26,
349 |      "metadata": {},
350 |      "output_type": "execute_result"
351 |     }
352 |    ],
353 |    "source": [
354 |     "max_depth_trial_name_map"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": 27,
360 |    "id": "7eda9ed7",
361 |    "metadata": {},
362 |    "outputs": [
363 |     {
364 |      "name": "stdout",
365 |      "output_type": "stream",
366 |      "text": [
367 |       "TrialSummary(trial_name='xgboost-training-job-trial-5-max-depth-1628392652',trial_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment-trial/xgboost-training-job-trial-5-max-depth-1628392652',display_name='xgboost-training-job-trial-5-max-depth-1628392652',creation_time=datetime.datetime(2021, 8, 8, 3, 17, 32, 730000, tzinfo=tzlocal()),last_modified_time=datetime.datetime(2021, 8, 8, 3, 17, 32, 730000, tzinfo=tzlocal()))\n",
368 |       "TrialSummary(trial_name='xgboost-training-job-trial-2-max-depth-1628392649',trial_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment-trial/xgboost-training-job-trial-2-max-depth-1628392649',display_name='xgboost-training-job-trial-2-max-depth-1628392649',creation_time=datetime.datetime(2021, 8, 8, 3, 17, 29, 979000, tzinfo=tzlocal()),last_modified_time=datetime.datetime(2021, 8, 8, 3, 17, 29, 979000, tzinfo=tzlocal()))\n"
369 |      ]
370 |     }
371 |    ],
372 |    "source": [
373 |     "##Quick check of the trails of the experiment\n",
374 |     "trails = weather_experiment.list_trials()\n",
375 |     "type(trails)\n",
376 |     "for trial in trails:\n",
377 |     "    print(trial)"
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "code",
382 |    "execution_count": null,
383 |    "id": "fea9947d",
384 |    "metadata": {},
385 |    "outputs": [
386 |     {
387 |      "name": "stdout",
388 |      "output_type": "stream",
389 |      "text": [
390 |       "Training job name: xgboost-training-job-1628392650\n",
391 |       "Status : InProgress\n",
392 |       "Status InProgress\n",
393 |       "Status InProgress\n",
394 |       "Status InProgress\n",
395 |       "Status InProgress\n",
396 |       "Status InProgress\n"
397 |      ]
398 |     }
399 |    ],
400 |    "source": [
401 |     "##Wait till the training jobs are complete.\n",
402 |     "for training_job in training_jobs:\n",
403 |     "    print(\"Training job name: \" + training_job)\n",
404 |     "    description = sm.describe_training_job(TrainingJobName=training_job)\n",
405 |     "    print(\"Status : \" + description[\"TrainingJobStatus\"])\n",
406 |     "    \n",
407 |     "    while description[\"TrainingJobStatus\"] != \"Completed\" and description[\"TrainingJobStatus\"] != \"Failed\":\n",
408 |     "        description = sm.describe_training_job(TrainingJobName=training_job)\n",
409 |     "        primary_status = description[\"TrainingJobStatus\"]\n",
410 |     "        print(\"Status {}\".format(primary_status))\n",
411 |     "        time.sleep(15)"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "markdown",
416 |    "id": "1f43971e",
417 |    "metadata": {},
418 |    "source": [
419 |     "### 4. Visualize results from the Experiment.\n",
420 |     "Compare the model training runs of an experiment using the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression."
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": null,
426 |    "id": "53d4348e",
427 |    "metadata": {},
428 |    "outputs": [],
429 |    "source": [
430 |     "experiment_name = weather_experiment.experiment_name\n",
431 |     "experiment_name"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": null,
437 |    "id": "d2d57858",
438 |    "metadata": {},
439 |    "outputs": [],
440 |    "source": [
441 |     "from sagemaker.analytics import ExperimentAnalytics\n",
442 |     "sess = boto3.Session()\n",
443 |     "sm = sess.client(\"sagemaker\")\n",
444 |     "sagemaker_session = Session(sess)\n",
445 |     "\n",
446 |     "trial_component_analytics = ExperimentAnalytics(\n",
447 |     "    sagemaker_session=sagemaker_session, experiment_name=experiment_name\n",
448 |     ")\n",
449 |     "trial_comp_ds_jobs = trial_component_analytics.dataframe()\n",
450 |     "trial_comp_ds_jobs"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "markdown",
455 |    "id": "26b7d585",
456 |    "metadata": {},
457 |    "source": [
458 |     "Results show the RMSE metrics for the various hyperparameters tried as part of the Experiment"
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "code",
463 |    "execution_count": null,
464 |    "id": "1d36c6a5",
465 |    "metadata": {},
466 |    "outputs": [],
467 |    "source": []
468 |   }
469 |  ],
470 |  "metadata": {
471 |   "kernelspec": {
472 |    "display_name": "conda_python3",
473 |    "language": "python",
474 |    "name": "conda_python3"
475 |   },
476 |   "language_info": {
477 |    "codemirror_mode": {
478 |     "name": "ipython",
479 |     "version": 3
480 |    },
481 |    "file_extension": ".py",
482 |    "mimetype": "text/x-python",
483 |    "name": "python",
484 |    "nbconvert_exporter": "python",
485 |    "pygments_lexer": "ipython3",
486 |    "version": "3.6.13"
487 |   }
488 |  },
489 |  "nbformat": 4,
490 |  "nbformat_minor": 5
491 | }
492 | 


--------------------------------------------------------------------------------
/Chapter06/code/csv_loader.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from torch.utils.data import Dataset
 3 | import glob
 4 | import torch
 5 | import sys
 6 | import logging
 7 | import collections
 8 | import bisect
 9 | 
10 | logger = logging.getLogger(__name__)
11 | logger.setLevel(logging.INFO)
12 | logger.addHandler(logging.StreamHandler(sys.stdout))
13 | 
14 | class CsvDataset(Dataset):
15 | 
16 |     def __init__(self, csv_path):
17 |         
18 |         self.csv_path = csv_path
19 |         if os.path.isfile(csv_path):
20 |             self.folder = False
21 |             self.count, fmap, self.line_offset = self.get_line_count(csv_path)
22 |             logger.debug(f"For {csv_path}, count = {self.count}")
23 |         else:
24 |             self.folder = True
25 |             self.count, fmap, self.line_offset = self.get_folder_line_count(csv_path)
26 |             
27 |         self.fmap = collections.OrderedDict(sorted(fmap.items()))
28 |             
29 |     def get_folder_line_count(self, d):
30 |         cnt = 0
31 |         all_map = {}
32 |         all_lc = {}
33 |         for f in glob.glob(os.path.join(d, '*.csv')):
34 |             fcnt, _, line_offset = self.get_line_count(f)
35 |             cnt = cnt + fcnt
36 |             all_map[cnt] = f
37 |             all_lc.update(line_offset)
38 |         return cnt, all_map, all_lc
39 |     
40 |     def get_line_count(self, f):
41 |         with open(f) as F:
42 |             line_offset = []
43 |             offset = 0
44 |             count = 0
45 |             for line in F:
46 |                 line_offset.append(offset)
47 |                 offset += len(line)
48 |                 count = count + 1
49 |         
50 |         return count, {count: f}, {f: line_offset}
51 | 
52 |     def __len__(self):
53 |         return self.count
54 | 
55 |     def __getitem__(self, idx):
56 |         
57 |         if torch.is_tensor(idx):
58 |             idx = idx.tolist()
59 |         logger.debug(f"Indices: {idx}")
60 |         
61 |         # This gives us the index in the line counts greater than or equal to the desired index.
62 |         # The map value for this line count is the file name containing that row.
63 |         klist = list(self.fmap.keys())
64 |         idx_m = bisect.bisect_left(klist, idx+1)
65 |         
66 |         # Grab the ending count of thisl file
67 |         cur_idx = klist[idx_m]
68 |         
69 |         # grab the ending count of the previous file
70 |         if idx_m > 0:
71 |             prev_idx = klist[idx_m-1]
72 |         else:
73 |             prev_idx = 0
74 |         
75 |         # grab the file name for the desired row count
76 |         fname = self.fmap[cur_idx]
77 | 
78 |         loff = self.line_offset[fname]
79 |         with open(fname) as F:
80 |             F.seek(loff[idx - prev_idx])
81 |             idx_line = F.readline()    
82 |         
83 |         idx_parts = idx_line.split(',')
84 | 
85 |         return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))])


--------------------------------------------------------------------------------
/Chapter06/code/csv_loader_pd.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from torch.utils.data import Dataset
 3 | import glob
 4 | import torch
 5 | import sys
 6 | import logging
 7 | import collections
 8 | import bisect
 9 | import time
10 | import pandas as pd
11 | from linecache import getline
12 | from itertools import islice
13 | 
14 | logger = logging.getLogger(__name__)
15 | logger.setLevel(logging.DEBUG)
16 | logger.addHandler(logging.StreamHandler(sys.stdout))
17 | 
18 | def _make_gen(reader):
19 |     b = reader(1024 * 1024)
20 |     while b:
21 |         yield b
22 |         b = reader(1024*1024)
23 |         
24 | def rawgencount(filename):
25 |     f = open(filename, 'rb')
26 |     f_gen = _make_gen(f.raw.read)
27 |     return sum( buf.count(b'\n') for buf in f_gen )
28 | 
29 | 
30 | class CsvDatasetPd(Dataset):
31 | 
32 |     def __init__(self, csv_path, max_files=5):
33 |         
34 |         self.csv_path = csv_path
35 |         if os.path.isfile(csv_path):
36 |             self.folder = False
37 |             self.count, fmap = self.get_line_count(csv_path)
38 |             logger.debug(f"For {csv_path}, count = {self.count}")
39 |         else:
40 |             self.folder = True
41 |             self.count, fmap = self.get_folder_line_count(csv_path, max_files)
42 |             
43 |         self.fmap = collections.OrderedDict(sorted(fmap.items()))
44 |         self.max_files = max_files
45 |             
46 |     def get_folder_line_count(self, d, max_files):
47 |         cnt = 0
48 |         all_map = {}
49 |         file_cnt = 0
50 |         for f in glob.glob(os.path.join(d, '*.csv')):
51 |             fcnt, _ = self.get_line_count(f)
52 |             cnt = cnt + fcnt
53 |             all_map[cnt] = f
54 |             file_cnt = file_cnt + 1
55 |             if file_cnt > max_files:
56 |                 break
57 |             
58 |         return cnt, all_map
59 |     
60 |     def get_line_count(self, f):
61 |         count = rawgencount(f)
62 |         return count, {count: f}
63 | 
64 |     def __len__(self):
65 |         return self.count
66 | 
67 |     def __getitem__(self, idx):
68 |         
69 |         if torch.is_tensor(idx):
70 |             idx = idx.tolist()
71 |         logger.debug(f"Indices: {idx}")
72 |         
73 |         # This gives us the index in the line counts greater than or equal to the desired index.
74 |         # The map value for this line count is the file name containing that row.
75 |         klist = list(self.fmap.keys())
76 |         idx_m = bisect.bisect_left(klist, idx+1)
77 |         
78 |         # Grab the ending count of thisl file
79 |         cur_idx = klist[idx_m]
80 |         
81 |         # grab the ending count of the previous file
82 |         if idx_m > 0:
83 |             prev_idx = klist[idx_m-1]
84 |         else:
85 |             prev_idx = 0
86 |         
87 |         # grab the file name for the desired row count
88 |         fname = self.fmap[cur_idx]
89 | 
90 |         #with open(fname) as F:
91 |         #    lines = list(islice(F, idx-prev_idx, idx-prev_idx+1))
92 |         idx_line = getline(fname, idx - prev_idx +1)
93 |         
94 |         idx_parts = idx_line.split(',')
95 | 
96 |         return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))])
97 | 
98 |     


--------------------------------------------------------------------------------
/Chapter06/code/csv_loader_simple.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from torch.utils.data import Dataset
 3 | import glob
 4 | import torch
 5 | import sys
 6 | import logging
 7 | import collections
 8 | import bisect
 9 | import time
10 | import pandas as pd
11 | from linecache import getline
12 | from itertools import islice
13 | 
14 | logger = logging.getLogger(__name__)
15 | logger.setLevel(logging.INFO)
16 | logger.addHandler(logging.StreamHandler(sys.stdout))
17 | 
18 | class CsvDatasetSimple(Dataset):
19 | 
20 |     def __init__(self, csv_path, max_files=5):
21 |         
22 |         self.csv_path = csv_path
23 |         if os.path.isfile(csv_path):
24 |             self.count, self.tensors = self.get_line_data(csv_path)
25 |             logger.debug(f"For {csv_path}, count = {self.count}")
26 |         else:
27 |             self.count, self.tensors = self.get_folder_line_data(csv_path, max_files)
28 |             
29 |     def get_folder_line_data(self, d, max_files):
30 |         cnt = 0
31 |         file_cnt = 0
32 |         tensors = []
33 |         for f in glob.glob(os.path.join(d, '*.csv')):
34 |             fcnt, ftensors = self.get_line_data(f)
35 |             cnt = cnt + fcnt
36 |             tensors = tensors + ftensors
37 |             file_cnt = file_cnt + 1
38 |             if file_cnt > max_files:
39 |                 break
40 |             
41 |         return cnt, tensors
42 |     
43 |     def get_line_data(self, f):
44 |         cnt = 0
45 |         tensors = []
46 |         with open(f) as F:
47 |             f_lines = F.readlines()
48 |             for l in f_lines:
49 |                 cnt = cnt + 1
50 |                 parts = l.split(',')
51 |                 tensors.append(tuple([torch.tensor( [float(f) for f in parts[1:]] ), torch.tensor(float(parts[0]))]))
52 |         
53 |         return cnt, tensors
54 | 
55 |     def __len__(self):
56 |         return self.count
57 | 
58 |     def __getitem__(self, idx):
59 |         
60 |         if torch.is_tensor(idx):
61 |             idx = idx.tolist()
62 |         logger.debug(f"Indices: {idx}")
63 |         
64 |         return self.tensors[idx]
65 | 
66 |     


--------------------------------------------------------------------------------
/Chapter06/code/model_pytorch.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | import logging
 5 | import sys
 6 | 
 7 | logger = logging.getLogger(__name__)
 8 | logger.setLevel(logging.INFO)
 9 | logger.addHandler(logging.StreamHandler(sys.stdout))
10 | 
11 | class TabularNet(nn.Module):
12 | 
13 |     def __init__(self, n_cont, n_cat, emb_sz = 100, dropout_p = 0.1, layers=[200,100], cat_mask=[], cat_dim=[], y_min = 0., y_max = 1.):
14 |         
15 |         super(TabularNet, self).__init__()
16 |         
17 |         self.cat_mask = cat_mask
18 |         self.cat_dim = cat_dim
19 |         self.y_min = y_min
20 |         self.y_max = y_max
21 |         self.n_cat = n_cat
22 |         self.n_cont = n_cont
23 |         
24 |         emb_dim = []
25 |         for ii in range(len(cat_mask)):
26 |             if cat_mask[ii]:
27 |                 c_dim = cat_dim[ii]
28 |                 emb_dim.append(c_dim)
29 |                 #emb = nn.Embedding(c_dim, emb_sz)
30 |                 #self.embeddings.append(emb)
31 |                 #setattr(self, 'emb_{}'.format(ii), emb)
32 |                 
33 |         self.embeddings = nn.ModuleList([nn.Embedding(c_dim, emb_sz) for c_dim in emb_dim])
34 |                 
35 |         modules = []
36 |         prev_size = n_cont + n_cat * emb_sz
37 |         for l in layers:
38 |             modules.append(nn.Linear(prev_size, l))
39 |             modules.append(nn.BatchNorm1d(l))
40 |             modules.append(nn.Dropout(dropout_p))
41 |             modules.append(nn.ReLU(inplace=True))
42 |             prev_size = l
43 |         modules.append(nn.BatchNorm1d(prev_size))
44 |         modules.append(nn.Dropout(dropout_p))
45 |         modules.append(nn.Linear(prev_size, 1))
46 |         modules.append(nn.Sigmoid())
47 |         
48 |         self.m_seq = nn.Sequential(*modules)
49 |         self.emb_drop = nn.Dropout(dropout_p)
50 |         self.bn_cont = nn.BatchNorm1d(n_cont)
51 | 
52 |     def forward(self, x_in):
53 |         
54 |         logger.debug(f"Forward pass on {x_in.shape}")
55 |         x = None
56 |         ee = 0
57 |         for ii in range(len(self.cat_mask)):
58 | 
59 |             if self.cat_mask[ii]:
60 |                 logger.debug(f"Embedding: {self.embeddings[ee]}, input: {x_in[:,ii]}")
61 |                 logger.debug(f"cat Device for x_in: {x_in.get_device()}")
62 |                 logger.debug(f"cat Device for x_in slice: {x_in[:,ii].get_device()}")
63 |                 logger.debug(f"cat Device for embed: {next(self.embeddings[ee].parameters()).get_device()}")
64 |                 x_e = self.embeddings[ee](x_in[:,ii].to(device = x_in.get_device(), dtype= torch.long))
65 |                 logger.debug(f"cat Device for x_e: {x_e.get_device()}")
66 |                 logger.debug(f"cat x_e = {x_e.shape}")
67 |                 if x is None:
68 |                     x = x_e
69 |                 else:
70 |                     x = torch.cat([x, x_e], 1)
71 |                 logger.debug(f"cat Device for x: {x.get_device()}")
72 |                 x = self.emb_drop(x)
73 |                 logger.debug(f"cat Device for x: {x.get_device()}")
74 |                 logger.debug(f"cat x = {x.shape}")
75 |                 ee = ee + 1
76 |             else:
77 |                 logger.debug(f"cont Device for x_in: {x_in.get_device()}")
78 |                 x_cont = x_in[:, ii] # self.bn_cont(x_in[:, ii])
79 |                 logger.debug(f"cont Device for x_cont: {x_cont.get_device()}")
80 |                 logger.debug(f"cont x_cont = {x_cont.shape}")
81 |                 if x is None:
82 |                     x = torch.unsqueeze(x_cont, 1)
83 |                 else:
84 |                     x = torch.cat([x, torch.unsqueeze(x_cont, 1)], 1)
85 |                 logger.debug(f"cont Device for x: {x.get_device()}")
86 |                 logger.debug(f"cont x = {x.shape}")
87 |                  
88 |         return self.m_seq(x) * (self.y_max - self.y_min) + self.y_min


--------------------------------------------------------------------------------
/Chapter06/code/train_pytorch-dist.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import os
  4 | import sys
  5 | import logging
  6 | import json
  7 | import shutil
  8 | import torch
  9 | import torch.nn as nn
 10 | from torch.utils.data import DataLoader, TensorDataset
 11 | from model_pytorch import TabularNet
 12 | from csv_loader_simple import CsvDatasetSimple
 13 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
 14 | 
 15 | 
 16 | logger = logging.getLogger(__name__)
 17 | logger.setLevel(logging.INFO)
 18 | logger.addHandler(logging.StreamHandler(sys.stdout))
 19 | 
 20 | class CUDANotFoundException(Exception):
 21 |     pass
 22 | 
 23 | 
 24 | dist.init_process_group()
 25 | 
 26 | 
 27 | def parse_args():
 28 |     """
 29 |     Parse arguments passed from the SageMaker API
 30 |     to the container
 31 |     """
 32 | 
 33 |     parser = argparse.ArgumentParser()
 34 | 
 35 |     # Hyperparameters sent by the client are passed as command-line arguments to the script
 36 |     parser.add_argument("--epochs", type=int, default=1)
 37 |     parser.add_argument("--batch_size", type=int, default=64)
 38 |     parser.add_argument("--learning_rate", type=float, default=0.01)
 39 | 
 40 |     # Data directories
 41 |     parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 42 |     parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 43 | 
 44 |     # Model directory: we will use the default set by SageMaker, /opt/ml/model
 45 |     parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 46 | 
 47 |     return parser.parse_known_args()
 48 | 
 49 | def train(model, device):
 50 |     """
 51 |     Train the PyTorch model
 52 |     """
 53 | 
 54 |     cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False]
 55 |     train_ds = CsvDatasetSimple(args.train)
 56 |     test_ds = CsvDatasetSimple(args.test)
 57 | 
 58 |     batch_size = args.batch_size
 59 |     epochs = args.epochs
 60 |     learning_rate = args.learning_rate
 61 |     logger.info(
 62 |         "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate)
 63 |     )
 64 |     
 65 |     train_sampler = torch.utils.data.distributed.DistributedSampler(
 66 |         train_ds, num_replicas=args.world_size, rank=args.rank
 67 |     )
 68 | 
 69 |     train_dl = DataLoader(train_ds, batch_size, shuffle=False, drop_last=True, sampler=train_sampler)
 70 | 
 71 |     model = TabularNet(n_cont=9, n_cat=9, 
 72 |                           cat_mask = cat_mask, 
 73 |                           cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 
 74 |                           y_min = 0., y_max = 1., device=device)
 75 |     logger.debug(model)
 76 |     model = DDP(model).to(device)
 77 |     torch.cuda.set_device(args.local_rank)
 78 |     model.cuda(args.local_rank)
 79 |     
 80 |     criterion = nn.MSELoss()
 81 |     optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 82 | 
 83 |     model.train()
 84 |     for epoch in range(epochs):
 85 |         batch_no = 0
 86 |         for x_train_batch, y_train_batch in train_dl:
 87 |             logger.debug(f"Training on shape {x_train_batch.shape}")
 88 |             y = model(x_train_batch.float())
 89 |             loss = criterion(y.flatten(), y_train_batch.float().to(device))
 90 |             if batch_no % 50 == 0:
 91 |                 logger.info(f"batch {batch_no} -> loss: {loss}")
 92 |             optimizer.zero_grad()
 93 |             loss.backward()
 94 |             optimizer.step()
 95 |             batch_no +=1
 96 |         epoch += 1
 97 |         logger.info(f"epoch: {epoch} -> loss: {loss}")
 98 | 
 99 |     # evalutate on test set
100 |     if args.rank == 0:
101 |         model.eval()
102 |         test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False)
103 |         with torch.no_grad():
104 |             mse = 0.
105 |             for x_test_batch, y_test_batch in test_dl:
106 |                 y = model(x_test_batch.float())
107 |                 mse = mse + ((y - y_test_batch.to(device)) ** 2).sum() / x_test_batch.shape[0]
108 |                 
109 |         mse = mse / len(test_dl.dataset)
110 |         logger.info(f"Test MSE: {mse}")
111 | 
112 |     torch.save(model.state_dict(), args.model_dir + "/model.pth")
113 |     # PyTorch requires that the inference script must
114 |     # be in the .tar.gz model file and Step Functions SDK doesn't do this.
115 |     inference_code_path = args.model_dir + "/code/"
116 | 
117 |     if not os.path.exists(inference_code_path):
118 |         os.mkdir(inference_code_path)
119 |         logger.info("Created a folder at {}!".format(inference_code_path))
120 | 
121 |     shutil.copy("train_pytorch.py", inference_code_path)
122 |     shutil.copy("model_pytorch.py", inference_code_path)
123 |     shutil.copy("csv_loader.py", inference_code_path)
124 |     logger.info("Saving models files to {}".format(inference_code_path))
125 | 
126 | 
127 | if __name__ == "__main__":
128 | 
129 |     args, _ = parse_args()
130 |     args.world_size = dist.get_world_size()
131 |     args.rank = rank = dist.get_rank()
132 |     args.local_rank = local_rank = dist.get_local_rank()
133 |     args.batch_size //= args.world_size // 8
134 |     args.batch_size = max(args.batch_size, 1)
135 |     
136 |     if not torch.cuda.is_available():
137 |         raise CUDANotFoundException(
138 |             "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices."
139 |         )
140 |         
141 |     torch.manual_seed(1)
142 |     
143 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
144 | 
145 |     train()
146 | 


--------------------------------------------------------------------------------
/Chapter06/code/train_pytorch.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import os
  4 | import sys
  5 | import logging
  6 | import json
  7 | import shutil
  8 | import torch
  9 | import torch.nn as nn
 10 | from torch.utils.data import DataLoader, TensorDataset
 11 | from model_pytorch import TabularNet
 12 | from csv_loader_simple import CsvDatasetSimple
 13 | 
 14 | 
 15 | logger = logging.getLogger(__name__)
 16 | logger.setLevel(logging.INFO)
 17 | logger.addHandler(logging.StreamHandler(sys.stdout))
 18 | 
 19 | 
 20 | def parse_args():
 21 |     """
 22 |     Parse arguments passed from the SageMaker API
 23 |     to the container
 24 |     """
 25 | 
 26 |     parser = argparse.ArgumentParser()
 27 | 
 28 |     # Hyperparameters sent by the client are passed as command-line arguments to the script
 29 |     parser.add_argument("--epochs", type=int, default=1)
 30 |     parser.add_argument("--batch_size", type=int, default=64)
 31 |     parser.add_argument("--learning_rate", type=float, default=0.01)
 32 | 
 33 |     # Data directories
 34 |     parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 35 |     parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 36 | 
 37 |     # Model directory: we will use the default set by SageMaker, /opt/ml/model
 38 |     parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 39 | 
 40 |     return parser.parse_known_args()
 41 | 
 42 | def train():
 43 |     """
 44 |     Train the PyTorch model
 45 |     """
 46 | 
 47 |     cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False]
 48 |     train_ds = CsvDatasetSimple(args.train)
 49 |     test_ds = CsvDatasetSimple(args.test)
 50 | 
 51 |     batch_size = args.batch_size
 52 |     epochs = args.epochs
 53 |     learning_rate = args.learning_rate
 54 |     logger.info(
 55 |         "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate)
 56 |     )
 57 | 
 58 |     train_dl = DataLoader(train_ds, batch_size, shuffle=False, 
 59 |                           num_workers=0,
 60 |                           pin_memory=True,
 61 |                           drop_last=True)
 62 | 
 63 |     model = TabularNet(n_cont=9, n_cat=9, 
 64 |                           cat_mask = cat_mask, 
 65 |                           cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 
 66 |                           y_min = 0., y_max = 1.)
 67 |     logger.debug(model)
 68 |     model = model.to(device)
 69 |     criterion = nn.MSELoss()
 70 |     optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 71 | 
 72 |     model.train()
 73 |     for epoch in range(epochs):
 74 |         batch_no = 0
 75 |         for x_train_batch, y_train_batch in train_dl:
 76 |             x_train_batch_d = x_train_batch.to(device)
 77 |             y_train_batch_d = y_train_batch.to(device)
 78 |             optimizer.zero_grad()
 79 |             
 80 |             logger.debug(f"Training on shape {x_train_batch.shape}")
 81 |             y = model(x_train_batch_d.float())
 82 |             loss = criterion(y.flatten(), y_train_batch_d.float())
 83 |             if batch_no % 50 == 0:
 84 |                 logger.info(f"batch {batch_no} -> loss: {loss}")
 85 |             
 86 |             loss.backward()
 87 |             optimizer.step()
 88 |             batch_no +=1
 89 |         epoch += 1
 90 |         logger.info(f"epoch: {epoch} -> loss: {loss}")
 91 | 
 92 |     # evalutate on test set
 93 |     model.eval()
 94 |     test_dl = DataLoader(test_ds, batch_size, shuffle=False, drop_last=True)
 95 |     with torch.no_grad():
 96 |         mse = 0.
 97 |         for x_test_batch, y_test_batch in test_dl:
 98 |             x_test_batch_d = x_test_batch.to(device)
 99 |             y_test_batch_d = y_test_batch.to(device)
100 |                 
101 |             y = model(x_test_batch_d.float())
102 |             mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0]
103 |         
104 |     mse = mse / len(test_dl.dataset)
105 |     logger.info(f"Test MSE: {mse}")
106 | 
107 |     torch.save(model.state_dict(), args.model_dir + "/model.pth")
108 |     # PyTorch requires that the inference script must
109 |     # be in the .tar.gz model file and Step Functions SDK doesn't do this.
110 |     inference_code_path = args.model_dir + "/code/"
111 | 
112 |     if not os.path.exists(inference_code_path):
113 |         os.mkdir(inference_code_path)
114 |         logger.info("Created a folder at {}!".format(inference_code_path))
115 | 
116 |     #shutil.copy("train_pytorch.py", inference_code_path)
117 |     #shutil.copy("model_pytorch.py", inference_code_path)
118 |     #shutil.copy("csv_loader.py", inference_code_path)
119 |     logger.info("Saving models files to {}".format(inference_code_path))
120 | 
121 | 
122 | if __name__ == "__main__":
123 | 
124 |     args, _ = parse_args()
125 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
126 | 
127 |     train()
128 | 


--------------------------------------------------------------------------------
/Chapter06/code/train_pytorch_dist.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import os
  4 | import sys
  5 | import logging
  6 | import json
  7 | import shutil
  8 | import torch
  9 | import torch.nn as nn
 10 | from torch.utils.data import DataLoader, TensorDataset
 11 | import torch.nn.functional as F
 12 | 
 13 | from model_pytorch import TabularNet
 14 | from csv_loader_simple import CsvDatasetSimple
 15 | 
 16 | ##Necessary to use SM distributed libraries
 17 | import smdistributed.dataparallel.torch.distributed as dist
 18 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
 19 | 
 20 | 
 21 | logger = logging.getLogger(__name__)
 22 | logger.setLevel(logging.INFO)
 23 | logger.addHandler(logging.StreamHandler(sys.stdout))
 24 | 
 25 | class CUDANotFoundException(Exception):
 26 |     pass
 27 | 
 28 | ##Initialize the distributed library
 29 | dist.init_process_group()
 30 | 
 31 | 
 32 | def parse_args():
 33 |     """
 34 |     Parse arguments passed from the SageMaker API
 35 |     to the container
 36 |     """
 37 | 
 38 |     parser = argparse.ArgumentParser()
 39 | 
 40 |     # Hyperparameters sent by the client are passed as command-line arguments to the script
 41 |     parser.add_argument("--epochs", type=int, default=1)
 42 |     parser.add_argument("--batch_size", type=int, default=64)
 43 |     parser.add_argument("--learning_rate", type=float, default=0.01)
 44 | 
 45 |     # Data directories
 46 |     parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 47 |     parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 48 | 
 49 |     # Model directory: we will use the default set by SageMaker, /opt/ml/model
 50 |     parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 51 |     
 52 |     parser.add_argument(
 53 |         "--save-model", action="store_true", default=False, help="For Saving the current Model"
 54 |     )
 55 | 
 56 |     return parser.parse_known_args()
 57 | 
 58 | def train(args):
 59 |     """
 60 |     Train the PyTorch model
 61 |     """
 62 | 
 63 |     cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False]
 64 |     train_ds = CsvDatasetSimple(args.train)
 65 |     test_ds = CsvDatasetSimple(args.test)
 66 | 
 67 |     batch_size = args.batch_size
 68 |     epochs = args.epochs
 69 |     learning_rate = args.learning_rate
 70 |     logger.info(
 71 |         "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate)
 72 |     )
 73 |     logger.info(f"World size: {args.world_size}")
 74 |     logger.info(f"Rank: {args.rank}")
 75 |     logger.info(f"Local Rank: {args.local_rank}")
 76 |     
 77 |     train_sampler = torch.utils.data.distributed.DistributedSampler(
 78 |         train_ds, num_replicas=args.world_size, rank=args.rank
 79 |     )
 80 |     logger.debug(f"Created distributed sampler")
 81 | 
 82 |     train_dl = DataLoader(train_ds, batch_size, shuffle=False, 
 83 |                           num_workers=0,
 84 |                           pin_memory=True,
 85 |                           drop_last=True, sampler=train_sampler)
 86 |     logger.debug(f"Created train data loader")
 87 | 
 88 |     model = TabularNet(n_cont=9, n_cat=9, 
 89 |                           cat_mask = cat_mask, 
 90 |                           cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 
 91 |                           y_min = 0., y_max = 1.)
 92 |     logger.debug("Created model")
 93 |     logger.debug(model)
 94 |     model = DDP(model.to(device), broadcast_buffers=False)
 95 |     logger.debug("created DDP")
 96 |     torch.cuda.set_device(args.local_rank)
 97 |     logger.debug("Set device on CUDA")
 98 |     model.cuda(args.local_rank)
 99 |     logger.debug(f"Set model CUDA to {args.local_rank}")
100 |     
101 |     criterion = nn.MSELoss()
102 |     optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
103 |     logger.debug("Created loss fn and optimizer")
104 | 
105 |     model.train()
106 |     for epoch in range(epochs):
107 |         batch_no = 0
108 |         for x_train_batch, y_train_batch in train_dl:
109 |             logger.debug(f"Working on batch {batch_no}")
110 |             x_train_batch_d = x_train_batch.to(device)
111 |             y_train_batch_d = y_train_batch.to(device)
112 |             
113 |             optimizer.zero_grad()
114 |             logger.debug("Did optimizer grad")
115 |             
116 |             logger.debug(f"Training on shape {x_train_batch.shape}")
117 |             y = model(x_train_batch_d.float())
118 |             logger.debug("Did forward pass")
119 |             loss = criterion(y.flatten(), y_train_batch_d.float())
120 |             logger.debug("Got loss")
121 |             if batch_no % 50 == 0:
122 |                 logger.info(f"batch {batch_no} -> loss: {loss}")
123 |             
124 |             loss.backward()
125 |             logger.debug("Did backward step")
126 |             optimizer.step()
127 |             logger.debug(f"Did optimizer step, batch {batch_no}")
128 |             batch_no +=1
129 |         epoch += 1
130 |         logger.info(f"epoch: {epoch} -> loss: {loss}")
131 | 
132 |     # evalutate on test set
133 |     if args.rank == 0:
134 |         logger.info(f"Starting test eval on rank 0")
135 |         model.eval()
136 |         test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False)
137 |         logger.info(f"Loaded test data set")
138 |         with torch.no_grad():
139 |             mse = 0.
140 |             batch_no = 0
141 |             for x_test_batch, y_test_batch in test_dl:
142 |                 x_test_batch_d = x_test_batch.to(device)
143 |                 y_test_batch_d = y_test_batch.to(device)
144 |                 
145 |                 y = model(x_test_batch_d.float())
146 |                 mse += F.mse_loss(y, y_test_batch_d, reduction="sum")
147 |                 #mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0]
148 |                 
149 |                 if batch_no % 50 == 0:
150 |                     logger.info(f"batch {batch_no} -> MSE: {mse}")
151 |                 batch_no +=1
152 |                 
153 |         mse = mse / len(test_dl.dataset)
154 |         logger.info(f"Test MSE: {mse}")
155 | 
156 |     if args.rank == 0:
157 |         logger.info("Saving model on rank 0")
158 |         torch.save(model.state_dict(), args.model_dir + "/model.pth")
159 | 
160 | 
161 | if __name__ == "__main__":
162 | 
163 |     args, _ = parse_args()
164 |     args.world_size = dist.get_world_size()
165 |     args.rank = rank = dist.get_rank()
166 |     args.local_rank = local_rank = dist.get_local_rank()
167 |     args.batch_size //= args.world_size // 8
168 |     args.batch_size = max(args.batch_size, 1)
169 |     
170 |     if not torch.cuda.is_available():
171 |         raise CUDANotFoundException(
172 |             "Must run smdistributed.dataparallel on CUDA-capable devices."
173 |         )
174 |         
175 |     torch.manual_seed(1)
176 |     
177 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
178 | 
179 |     train(args)
180 | 


--------------------------------------------------------------------------------
/Chapter06/code/train_pytorch_model_dist.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import os
  4 | import sys
  5 | import logging
  6 | import json
  7 | import shutil
  8 | import torch
  9 | import torch.nn as nn
 10 | import torch.nn.functional as F
 11 | import torch.optim as optim
 12 | #from torchnet.dataset import SplitDataset
 13 | from torchvision import datasets
 14 | 
 15 | from torch.utils.data import DataLoader, TensorDataset
 16 | from model_pytorch import TabularNet
 17 | from csv_loader_simple import CsvDatasetSimple
 18 | 
 19 | # SMP: Import and initialize SMP API.
 20 | import smdistributed.modelparallel.torch as smp
 21 | 
 22 | 
 23 | logger = logging.getLogger(__name__)
 24 | logger.setLevel(logging.INFO)
 25 | logger.addHandler(logging.StreamHandler(sys.stdout))
 26 | 
 27 | smp.init()
 28 | 
 29 | 
 30 | def parse_args():
 31 |     """
 32 |     Parse arguments passed from the SageMaker API
 33 |     to the container
 34 |     """
 35 | 
 36 |     parser = argparse.ArgumentParser()
 37 | 
 38 |     # Hyperparameters sent by the client are passed as command-line arguments to the script
 39 |     parser.add_argument("--epochs", type=int, default=1)
 40 |     parser.add_argument("--batch_size", type=int, default=64)
 41 |     parser.add_argument("--learning_rate", type=float, default=0.01)
 42 | 
 43 |     # Data directories
 44 |     parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 45 |     parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 46 | 
 47 |     # Model directory: we will use the default set by SageMaker, /opt/ml/model
 48 |     parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 49 | 
 50 |     return parser.parse_known_args()
 51 | 
 52 | 
 53 | # smdistributed: Define smp.step. Return any tensors needed outside.
 54 | @smp.step
 55 | def train_step(model, data, target):
 56 |     #logger.info("**** TRAIN_STEP method target is {} ".format(target))
 57 |     #print("target is : ", target)
 58 |     output = model(data)
 59 |     long_target = target.long()
 60 |     #loss = F.nll_loss(output, target, reduction="mean")
 61 |     loss = F.nll_loss(output, long_target, reduction="mean")
 62 |     model.backward(loss)
 63 |     return output, loss
 64 | 
 65 | def train(model, device, train_loader, optimizer):
 66 |     print("Into init_train")
 67 |     model.train()
 68 |     for batch_idx, (data, target) in enumerate(train_loader):
 69 |         #logger.info("**** TRAIN method target is {}".format(target))
 70 |         # smdistributed: Move input tensors to the GPU ID used by the current process,
 71 |         # based on the set_device call.
 72 |         data, target = data.to(device), target.to(device)
 73 |         optimizer.zero_grad()
 74 |         # Return value, loss_mb is a StepOutput object
 75 |         _, loss_mb = train_step(model, data, target)
 76 | 
 77 |         # smdistributed: Average the loss across microbatches.
 78 |         loss = loss_mb.reduce_mean()
 79 | 
 80 |         optimizer.step()
 81 |      
 82 |     
 83 | def init_train():
 84 |     """
 85 |     Train the PyTorch model
 86 |     """
 87 |    
 88 |     cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False]
 89 |     train_ds = CsvDatasetSimple(args.train)
 90 |     test_ds = CsvDatasetSimple(args.test)
 91 | 
 92 |     batch_size = args.batch_size
 93 |     epochs = args.epochs
 94 |     learning_rate = args.learning_rate
 95 |     
 96 |     logger.info(
 97 |         "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate)
 98 |     )
 99 |     
100 |     # smdistributed: initialize the backend
101 |     smp.init()
102 |     
103 |     # smdistributed: Set the device to the GPU ID used by the current process.
104 |     # Input tensors should be transferred to this device.
105 |     torch.cuda.set_device(smp.local_rank())
106 |     device = torch.device("cuda")
107 |     
108 |     # smdistributed: Download only on a single process per instance.
109 |     # When this is not present, the file is corrupted by multiple processes trying
110 |     # to download and extract at the same time
111 |     #dataset = datasets.MNIST("../data", train=True, download=False)
112 |     dataset = train_ds
113 | 
114 |     # smdistributed: Shard the dataset based on data-parallel ranks
115 |     if smp.dp_size() > 1:
116 |         partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
117 |         dataset = SplitDataset(dataset, partitions=partitions_dict)
118 |         dataset.select(f"{smp.dp_rank()}")
119 |     
120 |     
121 |     # smdistributed: Set drop_last=True to ensure that batch size is always divisible
122 |     # by the number of microbatches
123 |     train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)
124 | 
125 |     model = TabularNet(n_cont=9, n_cat=9, 
126 |                           cat_mask = cat_mask, 
127 |                           cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 
128 |                           y_min = 0., y_max = 1.)
129 |     
130 |     logger.debug(model)
131 |     
132 |     optimizer = optim.Adadelta(model.parameters(), lr=4.0)
133 |     
134 |     
135 |     # SMP: Instantiate DistributedModel object using the model.
136 |     # This handles distributing the model among multiple ranks
137 |     # behind the scenes
138 |     # If horovod is enabled this will do an overlapping_all_reduce by
139 |     # default.
140 |     
141 |     # smdistributed: Use the DistributedModel container to provide the model
142 |     # to be partitioned across different ranks. For the rest of the script,
143 |     # the returned DistributedModel object should be used in place of
144 |     # the model provided for DistributedModel class instantiation.
145 |     model = smp.DistributedModel(model)
146 |     
147 |     optimizer = smp.DistributedOptimizer(optimizer)
148 |     
149 |     train(model, device, train_loader, optimizer)
150 | 
151 |     torch.save(model.state_dict(), args.model_dir + "/model.pth")
152 |     
153 | 
154 | if __name__ == "__main__":
155 | 
156 |     args, _ = parse_args()
157 |         
158 |     init_train()
159 | 


--------------------------------------------------------------------------------
/Chapter07/code/csv_loader.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from torch.utils.data import Dataset
 3 | import glob
 4 | import torch
 5 | import sys
 6 | import logging
 7 | import collections
 8 | import bisect
 9 | 
10 | logger = logging.getLogger(__name__)
11 | logger.setLevel(logging.INFO)
12 | logger.addHandler(logging.StreamHandler(sys.stdout))
13 | 
14 | class CsvDataset(Dataset):
15 | 
16 |     def __init__(self, csv_path):
17 |         
18 |         self.csv_path = csv_path
19 |         if os.path.isfile(csv_path):
20 |             self.folder = False
21 |             self.count, fmap, self.line_offset = self.get_line_count(csv_path)
22 |             logger.debug(f"For {csv_path}, count = {self.count}")
23 |         else:
24 |             self.folder = True
25 |             self.count, fmap, self.line_offset = self.get_folder_line_count(csv_path)
26 |             
27 |         self.fmap = collections.OrderedDict(sorted(fmap.items()))
28 |             
29 |     def get_folder_line_count(self, d):
30 |         cnt = 0
31 |         all_map = {}
32 |         all_lc = {}
33 |         for f in glob.glob(os.path.join(d, '*.csv')):
34 |             fcnt, _, line_offset = self.get_line_count(f)
35 |             cnt = cnt + fcnt
36 |             all_map[cnt] = f
37 |             all_lc.update(line_offset)
38 |         return cnt, all_map, all_lc
39 |     
40 |     def get_line_count(self, f):
41 |         with open(f) as F:
42 |             line_offset = []
43 |             offset = 0
44 |             count = 0
45 |             for line in F:
46 |                 line_offset.append(offset)
47 |                 offset += len(line)
48 |                 count = count + 1
49 |         
50 |         return count, {count: f}, {f: line_offset}
51 | 
52 |     def __len__(self):
53 |         return self.count
54 | 
55 |     def __getitem__(self, idx):
56 |         
57 |         if torch.is_tensor(idx):
58 |             idx = idx.tolist()
59 |         logger.debug(f"Indices: {idx}")
60 |         
61 |         # This gives us the index in the line counts greater than or equal to the desired index.
62 |         # The map value for this line count is the file name containing that row.
63 |         klist = list(self.fmap.keys())
64 |         idx_m = bisect.bisect_left(klist, idx+1)
65 |         
66 |         # Grab the ending count of thisl file
67 |         cur_idx = klist[idx_m]
68 |         
69 |         # grab the ending count of the previous file
70 |         if idx_m > 0:
71 |             prev_idx = klist[idx_m-1]
72 |         else:
73 |             prev_idx = 0
74 |         
75 |         # grab the file name for the desired row count
76 |         fname = self.fmap[cur_idx]
77 | 
78 |         loff = self.line_offset[fname]
79 |         with open(fname) as F:
80 |             F.seek(loff[idx - prev_idx])
81 |             idx_line = F.readline()    
82 |         
83 |         idx_parts = idx_line.split(',')
84 | 
85 |         return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))])


--------------------------------------------------------------------------------
/Chapter07/code/csv_loader_pd.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from torch.utils.data import Dataset
 3 | import glob
 4 | import torch
 5 | import sys
 6 | import logging
 7 | import collections
 8 | import bisect
 9 | import time
10 | import pandas as pd
11 | from linecache import getline
12 | from itertools import islice
13 | 
14 | logger = logging.getLogger(__name__)
15 | logger.setLevel(logging.DEBUG)
16 | logger.addHandler(logging.StreamHandler(sys.stdout))
17 | 
18 | def _make_gen(reader):
19 |     b = reader(1024 * 1024)
20 |     while b:
21 |         yield b
22 |         b = reader(1024*1024)
23 |         
24 | def rawgencount(filename):
25 |     f = open(filename, 'rb')
26 |     f_gen = _make_gen(f.raw.read)
27 |     return sum( buf.count(b'\n') for buf in f_gen )
28 | 
29 | 
30 | class CsvDatasetPd(Dataset):
31 | 
32 |     def __init__(self, csv_path, max_files=5):
33 |         
34 |         self.csv_path = csv_path
35 |         if os.path.isfile(csv_path):
36 |             self.folder = False
37 |             self.count, fmap = self.get_line_count(csv_path)
38 |             logger.debug(f"For {csv_path}, count = {self.count}")
39 |         else:
40 |             self.folder = True
41 |             self.count, fmap = self.get_folder_line_count(csv_path, max_files)
42 |             
43 |         self.fmap = collections.OrderedDict(sorted(fmap.items()))
44 |         self.max_files = max_files
45 |             
46 |     def get_folder_line_count(self, d, max_files):
47 |         cnt = 0
48 |         all_map = {}
49 |         file_cnt = 0
50 |         for f in glob.glob(os.path.join(d, '*.csv')):
51 |             fcnt, _ = self.get_line_count(f)
52 |             cnt = cnt + fcnt
53 |             all_map[cnt] = f
54 |             file_cnt = file_cnt + 1
55 |             if file_cnt > max_files:
56 |                 break
57 |             
58 |         return cnt, all_map
59 |     
60 |     def get_line_count(self, f):
61 |         count = rawgencount(f)
62 |         return count, {count: f}
63 | 
64 |     def __len__(self):
65 |         return self.count
66 | 
67 |     def __getitem__(self, idx):
68 |         
69 |         if torch.is_tensor(idx):
70 |             idx = idx.tolist()
71 |         logger.debug(f"Indices: {idx}")
72 |         
73 |         # This gives us the index in the line counts greater than or equal to the desired index.
74 |         # The map value for this line count is the file name containing that row.
75 |         klist = list(self.fmap.keys())
76 |         idx_m = bisect.bisect_left(klist, idx+1)
77 |         
78 |         # Grab the ending count of thisl file
79 |         cur_idx = klist[idx_m]
80 |         
81 |         # grab the ending count of the previous file
82 |         if idx_m > 0:
83 |             prev_idx = klist[idx_m-1]
84 |         else:
85 |             prev_idx = 0
86 |         
87 |         # grab the file name for the desired row count
88 |         fname = self.fmap[cur_idx]
89 | 
90 |         #with open(fname) as F:
91 |         #    lines = list(islice(F, idx-prev_idx, idx-prev_idx+1))
92 |         idx_line = getline(fname, idx - prev_idx +1)
93 |         
94 |         idx_parts = idx_line.split(',')
95 | 
96 |         return tuple([torch.tensor( [float(f) for f in idx_parts[1:]] ), torch.tensor(float(idx_parts[0]))])
97 | 
98 |     


--------------------------------------------------------------------------------
/Chapter07/code/csv_loader_simple.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from torch.utils.data import Dataset
 3 | import glob
 4 | import torch
 5 | import sys
 6 | import logging
 7 | import collections
 8 | import bisect
 9 | import time
10 | import pandas as pd
11 | from linecache import getline
12 | from itertools import islice
13 | 
14 | logger = logging.getLogger(__name__)
15 | logger.setLevel(logging.INFO)
16 | logger.addHandler(logging.StreamHandler(sys.stdout))
17 | 
18 | class CsvDatasetSimple(Dataset):
19 | 
20 |     def __init__(self, csv_path, max_files=5):
21 |         
22 |         self.csv_path = csv_path
23 |         if os.path.isfile(csv_path):
24 |             self.count, self.tensors = self.get_line_data(csv_path)
25 |             logger.debug(f"For {csv_path}, count = {self.count}")
26 |         else:
27 |             self.count, self.tensors = self.get_folder_line_data(csv_path, max_files)
28 |             
29 |     def get_folder_line_data(self, d, max_files):
30 |         cnt = 0
31 |         file_cnt = 0
32 |         tensors = []
33 |         for f in glob.glob(os.path.join(d, '*.csv')):
34 |             fcnt, ftensors = self.get_line_data(f)
35 |             cnt = cnt + fcnt
36 |             tensors = tensors + ftensors
37 |             file_cnt = file_cnt + 1
38 |             if file_cnt > max_files:
39 |                 break
40 |             
41 |         return cnt, tensors
42 |     
43 |     def get_line_data(self, f):
44 |         cnt = 0
45 |         tensors = []
46 |         with open(f) as F:
47 |             f_lines = F.readlines()
48 |             for l in f_lines:
49 |                 cnt = cnt + 1
50 |                 parts = l.split(',')
51 |                 tensors.append(tuple([torch.tensor( [float(f) for f in parts[1:]] ), torch.tensor(float(parts[0]))]))
52 |         
53 |         return cnt, tensors
54 | 
55 |     def __len__(self):
56 |         return self.count
57 | 
58 |     def __getitem__(self, idx):
59 |         
60 |         if torch.is_tensor(idx):
61 |             idx = idx.tolist()
62 |         logger.debug(f"Indices: {idx}")
63 |         
64 |         return self.tensors[idx]
65 | 
66 |     


--------------------------------------------------------------------------------
/Chapter07/code/model_pytorch.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | import logging
 5 | import sys
 6 | 
 7 | logger = logging.getLogger(__name__)
 8 | logger.setLevel(logging.INFO)
 9 | logger.addHandler(logging.StreamHandler(sys.stdout))
10 | 
11 | class TabularNet(nn.Module):
12 | 
13 |     def __init__(self, n_cont, n_cat, emb_sz = 100, dropout_p = 0.1, layers=[200,100], cat_mask=[], cat_dim=[], y_min = 0., y_max = 1.):
14 |         
15 |         super(TabularNet, self).__init__()
16 |         
17 |         self.cat_mask = cat_mask
18 |         self.cat_dim = cat_dim
19 |         self.y_min = y_min
20 |         self.y_max = y_max
21 |         self.n_cat = n_cat
22 |         self.n_cont = n_cont
23 |         
24 |         emb_dim = []
25 |         for ii in range(len(cat_mask)):
26 |             if cat_mask[ii]:
27 |                 c_dim = cat_dim[ii]
28 |                 emb_dim.append(c_dim)
29 |                 #emb = nn.Embedding(c_dim, emb_sz)
30 |                 #self.embeddings.append(emb)
31 |                 #setattr(self, 'emb_{}'.format(ii), emb)
32 |                 
33 |         self.embeddings = nn.ModuleList([nn.Embedding(c_dim, emb_sz) for c_dim in emb_dim])
34 |                 
35 |         modules = []
36 |         prev_size = n_cont + n_cat * emb_sz
37 |         for l in layers:
38 |             modules.append(nn.Linear(prev_size, l))
39 |             modules.append(nn.BatchNorm1d(l))
40 |             modules.append(nn.Dropout(dropout_p))
41 |             modules.append(nn.ReLU(inplace=True))
42 |             prev_size = l
43 |         modules.append(nn.BatchNorm1d(prev_size))
44 |         modules.append(nn.Dropout(dropout_p))
45 |         modules.append(nn.Linear(prev_size, 1))
46 |         modules.append(nn.Sigmoid())
47 |         
48 |         self.m_seq = nn.Sequential(*modules)
49 |         self.emb_drop = nn.Dropout(dropout_p)
50 |         self.bn_cont = nn.BatchNorm1d(n_cont)
51 | 
52 |     def forward(self, x_in):
53 |         
54 |         logger.debug(f"Forward pass on {x_in.shape}")
55 |         x = None
56 |         ee = 0
57 |         for ii in range(len(self.cat_mask)):
58 | 
59 |             if self.cat_mask[ii]:
60 |                 logger.debug(f"Embedding: {self.embeddings[ee]}, input: {x_in[:,ii]}")
61 |                 logger.debug(f"cat Device for x_in: {x_in.get_device()}")
62 |                 logger.debug(f"cat Device for x_in slice: {x_in[:,ii].get_device()}")
63 |                 logger.debug(f"cat Device for embed: {next(self.embeddings[ee].parameters()).get_device()}")
64 |                 x_e = self.embeddings[ee](x_in[:,ii].to(device = x_in.get_device(), dtype= torch.long))
65 |                 logger.debug(f"cat Device for x_e: {x_e.get_device()}")
66 |                 logger.debug(f"cat x_e = {x_e.shape}")
67 |                 if x is None:
68 |                     x = x_e
69 |                 else:
70 |                     x = torch.cat([x, x_e], 1)
71 |                 logger.debug(f"cat Device for x: {x.get_device()}")
72 |                 x = self.emb_drop(x)
73 |                 logger.debug(f"cat Device for x: {x.get_device()}")
74 |                 logger.debug(f"cat x = {x.shape}")
75 |                 ee = ee + 1
76 |             else:
77 |                 logger.debug(f"cont Device for x_in: {x_in.get_device()}")
78 |                 x_cont = x_in[:, ii] # self.bn_cont(x_in[:, ii])
79 |                 logger.debug(f"cont Device for x_cont: {x_cont.get_device()}")
80 |                 logger.debug(f"cont x_cont = {x_cont.shape}")
81 |                 if x is None:
82 |                     x = torch.unsqueeze(x_cont, 1)
83 |                 else:
84 |                     x = torch.cat([x, torch.unsqueeze(x_cont, 1)], 1)
85 |                 logger.debug(f"cont Device for x: {x.get_device()}")
86 |                 logger.debug(f"cont x = {x.shape}")
87 |                  
88 |         return self.m_seq(x) * (self.y_max - self.y_min) + self.y_min


--------------------------------------------------------------------------------
/Chapter07/code/train_pytorch-dist.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import os
  4 | import sys
  5 | import logging
  6 | import json
  7 | import shutil
  8 | import torch
  9 | import torch.nn as nn
 10 | from torch.utils.data import DataLoader, TensorDataset
 11 | from model_pytorch import TabularNet
 12 | from csv_loader_simple import CsvDatasetSimple
 13 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
 14 | 
 15 | 
 16 | logger = logging.getLogger(__name__)
 17 | logger.setLevel(logging.INFO)
 18 | logger.addHandler(logging.StreamHandler(sys.stdout))
 19 | 
 20 | class CUDANotFoundException(Exception):
 21 |     pass
 22 | 
 23 | 
 24 | dist.init_process_group()
 25 | 
 26 | 
 27 | def parse_args():
 28 |     """
 29 |     Parse arguments passed from the SageMaker API
 30 |     to the container
 31 |     """
 32 | 
 33 |     parser = argparse.ArgumentParser()
 34 | 
 35 |     # Hyperparameters sent by the client are passed as command-line arguments to the script
 36 |     parser.add_argument("--epochs", type=int, default=1)
 37 |     parser.add_argument("--batch_size", type=int, default=64)
 38 |     parser.add_argument("--learning_rate", type=float, default=0.01)
 39 | 
 40 |     # Data directories
 41 |     parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 42 |     parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 43 | 
 44 |     # Model directory: we will use the default set by SageMaker, /opt/ml/model
 45 |     parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 46 | 
 47 |     return parser.parse_known_args()
 48 | 
 49 | def train(model, device):
 50 |     """
 51 |     Train the PyTorch model
 52 |     """
 53 | 
 54 |     cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False]
 55 |     train_ds = CsvDatasetSimple(args.train)
 56 |     test_ds = CsvDatasetSimple(args.test)
 57 | 
 58 |     batch_size = args.batch_size
 59 |     epochs = args.epochs
 60 |     learning_rate = args.learning_rate
 61 |     logger.info(
 62 |         "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate)
 63 |     )
 64 |     
 65 |     train_sampler = torch.utils.data.distributed.DistributedSampler(
 66 |         train_ds, num_replicas=args.world_size, rank=args.rank
 67 |     )
 68 | 
 69 |     train_dl = DataLoader(train_ds, batch_size, shuffle=False, drop_last=True, sampler=train_sampler)
 70 | 
 71 |     model = TabularNet(n_cont=9, n_cat=9, 
 72 |                           cat_mask = cat_mask, 
 73 |                           cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 
 74 |                           y_min = 0., y_max = 1., device=device)
 75 |     logger.debug(model)
 76 |     model = DDP(model).to(device)
 77 |     torch.cuda.set_device(args.local_rank)
 78 |     model.cuda(args.local_rank)
 79 |     
 80 |     criterion = nn.MSELoss()
 81 |     optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 82 | 
 83 |     model.train()
 84 |     for epoch in range(epochs):
 85 |         batch_no = 0
 86 |         for x_train_batch, y_train_batch in train_dl:
 87 |             logger.debug(f"Training on shape {x_train_batch.shape}")
 88 |             y = model(x_train_batch.float())
 89 |             loss = criterion(y.flatten(), y_train_batch.float().to(device))
 90 |             if batch_no % 50 == 0:
 91 |                 logger.info(f"batch {batch_no} -> loss: {loss}")
 92 |             optimizer.zero_grad()
 93 |             loss.backward()
 94 |             optimizer.step()
 95 |             batch_no +=1
 96 |         epoch += 1
 97 |         logger.info(f"epoch: {epoch} -> loss: {loss}")
 98 | 
 99 |     # evalutate on test set
100 |     if args.rank == 0:
101 |         model.eval()
102 |         test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False)
103 |         with torch.no_grad():
104 |             mse = 0.
105 |             for x_test_batch, y_test_batch in test_dl:
106 |                 y = model(x_test_batch.float())
107 |                 mse = mse + ((y - y_test_batch.to(device)) ** 2).sum() / x_test_batch.shape[0]
108 |                 
109 |         mse = mse / len(test_dl.dataset)
110 |         logger.info(f"Test MSE: {mse}")
111 | 
112 |     torch.save(model.state_dict(), args.model_dir + "/model.pth")
113 |     # PyTorch requires that the inference script must
114 |     # be in the .tar.gz model file and Step Functions SDK doesn't do this.
115 |     inference_code_path = args.model_dir + "/code/"
116 | 
117 |     if not os.path.exists(inference_code_path):
118 |         os.mkdir(inference_code_path)
119 |         logger.info("Created a folder at {}!".format(inference_code_path))
120 | 
121 |     shutil.copy("train_pytorch.py", inference_code_path)
122 |     shutil.copy("model_pytorch.py", inference_code_path)
123 |     shutil.copy("csv_loader.py", inference_code_path)
124 |     logger.info("Saving models files to {}".format(inference_code_path))
125 | 
126 | 
127 | if __name__ == "__main__":
128 | 
129 |     args, _ = parse_args()
130 |     args.world_size = dist.get_world_size()
131 |     args.rank = rank = dist.get_rank()
132 |     args.local_rank = local_rank = dist.get_local_rank()
133 |     args.batch_size //= args.world_size // 8
134 |     args.batch_size = max(args.batch_size, 1)
135 |     
136 |     if not torch.cuda.is_available():
137 |         raise CUDANotFoundException(
138 |             "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices."
139 |         )
140 |         
141 |     torch.manual_seed(1)
142 |     
143 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
144 | 
145 |     train()
146 | 


--------------------------------------------------------------------------------
/Chapter07/code/train_pytorch.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import os
  4 | import sys
  5 | import logging
  6 | import json
  7 | import shutil
  8 | import torch
  9 | import torch.nn as nn
 10 | from torch.utils.data import DataLoader, TensorDataset
 11 | from model_pytorch import TabularNet
 12 | from csv_loader_simple import CsvDatasetSimple
 13 | 
 14 | 
 15 | logger = logging.getLogger(__name__)
 16 | logger.setLevel(logging.INFO)
 17 | logger.addHandler(logging.StreamHandler(sys.stdout))
 18 | 
 19 | 
 20 | def parse_args():
 21 |     """
 22 |     Parse arguments passed from the SageMaker API
 23 |     to the container
 24 |     """
 25 | 
 26 |     parser = argparse.ArgumentParser()
 27 | 
 28 |     # Hyperparameters sent by the client are passed as command-line arguments to the script
 29 |     parser.add_argument("--epochs", type=int, default=1)
 30 |     parser.add_argument("--batch_size", type=int, default=64)
 31 |     parser.add_argument("--learning_rate", type=float, default=0.01)
 32 | 
 33 |     # Data directories
 34 |     parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 35 |     parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 36 | 
 37 |     # Model directory: we will use the default set by SageMaker, /opt/ml/model
 38 |     parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 39 | 
 40 |     return parser.parse_known_args()
 41 | 
 42 | def train():
 43 |     """
 44 |     Train the PyTorch model
 45 |     """
 46 | 
 47 |     cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False]
 48 |     train_ds = CsvDatasetSimple(args.train)
 49 |     test_ds = CsvDatasetSimple(args.test)
 50 | 
 51 |     batch_size = args.batch_size
 52 |     epochs = args.epochs
 53 |     learning_rate = args.learning_rate
 54 |     logger.info(
 55 |         "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate)
 56 |     )
 57 | 
 58 |     train_dl = DataLoader(train_ds, batch_size, shuffle=False, 
 59 |                           num_workers=0,
 60 |                           pin_memory=True,
 61 |                           drop_last=True)
 62 | 
 63 |     model = TabularNet(n_cont=9, n_cat=9, 
 64 |                           cat_mask = cat_mask, 
 65 |                           cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 
 66 |                           y_min = 0., y_max = 1.)
 67 |     logger.debug(model)
 68 |     model = model.to(device)
 69 |     criterion = nn.MSELoss()
 70 |     optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
 71 | 
 72 |     model.train()
 73 |     for epoch in range(epochs):
 74 |         batch_no = 0
 75 |         for x_train_batch, y_train_batch in train_dl:
 76 |             x_train_batch_d = x_train_batch.to(device)
 77 |             y_train_batch_d = y_train_batch.to(device)
 78 |             optimizer.zero_grad()
 79 |             
 80 |             logger.debug(f"Training on shape {x_train_batch.shape}")
 81 |             y = model(x_train_batch_d.float())
 82 |             loss = criterion(y.flatten(), y_train_batch_d.float())
 83 |             if batch_no % 50 == 0:
 84 |                 logger.info(f"batch {batch_no} -> loss: {loss}")
 85 |             
 86 |             loss.backward()
 87 |             optimizer.step()
 88 |             batch_no +=1
 89 |         epoch += 1
 90 |         logger.info(f"epoch: {epoch} -> loss: {loss}")
 91 | 
 92 |     # evalutate on test set
 93 |     model.eval()
 94 |     test_dl = DataLoader(test_ds, batch_size, shuffle=False, drop_last=True)
 95 |     with torch.no_grad():
 96 |         mse = 0.
 97 |         for x_test_batch, y_test_batch in test_dl:
 98 |             x_test_batch_d = x_test_batch.to(device)
 99 |             y_test_batch_d = y_test_batch.to(device)
100 |                 
101 |             y = model(x_test_batch_d.float())
102 |             mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0]
103 |         
104 |     mse = mse / len(test_dl.dataset)
105 |     logger.info(f"Test MSE: {mse}")
106 | 
107 |     torch.save(model.state_dict(), args.model_dir + "/model.pth")
108 |     # PyTorch requires that the inference script must
109 |     # be in the .tar.gz model file and Step Functions SDK doesn't do this.
110 |     inference_code_path = args.model_dir + "/code/"
111 | 
112 |     if not os.path.exists(inference_code_path):
113 |         os.mkdir(inference_code_path)
114 |         logger.info("Created a folder at {}!".format(inference_code_path))
115 | 
116 |     #shutil.copy("train_pytorch.py", inference_code_path)
117 |     #shutil.copy("model_pytorch.py", inference_code_path)
118 |     #shutil.copy("csv_loader.py", inference_code_path)
119 |     logger.info("Saving models files to {}".format(inference_code_path))
120 | 
121 | 
122 | if __name__ == "__main__":
123 | 
124 |     args, _ = parse_args()
125 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
126 | 
127 |     train()
128 | 


--------------------------------------------------------------------------------
/Chapter07/code/train_pytorch_dist.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | import os
  4 | import sys
  5 | import logging
  6 | import json
  7 | import shutil
  8 | import torch
  9 | import torch.nn as nn
 10 | from torch.utils.data import DataLoader, TensorDataset
 11 | import torch.nn.functional as F
 12 | 
 13 | from model_pytorch import TabularNet
 14 | from csv_loader_simple import CsvDatasetSimple
 15 | 
 16 | import smdistributed.dataparallel.torch.distributed as dist
 17 | from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
 18 | 
 19 | 
 20 | logger = logging.getLogger(__name__)
 21 | logger.setLevel(logging.INFO)
 22 | logger.addHandler(logging.StreamHandler(sys.stdout))
 23 | 
 24 | class CUDANotFoundException(Exception):
 25 |     pass
 26 | 
 27 | 
 28 | dist.init_process_group()
 29 | 
 30 | 
 31 | def parse_args():
 32 |     """
 33 |     Parse arguments passed from the SageMaker API
 34 |     to the container
 35 |     """
 36 | 
 37 |     parser = argparse.ArgumentParser()
 38 | 
 39 |     # Hyperparameters sent by the client are passed as command-line arguments to the script
 40 |     parser.add_argument("--epochs", type=int, default=1)
 41 |     parser.add_argument("--batch_size", type=int, default=64)
 42 |     parser.add_argument("--learning_rate", type=float, default=0.01)
 43 | 
 44 |     # Data directories
 45 |     parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 46 |     parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 47 | 
 48 |     # Model directory: we will use the default set by SageMaker, /opt/ml/model
 49 |     parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 50 |     
 51 |     parser.add_argument(
 52 |         "--save-model", action="store_true", default=False, help="For Saving the current Model"
 53 |     )
 54 | 
 55 |     return parser.parse_known_args()
 56 | 
 57 | def train(args):
 58 |     """
 59 |     Train the PyTorch model
 60 |     """
 61 | 
 62 |     cat_mask=[False,True,True,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False]
 63 |     train_ds = CsvDatasetSimple(args.train)
 64 |     test_ds = CsvDatasetSimple(args.test)
 65 | 
 66 |     batch_size = args.batch_size
 67 |     epochs = args.epochs
 68 |     learning_rate = args.learning_rate
 69 |     logger.info(
 70 |         "batch_size = {}, epochs = {}, learning rate = {}".format(batch_size, epochs, learning_rate)
 71 |     )
 72 |     logger.info(f"World size: {args.world_size}")
 73 |     logger.info(f"Rank: {args.rank}")
 74 |     logger.info(f"Local Rank: {args.local_rank}")
 75 |     
 76 |     train_sampler = torch.utils.data.distributed.DistributedSampler(
 77 |         train_ds, num_replicas=args.world_size, rank=args.rank
 78 |     )
 79 |     logger.debug(f"Created distributed sampler")
 80 | 
 81 |     train_dl = DataLoader(train_ds, batch_size, shuffle=False, 
 82 |                           num_workers=0,
 83 |                           pin_memory=True,
 84 |                           drop_last=True, sampler=train_sampler)
 85 |     logger.debug(f"Created train data loader")
 86 | 
 87 |     model = TabularNet(n_cont=9, n_cat=9, 
 88 |                           cat_mask = cat_mask, 
 89 |                           cat_dim=[0,2050,13,5,366,0,50000,50000,50000,50000,50,0,0,0,0,0,0,0], 
 90 |                           y_min = 0., y_max = 1.)
 91 |     logger.debug("Created model")
 92 |     logger.debug(model)
 93 |     model = DDP(model.to(device), broadcast_buffers=False)
 94 |     logger.debug("created DDP")
 95 |     torch.cuda.set_device(args.local_rank)
 96 |     logger.debug("Set device on CUDA")
 97 |     model.cuda(args.local_rank)
 98 |     logger.debug(f"Set model CUDA to {args.local_rank}")
 99 |     
100 |     criterion = nn.MSELoss()
101 |     optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
102 |     logger.debug("Created loss fn and optimizer")
103 | 
104 |     model.train()
105 |     for epoch in range(epochs):
106 |         batch_no = 0
107 |         for x_train_batch, y_train_batch in train_dl:
108 |             logger.debug(f"Working on batch {batch_no}")
109 |             x_train_batch_d = x_train_batch.to(device)
110 |             y_train_batch_d = y_train_batch.to(device)
111 |             
112 |             optimizer.zero_grad()
113 |             logger.debug("Did optimizer grad")
114 |             
115 |             logger.debug(f"Training on shape {x_train_batch.shape}")
116 |             y = model(x_train_batch_d.float())
117 |             logger.debug("Did forward pass")
118 |             loss = criterion(y.flatten(), y_train_batch_d.float())
119 |             logger.debug("Got loss")
120 |             if batch_no % 50 == 0:
121 |                 logger.info(f"batch {batch_no} -> loss: {loss}")
122 |             
123 |             loss.backward()
124 |             logger.debug("Did backward step")
125 |             optimizer.step()
126 |             logger.debug(f"Did optimizer step, batch {batch_no}")
127 |             batch_no +=1
128 |         epoch += 1
129 |         logger.info(f"epoch: {epoch} -> loss: {loss}")
130 | 
131 |     # evalutate on test set
132 |     if args.rank == 0:
133 |         logger.info(f"Starting test eval on rank 0")
134 |         model.eval()
135 |         test_dl = DataLoader(test_ds, batch_size, drop_last=True, shuffle=False)
136 |         logger.info(f"Loaded test data set")
137 |         with torch.no_grad():
138 |             mse = 0.
139 |             batch_no = 0
140 |             for x_test_batch, y_test_batch in test_dl:
141 |                 x_test_batch_d = x_test_batch.to(device)
142 |                 y_test_batch_d = y_test_batch.to(device)
143 |                 
144 |                 y = model(x_test_batch_d.float())
145 |                 mse += F.mse_loss(y, y_test_batch_d, reduction="sum")
146 |                 #mse = mse + ((y - y_test_batch_d) ** 2).sum() / x_test_batch.shape[0]
147 |                 
148 |                 if batch_no % 50 == 0:
149 |                     logger.info(f"batch {batch_no} -> MSE: {mse}")
150 |                 batch_no +=1
151 |                 
152 |         mse = mse / len(test_dl.dataset)
153 |         logger.info(f"Test MSE: {mse}")
154 | 
155 |     if args.rank == 0:
156 |         logger.info("Saving model on rank 0")
157 |         torch.save(model.state_dict(), args.model_dir + "/model.pth")
158 | 
159 | 
160 | if __name__ == "__main__":
161 | 
162 |     args, _ = parse_args()
163 |     args.world_size = dist.get_world_size()
164 |     args.rank = rank = dist.get_rank()
165 |     args.local_rank = local_rank = dist.get_local_rank()
166 |     args.batch_size //= args.world_size // 8
167 |     args.batch_size = max(args.batch_size, 1)
168 |     
169 |     if not torch.cuda.is_available():
170 |         raise CUDANotFoundException(
171 |             "Must run smdistributed.dataparallel MNIST example on CUDA-capable devices."
172 |         )
173 |         
174 |     torch.manual_seed(1)
175 |     
176 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
177 | 
178 |     train(args)
179 | 


--------------------------------------------------------------------------------
/Chapter07/images/LossNotDecreasingRule.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter07/images/LossNotDecreasingRule.png


--------------------------------------------------------------------------------
/Chapter07/images/RulesSummary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter07/images/RulesSummary.png


--------------------------------------------------------------------------------
/Chapter07/rules/custom_rule.py:
--------------------------------------------------------------------------------
 1 | # First Party
 2 | from smdebug.rules.rule import Rule
 3 | 
 4 | 
 5 | class CustomGradientRule(Rule):
 6 |     def __init__(self, base_trial, threshold=10.0):
 7 |         super().__init__(base_trial)
 8 |         self.threshold = float(threshold)
 9 | 
10 |     def invoke_at_step(self, step):
11 |         for tname in self.base_trial.tensor_names(collection="gradients"):
12 |             t = self.base_trial.tensor(tname)
13 |             abs_mean = t.reduction_value(step, "mean", abs=True)
14 |             if abs_mean > self.threshold:
15 |                 return True
16 |         return False


--------------------------------------------------------------------------------
/Chapter08/register-model.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Register Candidate Model to SageMaker Model Registry\n",
  8 |     "\n",
  9 |     "When you have a candidate model that is performing well according to your objective metric, you can register that version of your model to SageMaker Model Registry.   From there, it can be used for deployment for either batch or real-time use cases. \n",
 10 |     "\n",
 11 |     "A model version can be registered using the Studio Console, Boto3, or as a step in SageMaker Pipelines (which will be covered in a later chapter).\n",
 12 |     "\n",
 13 |     "In this notebook, you'll perform the following tasks: \n",
 14 |     " \n",
 15 |     " 1. Create a Model Group which is group of versioned models\n",
 16 |     " 2. Register a model version into the Model Group.  \n",
 17 |     " \n",
 18 |     "For this, exercise we'll register the XGBoost model previously created in Chapter 5.  The same steps apply for the PyTorch model as well.   \n",
 19 |     " \n",
 20 |     "First, you need to import the boto3 packages required. "
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "import time\n",
 30 |     "import os\n",
 31 |     "from sagemaker import get_execution_role, session, image_uris\n",
 32 |     "import boto3\n",
 33 |     "\n",
 34 |     "region = boto3.Session().region_name\n",
 35 |     "\n",
 36 |     "role = get_execution_role()\n",
 37 |     "\n",
 38 |     "sm_client = boto3.client('sagemaker', region_name=region)"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "## 1. Create a Model Group"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "import time\n",
 55 |     "model_package_group_name = \"air-quality-\" + str(round(time.time()))\n",
 56 |     "model_package_group_input_dict = {\n",
 57 |     " \"ModelPackageGroupName\" : model_package_group_name,\n",
 58 |     " \"ModelPackageGroupDescription\" : \"model package group for air quality models\",\n",
 59 |     " \"Tags\": [\n",
 60 |     "     {\n",
 61 |     "         \"Key\": \"MLProject\",\n",
 62 |     "         \"Value\": \"weather\"\n",
 63 |     "     }\n",
 64 |     " ]   \n",
 65 |     "}\n",
 66 |     "create_model_pacakge_group_response = sm_client.create_model_package_group(**model_package_group_input_dict)\n",
 67 |     "print('ModelPackageGroup Arn : {}'.format(create_model_pacakge_group_response['ModelPackageGroupArn']))"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "## 2. Register a Model Version\n",
 75 |     "\n",
 76 |     "First, we need to find the model_url that will be used as input to register the model version.  Typically this is included as part of a pipeline; however, in this case we are registering the model outside of a pipeline so we need to pull the data from our previous training job that resulted in a candidate model that is performing well according to our objective metric.\n",
 77 |     "\n",
 78 |     "**Replace the variable below with the name of the training job from Chapter 5**"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "#Example: \n",
 88 |     "#training_job = 'sagemaker-xgboost-2021-07-28-02-43-50-684'\n",
 89 |     "training_job = 'REPLACE WITH NAME OF TRAINING JOB'"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "metadata": {},
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "training_job_response = sm_client.describe_training_job(\n",
 99 |     "    TrainingJobName=training_job\n",
100 |     ")"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": null,
106 |    "metadata": {},
107 |    "outputs": [],
108 |    "source": [
109 |     "model_url=training_job_response['ModelArtifacts']['S3ModelArtifacts']\n",
110 |     "print('Model Data URL', model_url)"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "# this line automatically looks for the XGBoost image URI and builds an XGBoost container.\n",
120 |     "# specify the repo_version depending on your preference.\n",
121 |     "xgboost_container = image_uris.retrieve(region=boto3.Session().region_name,\n",
122 |     "                          framework='xgboost', \n",
123 |     "                          version='1.2-1')\n",
124 |     "\n",
125 |     "print('XGBoost Container for Inference:', xgboost_container)"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "metadata": {},
132 |    "outputs": [],
133 |    "source": [
134 |     "modelpackage_inference_specification =  {\n",
135 |     "    \"InferenceSpecification\": {\n",
136 |     "      \"Containers\": [\n",
137 |     "         {\n",
138 |     "            \"Image\": xgboost_container,\n",
139 |     "             \"ModelDataUrl\": model_url\n",
140 |     "         }\n",
141 |     "      ],\n",
142 |     "      \"SupportedContentTypes\": [ \"text/csv\" ],\n",
143 |     "      \"SupportedResponseMIMETypes\": [ \"text/csv\" ],\n",
144 |     "   }\n",
145 |     " }\n",
146 |     "\n",
147 |     "create_model_package_input_dict = {\n",
148 |     "    \"ModelPackageGroupName\" : model_package_group_name,\n",
149 |     "    \"ModelPackageDescription\" : \"Model to predict air quality ratings using XGBoost\",\n",
150 |     "    \"ModelApprovalStatus\" : \"PendingManualApproval\"  \n",
151 |     "}\n",
152 |     "create_model_package_input_dict.update(modelpackage_inference_specification)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": null,
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "create_model_package_response = sm_client.create_model_package(**create_model_package_input_dict)\n",
162 |     "model_package_arn = create_mode_package_response[\"ModelPackageArn\"]\n",
163 |     "print('ModelPackage Version ARN : {}'.format(model_package_arn))"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "Let's view the detailed of our registered model..."
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "sm_client.list_model_packages(ModelPackageGroupName=model_package_group_name)"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {},
186 |    "outputs": [],
187 |    "source": []
188 |   }
189 |  ],
190 |  "metadata": {
191 |   "instance_type": "ml.t3.medium",
192 |   "kernelspec": {
193 |    "display_name": "Python 3 (Data Science)",
194 |    "language": "python",
195 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
196 |   },
197 |   "language_info": {
198 |    "codemirror_mode": {
199 |     "name": "ipython",
200 |     "version": 3
201 |    },
202 |    "file_extension": ".py",
203 |    "mimetype": "text/x-python",
204 |    "name": "python",
205 |    "nbconvert_exporter": "python",
206 |    "pygments_lexer": "ipython3",
207 |    "version": "3.7.10"
208 |   }
209 |  },
210 |  "nbformat": 4,
211 |  "nbformat_minor": 4
212 | }
213 | 


--------------------------------------------------------------------------------
/Chapter10/scripts/preprocess_param.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | from __future__ import unicode_literals
  3 | 
  4 | import argparse
  5 | import csv
  6 | import os
  7 | import shutil
  8 | import sys
  9 | import time
 10 | import logging
 11 | import boto3
 12 | 
 13 | import pyspark
 14 | from pyspark.sql import SparkSession
 15 | from pyspark import SparkContext
 16 | from pyspark.ml import Pipeline
 17 | from pyspark.ml.linalg import Vectors
 18 | from pyspark.ml.feature import (
 19 |     StringIndexer,
 20 |     VectorAssembler,
 21 |     VectorIndexer,
 22 |     StandardScaler,
 23 |     OneHotEncoder
 24 | )
 25 | from pyspark.sql.functions import *
 26 | from pyspark.sql.functions import round as round_
 27 | from pyspark.sql.types import (
 28 |     DoubleType,
 29 |     StringType,
 30 |     StructField,
 31 |     StructType,
 32 |     BooleanType,
 33 |     IntegerType
 34 | )
 35 | 
 36 | def get_tables():
 37 |     
 38 |     tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2']
 39 |     #tables = ['0d18fb9e-857d-4380-bbac-ffbb60b07ae2',
 40 |     #                       '1645e09b-0919-439a-b07f-cd7532069f10',
 41 |     #                       '80d785aa-6da8-4b37-8632-7386b1d535f3',
 42 |     #                       '8375c185-ea3e-44b7-b61c-68534e33ddf7',
 43 |     #                       '9a6e24db-cffc-42de-a77f-7ab96c487022',
 44 |     #                       'bc981da3-bc9d-435a-8bf2-4107a8fb2676',
 45 |     #                       'cf5cf814-c9e3-4a2b-8811-e4fc6481a1fe',
 46 |     #                       'd3b8f1ab-f3e5-4fc9-84ab-8568edd8a03d']
 47 |     
 48 | 
 49 |     return tables
 50 | 
 51 | 
 52 | def isBadAir(v, p):
 53 |     if p == 'pm10':
 54 |         if v > 50:
 55 |             return 1
 56 |         else:
 57 |             return 0
 58 |     elif p == 'pm25':
 59 |         if v > 25:
 60 |             return 1
 61 |         else:
 62 |             return 0
 63 |     elif p == 'so2':
 64 |         if v > 20:
 65 |             return 1
 66 |         else:
 67 |             return 0
 68 |     elif p == 'no2':
 69 |         if v > 200:
 70 |             return 1
 71 |         else:
 72 |             return 0
 73 |     elif p == 'o3':
 74 |         if v > 100:
 75 |             return 1
 76 |         else:
 77 |             return 0
 78 |     else:
 79 |         return 0
 80 | 
 81 | def extract(row):
 82 |     return (row.value, row.ismobile, row.year, row.month, row.quarter, row.day, row.isBadAir, 
 83 |             row.indexed_location, row.indexed_city, row.indexed_country, row.indexed_sourcename, 
 84 |             row.indexed_sourcetype)
 85 | 
 86 | """
 87 | Schema on disk:
 88 | 
 89 |  |-- date_utc: string (nullable = true)
 90 |  |-- date_local: string (nullable = true)
 91 |  |-- location: string (nullable = true)
 92 |  |-- country: string (nullable = true)
 93 |  |-- value: float (nullable = true)
 94 |  |-- unit: string (nullable = true)
 95 |  |-- city: string (nullable = true)
 96 |  |-- attribution: array (nullable = true)
 97 |  |    |-- element: struct (containsNull = true)
 98 |  |    |    |-- name: string (nullable = true)
 99 |  |    |    |-- url: string (nullable = true)
100 |  |-- averagingperiod: struct (nullable = true)
101 |  |    |-- unit: string (nullable = true)
102 |  |    |-- value: float (nullable = true)
103 |  |-- coordinates: struct (nullable = true)
104 |  |    |-- latitude: float (nullable = true)
105 |  |    |-- longitude: float (nullable = true)
106 |  |-- sourcename: string (nullable = true)
107 |  |-- sourcetype: string (nullable = true)
108 |  |-- mobile: string (nullable = true)
109 |  |-- parameter: string (nullable = true)
110 |  
111 | Example output:
112 | 
113 | date_utc='2015-10-31T07:00:00.000Z'
114 | date_local='2015-10-31T04:00:00-03:00'
115 | location='Quintero Centro'
116 | country='CL'
117 | value=19.81999969482422
118 | unit='µg/m³'
119 | city='Quintero'
120 | attribution=[Row(name='SINCA', url='http://sinca.mma.gob.cl/'), Row(name='CENTRO QUINTERO', url=None)]
121 | averagingperiod=None
122 | coordinates=Row(latitude=-32.786170959472656, longitude=-71.53143310546875)
123 | sourcename='Chile - SINCA'
124 | sourcetype=None
125 | mobile=None
126 | parameter='o3'
127 | 
128 | Transformations:
129 | 
130 | * Featurize date_utc
131 | * Drop date_local
132 | * Encode location
133 | * Encode country
134 | * Scale value
135 | * Drop unit
136 | * Encode city
137 | * Drop attribution
138 | * Drop averaging period
139 | * Drop coordinates
140 | * Encode source name
141 | * Encode source type
142 | * Convert mobile to integer
143 | * Encode parameter
144 | 
145 | * Add label for good/bad air quality
146 | 
147 | """
148 | def main():
149 |     parser = argparse.ArgumentParser(description="Preprocessing configuration")
150 |     parser.add_argument("--s3_input_bucket", type=str, help="s3 input bucket")
151 |     parser.add_argument("--s3_input_key_prefix", type=str, help="s3 input key prefix")
152 |     parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket")
153 |     parser.add_argument("--s3_output_key_prefix", type=str, help="s3 output key prefix")
154 |     parser.add_argument("--parameter", type=str, help="parameter filter")
155 |     args = parser.parse_args()
156 | 
157 |     logging.basicConfig(level=logging.INFO)
158 |     logger = logging.getLogger('Preprocess')
159 | 
160 |     spark = SparkSession.builder.appName("Preprocessor").getOrCreate()
161 | 
162 |     logger.info("Reading data set")
163 |     tables = get_tables()
164 |     df = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{tables[0]}/")
165 |     for t in tables[1:]:
166 |         df_new = spark.read.parquet(f"s3://{args.s3_input_bucket}/{args.s3_input_key_prefix}/{t}/")
167 |         df = df.union(df_new)
168 |         
169 |     # Filter on parameter
170 |     df = df.filter(df.parameter == args.parameter)
171 |     
172 |     # Drop columns
173 |     logger.info("Dropping columns")
174 |     df = df.drop('date_local').drop('unit').drop('attribution').drop('averagingperiod').drop('coordinates')
175 | 
176 |     # Mobile field to int
177 |     logger.info("Casting mobile field to int")
178 |     df = df.withColumn("ismobile",col("mobile").cast(IntegerType())).drop('mobile')
179 | 
180 |     # scale value
181 |     logger.info("Scaling value")
182 |     value_assembler = VectorAssembler(inputCols=["value"], outputCol="value_vec")
183 |     value_scaler = StandardScaler(inputCol="value_vec", outputCol="value_scaled")
184 |     value_pipeline = Pipeline(stages=[value_assembler, value_scaler])
185 |     value_model = value_pipeline.fit(df)
186 |     xform_df = value_model.transform(df)
187 | 
188 |     # featurize date
189 |     logger.info("Featurizing date")
190 |     xform_df = xform_df.withColumn('aggdt', 
191 |                    to_date(unix_timestamp(col('date_utc'), "yyyy-MM-dd'T'HH:mm:ss.SSSX").cast("timestamp")))
192 |     xform_df = xform_df.withColumn('year',year(xform_df.aggdt)) \
193 |         .withColumn('month',month(xform_df.aggdt)) \
194 |         .withColumn('quarter',quarter(xform_df.aggdt))
195 |     xform_df = xform_df.withColumn("day", date_format(col("aggdt"), "d"))
196 | 
197 |     # Automatically assign good/bad labels
198 |     logger.info("Simulating good/bad air labels")
199 |     isBadAirUdf = udf(isBadAir, IntegerType())
200 |     xform_df = xform_df.withColumn('isBadAir', isBadAirUdf('value', 'parameter'))
201 |     xform_df = xform_df.drop('parameter')
202 | 
203 |     # Categorical encodings.  
204 |     logger.info("Categorical encoding")
205 |     #parameter_indexer = StringIndexer(inputCol="parameter", outputCol="indexed_parameter", handleInvalid='keep')
206 |     location_indexer = StringIndexer(inputCol="location", outputCol="indexed_location", handleInvalid='keep')
207 |     city_indexer = StringIndexer(inputCol="city", outputCol="indexed_city", handleInvalid='keep')
208 |     country_indexer = StringIndexer(inputCol="country", outputCol="indexed_country", handleInvalid='keep')
209 |     sourcename_indexer = StringIndexer(inputCol="sourcename", outputCol="indexed_sourcename", handleInvalid='keep')
210 |     sourcetype_indexer = StringIndexer(inputCol="sourcetype", outputCol="indexed_sourcetype", handleInvalid='keep')
211 |     #enc_est = OneHotEncoder(inputCols=["indexed_parameter"], outputCols=["vec_parameter"])
212 |     enc_pipeline = Pipeline(stages=[location_indexer, 
213 |         city_indexer, country_indexer, sourcename_indexer, 
214 |         sourcetype_indexer])
215 |     enc_model = enc_pipeline.fit(xform_df)
216 |     enc_df = enc_model.transform(xform_df)
217 |     #param_cols = enc_df.schema.fields[17].metadata['ml_attr']['vals']
218 | 
219 |     # Clean up data set
220 |     logger.info("Final cleanup")
221 |     final_df = enc_df.drop('location') \
222 |         .drop('city').drop('country').drop('sourcename') \
223 |         .drop('sourcetype').drop('date_utc') \
224 |         .drop('value_vec').drop('aggdt')
225 |     firstelement=udf(lambda v:str(v[0]),StringType())
226 |     final_df = final_df.withColumn('value_str', firstelement('value_scaled'))
227 |     final_df = final_df.withColumn("value",final_df.value_str.cast(DoubleType())).drop('value_str').drop('value_scaled')
228 |     schema = StructType([
229 |         StructField("value", DoubleType(), True),
230 |         StructField("ismobile", StringType(), True),
231 |         StructField("year", StringType(), True),
232 |         StructField("month", StringType(), True),
233 |         StructField("quarter", StringType(), True),
234 |         StructField("day", StringType(), True),
235 |         StructField("isBadAir", StringType(), True),
236 |         StructField("location", StringType(), True),
237 |         StructField("city", StringType(), True),
238 |         StructField("country", StringType(), True),
239 |         StructField("sourcename", StringType(), True),
240 |         StructField("sourcetype", StringType(), True)
241 |                     ])
242 |     final_df = final_df.rdd.map(extract).toDF(schema=schema)
243 |     
244 |     # Replace missing values
245 |     final_df = final_df.na.fill("0")
246 |     
247 |     # Round the value
248 |     final_df = final_df.withColumn("value", round_(final_df["value"], 4))
249 | 
250 |     # Split sets
251 |     logger.info("Splitting data set")
252 |     (train_df, validation_df, test_df) = final_df.randomSplit([0.7, 0.2, 0.1])
253 |     
254 |     # Drop value from test set
255 |     test_df = test_df.drop('value')
256 | 
257 |     # Save to S3
258 |     logger.info("Saving to S3")
259 |     train_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 
260 |         args.s3_output_key_prefix, 'train/'))
261 |     validation_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 
262 |         args.s3_output_key_prefix, 'validation/'))
263 |     test_df.write.option("header",False).csv('s3://' + os.path.join(args.s3_output_bucket, 
264 |         args.s3_output_key_prefix, 'test/'))
265 | 
266 | if __name__ == "__main__":
267 |     main()


--------------------------------------------------------------------------------
/Chapter11/README.md:
--------------------------------------------------------------------------------
 1 | # Monitor Production Models using Amazon SageMaker Model Monitor and Amazon SageMaker Clarify
 2 | 
 3 | Examples in this chapter demonstrate how to use SageMaker capabilities for monitoring weather prediction regression model for data drift, model quality dift, bias drift and feature attribution drift.
 4 | 
 5 | * data_drift_monitoring/WeatherPredictionDataDriftModelMonitoring.ipynb
 6 |  
 7 |   Monitor and detect data drift using SageMaker Model Monitor
 8 |   
 9 | * model_quality_monitoring/WeatherPredictionModelQualityMonitoring.ipynb
10 | 
11 |   Monitor and detect model quality drift using SageMaker Model Monitor 
12 |   
13 | * bias_drift_monitoring/WeatherPredictionBiasAndFeatureAttributionDrift.ipynb
14 | 
15 |   Monitor and detect bias and feature attribution drift using SageMaker Clarify 
16 |   
17 | * feature_attribute_drift_monitoring/WeatherPredictionFeatureAttributionDrift.ipynb
18 | 
19 |   Monitor and detect bias and feature attribution drift using SageMaker Clarify 
20 | 


--------------------------------------------------------------------------------
/Chapter11/bias_drift_monitoring/model/model/weather-prediction-model.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/bias_drift_monitoring/model/model/weather-prediction-model.tar.gz


--------------------------------------------------------------------------------
/Chapter11/bias_drift_monitoring/model/weather-prediction-model.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/bias_drift_monitoring/model/weather-prediction-model.tar.gz


--------------------------------------------------------------------------------
/Chapter11/data_drift_monitoring/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/data_drift_monitoring/.DS_Store


--------------------------------------------------------------------------------
/Chapter11/data_drift_monitoring/model/weather-prediction-model.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/data_drift_monitoring/model/weather-prediction-model.tar.gz


--------------------------------------------------------------------------------
/Chapter11/feature_attribution_drift_monitoring/model/weather-prediction-model.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/feature_attribution_drift_monitoring/model/weather-prediction-model.tar.gz


--------------------------------------------------------------------------------
/Chapter11/model/weather-prediction-model.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model/weather-prediction-model.tar.gz


--------------------------------------------------------------------------------
/Chapter11/model_quality_monitoring/data/model-quality-baseline-data.csv:
--------------------------------------------------------------------------------
  1 | probability,prediction,label
  2 | -4.902510643005371,-4.902510643005371,-7.535634515882377
  3 | -4.902510643005371,-4.902510643005371,-7.535634515882377
  4 | -4.902510643005371,-4.902510643005371,-7.535634515882377
  5 | -4.902510643005371,-4.902510643005371,-7.535634515882377
  6 | -4.902510643005371,-4.902510643005371,-7.535634515882377
  7 | -4.902510643005371,-4.902510643005371,-7.535634515882377
  8 | -4.902510643005371,-4.902510643005371,-7.535634515882377
  9 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 10 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 11 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 12 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 13 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 14 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 15 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 16 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 17 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 18 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 19 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 20 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 21 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 22 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 23 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 24 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 25 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 26 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 27 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 28 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 29 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 30 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 31 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 32 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 33 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 34 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 35 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 36 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 37 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 38 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 39 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 40 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 41 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 42 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 43 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 44 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 45 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 46 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 47 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 48 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 49 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 50 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 51 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 52 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 53 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 54 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 55 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 56 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 57 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 58 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 59 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 60 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 61 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 62 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 63 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 64 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 65 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 66 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 67 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 68 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 69 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 70 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 71 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 72 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 73 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 74 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 75 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 76 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 77 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 78 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 79 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 80 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 81 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 82 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 83 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 84 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 85 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 86 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 87 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 88 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 89 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 90 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 91 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 92 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 93 | -4.902510643005371,-4.902510643005371,-7.535634515882377
 94 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
 95 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
 96 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
 97 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
 98 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
 99 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
100 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
101 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
102 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
103 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
104 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
105 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
106 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
107 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
108 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
109 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
110 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
111 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
112 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
113 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
114 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
115 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
116 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
117 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
118 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
119 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
120 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
121 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
122 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
123 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
124 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
125 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
126 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
127 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
128 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
129 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
130 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
131 | -3.218202590942383,-3.218202590942383,-7.535634515882377
132 | -3.218202590942383,-3.218202590942383,-7.535634515882377
133 | -3.218202590942383,-3.218202590942383,-7.535634515882377
134 | -3.218202590942383,-3.218202590942383,-7.535634515882377
135 | -3.218202590942383,-3.218202590942383,-7.535634515882377
136 | -3.218202590942383,-3.218202590942383,-7.535634515882377
137 | -3.218202590942383,-3.218202590942383,-7.535634515882377
138 | -3.218202590942383,-3.218202590942383,-7.535634515882377
139 | -3.218202590942383,-3.218202590942383,-7.535634515882377
140 | -3.218202590942383,-3.218202590942383,-7.535634515882377
141 | -3.218202590942383,-3.218202590942383,-7.535634515882377
142 | -3.218202590942383,-3.218202590942383,-7.535634515882377
143 | -3.218202590942383,-3.218202590942383,-7.535634515882377
144 | -3.218202590942383,-3.218202590942383,-7.535634515882377
145 | -3.218202590942383,-3.218202590942383,-7.535634515882377
146 | -3.218202590942383,-3.218202590942383,-7.535634515882377
147 | -3.218202590942383,-3.218202590942383,-7.535634515882377
148 | -3.218202590942383,-3.218202590942383,-7.535634515882377
149 | -3.218202590942383,-3.218202590942383,-7.535634515882377
150 | -3.218202590942383,-3.218202590942383,-7.535634515882377
151 | -3.218202590942383,-3.218202590942383,-7.535634515882377
152 | -3.218202590942383,-3.218202590942383,-7.535634515882377
153 | -3.218202590942383,-3.218202590942383,-7.535634515882377
154 | -3.218202590942383,-3.218202590942383,-7.535634515882377
155 | -3.218202590942383,-3.218202590942383,-7.535634515882377
156 | -3.218202590942383,-3.218202590942383,-7.535634515882377
157 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377
158 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377
159 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377
160 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377
161 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377
162 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377
163 | -0.9505436420440674,-0.9505436420440674,-7.535634515882377
164 | -3.218202590942383,-3.218202590942383,-7.535634515882377
165 | -3.218202590942383,-3.218202590942383,-7.535634515882377
166 | -3.218202590942383,-3.218202590942383,-7.535634515882377
167 | -3.218202590942383,-3.218202590942383,-7.535634515882377
168 | -3.218202590942383,-3.218202590942383,-7.535634515882377
169 | -3.218202590942383,-3.218202590942383,-7.535634515882377
170 | -3.218202590942383,-3.218202590942383,-7.535634515882377
171 | -3.218202590942383,-3.218202590942383,-7.535634515882377
172 | -3.218202590942383,-3.218202590942383,-7.535634515882377
173 | -3.218202590942383,-3.218202590942383,-7.535634515882377
174 | -3.218202590942383,-3.218202590942383,-7.535634515882377
175 | -3.218202590942383,-3.218202590942383,-7.535634515882377
176 | -3.218202590942383,-3.218202590942383,-7.535634515882377
177 | -3.218202590942383,-3.218202590942383,-7.535634515882377
178 | -3.218202590942383,-3.218202590942383,-7.535634515882377
179 | -3.218202590942383,-3.218202590942383,-7.535634515882377
180 | -3.218202590942383,-3.218202590942383,-7.535634515882377
181 | -3.218202590942383,-3.218202590942383,-7.535634515882377
182 | -3.218202590942383,-3.218202590942383,-7.535634515882377
183 | -3.218202590942383,-3.218202590942383,-7.535634515882377
184 | -3.218202590942383,-3.218202590942383,-7.535634515882377
185 | -3.218202590942383,-3.218202590942383,-7.535634515882377
186 | -3.218202590942383,-3.218202590942383,-7.535634515882377
187 | -3.218202590942383,-3.218202590942383,-7.535634515882377
188 | -3.218202590942383,-3.218202590942383,-7.535634515882377
189 | -3.218202590942383,-3.218202590942383,-7.535634515882377
190 | -3.218202590942383,-3.218202590942383,-7.535634515882377
191 | -3.218202590942383,-3.218202590942383,-7.535634515882377
192 | -3.218202590942383,-3.218202590942383,-7.535634515882377
193 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
194 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
195 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
196 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
197 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
198 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
199 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
200 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
201 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
202 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
203 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
204 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
205 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
206 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
207 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
208 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
209 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
210 | -3.218202590942383,-3.218202590942383,-7.535634515882377
211 | -3.218202590942383,-3.218202590942383,-7.535634515882377
212 | -3.218202590942383,-3.218202590942383,-7.535634515882377
213 | -3.218202590942383,-3.218202590942383,-7.535634515882377
214 | -3.218202590942383,-3.218202590942383,-7.535634515882377
215 | -3.218202590942383,-3.218202590942383,-7.535634515882377
216 | -3.218202590942383,-3.218202590942383,-7.535634515882377
217 | -3.218202590942383,-3.218202590942383,-7.535634515882377
218 | -3.218202590942383,-3.218202590942383,-7.535634515882377
219 | -3.218202590942383,-3.218202590942383,-7.535634515882377
220 | -3.218202590942383,-3.218202590942383,-7.535634515882377
221 | -3.218202590942383,-3.218202590942383,-7.535634515882377
222 | -3.218202590942383,-3.218202590942383,-7.535634515882377
223 | -3.218202590942383,-3.218202590942383,-7.535634515882377
224 | -3.218202590942383,-3.218202590942383,-7.535634515882377
225 | -3.218202590942383,-3.218202590942383,-7.535634515882377
226 | -3.218202590942383,-3.218202590942383,-7.535634515882377
227 | -3.218202590942383,-3.218202590942383,-7.535634515882377
228 | -3.218202590942383,-3.218202590942383,-7.535634515882377
229 | -3.218202590942383,-3.218202590942383,-7.535634515882377
230 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
231 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
232 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
233 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
234 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
235 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
236 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
237 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
238 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
239 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
240 | -0.4467509984970093,-0.4467509984970093,-7.535634515882377
241 | -0.4467509984970093,-0.4467509984970093,-7.535634515882377
242 | -0.4467509984970093,-0.4467509984970093,-7.535634515882377
243 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
244 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
245 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
246 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
247 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
248 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
249 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
250 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
251 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
252 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
253 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
254 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
255 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
256 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
257 | 0.10094326734542847,0.10094326734542847,-7.535634515882377
258 | 0.10094326734542847,0.10094326734542847,-7.535634515882377
259 | 0.10094326734542847,0.10094326734542847,-7.535634515882377
260 | 0.17167794704437256,0.17167794704437256,-7.535634515882377
261 | -4.902510643005371,-4.902510643005371,-7.535634515882377
262 | -4.902510643005371,-4.902510643005371,-7.535634515882377
263 | -4.902510643005371,-4.902510643005371,-7.535634515882377
264 | -4.902510643005371,-4.902510643005371,-7.535634515882377
265 | -4.902510643005371,-4.902510643005371,-7.535634515882377
266 | -4.902510643005371,-4.902510643005371,-7.535634515882377
267 | -4.902510643005371,-4.902510643005371,-7.535634515882377
268 | -4.902510643005371,-4.902510643005371,-7.535634515882377
269 | -4.902510643005371,-4.902510643005371,-7.535634515882377
270 | -4.902510643005371,-4.902510643005371,-7.535634515882377
271 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
272 | -2.0710370540618896,-2.0710370540618896,-7.535634515882377
273 | -1.1011035442352295,-1.1011035442352295,-7.535634515882377
274 | -3.218202590942383,-3.218202590942383,-7.535634515882377
275 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
276 | -4.3170342445373535,-4.3170342445373535,-7.535634515882377
277 | -3.218202590942383,-3.218202590942383,-7.535634515882377
278 | -3.218202590942383,-3.218202590942383,-7.535634515882377
279 | -3.218202590942383,-3.218202590942383,-7.535634515882377
280 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
281 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
282 | -0.38459378480911255,-0.38459378480911255,-7.535634515882377
283 | 0.10094326734542847,0.10094326734542847,-7.535634515882377
284 | 0.17167794704437256,0.17167794704437256,-7.535634515882377
285 | 0.17167794704437256,0.17167794704437256,-7.535634515882377
286 | 0.10094326734542847,0.10094326734542847,-7.535634515882377
287 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149
288 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149
289 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149
290 | 0.10094326734542847,0.10094326734542847,-0.7528851766543149
291 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149
292 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149
293 | 0.10094326734542847,0.10094326734542847,-0.7528851766543149
294 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149
295 | 0.10094326734542847,0.10094326734542847,-0.7528851766543149
296 | 0.17167794704437256,0.17167794704437256,-0.7528851766543149
297 | 0.10094326734542847,0.10094326734542847,-0.752131537838845
298 | 0.10094326734542847,0.10094326734542847,-0.752131537838845
299 | 0.19079375267028809,0.19079375267028809,-0.0041450134850838155
300 | 0.17314249277114868,0.17314249277114868,-0.003768194077348923
301 | 


--------------------------------------------------------------------------------
/Chapter11/model_quality_monitoring/images/r2_Alarm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model_quality_monitoring/images/r2_Alarm.png


--------------------------------------------------------------------------------
/Chapter11/model_quality_monitoring/images/r2_InsufficientData.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model_quality_monitoring/images/r2_InsufficientData.png


--------------------------------------------------------------------------------
/Chapter11/model_quality_monitoring/model/weather-prediction-model.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Amazon-SageMaker-Best-Practices/fedd532640075f27b21b2dec0028cf1820225f79/Chapter11/model_quality_monitoring/model/weather-prediction-model.tar.gz


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | # Amazon SageMaker Best Practices
 5 | 
 6 | <a href="https://www.packtpub.com/in/data/amazon-sagemaker-best-practices"><img src="https://www.packtpub.com/media/catalog/product/cache/c2dd93b9130e9fabaf187d1326a880fc/9/7/9781801070522-original_104.jpeg" alt="Book Name" height="256px" align="right"></a>
 7 | 
 8 | This is the code repository for [Amazon SageMaker Best Practices](https://www.packtpub.com/in/data/amazon-sagemaker-best-practices), published by Packt.
 9 | 
10 | **Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker**
11 | 
12 | ## What is this book about?
13 | Amazon SageMaker is a fully managed AWS service that provides the ability to build, train, deploy, and monitor machine learning models. The book begins with a high-level overview of Amazon SageMaker capabilities that map to the various phases of the machine learning process to help set the right foundation. You'll learn efficient tactics to address data science challenges such as processing data at scale, data preparation, connecting to big data pipelines, identifying data bias, running A/B tests, and model explainability using Amazon SageMaker.
14 | 
15 | This book covers the following exciting features: 
16 | * Perform data bias detection with AWS Data Wrangler and SageMaker Clarify
17 | * Speed up data processing with SageMaker Feature Store
18 | * Overcome labeling bias with SageMaker Ground Truth
19 | * Improve training time with the monitoring and profiling capabilities of SageMaker Debugger
20 | * Address the challenge of model deployment automation with CI/CD using the SageMaker model registry
21 | * Explore SageMaker Neo for model optimization
22 | 
23 | If you feel this book is for you, get your [copy](https://www.amazon.com/dp/1801070520) today!
24 | 
25 | <a href="https://www.packtpub.com/?utm_source=github&utm_medium=banner&utm_campaign=GitHubBanner"><img src="https://raw.githubusercontent.com/PacktPublishing/GitHub/master/GitHub.png" alt="https://www.packtpub.com/" border="5" /></a>
26 | 
27 | ## Instructions and Navigations
28 | All of the code is organized into folders. For example, Chapter04.
29 | 
30 | The code will look like the following:
31 | 
32 | ```
33 | for batch in chunks(partitions_to_add, 100):
34 |     response = glue.batch_create_partition(
35 |         DatabaseName=glue_db_name,
36 |         TableName=glue_tbl_name,
37 |         PartitionInputList=[get_part_def(p) for p in batch]
38 |     )
39 | 
40 | ```
41 | 
42 | **Following is what you need for this book:**
43 | This book is for expert data scientists responsible for building machine learning applications using Amazon SageMaker. Working knowledge of Amazon SageMaker, machine learning, deep learning, and experience using Jupyter Notebooks and Python is expected. Basic knowledge of AWS related to data, security, and monitoring will help you make the most of the book.
44 | 
45 | With the following software and hardware list you can run all code files present in the book (Chapter 1-14).
46 | 
47 | ### Software and Hardware List
48 | 
49 | | Chapter  | Software required                                                                                  | OS required                        |
50 | | -------- | ---------------------------------------------------------------------------------------------------| -----------------------------------|
51 | | 1-14     | AWS Account, Amazon SageMaker, Amazon SageMaker Studio, Amazon Athena                              | Windows, Mac OS X, and Linux (Any) |
52 | 
53 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](https://static.packt-cdn.com/downloads/9781801070522_ColorImages.pdf).
54 | 
55 | ### Related products <Other books you may enjoy>
56 | * Learn Amazon SageMaker[[Packt]](https://www.packtpub.com/product/learn-amazon-sagemaker/9781800208919) [[Amazon]](https://www.amazon.com/Learn-Amazon-SageMaker-developers-scientists/dp/180020891X)
57 | 
58 | ## Errata 
59 |  * Page 142 (Code snippet under the line "Now, we must configure the PyTorch estimator by using the infrastructure and profiler configuration as parameters:"): Object Name is **pt_estimator** _should be_ **estimator**
60 | 
61 | ## Get to Know the Authors
62 | **Sireesha Muppala**
63 | She is a Principal Enterprise Solutions Architect, AI/ML at Amazon Web Services (AWS). Sireesha holds a PhD in computer science and post-doctorate from the University of Colorado. She is a prolific content creator in the ML space with multiple journal articles, blogs, and public speaking engagements. Sireesha is a co-creator and instructor of the Practical Data Science specialization on Coursera.
64 |   
65 | **Randy DeFauw**
66 | He is a Principal Solution Architect at AWS. He holds an MSEE from the University of Michigan, where his graduate thesis focused on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management.
67 | 
68 | **Shelbee Eigenbrode**
69 | She is a Principal AI and ML Specialist Solutions Architect at AWS. She holds six AWS certifications and has been in technology for 23 years, spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background to deliver and manage ML workloads at scale.
70 | ### Download a free PDF
71 | 
72 |  <i>If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.<br>Simply click on the link to claim your free PDF.</i>
73 | <p align="center"> <a href="https://packt.link/free-ebook/9781801070522">https://packt.link/free-ebook/9781801070522 </a> </p>


--------------------------------------------------------------------------------