3 | This repository contains code, webpages and config files accompanying the AWS Distributed Training Workshop
4 |
5 |
6 | * [Workshop content](https://distributed-training-workshop.go-aws.com/)
7 |
8 | * [Presentation slides](static/tf-world-distributed-training-workshop.pdf)
9 |
--------------------------------------------------------------------------------
/archetypes/default.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "{{ replace .Name "-" " " | title }}"
3 | date: {{ .Date }}
4 | ---
5 |
--------------------------------------------------------------------------------
/config.toml:
--------------------------------------------------------------------------------
1 | baseURL = "https://distributed-training-workshop.go-aws.com"
2 | languageCode = "en-us"
3 | defaultContentLanguage = "en"
4 | title = "Distributed training with Amazon SageMaker / Amazon EKS Workshop"
5 | theme = "learn"
6 | uglyurls = true
7 | googleAnalytics = "UA-151135045-1"
8 | sectionPagesMenu = "main"
9 | pygmentsCodeFences = true
10 |
11 | [blackfriday]
12 | hrefTargetBlank = true
13 |
14 | [params]
15 | themeVariant = "mine"
16 | showVisitedLinks = false
17 | author = "Shashank Prasanna"
18 | description = "Distributed training workshop with Amazon SageMaker and Amazon EKS"
19 | disableSearch = false
20 | disableAssetsBusting = false
21 |
22 | disableInlineCopyToClipBoard = false
23 | disableShortcutsTitle = false
24 | disableLanguageSwitchingButton = false
25 | disableBreadcrumb = true
26 | disableNextPrev = true
27 | ordersectionsby = "weight"
28 |
29 | [[menu.shortcuts]]
30 | name = " @shshnkp"
31 | identifier = "tw"
32 | url = "https://twitter.com/shshnkp"
33 | weight = 1
34 |
35 | [outputs]
36 | home = [ "HTML", "AMP", "RSS", "JSON"]
37 | page = [ "HTML", "AMP"]
38 |
--------------------------------------------------------------------------------
/content/_index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Distributed Training Workshop"
3 | chapter: true
4 | weight: 1
5 | ---
6 |
7 | # Distributed Training Workshop
8 |
9 | ### Welcome to the distributed training workshop with TensorFlow on Amazon SageMaker and Amazon Elastic Kubernetes Service (EKS).
10 |
11 | #### **At the end of this workshop, you'll be able to:**
12 |
13 | #### - Identify when to consider distributed training
14 | #### - Describe different approaches to distributed training
15 | #### - Outline libraries and tools needed for distributing training workloads on large clusters
16 | #### - Demonstrate code changes required to go from single-GPU to multi-GPU distributed training
17 | #### - Demonstrate using Amazon SageMaker and Amazon EKS to run distributed training jobs
18 | #### - Apply these skills to your own deep learning problem
19 |
--------------------------------------------------------------------------------
/content/cleanup/_index.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Cleanup"
3 | date: 2019-10-27T15:25:09-07:00
4 | chapter: true
5 | weight: 6
6 | ---
7 |
8 | # Clean up resources
9 | In this section, we'll walkthrough steps to clean up resources.
10 |
11 |
--------------------------------------------------------------------------------
/content/cleanup/clean_resources.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Clean up resources"
3 | date: 2019-10-31T23:12:17-07:00
4 | ---
5 |
6 | ## Amazon EKS resources
7 |
8 | #### Kill all distributed training jobs
9 | ```
10 | kubectl delete MPIJobs --all
11 | ```
12 |
13 | #### Delete StorageClass, PersistentVolumeClaim and FSx for Lustre CSI Driver
14 | {{% notice tip %}}
15 | Note: This will automatically delete the FSx for luster file system. Your files are safe in Amazon S3.
16 | {{% /notice %}}
17 | ```
18 | kubectl delete -f specs/storage-class-fsx-s3.yaml
19 | kubectl delete -f specs/claim-fsx-s3.yaml
20 | kubectl delete -f https://raw.githubusercontent.com/kubernetes-sigs/aws-fsx-csi-driver/master/deploy/kubernetes/manifest.yaml
21 | ```
22 | #### Delete security group
23 | ```
24 | aws ec2 delete-security-group --group-id ${SECURITY_GROUP_ID}
25 | ```
26 |
27 | #### Delete policies attached to the instance role
28 | These policies were automatically added to the node IAM roles, but we'll need to manually remove them.
29 |
30 | * Copy the role associated with the worker instances
31 | ```
32 | echo $INSTANCE_ROLE_NAME
33 | ```
34 | * Navigate to IAM console
35 | * Click on Roles on the left pane
36 | * Search for the output of `echo $INSTANCE_ROLE_NAME`
37 | * Delete the two inline policies.
38 | * `iam_alb_ingress_policy`
39 | * `iam_csi_fsx_policy`
40 |
41 | #### Finally, delete the cluster
42 | ```
43 | eksctl delete cluster aws-tf-cluster-cpu
44 | ```
45 |
46 | ## SageMaker resources
47 | SageMaker resources are easier to clear.
48 | Login into the SageMaker console and click dashboard
49 | Make sure that you don't have any resources that are **Green** as shown below. Click on the resources that is shown as green and either stop or delete them.
50 |
51 | 
52 |
53 | ## Other resources
54 | It's always good idea to ensure that:
55 |
56 |
--------------------------------------------------------------------------------
/content/intro/_index.md:
--------------------------------------------------------------------------------
1 | +++
2 | title = "Introduction"
3 | date = 2019-10-27T15:22:24-07:00
4 | weight = 1
5 | chapter = true
6 | +++
7 |
8 | # Introduction
9 | In a typical machine learning development workflow, there are two main stages where you can get benefit from scaling out.
10 |
11 | 
12 |
13 | 1. Running large-scale parallel experiments: In this scenario our goal is to find the best model/hyperparameters/network architecture by exploring a space of possibilities.
14 | 1. Running distributed training of a single model: In this scenario our goal is to train a single model faster, by distributing its computation across nodes in a cluster.
15 |
16 | ### The focus of this workshop is distributed training of a single model
17 |
--------------------------------------------------------------------------------
/content/intro/addressing_challenges-1.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Addressing scaling challenges - Infrastructure management"
3 | date: 2019-10-28T21:07:47-07:00
4 | weight: 4
5 | ---
6 |
7 | ### Infrastructure management
8 |
9 | 
10 |
--------------------------------------------------------------------------------
/content/intro/addressing_challenges.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Addressing scaling challenges - software dependencies"
3 | date: 2019-10-28T21:02:29-07:00
4 | weight: 3
5 | ---
6 |
7 | ### Software dependencies
8 |
9 | Containers provide consistent, lightweight and portable environment that includes not just the training code but also dependencies and configuration.
10 | 
11 |
12 | Simply package up your code and push it to a container registry.
13 | The container image can then be pulled into a cluster and run at scale.
14 | 
15 |
--------------------------------------------------------------------------------
/content/intro/challenges_solution.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Challenges with scaling machine learning"
3 | date: 2019-10-28T20:56:39-07:00
4 | weight: 2
5 | ---
6 | 
7 |
8 | There are two key challenges associated with scaling machine learning computation.
9 |
10 | 1. Development setup on a single computer or instance doesn't translate well when deploying to cluster
11 | 2. Managing infrastructure is challenging for machine learning researchers, data scientists and developer without IT/ops background
12 |
--------------------------------------------------------------------------------
/content/intro/horovod.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Distributed training approaches"
3 | date: 2019-10-28T21:11:22-07:00
4 | weight: 5
5 | ---
6 | 
7 |
8 | ## Horovod
9 | [(horovod.ai)](horovod.ai)
10 |
11 | Horovod is based on the MPI concepts:
12 | size, rank, local rank, allreduce, allgather, and broadcast.
13 |
14 | * Library for distributed deep learning with support for multiple frameworks including TensorFlow
15 | * Separates infrastructure from ML engineers
16 | * Uses ring-allreduce and uses Message Passing Interface (MPI) popular in the HPC community
17 | * Infrastructure services such as Amazon SageMaker and Amazon EKS provides container and MPI environment
18 |
19 | 
20 |
21 | 1. Forward pass on each device
22 | 1. Backward pass compute gradients
23 | 1. ”All reduce” (average and broadcast) gradients across devices
24 | 1. Update local variables with “all reduced” gradients
25 |
26 | Horovod will run the same copy of the script on all hosts/servers/nodes/instances
27 |
28 | 
29 |
30 | `horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python training_script.py`
31 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/_index.md:
--------------------------------------------------------------------------------
1 | +++
2 | title = "Distributed Training with Amazon EKS"
3 | date = 2019-10-21T13:21:28-07:00
4 | weight = 5
5 | chapter = true
6 | #pre = "2. "
7 | +++
8 |
9 | # Distributed Training with Amazon EKS
10 |
11 | In this section, we’ll run distributed training on Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS. To run deep learning workloads on Amazon EKS, we'll install Kubeflow. The Kubeflow project includes capabilities that make deployments of machine learning (ML) workflows on Kubernetes easy. With EKS and Kubeflow, you'll still need to manage the underlying CPU and GPU instances that form your cluster. EKS and Kubeflow make it easy to manage and schedule machine learning workloads on your cluster.
12 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/build_container.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Build training container image and push it to ECR"
3 | date: 2019-10-28T16:51:02-07:00
4 | weight: 7
5 | ---
6 |
7 | #### Build a custom docker image with our training code
8 |
9 | In our Dockerfile we start with an AWS Deep Learning TensorFlow container and copy our training code into the container.
10 |
11 | ```
12 | cd ~/SageMaker/distributed-training-workshop/notebooks/part-3-kubernetes/
13 | cat Dockerfile.cpu
14 | ```
15 | `Dockerfile.cpu` Output:
16 | ```
17 | FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.14.0-cpu-py36-ubuntu16.04
18 | COPY code /opt/training/
19 | WORKDIR /opt/training
20 | ```
21 |
22 | {{% notice tip %}}
23 | Replace with `Dockerfile.gpu` if you're going to be running training on a GPU cluster.
24 | {{% /notice %}}
25 |
26 | #### Build and push a custom Docker container
27 |
28 | * Navigate to [ECR and create a new repository](https://console.aws.amazon.com/ecr/home)
29 | * Click create repository
30 | * Provide a repository name
31 | * Click create
32 |
33 | {{% notice tip %}}
34 | By clicking on **View push commands** button below, you can get access to docker build and push commands, so you don't have to remember them.
35 | {{% /notice %}}
36 | 
37 | 
38 | #### Create a new Elastic Container Registry repository
39 |
40 | * Head over to the terminal on JupyterLab and log-in to the AWS Deep Learning registry
41 | ```
42 | $(aws ecr get-login --no-include-email --region us-west-2 --registry-ids 763104351884)
43 | ```
44 | * Run `docker build` command in **Step 2** from the Docker push commands menu. Make sure to update it with the correct Docker file name for CPU or GPU:
45 | * For CPU container: `docker build -t -f Dockerfile.cpu .`
46 | * For GPU container: `docker build -t -f Dockerfile.gpu .`
47 | * Run the `docker tag` command in **Step 3** from the Docker push commands menu
48 |
49 | * Log in to your docker registry
50 | * `$(aws ecr get-login --no-include-email --region us-west-2)`
51 |
52 | * Run `docker push` command in **Step 4** from the Docker push commands menu
53 |
54 | {{% notice tip %}}
55 | What happened?
56 | (1) You first logged into the AWS Deep Learning container registry in order to pull the deep learning container (2) You then built your container. (3) After the container is built, you added the appropriate tag needed to push it to ECR. (4) Then you login to your own registry. (4) Then you push the container to your registry
57 |
58 | {{% /notice %}}
59 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/fsx_lustre.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Enable Amazon FSx for Lustre access"
3 | date: 2019-10-28T16:20:52-07:00
4 | weight: 6
5 | ---
6 |
7 | Amazon FSx for Lustre provides a high-performance file system optimized for fast processing of workloads such as in deep learning. FSx for Lustre file system transparently presents S3 objects as files and allows you to write results back to S3.
8 |
9 | #### Install the FSx CSI Driver
10 | ```
11 | kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-fsx-csi-driver/master/deploy/kubernetes/manifest.yaml
12 | ```
13 |
14 |
15 | ```
16 | VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:Name,Values=eksctl-${CLUSTER_NAME}-cluster/VPC" --query "Vpcs[0].VpcId" --output text)
17 | ```
18 |
19 | #### Get subnet ID from the EC2 console
20 | Navigate to [AWS EC2 console](https://console.aws.amazon.com/ec2/v2/home) and click on **Instances**.
21 | Select one of the running instances which starts with the name of the EKS cluster. This instance is a node on the EKS cluster.
22 | Copy the subnet ID as show in the image below. Click on the copy-to-clipboard icon show next to the arrow.
23 |
24 | 
25 | paste the subnet ID below
26 | ```
27 | export SUBNET_ID=
28 | ```
29 |
30 | #### Create your security group for the FSx file system
31 | ```
32 | export SECURITY_GROUP_ID=$(aws ec2 create-security-group --group-name eks-fsx-security-group --vpc-id ${VPC_ID} --description "FSx for Lustre Security Group" --query "GroupId" --output text)
33 | ```
34 |
35 | {{% notice warning %}}
36 | **Stop:** Make sure that the security group was created before proceeding.
37 | Confirm by running `echo $SECURITY_GROUP_ID`. Don't proceed if this is empty.
38 | {{% /notice %}}
39 |
40 | #### Add an ingress rule that opens up port 988 from the 192.168.0.0/16 CIDR range
41 | ```
42 | aws ec2 authorize-security-group-ingress --group-id ${SECURITY_GROUP_ID} --protocol tcp --port 988 --cidr 192.168.0.0/16
43 | ```
44 |
45 | #### Update the environment variables in the storage class spec file
46 | Running envsubst will populate SUBNET_ID, SECURITY_GROUP_ID, BUCKET_NAME
47 | ```
48 | cd ~/SageMaker/distributed-training-workshop/notebooks/part-3-kubernetes/
49 |
50 | envsubst < specs/storage-class-fsx-s3-template.yaml > specs/storage-class-fsx-s3.yaml
51 | ```
52 |
53 | #### Deploy the StorageClass and PersistentVolumeClaim
54 | ```
55 | kubectl apply -f specs/storage-class-fsx-s3.yaml
56 | kubectl apply -f specs/claim-fsx-s3.yaml
57 | ```
58 |
59 | This will take several minutes. You can check the status by running the following command. Hit `Ctrl+C` if you don't want the terminal to be blocked. To manually check, run the command without `-w`
60 |
61 | ```
62 | kubectl get pvc fsx-claim -w
63 | ```
64 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/install_cli.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Install CLI tools"
3 | date: 2019-10-28T15:02:28-07:00
4 | weight: 2
5 | ---
6 |
7 | Navigate to the following directory for part 3 of the workshop
8 | ```
9 | cd ~/SageMaker/distributed-training-workshop/notebooks/part-3-kubernetes/
10 | ```
11 |
12 |
13 | #### Install `eksctl`
14 |
15 | To get started we'll fist install the eksctl CLI tool. [eksctl](https://eksctl.io) simplifies the process of creating EKS clusters.
16 |
17 | ```bash
18 | pip install awscli --upgrade --user
19 | curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
20 |
21 | ```
22 |
23 | Move eksctl to /usr/local/bin to that it's on path
24 |
25 | ```
26 | sudo mv /tmp/eksctl /usr/local/bin
27 | eksctl version
28 |
29 | ```
30 |
31 | #### Install `kubectl`
32 | Kubectl is a command line interface for running commands against Kubernetes clusters. Run the following to install Kubectl
33 |
34 | ```bash
35 | curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.14.6/2019-08-22/bin/linux/amd64/kubectl
36 | chmod +x ./kubectl
37 | sudo mv ./kubectl /usr/local/bin
38 | kubectl version --short --client
39 |
40 | ```
41 |
42 | #### Install `aws-iam-authenticator`
43 |
44 | ```
45 | curl -o aws-iam-authenticator https://amazon-eks.s3-us-west-2.amazonaws.com/1.14.6/2019-08-22/bin/linux/amd64/aws-iam-authenticator
46 |
47 | chmod +x ./aws-iam-authenticator
48 |
49 | sudo mv aws-iam-authenticator /usr/local/bin
50 | ```
51 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/install_kubeflow.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Install Kubeflow"
3 | date: 2019-10-28T15:42:44-07:00
4 | weight: 5
5 | ---
6 |
7 | #### Download the kfctl CLI tool
8 |
9 | ```
10 | curl --silent --location https://github.com/kubeflow/kubeflow/releases/download/v0.7.0-rc.6/kfctl_v0.7.0-rc.5-7-gc66ebff3_linux.tar.gz | tar xz
11 |
12 | sudo mv kfctl /usr/local/bin
13 | ```
14 |
15 | #### Get the latest Kubeflow configuration file
16 |
17 | ```
18 | export CONFIG='https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.0.yaml'
19 | ```
20 |
21 | #### Create environment and local variables
22 |
23 | ```
24 | CLUSTER_NAME=$(eksctl get cluster --output=json | jq '.[0].name' --raw-output)
25 |
26 | INSTANCE_ROLE_NAME=$(eksctl get iamidentitymapping --name ${CLUSTER_NAME} --output=json | jq '.[0].rolearn' --raw-output | sed -e 's/.*\///')
27 | ```
28 |
29 | {{% notice warning %}}
30 | Make sure that both environment variables are set before proceeding.
31 | Confirm by running `echo $CLUSTER_NAME` and `echo $INSTANCE_ROLE_NAME`.
32 | Make sure that these are not empty.
33 | {{% /notice %}}
34 |
35 | Add your S3 bucket name below:
36 | ```
37 | export BUCKET_NAME=
38 | ```
39 |
40 | {{% notice warning %}}
41 | **Stop:** Verify that you have the correct bucket name before proceeding.
42 | {{% /notice %}}
43 |
44 | ```
45 | export KF_NAME=${CLUSTER_NAME}
46 | export KF_DIR=$PWD/${KF_NAME}
47 | ```
48 |
49 | #### Build your configuration files
50 | We'll edit the configuration with the right names for the cluster and node groups before deploying Kubeflow.
51 |
52 | ```
53 | mkdir -p ${KF_DIR}
54 | cd ${KF_DIR}
55 | kfctl build -V -f ${CONFIG}
56 | export CONFIG_FILE=${KF_DIR}/kfctl_aws.0.7.0.yaml
57 |
58 | ```
59 |
60 | #### Edit the configuration file to include the correct instance role name and cluster name
61 | ```
62 | sed -i "s@eksctl-kubeflow-aws-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx@$INSTANCE_ROLE_NAME@" ${CONFIG_FILE}
63 |
64 | sed -i "s@kubeflow-aws@$CLUSTER_NAME@" ${CONFIG_FILE}
65 |
66 | ```
67 |
68 | #### Apply the changes and deploy Kubeflow
69 | ```
70 | cd ${KF_DIR}
71 | rm -rf kustomize/
72 | kfctl apply -V -f ${CONFIG_FILE}
73 | ```
74 |
75 | #### Wait for resource to become available
76 |
77 | Monitor changes by running kubectl get all namespaces command.
78 | ```
79 | kubectl -n kubeflow get all
80 | ```
81 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/setup_eks.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Setup an Amazon EKS cluster"
3 | date: 2019-10-28T15:14:12-07:00
4 | weight: 3
5 | ---
6 |
7 | Navigate to ***distributed-training-workshop > notebooks > part-3-sagemaker***
8 |
9 | The `cpu_eks_cluster.sh` and `gpu_eks_cluster.sh` files include the necessary options to lauch a CPU or GPU cluster. Take a look at the options by running the following script to launch an EKS clusters
10 |
11 | ```bash
12 | cd ~/SageMaker/distributed-training-workshop/notebooks/part-3-kubernetes/
13 | cat cpu_eks_cluster.sh
14 | ```
15 | You should see the following output
16 | ```
17 | Output:
18 | eksctl create cluster \
19 | --name aws-tf-cluster-cpu \
20 | --version 1.14 \
21 | --region us-west-2 \
22 | --nodegroup-name cpu-nodes \
23 | --node-type c5.xlarge \
24 | --nodes 2 \
25 | --node-volume-size 50 \
26 | --node-zones us-west-2a \
27 | --timeout=40m \
28 | --zones=us-west-2a,us-west-2b,us-west-2c \
29 | --auto-kubeconfig
30 | ```
31 |
32 | {{% notice tip %}}
33 | To launch a cluster with GPU use the script `gpu_eks_cluster.sh` instead. If you wish to launch a cluster with more than 2 nodes, update the `nodes` argument to number of nodes you want in the cluster.
34 | {{% /notice %}}
35 |
36 | Now launch an EKS cluster:
37 | ```
38 | sh cpu_eks_cluster.sh
39 | ```
40 |
41 | You should an output that something similar to this.
42 |
43 | 
44 |
45 | Creating a cluster may take about 15 mins. You could head over to [AWS cloud formation console](https://console.aws.amazon.com/cloudformation) to monitor the progress.
46 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/submit_job.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Submit distributed training job"
3 | date: 2019-10-28T17:14:05-07:00
4 | weight: 8
5 | ---
6 | #### Confirm that you are in the right directory
7 | ```
8 | cd ~/SageMaker/distributed-training-workshop/notebooks/part-3-kubernetes/
9 | ```
10 | #### Copy the container image name
11 |
12 | 
13 |
14 |
15 | #### Update the MPIJob spec file
16 |
17 | Open `specs/eks_tf_training_job-cpu.yaml` and update `image: ` with the name of your container.
18 |
19 | 
20 |
21 | #### Submit a job run:
22 | ```
23 | kubectl apply -f specs/eks_tf_training_job-cpu.yaml
24 | ```
25 | {{% notice tip %}}
26 | For GPU jobs use this instead: `eks_tf_training_job-gpu.yaml`
27 | {{% /notice %}}
28 |
29 | You should see an output something like this:
30 | ```
31 | mpijob.kubeflow.org/eks-tf-distributed-training created
32 | ```
33 | Running `kubectl get pods` will should you the number of workers + 1 number of pods.
34 |
35 | ```bash
36 | $ kubectl get pods
37 | NAME READY STATUS RESTARTS AGE
38 | eks-tf-distributed-training-launcher-6lgzg 1/1 Running 0 63s
39 | eks-tf-distributed-training-worker-0 1/1 Running 0 66s
40 | eks-tf-distributed-training-worker-1 1/1 Running 0 66s
41 | ```
42 |
43 | To observer training logs, run `kubectl logs `. Select the launcher pod from the list. You can use tab complete or copy the name of the pod from the output of `kubectl get pods`
44 |
45 | ```
46 | kubectl logs eks-tf-distributed-training-launcher-
47 | ```
48 |
49 | output:
50 | ```
51 | ...
52 | Epoch 1/30
53 | Epoch 1/30
54 | 3/78 [>.............................] - ETA: 4:05 - loss: 3.6816 - acc: 0.1172 3/724/78 [========>.....................] - ETA: 1:29 - loss: 2.7493 - acc: 0.161024/778/78 [==============================] - 128s 2s/step - loss: 2.1984 - acc: 0.2268 - val_loss: 2.1794 - val_acc: 0.1699
55 | Epoch 2/30
56 | 78/78 [==============================] - 129s 2s/step - loss: 2.2108 - acc: 0.2268 - val_loss: 2.1794 - val_acc: 0.1699
57 | Epoch 2/30
58 | ```
59 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/verify_cluster.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Verify installation and test cluster"
3 | date: 2019-10-28T15:33:52-07:00
4 | weight: 4
5 | ---
6 |
7 | Once the cluster is up and running, you should see a message that your cluster is now ready.
8 | 
9 |
10 | Update kubeconfig file to point to our new cluster.
11 | If you chose a different name for you cluster (other than aws-tf-cluster-cpu) then be sure to include the name of your cluster below.
12 |
13 | ```
14 | aws eks --region us-west-2 update-kubeconfig --name aws-tf-cluster-cpu
15 | ```
16 |
17 | Run the following to confirm that you can access the EKS cluster:
18 |
19 | You should see a list of kubernetes namespaces:
20 | ```
21 | kubectl get ns
22 | ```
23 | ```
24 | Output:
25 | NAME STATUS AGE
26 | default Active 12m
27 | kube-node-lease Active 13m
28 | kube-public Active 13m
29 | kube-system Active 13m
30 | ```
31 |
32 | You should see total number of nodes in your cluster:
33 | ```
34 | kubectl get nodes
35 | ```
36 | ```
37 | Output:
38 | NAME STATUS ROLES AGE VERSION
39 | ip-192-168-10-211.us-west-2.compute.internal Ready 7m3s v1.14.7-eks-1861c5
40 | ip-192-168-10-229.us-west-2.compute.internal Ready 7m4s v1.14.7-eks-1861c5
41 | ```
42 |
--------------------------------------------------------------------------------
/content/kubernetes_dist_training/workflow.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Workflow"
3 | date: 2019-10-28T14:18:11-07:00
4 | weight: 1
5 | ---
6 |
7 | Navigate to
8 | ***distributed-training-workshop > notebooks > part-2-sagemaker***
9 | You should see the following files:
10 |
11 | ```bash
12 | part-3-kubernetes/
13 | ├── Dockerfile
14 | ├── cpu_eks_cluster.sh
15 | ├── gpu_eks_cluster.sh
16 | ├── code
17 | │ ├── cifar10-multi-gpu-horovod-k8s.py
18 | │ └── model_def.py
19 | └── specs
20 | ├── claim-fsx-s3.yaml
21 | ├── eks_tf_training_job.yaml
22 | ├── fsx_lustre_policy.json
23 | └── storage-class-fsx-s3-template.yaml
24 | ```
25 |
26 | |Files/directories|Description|
27 | |-----|-----|
28 | |Dockerfile | Use this build a custom container image for training on Amazon EKS|
29 | |cpu_eks_cluster.sh, gpu_eks_cluster.sh |shell scripts using `eksctl` CLI tool to launch an Amazon EKS cluster|
30 | |code|Contains the training scrip and other training script dependencies|
31 | |specs|List of spec files required to configure Kubeflow|
32 |
33 | 
34 |
35 | We'll need to first setup Amazon EKS, Amazon FSx for Lustre file system and install Kubeflow. This involves multiple steps and we'll leverage various CLI tools to to help install, configure and interact with EKS. At a high level, we'll perform the following steps:
36 |
37 | 1. Install `eksctl` CLI and use it to launch an Amazon EKS cluster
38 | 1. Install `kubectl` CLI to interact with the Amazon EKS cluster
39 | 1. Install `kfclt` CLI and use it to configure and install Kubeflow
40 | 1. Allow Amazon EKS to access Amazon FSx for Lustre file system that's linked to an Amazon S3 bucket
41 | 1. Finally, launch a distributed training job
42 |
--------------------------------------------------------------------------------
/content/sagemaker_dist_training/_index.md:
--------------------------------------------------------------------------------
1 | +++
2 | title = "Distributed Training with Amazon Sagemaker"
3 | date = 2019-10-21T13:21:01-07:00
4 | weight = 4
5 | chapter = true
6 | # pre = "1. "
7 | +++
8 |
9 | # Distributed training with Amazon SageMaker
10 |
11 | In this section, we'll run distributed training on Amazon SageMaker. We'll provide SageMaker with our updated training script that includes horovod API, and SageMaker will take care of the rest - spinning up requested number of CPU or GPU instances, copying the training code and dependencies to the training cluster, copying the dataset from Amazon S3 to the training cluster, keeping track of training progress and shutting down the instances once training is done. Amazon SageMaker is a fully managed service, so you don't have to worry about managing instances.
12 |
--------------------------------------------------------------------------------
/content/sagemaker_dist_training/monitoring_results.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Monitoring training progress"
3 | date: 2019-10-28T13:54:27-07:00
4 | weights: 4
5 | ---
6 |
7 | ### Monitoring training progress using tensorboard
8 |
9 | The ***cifar10-sagemaker-distributed.ipynb*** notebook will automatically start a tensorboard server for you when your run the following cell. Tensorboard is running locally on your Jupyter notebook instance, but reading the events from the Amazon S3 bucket we used to save the events using the keras callback.
10 |
11 | ```bash
12 | !S3_REGION=us-west-2 tensorboard --logdir s3://{bucket_name}/tensorboard_logs/
13 | ```
14 |
15 | Navigate to https://tfworld2019.notebook.us-west-2.sagemaker.aws/proxy/6006/
16 |
17 | Replace `tfworld2019` with the name of your Jupyter notebook instance.
18 | 
19 |
20 | ### Monitoring training job status on the AWS SageMaker console
21 |
22 | Navigate to ***AWS management console > SageMaker console*** to see a full list of training jobs and their status.
23 |
24 | 
25 |
26 | To view cloudwatch logs from the training instances, click on the ***training job name > Monitor > View logs***
27 |
--------------------------------------------------------------------------------
/content/sagemaker_dist_training/sagemaker_training.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "SageMaker distributed training"
3 | date: 2019-10-28T13:17:44-07:00
4 | weight: 3
5 | ---
6 |
7 | Open `cifar10-sagemaker-distributed.ipynb` and run through the cells. The following notebook is located at:
8 | ***distributed-training-workshop > notebooks > part-2-sagemaker > cifar10-sagemaker-distributed.ipynb***
9 |
10 | 
11 |
12 | {{% notice warning %}}
13 | **Stop:** Do this section on JupyterLab. Below is a copy of the jupyter notebook for reference.
14 | {{% /notice %}}
15 |
16 | ----
17 |
18 |
19 | ## Distributed training with Amazon SageMaker
20 |
21 | In this notebook we use the SageMaker Python SDK to setup and run a distributed training job.
22 | SageMaker makes it easy to train models across a cluster containing a large number of machines, without having to explicitly manage those resources.
23 |
24 | **Step 1:** Import essentials packages, start a sagemaker session and specify the bucket name you created in the pre-requsites section of this workshop.
25 |
26 |
27 | ```python
28 | import os
29 | import time
30 | import numpy as np
31 | import sagemaker
32 |
33 | sagemaker_session = sagemaker.Session()
34 | role = sagemaker.get_execution_role()
35 | bucket_name = 'tfworld2019-'
36 | ```
37 |
38 | **Step 2:** Specify hyperparameters, instance type and number of instances to distribute training to. The `hvd_processes_per_host` corrosponds to number of GPUs per instances.
39 | For example, if you choose:
40 | ```
41 | hvd_instance_type = 'ml.p3.8large'
42 | hvd_instance_count = 2
43 | hvd_processes_per_host = 4
44 | ```
45 |
46 | Since p3.8xlarge instance has 4 GPUs, we'll we distributing training to 8 workers, 1 per GPU.
47 | This is spread across 2 instances (or nodes). SageMaker automatically takes care of spinning up these instances and making sure they can communiate with each other.
48 |
49 |
50 | ```python
51 | hyperparameters = {'epochs': 100,
52 | 'learning-rate': 0.001,
53 | 'momentum': 0.9,
54 | 'weight-decay': 2e-4,
55 | 'optimizer': 'adam',
56 | 'batch-size' : 256}
57 |
58 | hvd_instance_type = 'ml.c5.xlarge'
59 | hvd_instance_count = 2
60 | hvd_processes_per_host = 1
61 |
62 | print('Distributed training with a total of {} workers'.format(hvd_processes_per_host*hvd_instance_count))
63 | print('{} x {} instances with {} processes per instance'.format(hvd_instance_count, hvd_instance_type, hvd_processes_per_host))
64 | ```
65 |
66 | **Step 3:** In this cell we create a SageMaker estimator, by providing it with all the information it needs to launch instances and execute training on those instances.
67 |
68 | Since we're using horovod for distributed training, we specify `distributions` to mpi which is used by horovod.
69 |
70 | In the TensorFlow estimator call, we specify training script under `entry_point` and dependencies under `code`. SageMaker automatically copies these files into a TensorFlow container behind the scenes, and are executed on the training instances.
71 |
72 |
73 | ```python
74 | from sagemaker.tensorflow import TensorFlow
75 |
76 | output_path = 's3://{}/'.format(bucket_name)
77 | job_name = 'sm-dist-{}x{}-workers'.format(hvd_instance_count, hvd_processes_per_host) + time.strftime('%Y-%m-%d-%H-%M-%S-%j', time.gmtime())
78 | model_dir = output_path + 'tensorboard_logs/' + job_name
79 |
80 | distributions = {'mpi': {
81 | 'enabled': True,
82 | 'processes_per_host': hvd_processes_per_host,
83 | 'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'
84 | }
85 | }
86 |
87 | estimator_hvd = TensorFlow(base_job_name='hvd-cifar10-tf',
88 | source_dir='code',
89 | entry_point='cifar10-multi-gpu-horovod-sagemaker.py',
90 | role=role,
91 | framework_version='1.14',
92 | py_version='py3',
93 | hyperparameters=hyperparameters,
94 | train_instance_count=hvd_instance_count,
95 | train_instance_type=hvd_instance_type,
96 | output_path=output_path,
97 | model_dir=model_dir,
98 | tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'dist'}],
99 | metric_definitions=[{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}],
100 | distributions=distributions)
101 | ```
102 |
103 | **Step 4:** Specify dataset locations in Amazon S3 and then call the fit function.
104 |
105 |
106 | ```python
107 | train_path = 's3://{}/cifar10-dataset/train'.format(bucket_name)
108 | val_path = 's3://{}/cifar10-dataset/validation'.format(bucket_name)
109 | eval_path = 's3://{}/cifar10-dataset/eval/'.format(bucket_name)
110 |
111 | estimator_hvd.fit({'train': train_path,'validation': val_path,'eval': eval_path},
112 | job_name=job_name, wait=False)
113 | ```
114 |
115 | **Step 5:** Monitor progress on TensorBoard. Launch tensorboard and open the link on a new tab to visualize training progress, and navigate to the following link
116 |
117 |
118 | ```python
119 | !S3_REGION=us-west-2 tensorboard --logdir s3://{bucket_name}/tensorboard_logs/
120 | ```
121 |
122 | Open a new browser tan and navigate to the folloiwng link to access TensorBoard:
123 | https://tfworld2019.notebook.us-west-2.sagemaker.aws/proxy/6006/
124 | Make sure that the name of the notebook instance is correct in the link above.
125 | Don't forget the slash at the end of the URL 6006/
126 |
--------------------------------------------------------------------------------
/content/sagemaker_dist_training/training_scrip_updates.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Updates required to run on SageMaker"
3 | date: 2019-10-28T13:42:13-07:00
4 | weight: 2
5 | ---
6 |
7 | There are few minor changes required to run a training script on Amazon Sagemaker
8 |
9 |
10 | ##### SageMaker hyperparameters
11 | * SageMaker passes hyperparameters to the training scripts as commandline arguments. Your script must be able to parse these arguments.
12 |
13 | ##### SageMaker environment variables
14 | * SageMaker makes several environment variables available inside the container that a training script can take advantage of for finding location of the training dataset, number of GPU in the instance, dataset channels and others. A full list of environment variables an be found on the [SageMaker container GitHub repository](https://github.com/aws/sagemaker-containers#important-environment-variables)
15 |
16 | ```
17 | parser = argparse.ArgumentParser()
18 |
19 | # Hyper-parameters
20 | parser.add_argument('--epochs', type=int, default=15)
21 | parser.add_argument('--learning-rate', type=float, default=0.001)
22 | parser.add_argument('--batch-size', type=int, default=256)
23 | parser.add_argument('--weight-decay', type=float, default=2e-4)
24 | parser.add_argument('--momentum', type=float, default='0.9')
25 | parser.add_argument('--optimizer', type=str, default='adam')
26 |
27 | # SageMaker parameters
28 | parser.add_argument('--model_dir', type=str)
29 | parser.add_argument('--model_output_dir', type=str, default=os.environ['SM_MODEL_DIR'])
30 | parser.add_argument('--output_data_dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
31 |
32 | # Data directories and other options
33 | parser.add_argument('--gpu-count', type=int, default=os.environ['SM_NUM_GPUS'])
34 | parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
35 | parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])
36 | parser.add_argument('--eval', type=str, default=os.environ['SM_CHANNEL_EVAL'])
37 |
38 | args = parser.parse_args()
39 | ```
40 |
41 | ##### (Optional) TensorBoard callback for real-time monitoring of training
42 | * Using a keras callback we can upload tensorboard files to Amazon S3 so that we can monitor progress in real-time.
43 | tensorboard already comes installed on the SageMaker JupyterLab instance, and has support for reading event files from Amazon S3.
44 |
45 | `tensorboard --logdir s3://{bucket_name}/tensorboard_logs/`
46 |
47 | ```
48 | class Sync2S3(tf.keras.callbacks.Callback):
49 | def __init__(self, logdir, s3logdir):
50 | super(Sync2S3, self).__init__()
51 | self.logdir = logdir
52 | self.s3logdir = s3logdir
53 |
54 | def on_epoch_end(self, batch, logs={}):
55 | os.system('aws s3 sync '+self.logdir+' '+self.s3logdir)
56 | # ' >/dev/null 2>&1'
57 | ```
58 |
--------------------------------------------------------------------------------
/content/sagemaker_dist_training/workflow.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Workflow"
3 | date: 2019-10-28T12:59:15-07:00
4 | weight: 1
5 | ---
6 | Navigate to
7 | ***distributed-training-workshop > notebooks > part-2-sagemaker***
8 | You should see the following files:
9 |
10 | ```bash
11 | part-2-sagemaker/
12 | ├── cifar10-sagemaker-distributed.ipynb
13 | └── code
14 | ├── cifar10-multi-gpu-horovod-sagemaker.py
15 | └── model_def.py
16 | ```
17 |
18 | |Files/directories|Description|
19 | |-----|-----|
20 | |cifar10-sagemaker-distributed.ipynb |This jupyter notebook contains code to define and kick off a SageMaker training job|
21 | |code |This directory contains the training scrip and other training script dependencies|
22 |
23 | 
24 |
25 | SageMaker is a fully-managed service, which means when you kick off a training job using the SageMaker SDK in the `cifar10-sagemaker-distributed.ipynb` notebook, few different things happen behind the scene
26 |
27 | * SageMaker spins up request number of instances in a fully-managed SageMaker cluster
28 | * SageMaker pulls the latest (or specified version) of TensorFlow container images, instantiates it on the new instances and loads the content of the `code` directory into the container
29 | * SageMaker runs the training script on each instance. Since we're running distributed training SageMaker launches an `MPI` job with the right settings so that workers can communicate with each other.
30 | * SageMaker copies the dataset over from Amazon S3 and makes it available inside the container for Training
31 | * SageMaker monitors the training and updates progress on the Amazon SageMaker console
32 | * SageMaker copies all the code and model artifacts to Amazon S3 after the training is finished
33 |
34 | In addition, SageMaker does a lot more to ensure that the jobs run optimally and you get the best perfomance out-of-the box. As a user you don't have to worry about managing machine learning infrastructure.
35 |
--------------------------------------------------------------------------------
/content/setup/_index.md:
--------------------------------------------------------------------------------
1 | +++
2 | title = "Prerequisites"
3 | date = 2019-10-21T13:30:44-07:00
4 | weight = 2
5 | chapter = true
6 | #pre = "0. "
7 | +++
8 |
9 | # Getting Started
10 | In this section, we'll setup up our development environment.
11 | We'll be using an Amazon SageMaker notebook instance which is a fully managed compute instance running the Jupyter Notebook server.
12 |
--------------------------------------------------------------------------------
/content/setup/add_admin_policy.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Update notebook IAM role"
3 | date: 2019-10-27T23:41:36-07:00
4 | weight: 3
5 | ---
6 |
7 | ### Give your notebook instances admin privileges
8 | {{% notice warning %}}
9 | **Note:** We're providing admin privileges to the SageMaker notebook instance only because we'll be using the same instance to launch an Amazon EKS cluster in the later part of the workshop. If you're only going to be using SageMaker managed cluster for training, S3 full access policy should suffice.
10 | {{% /notice %}}
11 |
12 | * Click on the **tfworld2019** and you'll see additional details about the instance. Click on the IAM role link, this should take you to the IAM Management Console. Once there, click attach policy button.
13 | 
14 | 
15 | * Select **AdministratorAccess** and click on **Attach policy**
16 | 
17 | * Close the the IAM Management Console window and head back to the SageMaker console.
18 |
--------------------------------------------------------------------------------
/content/setup/download_workshop.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Download the workshop content"
3 | date: 2019-10-28T00:14:06-07:00
4 | weight: 4
5 | ---
6 | ### Launch JupyterLab client and clone the workshop repository
7 | * Your notebook instance should now be ready. Click *JupyterLab* to launch your client.
8 | 
9 |
10 | * Click *File > New > Terminal* to launch terminal in your JupyterLab instance.
11 | 
12 |
13 | * Download the workshop code and notebooks. Enter bash (optional), change directory to ~/SageMaker, clone the repository
14 | ```bash
15 | bash
16 | cd ~/SageMaker
17 | git clone https://github.com/shashankprasanna/distributed-training-workshop.git
18 | ```
19 |
20 | * Confirm that you're able to see the contents. Should see 3 parts
21 | ```
22 | ls distributed-training-workshop/notebooks
23 | ```
24 |
--------------------------------------------------------------------------------
/content/setup/sm_jupyter_instance.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Launch a SageMaker notebook instance"
3 | date: 2019-10-27T22:39:43-07:00
4 | weight: 2
5 | ---
6 |
7 | {{% notice info %}}
8 | **Note:** In this workshop, we'll be using an Amazon SageMaker notebook instance for simplicity and convenience. You can use any local client to perform steps detailed in this and subsequent sections. You'll just need to make sure you have the right privileges to access AWS services such as SageMaker, EKS, S3, ECR and others from your clinet. You'll also need to install AWS Command Line Interface (AWS CLI), python, boto3 and SageMaker SDK installed. The SageMaker Jupyter notebook on the other hand is preconfigured and ready to use.
9 | {{% /notice %}}
10 |
11 | ### Launch an Amazon SageMaker notebook instance
12 |
13 | * Open the [AWS Management Console](https://console.aws.amazon.com/console/home)
14 | {{% notice info %}}
15 | **Note:** This workshop has been tested on the US West (Oregon) (us-west-2) region. Make sure that you see **Oregon** on the top right hand corner of your AWS Management Console. If you see a different region, click the dropdown menu and select US West (Oregon)
16 | {{% /notice %}}
17 |
18 | * In the AWS Console search bar, type SageMaker and select Amazon SageMaker to open the service console.
19 | 
20 | * Click on Notebook Instances
21 | 
22 | * From the Amazon SageMaker > Notebook instances page, select Create notebook instance.
23 | 
24 | * In the Notebook instance name text box, enter a name for the notebook instance.
25 | * For this workshop select **"tfworld2019"** as the instance name
26 | * Choose ml.c5.xlarge. We'll only be using this instance to launch jobs. The training job themselves will run either on a SageMaker managed cluster or an Amazon EKS cluster
27 | * Volume size 50 - this is only needed for building docker containers. During training data is copied directly from Amazon S3 to the training cluster when using SageMaker. When using Amazon EKS, we'll setup an FSx for lustre file system that worker nodes will use to get access to training data.
28 | 
29 | * To create an IAM role, from the IAM role drop-down list, select Create a new role. In the Create an IAM role dialog box, select Any S3 bucket. After that select Select **Create role**. Amazon SageMaker creates the **AmazonSageMaker-ExecutionRole-*** ** role.
30 | 
31 | * Keep the default settings for the other options and click Create notebook instance. On the **Notebook instances** section you should see the status change from *Pending -> InService*
32 | * While the notebook instance spins up, continue to work on the next section, and we'll come back and launch the instance when it's ready.
33 |
--------------------------------------------------------------------------------
/content/update_code_dist_training/_index.md:
--------------------------------------------------------------------------------
1 | +++
2 | title = "Prepare your training scripts"
3 | date = 2019-10-28T00:51:31-07:00
4 | weight = 3
5 | chapter = true
6 | +++
7 |
8 | # Prepare your training scripts
9 |
10 | In this section, we'll walk through the process of modifying an existing TensorFlow-Keras training script so that it can perform training in a distributed environment.
11 |
--------------------------------------------------------------------------------
/content/update_code_dist_training/distributed_training_script.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Using horovod API for distributed training"
3 | date: 2019-10-28T02:47:30-07:00
4 | weight: 4
5 | ---
6 |
7 | ## Exercise 1: Convert training script to use horovod
8 |
9 | In this section you'll update the training script with horovod API for run distributed training.
10 |
11 | Open `cifar10-distributed.ipynb` and run through the cells. The following notebook is located at:
12 | ***distributed-training-workshop > notebooks > part-1-horovod***
13 |
14 | 
15 |
16 | {{% notice warning %}}
17 | **Stop:** Do this section on JupyterLab. Below is a copy of the jupyter notebook for reference.
18 | Open `cifar10-distributed.ipynb` and run these cells.
19 | Look for cells that say **Change X** and fill in those cells with the modifications - where **X** is the change number. There are a total of 8 changes.
20 | Click on **> Solution** to see the answers
21 | {{% /notice %}}
22 |
23 | You'll need to make the following modifications to your training script to use horovod for distributed training.
24 |
25 | 1. Run hvd.init()
26 | 2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list.
27 | 3. Scale the learning rate by the number of workers.
28 | 4. Wrap the optimizer in hvd.DistributedOptimizer.
29 | 5. Add hvd.callbacks.BroadcastGlobalVariablesCallback(0) to broadcast initial variable states from rank 0 to all other processes.
30 | 6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.
31 |
32 |
33 |
34 | #### Change 1: Import horovod and keras backend
35 |
36 |
37 | ```python
38 | import tensorflow as tf
39 |
40 |
41 |
42 | ```
43 |
44 | Solution
45 |
46 | import horovod.tensorflow.keras as hvd
47 | import tensorflow.keras.backend as K
48 |
200 |
201 |
202 |
203 | ```python
204 | model = get_model(lr, weight_decay, optimizer, momentum)
205 | ```
206 |
207 | #### Change 4: How will you update the learning rate for distributed training? What changes should you make to the following command?
208 |
209 |
210 | ```python
211 | opt = SGD(lr=lr, decay=weight_decay, momentum=momentum)
212 | ```
213 |
214 | Solution
215 |
216 | opt = SGD(lr=lr * size, decay=weight_decay, momentum=momentum)
217 |
218 | You need to scale the learning using the size of the cluster (total number of workers)
219 |
220 |
221 |
222 | #### Change 6: How will you convert the optimizer to distributed optimizer?
223 |
224 |
225 | ```python
226 | model.compile(loss='categorical_crossentropy',
227 | optimizer=opt,
228 | metrics=['accuracy'])
229 | ```
230 |
231 | Solution
232 |