├── .dockerignore ├── .github ├── ISSUE_TEMPLATE │ ├── bug-report.md │ ├── documentation.md │ ├── feature-request.md │ └── questions-help-support.md └── PULL_REQUEST_TEMPLATE │ └── pull_request_template.md ├── .gitignore ├── CHANGELOG.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── Dockerfile ├── LICENSE ├── README.md ├── __init__.py ├── aws ├── README.md ├── __init__.py ├── auth │ ├── __init__.py │ └── session.py ├── autoscaling.py ├── cfn │ └── setup.yml ├── cloudformation.py ├── config │ ├── sample_specs.json │ ├── user_data_rdzv │ └── user_data_worker ├── petctl.py ├── requirements.txt ├── s3.py └── util.py ├── azure ├── README.md ├── config │ ├── Dockerfile │ ├── kubernetes.json │ └── sample_specs.yaml ├── petctl.py └── util.py ├── design ├── kubernetes │ └── torchelastic-operator-design.md └── torchelastic │ └── 0.2.0 │ ├── design_doc.md │ ├── torchelastic_agent_diagram.jpg │ └── torchelastic_diagram.jpg ├── docs ├── Makefile ├── doc_push.sh ├── requirements.txt ├── source │ ├── _static │ │ └── img │ │ │ ├── efs-setup.jpg │ │ │ ├── pytorch-logo-dark.svg │ │ │ └── pytorch-logo-flame.png │ ├── conf.py │ ├── index.rst │ └── scripts │ │ └── create_redirect_md.py └── src │ └── pip-delete-this-directory.txt ├── examples ├── Dockerfile ├── README.md ├── bin │ ├── fetch_and_run │ └── install_etcd ├── imagenet │ └── main.py └── multi_container │ ├── Dockerfile │ ├── README.md │ ├── docker-compose.yaml │ └── echo.py ├── kubernetes ├── DEVELOPMENT.md ├── Dockerfile ├── Makefile ├── PROJECT ├── README.md ├── TROUBLESHOOTING.md ├── api │ └── v1alpha1 │ │ ├── constants.go │ │ ├── elasticjob_types.go │ │ ├── groupversion_info.go │ │ └── zz_generated.deepcopy.go ├── config │ ├── crd │ │ ├── bases │ │ │ └── elastic.pytorch.org_elasticjobs.yaml │ │ ├── kustomization.yaml │ │ └── kustomizeconfig.yaml │ ├── default │ │ └── kustomization.yaml │ ├── manager │ │ ├── kustomization.yaml │ │ └── manager.yaml │ ├── rbac │ │ ├── elasticjob_editor_role.yaml │ │ ├── elasticjob_viewer_role.yaml │ │ ├── kustomization.yaml │ │ ├── leader_election_role.yaml │ │ ├── leader_election_role_binding.yaml │ │ ├── role.yaml │ │ └── role_binding.yaml │ └── samples │ │ ├── classy-vision.yaml │ │ ├── etcd.yaml │ │ └── imagenet.yaml ├── controllers │ ├── elasticjob_controller.go │ ├── expectation.go │ ├── job.go │ ├── pod.go │ ├── service.go │ ├── suite_test.go │ └── util.go ├── go.mod ├── go.sum ├── hack │ └── boilerplate.go.txt └── main.go ├── requirements.txt ├── scripts └── formatter_python.sh ├── setup.py └── torchelastic ├── __init__.py └── distributed ├── __init__.py └── launch.py /.dockerignore: -------------------------------------------------------------------------------- 1 | .gitignore 2 | .git 3 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug-report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: "\U0001F41B Bug Report" 3 | about: Submit a bug report to help us improve PyTorch Elastic 4 | 5 | --- 6 | 7 | ## 🐛 Bug 8 | 9 | 10 | 11 | Component (check all that applies): 12 | * [ ] `state api` 13 | * [ ] `train_step api` 14 | * [ ] `train_loop` 15 | * [ ] `rendezvous` 16 | * [ ] `checkpoint` 17 | * [ ] `rollback` 18 | * [ ] `metrics` 19 | * [ ] `petctl` 20 | * [ ] `examples` 21 | * [ ] `docker` 22 | * [ ] other 23 | 24 | 25 | 26 | ## To Reproduce 27 | 28 | Steps to reproduce the behavior: 29 | 30 | 1. 31 | 1. 32 | 1. 33 | 34 | 35 | 36 | ## Expected behavior 37 | 38 | 39 | 40 | ## Environment 41 | 42 | - torchelastic version (e.g. 0.1.0rc1): 43 | - OS (e.g., Linux): 44 | - How you installed torchelastic (`conda`, `pip`, source, `docker`): 45 | - Docker image and tag (if using docker): 46 | - Build command you used (if compiling from source): 47 | - Git commit (if installed from source): 48 | - Python version: 49 | - CUDA/cuDNN version: 50 | - GPU models and configuration: 51 | - Execution environment (on-prem, aws, etc): 52 | - Any other relevant information: 53 | 54 | ## Additional context 55 | 56 | 57 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/documentation.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: "\U0001F4DA Documentation" 3 | about: Report an issue related to PyTorch Elastic documentation 4 | 5 | --- 6 | 7 | ## 📚 Documentation 8 | 9 | ## Link 10 | 11 | 12 | ## What does it currently say? 13 | 14 | 15 | ## What should it say? 16 | 17 | 18 | ### Why? 19 | 20 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature-request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: "\U0001F680Feature Request" 3 | about: Submit a proposal/request for a new feature or enhancement 4 | 5 | --- 6 | 7 | ## Description 8 | 9 | 10 | ## Motivation/Background 11 | 12 | 13 | 14 | ## Detailed Proposal 15 | 16 | 17 | 18 | ## Alternatives 19 | 20 | 21 | 22 | ## Additional context/links 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/questions-help-support.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: "❓Questions/Help/Support" 3 | about: Do you need support? We have resources. 4 | 5 | --- 6 | 7 | ## ❓ Questions and Help 8 | 9 | 10 | ### Please note that this issue tracker is not a help form and this issue will be closed. 11 | 12 | Before submitting, please ensure you have gone through our documentation. Here 13 | are some links that may be helpful: 14 | 15 | * [What is torchelastic?](../../README.md) 16 | * [Quickstart on AWS](../../aws/README.md) 17 | * [Usage](../../USAGE.md) 18 | * [Examples](../../examples/README.md) 19 | * API documentation 20 | * [Overview](../../USAGE.md) 21 | * [Rendezvous documentation](../../torchelastic/rendezvous/README.md) 22 | * [Checkpointing documentation](../../torchelastic/checkpoint/README.md) 23 | * [Configuring](../../USAGE.md#configuring) 24 | 25 | 26 | ### Question 27 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE/pull_request_template.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pytorch/elastic/bc88e6982961d4117e53c4c8163ecf277f35c2c5/.github/PULL_REQUEST_TEMPLATE/pull_request_template.md -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | docs/src 2 | docs/build 3 | docs/torchelastic.docset 4 | **/__pycache__ 5 | 6 | # Mac OS X files 7 | .DS_Store 8 | 9 | # Binaries for programs and plugins 10 | *.exe 11 | *.dll 12 | *.so 13 | *.dylib 14 | 15 | # Test binary, build with `go test -c` 16 | *.test 17 | 18 | # Output of the go coverage tool, specifically when used with LiteIDE 19 | *.out 20 | 21 | # IDE 22 | **/.idea/ 23 | 24 | # Operator Binary 25 | kubernetes//bin/manager 26 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # CHANGELOG 2 | 3 | ## 0.2.2 (Feb 18, 2021) 4 | 5 | > **_NOTE:_** This is the last release for torchelastic! We are upstreaming TorchElastic into 6 | > pytorch. See [pytorch issue-50621](https://github.com/pytorch/pytorch/issues/50621). 7 | 8 | ### PyTorch Elastic 9 | 10 | * (new) `torchelastic.multiprocessing`, drop in replacement for `torch.multiprocessing` that supports: 11 | * both function and binary launches 12 | * inter-process exception propagation 13 | * piping worker stdout/stderr to separate log files 14 | * tail worker log files to main console with `{role}_{rank}:` prefix on each line 15 | * Improvements to `torchelastic.events` 16 | * `NCCL_ASYNC_ERROR_HANDLING` set by default in torchelastic agent 17 | * Implemented shutdown barrier on agent to reduce exit time variance 18 | * Minor cosmetic improvements to rendezvous configuration 19 | * Non functional refactoring of `EtcdRendezvous` 20 | * TSM API improvements 21 | 22 | ## 0.2.1 (October 05, 2020) 23 | 24 | ### PyTorch Elastic 25 | 26 | > **_NOTE:_** As of torch-1.7 and torchelastic-0.2.1 torchelastic will be bundled into the main [pytorch docker](https://hub.docker.com/r/pytorch/pytorch) 27 | image. [torchelastic/examples](https://hub.docker.com/r/torchelastic/examples) will be available post torch-1.7 release since 28 | its base image will now be **pytorch/pytorch** 29 | 30 | * Torchelastic agent: 31 | * `run_id` available to workers as `TORCHELASTIC_RUN_ID` environment variable 32 | * Allow `max_restarts=0` 33 | * Worker exit barrier added to torchelastic agent to protect against variances in worker finish times 34 | * Improvements to error handling and propagation from torchelastic agent 35 | * Enable fault handlers on worker processes to get torch C++ stack traces 36 | 37 | * `torchelastic.distributed.launch` CLI: 38 | * New option `--role` to allow users to set worker role name 39 | * CLI options can now be set via environment variables (e.g. `PET_NNODES="1:2"`) 40 | 41 | * Project: 42 | * Upgraded to Python 3.8 43 | * Tests moved to `test` directory within the respective modules 44 | * Use Pyre 45 | 46 | * Deprecated: 47 | * [pytorch/elastic](https://hub.docker.com/r/pytorch/elastic) Docker image 48 | 49 | * Experimental: 50 | * [Training Session Manager (TSM)](http://pytorch.org/elastic/0.2.1/tsm_driver.html) with localhost scheduler 51 | * [torchelastic.multiprocessing](http://pytorch.org/elastic/0.2.1/multiprocessing.html) 52 | 53 | 54 | ## 0.2.0 (April 29, 2020) 55 | 56 | ### PyTorch Elastic 57 | 58 | * Separate infrastructure related work from the user script. [DesignDoc] 59 | * Events API 60 | 61 | [DesignDoc]: https://github.com/pytorch/elastic/blob/master/design/torchelastic/0.2.0/design_doc.md 62 | 63 | ## 0.1.0rc1 (December 06, 2019) 64 | 65 | ### PyTorch Elastic 66 | 67 | * First release torchelastic v0.1.0rc1 (experimental) 68 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to make participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies within all project spaces, and it also applies when 49 | an individual is representing the project or its community in public spaces. 50 | Examples of representing a project or community include using an official 51 | project e-mail address, posting via an official social media account, or acting 52 | as an appointed representative at an online or offline event. Representation of 53 | a project may be further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at . All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to torchelastic 2 | We want to make contributing to this project as easy and transparent as 3 | possible. 4 | 5 | ## Our Development Process 6 | ... (in particular how this is synced with internal changes to the project) 7 | 8 | ## Pull Requests 9 | We actively welcome your pull requests. 10 | 11 | 1. Fork the repo and create your branch from `master`. 12 | 2. If you've added code that should be tested, add tests. 13 | 3. If you've changed APIs, update the documentation. 14 | 4. Ensure the test suite passes. 15 | 5. Make sure your code lints (run `scripts/formatter_python.sh`). 16 | 6. If you haven't already, complete the Contributor License Agreement ("CLA"). 17 | 18 | ## Contributor License Agreement ("CLA") 19 | In order to accept your pull request, we need you to submit a CLA. You only need 20 | to do this once to work on any of Facebook's open source projects. 21 | 22 | Complete your CLA here: 23 | 24 | ## Issues 25 | We use GitHub issues to track public bugs. Please ensure your description is 26 | clear and has sufficient instructions to be able to reproduce the issue. 27 | 28 | Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe 29 | disclosure of security bugs. In those cases, please go through the process 30 | outlined on that page and do not file a public issue. 31 | 32 | ## License 33 | By contributing to torchelastic, you agree that your contributions will be licensed 34 | under the LICENSE file in the root directory of this source tree. 35 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime 2 | 3 | # install torchelastic 4 | WORKDIR /opt/torchelastic 5 | COPY . . 6 | RUN pip install -v . 7 | 8 | WORKDIR /workspace 9 | RUN chmod -R a+w . 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2019-present, Facebook, Inc. 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TorchElastic 2 | 3 | **IMPORTANT:** This repository is deprecated. 4 | 1. TorchElastic has been upstreamed to PyTorch 1.9 under `torch.distributed.elastic`. 5 | Please refer to the PyTorch documentation [here](https://pytorch.org/docs/stable/distributed.elastic.html). 6 | 7 | 2. The TorchElastic Controller for Kubernetes is no longer being actively maintained in favor of [TorchX](https://pytorch.org/torchx). 8 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | -------------------------------------------------------------------------------- /aws/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | -------------------------------------------------------------------------------- /aws/auth/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | from .session import AwsSessionProvider 10 | 11 | 12 | def get_session(region): 13 | return AwsSessionProvider().get_session(region) 14 | 15 | 16 | try: 17 | from .static_init import * # noqa: F401 F403 18 | except ModuleNotFoundError: 19 | pass 20 | -------------------------------------------------------------------------------- /aws/auth/session.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | import abc 10 | 11 | import boto3 12 | 13 | 14 | class AwsSessionProvider: 15 | """ 16 | Provides AWS credentials in the form of boto3 Session. 17 | This class may be sub-classed to provide custom methods 18 | of getting aws_access_key_id and aws_secret_access_key. 19 | Child classes are expected to provide overriding implementations 20 | of the three `_get_*` methods below. 21 | 22 | When used directly, it follows the default credential 23 | lookup chain as documented in: 24 | https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html 25 | """ 26 | 27 | def get_session(self, region=None) -> boto3.Session: 28 | access_key = self._get_access_key() 29 | secret_key = self._get_secret_key() 30 | session_token = self._get_session_token() 31 | 32 | # either both access and secret keys are None 33 | # or both are not None; just check one to assume 34 | # the presence of the other 35 | if access_key is None: 36 | return boto3.session.Session() 37 | else: 38 | return boto3.session.Session( 39 | aws_access_key_id=access_key, 40 | aws_secret_access_key=secret_key, 41 | aws_session_token=session_token, 42 | region_name=region, 43 | ) 44 | 45 | def _get_access_key(self): 46 | """ 47 | Returns the aws_access_key_id. Override when sub-classing. 48 | """ 49 | return None 50 | 51 | def _get_secret_key(self): 52 | """ 53 | Returns the aws_secret_access_key. Override when sub-classing. 54 | """ 55 | return None 56 | 57 | def _get_session_token(self): 58 | """ 59 | Returns the aws_session_token. Override when sub-classing. 60 | """ 61 | return None 62 | -------------------------------------------------------------------------------- /aws/autoscaling.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | import logging 10 | import os 11 | from enum import Enum, unique 12 | 13 | from jinja2 import Template 14 | from util import wait_for 15 | 16 | 17 | log = logging.getLogger(__name__) 18 | 19 | 20 | @unique 21 | class Accelerator(Enum): 22 | NONE = 0 23 | GPU = 1 24 | 25 | @classmethod 26 | def get_accelerator(cls, instance_type): 27 | """ 28 | get_accelerator("p3.2xlarge") returns Accelerator.GPU 29 | get_accelerator("i3.xlarge") returns Accelerator.NONE 30 | """ 31 | 32 | instance_accelerators = { 33 | "g2": Accelerator.GPU, 34 | "g3": Accelerator.GPU, 35 | "g4": Accelerator.GPU, 36 | "p2": Accelerator.GPU, 37 | "p3": Accelerator.GPU, 38 | } 39 | 40 | instance_family = instance_type[0:2] 41 | return instance_accelerators.get(instance_family, Accelerator.NONE) 42 | 43 | @classmethod 44 | def from_str(cls, accelerator_str): 45 | """ 46 | returns the enum Accelerator value from a string representation 47 | """ 48 | accelerators = {"none": Accelerator.NONE, "gpu": Accelerator.GPU} 49 | return accelerators.get(accelerator_str.lower(), Accelerator.NONE) 50 | 51 | def describe(self): 52 | """ 53 | Returns a string representation of the enum. 54 | This method is intended to be used to label certain AWS 55 | resources in their descriptions/names for informative purposes 56 | 57 | e.g. launch template created for GPUs can be named as: torchelastic_gpu 58 | """ 59 | 60 | string_rep = {Accelerator.NONE.value(): "cpu", Accelerator.GPU.value(): "gpu"} 61 | return string_rep.get(self, "unknown_accelerator") 62 | 63 | 64 | class AutoScalingGroup: 65 | def __init__(self, session): 66 | self._session = session 67 | self._asg = session.client("autoscaling") 68 | self._ec2 = session.client("ec2") 69 | 70 | def get_user_data(self, user_data_template, **kwargs): 71 | if os.path.isabs(user_data_template): 72 | user_data_path = user_data_template 73 | else: 74 | user_data_path = os.path.join(os.path.dirname(__file__), user_data_template) 75 | 76 | with open(user_data_path) as f: 77 | user_data_template = Template(f.read()) 78 | user_data = user_data_template.render(**kwargs) 79 | return user_data 80 | 81 | def get_ami_id(self, accelerator): 82 | """ 83 | Use EKS optimized AMI since it has everything we need pre-installed 84 | """ 85 | 86 | eks_owner_id = "602401143452" 87 | eks_amis = { 88 | Accelerator.NONE: "amazon-eks-node-1.14-v20190927", 89 | Accelerator.GPU: "amazon-eks-gpu-node-1.14-v20190927", 90 | } 91 | 92 | res = self._ec2.describe_images( 93 | Filters=[ 94 | {"Name": "owner-id", "Values": [eks_owner_id]}, 95 | { 96 | "Name": "name", 97 | "Values": [eks_amis.get(accelerator, Accelerator.NONE)], 98 | }, 99 | ] 100 | ) 101 | images = res["Images"] 102 | assert ( 103 | len(images) == 1 104 | ), f"Multiple EKS AMIs found for {self._session.aws_region()}" 105 | return images[0]["ImageId"] 106 | 107 | def create_launch_config( 108 | self, 109 | name, 110 | instance_type, 111 | instance_role, 112 | user_data_template, 113 | security_groups=None, 114 | accelerator="gpu", 115 | max_spot_price=None, 116 | ebs_volume_gb=128, 117 | **user_data_kwargs, 118 | ): 119 | 120 | req = { 121 | "LaunchConfigurationName": name, 122 | "InstanceType": instance_type, 123 | "IamInstanceProfile": instance_role, 124 | "ImageId": self.get_ami_id(Accelerator.from_str(accelerator)), 125 | "SecurityGroups": security_groups, 126 | "AssociatePublicIpAddress": True, 127 | "UserData": self.get_user_data(user_data_template, **user_data_kwargs), 128 | "BlockDeviceMappings": [ 129 | { 130 | "DeviceName": "/dev/xvda", 131 | "Ebs": { 132 | "VolumeSize": ebs_volume_gb, 133 | "VolumeType": "gp2", 134 | "DeleteOnTermination": True, 135 | }, 136 | } 137 | ], 138 | } 139 | 140 | if max_spot_price: 141 | req["SpotMaxPrice"] = str(max_spot_price) 142 | 143 | log.info(f"Creating launch config: {name}") 144 | self._asg.create_launch_configuration(**req) 145 | 146 | def describe_launch_config(self, name): 147 | res = self._asg.describe_launch_configurations(LaunchConfigurationNames=[name]) 148 | lcs = res["LaunchConfigurations"] 149 | return lcs[0] if len(lcs) == 1 else None 150 | 151 | def delete_launch_config(self, name): 152 | if self.describe_launch_config(name): 153 | log.info(f"Deleting asg launch config: {name}") 154 | self._asg.delete_launch_configuration(LaunchConfigurationName=name) 155 | 156 | def create_asg(self, name, size, min_size=None, max_size=None, **kwargs): 157 | """ 158 | Creates an asg. For specifications on kwargs see config/sample_specs.json 159 | """ 160 | 161 | if not min_size: 162 | min_size = size 163 | 164 | if not max_size: 165 | max_size = size 166 | 167 | assert min_size <= size <= max_size 168 | 169 | kwargs["size"] = size 170 | kwargs["min_size"] = min_size 171 | kwargs["max_size"] = max_size 172 | self.create_launch_config(name, **kwargs) 173 | 174 | log.info(f"Creating autoscaling group: {name}") 175 | self._asg.create_auto_scaling_group( 176 | AutoScalingGroupName=name, 177 | LaunchConfigurationName=name, 178 | VPCZoneIdentifier=",".join(kwargs["subnets"]), 179 | MinSize=min_size, 180 | MaxSize=max_size, 181 | DesiredCapacity=size, 182 | ) 183 | 184 | def create_asg_sync(self, name, size, min_size=None, max_size=None, **kwargs): 185 | self.create_asg(name, size, min_size, max_size, **kwargs) 186 | _, hostnames = self.get_hostnames(name, size) 187 | return hostnames 188 | 189 | def describe_asg(self, name): 190 | res = self._asg.describe_auto_scaling_groups(AutoScalingGroupNames=[name]) 191 | asgs = res["AutoScalingGroups"] 192 | num_asgs = len(asgs) 193 | 194 | return asgs[0] if num_asgs == 1 else None 195 | 196 | def delete_asg(self, name): 197 | if self.describe_asg(name): 198 | log.info(f"Deleting autoscaling group: {name}") 199 | self._asg.delete_auto_scaling_group( 200 | AutoScalingGroupName=name, ForceDelete=True 201 | ) 202 | 203 | for _ in wait_for(f"instances in {name} to terminate"): 204 | if not self.describe_asg(name): 205 | log.info(f"Deleted autoscaling group: {name}") 206 | break 207 | 208 | # launch config needs to be deleted after asg 209 | self.delete_launch_config(name) 210 | 211 | def list_hostnames(self, name): 212 | return self.get_hostnames(name, 1) 213 | 214 | def get_hostnames(self, name, size): 215 | """ 216 | Waits until the asg has at least instances in "InService" 217 | state and returns their public dns names. 218 | """ 219 | for _ in wait_for(f"autoscaling group: {name} to reach size >= {size}"): 220 | asg_desc = self.describe_asg(name) 221 | if not asg_desc: 222 | return [] 223 | else: 224 | instances = asg_desc["Instances"] 225 | ready_instance_ids = [ 226 | e["InstanceId"] 227 | for e in instances 228 | if e["LifecycleState"] == "InService" 229 | ] 230 | if len(ready_instance_ids) >= size: 231 | paginator = self._ec2.get_paginator("describe_instances") 232 | 233 | hostnames = [] 234 | instance_ids = [] 235 | for e in paginator.paginate(InstanceIds=ready_instance_ids): 236 | for r in e["Reservations"]: 237 | for i in r["Instances"]: 238 | hostnames.append(i["PublicDnsName"]) 239 | instance_ids.append(i["InstanceId"]) 240 | return instance_ids, hostnames 241 | -------------------------------------------------------------------------------- /aws/cfn/setup.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: "2010-09-09" 2 | Description: "cfn template that creates the minimum set of aws resources required for petctl" 3 | Parameters: 4 | S3BucketName: 5 | Description: "name of s3 bucket to create for torchelastic use-case" 6 | Type: "String" 7 | Default: "" 8 | 9 | EFSFileSystemId: 10 | Description: "efs file system id (e.g. fs-d1234567)" 11 | Type: "String" 12 | Default: "" 13 | 14 | WorkerRoleName: 15 | Description: "name of the worker node iam role and ec2 instance profile" 16 | Type: "String" 17 | Default: "torchelastic_worker_role" 18 | 19 | RendezvousRoleName: 20 | Description: "name of the rendezvous node iam role and ec2 instance profile" 21 | Type: "String" 22 | Default: "torchelastic_rendezvous_role" 23 | 24 | Conditions: 25 | CreateEFSCondition: 26 | Fn::Equals: 27 | - Ref: "EFSFileSystemId" 28 | - "" 29 | 30 | CreateS3BucketCondition: 31 | Fn::Equals: 32 | - Ref: "S3BucketName" 33 | - "" 34 | 35 | Resources: 36 | InternetGateway: 37 | Type: AWS::EC2::InternetGateway 38 | 39 | VPC: 40 | Type: AWS::EC2::VPC 41 | Properties: 42 | CidrBlock: "172.31.0.0/16" 43 | EnableDnsHostnames: True 44 | EnableDnsSupport: True 45 | 46 | VPCGatewayAttachment: 47 | Type: AWS::EC2::VPCGatewayAttachment 48 | Properties: 49 | VpcId: 50 | Ref: "VPC" 51 | InternetGatewayId: 52 | Ref: "InternetGateway" 53 | 54 | RouteTable: 55 | Type: AWS::EC2::RouteTable 56 | Properties: 57 | VpcId: 58 | Ref: "VPC" 59 | 60 | InternetRoute: 61 | Type: AWS::EC2::Route 62 | DependsOn: VPCGatewayAttachment 63 | Properties: 64 | DestinationCidrBlock: 0.0.0.0/0 65 | GatewayId: 66 | Ref: InternetGateway 67 | RouteTableId: 68 | Ref: RouteTable 69 | 70 | Subnet0: 71 | Type: AWS::EC2::Subnet 72 | Properties: 73 | VpcId: 74 | Ref: "VPC" 75 | CidrBlock: "172.31.0.0/20" 76 | MapPublicIpOnLaunch: True 77 | AvailabilityZone: 78 | Fn::Select: 79 | - 0 80 | - Fn::GetAZs: 81 | Ref: AWS::Region 82 | 83 | Subnet1: 84 | Type: AWS::EC2::Subnet 85 | Properties: 86 | VpcId: 87 | Ref: "VPC" 88 | CidrBlock: "172.31.16.0/20" 89 | MapPublicIpOnLaunch: True 90 | AvailabilityZone: 91 | Fn::Select: 92 | - 1 93 | - Fn::GetAZs: 94 | Ref: AWS::Region 95 | 96 | SubnetRouteTableAssociation0: 97 | Type: AWS::EC2::SubnetRouteTableAssociation 98 | Properties: 99 | RouteTableId: 100 | Ref: "RouteTable" 101 | SubnetId: 102 | Ref: "Subnet0" 103 | 104 | SubnetRouteTableAssociation1: 105 | Type: AWS::EC2::SubnetRouteTableAssociation 106 | Properties: 107 | RouteTableId: 108 | Ref: "RouteTable" 109 | SubnetId: 110 | Ref: "Subnet1" 111 | 112 | InstanceSecurityGroup: 113 | Type: AWS::EC2::SecurityGroup 114 | Properties: 115 | GroupDescription: "security group for ec2 instances in the VPC" 116 | GroupName: "torchelastic instance security group" 117 | VpcId: 118 | Ref: "VPC" 119 | SecurityGroupIngress: 120 | - Description: "allow ssh" 121 | IpProtocol: "tcp" 122 | FromPort: "22" 123 | ToPort: "22" 124 | CidrIp: "0.0.0.0/0" 125 | - Description: "allow ssh" 126 | IpProtocol: "tcp" 127 | FromPort: "22" 128 | ToPort: "22" 129 | CidrIpv6: "::/0" 130 | SecurityGroupEgress: 131 | - Description: "* egress" 132 | IpProtocol: "-1" 133 | CidrIp: "0.0.0.0/0" 134 | 135 | SecurityGroupIngress: 136 | Type: AWS::EC2::SecurityGroupIngress 137 | Properties: 138 | Description: "* ingress within the same security group" 139 | GroupId: 140 | Ref: "InstanceSecurityGroup" 141 | IpProtocol: "-1" 142 | SourceSecurityGroupId: 143 | Ref: "InstanceSecurityGroup" 144 | 145 | EFSSecurityGroup: 146 | Type: AWS::EC2::SecurityGroup 147 | Properties: 148 | GroupDescription: "security group for efs mount targets in the VPC" 149 | GroupName: "torchelastic efs security group" 150 | VpcId: 151 | Ref: "VPC" 152 | SecurityGroupIngress: 153 | - Description: "allow NFS from ec2" 154 | IpProtocol: "tcp" 155 | FromPort: "2049" 156 | ToPort: "2049" 157 | SourceSecurityGroupId: 158 | Ref: "InstanceSecurityGroup" 159 | 160 | EFS: 161 | Type: AWS::EFS::FileSystem 162 | Condition: "CreateEFSCondition" 163 | 164 | EFSMountTarget0: 165 | Type: AWS::EFS::MountTarget 166 | Properties: 167 | FileSystemId: 168 | Fn::If: 169 | - "CreateEFSCondition" 170 | - Ref: "EFS" 171 | - Ref: "EFSFileSystemId" 172 | SubnetId: 173 | Ref: "Subnet0" 174 | SecurityGroups: 175 | - Ref: "EFSSecurityGroup" 176 | 177 | EFSMountTarget1: 178 | Type: AWS::EFS::MountTarget 179 | Properties: 180 | FileSystemId: 181 | Fn::If: 182 | - "CreateEFSCondition" 183 | - Ref: "EFS" 184 | - Ref: "EFSFileSystemId" 185 | SubnetId: 186 | Ref: "Subnet1" 187 | SecurityGroups: 188 | - Ref: "EFSSecurityGroup" 189 | 190 | S3Bucket: 191 | Type: AWS::S3::Bucket 192 | Condition: "CreateS3BucketCondition" 193 | 194 | InstanceProfileWorker: 195 | Type: AWS::IAM::InstanceProfile 196 | Properties: 197 | InstanceProfileName: 198 | Ref: "WorkerRoleName" 199 | Roles: 200 | - Ref: "IAMRoleWorker" 201 | 202 | IAMRoleWorker: 203 | Type: AWS::IAM::Role 204 | Properties: 205 | RoleName: 206 | Ref: "WorkerRoleName" 207 | AssumeRolePolicyDocument: 208 | Version: "2012-10-17" 209 | Statement: 210 | - Effect: "Allow" 211 | Principal: 212 | Service: 213 | - "ec2.amazonaws.com" 214 | Action: 215 | - "sts:AssumeRole" 216 | Path: "/" 217 | ManagedPolicyArns: 218 | - "arn:aws:iam::aws:policy/AmazonS3FullAccess" 219 | - "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly" 220 | - "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" 221 | - Ref: "ContainerCloudWatchLogsPolicy" 222 | 223 | InstanceProfileRendezvous: 224 | Type: AWS::IAM::InstanceProfile 225 | Properties: 226 | InstanceProfileName: 227 | Ref: "RendezvousRoleName" 228 | Roles: 229 | - Ref: "IAMRoleRendezvous" 230 | 231 | IAMRoleRendezvous: 232 | Type: AWS::IAM::Role 233 | Properties: 234 | RoleName: 235 | Ref: "RendezvousRoleName" 236 | AssumeRolePolicyDocument: 237 | Version: "2012-10-17" 238 | Statement: 239 | - Effect: "Allow" 240 | Principal: 241 | Service: 242 | - "ec2.amazonaws.com" 243 | Action: 244 | - "sts:AssumeRole" 245 | Path: "/" 246 | ManagedPolicyArns: 247 | - "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly" 248 | - "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" 249 | - Ref: "ContainerCloudWatchLogsPolicy" 250 | 251 | ContainerCloudWatchLogsPolicy: 252 | Type: AWS::IAM::ManagedPolicy 253 | Properties: 254 | Description: "Allows container instances to use CloudWatch APIs" 255 | Path: "/" 256 | PolicyDocument: 257 | Version: "2012-10-17" 258 | Statement: 259 | - Effect: Allow 260 | Action: 261 | - "logs:CreateLogGroup" 262 | - "logs:CreateLogStream" 263 | - "logs:PutLogEvents" 264 | - "logs:DescribeLogStreams" 265 | Resource: 266 | - "arn:aws:logs:*:*:*" 267 | Outputs: 268 | VPCId: 269 | Value: 270 | Ref: "VPC" 271 | 272 | SubnetId0: 273 | Value: 274 | Ref: "Subnet0" 275 | 276 | SubnetId1: 277 | Value: 278 | Ref: "Subnet1" 279 | 280 | SecurityGroupId: 281 | Value: 282 | Ref: "InstanceSecurityGroup" 283 | 284 | EFSId: 285 | Value: 286 | Fn::If: 287 | - "CreateEFSCondition" 288 | - Ref: "EFS" 289 | - Ref: "EFSFileSystemId" 290 | 291 | S3Bucket: 292 | Value: 293 | Fn::If: 294 | - "CreateS3BucketCondition" 295 | - Ref: "S3Bucket" 296 | - Ref: "S3BucketName" 297 | 298 | WorkerInstanceProfile: 299 | Value: 300 | Ref: "InstanceProfileWorker" 301 | 302 | RendezvousInstanceProfile: 303 | Value: 304 | Ref: "InstanceProfileRendezvous" 305 | 306 | 307 | -------------------------------------------------------------------------------- /aws/cloudformation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | 10 | import getpass 11 | import logging 12 | import os 13 | import random 14 | import string 15 | 16 | from jinja2 import Template 17 | from util import wait_for 18 | 19 | 20 | log = logging.getLogger(__name__) 21 | 22 | 23 | class CloudFormation: 24 | def __init__(self, session): 25 | self._session = session 26 | self._cfn = session.client("cloudformation") 27 | 28 | def create_specs_file(self, specs_file, s3_bucket_name, efs_id): 29 | username = getpass.getuser() 30 | rand = "".join(random.choices(string.ascii_uppercase + string.digits, k=5)) 31 | hash = f"{username}-{rand}" 32 | stack_name = f"torchelastic-{hash}" 33 | this_dir = os.path.dirname(__file__) 34 | cfn_template = os.path.join(this_dir, "cfn/setup.yml") 35 | sample_specs = os.path.join(this_dir, "config/sample_specs.json") 36 | 37 | params = { 38 | "WorkerRoleName": f"torchelastic_worker_role-{hash}", 39 | "RendezvousRoleName": f"torchelastic_rendezvous_role-{hash}", 40 | } 41 | 42 | if s3_bucket_name: 43 | params["S3BucketName"] = s3_bucket_name 44 | if efs_id: 45 | params["EFSFileSystemId"] = efs_id 46 | 47 | self.create_stack(stack_name, cfn_template, **params) 48 | 49 | for _ in wait_for( 50 | f"cfn stack: {stack_name} to create", timeout=600, interval=2 51 | ): 52 | status, outputs = self.describe_stack(stack_name) 53 | if status == "CREATE_COMPLETE": 54 | break 55 | elif status == "CREATE_FAILED" or status.startswith("ROLLBACK_"): 56 | # when stack creation fails cfn starts rolling the stack back 57 | raise RuntimeError( 58 | f"Error creating stack {stack_name}, status = {status}" 59 | ) 60 | 61 | outputs["User"] = username 62 | 63 | log.info(f"Writing specs file to: {specs_file}") 64 | with open(sample_specs) as f: 65 | specs_template = Template(f.read()) 66 | specs_template.stream(**outputs).dump(specs_file) 67 | 68 | def describe_stack(self, stack_name): 69 | describe_res = self._cfn.describe_stacks(StackName=stack_name) 70 | 71 | stacks = describe_res["Stacks"] 72 | if len(stacks) > 1: 73 | raise RuntimeError(f"Found more than one stack with name {stack_name}") 74 | 75 | stack_desc = stacks[0] 76 | status = stack_desc["StackStatus"] 77 | 78 | # cfn outputs an array of maps, each element in the array is 79 | # a single output of the form "{OutputKey: , OutputValue: }" 80 | # simplify to a map of , pairs 81 | outputs = {} 82 | if "Outputs" in stack_desc: 83 | for cfn_output in stack_desc["Outputs"]: 84 | key = cfn_output["OutputKey"] 85 | value = cfn_output["OutputValue"] 86 | outputs[key] = value 87 | return status, outputs 88 | 89 | def create_stack(self, stack_name, cfn_template, **params): 90 | log.info(f"Creating cloudformation stack with template: {cfn_template}") 91 | 92 | with open(cfn_template) as f: 93 | template_body = f.read() 94 | 95 | cfn_parameters = [] 96 | for key, value in params.items(): 97 | cfn_parameters.append({"ParameterKey": key, "ParameterValue": value}) 98 | 99 | res = self._cfn.create_stack( 100 | StackName=stack_name, 101 | TemplateBody=template_body, 102 | Capabilities=["CAPABILITY_NAMED_IAM"], 103 | Parameters=cfn_parameters, 104 | ) 105 | 106 | return res["StackId"] 107 | -------------------------------------------------------------------------------- /aws/config/sample_specs.json: -------------------------------------------------------------------------------- 1 | { 2 | "rdzv": { 3 | "instance_type" : "t2.small", 4 | "accelerator" : "none", 5 | "instance_role" : "{{ RendezvousInstanceProfile }}", 6 | "subnets" : [ 7 | "{{ SubnetId0 }}" 8 | ], 9 | "security_groups" : [ 10 | "{{ SecurityGroupId }}" 11 | ], 12 | "user_data_template": "config/user_data_rdzv", 13 | "ebs_volume_gb" : 64 14 | }, 15 | "worker": { 16 | "instance_type": "p3.2xlarge", 17 | "accelerator" : "gpu", 18 | "instance_role" : "{{ WorkerInstanceProfile }}", 19 | "efs_file_system_id" : "{{ EFSId }}", 20 | "subnets" : [ 21 | "{{ SubnetId0 }}", 22 | "{{ SubnetId1 }}" 23 | ], 24 | "security_groups" : [ 25 | "{{ SecurityGroupId }}" 26 | ], 27 | 28 | "user_data_template" : "config/user_data_worker", 29 | "ebs_volume_gb" : 64, 30 | "docker_image" : "torchelastic/examples:0.1.0rc1", 31 | "s3_bucket" : "{{ S3Bucket }}", 32 | "s3_prefix" : "petctl/{{ User }}" 33 | } 34 | } 35 | -------------------------------------------------------------------------------- /aws/config/user_data_rdzv: -------------------------------------------------------------------------------- 1 | Content-Type: multipart/mixed; boundary="//" 2 | MIME-Version: 1.0 3 | 4 | --// 5 | Content-Type: text/cloud-config; charset="us-ascii" 6 | MIME-Version: 1.0 7 | Content-Transfer-Encoding: 7bit 8 | Content-Disposition: attachment; filename="cloud-config.txt" 9 | 10 | #cloud-config 11 | repo_update: true 12 | repo_upgrade: all 13 | cloud_final_modules: 14 | - [scripts-user, always] 15 | runcmd: 16 | - yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm 17 | - systemctl restart amazon-ssm-agent 18 | 19 | --// 20 | Content-Type: text/x-shellscript; charset="us-ascii" 21 | MIME-Version: 1.0 22 | Content-Transfer-Encoding: 7bit 23 | Content-Disposition: attachment; filename="userdata.txt" 24 | 25 | #!/bin/bash 26 | #------------------------- 27 | # Install etcd 28 | #------------------------- 29 | ETCD_VER=v3.4.3 30 | BASE_DOWNLOAD_URL=https://github.com/etcd-io/etcd/releases/download 31 | DOWNLOAD_URL="${BASE_DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz" 32 | 33 | TMP_DIR="/tmp/etcd-${ETCD_VER}" 34 | INSTALL_DIR="/opt/etcd" 35 | BIN_DIR="${INSTALL_DIR}/bin" 36 | 37 | mkdir -p "${TMP_DIR}" 38 | mkdir -p "${BIN_DIR}" 39 | 40 | echo "etcd-${ETCD_VER}" > "${INSTALL_DIR}/version.info" 41 | 42 | echo "Downloading pre-built etcd binary from ${DOWNLOAD_URL}" 43 | curl -L "${DOWNLOAD_URL}" -o "${TMP_DIR}/etcd-${ETCD_VER}-linux-amd64.tar.gz" 44 | tar xzf "${TMP_DIR}/etcd-${ETCD_VER}-linux-amd64.tar.gz" -C "${TMP_DIR}" --strip-components=1 45 | 46 | echo "Installing etcd into ${INSTALL_DIR}" 47 | cp -p "${TMP_DIR}/etcd" "${BIN_DIR}" 48 | cp -p "${TMP_DIR}/etcdctl" "${BIN_DIR}" 49 | 50 | rm -rf "${TMP_DIR}" 51 | 52 | #------------------------- 53 | # Add etcd to systemctl 54 | #------------------------- 55 | PUBLIC_HOSTNAME=$(curl http://169.254.169.254/latest/meta-data/public-hostname) 56 | 57 | cat > /etc/etcd.conf < /etc/systemd/system/etcd.service <> /etc/fstab 20 | - mount -a -t efs 21 | - chmod 777 ${efs_mount_point} 22 | - mkdir -p /var/torchelastic 23 | - yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm 24 | - systemctl restart amazon-ssm-agent 25 | --// 26 | Content-Type: text/x-shellscript; charset="us-ascii" 27 | MIME-Version: 1.0 28 | Content-Transfer-Encoding: 7bit 29 | Content-Disposition: attachment; filename="userdata.txt" 30 | 31 | cat > /var/torchelastic/ecr_login <<\EOL 32 | #!/bin/bash 33 | region=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq .region -r) 34 | $(aws ecr get-login --no-include-email --region ${region}) 35 | EOL 36 | 37 | cat > /var/torchelastic/worker.env <<\EOL 38 | RDZV_ENDPOINT={{ rdzv_endpoint }} 39 | JOB_ID={{ job_name }} 40 | MIN_SIZE={{ min_size }} 41 | MAX_SIZE={{ max_size }} 42 | SIZE={{ size }} 43 | EOL 44 | 45 | cat > /var/torchelastic/run_worker <<\EOL 46 | #!/bin/bash 47 | container_name=$1 48 | shift 49 | 50 | region=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq .region -r) 51 | instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) 52 | 53 | docker run \ 54 | --init \ 55 | --net=host \ 56 | --restart=on-failure \ 57 | --shm-size=32g \ 58 | --env-file /var/torchelastic/worker.env \ 59 | -v /mnt/efs/fs1:/mnt/efs/fs1 \ 60 | --name ${container_name} \ 61 | --log-driver=awslogs \ 62 | --log-opt awslogs-region=${region} \ 63 | --log-opt awslogs-group=torchelastic/{{ user }} \ 64 | --log-opt awslogs-create-group=true \ 65 | --log-opt awslogs-stream=${container_name}/${instance_id} \ 66 | {{ docker_image }} $* 67 | EOL 68 | 69 | chmod 755 /var/torchelastic/ecr_login 70 | chmod 755 /var/torchelastic/run_worker 71 | 72 | cat > /etc/systemd/system/torchelastic_worker.service <<\EOL 73 | [Unit] 74 | Description=torchelastic worker 75 | Documentation=https://github.com/pytorch/torchelastic 76 | After=docker.service 77 | Requires=docker.service 78 | 79 | [Service] 80 | Type=exec 81 | ExecStartPre=-/var/torchelastic/ecr_login 82 | ExecStart=/var/torchelastic/run_worker {{ job_name }} {{ script }} {{ args }} 83 | ExecStop=-/usr/bin/docker kill {{ job_name }} 84 | ExecStopPost=-/usr/bin/docker rm -f {{ job_name }} 85 | Restart=no 86 | LimitNOFILE=40000 87 | KillMode=control-group 88 | 89 | [Install] 90 | WantedBy=multi-user.target 91 | EOL 92 | 93 | #------------------------- 94 | # Enable and start worker 95 | #------------------------- 96 | systemctl enable torchelastic_worker 97 | systemctl start torchelastic_worker 98 | --// 99 | -------------------------------------------------------------------------------- /aws/requirements.txt: -------------------------------------------------------------------------------- 1 | boto3>=1.9.148 2 | jinja2>=2.10 3 | -------------------------------------------------------------------------------- /aws/s3.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | import logging 10 | import os 11 | import shutil 12 | import tarfile as tar 13 | import tempfile 14 | 15 | 16 | log = logging.getLogger(__name__) 17 | 18 | 19 | class S3: 20 | def __init__(self, session): 21 | self._session = session 22 | self._s3 = session.client("s3") 23 | 24 | def cp(self, target_path, bucket, key): 25 | """ 26 | Uploads target_path to s3://bucket/key. If the target_path is a file 27 | then uploads to s3://bucket/key/file_name, if the target_path is a 28 | directory, then a tarball is created with the contents of target_path 29 | and uploaded to s3://bucket/key/dir_name.tar.gz. The tar is created as 30 | if created by running the command: 31 | 32 | cd target_path && tar xzf /tmp/$(basename target_path).tar.gz * 33 | 34 | Returns the destination s3 url 35 | """ 36 | 37 | target_basename = os.path.basename(target_path) 38 | 39 | if os.path.isdir(target_path): 40 | tmpdir = tempfile.mkdtemp(prefix="petctl_") 41 | tar_basename = f"{target_basename}.tar.gz" 42 | tar_file = os.path.join(tmpdir, tar_basename) 43 | log.info(f"Compressing {target_path} into {tar_basename}") 44 | with tar.open(tar_file, "x:gz") as f: 45 | f.add(target_path, arcname="", recursive=True) 46 | 47 | dest_key = f"{key}/{tar_basename}" 48 | target_file = tar_file 49 | else: 50 | tmpdir = None 51 | dest_key = f"{key}/{target_basename}" 52 | target_file = target_path 53 | 54 | log.info(f"Uploading {target_file} to s3://{bucket}/{dest_key}") 55 | self._s3.upload_file(target_file, bucket, dest_key) 56 | 57 | if tmpdir: 58 | log.info(f"Deleting tmp dir: {tmpdir}") 59 | shutil.rmtree(tmpdir) 60 | return f"s3://{bucket}/{dest_key}" 61 | -------------------------------------------------------------------------------- /aws/util.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | import sys 10 | import time 11 | 12 | 13 | def wait_for(msg, timeout: float = 300, interval: int = 1, print_spinner: bool = True): 14 | """ 15 | for _ in wait_for("asg to provision", timeout_sec, interval_sec): 16 | if check_condition():. 17 | break 18 | """ 19 | spin = ["-", "/", "|", "\\", "-", "/", "|", "\\"] 20 | idx = 0 21 | start = time.time() 22 | max_time = start + timeout 23 | while True: 24 | if print_spinner: 25 | elapsed = time.time() - start 26 | print( 27 | f"Waiting for {msg}" 28 | f" ({elapsed:03.0f}/{timeout:3.0f}s elapsed) {spin[idx]}\r", 29 | end="", 30 | ) 31 | sys.stdout.flush() 32 | idx = (idx + 1) % len(spin) 33 | 34 | if time.time() >= max_time: 35 | raise RuntimeError(f"Timed out while waiting for: {msg}") 36 | else: 37 | time.sleep(interval) 38 | yield 39 | -------------------------------------------------------------------------------- /azure/README.md: -------------------------------------------------------------------------------- 1 | # Pytorch Elastic on Azure 2 | This directory contains scripts and libraries that help users run pytorch elastic jobs on Azure. 3 | 4 | ## Prerequisites 5 | 1. Familiarity with [Azure](https://azure.microsoft.com/en-us/), [aks-engine](https://github.com/Azure/aks-engine), [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) 6 | 2. Quota available for Standard_DS1_v2 instance and Standard_NC6s_v3 instance. 7 | 3. Access to Azure subscription, Resource Group and Storage. (Refer [here](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) for setup instructions). 8 | 4. Run the Azure login command `az login`. 9 | 10 | # Sample Usage 11 | 12 | 1. #### Configure your job yaml and kubernetes.json 13 | ``` 14 | python petctl.py configure --name "test_job" --min_size 1 --max_size 5 15 | ``` 16 | This will create a spec for aks-engine instances and training job. Aks-engine launch [spec](config/kubernetes.json) is a simple json file that specifies the count and type of Rendezvous and Worker instances. A training job [spec](config/sample_specs.yaml) file is created with the specified job name and min, max worker count. 17 | By default master node creates Standard DS1_v2 instance and worker nodes create Standard_NC6_v3 instances. Other Azure instances could be specified using --master_vm and --worker_vm options. 18 | 19 | 2. #### Setup your Kubernetes cluster 20 | 21 | This step requires service prinicpal to create aks cluster. 22 | Instructions for generating service principal can be found at [portal](https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal), [CLI](https://docs.microsoft.com/en-us/cli/azure/create-an-azure-service-principal-azure-cli?view=azure-cli-latest), [Powershell](https://docs.microsoft.com/en-us/powershell/azure/create-azure-service-principal-azureps). 23 | ``` 24 | python petctl.py setup --dns_prefix azure-pytorch-elastic 25 | --rg "" 26 | --location "" 27 | --subscription_id 28 | --client_id 29 | --client_secret 30 | ``` 31 | This creates an Azure Kubernetes cluster with 1 Standard_DS1_v2 master instances and specified number of Standard_NC6s_v3 worker instance in the resource group created in [Prerequisites](#Prerequisites) #3. 32 | 33 | 3. #### Upload to Azure Blob storage 34 | 35 | This is an optional step to upload code and data to Azure blob storage. This step can be skipped if the training script and data are already available in Azure Blob storage. 36 | ``` 37 | python petctl.py upload_storage --source_path 38 | --account_name 39 | --container_name 40 | --sas_token 41 | ``` 42 | Instructions to generate SAS token are available [here](https://adamtheautomator.com/azure-sas-token/). 43 | 44 | 4. #### Generate Storage and Docker Image secrets 45 | 46 | This step requires user blob storage account and docker image details. 47 | Instructions for accessing storage account keys can be found at [portal](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage), [CLI](https://docs.microsoft.com/en-us/cli/azure/storage/account/keys) 48 | 49 | ##### Generate Storage secret 50 | ``` 51 | python petctl.py storage_secret --account_name 52 | --account_key "" 53 | ``` 54 | ##### Generate Docker image secret 55 | ``` 56 | python petctl.py docker_secret --server 57 | --username 58 | --password 59 | --image_name 60 | ``` 61 | 62 | Training job file is updated to mount users storage account onto worker instances and apply the user provided docker image. 63 | Base docker image to run Pytorch Elastic on Azure is at [Dockerfile](config/Dockerfile). Instructions on publishing docker image to AzureContainer registry can be found at [ACR](https://docs.microsoft.com/en-us/azure/container-registry/container-registry-get-started-docker-cli). 64 | Docker image secret generation can be skipped for running [imagenet](../../examples/imagenet/main.py) example as job specs yaml is already populated with public AzureML image with pytorch elastic support. 65 | 66 | 5. #### Start your training job 67 | 68 | Submit the training job. 69 | ``` 70 | python petctl.py run_job 71 | ``` 72 | To run the provided imagenet example, training script and data can be uploaded to Azure blob storage by running 73 | ``` 74 | python petctl.py --upload_storage --source_path ../examples/imagenet/main.py 75 | --account_name 76 | --container_name code 77 | --sas_token 78 | ``` 79 | ``` 80 | python petctl.py --upload_storage --source_path 81 | --account_name 82 | --container_name data 83 | --sas_token 84 | ``` 85 | 86 | 6. #### Check status of your job 87 | ``` 88 | python petctl.py check_status 89 | ``` 90 | 7. #### Scale worker instances 91 | ``` 92 | python petctl.py scale --rg "" 93 | --location "" 94 | --subscription_id 95 | --client_id 96 | --client_secret 97 | --new_node_count 98 | ``` 99 | (Here subscription id and resource group is the one setup in [Prerequisites](#Prerequisites) #3. 100 | 101 | 8. #### Delete resources 102 | ```` 103 | python petctl.py delete_resources 104 | ```` 105 | This deletes the aks-engine cluster and all associated namespaces and secrets. 106 | -------------------------------------------------------------------------------- /azure/config/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM mcr.microsoft.com/azureml/base-gpu:openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04 2 | 3 | RUN apt-get update -y && \ 4 | apt-get install -y curl 5 | 6 | RUN ldconfig /usr/local/cuda/lib64/stubs && \ 7 | conda install -y conda=4.6.14 python=3.6.2 && conda clean -ay && \ 8 | # Install AzureML SDK 9 | pip install --no-cache-dir azureml-defaults && \ 10 | pip install pyyaml && \ 11 | # Install PyTorch 12 | pip install torch==1.5.0 13 | pip install torchvision==0.6.0 14 | pip install --no-cache-dir mkl==2018.0.3 && \ 15 | ldconfig && \ 16 | pip install tensorboard==1.14.0 && \ 17 | pip install future==0.17.1 && \ 18 | pip install python-etcd==0.4.5 && \ 19 | pip install torchelastic 20 | -------------------------------------------------------------------------------- /azure/config/kubernetes.json: -------------------------------------------------------------------------------- 1 | { 2 | "apiVersion": "vlabs", 3 | "properties": { 4 | "orchestratorProfile": { 5 | "orchestratorType": "Kubernetes", 6 | "kubernetesConfig": { 7 | "apiServerConfig": { 8 | "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256" 9 | } 10 | } 11 | }, 12 | "masterProfile": { 13 | "count": 1, 14 | "dnsPrefix": "", 15 | "vmSize": "Standard_DS1_v2" 16 | }, 17 | "agentPoolProfiles": [ 18 | { 19 | "name": "agentpool1", 20 | "count": 1, 21 | "vmSize": "Standard_NC6s_v3" 22 | } 23 | ], 24 | "linuxProfile": { 25 | "adminUsername": "azureuser", 26 | "ssh": { 27 | "publicKeys": [ 28 | {} 29 | ] 30 | } 31 | }, 32 | "servicePrincipalProfile": { 33 | "clientId": "", 34 | "secret": "" 35 | } 36 | } 37 | } 38 | -------------------------------------------------------------------------------- /azure/config/sample_specs.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: batch/v1 2 | kind: Job 3 | metadata: 4 | labels: 5 | app: azure-pytorch-elastic 6 | name: azure-pytorch-elastic 7 | spec: 8 | template: 9 | metadata: 10 | labels: 11 | app: azure-pytorch-elastic 12 | spec: 13 | containers: 14 | - name: petimage 15 | image: mcr.microsoft.com/azureml/elastic:pytorch-elastic-openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04 16 | command: ["/bin/bash", "-c"] 17 | args: ["python /mnt/azure/pet/code/imagenet/main.py --input_path /mnt/azure/pet/data/train/"] 18 | imagePullPolicy: Always 19 | env: 20 | - name: RDZV_ENDPOINT 21 | value: 10.255.255.5:2379 22 | - name: ETCD_PROTOCOL 23 | value: https 24 | - name: ETCD_CACERT 25 | value: /etc/kubernetes/certs/ca.crt 26 | - name: ETCD_CERT 27 | value: /etc/kubernetes/certs/client.crt 28 | - name: ETCD_KEY 29 | value: /etc/kubernetes/certs/client.key 30 | resources: 31 | limits: 32 | nvidia.com/gpu: 1 33 | volumeMounts: 34 | - name: pet 35 | mountPath: /mnt/azure/pet 36 | - name: etc 37 | mountPath: /etc/kubernetes/certs 38 | imagePullSecrets: 39 | - name: pet-docker-secret 40 | volumes: 41 | - name: pet 42 | flexVolume: 43 | driver: "azure/blobfuse" 44 | readOnly: true 45 | secretRef: 46 | name: pet-blob-secret 47 | options: 48 | container: petimagenet 49 | - name: etc 50 | hostPath: 51 | path: /etc/kubernetes/certs 52 | restartPolicy: Never 53 | -------------------------------------------------------------------------------- /azure/petctl.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import, division, print_function, unicode_literals 2 | 3 | import util 4 | 5 | 6 | # Create a Kubernetes specs and YAML job file based on user inputs 7 | def configure(args): 8 | util.configure_yaml(args) 9 | util.configure_json(args) 10 | 11 | 12 | # Deploys a Kubernetes cluster 13 | def setup(args): 14 | # Install AKS Engine 15 | util.install_aks_engine() 16 | # Deploy an AKS cluster using kubernetes.json 17 | util.deploy_aks_cluster(args) 18 | 19 | 20 | # Upload code/data to Azure blob storage 21 | def upload_storage(args): 22 | util.upload_to_azure_blob(args) 23 | 24 | 25 | # Create Azure blob storage secret 26 | def storage_secret(args): 27 | util.create_storage_secrets(args) 28 | 29 | 30 | # Create docker image secrets 31 | def docker_secret(args): 32 | util.create_docker_image_secret(args) 33 | 34 | 35 | # Scale the cluster 36 | def scale_cluster(args): 37 | util.scale_cluster(args) 38 | 39 | 40 | # Submits your training job 41 | def run_job(args): 42 | util.install_blobfuse_drivers() 43 | commands = [ 44 | "kubectl delete -f config/azure-pytorch-elastic.yaml", 45 | "kubectl apply -f config/azure-pytorch-elastic.yaml", 46 | "kubectl describe pods", 47 | "kubectl get pods --selector app=azure-pytorch-elastic", 48 | ] 49 | 50 | util.run_commands(commands) 51 | 52 | 53 | # Check current status of your pods 54 | def check_status(): 55 | commands = [ 56 | "kubectl describe pods", 57 | "kubectl get pods --selector app=azure-pytorch-elastic", 58 | ] 59 | 60 | util.run_commands(commands) 61 | 62 | 63 | # Get logs of your job from each pod 64 | def get_logs(): 65 | util.run_commands(["kubectl logs --selector app=azure-pytorch-elastic "]) 66 | 67 | 68 | # Deletes secrets and cluster 69 | def delete_resources(): 70 | util.delete_resources_util() 71 | 72 | 73 | if __name__ == "__main__": 74 | parser = util.argparse.ArgumentParser() 75 | 76 | subparser = parser.add_subparsers( 77 | title="actions", description="setup | configure | run job", dest="command" 78 | ) 79 | 80 | # ---------------------------------- # 81 | # SETUP # 82 | # ---------------------------------- # 83 | 84 | parser_setup = subparser.add_parser( 85 | "setup", help="set up aks-engine, cluster and other dependencies" 86 | ) 87 | 88 | parser_setup.add_argument( 89 | "--dns_prefix", 90 | type=str, 91 | required=False, 92 | default="azure-pytorch-elastic", 93 | help="Dns prefix of the app", 94 | ) 95 | 96 | parser_setup.add_argument( 97 | "--subscription_id", 98 | type=str, 99 | required=True, 100 | help="Subscription id of the cluster", 101 | ) 102 | 103 | parser_setup.add_argument( 104 | "--rg", type=str, required=True, help="Resource group of the cluster" 105 | ) 106 | 107 | parser_setup.add_argument( 108 | "--location", type=str, required=True, help="Location of the cluster" 109 | ) 110 | 111 | parser_setup.add_argument( 112 | "--client_id", type=str, required=True, help="Service principal client id" 113 | ) 114 | 115 | parser_setup.add_argument( 116 | "--client_secret", 117 | type=str, 118 | required=True, 119 | help="Service Principal client secret", 120 | ) 121 | 122 | parser_setup.set_defaults(func=setup) 123 | 124 | # ---------------------------------- # 125 | # CONFIGURE JOB YAML # 126 | # ---------------------------------- # 127 | 128 | parser_configure = subparser.add_parser("configure", help="Generate yaml job file") 129 | 130 | parser_configure.add_argument("--name", required=True, help="config parameters") 131 | parser_configure.add_argument( 132 | "--min_size", 133 | type=int, 134 | required=False, 135 | help="minimum number of worker hosts to continue training", 136 | ) 137 | parser_configure.add_argument( 138 | "--max_size", 139 | type=int, 140 | required=False, 141 | help="maximum number of worker hosts to allow scaling out", 142 | ) 143 | parser_configure.add_argument( 144 | "--size", 145 | type=int, 146 | required=False, 147 | help="set size to automatically set min_size = max_size = size", 148 | ) 149 | parser_configure.add_argument( 150 | "--master_vm", 151 | type=str, 152 | required=False, 153 | default="Standard_DS1_v2", 154 | help="Azure VM instance for master node", 155 | ) 156 | parser_configure.add_argument( 157 | "--worker_vm", 158 | type=str, 159 | required=False, 160 | default="Standard_NC6s_v3", 161 | help="Azure VM instance for woker nodes", 162 | ) 163 | parser_configure.set_defaults(func=configure) 164 | 165 | # ---------------------------------- # 166 | # UPLOAD STORAGE # 167 | # ---------------------------------- # 168 | 169 | parser_upload_storage = subparser.add_parser( 170 | "upload_storage", help="Upload to Azure Blob storage" 171 | ) 172 | 173 | parser_upload_storage.add_argument( 174 | "--account_name", 175 | type=str, 176 | required=True, 177 | help="Azure Blob storage Account name", 178 | ) 179 | 180 | parser_upload_storage.add_argument( 181 | "--container_name", 182 | type=str, 183 | required=True, 184 | help="Azure Blob storage container name", 185 | ) 186 | 187 | parser_upload_storage.add_argument( 188 | "--sas_token", type=str, required=True, help="Azure Blob storage SAS token" 189 | ) 190 | 191 | parser_upload_storage.add_argument( 192 | "--source_path", type=str, required=True, help="Path to local files" 193 | ) 194 | 195 | parser_upload_storage.set_defaults(func=upload_storage) 196 | 197 | # ---------------------------------- # 198 | # SETUP SECRETS # 199 | # ---------------------------------- # 200 | 201 | parser_storage_secret = subparser.add_parser( 202 | "storage_secret", help="Generate secret for Azure Blob storage" 203 | ) 204 | 205 | parser_storage_secret.add_argument( 206 | "--account_name", 207 | type=str, 208 | required=True, 209 | help="Azure Blob storage account name", 210 | ) 211 | 212 | parser_storage_secret.add_argument( 213 | "--account_key", type=str, required=True, help="Azure Blob storage account key" 214 | ) 215 | 216 | parser_storage_secret.set_defaults(func=storage_secret) 217 | 218 | parser_docker_secret = subparser.add_parser( 219 | "docker_secret", help="Generate secret for Docker Image" 220 | ) 221 | 222 | parser_docker_secret.add_argument( 223 | "--server", type=str, required=True, help="Docker server" 224 | ) 225 | 226 | parser_docker_secret.add_argument( 227 | "--username", type=str, required=True, help="Docker username" 228 | ) 229 | 230 | parser_docker_secret.add_argument( 231 | "--password", type=str, required=True, help="Docker password" 232 | ) 233 | 234 | parser_docker_secret.add_argument( 235 | "--image_name", type=str, required=True, help="Docker Imagename" 236 | ) 237 | 238 | parser_docker_secret.set_defaults(func=docker_secret) 239 | 240 | # ---------------------------------- # 241 | # RUN JOB # 242 | # ---------------------------------- # 243 | 244 | parser_run_job = subparser.add_parser("run_job", help="Run your training job") 245 | 246 | parser_run_job.set_defaults(func=run_job) 247 | 248 | # ---------------------------------- # 249 | # CHECK STATUS # 250 | # ---------------------------------- # 251 | 252 | parser_check_status = subparser.add_parser( 253 | "check_status", help="Check status of your jobs" 254 | ) 255 | parser_run_job.set_defaults(func=check_status) 256 | 257 | # ---------------------------------- # 258 | # DELETE RESOURCES # 259 | # ---------------------------------- # 260 | parser_delete_resources = subparser.add_parser( 261 | "delete_resources", 262 | help="Deletes the kubernetes cluster and all namespaces and secrets", 263 | ) 264 | parser_delete_resources.set_defaults(func=delete_resources) 265 | 266 | # ---------------------------------- # 267 | # GET LOGS # 268 | # ---------------------------------- # 269 | 270 | parser_get_logs = subparser.add_parser( 271 | "get_logs", help="Get logs from all your pods" 272 | ) 273 | 274 | parser_get_logs.set_defaults(func=get_logs) 275 | 276 | # ---------------------------------- # 277 | # SCALE CLUSTER # 278 | # ---------------------------------- # 279 | parser_scale = subparser.add_parser("scale", help="Scale up/down your cluster") 280 | 281 | parser_scale.add_argument( 282 | "--subscription_id", 283 | type=str, 284 | required=True, 285 | help="Subscription id of the cluster", 286 | ) 287 | 288 | parser_scale.add_argument( 289 | "--rg", type=str, required=True, help="Resource group of the cluster" 290 | ) 291 | 292 | parser_scale.add_argument( 293 | "--location", type=str, required=True, help="Location of the cluster" 294 | ) 295 | 296 | parser_scale.add_argument( 297 | "--client_id", type=str, required=True, help="Service principal client id" 298 | ) 299 | 300 | parser_scale.add_argument( 301 | "--client_secret", 302 | type=str, 303 | required=True, 304 | help="Service Principal client secret", 305 | ) 306 | 307 | parser_scale.add_argument( 308 | "--new_node_count", 309 | type=int, 310 | required=True, 311 | help="New node count to scale cluster to", 312 | ) 313 | 314 | parser_scale.set_defaults(func=util.scale_cluster) 315 | 316 | args = parser.parse_args() 317 | 318 | # ----- 319 | # Execution order: Configure --> Setup --> Run 320 | # ----- 321 | if args.command == "configure": 322 | configure(args) 323 | elif args.command == "setup": 324 | setup(args) 325 | elif args.command == "upload_storage": 326 | upload_storage(args) 327 | elif args.command == "storage_secret": 328 | storage_secret(args) 329 | elif args.command == "docker_secret": 330 | docker_secret(args) 331 | elif args.command == "run_job": 332 | run_job(args) 333 | elif args.command == "check_status": 334 | check_status() 335 | elif args.command == "delete_resources": 336 | delete_resources() 337 | elif args.command == "get_logs": 338 | get_logs() 339 | elif args.command == "scale": 340 | scale_cluster(args) 341 | -------------------------------------------------------------------------------- /design/kubernetes/torchelastic-operator-design.md: -------------------------------------------------------------------------------- 1 | # TorchElastic Controller for Kubernetes 2 | 3 | ## Background 4 | 5 | PyTorch continues to be used for the latest state-of-the-art research, making up nearly 70% of [papers](https://chillee.github.io/pytorch-vs-tensorflow/) that cite a framework. 6 | 7 | The current PyTorch Distributed Data Parallel (DDP) module enables data parallel training where each process trains the same model but on different shards of data. It enables bulk synchronous, multi-host, multi-GPU/CPU execution of ML training. However, DDP has several shortcomings; e.g.jobs cannot start without acquiring all the requested nodes; jobs cannot continue after a node fails due to an error or transient issue; jobs cannot incorporate a node that joined later; and lastly; progress cannot be made with the presence of a slow/stuck node. 8 | 9 | The focus of [PyTorch Elastic](https://github.com/pytorch/elastic), which uses Elastic Distributed Data Parallelism, is to address these issues and build a generic framework/APIs for PyTorch to enable reliable and elastic execution of these data parallel training workloads. It will provide better programmability, higher resilience to failures of all kinds, higher-efficiency and larger-scale training compared with pure DDP. 10 | 11 | ## Motivation 12 | 13 | With job fault tolerance and elastic training, we can unlock a lot of features. 14 | 15 | Users can enable job priority and preemption in the cluster. Losing a task becomes acceptable and the user won't lose the entire job progress. More importantly, it will help guarantee SLAs of crititcal jobs, even in a cluster under resource pressure. 16 | 17 | Cost and GPU utilization will be further optimized with this feature, since users can launch jobs with partial resources and spot GPU instances can be used as well without worrying. 18 | 19 | ## User Experience 20 | 21 | * Users should define the `minReplicas` and `maxReplicas` number of tasks of a job instead of a fixed number. TorchElastic controller will launch jobs in Kubernetes, setup the needed network topology and manage the job lifecycle. 22 | * Users need to specify the etcd endpoint used as the RDZV service for task coordination. 23 | * The desired `spec.replicaSpecs[Worker].replicas`, being number of tasks, has to be within the range from `minReplicas` to `maxReplicas`. 24 | * Users can easily create/delete a torch elastic job using `kubectl` using a job manifest. 25 | * Users are able to describe custom resources to monitor the job status. 26 | 27 | ## High Level Design 28 | 29 | Workers in torch elastic job are equivalent and their communication is peer to peer. In this case, every pod should be able to talk with every other pod, and we need to create a `headless` service for every pod. Once the job is done, controller won't terminate any pods, user can check logs for any worker. Manual job deletion will delete all pods belong to it. 30 | 31 | A config with kind `ElasticJob` defines the job spec and the controller will reconcile against this definition. It will create/update/delete pods and services if there are any changes in the job orin the kubernetes resources (pods, services) changes owned by `ElasticJob`. 32 | 33 | ``` 34 | apiVersion: "elastic.pytorch.org/v1alpha1" 35 | kind: "ElasticJob" 36 | metadata: 37 | name: "classy-vision-job" 38 | spec: 39 | rdzvEndpoint: "etcd-service:2379" 40 | minReplicas: 2 41 | maxReplicas: 5 42 | replicaSpecs: 43 | Worker: 44 | replicas: 3 45 | restartPolicy: ExitCode 46 | template: 47 | apiVersion: v1 48 | kind: Pod 49 | spec: 50 | containers: 51 | - name: torchelasticworker 52 | image: torchelastic/examples:0.1.0rc1 53 | imagePullPolicy: Always 54 | args: 55 | - "s3://code_path/petctl/user/my_job/main.py" 56 | - --config_file 57 | - "/data/classy_vision/resnet50_synthetic_image_classy_config.json" 58 | - "--checkpoint_folder" 59 | - "/data/classy_vision/checkpoint"" 60 | 61 | ``` 62 | 63 | *Network Communication* 64 | 65 | In this case, every pod should be able to talk with each other and we need to create headless service for every pod since they use hostname registered in rdzv endpoint to find peers. 66 | 67 | *Failure condition* 68 | 69 | Torch Elastic controller will only fail a job if active workers is under minReplicas size user specified. Otherwise, it will try to reschedule failed pods and maintain the desired task size. 70 | 71 | *rdzvEndpoint* 72 | 73 | `rdzvEndpoint` needs to be specified by user. It could be high available etcd quorum or single etcd pod on Kubernetes cluster. 74 | 75 | *Replicas* 76 | 77 | `replicas` represents the desired task size. Torch elastic job doesn't need all the workers to be ready to start training. We can set this field to job.spec.maxReplicas and try to allocate more resources. If cluster doesn't have enough resources, some tasks maybe pending and job can still start. 78 | 79 | 80 | These are the resources the controller creates from a `TorchElasticJob`: 81 | 82 | **Pod** 83 | 84 | ``` 85 | apiVersion: v1 86 | kind: Pod 87 | metadata: 88 | name: classy-vision-job-worker-${index} 89 | labels: 90 | job-name: classy-vision-job 91 | group-name=elastic.pytorch.org 92 | replica-index: 0 93 | replica-type=worker 94 | spec: 95 | containers: 96 | image: torchelastic/examples:0.1.0rc1 97 | imagePullPolicy: Always 98 | name: torchelasticworker 99 | env: 100 | - name: RDZV_ENDPOINT 101 | value: "etcd-:2379" 102 | - name: JOB_ID 103 | value: "classy-vision-job" 104 | - name: SIZE 105 | value: "3" 106 | - name: MIN_SIZE 107 | value: "2" 108 | - name: MAX_SIZE 109 | value: "5" 110 | restartPolicy: OnFailure 111 | ``` 112 | 113 | **Service** 114 | 115 | ``` 116 | apiVersion: v1 117 | kind: Service 118 | metadata: 119 | name: classy-vision-job-worker-${index} 120 | spec: 121 | selector: 122 | job-name: classy-vision-job 123 | group-name=elastic.pytorch.org 124 | replica-index: 0 125 | replica-type=worker 126 | clusterIP: None 127 | ``` 128 | 129 | **Job Status** 130 | 131 | ``` yaml 132 | kubectl describe elasticjob classy-vision-job 133 | Name: classy-vision-job 134 | Namespace: default 135 | API Version: elastic.pytorch.org/v1alpha1 136 | Kind: ElasticJob 137 | Spec: 138 | ... 139 | Status: 140 | Conditions: 141 | Last Transition Time: 2020-01-22T23:10:44Z 142 | Last Update Time: 2020-01-22T23:10:44Z 143 | Message: job classy-vision-job is created. 144 | Reason: ElasticJobCreated 145 | Status: True 146 | Type: Created 147 | Last Transition Time: 2020-01-22T23:10:49Z 148 | Last Update Time: 2020-01-22T23:10:49Z 149 | Message: ElasticJob classy-vision-job is running. 150 | Reason: ElasticJobRunning 151 | Status: False 152 | Type: Running 153 | Last Transition Time: 2020-01-22T23:10:49Z 154 | Last Update Time: 2020-01-22T23:10:49Z 155 | Message: ElasticJob classy-vision-job is failed because 2 workers replica(s) failed. 156 | Reason: ElasticJobFailed 157 | Status: True 158 | Type: Failed 159 | Replica Statuses: 160 | Worker: 161 | Active: 1 162 | Failed: 2 163 | Events: 164 | Type Reason Age From Message 165 | ---- ------ ---- ---- ------- 166 | Normal SuccessfulCreatePod 39m elastic-job-controller Created pod: classy-vision-job-worker-0 167 | Normal SuccessfulCreatePod 39m elastic-job-controller Created pod: classy-vision-job-worker-1 168 | Normal SuccessfulCreateService 39m elastic-job-controller Created service: classy-vision-job-worker-0 169 | Normal SuccessfulCreateService 39m elastic-job-controller Created service: classy-vision-job-worker-1 170 | Normal ExitedWithCode 39m (x3 over 39m) elastic-job-controller Pod: default.classy-vision-job-worker-0 exited with code 1 171 | Warning ElasticJobRestarting 39m (x3 over 39m) elastic-job-controller ElasticJob classy-vision-job is restarting because 1 Worker replica(s) failed. 172 | Normal ElasticJobFailed 39m elastic-job-controller ElasticJob classy-vision-job is failed because 2 Worker replica(s) failed. 173 | ``` 174 | 175 | ## Not in scope 176 | 177 | TorchElastic Controller for Kubernetes can simplify the setups to run torch elastic jobs and manage entire job lifecycle. It is hard for the controller to monitor cluster resources and dynamically adjust the task size. Instead, having a separate component like a batch scheduler to make the decision is a better option at this stage, to limit the scope of this project. 178 | 179 | Currently, for each `ElasticJob`, it has to accept an etcd service as `rdzvEndpoint`. We may consider to make this field optional and provide etcd service by controller if it's not set. -------------------------------------------------------------------------------- /design/torchelastic/0.2.0/design_doc.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | PyTorch Elastic Trainer (PET) provides a framework for conveniently training 3 | models across a compute cluster in a _fault tolerant_ and _elastic_ manner. 4 | PET provides these features in two ways: 5 | 6 | 1. When a PyTorch worker process throws a certain class of retriable errors, it is caught by PET and the training process is retried. 7 | 2. A new worker can leave or join the process pool for an existing training job at any point as long as the number of workers stays within the bounds specified when starting the job. When a membership change happens, all the workers re-rendezvous to establish a new process group and training resumes from the previous well-known good state. 8 | 9 | In order to integrate with PET, a PyTorch user needs to make the following 10 | changes to their training logic: 11 | 12 | 1. They need to enable PET to control their training loop. 13 | Essentially, they provide an "inner training" loop that is wrapped in a 14 | retryable loop by PET. All aspects of establishing or re-establishing the 15 | process group as well as restoring the user's trainer to a known good state 16 | is handled by the retryable PET loop. 17 | 2. They need to specify _what_ the state is that needs to be restored in case 18 | a new worker joins the pool and _how_ the state is applied to a new worker. 19 | The API for specifying these is described by the `State` object. 20 | 21 | PET v.0.1 was released on GitHub, PyPI and Docker Hub in November 2019 and since 22 | then the community has contributed integrations with Amazon Web Services 23 | (via Elastic Kubernetes Service) and Microsoft Azure (via Azure Kubernetes Service). 24 | 25 | # Lessons learned from PET v0.1 26 | In porting existing PyTorch-based projects such as 27 | [ClassyVision](https://github.com/facebookresearch/ClassyVision) and 28 | [PyText](https://github.com/facebookresearch/pytext) to use PET, we encountered 29 | a few areas for refinement in the v0.1 design. 30 | 31 | **First**, adapting a mature training library such as ClassyVision to use the 32 | elastic training APIs often requires a significant amount of restructuring often 33 | causing bifurcation of code paths between the elastic and non-elastic implementations. 34 | 35 | **Second**, it is non-trivial to correctly implement the state restore logic for 36 | each application during in-process recovery. While explicit state such as weight 37 | tensors are easy to save and restore, there is often "hidden" or implicit state 38 | in the application that is hard for the developer to reason about. For example, 39 | after a rendezvous round, a worker process might be expected to restore the state 40 | of C++ objects either in CPU or GPU memory which are extremely error-prone, 41 | especially after failures or exceptions. To compound this issue, several 42 | applications such as PyText already implement some form of checkpoint/restart and 43 | this logic often needs to be taken into account when implementing the elastic state. 44 | 45 | 46 | **Finally**, one of the goals of PET v0.1 was to detect and restart straggler workers. 47 | This was not possible when running the training loop in process and necessitated 48 | writing an additional watchdog process to monitor the main training process. 49 | 50 | For the next iteration of PET, we would like to propose a design that makes it 51 | significantly simpler to port existing training workflows to an elastic 52 | infrastructure and results in applications that can recover more reliably 53 | from workflow failures. 54 | 55 | # Overview of the new design 56 | In PET v.0.2, _we no longer attempt to recover errors in the training function_. 57 | Instead, PET attempts to maintain the number of worker processes such that they 58 | stay within the \[_min_, _max_\] bounds required for the job. 59 | The application writer is responsible for loading and restarting from an existing 60 | checkpoint file is available. Unlike v0.1, PET v0.2 does not mandate how 61 | checkpoints are managed. An application writer is free to use just `torch.save` 62 | and `torch.load` from PyTorch or a higher-level framework such as 63 | [PyTorch Lightening](https://github.com/PyTorchLightning/pytorch-lightning). 64 | 65 | PET v0.2 is implemented using a new process named `elastic-agent`. 66 | There is a single `elastic-agent` per job, per node. Each agent process is only 67 | responsible for managing a set of worker process local to that node and coordinating 68 | process group membership changes with elastic agents on other nodes allocated to 69 | that job. This is illustrated in the diagram below: 70 | 71 | ![image](torchelastic_diagram.jpg) 72 | 73 | Membership changes are handled as followed: When a worker process fails, 74 | the corresponding elastic agent managing it kills all the workers on that node, 75 | establishes rendezvous with the other agents and restarts workers with the new 76 | rendezvous information. However, when an agent exits with a non-zero error code, 77 | it is up to a higher-level orchestrator such as Kubernetes to restart the agent 78 | (which in turn will restart all the workers it is responsible for). 79 | The same recovery mechanism holds for node-level failures. 80 | An orchestrator such as Kubernetes will schedule a job such that a minimum replicas 81 | of the elastic agent are running and each agent will in turn orchestrate the 82 | user's training script. 83 | 84 | ![image](torchelastic_agent_diagram.jpg) 85 | 86 | To adopt PET v0.2, an application simply needs its entry-point or `main` function 87 | to be compatible with the 88 | [PyTorch distributed launcher](https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py). 89 | We expect distributed training jobs that are started via the distributed launcher 90 | to be seamlessly started via the elastic agent with none to minimal code changes. 91 | The only difference is that in the latter case, the application will be able to 92 | make progress in the presence of certain failures. 93 | 94 | # Overview of the API 95 | As mentioned above, with PET v0.2, there is no separate library for a training 96 | application to integrate with. Instead, the user simply launches a training job 97 | via the elastic agent monitor process. For example, if a user starts their job 98 | using PyTorch distributed launcher using: 99 | ```sh 100 | python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_ON_NODE 101 | TRAINING_SCRIPT.py (... train script args ...) 102 | ``` 103 | they would instead use: 104 | 105 | ```sh 106 | python -m torchelastic.distributed.launch --nproc_per_node=NUM_GPUS_ON_NODE 107 | --nnodes=1:4 108 | --rdzv_id=JOB_ID 109 | --rdzv_backend=etcd 110 | --rdzv_endpoint=ETCD_HOST:ETCD_PORT 111 | TRAINING_SCRIPT.py (... train script args ...) 112 | ``` 113 | Notice that it adds a few additional parameters: 114 | 1. The min and max number of nodes. During a rendezvous, if the number of nodes 115 | drops below the specified threshold, the job is aborted. 116 | 2. A rendezvous type and its configuration. 117 | 118 | In side the training script, the only potential change the user needs to do is 119 | to make sure that they use environment variables to initialize the process group, 120 | i.e., create the process group as follows: 121 | ```py 122 | import torch.distributed as dist 123 | 124 | dist.init_process_group(init_method="env://", backend="gloo") 125 | # or 126 | dist.init_process_group(init_method="env://", backend="nccl") 127 | ``` 128 | 129 | All the parameters for initializing the group (the world size, the numerical 130 | rank, the master address and port) are passed in as environment variables 131 | by the parent elastic agent. 132 | 133 | The new PET design is intentionally "bare-bones": it trade-offs the granularity 134 | with which an application can recover for simplicity and robustness. 135 | In the future, we hope to provide more APIs for convenient checkpointing that a 136 | developer can optionally use for more efficient restart semantics. 137 | 138 | # Implementation details and next steps 139 | An implementation of the above ideas is available in [PR #65](https://github.com/pytorch/elastic/pull/65). 140 | We encourage the community to give evaluate the new functionality and 141 | give us feedback on the trade-offs we have made in the design either in the PR 142 | or in this issue. We look forward to hearing from you! 143 | -------------------------------------------------------------------------------- /design/torchelastic/0.2.0/torchelastic_agent_diagram.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pytorch/elastic/bc88e6982961d4117e53c4c8163ecf277f35c2c5/design/torchelastic/0.2.0/torchelastic_agent_diagram.jpg -------------------------------------------------------------------------------- /design/torchelastic/0.2.0/torchelastic_diagram.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pytorch/elastic/bc88e6982961d4117e53c4c8163ecf277f35c2c5/design/torchelastic/0.2.0/torchelastic_diagram.jpg -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | 2 | # Minimal makefile for Sphinx documentation 3 | # Usage: 4 | # make html 5 | # 6 | 7 | # You can set these variables from the command line. 8 | SPHINXOPTS = 9 | SPHINXBUILD = sphinx-build 10 | SPHINXPROJ = torchelastic 11 | SOURCEDIR = source 12 | BUILDDIR = build 13 | VERSION := "0.2.3.dev0" 14 | 15 | # Put it first so that "make" without argument is like "make help". 16 | help: 17 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 18 | 19 | clean: 20 | @echo "Deleting build directory" 21 | rm -rf "$(BUILDDIR)" 22 | 23 | .PHONY: help Makefile clean 24 | 25 | # Catch-all target: route all unknown targets to Sphinx using the new 26 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 27 | %: Makefile 28 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)/$(VERSION)" $(SPHINXOPTS) $(O) 29 | -------------------------------------------------------------------------------- /docs/doc_push.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | # 10 | # Builds docs from the checkedout HEAD 11 | # and pushes the artifacts to gh-pages branch in github.com/pytorch/elastic 12 | # 13 | # 1. sphinx generated docs are copied to / 14 | # 2. if a release tag is found on HEAD then redirects are copied to /latest 15 | # 3. if no release tag is found on HEAD then redirects are copied to /master 16 | # 17 | # gh-pages branch should look as follows: 18 | # 19 | # |- 0.1.0rc2 20 | # |- 0.1.0rc3 21 | # |- 22 | # |- master (redirects to the most recent ver in trunk, including release) 23 | # |- latest (redirects to the most recent release) 24 | # If the most recent release is 0.1.0 and master is at 0.1.1rc1 then, 25 | # https://pytorch.org/elastic/master -> https://pytorch.org/elastic/0.1.1rc1 26 | # https://pytorch.org/elastic/latest -> https://pytorch.org/elastic/0.1.0 27 | # 28 | # Redirects are done via Jekyll redirect-from plugin. See: 29 | # sources/scripts/create_redirect_md.py 30 | # Makefile (redirect target) 31 | # (on gh-pages branch) _layouts/docs_redirect.html 32 | 33 | dry_run=0 34 | for arg in "$@"; do 35 | shift 36 | case "$arg" in 37 | "--dry-run") dry_run=1 ;; 38 | "--help") echo "Usage $0 [--dry-run]"; exit 0 ;; 39 | esac 40 | done 41 | 42 | repo_root=$(git rev-parse --show-toplevel) 43 | branch=$(git rev-parse --abbrev-ref HEAD) 44 | commit_id=$(git rev-parse --short HEAD) 45 | 46 | if ! release_tag=$(git describe --tags --exact-match HEAD 2>/dev/null); then 47 | echo "No release tag found, building docs for master..." 48 | redirects=(master) 49 | release_tag="master" 50 | else 51 | echo "Release tag $release_tag found, building docs for release..." 52 | redirects=(latest master) 53 | fi 54 | 55 | echo "Installing torchelastic from $repo_root..." 56 | cd "$repo_root" || exit 57 | pip uninstall -y torchelastic 58 | python setup.py install 59 | 60 | torchelastic_ver=$(python -c "import torchelastic; print(torchelastic.__version__)") 61 | 62 | echo "Building PyTorch Elastic v$torchelastic_ver docs..." 63 | docs_dir=$repo_root/docs 64 | build_dir=$docs_dir/build 65 | cd "$docs_dir" || exit 66 | pip install -r requirements.txt 67 | make clean html 68 | echo "Doc build complete" 69 | 70 | if [ $dry_run -eq 1 ]; then 71 | echo "*** dry-run mode, building only. See build artifacts in: $build_dir" 72 | exit 73 | fi 74 | 75 | tmp_dir=/tmp/torchelastic_docs_tmp 76 | rm -rf "${tmp_dir:?}" 77 | 78 | echo "Checking out gh-pages branch..." 79 | gh_pages_dir="$tmp_dir/elastic_gh_pages" 80 | git clone -b gh-pages --single-branch git@github.com:pytorch/elastic.git $gh_pages_dir 81 | 82 | echo "Copying doc pages for $torchelastic_ver into $gh_pages_dir..." 83 | rm -rf "${gh_pages_dir:?}/${torchelastic_ver:?}" 84 | cp -R "$build_dir/$torchelastic_ver/html" "$gh_pages_dir/$torchelastic_ver" 85 | 86 | for redirect in "${redirects[@]}"; do 87 | echo "Copying redirects for $redirect -> $torchelastic_ver..." 88 | rm -rf "${gh_pages_dir:?}/${redirect:?}" 89 | cp -R "$build_dir/redirects" "$gh_pages_dir/$redirect" 90 | done 91 | 92 | if [ "$release_tag" != "master" ]; then 93 | echo "Copying redirects for default(latest) -> $torchelastic_ver..." 94 | cp -R "$build_dir/redirects/." "$gh_pages_dir" 95 | fi 96 | 97 | cd $gh_pages_dir || exit 98 | git add . 99 | git commit --quiet -m "[doc_push][$release_tag] built from $commit_id ($branch). Redirects: ${redirects[*]} -> $torchelastic_ver." 100 | -------------------------------------------------------------------------------- /docs/requirements.txt: -------------------------------------------------------------------------------- 1 | sphinx 2 | -e git+http://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme 3 | sphinxcontrib.katex 4 | matplotlib 5 | -------------------------------------------------------------------------------- /docs/source/_static/img/efs-setup.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pytorch/elastic/bc88e6982961d4117e53c4c8163ecf277f35c2c5/docs/source/_static/img/efs-setup.jpg -------------------------------------------------------------------------------- /docs/source/_static/img/pytorch-logo-dark.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 9 | 10 | 12 | 13 | 14 | 15 | 16 | 18 | 20 | 21 | 24 | 26 | 29 | 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /docs/source/_static/img/pytorch-logo-flame.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pytorch/elastic/bc88e6982961d4117e53c4c8163ecf277f35c2c5/docs/source/_static/img/pytorch-logo-flame.png -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | # 4 | # PyTorch documentation build configuration file, created by 5 | # sphinx-quickstart on Fri Dec 23 13:31:47 2016. 6 | # 7 | # This file is execfile()d with the current directory set to its 8 | # containing dir. 9 | # 10 | # Note that not all possible configuration values are present in this 11 | # autogenerated file. 12 | # 13 | # All configuration values have a default; values that are commented out 14 | # serve to show the default. 15 | 16 | import pytorch_sphinx_theme 17 | 18 | # If extensions (or modules to document with autodoc) are in another directory, 19 | # add these directories to sys.path here. If the directory is relative to the 20 | # documentation root, use os.path.abspath to make it absolute, like shown here. 21 | # 22 | # import os 23 | # import sys 24 | # sys.path.insert(0, os.path.abspath('.')) 25 | from docutils import nodes 26 | from sphinx import addnodes 27 | from sphinx.util.docfields import TypedField 28 | 29 | 30 | # -- General configuration ------------------------------------------------ 31 | 32 | # If your documentation needs a minimal Sphinx version, state it here. 33 | # 34 | needs_sphinx = "1.6" 35 | 36 | # Add any Sphinx extension module names here, as strings. They can be 37 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 38 | # ones. 39 | extensions = [ 40 | "sphinx.ext.autodoc", 41 | "sphinx.ext.autosummary", 42 | "sphinx.ext.doctest", 43 | "sphinx.ext.intersphinx", 44 | "sphinx.ext.todo", 45 | "sphinx.ext.coverage", 46 | "sphinx.ext.napoleon", 47 | "sphinx.ext.viewcode", 48 | "sphinxcontrib.katex", 49 | "sphinx.ext.autosectionlabel", 50 | ] 51 | 52 | # katex options 53 | # 54 | # 55 | 56 | katex_options = r""" 57 | delimiters : [ 58 | {left: "$$", right: "$$", display: true}, 59 | {left: "\\(", right: "\\)", display: false}, 60 | {left: "\\[", right: "\\]", display: true} 61 | ] 62 | """ 63 | 64 | napoleon_use_ivar = True 65 | 66 | # Add any paths that contain templates here, relative to this directory. 67 | templates_path = ["_templates"] 68 | 69 | # The suffix(es) of source filenames. 70 | # You can specify multiple suffix as a list of string: 71 | # 72 | # source_suffix = ['.rst', '.md'] 73 | source_suffix = [".rst", ".md"] 74 | 75 | # The master toctree document. 76 | master_doc = "index" 77 | 78 | # General information about the project. 79 | project = "PyTorch/Elastic" 80 | copyright = "2020, PyTorch Elastic Contributors" 81 | author = "PyTorch Elastic Contributors" 82 | 83 | # The version info for the project you're documenting, acts as replacement for 84 | # |version| and |release|, also used in various other places throughout the 85 | # built documents. 86 | # 87 | # The short X.Y version. 88 | # TODO: change to [:2] at v1.0 89 | version = "v0.2.3.dev0" 90 | # The full version, including alpha/beta/rc tags. 91 | # TODO: verify this works as expected 92 | release = "master" 93 | 94 | # The language for content autogenerated by Sphinx. Refer to documentation 95 | # for a list of supported languages. 96 | # 97 | # This is also used if you do content translation via gettext catalogs. 98 | # Usually you set "language" from the command line for these cases. 99 | language = None 100 | 101 | # List of patterns, relative to source directory, that match files and 102 | # directories to ignore when looking for source files. 103 | # This patterns also effect to html_static_path and html_extra_path 104 | exclude_patterns = [] 105 | 106 | # The name of the Pygments (syntax highlighting) style to use. 107 | pygments_style = "sphinx" 108 | 109 | # If true, `todo` and `todoList` produce output, else they produce nothing. 110 | todo_include_todos = True 111 | 112 | 113 | # -- Options for HTML output ---------------------------------------------- 114 | 115 | # The theme to use for HTML and HTML Help pages. See the documentation for 116 | # a list of builtin themes. 117 | # 118 | html_theme = "pytorch_sphinx_theme" 119 | html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()] 120 | 121 | # Theme options are theme-specific and customize the look and feel of a theme 122 | # further. For a list of options available for each theme, see the 123 | # documentation. 124 | # 125 | html_theme_options = { 126 | "pytorch_project": "elastic", 127 | "collapse_navigation": False, 128 | "display_version": True, 129 | "logo_only": True, 130 | } 131 | 132 | html_logo = "_static/img/pytorch-logo-dark.svg" 133 | 134 | # Add any paths that contain custom static files (such as style sheets) here, 135 | # relative to this directory. They are copied after the builtin static files, 136 | # so a file named "default.css" will overwrite the builtin "default.css". 137 | html_static_path = ["_static"] 138 | 139 | 140 | def setup(app): 141 | # NOTE: in Sphinx 1.8+ `html_css_files` is an official configuration value 142 | # and can be moved outside of this function (and the setup(app) function 143 | # can be deleted). 144 | html_css_files = [ 145 | "https://cdn.jsdelivr.net/npm/katex@0.10.0-beta/dist/katex.min.css" 146 | ] 147 | 148 | # In Sphinx 1.8 it was renamed to `add_css_file`, 1.7 and prior it is 149 | # `add_stylesheet` (deprecated in 1.8). 150 | add_css = getattr( 151 | app, "add_css_file", getattr(app, "add_stylesheet", None) 152 | ) # noqa B009 153 | for css_file in html_css_files: 154 | add_css(css_file) 155 | 156 | 157 | # -- Options for HTMLHelp output ------------------------------------------ 158 | 159 | # Output file base name for HTML help builder. 160 | htmlhelp_basename = "TorchElasticdoc" 161 | 162 | 163 | # -- Options for LaTeX output --------------------------------------------- 164 | 165 | latex_elements = { 166 | # The paper size ('letterpaper' or 'a4paper'). 167 | # 168 | # 'papersize': 'letterpaper', 169 | # The font size ('10pt', '11pt' or '12pt'). 170 | # 171 | # 'pointsize': '10pt', 172 | # Additional stuff for the LaTeX preamble. 173 | # 174 | # 'preamble': '', 175 | # Latex figure (float) alignment 176 | # 177 | # 'figure_align': 'htbp', 178 | } 179 | 180 | # Grouping the document tree into LaTeX files. List of tuples 181 | # (source start file, target name, title, 182 | # author, documentclass [howto, manual, or own class]). 183 | latex_documents = [ 184 | ( 185 | master_doc, 186 | "pytorch.tex", 187 | "Torchelastic Documentation", 188 | "Torch Contributors", 189 | "manual", 190 | ) 191 | ] 192 | 193 | 194 | # -- Options for manual page output --------------------------------------- 195 | 196 | # One entry per manual page. List of tuples 197 | # (source start file, name, description, authors, manual section). 198 | man_pages = [(master_doc, "Torchelastic", "Torchelastic Documentation", [author], 1)] 199 | 200 | 201 | # -- Options for Texinfo output ------------------------------------------- 202 | 203 | # Grouping the document tree into Texinfo files. List of tuples 204 | # (source start file, target name, title, author, 205 | # dir menu entry, description, category) 206 | texinfo_documents = [ 207 | ( 208 | master_doc, 209 | "Torchelastic", 210 | "Torchelastic Documentation", 211 | author, 212 | "Torchelastic", 213 | "PyTorch Elastic Training", 214 | "Miscellaneous", 215 | ) 216 | ] 217 | 218 | 219 | # Example configuration for intersphinx: refer to the Python standard library. 220 | intersphinx_mapping = { 221 | "python": ("https://docs.python.org/", None), 222 | "numpy": ("https://docs.scipy.org/doc/numpy/", None), 223 | "torch": ("https://pytorch.org/docs/stable/", None), 224 | } 225 | 226 | # -- A patch that prevents Sphinx from cross-referencing ivar tags ------- 227 | # See http://stackoverflow.com/a/41184353/3343043 228 | 229 | 230 | def patched_make_field(self, types, domain, items, **kw): 231 | # `kw` catches `env=None` needed for newer sphinx while maintaining 232 | # backwards compatibility when passed along further down! 233 | 234 | def handle_item(fieldarg, content): 235 | par = nodes.paragraph() 236 | par += addnodes.literal_strong("", fieldarg) # Patch: this line added 237 | # par.extend(self.make_xrefs(self.rolename, domain, fieldarg, 238 | # addnodes.literal_strong)) 239 | if fieldarg in types: 240 | par += nodes.Text(" (") 241 | # NOTE: using .pop() here to prevent a single type node to be 242 | # inserted twice into the doctree, which leads to 243 | # inconsistencies later when references are resolved 244 | fieldtype = types.pop(fieldarg) 245 | if len(fieldtype) == 1 and isinstance(fieldtype[0], nodes.Text): 246 | typename = "".join(n.astext() for n in fieldtype) 247 | typename = typename.replace("int", "python:int") 248 | typename = typename.replace("long", "python:long") 249 | typename = typename.replace("float", "python:float") 250 | typename = typename.replace("type", "python:type") 251 | par.extend( 252 | self.make_xrefs( 253 | self.typerolename, 254 | domain, 255 | typename, 256 | addnodes.literal_emphasis, 257 | **kw, 258 | ) 259 | ) 260 | else: 261 | par += fieldtype 262 | par += nodes.Text(")") 263 | par += nodes.Text(" -- ") 264 | par += content 265 | return par 266 | 267 | fieldname = nodes.field_name("", self.label) 268 | if len(items) == 1 and self.can_collapse: 269 | fieldarg, content = items[0] 270 | bodynode = handle_item(fieldarg, content) 271 | else: 272 | bodynode = self.list_type() 273 | for fieldarg, content in items: 274 | bodynode += nodes.list_item("", handle_item(fieldarg, content)) 275 | fieldbody = nodes.field_body("", bodynode) 276 | return nodes.field("", fieldname, fieldbody) 277 | 278 | 279 | TypedField.make_field = patched_make_field 280 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | :github_url: https://github.com/pytorch/elastic 2 | 3 | TorchElastic 4 | ================== 5 | 6 | .. important:: TorchElastic has been upstreamed to `PyTorch 1.9 `_. 7 | TSM has been upstreamed to `TorchX `_. 8 | -------------------------------------------------------------------------------- /docs/source/scripts/create_redirect_md.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | """ 10 | For each rst file, generates a corresponding rst file 11 | that redirects http://pytorch.org/elastic//.html 12 | to http://pytorch.org/elastic/latest/.html 13 | """ 14 | 15 | import argparse 16 | import glob 17 | import os 18 | import sys 19 | 20 | import torchelastic 21 | 22 | 23 | def parse_args(args): 24 | parser = argparse.ArgumentParser() 25 | parser.add_argument( 26 | "--source_dir", required=True, help="directory where rst files are" 27 | ) 28 | parser.add_argument("--build_dir", required=True, help="directory to drop md files") 29 | 30 | return parser.parse_args(args[1:]) 31 | 32 | 33 | if __name__ == "__main__": 34 | args = parse_args(sys.argv) 35 | build_ver = torchelastic.__version__ 36 | source_dir = args.source_dir 37 | build_dir = args.build_dir 38 | print(f"Creating redirect files from source_dir: {source_dir} into {build_dir}") 39 | for rst_file in glob.glob(os.path.join(source_dir, "**/*.rst"), recursive=True): 40 | rst_relative_path = os.path.relpath(rst_file, source_dir) 41 | md_relative_path = os.path.splitext(rst_relative_path)[0] + ".md" 42 | html_relative_path = os.path.splitext(rst_relative_path)[0] + ".html" 43 | md_file = os.path.join(build_dir, md_relative_path) 44 | os.makedirs(os.path.dirname(md_file), exist_ok=True) 45 | 46 | print(f"Creating redirect md for {rst_relative_path} --> {md_file}") 47 | with open(md_file, "w") as f: 48 | f.write("---\n") 49 | f.write("layout: docs_redirect\n") 50 | f.write("title: PyTorch | Redirect\n") 51 | f.write(f'redirect_url: "/elastic/{build_ver}/{html_relative_path}"\n') 52 | f.write("---\n") 53 | -------------------------------------------------------------------------------- /docs/src/pip-delete-this-directory.txt: -------------------------------------------------------------------------------- 1 | This file is placed here by pip to indicate the source was put 2 | here by pip. 3 | 4 | Once this package is successfully installed this source code will be 5 | deleted (unless you remove this file). 6 | -------------------------------------------------------------------------------- /examples/Dockerfile: -------------------------------------------------------------------------------- 1 | ARG BASE_IMAGE=pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime 2 | FROM $BASE_IMAGE 3 | 4 | # install utilities and dependencies 5 | RUN pip install awscli --upgrade 6 | RUN pip install classy-vision 7 | 8 | RUN pip uninstall -y torch 9 | # TODO remove and make the BASE_IMAGE pytorch:1.9.0-cuda11.1-cudnn8-runtime when torch-1.9 releases 10 | RUN pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html 11 | 12 | WORKDIR /workspace 13 | 14 | # download imagenet tiny for data 15 | RUN apt-get -q update && apt-get -q install -y wget unzip 16 | RUN wget -q http://cs231n.stanford.edu/tiny-imagenet-200.zip && unzip -q tiny-imagenet-200.zip -d data && rm tiny-imagenet-200.zip 17 | 18 | COPY . ./examples 19 | RUN chmod -R u+x ./examples/bin 20 | RUN examples/bin/install_etcd -d examples/bin 21 | ENV PATH=/workspace/examples/bin:${PATH} 22 | 23 | # create a template classy project in /workspace/classy_vision 24 | # (see https://classyvision.ai/#quickstart) 25 | RUN classy-project classy_vision 26 | 27 | USER root 28 | ENTRYPOINT ["python", "-m", "torch.distributed.run"] 29 | CMD ["--help"] 30 | -------------------------------------------------------------------------------- /examples/README.md: -------------------------------------------------------------------------------- 1 | Examples 2 | ============= 3 | 4 | The examples below run on the [torchelastic/examples](https://hub.docker.com/r/torchelastic/examples) 5 | Docker image, built from the [examples/Dockerfile](https://github.com/pytorch/elastic/blob/master/examples/Dockerfile). 6 | 7 | .. note:: The ``$VERSION`` (e.g. ``0.2.0``) variable is used throughout this page, 8 | this should be substituted with the version of torchelastic you are using. 9 | The examples below only work on torchelastic ``>=0.2.0``. 10 | 11 | Prerequisite 12 | -------------- 13 | 14 | 1. (recommended) Instance with GPU(s) 15 | 2. [Docker](https://docs.docker.com/install/) 16 | 3. [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker) 17 | 4. ``export VERSION=`` 18 | 19 | > **NOTE:** PyTorch data loaders use ``shm``. The default docker ``shm-size`` 20 | > is not large enough and will OOM when using multiple data loader workers. 21 | > you must pass ``--shm-size`` to the ``docker run`` command or set the 22 | > number of data loader workers to ``0`` (run on the same process) 23 | > by passing the appropriate option to the script (use the ``--help`` flag 24 | > to see all script options). In the examples below we set ``--shm-size``. 25 | 26 | Classy Vision 27 | -------------- 28 | [Classy Vision](https://classyvision.ai/) is an end-to-end framework 29 | for image and video classification built on PyTorch. It works out-of-the-box 30 | with torchelastic's launcher. 31 | 32 | Launch two trainers on a single node: 33 | 34 | ``` 35 | >>> docker run --shm-size=2g torchelastic/examples:$VERSION 36 | --standalone 37 | --nnodes=1 38 | --nproc_per_node=2 39 | /workspace/classy_vision/classy_train.py 40 | --config_file /workspace/classy_vision/configs/template_config.json 41 | ``` 42 | 43 | If you have an instance with GPUs, run a worker on each GPU: 44 | 45 | ``` 46 | >>> docker run --shm-size=2g 47 | --gpus=all 48 | torchelastic/examples:$VERSION 49 | --standalone 50 | --nnodes=1 51 | --nproc_per_node=$NUM_CUDA_DEVICES 52 | /workspace/classy_vision/classy_train.py 53 | --device=gpu 54 | --config_file /workspace/classy_vision/configs/template_config.json 55 | ``` 56 | 57 | Imagenet 58 | ---------- 59 | 60 | > **NOTE:** an instance with at least one GPU is required for this example 61 | 62 | Launch ``$NUM_CUDA_DEVICES`` number of workers on a single node: 63 | 64 | ``` 65 | >>> docker run --shm-size=2g --gpus=all torchelastic/examples:$VERSION 66 | --standalone 67 | --nnodes=1 68 | --nproc_per_node=$NUM_CUDA_DEVICES 69 | /workspace/examples/imagenet/main.py 70 | --arch resnet18 71 | --epochs 20 72 | --batch-size 32 73 | /workspace/data/tiny-imagenet-200 74 | ``` 75 | 76 | Multi-container 77 | ---------------- 78 | We now show how to use the PyTorch Elastic Trainer launcher 79 | to start a distributed application spanning more than one container. The 80 | application is intentionally kept "bare bones" since the 81 | objective is to show how to create a ``torch.distributed.ProcessGroup`` 82 | instance. Once a ``ProcessGroup`` is created, you can use any 83 | functionality needed from the ``torch.distributed`` package. 84 | 85 | The ``docker-compose.yml`` file is based on the example provided with 86 | the [Bitnami ETCD container image](https://hub.docker.com/r/bitnami/etcd/). 87 | 88 | 89 | 90 | ### Obtaining the example repo 91 | 92 | Clone the PyTorch Elastic Trainer Git repo using 93 | 94 | ``` 95 | git clone https://github.com/pytorch/elastic.git 96 | ``` 97 | 98 | make an environment variable that points to the elastic repo, e.g. 99 | 100 | ``` 101 | export TORCHELASTIC_HOME=~/elastic 102 | ``` 103 | 104 | ### Building the samples Docker container 105 | 106 | While you can run the rest of this example using a pre-built Docker 107 | image, you can also build one for yourself. This is especially useful if 108 | you would like to customize the image. To build the image, run: 109 | 110 | ``` 111 | cd $TORCHELASTIC_HOME && docker build -t hello_elastic:dev . 112 | ``` 113 | 114 | ### Running an existing sample 115 | 116 | This example uses ``docker-compose`` to run two containers: one for the 117 | ETCD service and one for the sample application itself. Docker compose 118 | takes care of all aspects of establishing the network interfaces so the 119 | application container can communicate with the ETCD container. 120 | 121 | To start the example, run 122 | 123 | ``` 124 | cd $TORCHELASTIC_HOME/examples/multi_container && docker-compose up 125 | ``` 126 | 127 | You should see two sets of outputs, one from ETCD starting up and one 128 | from the application itself. The output from the application looks 129 | something like this: 130 | 131 | ``` 132 | example_1 | INFO 2020-04-03 17:36:31,582 Etcd machines: ['http://etcd-server:2379'] 133 | example_1 | ***************************************** 134 | example_1 | Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 135 | example_1 | ***************************************** 136 | example_1 | INFO 2020-04-03 17:36:31,922 Attempting to join next rendezvous 137 | example_1 | INFO 2020-04-03 17:36:31,929 New rendezvous state created: {'status': 'joinable', 'version': '1', 'participants': []} 138 | example_1 | INFO 2020-04-03 17:36:32,032 Joined rendezvous version 1 139 | ``` 140 | 141 | The high-level differences between a single-container vs multi-container 142 | launches are: 143 | 144 | 1. Specify ``--nnodes=$MIN_NODE:$MAX_NODE`` instead of ``--nnodes=1``. 145 | 2. An etcd server must be setup before starting the worker containers. 146 | 3. Remove ``--standalone`` and specify ``--rdzv_backend``, ``--rdzv_endpoint`` and ``--rdzv_id``. 147 | 148 | For more information see [torch.distributed.run](https://github.com/pytorch/pytorch/blob/master/torch/distributed/run.py). 149 | 150 | 151 | 152 | Multi-node 153 | ----------- 154 | 155 | The multi-node, multi-worker case is similar to running multi-container, multi-worker. 156 | Simply run each container on a separate node, occupying the entire node. 157 | Alternatively, you can use our kubernetes 158 | [elastic job controller](https://github.com/pytorch/elastic/tree/master/kubernetes) to launch a multi-node job. 159 | 160 | > **WARNING**: We recommend you setup a highly available etcd server when 161 | > deploying multi-node jobs in production as this is the single 162 | > point of failure for your jobs. Depending on your usecase 163 | > you can either sidecar an etcd server with each job or setup 164 | > a shared etcd server. If etcd does not meet your requirements 165 | > you can implement your own rendezvous handler and use our 166 | > APIs to create a custom launcher. 167 | -------------------------------------------------------------------------------- /examples/bin/fetch_and_run: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (c) Facebook, Inc. and its affiliates. 4 | # All rights reserved. 5 | # 6 | # This source code is licensed under the BSD-style license found in the 7 | # LICENSE file in the root directory of this source tree. 8 | 9 | import os 10 | import sys 11 | import tarfile as tar 12 | import tempfile 13 | from urllib.parse import urlparse 14 | 15 | 16 | """ 17 | Fetch a script or tar.gz from s3 and runs it. 18 | 19 | Usage: 20 | 21 | fetch_and_run $HOME/my_script [