├── RaNNC ├── README.md └── main.py ├── .gitignore ├── torchgpipe_ ├── requirements.txt ├── run_transformer.sh ├── README.md ├── tbd_resnet50.py └── transformer_.py ├── .gitattributes ├── Image ├── P2 │ └── 1.png ├── GSPMD │ └── 1.png ├── Automap │ ├── 1.png │ └── group.png ├── overall │ ├── gdp.png │ ├── REGAL.png │ ├── gpipe.png │ ├── placeto.png │ ├── reinforce.png │ └── spotlight.png └── models │ └── wenxin.png ├── Piper ├── runtime.png ├── costmodel.png └── README.md ├── Chimera └── README.md ├── huawei ├── run_tensoropt.sh ├── run_double_recursive.sh ├── README.md ├── tensor_opt.py ├── double_recursive.py └── resnet.py ├── utils ├── dataset.py └── bert_cls.py ├── DeepMind_Automap └── README.md ├── DistIR_ └── README.md ├── PaSE_ └── README.md ├── FlexFlow └── README.md ├── GSPMD └── README.md ├── P^2 └── README.md ├── Models.md └── README.md /RaNNC/README.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | .DS_Store 3 | -------------------------------------------------------------------------------- /RaNNC/main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This 3 | ''' -------------------------------------------------------------------------------- /torchgpipe_/requirements.txt: -------------------------------------------------------------------------------- 1 | torch>=1.8 2 | torchtext 3 | torchgpipe -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /Image/P2/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/P2/1.png -------------------------------------------------------------------------------- /Image/GSPMD/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/GSPMD/1.png -------------------------------------------------------------------------------- /Piper/runtime.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Piper/runtime.png -------------------------------------------------------------------------------- /Image/Automap/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/Automap/1.png -------------------------------------------------------------------------------- /Piper/costmodel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Piper/costmodel.png -------------------------------------------------------------------------------- /Chimera/README.md: -------------------------------------------------------------------------------- 1 | ## Prepare the code 2 | ```bash 3 | git clone https://github.com/Shigangli/Chimera.git 4 | ``` 5 | -------------------------------------------------------------------------------- /Image/overall/gdp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/gdp.png -------------------------------------------------------------------------------- /Image/Automap/group.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/Automap/group.png -------------------------------------------------------------------------------- /Image/models/wenxin.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/models/wenxin.png -------------------------------------------------------------------------------- /Image/overall/REGAL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/REGAL.png -------------------------------------------------------------------------------- /Image/overall/gpipe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/gpipe.png -------------------------------------------------------------------------------- /Image/overall/placeto.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/placeto.png -------------------------------------------------------------------------------- /Image/overall/reinforce.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/reinforce.png -------------------------------------------------------------------------------- /Image/overall/spotlight.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/spotlight.png -------------------------------------------------------------------------------- /torchgpipe_/run_transformer.sh: -------------------------------------------------------------------------------- 1 | echo "This example requires pytorch >= 1.8" 2 | echo "and at least two CUDA devices." 3 | python transformer_.py -------------------------------------------------------------------------------- /torchgpipe_/README.md: -------------------------------------------------------------------------------- 1 | # Torchgpipe 2 | 3 | torchgpipe is a library that support only Pipeline parallelism. 4 | It can partition a model according to "memory of each layer" or 5 | "elapsed time of each layer". 6 | 7 | It is still a coarse-grained method that has many bubbles in 8 | the pipeline. 9 | 10 | In this directory, we use `torch.distributed.pipeline.sync.Pipe` Instead of torchgpipe, 11 | since its implementation is based on torchgpipe paper. 12 | 13 | Note: 14 | 1. It only supports `nn.Sequential` model, which means we may need to rewrite 15 | our model if we want to use it. 16 | 17 | 2. It is coarse-grained. It partitions stages into parts of the Sequential model. 18 | 19 | 3. Easy to use. Can run with only one process. -------------------------------------------------------------------------------- /huawei/run_tensoropt.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # applicable to GPU 3 | 4 | echo "==============================================================================================================" 5 | echo "Please run the script as: " 6 | echo "bash run_gpu.sh DATA_PATH" 7 | echo "For example: bash run_gpu.sh /path/dataset" 8 | echo "It is better to use the absolute path." 9 | echo "==============================================================================================================" 10 | set -e 11 | DATA_PATH=$1 12 | export DATA_PATH=${DATA_PATH} 13 | 14 | rm -rf tensor_opt 15 | mkdir tensor_opt 16 | cp ./tensor_opt.py ./resnet.py ./tensor_opt 17 | cd ./tensor_opt 18 | echo "start training" 19 | mpirun -n 2 pytest -s -v ./tensor_opt.py > train.log 2>&1 & 20 | -------------------------------------------------------------------------------- /huawei/run_double_recursive.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # applicable to GPU 3 | 4 | echo "==============================================================================================================" 5 | echo "Please run the script as: " 6 | echo "bash run_gpu.sh DATA_PATH" 7 | echo "For example: bash run_gpu.sh /path/dataset" 8 | echo "It is better to use the absolute path." 9 | echo "==============================================================================================================" 10 | set -e 11 | DATA_PATH=$1 12 | export DATA_PATH=${DATA_PATH} 13 | 14 | rm -rf double_recursive 15 | mkdir double_recursive 16 | cp ./double_recursive.py ./resnet.py ./double_recursive 17 | cd ./double_recursive 18 | echo "start training" 19 | mpirun -n 2 pytest -s -v ./double_recursive.py > train.log 2>&1 & 20 | -------------------------------------------------------------------------------- /utils/dataset.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from torch.utils.data import Dataset 4 | import numpy as np 5 | 6 | 7 | class RandomDataset(Dataset): 8 | def __init__(self): 9 | pass 10 | 11 | def __len__(self): 12 | return 1024 13 | 14 | def __getitem__(self, idx): 15 | pt_data = dict() 16 | pt_data['input_ids'] = torch.tensor(np.random.randint(1, 30000, (1, 512)), dtype=torch.long) 17 | pt_data['token_type_ids'] = torch.tensor(np.random.randint(0, 1, (1, 512)), dtype=torch.long) 18 | pt_data['attention_mask'] = torch.tensor(np.random.random((1, 512))) 19 | label = torch.tensor(np.random.randint(0, 1, (1, 1))) 20 | masked_lm_ids = torch.tensor(np.random.randint(0, 30000, (1, 80))) 21 | masked_lm_positions = torch.tensor(np.random.randint(0, 511, (1, 80))) 22 | masked_lm_weights = torch.tensor(np.random.random((1, 80))) 23 | 24 | return pt_data, label, masked_lm_ids, masked_lm_positions, masked_lm_weights 25 | -------------------------------------------------------------------------------- /huawei/README.md: -------------------------------------------------------------------------------- 1 | # Huawei MindSpore 2 | In this directory, we have examples of **TensorOpt** and **Double Recursive Algorithm**, 3 | which are implemented with Mindspore. 4 | 5 | ## Prerequisite 6 | 1. GCC 7.3.0 7 | 2. CUDA 10.1 with cuDNN 7.6.x or CUDA11.1 with cuDNN 8.0.x. 8 | 9 | Installation Instructions: [link](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions) 10 | 11 | Make sure we have CUDA in our environment `PATH` and `LD_LIBRARY_PATH` like `export PATH=/usr/local/cuda-${version}/bin:$PATH` 12 | and `export LD_LIBRARY_PATH=/usr/local/cuda-${version}/lib64:$LD_LIBRARY_PATH` 13 | 14 | 3. To run experiment on multiple devices, we need NCCL 2.7.6 with CUDA 10.1 or 15 | NCCL 2.7.8 with CUDA 11.1 16 | 17 | 18 | Since I have RTX 3090, which only supported by CUDA11.1, I use CUDA11.1 here. 19 | 20 | Install Mindspore with conda: 21 | ```bash 22 | conda install mindspore-gpu={version} cudatoolkit=11.1 -c mindspore -c conda-forge 23 | ``` 24 | 25 | Check Installation: 26 | ```bash 27 | python -c "import mindspore;mindspore.run_check()" 28 | ``` 29 | 30 | Expected Output: 31 | ```text 32 | mindspore version: 1.5 33 | The result of multiplication calculation is correct, MindSpore has been installed successfully! 34 | ``` 35 | 36 | ## Dataset 37 | We use CIFAR-10 to train Resnet-50 model in this experiment. 38 | 39 | ### Dataset Preparation 40 | > `CIFAR-10` Download Link:。 41 | 42 | On linux machine, we can execute below bash codes to download dataset to directory `cifar-10-batches-bin`。 43 | 44 | ```bash 45 | wget http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz 46 | tar -zxvf cifar-10-binary.tar.gz 47 | ``` 48 | 49 | ## Run Experiment 50 | ### TensorOpt 51 | ```bash 52 | bash run_tensoropt.sh {DATA_PATH} 53 | ``` 54 | 55 | ### Double Recursive 56 | ```bash 57 | bash run_double_recursive.sh {DATA_PATH} 58 | ``` -------------------------------------------------------------------------------- /DeepMind_Automap/README.md: -------------------------------------------------------------------------------- 1 | # AutoMap 2 | 3 | This method is proposed by **DeepMind**. 4 | 5 | This rewrite engine is implemented in MLIR with an XLA backend, and a python API in JAX. Rewrite sequences are evaluated 6 | through compiler-internal cost models (estimating peak memory, runtime and communication). 7 | 8 | ![1](../Image/Automap/1.png) 9 | 10 | AutoMap search strategies based on **Search and Learn** algorithm. 11 | 12 | The Search is based on Monte Carlo Tree Search (MCTS) with upper confidence bound for trees (UCT). 13 | 14 | The Learning uses a DeepMind model called [**Interaction Network**](https://arxiv.org/pdf/1612.00222.pdf). This model 15 | was trained on a dataset of 20k transformer variants, and used to predict the importance of a node to be partitioned. 16 | The top-k most relevant nodes are then passed to MCTS for searching strategies. 17 | 18 | This method costs few minutes to find a Megatron-LM performance-like strategy with **search and learn**. 19 | 20 | ### More work is needed to support that, a learned system in interactive compiler workflows can 21 | 22 | handle a variety of generally unpredictable user models. 23 | 24 | ## Scalability 25 | 26 | For large models, relies on propagation sharding information through subtly shared constants and other computations 27 | **across layers**. Sharing information is brittle. Model with replicated blocks (like transformers, resnet), search 28 | techniques scale unfavourably when having to explicitly rewrite for each layer. So Automap allows users to group 29 | repeated layers together and exposes only a single set of decisions per group. This hint setting helps to search 30 | megatron-lm like strategies within a short time. 31 | 32 | ![img.png](../Image/Automap/group.png) 33 | 34 | ## Discussion 35 | 36 | ### from paper 37 | 38 | - The results presented in paper is restricted to sharding within the devices of a single host, while assuming data 39 | parallelism across hosts, which simplifies the communication cost model. 40 | - More advanced cost models are required to model multi-host communication, as well as device-specific lowering (like 41 | fusion) 42 | - Pipeline and ZeRO offloading. -------------------------------------------------------------------------------- /DistIR_/README.md: -------------------------------------------------------------------------------- 1 | #DistIR 2 | 3 | # Summary 4 | Users don't need to write dist-ir code. Users can write forward-only code 5 | like PyTorch, and then export it to ONNX or XLA. Then import the ONNX or XLA 6 | model to DistIR. DistIR can then use grid-search to find the best parallelism 7 | strategies. 8 | 9 | There are two ways to measure a strategy in DistIR now, which are Simulator and 10 | PyTorch running time. Simulator use a CostModel to simulate the computation and 11 | communication time. PyTorch running time uses `multiprocessing` library to profile 12 | its executing time. 13 | 14 | Currently, DistIR only support grid search on data-parallelism, tensor-parallleism 15 | and 1F1B PipeDream pipeline-parallelism, as well as the num of micro-batches. 16 | 17 | In the future, DistIR may support ZeRO and overlapping communication and computation. 18 | 19 | 20 | # Experiment 21 | I ran this experiment on a Ubuntu single node with 8 V100. 22 | ## Prerequisite 23 | To run this example, Make sure you have Anaconda installed, and 24 | `git lfs` is runnable. 25 | 26 | To install git lfs on Ubuntu, you can run: 27 | ```bash 28 | curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash 29 | sudo apt-get install git-lfs 30 | ``` 31 | 32 | ### Step1: Environment Settings 33 | ```bash 34 | conda create -n distir python=3.8 35 | conda activate distir 36 | ``` 37 | 38 | ### Step 2: Clone repository 39 | ```bash 40 | git clone https://github.com/microsoft/dist-ir.git 41 | cd dist-ir 42 | ``` 43 | 44 | ### Step 3: Install dependencies 45 | ```bash 46 | pip install pylint, pytest, black 47 | 48 | # DistIR Dependencies 49 | pip install -r requirements.txt 50 | ``` 51 | 52 | ### Step 4: Prepare GPT2-10 ONNX model 53 | ```bash 54 | pushd /tmp 55 | git clone https://github.com/onnx/models.git 56 | pushd models 57 | git checkout bb0d4cf3d4e2a5f7376c13a08d337e86296edbe8h 58 | git lfsm pull --include="text/machine_comprehension/gpt-2/model/gpt2-10.onnx" --exclude "" 59 | popd 60 | popd 61 | mv /tmp/models/text/machine_comprehension/gpt-2/model/gpt2-10.onnx ./ 62 | ``` 63 | 64 | ### Step 5: Check formatting (black) 65 | ```bash 66 | balck --diff --check . 67 | ``` 68 | 69 | ### Step 6: Install dist-ir 70 | ```bash 71 | python setep.py install 72 | ``` 73 | 74 | ### Step 7: Run pytest 75 | ```bash 76 | python -m pytest 77 | ``` 78 | -------------------------------------------------------------------------------- /PaSE_/README.md: -------------------------------------------------------------------------------- 1 | # PaSE 2 | 3 | PaSE is proposed by Baidu-Research. 4 | 5 | Currently, it is implemented by python. So the efficiency may not be high enough. 6 | 7 | ## Advantages: 8 | 9 | 1. PaSE support brute-force search and dynamic-programming based search. 10 | 2. Its complexity is O(|V|^2 K^(M+1)), where V is the num of vertices, K is the configurable strategies of a single 11 | layer, M is the maximum num of dependent set. 12 | 13 | ## Limitations: 14 | 15 | 1. Do not support inter-layer pipeline parallelism. Prevent us from overlapping computation and communications between 16 | different layers. 17 | 18 | 2. Not beneficial when the graph is dense, like `DenseNet`, making the `M` size cannot be reduced to an ideal value, and 19 | leading to high runtime overhead. 20 | 21 | 3. Though this method is applicable to `heterogeneous` architecture, it does not explicitly include heterogeneity into 22 | the cost model. 23 | 24 | 4. Ignores several low level details such as `cache effects` in cost model. 25 | 26 | ## How to Use 27 | 28 | ### git clone 29 | 30 | ```bash 31 | git clone https://github.com/baidu-research/PaSE.git 32 | ``` 33 | 34 | ### create virtual environment and install required libraries. 35 | 36 | ```bash 37 | > python3 -m venv ~/env/pase 38 | > source ~/env/pase/bin/activate 39 | > pip install -r requirements.txt 40 | ``` 41 | 42 | ### Execute 43 | 44 | ```bash 45 | > python3 ./scheduler.py --help 46 | usage: scheduler.py [-h] [-p PROCS] [-b BATCH] [-m MODEL] [-g {alexnet,resnet101,inception3,rnnlm,transformer}] [-a {0,1}] [--profile] [--measure] [-d] [--algo {0,1}] 47 | ``` 48 | 49 | ```bash 50 | optional arguments: 51 | -h, --help show this help message and exit 52 | -p PROCS, --procs PROCS 53 | No. of processors. (Default: 32) 54 | -b BATCH, --batch BATCH 55 | Batch size. (Default: 128) 56 | -m MODEL, --model MODEL 57 | Model size. (Default: 128) 58 | -g {alexnet,resnet101,inception3,rnnlm,transformer}, --graph {alexnet,resnet101,inception3,rnnlm,transformer} 59 | Neural net graph. (Default: 'alexnet') 60 | --flops FLOPS Peak FLOPS of each device in TFLOPS. (default: 10.0) 61 | --bw BW Peak inter-connection bandwidth in GBytes/sec (default: 16.0) 62 | --profile Turn on/off profiling. 63 | --measure Turn on/off measurement. 64 | -d, --dump-graph Dump the graph in dot format to the file graph.dot in the working directory. 65 | --algo {0,1} Algorithm to be used to compute strategy (Default: 0). 66 | ``` -------------------------------------------------------------------------------- /FlexFlow/README.md: -------------------------------------------------------------------------------- 1 | # FlexFlow 2 | 3 | ## Summary 4 | FlexFlow uses MCMC algorithm to automatically search strategies for each 5 | operator. 6 | 7 | Currently FlexFlow has a lack of documents to instruct users how to use it well. 8 | 9 | ## Installation 10 | 11 | Follow instructions on [FlexFlow/INSTALL.md](https://github.com/flexflow/FlexFlow/blob/master/INSTALL.md) 12 | 13 | 1. Clone code from github 14 | ```bash 15 | git clone --recursive https://github.com/flexflow/FlexFlow.git 16 | ``` 17 | 2. edit `config/config.linux` to fit your demands. 18 | 19 | 3. build FlexFlow 20 | ```bash 21 | mkdir build 22 | cd build 23 | ../config/config.linux 24 | make 25 | ``` 26 | 27 | 4. python library dependencies 28 | ```bash 29 | pip install cffi 30 | pip install keras-preprocessing 31 | pip install pillow 32 | ``` 33 | 34 | ### Some problems I met while installing 35 | 36 | 1. could not find cmake 37 | 38 | Install cmake using this line of code on ubuntu 39 | ```bash 40 | sudo apt-get install cmake 41 | ``` 42 | 43 | 2. could not find hdf5 44 | 45 | Install using this line of code on ubuntu 46 | ```bash 47 | sudo apt-get install libhdf5-serial-dev 48 | ``` 49 | 50 | ## Experiments 51 | follow the [autotune tutorial](https://flexflow.ai/search/) 52 | 53 | execute cmd like 54 | ```bash 55 | ./dlrm -ll:gpu 4 -ll:fsize 12000 -ll:zsize 20000 --arch-sparse-feature-size 64 --arch-embedding-size 1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --arch-mlp-bot 64-512-512-64 --arch-mlp-top 576-1024-1024-1024-1 --batch-size 1024 --budget 1000 56 | ``` 57 | to generate the stategies using MCMC. 58 | 59 | ### DLRM example 60 | using 4 GPU, one strategy of an Embedding layer looks like 61 | ```text 62 | [Dense_100] num_dims(2) dims[1,4] device_ids[0,1,2,3] 63 | [Dense_101] num_dims(2) dims[1,4] device_ids[0,1,2,3] 64 | [Dense_102] num_dims(2) dims[1,4] device_ids[0,1,2,3] 65 | [Embedding_103] num_dims(2) dims[1,1] device_ids[3] 66 | [Embedding_104] num_dims(2) dims[1,1] device_ids[2] 67 | [Embedding_105] num_dims(2) dims[1,1] device_ids[2] 68 | [Embedding_106] num_dims(2) dims[1,1] device_ids[0] 69 | [Embedding_107] num_dims(2) dims[1,1] device_ids[1] 70 | [Embedding_108] num_dims(2) dims[1,1] device_ids[1] 71 | [Embedding_109] num_dims(2) dims[1,1] device_ids[3] 72 | [Embedding_110] num_dims(2) dims[1,1] device_ids[0] 73 | [Concat-111] num_dims(2) dims[1,4] device_ids[0,1,2,3] 74 | [Dense_112] num_dims(2) dims[1,4] device_ids[0,1,2,3] 75 | [Dense_113] num_dims(2) dims[1,4] device_ids[0,1,2,3] 76 | [Dense_114] num_dims(2) dims[1,4] device_ids[0,1,2,3] 77 | [Dense_115] num_dims(2) dims[1,4] device_ids[0,1,2,3] 78 | ``` 79 | Each line describes the parallelization configuration for one operator: 80 | `dims` indicates the degree of parallelism for each dimension, and `device_ids` shows the device assignment for each task within an operator. 81 | 82 | It seems quite rational to distribute different embedding tables to every device and 83 | partition a Dense network's weights matrix to every device. 84 | -------------------------------------------------------------------------------- /GSPMD/README.md: -------------------------------------------------------------------------------- 1 | # Code 2 | 3 | The implementation code is in tensorflow/compiler/xla. 4 | An example can be found [here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/experimental/xla_sharding/xla_sharding.py) 5 | 6 | # Methods Summary 7 | 8 | ## GSPMD 9 | 10 | GSPMD is a system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified 11 | way. 12 | 13 | GSPMD is implemented as an extension to a production ML compiler, XLA. GSPMD natively supports **in-opeartor** 14 | parallelism, which includes data-parallelism, model-parallelism, pipeline parallelism, spatial partitioning as well as 15 | ZeRO. Though pipeline is not an in-operator parallelism, it **could be reduced to an operator partitioning** 16 | problem. 17 | 18 | GSPMD has two independent compiler transformation: sharding completion and per-operator partitioning. 19 | 20 | ### Sharding completion 21 | 22 | GSPMD defines three types of sharding: 23 | 24 | - Replicated: All devices have the same full data. 25 | - Tiled: Each device occupies the corresponding tile in the data tensor, without data duplication. 26 | - Partially tiled. Replicated in subgroups. Each subgroup has a different subset of data. 27 | 28 | ![Examples](../Image/GSPMD/1.png) 29 | 30 | mesh_split(tensor, device_mesh, dims_mapping) is the primary API that GSPMD provides for users. 31 | 32 | #### Pipeline parallelism reduced to tensor sharding 33 | 34 | GSPMD reduce pipelining into a layer-wise sharding problem. Imagine that the layer computation is rewritten in a 35 | stage-parallel way, where a leading layer/stage dimension L has been added to each tensor, and the computation is 36 | entirely parallel on this dimension. This transformation can be done by existing frontends' vectorization support like 37 | JAX's vmap() 38 | and TensorFlow's vectorized_map(). 39 | 40 | User annotate the L dimension to be sharded, then GSPMD will turn the buffer shifting into cross-device communication 41 | via CollectivePermute to do the pipeline. 42 | 43 | This pipelining implementation runs naturally when combined with other types of parallelism in GSPMD. 44 | 45 | However, it could only be used in **homogeneous pipeline stages**(Like BertEncoder in transfromers). For heterogeneous 46 | pipelining, it is recommended to use GPipe, Tera-Pipe, PipeDream, etc. 47 | 48 | #### Intuitive sharding completion 49 | 50 | GSPMD can auto-complete the sharding on every tensor based on limited user annotations. This is implemented as a 51 | compiler pass in XLA. The design goal is to make the sharding decisions intuitive to users even if they only annotate a 52 | small subset of tensors. 53 | 54 | ### Per operator partitioning 55 | 56 | This step automatically partition operator with the result from Sharding completion. 57 | 58 | ### Comparison 59 | 60 | FlexFlow focuses on determining the partitioning policy, while GSPMD focuses on the mechanisms to transform an annotated 61 | graph. They are complementary to each other. Automated search combined with GSPMD could provide a fully automated 62 | system. 63 | 64 | DistIR is an MPMD-style intermediate representation for distributed ML computattions. It focuses on the representation 65 | itself but does not provide a transformation to automatically parallelize programs like GSPMD. 66 | -------------------------------------------------------------------------------- /Piper/README.md: -------------------------------------------------------------------------------- 1 | # Piper 2 | 3 | ## inputs 4 | 5 | Piper is given as input a DNN workload (model), which is represented as a DAG. Each Node corresponds to a single DNN 6 | layer. Each Edge (u, v) corresponds to a data transfer: 7 | node v requires the output of u as its input. The input graph is annotated with a number of attribtues related to 8 | compute times, communication costs and memory usage. These attribtues should be profiled or estimated. 9 | 10 | Furthermore, Piper is given: the number of devices (K),available memory per device (M), the network bandwidth (B), and 11 | the maximum allowed number of microbatches in a batch (N). 12 | 13 | In short: 14 | 15 | - a Directed Acyclic Graph G=(V,E): layer graph of a DNN, 16 | - for every edge (u,v), an associated communication cost c(u,v) (in bytes). 17 | - number K of devices, memory M per device, network bandwidth B, [maximum] number N of microbatches in a batch, 18 | - for every node/layer v, and every degree t=1,...,K of tensor parallelism: a list T(v,t) of Tensor Model Parallelism 19 | Configurations(TMPC) 20 | 21 | each TMPC contains: 22 | 23 | - a Time-Per-Sample X.p that consists of compute time and communication time in tensor parallelism (data transfers 24 | between the *t* devices) 25 | - for every incoming edge (u,v), a communication cost X.c_forward(u,v), 26 | - an analogous quantity X.c_backward(v,w) for every outgoing edge(u,w), 27 | - the size X.w of parameters/weights on each device, 28 | - memory usage: X.a and X.b such that if layer v is located on a stage with data-parallel degree *d* and data-parallel 29 | degrees of this and later stages sum up to s, then the memory consumption on each device is X.a*[s/d] + X.b ( 30 | PipeDream) 31 | 32 | ## Outputs 33 | 34 | - a collection of contiguous subgraphs/stages S1, S2,...,S_l. and for each stage i: 35 | - the degree d_i of data parallelism, 36 | - the degree t_i of tensor paralleism, 37 | - for each node/layer v in S_i: the index of the TMPC selected in T(v,t_i). 38 | 39 | All nodes in a stage use the same *d* and *t*, but different nodes (even if they are the same type) can use different 40 | TMPCs. For example, some but not all layers in a stage might employ activation recomputation. 41 | 42 | ## Search Space 43 | 44 | Piper find partitions of G, also called stages. Each node/layer is assigned to only one stage. Each stage is executed 45 | using some number *d\*t* devices. `d` is the data parallelism degree, 46 | `t` is the tensor parallelism degree. Also Piper will specify how the tensor parallelism is carried out. 47 | 48 | To determine how to tensor-parallelize every operator in a DNN computation graph is very complex. So Piper reduce this 49 | task to the problem of coming up with good tensor parallelization techniques *for individual layers*. Solving this 50 | problem is not a goal of Piper. Piper directly use some good configurations like Megatron-LM after solving *d* and *t* 51 | out. 52 | 53 | ## Algorithm 54 | 55 | Based on dynamic programming on downsets, extending prior work `DNN partitioning` 56 | 57 | ## Running time 58 | 59 | The compution complexity is O(|V|^2 NK^2), V is the number of nodes, K is the number of devices, and N <= K is the 60 | maximum sum of data-parallel degrees. 61 | 62 | ![img.png](runtime.png) 63 | Bert-32, batch-size=512, devices=2048 : 13 minutes, Bert-64, batch-size=2048,devices=2048 : 2 hour. 64 | 65 | This algorithm can be parallelized, but has not been done yet. 66 | 67 | ## Cost Model Performance 68 | 69 | very close to real performance. 70 | ![img.png](costmodel.png) -------------------------------------------------------------------------------- /torchgpipe_/tbd_resnet50.py: -------------------------------------------------------------------------------- 1 | ''' 2 | An example of using torchgpipe. 3 | 4 | TODO: This file is currently not executing correctly due to multiple submodule exists. 5 | 6 | torchgpipe only support model implemented by nn.Sequential. 7 | 8 | ''' 9 | from collections import OrderedDict 10 | 11 | import torch 12 | import torch.nn as nn 13 | import torchvision 14 | import torchvision.transforms as transforms 15 | from torchvision.models.resnet import ResNet, Bottleneck 16 | 17 | num_classes = 10 18 | 19 | from torchgpipe import GPipe 20 | from torchgpipe.balance import balance_by_time 21 | 22 | 23 | class ResNet50(ResNet): 24 | def __init__(self, *args, **kwargs): 25 | super(ResNet50, self).__init__( 26 | Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs) 27 | 28 | self.seq1 = nn.Sequential( 29 | self.conv1, 30 | self.bn1, 31 | self.relu, 32 | self.maxpool, 33 | 34 | self.layer1, 35 | self.layer2 36 | ) 37 | 38 | self.seq2 = nn.Sequential( 39 | self.layer3, 40 | self.layer4, 41 | self.avgpool, 42 | ) 43 | 44 | def forward(self, x): 45 | x = self.seq2(self.seq1(x)) 46 | return self.fc(x.view(x.size(0), -1)) 47 | 48 | 49 | def flatten_sequential(module): 50 | def _flatten(module): 51 | for name, child in module.named_children(): 52 | if isinstance(child, nn.Sequential): 53 | for sub_name, sub_child in _flatten(child): 54 | yield f'{name}_{sub_name}', sub_child 55 | else: 56 | yield name, child 57 | 58 | return nn.Sequential(OrderedDict(_flatten(module))) 59 | 60 | 61 | if __name__ == '__main__': 62 | # Data Preparation 63 | transform = transforms.Compose( 64 | [transforms.ToTensor(), 65 | transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) 66 | 67 | batch_size = 4 68 | 69 | trainset = torchvision.datasets.CIFAR10(root='./data', train=True, 70 | download=True, transform=transform) 71 | trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, 72 | shuffle=True, num_workers=2) 73 | 74 | classes = ('plane', 'car', 'bird', 'cat', 75 | 'deer', 'dog', 'frog', 'horse', 'ship', 'truck') 76 | 77 | # Model Definition 78 | model = ResNet50() 79 | 80 | model = flatten_sequential(model) 81 | 82 | optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9) 83 | 84 | # Get device_numbers 85 | partitions = torch.cuda.device_count() 86 | 87 | # Prepare sample data 88 | sample = torch.rand(4, 3, 224, 224) 89 | 90 | # balance the partitions by time. 91 | balance = balance_by_time(partitions, model, sample) 92 | print("The balance is: ", balance) 93 | model = GPipe(model, balance, chunks=8) 94 | 95 | # loss definition 96 | criterion = torch.nn.CrossEntropyLoss() 97 | 98 | for epoch in range(1): # loop over the dataset multiple times 99 | 100 | running_loss = 0.0 101 | for i, data in enumerate(trainloader, 0): 102 | # get the inputs; data is a list of [inputs, labels] 103 | inputs, labels = data 104 | 105 | # zero the parameter gradients 106 | optimizer.zero_grad() 107 | 108 | # forward + backward + optimize 109 | outputs = model(inputs) 110 | loss = criterion(outputs, labels) 111 | loss.backward() 112 | optimizer.step() 113 | 114 | # print statistics 115 | running_loss += loss.item() 116 | if i % 2000 == 1999: # print every 2000 mini-batches 117 | print('[%d, %5d] loss: %.3f' % 118 | (epoch + 1, i + 1, running_loss / 2000)) 119 | running_loss = 0.0 120 | 121 | print('Finished Training') 122 | -------------------------------------------------------------------------------- /P^2/README.md: -------------------------------------------------------------------------------- 1 | Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems For Deep Learning 2 | 3 | From University of Cambridge and DeepMind 4 | 5 | # Summary 6 | 7 | - A new method that reduce the space of software-to-hardware mapping. 8 | 9 | - Offer a synthesis framework %P^2% that can decompose reductions over one or more parallelism axes to sequences of 10 | collectives in a hierarchy- and mapping-aware way. 11 | 12 | - For 69 % of parallelism placements, proposed framework synthesized programs **outperform** default all-reduce 13 | implementation. 14 | 15 | - Use a simulator that exceeds 90% top-10 accuracy to help complement synthesis tool, which therefore reduces the need 16 | for massive evaluations of synthesis results to determine a small set of optimal programs and mappings. 17 | 18 | ## Some concepts explanations 19 | 20 | - **Parallelism axis**: each form of parallelism 21 | - **Parallelism placement**: How we map parallelism over devices. This decides the communication groups as well as 22 | communication cost. 23 | - **reduction axis**: Reduction along the axis of model parallelism or data parallelism. 24 | 25 | ## Contributions 26 | 27 | - Parallelism placement synthesis: Given parallelism axes, reduction axes and a hierarchical system topology, P^2 can 28 | automatically synthesizes hierarchical parallelism placements, where parallelism placement is modelled as a 29 | parallelism matrix mapping from parallelism axes to the system hierarchy. 30 | - Reduction strategy synthesis: For each parallelism placement, P^2 utilizes the system hierarchy to further synthesize 31 | a wide variety of reduction strategies to implement reductions using common collective operations. 32 | - Synthesis hierarchy: parallelism matrix determines a candidate parallelism placement, and it can be used by the 33 | synthesizer to massively reduce the space of programs. 34 | - Simulator: Use a simulator to help evaluate reduction strategies produced by P^2. 35 | 36 | ## Key design 37 | 38 | 1. Parallelism Placement 39 | 40 | Critical idea: To reduce search space, the critical idea of P^2 is to **partition parallelism axes over the system 41 | hierarchy to generate topology-aware parallelism placements**, while still being able to systematically generate a 42 | wide range of parallelism placements. 43 | 44 | Results: result of parallelism placement synthesis is a parallelism matrix, where each element is a parallelism 45 | factor representing the number of a specific level in the hierarchy that a parallelism form splits the computation 46 | across. 47 | 48 | Parallelism matrices decide communication requirements. 49 | 50 | 2. Reduction Strategy 51 | 52 | For each parallelism matrix, P^2 further synthesizes topology-aware reduction strategies using common collective 53 | operations, which allows us to find the optimal reduction strategy for any given parallelism matrix. 54 | 55 | Below figure shows different reduction strategy on a produced placement. 56 | 57 | ![Figure3](../Image/P2/1.png) 58 | 59 | 3. Formalism of Collective Operations 60 | 61 | **Semantically Invalid reduction steps**:Reduction steps which result in device states that can never reach the final 62 | wanted state. 63 | 64 | P^2 provides a formalism of common collective operations that captures semantic correctness and rules out 65 | semantically invalid programs, massively reducing the synthesis space. 66 | 67 | Each device state is defined as a **state matrix** describing what kind of data a device has. 68 | 69 | 4. Reduction Communication Patterns 70 | 71 | To synthesize reduction strategies effectively, P^2 uses a domain-specific language(DSL) that explores the hierarchy 72 | to generate hierarchical communication patterns. 73 | 74 | 5. Synthesis Hierarchy 75 | 76 | Exploration with multiple reduction axes. 77 | 78 | ## Program Synthesis Algorithm 79 | 80 | ## Results 81 | 82 | - The performance of AllReduce differs significantly among parallelism matrices, up to 448.5x. (Intra-node communication 83 | is faster than inter-node communication) 84 | 85 | - Pruning techniques are effective for the synthesizer to achieve fast synthesis time. 86 | 87 | - If the reduction axes can be put within one node, then a **single step AllReduce** inside that node is the most 88 | effective reduction due to fast local bandwidth. 89 | 90 | - Synthesized programs can help alleviate the impact of parallelism placement. 91 | 92 | - For inter-node reduction, a **topology-aware reduction program** tends to outperform a single step AllReduce, with speedup 93 | on average 1.28x , upto 2.04x. 94 | -------------------------------------------------------------------------------- /utils/bert_cls.py: -------------------------------------------------------------------------------- 1 | ''' 2 | An implementation of Bert_CLS model 3 | ''' 4 | 5 | import torch 6 | from torch import nn 7 | from transformers import BertModel 8 | 9 | 10 | class BertPreTrainingHeads(nn.Module): 11 | def __init__(self, hidden_size, vocab_size, hidden_act=nn.GELU()): 12 | super().__init__() 13 | self.predictions = BertLMPredictionHead( 14 | hidden_size, vocab_size, hidden_act) 15 | self.seq_relationship = nn.Linear(hidden_size, 2) 16 | 17 | def forward(self, sequence_output, pooled_output): 18 | prediction_scores = self.predictions(sequence_output) 19 | seq_relationship_scores = self.seq_relationship(pooled_output) 20 | return prediction_scores, seq_relationship_scores 21 | 22 | 23 | class BertLMPredictionHead(nn.Module): 24 | def __init__(self, hidden_size, vocab_size, hidden_act=nn.GELU()): 25 | super().__init__() 26 | self.transform = BertPredictionHeadTransform(hidden_size, hidden_act) 27 | 28 | # The output weights are the same as the input embeddings, but there is 29 | # an output-only bias for each token. 30 | self.decoder = nn.Linear(hidden_size, vocab_size, bias=False) 31 | 32 | self.output_bias = nn.Parameter(torch.zeros(vocab_size)) 33 | 34 | # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings` 35 | self.decoder.bias = self.output_bias 36 | 37 | 38 | def forward(self, sequence_output): 39 | sequence_output = self.transform(sequence_output) 40 | sequence_output = self.decoder(sequence_output) 41 | return sequence_output 42 | 43 | 44 | class BertPredictionHeadTransform(nn.Module): 45 | def __init__(self, hidden_size, hidden_act=nn.GELU()): 46 | super().__init__() 47 | self.dense = nn.Linear(hidden_size, hidden_size) 48 | self.transform_act_fn = hidden_act 49 | self.LayerNorm = nn.LayerNorm(hidden_size) 50 | 51 | def forward(self, sequence_output): 52 | sequence_output = self.dense(sequence_output) 53 | sequence_output = self.transform_act_fn(sequence_output) 54 | sequence_output = self.LayerNorm(sequence_output) 55 | return sequence_output 56 | 57 | 58 | class BertLargeCls(nn.Module): 59 | def __init__(self, config): 60 | super().__init__() 61 | 62 | mlm_criterion = nn.CrossEntropyLoss(reduction="none") 63 | self.max_predictions_per_seq = 80 64 | 65 | def get_masked_lm_loss( 66 | logit_blob, 67 | masked_lm_positions, 68 | masked_lm_labels, 69 | label_weights, 70 | max_predictions_per_seq, 71 | ): 72 | # gather valid position indices 73 | logit_blob = torch.gather( 74 | logit_blob, 75 | index=masked_lm_positions.unsqueeze(2).to( 76 | dtype=torch.int64).repeat(1, 1, 30522), 77 | dim=1, 78 | ) 79 | logit_blob = torch.reshape(logit_blob, [-1, 30522]) 80 | label_id_blob = torch.reshape(masked_lm_labels, [-1]) 81 | 82 | # The `positions` tensor might be zero-padded (if the sequence is too 83 | # short to have the maximum number of predictions). The `label_weights` 84 | # tensor has a value of 1.0 for every real prediction and 0.0 for the 85 | # padding predictions. 86 | pre_example_loss = mlm_criterion(logit_blob, label_id_blob.long()) 87 | pre_example_loss = torch.reshape( 88 | pre_example_loss, [-1, max_predictions_per_seq]) 89 | sum_label_weight = torch.sum(label_weights, dim=-1) 90 | sum_label_weight = sum_label_weight / label_weights.shape[0] 91 | numerator = torch.sum(pre_example_loss * label_weights) 92 | denominator = torch.sum(label_weights) + 1e-5 93 | loss = numerator / denominator 94 | return loss 95 | 96 | self.bert = BertModel(config) 97 | self.cls = BertPreTrainingHeads(config.hidden_size, config.vocab_size) 98 | self.ns_criterion = nn.CrossEntropyLoss(reduction="mean") 99 | self.masked_lm_criterion = get_masked_lm_loss 100 | 101 | def forward(self, labels, id, pos, weight, inputs, return_outputs=False): 102 | outputs = self.bert(**inputs) 103 | prediction_scores, seq_relationship_scores = self.cls( 104 | outputs[0], outputs[1]) # last_hidden_state, pooler_output 105 | next_sentence_loss = self.ns_criterion( 106 | seq_relationship_scores.view(-1, 2), labels.long().view(-1) 107 | ) 108 | masked_lm_loss = self.masked_lm_criterion( 109 | prediction_scores, pos, id, weight, max_predictions_per_seq=self.max_predictions_per_seq 110 | ) 111 | 112 | total_loss = next_sentence_loss + masked_lm_loss 113 | return (total_loss, outputs) if return_outputs else total_loss -------------------------------------------------------------------------------- /huawei/tensor_opt.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Huawei Technologies Co., Ltd 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================ 15 | """ 16 | Resnet50_distributed_training 17 | """ 18 | import os 19 | import mindspore.nn as nn 20 | from mindspore import dtype as mstype 21 | import mindspore.ops as ops 22 | import mindspore.dataset as ds 23 | import mindspore.dataset.vision.c_transforms as vision 24 | import mindspore.dataset.transforms.c_transforms as C 25 | from mindspore.communication import init, get_rank, get_group_size 26 | from mindspore import Tensor, Model 27 | from mindspore.nn import Momentum 28 | from mindspore.context import ParallelMode 29 | from mindspore import context 30 | from mindspore.train.callback import LossMonitor 31 | from resnet import resnet50 32 | 33 | context.set_context(mode=context.GRAPH_MODE, device_target="GPU") 34 | init("nccl") 35 | 36 | 37 | def create_dataset(data_path, repeat_num=1, batch_size=32, rank_id=0, rank_size=1): # pylint: disable=missing-docstring 38 | resize_height = 224 39 | resize_width = 224 40 | rescale = 1.0 / 255.0 41 | shift = 0.0 42 | 43 | # get rank_id and rank_size 44 | rank_id = get_rank() 45 | rank_size = get_group_size() 46 | data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id) 47 | 48 | # define map operations 49 | random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4)) 50 | random_horizontal_op = vision.RandomHorizontalFlip() 51 | resize_op = vision.Resize((resize_height, resize_width)) 52 | rescale_op = vision.Rescale(rescale, shift) 53 | normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023)) 54 | changeswap_op = vision.HWC2CHW() 55 | type_cast_op = C.TypeCast(mstype.int32) 56 | 57 | c_trans = [random_crop_op, random_horizontal_op] 58 | c_trans += [resize_op, rescale_op, normalize_op, changeswap_op] 59 | 60 | # apply map operations on images 61 | data_set = data_set.map(operations=type_cast_op, input_columns="label") 62 | data_set = data_set.map(operations=c_trans, input_columns="image") 63 | 64 | # apply shuffle operations 65 | data_set = data_set.shuffle(buffer_size=10) 66 | 67 | # apply batch operations 68 | data_set = data_set.batch(batch_size=batch_size, drop_remainder=True) 69 | 70 | # apply repeat operations 71 | data_set = data_set.repeat(repeat_num) 72 | 73 | return data_set 74 | 75 | 76 | class SoftmaxCrossEntropyExpand(nn.Cell): # pylint: disable=missing-docstring 77 | def __init__(self, sparse=False): 78 | super(SoftmaxCrossEntropyExpand, self).__init__() 79 | self.exp = ops.Exp() 80 | self.sum = ops.ReduceSum(keep_dims=True) 81 | self.onehot = ops.OneHot() 82 | self.on_value = Tensor(1.0, mstype.float32) 83 | self.off_value = Tensor(0.0, mstype.float32) 84 | self.div = ops.RealDiv() 85 | self.log = ops.Log() 86 | self.sum_cross_entropy = ops.ReduceSum(keep_dims=False) 87 | self.mul = ops.Mul() 88 | self.mul2 = ops.Mul() 89 | self.mean = ops.ReduceMean(keep_dims=False) 90 | self.sparse = sparse 91 | self.max = ops.ReduceMax(keep_dims=True) 92 | self.sub = ops.Sub() 93 | self.eps = Tensor(1e-24, mstype.float32) 94 | 95 | def construct(self, logit, label): # pylint: disable=missing-docstring 96 | logit_max = self.max(logit, -1) 97 | exp = self.exp(self.sub(logit, logit_max)) 98 | exp_sum = self.sum(exp, -1) 99 | softmax_result = self.div(exp, exp_sum) 100 | if self.sparse: 101 | label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value) 102 | 103 | softmax_result_log = self.log(softmax_result + self.eps) 104 | loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1) 105 | loss = self.mul2(ops.scalar_to_array(-1.0), loss) 106 | loss = self.mean(loss, -1) 107 | 108 | return loss 109 | 110 | 111 | def test_train_cifar(epoch_size=10): # pylint: disable=missing-docstring 112 | context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=True, 113 | auto_parallel_search_mode='dynamic_programming') 114 | loss_cb = LossMonitor() 115 | data_path = os.getenv('DATA_PATH') 116 | dataset = create_dataset(data_path) 117 | batch_size = 32 118 | num_classes = 10 119 | net = resnet50(batch_size, num_classes) 120 | loss = SoftmaxCrossEntropyExpand(sparse=True) 121 | opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9) 122 | model = Model(net, loss_fn=loss, optimizer=opt) 123 | model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=False) 124 | -------------------------------------------------------------------------------- /huawei/double_recursive.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Huawei Technologies Co., Ltd 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================ 15 | """ 16 | Resnet50_distributed_training 17 | """ 18 | import os 19 | import mindspore.nn as nn 20 | from mindspore import dtype as mstype 21 | import mindspore.ops as ops 22 | import mindspore.dataset as ds 23 | import mindspore.dataset.vision.c_transforms as vision 24 | import mindspore.dataset.transforms.c_transforms as C 25 | from mindspore.communication import init, get_rank, get_group_size 26 | from mindspore import Tensor, Model 27 | from mindspore.nn import Momentum 28 | from mindspore.context import ParallelMode 29 | from mindspore import context 30 | from mindspore.train.callback import LossMonitor 31 | from resnet import resnet50 32 | 33 | context.set_context(mode=context.GRAPH_MODE, device_target="GPU") 34 | init("nccl") 35 | 36 | 37 | def create_dataset(data_path, repeat_num=1, batch_size=32, rank_id=0, rank_size=1): # pylint: disable=missing-docstring 38 | resize_height = 224 39 | resize_width = 224 40 | rescale = 1.0 / 255.0 41 | shift = 0.0 42 | 43 | # get rank_id and rank_size 44 | rank_id = get_rank() 45 | rank_size = get_group_size() 46 | data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id) 47 | 48 | # define map operations 49 | random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4)) 50 | random_horizontal_op = vision.RandomHorizontalFlip() 51 | resize_op = vision.Resize((resize_height, resize_width)) 52 | rescale_op = vision.Rescale(rescale, shift) 53 | normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023)) 54 | changeswap_op = vision.HWC2CHW() 55 | type_cast_op = C.TypeCast(mstype.int32) 56 | 57 | c_trans = [random_crop_op, random_horizontal_op] 58 | c_trans += [resize_op, rescale_op, normalize_op, changeswap_op] 59 | 60 | # apply map operations on images 61 | data_set = data_set.map(operations=type_cast_op, input_columns="label") 62 | data_set = data_set.map(operations=c_trans, input_columns="image") 63 | 64 | # apply shuffle operations 65 | data_set = data_set.shuffle(buffer_size=10) 66 | 67 | # apply batch operations 68 | data_set = data_set.batch(batch_size=batch_size, drop_remainder=True) 69 | 70 | # apply repeat operations 71 | data_set = data_set.repeat(repeat_num) 72 | 73 | return data_set 74 | 75 | 76 | class SoftmaxCrossEntropyExpand(nn.Cell): # pylint: disable=missing-docstring 77 | def __init__(self, sparse=False): 78 | super(SoftmaxCrossEntropyExpand, self).__init__() 79 | self.exp = ops.Exp() 80 | self.sum = ops.ReduceSum(keep_dims=True) 81 | self.onehot = ops.OneHot() 82 | self.on_value = Tensor(1.0, mstype.float32) 83 | self.off_value = Tensor(0.0, mstype.float32) 84 | self.div = ops.RealDiv() 85 | self.log = ops.Log() 86 | self.sum_cross_entropy = ops.ReduceSum(keep_dims=False) 87 | self.mul = ops.Mul() 88 | self.mul2 = ops.Mul() 89 | self.mean = ops.ReduceMean(keep_dims=False) 90 | self.sparse = sparse 91 | self.max = ops.ReduceMax(keep_dims=True) 92 | self.sub = ops.Sub() 93 | self.eps = Tensor(1e-24, mstype.float32) 94 | 95 | def construct(self, logit, label): # pylint: disable=missing-docstring 96 | logit_max = self.max(logit, -1) 97 | exp = self.exp(self.sub(logit, logit_max)) 98 | exp_sum = self.sum(exp, -1) 99 | softmax_result = self.div(exp, exp_sum) 100 | if self.sparse: 101 | label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value) 102 | 103 | softmax_result_log = self.log(softmax_result + self.eps) 104 | loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1) 105 | loss = self.mul2(ops.scalar_to_array(-1.0), loss) 106 | loss = self.mean(loss, -1) 107 | 108 | return loss 109 | 110 | 111 | def test_train_cifar(epoch_size=10): # pylint: disable=missing-docstring 112 | context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=True, 113 | auto_parallel_search_mode='recursive_programming') 114 | loss_cb = LossMonitor() 115 | data_path = os.getenv('DATA_PATH') 116 | dataset = create_dataset(data_path) 117 | batch_size = 32 118 | num_classes = 10 119 | net = resnet50(batch_size, num_classes) 120 | loss = SoftmaxCrossEntropyExpand(sparse=True) 121 | opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9) 122 | model = Model(net, loss_fn=loss, optimizer=opt) 123 | model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=False) 124 | -------------------------------------------------------------------------------- /Models.md: -------------------------------------------------------------------------------- 1 | |Time | Name | Organization | Model size | Dataset size | Devices| Backbone| paper | Parallel Methods | Optimizer | 2 | | ---|--- | --- | --- | --- | --- | --- | --- | --- |---| 3 | |2022.4.5 | PaLM | Google | 540 B | 780 B tokens | 2x TPU v4 pod(2*3072)| Singleton | [arxiv](https://arxiv.org/pdf/2204.02311.pdf) | Pathways, inter-pod: DP=2。intra-pod: DP=256,TP=12 | Adafactor 4 | |2022.2.4 | Megatron-Turing NLG | Microsoft & Nvidia | 530 B |339 B tokens | 560 80GB A100| Singleton| [arxiv](https://arxiv.org/pdf/2201.11990.pdf) |ZeRO-D=2, T=8, P=35| Adam 1.6e-4, beta1=0.9, beta2=0.95 5 | |2021.12.9 | GLaM | Google Brain | 1.162T | 1.6 T tokens | [1024 TPUv3?](\textsc{https://ai.googleblog.com/2021/12/general-and-scalable-parallelization.html}) |GShard MoE Transformer | [blog](https://ai.googleblog.com/2021/12/more-efficient-in-context-learnning-with.html) | GShard, 64 experts per MoE layer with 32 MoE layers in total | Unknown 6 | |2021.12.8 |Gopher | Google DeepMind | 280 B | 5GB?? | 4096 16 GB TPUv3 | Singleton | [deepmind](https://storage.googleapis.com/deepmind-media/research/language-research/Training%20Gopher.pdf) | T,D,P ZeRO-stage1. details ungiven. Maybe given by [Automap](https://arxiv.org/pdf/2112.02958.pdf)| Adam in pre-traininig, adafactor in fine-tuning 7 | |2021.12.8| Wenxin | Baidu and PengCheng Lab | 260 B | 68 datasets | 1920 Ascend 910 NPU |ERNIE 3.0 Titan| [arxiv](https://arxiv.org/pdf/2112.02752.pdf) | Resource-aware acceleration. D=4, 4D parallelism (DP+MP+PP+ZeRO) | Adam 1e-4 beta1=0.9 beta2=0.95 8 | |2021.10.25| M6-10T | Alibaba | 10T | 16 GB | 512 32GB V100 | M6 | [arxiv](https://arxiv.org/pdf/2110.03888.pdf) | ZeRO-Stage3,Offload, T, P, E | possibly Adafactor (Not given) 9 | |2021.9.28 | Yuan 1.0 | Inspur | 245.7 B | 5 TB | 2128 GPUs (type unknown)| Singleton | [arxiv](https://arxiv.org/pdf/2110.04725.pdf) | T=8, P=38, D=7 | Adam 1.6e-4, beta1=0.9, beta2=0.95 10 | |2021.8 | Jurassic-1 | AI21 | 178B | 300B tokens | Thousands of GPU | Singleton | [tech paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf) | Megatron and ZeRO | batch=3.2M tokens, lr: 0.6e-4 11 | |2021.5 |Wudao 2.0 | BAAI | 1.75 T | 4.9 TB | ? | Cogview, CPM | ? | Zero-Stage-2, expert ? | ? 12 | |2021.4.26 | Pangu-alpha | Huawei and PengCheng Lab | 207 B | 1.1 TB | 2048 Ascend 910 NPU| Singleton | [arxiv](https://arxiv.org/pdf/2104.12369.pdf) | T=8, P=16, ZeRO-Stage1-D=16 | Adam 2e-4, beta1=0.9, beta2=0.95 13 | |2020.5 | GPT3 | Open-AI | 175 B | 570 GB | 10000 V100 GPUs? | Singleton |[arxiv](https://arxiv.org/pdf/2005.14165.pdf) | Model and Data parallelism, details unknown| 0.6e10-4, beta1=0.9, beta2=0.95 14 | 15 | ## PaLM 16 | First model that uses Pathways to train. 17 | 18 | | layers | hidden_size | num_heads | FFN_size | seq_length 19 | | ------ | ----------- | --------- | --------- | -----| 20 | | 118 | 18432 | 48 | 73728 | 2048| 21 | 22 | ## Megatron-Turing NLG 23 | 24 | Transformer decoder. 25 | 26 | Use PipeDream pipeline. 27 | 28 | | layers | hidden_size | num_heads | FFN_size | seq_length 29 | | ------ | ----------- | --------- | --------- | -----| 30 | | 105 | 20480 | 128 | ? | 2048| 31 | 32 | ## GLaM 33 | 34 | It only activates a subnetwork of 97B (8%) parameter per token during inference. 35 | 36 | The power consumption is about 1/3 of GPT-3's. 37 | 38 | Use Zero-shot and one-shot setting where the tasks are never seen during training. 39 | 40 | In evaluation, outperform GPT-3 and use less training time to converge. 41 | 42 | ## Gopher 43 | 44 | Perspective: **It is still effective to enlarge model size.** 45 | 46 | ## Wenxin (Ernie 3.0 Titan) 47 | 48 | Universal Representation Module and Task-specific representation module. 49 | 50 | ![wenxin.png](Image/models/wenxin.png) 51 | 52 | Universal Representation Module setup: Scales up the FFN_size to increase model capacity. 53 | 54 | | layers | hidden_size | num_heads | FFN_size | 55 | | ------ | ----------- | --------- | --------- | 56 | | 48 | 12288 | 192 | 196608 | 57 | 58 | Task-specific representation module setup: 59 | 60 | | layers | hidden_size | num_heads | FFN_size | 61 | | ------ | ----------- | --------- | ---------- | 62 | | 12 | 768 | 12 | unknown | 63 | 64 | 65 | ## Yuan 1.0 66 | 67 | 76 layers, hidden size is 16384. Global batch size is 3360 and micro batch size is 1. Sequence length is 2048. 68 | 69 | The dataset consists of filtered data: crawled web pages (4200 GB), public dataset (268 GB), Encyclopedia(10.5 GB) and 70 | Books(655 GB). 71 | 72 | Trained using zero-shot and few-shot. 73 | 74 | ## M6-10T 75 | 76 | | layers | hidden_size | num_heads | FFN_size | expert_num | training_batch_size 77 | | ------ | ----------- | --------- | -------- | ---| --- | 78 | | 48 | 1024 | 16 | 9984 | 10240 experts, 80 prototypes | 8 per GPU 79 | 80 | Use Pseudo-to-Real training skill, and offloading to train M6-10T on 512 V100 in ten days. 81 | 82 | This model has no downstream test now. 83 | 84 | ## Jurassic-1 Jumbo 85 | 86 | has a large vocabulary up to 256K while GPT-3 has 50K. 87 | 88 | 89 | | layers | hidden_size | num_heads | FFN_size | 90 | | ------ | ----------- | --------- | -------- | 91 | | 76 | 13824 | 96 | 65536 | 92 | 93 | ## Pangu-alpha 94 | 95 | | layers | hidden_size | num_heads | FFN_size | 96 | | ------ | ----------- | --------- | -------- | 97 | | 64 | 16384 | 128 | 65536 | 98 | 99 | ## GPT-3 175B 100 | 101 | | layers | hidden_size | num_heads | batch_size | 102 | | ------ | ----------- | --------- | ---------- | 103 | | 96 | 12288 | 96 | 3.2 M tokens | 104 | -------------------------------------------------------------------------------- /torchgpipe_/transformer_.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This example is from https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html 3 | ''' 4 | 5 | import sys 6 | import math 7 | import torch 8 | import torch.nn as nn 9 | import torch.nn.functional as F 10 | import tempfile 11 | from torch.nn import TransformerEncoder, TransformerEncoderLayer 12 | 13 | import torch 14 | from torchtext.datasets import WikiText2 15 | from torchtext.data.utils import get_tokenizer 16 | from torchtext.vocab import build_vocab_from_iterator 17 | import time 18 | 19 | if torch.cuda.device_count() < 2: 20 | print('Need at least two GPU devices for this tutorial') 21 | sys.exit(0) 22 | 23 | 24 | class Encoder(nn.Module): 25 | def __init__(self, ntoken, ninp, dropout=0.5): 26 | super(Encoder, self).__init__() 27 | self.pos_encoder = PositionalEncoding(ninp, dropout) 28 | self.encoder = nn.Embedding(ntoken, ninp) 29 | self.ninp = ninp 30 | self.init_weights() 31 | 32 | def init_weights(self): 33 | initrange = 0.1 34 | self.encoder.weight.data.uniform_(-initrange, initrange) 35 | 36 | def forward(self, src): 37 | # Need (S, N) format for encoder. 38 | src = src.t() 39 | src = self.encoder(src) * math.sqrt(self.ninp) 40 | return self.pos_encoder(src) 41 | 42 | 43 | class Decoder(nn.Module): 44 | def __init__(self, ntoken, ninp): 45 | super(Decoder, self).__init__() 46 | self.decoder = nn.Linear(ninp, ntoken) 47 | self.init_weights() 48 | 49 | def init_weights(self): 50 | initrange = 0.1 51 | self.decoder.bias.data.zero_() 52 | self.decoder.weight.data.uniform_(-initrange, initrange) 53 | 54 | def forward(self, inp): 55 | # Need batch dimension first for output of pipeline. 56 | return self.decoder(inp).permute(1, 0, 2) 57 | 58 | 59 | class PositionalEncoding(nn.Module): 60 | 61 | def __init__(self, d_model, dropout=0.1, max_len=5000): 62 | super(PositionalEncoding, self).__init__() 63 | self.dropout = nn.Dropout(p=dropout) 64 | 65 | pe = torch.zeros(max_len, d_model) 66 | position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) 67 | div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) 68 | pe[:, 0::2] = torch.sin(position * div_term) 69 | pe[:, 1::2] = torch.cos(position * div_term) 70 | pe = pe.unsqueeze(0).transpose(0, 1) 71 | self.register_buffer('pe', pe) 72 | 73 | def forward(self, x): 74 | x = x + self.pe[:x.size(0), :] 75 | return self.dropout(x) 76 | 77 | 78 | def data_process(raw_text_iter): 79 | data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter] 80 | return torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) 81 | 82 | 83 | def batchify(data, bsz): 84 | # Divide the dataset into bsz parts. 85 | nbatch = data.size(0) // bsz 86 | # Trim off any extra elements that wouldn't cleanly fit (remainders). 87 | data = data.narrow(0, 0, nbatch * bsz) 88 | # Evenly divide the data across the bsz batches. 89 | data = data.view(bsz, -1).t().contiguous() 90 | return data.to(device) 91 | 92 | 93 | def get_total_params(module: torch.nn.Module): 94 | total_params = 0 95 | for param in module.parameters(): 96 | total_params += param.numel() 97 | return total_params 98 | 99 | 100 | def train(): 101 | model.train() # Turn on the train mode 102 | total_loss = 0. 103 | start_time = time.time() 104 | ntokens = len(vocab) 105 | 106 | # Train only for 50 batches to keep script execution time low. 107 | nbatches = min(50 * bptt, train_data.size(0) - 1) 108 | 109 | for batch, i in enumerate(range(0, nbatches, bptt)): 110 | data, targets = get_batch(train_data, i) 111 | optimizer.zero_grad() 112 | # Since the Pipe is only within a single host and process the ``RRef`` 113 | # returned by forward method is local to this node and can simply 114 | # retrieved via ``RRef.local_value()``. 115 | output = model(data).local_value() 116 | # Need to move targets to the device where the output of the 117 | # pipeline resides. 118 | loss = criterion(output.view(-1, ntokens), targets.cuda(1)) 119 | loss.backward() 120 | torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) 121 | optimizer.step() 122 | 123 | total_loss += loss.item() 124 | log_interval = 10 125 | if batch % log_interval == 0 and batch > 0: 126 | cur_loss = total_loss / log_interval 127 | elapsed = time.time() - start_time 128 | print('| epoch {:3d} | {:5d}/{:5d} batches | ' 129 | 'lr {:02.2f} | ms/batch {:5.2f} | ' 130 | 'loss {:5.2f} | ppl {:8.2f}'.format( 131 | epoch, batch, nbatches // bptt, scheduler.get_lr()[0], 132 | elapsed * 1000 / log_interval, 133 | cur_loss, math.exp(cur_loss))) 134 | total_loss = 0 135 | start_time = time.time() 136 | 137 | 138 | def evaluate(eval_model, data_source): 139 | eval_model.eval() # Turn on the evaluation mode 140 | total_loss = 0. 141 | ntokens = len(vocab) 142 | # Evaluate only for 50 batches to keep script execution time low. 143 | nbatches = min(50 * bptt, data_source.size(0) - 1) 144 | with torch.no_grad(): 145 | for i in range(0, nbatches, bptt): 146 | data, targets = get_batch(data_source, i) 147 | output = eval_model(data).local_value() 148 | output_flat = output.view(-1, ntokens) 149 | # Need to move targets to the device where the output of the 150 | # pipeline resides. 151 | total_loss += len(data) * criterion(output_flat, targets.cuda(1)).item() 152 | return total_loss / (len(data_source) - 1) 153 | 154 | 155 | if __name__ == "__main__": 156 | train_iter = WikiText2(split='train') 157 | tokenizer = get_tokenizer('basic_english') 158 | vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=[""]) 159 | vocab.set_default_index(vocab[""]) 160 | 161 | train_iter, val_iter, test_iter = WikiText2() 162 | train_data = data_process(train_iter) 163 | val_data = data_process(val_iter) 164 | test_data = data_process(test_iter) 165 | 166 | device = torch.device("cuda") 167 | batch_size = 20 168 | eval_batch_size = 10 169 | train_data = batchify(train_data, batch_size) 170 | val_data = batchify(val_data, eval_batch_size) 171 | test_data = batchify(test_data, eval_batch_size) 172 | 173 | bptt = 25 174 | 175 | 176 | def get_batch(source, i): 177 | seq_len = min(bptt, len(source) - 1 - i) 178 | data = source[i:i + seq_len] 179 | target = source[i + 1:i + 1 + seq_len].view(-1) 180 | # Need batch dimension first for pipeline parallelism. 181 | return data.t(), target 182 | 183 | 184 | ntokens = len(vocab) # the size of vocabulary 185 | emsize = 4096 # embedding dimension 186 | nhid = 4096 # the dimension of the feedforward network model in nn.TransformerEncoder 187 | nlayers = 12 # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder 188 | nhead = 16 # the number of heads in the multiheadattention models 189 | dropout = 0.2 # the dropout value 190 | 191 | from torch.distributed import rpc 192 | 193 | tmpfile = tempfile.NamedTemporaryFile() 194 | rpc.init_rpc( 195 | name="worker", 196 | rank=0, 197 | world_size=1, 198 | rpc_backend_options=rpc.TensorPipeRpcBackendOptions( 199 | init_method="file://{}".format(tmpfile.name), 200 | # Specifying _transports and _channels is a workaround and we no longer 201 | # will have to specify _transports and _channels for PyTorch 202 | # versions >= 1.8.1 203 | _transports=["ibv", "uv"], 204 | _channels=["cuda_ipc", "cuda_basic"], 205 | ) 206 | ) 207 | 208 | num_gpus = 2 209 | partition_len = ((nlayers - 1) // num_gpus) + 1 210 | 211 | # Add encoder in the beginning. 212 | tmp_list = [Encoder(ntokens, emsize, dropout).cuda(0)] 213 | module_list = [] 214 | 215 | # Add all the necessary transformer blocks. 216 | for i in range(nlayers): 217 | transformer_block = TransformerEncoderLayer(emsize, nhead, nhid, dropout) 218 | if i != 0 and i % (partition_len) == 0: 219 | module_list.append(nn.Sequential(*tmp_list)) 220 | tmp_list = [] 221 | device = i // (partition_len) 222 | tmp_list.append(transformer_block.to(device)) 223 | 224 | # Add decoder in the end. 225 | tmp_list.append(Decoder(ntokens, emsize).cuda(num_gpus - 1)) 226 | module_list.append(nn.Sequential(*tmp_list)) 227 | 228 | from torch.distributed.pipeline.sync import Pipe 229 | 230 | # Build the pipeline. 231 | chunks = 8 232 | model = Pipe(torch.nn.Sequential(*module_list), chunks=chunks) 233 | 234 | print('Total parameters in model: {:,}'.format(get_total_params(model))) 235 | 236 | criterion = nn.CrossEntropyLoss() 237 | lr = 5.0 # learning rate 238 | optimizer = torch.optim.SGD(model.parameters(), lr=lr) 239 | scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95) 240 | 241 | best_val_loss = float("inf") 242 | epochs = 3 # The number of epochs 243 | best_model = None 244 | 245 | for epoch in range(1, epochs + 1): 246 | epoch_start_time = time.time() 247 | train() 248 | val_loss = evaluate(model, val_data) 249 | print('-' * 89) 250 | print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | ' 251 | 'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time), 252 | val_loss, math.exp(val_loss))) 253 | print('-' * 89) 254 | 255 | if val_loss < best_val_loss: 256 | best_val_loss = val_loss 257 | best_model = model 258 | 259 | scheduler.step() 260 | 261 | test_loss = evaluate(best_model, test_data) 262 | print('=' * 89) 263 | print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format( 264 | test_loss, math.exp(test_loss))) 265 | print('=' * 89) 266 | -------------------------------------------------------------------------------- /huawei/resnet.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Huawei Technologies Co., Ltd 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================ 15 | ''' 16 | Resnet 17 | ''' 18 | import numpy as np 19 | import mindspore.nn as nn 20 | from mindspore import Tensor 21 | import mindspore.ops as ops 22 | 23 | 24 | def weight_variable_0(shape): 25 | """weight_variable_0""" 26 | zeros = np.zeros(shape).astype(np.float32) 27 | return Tensor(zeros) 28 | 29 | 30 | def weight_variable_1(shape): 31 | """weight_variable_1""" 32 | ones = np.ones(shape).astype(np.float32) 33 | return Tensor(ones) 34 | 35 | 36 | def conv3x3(in_channels, out_channels, stride=1, padding=0): 37 | """3x3 convolution """ 38 | return nn.Conv2d(in_channels, out_channels, 39 | kernel_size=3, stride=stride, padding=padding, weight_init='XavierUniform', 40 | has_bias=False, pad_mode="same") 41 | 42 | 43 | def conv1x1(in_channels, out_channels, stride=1, padding=0): 44 | """1x1 convolution""" 45 | return nn.Conv2d(in_channels, out_channels, 46 | kernel_size=1, stride=stride, padding=padding, weight_init='XavierUniform', 47 | has_bias=False, pad_mode="same") 48 | 49 | 50 | def conv7x7(in_channels, out_channels, stride=1, padding=0): 51 | """1x1 convolution""" 52 | return nn.Conv2d(in_channels, out_channels, 53 | kernel_size=7, stride=stride, padding=padding, weight_init='XavierUniform', 54 | has_bias=False, pad_mode="same") 55 | 56 | 57 | def bn_with_initialize(out_channels): 58 | """bn_with_initialize""" 59 | shape = (out_channels) 60 | mean = weight_variable_0(shape) 61 | var = weight_variable_1(shape) 62 | beta = weight_variable_0(shape) 63 | bn = nn.BatchNorm2d(out_channels, momentum=0.99, eps=0.00001, gamma_init='Uniform', 64 | beta_init=beta, moving_mean_init=mean, moving_var_init=var) 65 | return bn 66 | 67 | 68 | def bn_with_initialize_last(out_channels): 69 | """bn_with_initialize_last""" 70 | shape = (out_channels) 71 | mean = weight_variable_0(shape) 72 | var = weight_variable_1(shape) 73 | beta = weight_variable_0(shape) 74 | bn = nn.BatchNorm2d(out_channels, momentum=0.99, eps=0.00001, gamma_init='Uniform', 75 | beta_init=beta, moving_mean_init=mean, moving_var_init=var) 76 | return bn 77 | 78 | 79 | def fc_with_initialize(input_channels, out_channels): 80 | """fc_with_initialize""" 81 | return nn.Dense(input_channels, out_channels, weight_init='XavierUniform', bias_init='Uniform') 82 | 83 | 84 | class ResidualBlock(nn.Cell): 85 | """ResidualBlock""" 86 | expansion = 4 87 | 88 | def __init__(self, 89 | in_channels, 90 | out_channels, 91 | stride=1): 92 | """init block""" 93 | super(ResidualBlock, self).__init__() 94 | 95 | out_chls = out_channels // self.expansion 96 | self.conv1 = conv1x1(in_channels, out_chls, stride=stride, padding=0) 97 | self.bn1 = bn_with_initialize(out_chls) 98 | 99 | self.conv2 = conv3x3(out_chls, out_chls, stride=1, padding=0) 100 | self.bn2 = bn_with_initialize(out_chls) 101 | 102 | self.conv3 = conv1x1(out_chls, out_channels, stride=1, padding=0) 103 | self.bn3 = bn_with_initialize_last(out_channels) 104 | 105 | self.relu = ops.ReLU() 106 | self.add = ops.Add() 107 | 108 | def construct(self, x): 109 | """construct""" 110 | identity = x 111 | 112 | out = self.conv1(x) 113 | out = self.bn1(out) 114 | out = self.relu(out) 115 | 116 | out = self.conv2(out) 117 | out = self.bn2(out) 118 | out = self.relu(out) 119 | 120 | out = self.conv3(out) 121 | out = self.bn3(out) 122 | 123 | out = self.add(out, identity) 124 | out = self.relu(out) 125 | 126 | return out 127 | 128 | 129 | class ResidualBlockWithDown(nn.Cell): 130 | """ResidualBlockWithDown""" 131 | expansion = 4 132 | 133 | def __init__(self, 134 | in_channels, 135 | out_channels, 136 | stride=1, 137 | down_sample=False): 138 | """init block with down""" 139 | super(ResidualBlockWithDown, self).__init__() 140 | 141 | out_chls = out_channels // self.expansion 142 | self.conv1 = conv1x1(in_channels, out_chls, stride=stride, padding=0) 143 | self.bn1 = bn_with_initialize(out_chls) 144 | 145 | self.conv2 = conv3x3(out_chls, out_chls, stride=1, padding=0) 146 | self.bn2 = bn_with_initialize(out_chls) 147 | 148 | self.conv3 = conv1x1(out_chls, out_channels, stride=1, padding=0) 149 | self.bn3 = bn_with_initialize_last(out_channels) 150 | 151 | self.relu = ops.ReLU() 152 | self.down_sample = down_sample 153 | 154 | self.conv_down_sample = conv1x1(in_channels, out_channels, stride=stride, padding=0) 155 | self.bn_down_sample = bn_with_initialize(out_channels) 156 | self.add = ops.Add() 157 | 158 | def construct(self, x): 159 | """construct""" 160 | identity = x 161 | 162 | out = self.conv1(x) 163 | out = self.bn1(out) 164 | out = self.relu(out) 165 | 166 | out = self.conv2(out) 167 | out = self.bn2(out) 168 | out = self.relu(out) 169 | 170 | out = self.conv3(out) 171 | out = self.bn3(out) 172 | 173 | identity = self.conv_down_sample(identity) 174 | identity = self.bn_down_sample(identity) 175 | 176 | out = self.add(out, identity) 177 | out = self.relu(out) 178 | 179 | return out 180 | 181 | 182 | class MakeLayer0(nn.Cell): 183 | """MakeLayer0""" 184 | 185 | def __init__(self, block, in_channels, out_channels, stride): 186 | """init""" 187 | super(MakeLayer0, self).__init__() 188 | self.a = ResidualBlockWithDown(in_channels, out_channels, stride=1, down_sample=True) 189 | self.b = block(out_channels, out_channels, stride=stride) 190 | self.c = block(out_channels, out_channels, stride=1) 191 | 192 | def construct(self, x): 193 | """construct""" 194 | x = self.a(x) 195 | x = self.b(x) 196 | x = self.c(x) 197 | 198 | return x 199 | 200 | 201 | class MakeLayer1(nn.Cell): 202 | """MakeLayer1""" 203 | 204 | def __init__(self, block, in_channels, out_channels, stride): 205 | """init""" 206 | super(MakeLayer1, self).__init__() 207 | self.a = ResidualBlockWithDown(in_channels, out_channels, stride=stride, down_sample=True) 208 | self.b = block(out_channels, out_channels, stride=1) 209 | self.c = block(out_channels, out_channels, stride=1) 210 | self.d = block(out_channels, out_channels, stride=1) 211 | 212 | def construct(self, x): 213 | """construct""" 214 | x = self.a(x) 215 | x = self.b(x) 216 | x = self.c(x) 217 | x = self.d(x) 218 | 219 | return x 220 | 221 | 222 | class MakeLayer2(nn.Cell): 223 | """MakeLayer2""" 224 | 225 | def __init__(self, block, in_channels, out_channels, stride): 226 | """init""" 227 | super(MakeLayer2, self).__init__() 228 | self.a = ResidualBlockWithDown(in_channels, out_channels, stride=stride, down_sample=True) 229 | self.b = block(out_channels, out_channels, stride=1) 230 | self.c = block(out_channels, out_channels, stride=1) 231 | self.d = block(out_channels, out_channels, stride=1) 232 | self.e = block(out_channels, out_channels, stride=1) 233 | self.f = block(out_channels, out_channels, stride=1) 234 | 235 | def construct(self, x): 236 | """construct""" 237 | x = self.a(x) 238 | x = self.b(x) 239 | x = self.c(x) 240 | x = self.d(x) 241 | x = self.e(x) 242 | x = self.f(x) 243 | 244 | return x 245 | 246 | 247 | class MakeLayer3(nn.Cell): 248 | """MakeLayer3""" 249 | 250 | def __init__(self, block, in_channels, out_channels, stride): 251 | """init""" 252 | super(MakeLayer3, self).__init__() 253 | self.a = ResidualBlockWithDown(in_channels, out_channels, stride=stride, down_sample=True) 254 | self.b = block(out_channels, out_channels, stride=1) 255 | self.c = block(out_channels, out_channels, stride=1) 256 | 257 | def construct(self, x): 258 | """construct""" 259 | x = self.a(x) 260 | x = self.b(x) 261 | x = self.c(x) 262 | 263 | return x 264 | 265 | 266 | class Head(nn.Cell): 267 | """Head""" 268 | def __init__(self): 269 | super(Head, self).__init__() 270 | self.conv1 = conv7x7(3, 64, stride=2, padding=0) 271 | self.bn1 = bn_with_initialize(64) 272 | self.relu = ops.ReLU() 273 | self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same") 274 | 275 | def construct(self, x): 276 | x = self.conv1(x) 277 | x = self.bn1(x) 278 | x = self.relu(x) 279 | x = self.maxpool(x) 280 | return x 281 | 282 | 283 | class ResNet(nn.Cell): 284 | """ResNet""" 285 | 286 | def __init__(self, block, num_classes=100, batch_size=32): 287 | """init""" 288 | super(ResNet, self).__init__() 289 | self.batch_size = batch_size 290 | self.num_classes = num_classes 291 | self.head = Head() 292 | 293 | self.layer1 = MakeLayer0(block, in_channels=64, out_channels=256, stride=1) 294 | self.layer2 = MakeLayer1(block, in_channels=256, out_channels=512, stride=2) 295 | self.layer3 = MakeLayer2(block, in_channels=512, out_channels=1024, stride=2) 296 | self.layer4 = MakeLayer3(block, in_channels=1024, out_channels=2048, stride=2) 297 | 298 | self.pool = ops.ReduceMean(keep_dims=True) 299 | self.squeeze = ops.Squeeze(axis=(2, 3)) 300 | self.fc = fc_with_initialize(512 * block.expansion, num_classes) 301 | 302 | # pipeline parallel config 303 | self.head.pipeline_stage = 0 304 | self.layer1.pipeline_stage = 0 305 | self.layer2.pipeline_stage = 0 306 | self.layer3.pipeline_stage = 1 307 | self.layer4.pipeline_stage = 1 308 | self.fc.pipeline_stage = 1 309 | 310 | def construct(self, x): 311 | """construct""" 312 | x = self.head(x) 313 | x = self.layer1(x) 314 | x = self.layer2(x) 315 | x = self.layer3(x) 316 | x = self.layer4(x) 317 | 318 | x = self.pool(x, (2, 3)) 319 | x = self.squeeze(x) 320 | x = self.fc(x) 321 | return x 322 | 323 | 324 | def resnet50(batch_size, num_classes): 325 | """create resnet50""" 326 | return ResNet(ResidualBlock, num_classes, batch_size) 327 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Concept Explanation 2 | 3 | ## Data Parallelism (DP) 4 | 5 | ## Model Parallelism 6 | 7 | Model Parallelism has two types: Inter-layer and intra-layer. We note Inter-layer model parallelism as MP, and 8 | intra-layer model parallelism as TP (tensor parallelism). 9 | 10 | some researchers may call TP parameter parallelism or intra-layer model parallelism. 11 | 12 | Popular intra-model parallelism methods include 2D, 2.5D, 3D model-parallelism as well as Megatron(1D). There are only 13 | few work related to 2D, 2.5D and 3D now (only Colossal-AI). 14 | 15 | ## Pipeline Parallelism 16 | 17 | The partition of PP and MP are similar, but has different executing behaviors. Basically pipeline parallelism has two 18 | families: PipeDream family and GPipe family. 19 | 20 | # Published methods of auto-parallelism, including: 21 | 22 | I classify parallelism methods according to their partition ways. 23 | 24 | ## Pipeline Parallelism or Inter-layer Model Parallelism only: 25 | 26 | | Name | Description | Organization or author | Paper| Framework| Year | Auto Methods 27 | | --- | --- | --- | --- | --- | --- | --- | 28 | | ColocRL(REINFORCE) | Use reinforce learning to discover model partitions | Google Brain | [mlr.press](http://proceedings.mlr.press/v70/mirhoseini17a/mirhoseini17a.pdf) | Tensorflow | PMLR 70, 2017 | Reinforce 29 | | A hierarchical model for device placement (HDP)| Use Scotch to do graph partitioning | Google |[link](https://openreview.net/pdf?id=Hkc-TeZ0W) | Tensorflow | ICLR 2018 | Reinforce LSTM 30 | | GPipe| No implementation, see torchgpipe | Google | [arxiv](https://arxiv.org/abs/1811.06965) | None| 2018 on arxiv, NIPS2019 | averagely partition or manually 31 | |[torchgpipe](https://github.com/kakaobrain/torchgpipe)| An A GPipe implementation in PyTorch | UNIST | [arxiv](https://arxiv.org/pdf/2004.09910.pdf) | pytorch | 2020 on arxiv | balance stages by profiling 32 | | GDP | A general deep RL method for automating device placements on arbitrary graphs. Orthogonal to DP,MP,PP | Google| [arxiv](https://export.arxiv.org/pdf/1910.01578.pdf) | Unknown | 2019 on arxiv | Reinforce Transformer 33 | | Pesto | partition model based on inter-layer model parallelism | Stony Brook University | [acm](https://www3.cs.stonybrook.edu/~anshul/middleware21_pesto.pdf) | Tensorflow | Middleware '21 | integer linear program 34 | | [vPipe](https://github.com/hku-systems/vpipe) | A pipeline only system designed for NAS network. Complementary to hybrid parallelism| HKU | [ieee](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9472938) | PyTorch | TPDS vol.33 no.3 2022 |Swap, Recompute, Partition(SRP) planner. P: Kernighan-Lin algorithm 35 | 36 | ## Data Parallelism + Pipeline Parallelism (or Inter-layer Model Parallelism): 37 | 38 | | Name | Description | Organization or author | Paper| Framework| Year | Auto Methods | 39 | | --- | --- | --- | --- | --- | --- | --- | 40 | | Spotlight| Model device placement as a Markov decision process (MDP). | University of Toronto | [mlr.press](http://proceedings.mlr.press/v80/gao18a/gao18a.pdf) | Unknown|PMLR 80, 2018 | Reinforce LSTM 41 | | Placeto | Looks like Spotlight with MDP, but have different Policy. | MIT |[nips](https://proceedings.neurips.cc/paper/2019/file/71560ce98c8250ce57a6a970c9991a5f-Paper.pdf) | Tensorflow | NIPS 2019 | Reinforce 42 | |[REGAL](https://github.com/deepmind/deepmind-research/tree/master/regal)|a deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler. |Google|[openreview](https://openreview.net/pdf?id=rkxDoJBYPB) | Unknown |ICLR 2020 |RL with Genetic Algorithm 43 | |[PipeDream](https://github.com/msr-fiddle/pipedream) |This repository contains the source code implementation of PipeDream and PipeDream-2BW | Microsoft Fiddle| [arxiv](https://arxiv.org/pdf/1806.03377.pdf), | PyTorch | 2018 on arxiv, SOSP 2019 | Dynamic Programming with Profile 44 | |PipeDream-2BW | See above one | Microsoft |[arxiv](https://arxiv.org/pdf/2006.09503.pdf), [mlr.press](http://proceedings.mlr.press/v139/narayanan21a/narayanan21a.pdf) | PyTorch | PMLR 139, 2021 | Dynamic Programming with Profile 45 | |[DNN-partitioning](https://github.com/msr-fiddle/dnn-partitioning)| published at NeurIPS 2020. | Microsoft Fiddle| [arxiv](https://arxiv.org/pdf/2006.16423.pdf) | proof-of-concept implementation | NIPS 2020 |Dynamic Programming and Integer Programming 46 | |HetPipe| Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism | UNIST | [usenix](https://www.usenix.org/system/files/atc20-park.pdf) | PyTorch (not open sourced) | USENIX 2020 | use CPLEX to solve linear programming problem 47 | |[DAPPLE](https://github.com/AlibabaPAI/DAPPLE) | An Efficient Pipelined Data Parallel Approach for Training Large Model. Succeed from GPipe | Alibaba | [arxiv](https://arxiv.org/pdf/2007.01045.pdf) | DAPPLE | 2020 on arxiv; PPoPP 21 | Dynamic Programming 48 | |[PipeTransformer](https://github.com/Distributed-AI/PipeTransformer) |Automated Elastic Pipelining for Distributed Training of Transformers | University of South California | [arxiv](https://arxiv.org/pdf/2102.03161.pdf) |PyTorch | ICML 21 | Dynamic Programming 49 | |[Chimera](https://github.com/Shigangli/Chimera) | Efficiently training large-scale neural networks with bidirectional pipelines | Department of Computer Science, ETH Zurich Switzerland | [dl.acm](https://dl.acm.org/doi/pdf/10.1145/3458817.3476145) | PyTorch | SC 2021 | Performance model with brute force 50 | | TAPP | Use a Seq2Seq based on attention mechanism to predict stage for layers. | Hohai University | [mdpi](https://www.mdpi.com/2076-3417/11/11/4785/pdf) | Unknown |Appl.sci. 2021, 11 | Reinforce Seq2Seq based on attention 51 | |[RaNNC](https://github.com/nict-wisdom/rannc/tree/main) | RaNNC is an automatic parallelization middleware used to train very large-scale neural networks. | DIRECT and University of Tokyo | [arxiv](http://arxiv.org/abs/2103.16063) | PyTorch | IPDPS 2021 | dynamic programming 52 | |[HeterPS](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/framework/fleet/heter_ps)| distributed deep learning with RL based scheduling in heterogeneous environment. | Baidu | [arxiv](https://arxiv.org/pdf/2111.10635.pdf) | Paddle | 2021 | Reinforce learning based 53 | |[FTPipe](https://github.com/saareliad/FTPipe) | FTPipe can automatically transform sequential implementation into a multi-GPU one. | Technion-Israel Institute of Technology | [usenix](https://usenix.org/system/files/atc21-eliad.pdf) | PyTorch | 2021 | multiprocessor scheduling problem with profiling. 54 | 55 | ## Data Parallelism + Intra-layer Model Parallelism (or Tensor Parallelism): 56 | 57 | | Name | Description | Organization or author | Paper| Framework| Year | Auto Methods | 58 | | --- | --- | --- | --- | --- | --- | --- | 59 | |[OptCNN](https://github.com/flexflow/FlexFlow) | auto parallelism method for CNN |Zhihao Jia | [mlr.press](http://proceedings.mlr.press/v80/jia18a/jia18a.pdf) | FlexFlow | PMLR 80, 2018 | Dynamic Programming based graph search algorithm 60 | |[FlexFlow](https://github.com/flexflow/FlexFlow) | a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies | Zhihao Jia | [stanford](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf) |FlexFlow, compatible with PyTorch, Keras | SysML 2019 | MCMC 61 | |Tofu| Supporting Very Large Models using Automatic Dataflow Graph Partitioning | New York University | [dl.acm](https://dl.acm.org/doi/pdf/10.1145/3302424.3303953) | Not OpenSourced | Euro-Sys 2019 | same as OptCNN 62 | |[AccPar](https://github.com/linghaosong/AccPar) |Tensor partitioning for heterogeneous deep learning accelerators. | Linghao Song from USC| [usc.edu](http://alchem.usc.edu/portal/static/download/accpar.pdf) | Need Manually Deploy | 2019 on arxiv, HPCA 2020 | Dynamic Programming 63 | |[TensorOpt](https://github.com/mindspore-ai/mindspore/tree/master/mindspore/ccsrc/frontend/parallel/auto_parallel) | Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism | CUHK & Huawei | [arxiv](https://arxiv.org/pdf/2004.10856.pdf) | MindSpore | 2020 on arxiv | Dynamic Programming based graph search algorithm 64 | |[ROC](https://github.com/jiazhihao/ROC) | Another paper from Zhihao, Jia. Designed for GNN | Zhihao Jia | [mlsys](https://proceedings.mlsys.org/paper/2020/file/fe9fc289c3ff0af142b6d3bead98a923-Paper.pdf) | On top of Flexflow | MLSys 2020 | uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost. 65 | |[Double Recursive](https://github.com/mindspore-ai/mindspore/tree/master/mindspore/ccsrc/frontend/parallel/auto_parallel/rec_core) | A Double recursive algorithm to search strategies | Huawei | [link](https://link.springer.com/chapter/10.1007/978-3-030-85665-6_13) | MindSpore | Euro-Par 2021 | Double Recursive 66 | |[PaSE](https://github.com/baidu-research/PaSE) |PaSE uses a dynamic programming based approach to find an efficient strategy within a reasonable time. | Baidu Research | [ieee](https://github.com/baidu-research/PaSE/raw/master/docs/PaSE_ipdps2021.pdf) | prototype | IPDPS 2021 | Dynamic Programming 67 | |P^2| offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way |University of Cambridge & DeepMind | [arxiv](https://arxiv.org/pdf/2110.10548.pdf) | Simulation Experiment |2021 on arxiv, MLSys 2022 | Synthesize tool with simulation 68 | |AutoMap| Uses Search and Learn to do find Megatron-like strategies | DeepMind | [arxiv](https://arxiv.org/pdf/2112.02958.pdf) | JAX python API, XLA backend | 2021 on arxiv, NIPS 2021 | Search: Monte Carlo Tree Search; Learn: Interactive Network 69 | 70 | ## Data Parallelism + Model Parallelism (or Tensor Parallelism) + Pipeline Parallelism: 71 | 72 | | Name | Description | Organization or author | Paper| Framework| Year | Auto Methods| 73 | | --- | --- | --- | --- | --- | --- | --- | 74 | |Auto-MAP| It works on HLO IR. Use Linkage Group to prune search space Use DQN RL to search DD, MP, PP stategies. | Alibaba | [arxiv](https://arxiv.org/pdf/2007.04069.pdf) | RAINBOW DQN | 2020 | Reinforce Learning 75 | |[Piper](https://github.com/msr-fiddle/piper) | This code package contains algorithms (proof-of-concept implementation) and input files (profiled DNN models / workloads) from the paper "Piper: Multidimensional Planner for DNN Parallelization" published at NeurIPS 2021. An extension of DNN partitioning| Microsoft Fiddle| [link](https://www.microsoft.com/en-us/research/publication/piper-multidimensional-planner-for-dnn-parallelization/) | proof-of-concept implementation | NIPS 2021 | two-level dynamic programming 76 | |[GSPMD](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla) |a system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified way | Google | [arxiv](https://arxiv.org/pdf/2105.04663.pdf) | Tensorflow XLA | 2021 | sharding propagation 77 | |[DistIR](https://github.com/microsoft/dist-ir) | Horizontal TP. An intermediate representation and simulator for efficient neural network distribution | Stanford University & Microsoft Fiddle| [arxiv](https://arxiv.org/abs/2111.05426) | PyTorch | MLSys 2021 | Grid-Search Simulator 78 | |Neo | A software-hardware co-designed system for high-performance distributed training of large-scale DLRM. | Facebook | [arxiv](https://export.arxiv.org/pdf/2104.05158.pdf) | PyTorch | 2021 | 1. Greedy 2. Karmarker-Karp Algorithm 79 | |Adaptive Paddle| Elastic training, fault tolerant, Cost-model based Sharding propagation |Baidu | [arxiv](https://arxiv.org/pdf/2112.02752.pdf) | Paddle | 2021 | Cost model based. Details un-given. 80 | |[Alpa](https://github.com/alpa-projects/alpa) | Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | UC Berkley, Google, etc. | [arxiv](https://arxiv.org/pdf/2201.12023.pdf) | Jax, XLA | 2022 | Integer Linear for Intra, Dynamic programming for inter 81 | 82 | ## Other Interesting automatic work 83 | 84 | | Name | Description | Organization or author | Paper| Framework| Year | Auto Methods| 85 | | --- | --- | --- | --- | --- | --- | --- | 86 | |[TASO](https://github.com/jiazhihao/TASO) | automatically optimize DNN computation with graph substitution | Zhihao Jia | 87 | 88 | --- 89 | 90 | # Classify with Machine-Learning Based Methods and Classic Algorithm Based Methods 91 | 92 | ## Machine-Learning Based Methods 93 | 94 | | Name | Method Type | Parallelism | Year | 95 | | --- | --- | --- | --- | 96 | | ColocRL | Reinforcement | MP | 2017 | 97 | | HDP | Reinforcement | MP | 2018 | 98 | | GDP | Reinforcement | MP | 2019 | 99 | | REGAL | Reinforcement | MP | 2020 | 100 | | TAPP | Reinforcement | DP+PP | 2021 | 101 | | Spotlight | Reinforcement | DP+MP | 2018 | 102 | | Placeto | Reinforcement | DP+MP | 2019 | 103 | | HeterPS | Reinforcement | DP+PP | 2021 | 104 | | AutoMap | Deep Learning to predict rank | DP+TP | 2021 | 105 | | Auto-MAP | Reinforcement | DP or TP or PP | 2020 | 106 | | FlexFlow | MCMC | DP+TP | 2019 | 107 | | ROC | uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost. | DP+TP | 2020 | 108 | 109 | ## Classic Algorithm Based Methods 110 | 111 | | Name | Method Type | Parallelism | Year | 112 | | --- | --- | --- | --- | 113 | | Pesto | integer linear | MP | 2021 | 114 | | vpipe | SRP algorithm + KL (DP) | PP | 2022 | 115 | | PipeDream | dynamic programming | DP+PP | 2019 | 116 | | DNN-partitioning | dynamic programming + integer programming | DP+PP | 2020 | 117 | | PipeDream-2BW | dynamic programming | DP+PP | 2021 | 118 | | HetPipe | dynamic programming | DP+PP | 2020 | 119 | | DAPPLE | dynamic programming | DP+PP | 2021 | 120 | | PipeTransformer | dynamic programming | DP+PP | 2021 | 121 | | Chimera | Grid-Search| DP+PP | 2021 | 122 | | RaNNC | dynamic programming | DP+PP | 2021 | 123 | | FTPipe | Multiprocessor scheduling problem with profiling | DP+PP | 2021 | 124 | | OptCNN | dynamic programming | DP+TP | 2018 | 125 | | Tofu | dynamic programming | DP+TP | 2019 | 126 | | AccPar | dynamic programming | DP+TP | 2020 | 127 | | TensorOpt | dynamic programming | DP+TP | 2020 | 128 | | Double Recursive | Double recursive | DP+TP | 2021 | 129 | | PaSE | dynamic programming | DP+TP | 2021 | 130 | | P^2 | Synthesize tool with simulation | DP+TP | 2021 | 131 | | Piper | two-level dynamic programming | DP+TP+PP | 2021 | 132 | | GSPMD | heuristic-propagation | DP+TP+PP | 2021 | 133 | | DistIR | grid search | DP+TP+PP | 2021 | 134 | | Neo | Greedy + Karmarker-karp algorithm | DP+TP+PP | 2021 | 135 | | Alpa | Integer programming + Dynamic Programming | DP+TP+PP | 2022 | 136 | 137 | --- 138 | 139 | ## Pictures 140 | 141 | ### REINFORCE 142 | 143 | ![img.png](Image/overall/reinforce.png) 144 | 145 | ### Spotlight 146 | 147 | ![img.png](Image/overall/spotlight.png) 148 | 149 | ### GPipe 150 | 151 | ![img.png](Image/overall/gpipe.png) 152 | 153 | ### GDP 154 | 155 | ![img.png](Image/overall/gdp.png) 156 | 157 | ### Placeto 158 | 159 | ![img.png](Image/overall/placeto.png) 160 | 161 | ### REGAL 162 | 163 | ![img.png](Image/overall/REGAL.png) 164 | 165 | # News 166 | 167 | 2021.12.9 DeepMind proposes Gopher, a 280 billion parameter transformer language model. Trained by 4096 16GB 168 | TPUv3. [link](https://deepmind.com/blog/article/language-modelling-at-scale) 169 | 170 | 2021.12.8 Baidu and Peng Cheng proposes Wenxin (文心), a 260 billion parameter knowledge-aware pretrained model (a.k.a. 171 | ERNIE 3.0 Titan). Trained with Adaptive Paddle in the Table above. 172 | 173 | 2021.10.26 Inspur formally proposes 245.7 billion parameter on AICC 2021.s --------------------------------------------------------------------------------