├── RaNNC
    ├── README.md
    └── main.py
├── .gitignore
├── torchgpipe_
    ├── requirements.txt
    ├── run_transformer.sh
    ├── README.md
    ├── tbd_resnet50.py
    └── transformer_.py
├── .gitattributes
├── Image
    ├── P2
    │   └── 1.png
    ├── GSPMD
    │   └── 1.png
    ├── Automap
    │   ├── 1.png
    │   └── group.png
    ├── overall
    │   ├── gdp.png
    │   ├── REGAL.png
    │   ├── gpipe.png
    │   ├── placeto.png
    │   ├── reinforce.png
    │   └── spotlight.png
    └── models
    │   └── wenxin.png
├── Piper
    ├── runtime.png
    ├── costmodel.png
    └── README.md
├── Chimera
    └── README.md
├── huawei
    ├── run_tensoropt.sh
    ├── run_double_recursive.sh
    ├── README.md
    ├── tensor_opt.py
    ├── double_recursive.py
    └── resnet.py
├── utils
    ├── dataset.py
    └── bert_cls.py
├── DeepMind_Automap
    └── README.md
├── DistIR_
    └── README.md
├── PaSE_
    └── README.md
├── FlexFlow
    └── README.md
├── GSPMD
    └── README.md
├── P^2
    └── README.md
├── Models.md
└── README.md


/RaNNC/README.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | .DS_Store
3 | 


--------------------------------------------------------------------------------
/RaNNC/main.py:
--------------------------------------------------------------------------------
1 | '''
2 | This 
3 | '''


--------------------------------------------------------------------------------
/torchgpipe_/requirements.txt:
--------------------------------------------------------------------------------
1 | torch>=1.8
2 | torchtext
3 | torchgpipe


--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/Image/P2/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/P2/1.png


--------------------------------------------------------------------------------
/Image/GSPMD/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/GSPMD/1.png


--------------------------------------------------------------------------------
/Piper/runtime.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Piper/runtime.png


--------------------------------------------------------------------------------
/Image/Automap/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/Automap/1.png


--------------------------------------------------------------------------------
/Piper/costmodel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Piper/costmodel.png


--------------------------------------------------------------------------------
/Chimera/README.md:
--------------------------------------------------------------------------------
1 | ## Prepare the code
2 | ```bash
3 | git clone https://github.com/Shigangli/Chimera.git
4 | ```
5 | 


--------------------------------------------------------------------------------
/Image/overall/gdp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/gdp.png


--------------------------------------------------------------------------------
/Image/Automap/group.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/Automap/group.png


--------------------------------------------------------------------------------
/Image/models/wenxin.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/models/wenxin.png


--------------------------------------------------------------------------------
/Image/overall/REGAL.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/REGAL.png


--------------------------------------------------------------------------------
/Image/overall/gpipe.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/gpipe.png


--------------------------------------------------------------------------------
/Image/overall/placeto.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/placeto.png


--------------------------------------------------------------------------------
/Image/overall/reinforce.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/reinforce.png


--------------------------------------------------------------------------------
/Image/overall/spotlight.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ConnollyLeon/awesome-Auto-Parallelism/HEAD/Image/overall/spotlight.png


--------------------------------------------------------------------------------
/torchgpipe_/run_transformer.sh:
--------------------------------------------------------------------------------
1 | echo "This example requires pytorch >= 1.8"
2 | echo "and at least two CUDA devices."
3 | python transformer_.py


--------------------------------------------------------------------------------
/torchgpipe_/README.md:
--------------------------------------------------------------------------------
 1 | # Torchgpipe
 2 | 
 3 | torchgpipe is a library that support only Pipeline parallelism.
 4 | It can partition a model according to "memory of each layer" or 
 5 | "elapsed time of each layer". 
 6 | 
 7 | It is still a coarse-grained method that has many bubbles in 
 8 | the pipeline.
 9 | 
10 | In this directory, we use `torch.distributed.pipeline.sync.Pipe` Instead of torchgpipe,
11 | since its implementation is based on torchgpipe paper.
12 | 
13 | Note:
14 | 1. It only supports `nn.Sequential` model, which means we may need to rewrite
15 |  our model if we want to use it.
16 |    
17 | 2. It is coarse-grained. It partitions stages into parts of the Sequential model.
18 | 
19 | 3. Easy to use. Can run with only one process.


--------------------------------------------------------------------------------
/huawei/run_tensoropt.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # applicable to GPU
 3 | 
 4 | echo "=============================================================================================================="
 5 | echo "Please run the script as: "
 6 | echo "bash run_gpu.sh DATA_PATH"
 7 | echo "For example: bash run_gpu.sh /path/dataset"
 8 | echo "It is better to use the absolute path."
 9 | echo "=============================================================================================================="
10 | set -e
11 | DATA_PATH=$1
12 | export DATA_PATH=${DATA_PATH}
13 | 
14 | rm -rf tensor_opt
15 | mkdir tensor_opt
16 | cp ./tensor_opt.py ./resnet.py ./tensor_opt
17 | cd ./tensor_opt
18 | echo "start training"
19 | mpirun -n 2 pytest -s -v ./tensor_opt.py > train.log 2>&1 &
20 | 


--------------------------------------------------------------------------------
/huawei/run_double_recursive.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # applicable to GPU
 3 | 
 4 | echo "=============================================================================================================="
 5 | echo "Please run the script as: "
 6 | echo "bash run_gpu.sh DATA_PATH"
 7 | echo "For example: bash run_gpu.sh /path/dataset"
 8 | echo "It is better to use the absolute path."
 9 | echo "=============================================================================================================="
10 | set -e
11 | DATA_PATH=$1
12 | export DATA_PATH=${DATA_PATH}
13 | 
14 | rm -rf double_recursive
15 | mkdir double_recursive
16 | cp ./double_recursive.py ./resnet.py ./double_recursive
17 | cd ./double_recursive
18 | echo "start training"
19 | mpirun -n 2 pytest -s -v ./double_recursive.py > train.log 2>&1 &
20 | 


--------------------------------------------------------------------------------
/utils/dataset.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch import nn
 3 | from torch.utils.data import Dataset
 4 | import numpy as np
 5 | 
 6 | 
 7 | class RandomDataset(Dataset):
 8 |     def __init__(self):
 9 |         pass
10 | 
11 |     def __len__(self):
12 |         return 1024
13 | 
14 |     def __getitem__(self, idx):
15 |         pt_data = dict()
16 |         pt_data['input_ids'] = torch.tensor(np.random.randint(1, 30000, (1, 512)), dtype=torch.long)
17 |         pt_data['token_type_ids'] = torch.tensor(np.random.randint(0, 1, (1, 512)), dtype=torch.long)
18 |         pt_data['attention_mask'] = torch.tensor(np.random.random((1, 512)))
19 |         label = torch.tensor(np.random.randint(0, 1, (1, 1)))
20 |         masked_lm_ids = torch.tensor(np.random.randint(0, 30000, (1, 80)))
21 |         masked_lm_positions = torch.tensor(np.random.randint(0, 511, (1, 80)))
22 |         masked_lm_weights = torch.tensor(np.random.random((1, 80)))
23 | 
24 |         return pt_data, label, masked_lm_ids, masked_lm_positions, masked_lm_weights
25 | 


--------------------------------------------------------------------------------
/huawei/README.md:
--------------------------------------------------------------------------------
 1 | # Huawei MindSpore
 2 | In this directory, we have examples of **TensorOpt** and **Double Recursive Algorithm**,
 3 | which are implemented with Mindspore.
 4 | 
 5 | ## Prerequisite
 6 | 1. GCC 7.3.0
 7 | 2. CUDA 10.1 with cuDNN 7.6.x or CUDA11.1 with cuDNN 8.0.x. 
 8 |     
 9 |     Installation Instructions: [link](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions)
10 |     
11 |     Make sure we have CUDA in our environment `PATH` and `LD_LIBRARY_PATH` like `export PATH=/usr/local/cuda-${version}/bin:$PATH`
12 |    and `export LD_LIBRARY_PATH=/usr/local/cuda-${version}/lib64:$LD_LIBRARY_PATH`
13 |    
14 | 3. To run experiment on  multiple devices, we need NCCL 2.7.6 with CUDA 10.1 or
15 |     NCCL 2.7.8 with CUDA 11.1
16 |    
17 | 
18 | Since I have RTX 3090, which only supported by CUDA11.1, I use CUDA11.1 here.
19 | 
20 | Install Mindspore with conda:
21 | ```bash
22 | conda install mindspore-gpu={version} cudatoolkit=11.1 -c mindspore -c conda-forge
23 | ```
24 | 
25 | Check Installation:
26 | ```bash
27 | python -c "import mindspore;mindspore.run_check()"
28 | ```
29 | 
30 | Expected Output:
31 | ```text
32 | mindspore version: 1.5
33 | The result of multiplication calculation is correct, MindSpore has been installed successfully!
34 | ```
35 | 
36 | ## Dataset
37 | We use CIFAR-10 to train Resnet-50 model in this experiment.
38 | 
39 | ### Dataset Preparation
40 | > `CIFAR-10` Download Link：<http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz>。
41 | 
42 | On linux machine, we can execute below bash codes to download dataset to directory `cifar-10-batches-bin`。
43 | 
44 | ```bash
45 | wget http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
46 | tar -zxvf cifar-10-binary.tar.gz
47 | ```
48 | 
49 | ## Run Experiment
50 | ### TensorOpt
51 | ```bash
52 | bash run_tensoropt.sh {DATA_PATH} 
53 | ```
54 | 
55 | ### Double Recursive
56 | ```bash
57 | bash run_double_recursive.sh {DATA_PATH}
58 | ```


--------------------------------------------------------------------------------
/DeepMind_Automap/README.md:
--------------------------------------------------------------------------------
 1 | # AutoMap
 2 | 
 3 | This method is proposed by **DeepMind**.
 4 | 
 5 | This rewrite engine is implemented in MLIR with an XLA backend, and a python API in JAX. Rewrite sequences are evaluated
 6 | through compiler-internal cost models (estimating peak memory, runtime and communication).
 7 | 
 8 | ![1](../Image/Automap/1.png)
 9 | 
10 | AutoMap search strategies based on **Search and Learn** algorithm.
11 | 
12 | The Search is based on Monte Carlo Tree Search (MCTS) with upper confidence bound for trees (UCT).
13 | 
14 | The Learning uses a DeepMind model called [**Interaction Network**](https://arxiv.org/pdf/1612.00222.pdf). This model
15 | was trained on a dataset of 20k transformer variants, and used to predict the importance of a node to be partitioned.
16 | The top-k most relevant nodes are then passed to MCTS for searching strategies.
17 | 
18 | This method costs few minutes to find a Megatron-LM performance-like strategy with **search and learn**.
19 | 
20 | ### More work is needed to support that, a learned system in interactive compiler workflows can
21 | 
22 | handle a variety of generally unpredictable user models.
23 | 
24 | ## Scalability
25 | 
26 | For large models, relies on propagation sharding information through subtly shared constants and other computations
27 | **across layers**. Sharing information is brittle. Model with replicated blocks (like transformers, resnet), search
28 | techniques scale unfavourably when having to explicitly rewrite for each layer. So Automap allows users to group
29 | repeated layers together and exposes only a single set of decisions per group. This hint setting helps to search
30 | megatron-lm like strategies within a short time.
31 | 
32 | ![img.png](../Image/Automap/group.png)
33 | 
34 | ## Discussion
35 | 
36 | ### from paper
37 | 
38 | - The results presented in paper is restricted to sharding within the devices of a single host, while assuming data
39 |   parallelism across hosts, which simplifies the communication cost model.
40 | - More advanced cost models are required to model multi-host communication, as well as device-specific lowering (like
41 |   fusion)
42 | - Pipeline and ZeRO offloading. 


--------------------------------------------------------------------------------
/DistIR_/README.md:
--------------------------------------------------------------------------------
 1 | #DistIR
 2 | 
 3 | # Summary
 4 | Users don't need to write dist-ir code. Users can write forward-only code
 5 | like PyTorch, and then export it to ONNX or XLA. Then import the ONNX or XLA 
 6 | model to DistIR. DistIR can then use grid-search to find the best parallelism 
 7 | strategies.
 8 | 
 9 | There are two ways to measure a strategy in DistIR now, which are Simulator and 
10 | PyTorch running time. Simulator use a CostModel to simulate the computation and 
11 | communication time. PyTorch running time uses `multiprocessing` library to profile
12 | its executing time.
13 | 
14 | Currently, DistIR only support grid search on data-parallelism, tensor-parallleism 
15 | and 1F1B PipeDream pipeline-parallelism, as well as the num of micro-batches.
16 | 
17 | In the future, DistIR may support ZeRO and overlapping communication and computation. 
18 | 
19 | 
20 | # Experiment
21 | I  ran this experiment on a Ubuntu single node with 8 V100.
22 | ## Prerequisite
23 | To run this example, Make sure you have Anaconda installed, and 
24 | `git lfs` is runnable.
25 | 
26 | To install git lfs on Ubuntu, you can run:
27 | ```bash
28 | curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
29 | sudo apt-get install git-lfs
30 | ```
31 | 
32 | ### Step1: Environment Settings
33 | ```bash
34 | conda create -n distir python=3.8
35 | conda activate distir
36 | ```
37 | 
38 | ### Step 2: Clone repository
39 | ```bash
40 | git clone https://github.com/microsoft/dist-ir.git
41 | cd dist-ir
42 | ```
43 | 
44 | ### Step 3: Install dependencies
45 | ```bash
46 | pip install pylint, pytest, black
47 | 
48 | # DistIR Dependencies
49 | pip install -r requirements.txt 
50 | ```
51 | 
52 | ### Step 4: Prepare GPT2-10 ONNX model
53 | ```bash
54 | pushd /tmp 
55 | git clone https://github.com/onnx/models.git
56 | pushd models
57 | git checkout bb0d4cf3d4e2a5f7376c13a08d337e86296edbe8h
58 | git lfsm pull --include="text/machine_comprehension/gpt-2/model/gpt2-10.onnx" --exclude ""
59 | popd
60 | popd
61 | mv /tmp/models/text/machine_comprehension/gpt-2/model/gpt2-10.onnx ./
62 | ```
63 | 
64 | ### Step 5: Check formatting (black)
65 | ```bash
66 | balck --diff --check .
67 | ```
68 | 
69 | ### Step 6: Install dist-ir
70 | ```bash
71 | python setep.py install
72 | ```
73 | 
74 | ### Step 7: Run pytest
75 | ```bash
76 | python -m pytest
77 | ```
78 | 


--------------------------------------------------------------------------------
/PaSE_/README.md:
--------------------------------------------------------------------------------
 1 | # PaSE
 2 | 
 3 | PaSE is proposed by Baidu-Research.
 4 | 
 5 | Currently, it is implemented by python. So the efficiency may not be high enough.
 6 | 
 7 | ## Advantages:
 8 | 
 9 | 1. PaSE support brute-force search and dynamic-programming based search.
10 | 2. Its complexity is O(|V|^2 K^(M+1)), where V is the num of vertices, K is the configurable strategies of a single
11 |    layer, M is the maximum num of dependent set.
12 | 
13 | ## Limitations:
14 | 
15 | 1. Do not support inter-layer pipeline parallelism. Prevent us from overlapping computation and communications between
16 |    different layers.
17 | 
18 | 2. Not beneficial when the graph is dense, like `DenseNet`, making the `M` size cannot be reduced to an ideal value, and
19 |    leading to high runtime overhead.
20 | 
21 | 3. Though this method is applicable to `heterogeneous` architecture, it does not explicitly include heterogeneity into
22 |    the cost model.
23 |    
24 | 4. Ignores several low level details such as `cache effects` in cost model.
25 | 
26 | ## How  to Use
27 | 
28 | ### git clone
29 | 
30 | ```bash
31 | git clone https://github.com/baidu-research/PaSE.git
32 | ```
33 | 
34 | ### create virtual environment and install required libraries.
35 | 
36 | ```bash
37 | > python3 -m venv ~/env/pase
38 | > source ~/env/pase/bin/activate
39 | > pip install -r requirements.txt
40 | ```
41 | 
42 | ### Execute
43 | 
44 | ```bash
45 | > python3 ./scheduler.py --help
46 | usage: scheduler.py [-h] [-p PROCS] [-b BATCH] [-m MODEL] [-g {alexnet,resnet101,inception3,rnnlm,transformer}] [-a {0,1}] [--profile] [--measure] [-d] [--algo {0,1}]
47 | ```
48 | 
49 | ```bash
50 | optional arguments:
51 |   -h, --help            show this help message and exit
52 |   -p PROCS, --procs PROCS
53 |                         No. of processors. (Default: 32)
54 |   -b BATCH, --batch BATCH
55 |                         Batch size. (Default: 128)
56 |   -m MODEL, --model MODEL
57 |                         Model size. (Default: 128)
58 |   -g {alexnet,resnet101,inception3,rnnlm,transformer}, --graph {alexnet,resnet101,inception3,rnnlm,transformer}
59 |                         Neural net graph. (Default: 'alexnet')
60 |   --flops FLOPS         Peak FLOPS of each device in TFLOPS. (default: 10.0)
61 |   --bw BW               Peak inter-connection bandwidth in GBytes/sec (default: 16.0)
62 |   --profile             Turn on/off profiling.
63 |   --measure             Turn on/off measurement.
64 |   -d, --dump-graph      Dump the graph in dot format to the file graph.dot in the working directory.
65 |   --algo {0,1}          Algorithm to be used to compute strategy (Default: 0).
66 | ```


--------------------------------------------------------------------------------
/FlexFlow/README.md:
--------------------------------------------------------------------------------
 1 | # FlexFlow
 2 | 
 3 | ## Summary
 4 | FlexFlow uses MCMC algorithm to automatically search strategies for each
 5 | operator.
 6 | 
 7 | Currently FlexFlow has a lack of documents to instruct users how to use it well.
 8 | 
 9 | ## Installation
10 | 
11 | Follow instructions on [FlexFlow/INSTALL.md](https://github.com/flexflow/FlexFlow/blob/master/INSTALL.md)
12 | 
13 | 1. Clone code from github
14 |    ```bash
15 |     git clone --recursive https://github.com/flexflow/FlexFlow.git
16 |    ```
17 | 2. edit `config/config.linux` to fit your demands.
18 | 
19 | 3. build FlexFlow
20 |    ```bash
21 |     mkdir build
22 |     cd build
23 |     ../config/config.linux
24 |     make
25 |    ```
26 |    
27 | 4. python library dependencies
28 | ```bash
29 | pip install cffi
30 | pip install keras-preprocessing
31 | pip install pillow
32 | ```
33 | 
34 | ### Some problems I met while installing
35 | 
36 | 1. could not find cmake
37 | 
38 |    Install cmake using this line of code on ubuntu
39 |     ```bash
40 |    sudo apt-get install cmake
41 |     ```
42 | 
43 | 2. could not find hdf5
44 | 
45 |    Install using this line of code on ubuntu
46 |     ```bash
47 |    sudo apt-get install libhdf5-serial-dev
48 |     ```
49 | 
50 | ## Experiments
51 | follow the [autotune tutorial](https://flexflow.ai/search/)
52 | 
53 | execute cmd like 
54 | ```bash
55 | ./dlrm -ll:gpu 4 -ll:fsize 12000 -ll:zsize 20000 --arch-sparse-feature-size 64 --arch-embedding-size 1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --arch-mlp-bot 64-512-512-64 --arch-mlp-top 576-1024-1024-1024-1 --batch-size 1024 --budget 1000
56 | ```
57 | to generate the stategies using MCMC.
58 | 
59 | ### DLRM example
60 | using 4 GPU, one strategy of an Embedding layer looks like
61 | ```text
62 | [Dense_100] num_dims(2) dims[1,4] device_ids[0,1,2,3]
63 | [Dense_101] num_dims(2) dims[1,4] device_ids[0,1,2,3]
64 | [Dense_102] num_dims(2) dims[1,4] device_ids[0,1,2,3]
65 | [Embedding_103] num_dims(2) dims[1,1] device_ids[3]
66 | [Embedding_104] num_dims(2) dims[1,1] device_ids[2]
67 | [Embedding_105] num_dims(2) dims[1,1] device_ids[2]
68 | [Embedding_106] num_dims(2) dims[1,1] device_ids[0]
69 | [Embedding_107] num_dims(2) dims[1,1] device_ids[1]
70 | [Embedding_108] num_dims(2) dims[1,1] device_ids[1]
71 | [Embedding_109] num_dims(2) dims[1,1] device_ids[3]
72 | [Embedding_110] num_dims(2) dims[1,1] device_ids[0]
73 | [Concat-111] num_dims(2) dims[1,4] device_ids[0,1,2,3]
74 | [Dense_112] num_dims(2) dims[1,4] device_ids[0,1,2,3]
75 | [Dense_113] num_dims(2) dims[1,4] device_ids[0,1,2,3]
76 | [Dense_114] num_dims(2) dims[1,4] device_ids[0,1,2,3]
77 | [Dense_115] num_dims(2) dims[1,4] device_ids[0,1,2,3]
78 | ```
79 | Each line describes the parallelization configuration for one operator: 
80 | `dims` indicates the degree of parallelism for each dimension, and `device_ids` shows the device assignment for each task within an operator. 
81 | 
82 | It seems quite rational to distribute different embedding tables to every device and
83 | partition a Dense network's weights matrix to every device.
84 | 


--------------------------------------------------------------------------------
/GSPMD/README.md:
--------------------------------------------------------------------------------
 1 | # Code
 2 | 
 3 | The implementation code is in tensorflow/compiler/xla.
 4 | An example can be found [here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/experimental/xla_sharding/xla_sharding.py)
 5 | 
 6 | # Methods Summary
 7 | 
 8 | ## GSPMD
 9 | 
10 | GSPMD is a system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified
11 | way.
12 | 
13 | GSPMD is implemented as an extension to a production ML compiler, XLA. GSPMD natively supports **in-opeartor**
14 | parallelism, which includes data-parallelism, model-parallelism, pipeline parallelism, spatial partitioning as well as
15 | ZeRO. Though pipeline is not an in-operator parallelism, it **could be reduced to an operator partitioning**
16 | problem.
17 | 
18 | GSPMD has two independent compiler transformation: sharding completion and per-operator partitioning.
19 | 
20 | ### Sharding completion
21 | 
22 | GSPMD defines three types of sharding:
23 | 
24 | - Replicated: All devices have the same full data.
25 | - Tiled: Each device occupies the corresponding tile in the data tensor, without data duplication.
26 | - Partially tiled. Replicated in subgroups. Each subgroup has a different subset of data.
27 | 
28 | ![Examples](../Image/GSPMD/1.png)
29 | 
30 | mesh_split(tensor, device_mesh, dims_mapping) is the primary API that GSPMD provides for users.
31 | 
32 | #### Pipeline parallelism reduced to tensor sharding
33 | 
34 | GSPMD reduce pipelining into a layer-wise sharding problem. Imagine that the layer computation is rewritten in a
35 | stage-parallel way, where a leading layer/stage dimension L has been added to each tensor, and the computation is
36 | entirely parallel on this dimension. This transformation can be done by existing frontends' vectorization support like
37 | JAX's vmap()
38 | and TensorFlow's vectorized_map().
39 | 
40 | User annotate the L dimension to be sharded, then GSPMD will turn the buffer shifting into cross-device communication
41 | via CollectivePermute to do the pipeline.
42 | 
43 | This pipelining implementation runs naturally when combined with other types of parallelism in GSPMD.
44 | 
45 | However, it could only be used in **homogeneous pipeline stages**(Like BertEncoder in transfromers). For heterogeneous
46 | pipelining, it is recommended to use GPipe, Tera-Pipe, PipeDream, etc.
47 | 
48 | #### Intuitive sharding completion
49 | 
50 | GSPMD can auto-complete the sharding on every tensor based on limited user annotations. This is implemented as a
51 | compiler pass in XLA. The design goal is to make the sharding decisions intuitive to users even if they only annotate a
52 | small subset of tensors.
53 | 
54 | ### Per operator partitioning
55 | 
56 | This step automatically partition operator with the result from Sharding completion.
57 | 
58 | ### Comparison
59 | 
60 | FlexFlow focuses on determining the partitioning policy, while GSPMD focuses on the mechanisms to transform an annotated
61 | graph. They are complementary to each other. Automated search combined with GSPMD could provide a fully automated
62 | system.
63 | 
64 | DistIR is an MPMD-style intermediate representation for distributed ML computattions. It focuses on the representation
65 | itself but does not provide a transformation to automatically parallelize programs like GSPMD.
66 | 


--------------------------------------------------------------------------------
/Piper/README.md:
--------------------------------------------------------------------------------
 1 | # Piper
 2 | 
 3 | ## inputs
 4 | 
 5 | Piper is given as input a DNN workload (model), which is represented as a DAG. Each Node corresponds to a single DNN
 6 | layer. Each Edge (u, v) corresponds to a data transfer:
 7 | node v requires the output of u as its input. The input graph is annotated with a number of attribtues related to
 8 | compute times, communication costs and memory usage. These attribtues should be profiled or estimated.
 9 | 
10 | Furthermore, Piper is given: the number of devices (K),available memory per device (M), the network bandwidth (B), and
11 | the maximum allowed number of microbatches in a batch (N).
12 | 
13 | In short：
14 | 
15 | - a Directed Acyclic Graph G=(V,E): layer graph of a DNN,
16 | - for every edge (u,v), an associated communication cost c(u,v) (in bytes).
17 | - number K of devices, memory M per device, network bandwidth B, [maximum] number N of microbatches in a batch,
18 | - for every node/layer v, and every degree t=1,...,K of tensor parallelism: a list T(v,t) of Tensor Model Parallelism
19 |   Configurations(TMPC)
20 | 
21 | each TMPC contains:
22 | 
23 | - a Time-Per-Sample X.p that consists of compute time and communication time in tensor parallelism (data transfers
24 |   between the *t* devices)
25 | - for every incoming edge (u,v), a communication cost X.c_forward(u,v),
26 | - an analogous quantity X.c_backward(v,w) for every outgoing edge(u,w),
27 | - the size X.w of parameters/weights on each device,
28 | - memory usage: X.a and X.b such that if layer v is located on a stage with data-parallel degree *d* and data-parallel
29 |   degrees of this and later stages sum up to s, then the memory consumption on each device is X.a*[s/d] + X.b (
30 |   PipeDream)
31 | 
32 | ## Outputs
33 | 
34 | - a collection of contiguous subgraphs/stages S1, S2,...,S_l. and for each stage i:
35 |     - the degree d_i of data parallelism,
36 |     - the degree t_i of tensor paralleism,
37 |     - for each node/layer v in S_i: the index of the TMPC selected in T(v,t_i).
38 | 
39 | All nodes in a stage use the same *d* and *t*, but different nodes (even if they are the same type) can use different
40 | TMPCs. For example, some but not all layers in a stage might employ activation recomputation.
41 | 
42 | ## Search Space
43 | 
44 | Piper find partitions of G, also called stages. Each node/layer is assigned to only one stage. Each stage is executed
45 | using some number *d\*t* devices. `d` is the data parallelism degree,
46 | `t` is the tensor parallelism degree. Also Piper will specify how the tensor parallelism is carried out.
47 | 
48 | To determine how to tensor-parallelize every operator in a DNN computation graph is very complex. So Piper reduce this
49 | task to the problem of coming up with good tensor parallelization techniques *for individual layers*. Solving this
50 | problem is not a goal of Piper. Piper directly use some good configurations like Megatron-LM after solving *d* and *t*
51 | out.
52 | 
53 | ## Algorithm
54 | 
55 | Based on dynamic programming on downsets, extending prior work `DNN partitioning`
56 | 
57 | ## Running time
58 | 
59 | The compution complexity is O(|V|^2 NK^2), V is the number of nodes, K is the number of devices, and N <= K is the
60 | maximum sum of data-parallel degrees.
61 | 
62 | ![img.png](runtime.png)
63 | Bert-32, batch-size=512, devices=2048 : 13 minutes, Bert-64, batch-size=2048,devices=2048 : 2 hour.
64 | 
65 | This algorithm can be parallelized, but has not been done yet.
66 | 
67 | ## Cost Model Performance
68 | 
69 | very close to real performance.
70 | ![img.png](costmodel.png)


--------------------------------------------------------------------------------
/torchgpipe_/tbd_resnet50.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | An example of using torchgpipe.
  3 | 
  4 | TODO: This file is currently not executing correctly due to multiple submodule exists.
  5 | 
  6 | torchgpipe only support model implemented by nn.Sequential.
  7 | 
  8 | '''
  9 | from collections import OrderedDict
 10 | 
 11 | import torch
 12 | import torch.nn as nn
 13 | import torchvision
 14 | import torchvision.transforms as transforms
 15 | from torchvision.models.resnet import ResNet, Bottleneck
 16 | 
 17 | num_classes = 10
 18 | 
 19 | from torchgpipe import GPipe
 20 | from torchgpipe.balance import balance_by_time
 21 | 
 22 | 
 23 | class ResNet50(ResNet):
 24 |     def __init__(self, *args, **kwargs):
 25 |         super(ResNet50, self).__init__(
 26 |             Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)
 27 | 
 28 |         self.seq1 = nn.Sequential(
 29 |             self.conv1,
 30 |             self.bn1,
 31 |             self.relu,
 32 |             self.maxpool,
 33 | 
 34 |             self.layer1,
 35 |             self.layer2
 36 |         )
 37 | 
 38 |         self.seq2 = nn.Sequential(
 39 |             self.layer3,
 40 |             self.layer4,
 41 |             self.avgpool,
 42 |         )
 43 | 
 44 |     def forward(self, x):
 45 |         x = self.seq2(self.seq1(x))
 46 |         return self.fc(x.view(x.size(0), -1))
 47 | 
 48 | 
 49 | def flatten_sequential(module):
 50 |     def _flatten(module):
 51 |         for name, child in module.named_children():
 52 |             if isinstance(child, nn.Sequential):
 53 |                 for sub_name, sub_child in _flatten(child):
 54 |                     yield f'{name}_{sub_name}', sub_child
 55 |             else:
 56 |                 yield name, child
 57 | 
 58 |     return nn.Sequential(OrderedDict(_flatten(module)))
 59 | 
 60 | 
 61 | if __name__ == '__main__':
 62 |     # Data Preparation
 63 |     transform = transforms.Compose(
 64 |         [transforms.ToTensor(),
 65 |          transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
 66 | 
 67 |     batch_size = 4
 68 | 
 69 |     trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
 70 |                                             download=True, transform=transform)
 71 |     trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
 72 |                                               shuffle=True, num_workers=2)
 73 | 
 74 |     classes = ('plane', 'car', 'bird', 'cat',
 75 |                'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
 76 | 
 77 |     # Model Definition
 78 |     model = ResNet50()
 79 | 
 80 |     model = flatten_sequential(model)
 81 | 
 82 |     optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
 83 | 
 84 |     # Get device_numbers
 85 |     partitions = torch.cuda.device_count()
 86 | 
 87 |     # Prepare sample data
 88 |     sample = torch.rand(4, 3, 224, 224)
 89 | 
 90 |     # balance the partitions  by time.
 91 |     balance = balance_by_time(partitions, model, sample)
 92 |     print("The balance is: ", balance)
 93 |     model = GPipe(model, balance, chunks=8)
 94 | 
 95 |     # loss definition
 96 |     criterion = torch.nn.CrossEntropyLoss()
 97 | 
 98 |     for epoch in range(1):  # loop over the dataset multiple times
 99 | 
100 |         running_loss = 0.0
101 |         for i, data in enumerate(trainloader, 0):
102 |             # get the inputs; data is a list of [inputs, labels]
103 |             inputs, labels = data
104 | 
105 |             # zero the parameter gradients
106 |             optimizer.zero_grad()
107 | 
108 |             # forward + backward + optimize
109 |             outputs = model(inputs)
110 |             loss = criterion(outputs, labels)
111 |             loss.backward()
112 |             optimizer.step()
113 | 
114 |             # print statistics
115 |             running_loss += loss.item()
116 |             if i % 2000 == 1999:  # print every 2000 mini-batches
117 |                 print('[%d, %5d] loss: %.3f' %
118 |                       (epoch + 1, i + 1, running_loss / 2000))
119 |                 running_loss = 0.0
120 | 
121 |     print('Finished Training')
122 | 


--------------------------------------------------------------------------------
/P^2/README.md:
--------------------------------------------------------------------------------
 1 | Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems For Deep Learning
 2 | 
 3 | From University of Cambridge and DeepMind
 4 | 
 5 | # Summary
 6 | 
 7 | - A new method that reduce the space of software-to-hardware mapping.
 8 | 
 9 | - Offer a synthesis framework %P^2% that can decompose reductions over one or more parallelism axes to sequences of
10 |   collectives in a hierarchy- and mapping-aware way.
11 | 
12 | - For 69 % of parallelism placements, proposed framework synthesized programs **outperform** default all-reduce
13 |   implementation.
14 | 
15 | - Use a simulator that exceeds 90% top-10 accuracy to help complement synthesis tool, which therefore reduces the need
16 |   for massive evaluations of synthesis results to determine a small set of optimal programs and mappings.
17 | 
18 | ## Some  concepts explanations
19 | 
20 | - **Parallelism axis**: each form of parallelism
21 | - **Parallelism placement**: How we map parallelism over devices. This decides the communication groups as well as
22 |   communication cost.
23 | - **reduction axis**: Reduction along the axis of model parallelism or data parallelism.
24 | 
25 | ## Contributions
26 | 
27 | - Parallelism placement synthesis: Given parallelism axes, reduction axes and a hierarchical system topology, P^2 can
28 |   automatically synthesizes hierarchical parallelism placements, where parallelism placement is modelled as a
29 |   parallelism matrix mapping from parallelism axes to the system hierarchy.
30 | - Reduction strategy synthesis: For each parallelism placement, P^2 utilizes the system hierarchy to further synthesize
31 |   a wide variety of reduction strategies to implement reductions using common collective operations.
32 | - Synthesis hierarchy： parallelism matrix determines a candidate parallelism placement, and it can be used by the
33 |   synthesizer to massively reduce the space of programs.
34 | - Simulator: Use a simulator to help evaluate reduction strategies produced by P^2.
35 | 
36 | ## Key design
37 | 
38 | 1. Parallelism Placement
39 | 
40 |    Critical idea: To reduce search space, the critical idea of P^2 is to **partition parallelism axes over the system
41 |    hierarchy to generate topology-aware parallelism placements**, while still being able to systematically generate a
42 |    wide range of parallelism placements.
43 | 
44 |    Results: result of parallelism placement synthesis is a parallelism matrix, where each element is a parallelism
45 |    factor representing the number of a specific level in the hierarchy that a parallelism form splits the computation
46 |    across.
47 | 
48 |    Parallelism matrices decide communication requirements.
49 | 
50 | 2. Reduction Strategy
51 | 
52 |    For each parallelism matrix, P^2 further synthesizes topology-aware reduction strategies using common collective
53 |    operations, which allows us to find the optimal reduction strategy for any given parallelism matrix.
54 | 
55 |    Below figure shows different reduction strategy on a produced placement.
56 | 
57 |    ![Figure3](../Image/P2/1.png)
58 | 
59 | 3. Formalism of Collective Operations
60 | 
61 |    **Semantically Invalid reduction steps**：Reduction steps which result in device states that can never reach the final
62 |    wanted state.
63 | 
64 |    P^2 provides a formalism of common collective operations that captures semantic correctness and rules out
65 |    semantically invalid programs, massively reducing the synthesis space.
66 | 
67 |    Each device state is defined as a **state matrix** describing what kind of data a device has.
68 | 
69 | 4. Reduction Communication Patterns
70 | 
71 |    To synthesize reduction strategies effectively, P^2 uses a domain-specific language(DSL) that explores the hierarchy
72 |    to generate hierarchical communication patterns.
73 | 
74 | 5. Synthesis Hierarchy
75 | 
76 |    Exploration with multiple reduction axes.
77 | 
78 | ## Program Synthesis Algorithm
79 | 
80 | ## Results
81 | 
82 | - The performance of AllReduce differs significantly among parallelism matrices, up to 448.5x. (Intra-node communication
83 |   is faster than inter-node communication)
84 | 
85 | - Pruning techniques are effective for the synthesizer to achieve fast synthesis time.
86 | 
87 | - If the reduction axes can be put within one node, then a **single step AllReduce** inside that node is the most
88 |   effective reduction due to fast local bandwidth.
89 | 
90 | - Synthesized programs can help alleviate the impact of parallelism placement.
91 | 
92 | - For inter-node reduction, a **topology-aware reduction program** tends to outperform a single step AllReduce, with speedup
93 |   on average 1.28x , upto 2.04x.
94 | 


--------------------------------------------------------------------------------
/utils/bert_cls.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | An implementation of Bert_CLS model
  3 | '''
  4 | 
  5 | import torch
  6 | from torch import nn
  7 | from transformers import BertModel
  8 | 
  9 | 
 10 | class BertPreTrainingHeads(nn.Module):
 11 |     def __init__(self, hidden_size, vocab_size, hidden_act=nn.GELU()):
 12 |         super().__init__()
 13 |         self.predictions = BertLMPredictionHead(
 14 |             hidden_size, vocab_size, hidden_act)
 15 |         self.seq_relationship = nn.Linear(hidden_size, 2)
 16 | 
 17 |     def forward(self, sequence_output, pooled_output):
 18 |         prediction_scores = self.predictions(sequence_output)
 19 |         seq_relationship_scores = self.seq_relationship(pooled_output)
 20 |         return prediction_scores, seq_relationship_scores
 21 | 
 22 | 
 23 | class BertLMPredictionHead(nn.Module):
 24 |     def __init__(self, hidden_size, vocab_size, hidden_act=nn.GELU()):
 25 |         super().__init__()
 26 |         self.transform = BertPredictionHeadTransform(hidden_size, hidden_act)
 27 | 
 28 |         # The output weights are the same as the input embeddings, but there is
 29 |         # an output-only bias for each token.
 30 |         self.decoder = nn.Linear(hidden_size, vocab_size, bias=False)
 31 | 
 32 |         self.output_bias = nn.Parameter(torch.zeros(vocab_size))
 33 | 
 34 |         # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
 35 |         self.decoder.bias = self.output_bias
 36 | 
 37 | 
 38 |     def forward(self, sequence_output):
 39 |         sequence_output = self.transform(sequence_output)
 40 |         sequence_output = self.decoder(sequence_output)
 41 |         return sequence_output
 42 | 
 43 | 
 44 | class BertPredictionHeadTransform(nn.Module):
 45 |     def __init__(self, hidden_size, hidden_act=nn.GELU()):
 46 |         super().__init__()
 47 |         self.dense = nn.Linear(hidden_size, hidden_size)
 48 |         self.transform_act_fn = hidden_act
 49 |         self.LayerNorm = nn.LayerNorm(hidden_size)
 50 | 
 51 |     def forward(self, sequence_output):
 52 |         sequence_output = self.dense(sequence_output)
 53 |         sequence_output = self.transform_act_fn(sequence_output)
 54 |         sequence_output = self.LayerNorm(sequence_output)
 55 |         return sequence_output
 56 | 
 57 | 
 58 | class BertLargeCls(nn.Module):
 59 |     def __init__(self, config):
 60 |         super().__init__()
 61 | 
 62 |         mlm_criterion = nn.CrossEntropyLoss(reduction="none")
 63 |         self.max_predictions_per_seq = 80
 64 | 
 65 |         def get_masked_lm_loss(
 66 |                 logit_blob,
 67 |                 masked_lm_positions,
 68 |                 masked_lm_labels,
 69 |                 label_weights,
 70 |                 max_predictions_per_seq,
 71 |         ):
 72 |             # gather valid position indices
 73 |             logit_blob = torch.gather(
 74 |                 logit_blob,
 75 |                 index=masked_lm_positions.unsqueeze(2).to(
 76 |                     dtype=torch.int64).repeat(1, 1, 30522),
 77 |                 dim=1,
 78 |             )
 79 |             logit_blob = torch.reshape(logit_blob, [-1, 30522])
 80 |             label_id_blob = torch.reshape(masked_lm_labels, [-1])
 81 | 
 82 |             # The `positions` tensor might be zero-padded (if the sequence is too
 83 |             # short to have the maximum number of predictions). The `label_weights`
 84 |             # tensor has a value of 1.0 for every real prediction and 0.0 for the
 85 |             # padding predictions.
 86 |             pre_example_loss = mlm_criterion(logit_blob, label_id_blob.long())
 87 |             pre_example_loss = torch.reshape(
 88 |                 pre_example_loss, [-1, max_predictions_per_seq])
 89 |             sum_label_weight = torch.sum(label_weights, dim=-1)
 90 |             sum_label_weight = sum_label_weight / label_weights.shape[0]
 91 |             numerator = torch.sum(pre_example_loss * label_weights)
 92 |             denominator = torch.sum(label_weights) + 1e-5
 93 |             loss = numerator / denominator
 94 |             return loss
 95 | 
 96 |         self.bert = BertModel(config)
 97 |         self.cls = BertPreTrainingHeads(config.hidden_size, config.vocab_size)
 98 |         self.ns_criterion = nn.CrossEntropyLoss(reduction="mean")
 99 |         self.masked_lm_criterion = get_masked_lm_loss
100 | 
101 |     def forward(self, labels, id, pos, weight, inputs, return_outputs=False):
102 |         outputs = self.bert(**inputs)
103 |         prediction_scores, seq_relationship_scores = self.cls(
104 |             outputs[0], outputs[1])  # last_hidden_state, pooler_output
105 |         next_sentence_loss = self.ns_criterion(
106 |             seq_relationship_scores.view(-1, 2), labels.long().view(-1)
107 |         )
108 |         masked_lm_loss = self.masked_lm_criterion(
109 |             prediction_scores, pos, id, weight, max_predictions_per_seq=self.max_predictions_per_seq
110 |         )
111 | 
112 |         total_loss = next_sentence_loss + masked_lm_loss
113 |         return (total_loss, outputs) if return_outputs else total_loss


--------------------------------------------------------------------------------
/huawei/tensor_opt.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2020 Huawei Technologies Co., Ltd
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | # http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ============================================================================
 15 | """
 16 | Resnet50_distributed_training
 17 | """
 18 | import os
 19 | import mindspore.nn as nn
 20 | from mindspore import dtype as mstype
 21 | import mindspore.ops as ops
 22 | import mindspore.dataset as ds
 23 | import mindspore.dataset.vision.c_transforms as vision
 24 | import mindspore.dataset.transforms.c_transforms as C
 25 | from mindspore.communication import init, get_rank, get_group_size
 26 | from mindspore import Tensor, Model
 27 | from mindspore.nn import Momentum
 28 | from mindspore.context import ParallelMode
 29 | from mindspore import context
 30 | from mindspore.train.callback import LossMonitor
 31 | from resnet import resnet50
 32 | 
 33 | context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
 34 | init("nccl")
 35 | 
 36 | 
 37 | def create_dataset(data_path, repeat_num=1, batch_size=32, rank_id=0, rank_size=1):  # pylint: disable=missing-docstring
 38 |     resize_height = 224
 39 |     resize_width = 224
 40 |     rescale = 1.0 / 255.0
 41 |     shift = 0.0
 42 | 
 43 |     # get rank_id and rank_size
 44 |     rank_id = get_rank()
 45 |     rank_size = get_group_size()
 46 |     data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id)
 47 | 
 48 |     # define map operations
 49 |     random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4))
 50 |     random_horizontal_op = vision.RandomHorizontalFlip()
 51 |     resize_op = vision.Resize((resize_height, resize_width))
 52 |     rescale_op = vision.Rescale(rescale, shift)
 53 |     normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023))
 54 |     changeswap_op = vision.HWC2CHW()
 55 |     type_cast_op = C.TypeCast(mstype.int32)
 56 | 
 57 |     c_trans = [random_crop_op, random_horizontal_op]
 58 |     c_trans += [resize_op, rescale_op, normalize_op, changeswap_op]
 59 | 
 60 |     # apply map operations on images
 61 |     data_set = data_set.map(operations=type_cast_op, input_columns="label")
 62 |     data_set = data_set.map(operations=c_trans, input_columns="image")
 63 | 
 64 |     # apply shuffle operations
 65 |     data_set = data_set.shuffle(buffer_size=10)
 66 | 
 67 |     # apply batch operations
 68 |     data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)
 69 | 
 70 |     # apply repeat operations
 71 |     data_set = data_set.repeat(repeat_num)
 72 | 
 73 |     return data_set
 74 | 
 75 | 
 76 | class SoftmaxCrossEntropyExpand(nn.Cell):  # pylint: disable=missing-docstring
 77 |     def __init__(self, sparse=False):
 78 |         super(SoftmaxCrossEntropyExpand, self).__init__()
 79 |         self.exp = ops.Exp()
 80 |         self.sum = ops.ReduceSum(keep_dims=True)
 81 |         self.onehot = ops.OneHot()
 82 |         self.on_value = Tensor(1.0, mstype.float32)
 83 |         self.off_value = Tensor(0.0, mstype.float32)
 84 |         self.div = ops.RealDiv()
 85 |         self.log = ops.Log()
 86 |         self.sum_cross_entropy = ops.ReduceSum(keep_dims=False)
 87 |         self.mul = ops.Mul()
 88 |         self.mul2 = ops.Mul()
 89 |         self.mean = ops.ReduceMean(keep_dims=False)
 90 |         self.sparse = sparse
 91 |         self.max = ops.ReduceMax(keep_dims=True)
 92 |         self.sub = ops.Sub()
 93 |         self.eps = Tensor(1e-24, mstype.float32)
 94 | 
 95 |     def construct(self, logit, label):  # pylint: disable=missing-docstring
 96 |         logit_max = self.max(logit, -1)
 97 |         exp = self.exp(self.sub(logit, logit_max))
 98 |         exp_sum = self.sum(exp, -1)
 99 |         softmax_result = self.div(exp, exp_sum)
100 |         if self.sparse:
101 |             label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value)
102 | 
103 |         softmax_result_log = self.log(softmax_result + self.eps)
104 |         loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1)
105 |         loss = self.mul2(ops.scalar_to_array(-1.0), loss)
106 |         loss = self.mean(loss, -1)
107 | 
108 |         return loss
109 | 
110 | 
111 | def test_train_cifar(epoch_size=10):  # pylint: disable=missing-docstring
112 |     context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=True,
113 |                                       auto_parallel_search_mode='dynamic_programming')
114 |     loss_cb = LossMonitor()
115 |     data_path = os.getenv('DATA_PATH')
116 |     dataset = create_dataset(data_path)
117 |     batch_size = 32
118 |     num_classes = 10
119 |     net = resnet50(batch_size, num_classes)
120 |     loss = SoftmaxCrossEntropyExpand(sparse=True)
121 |     opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9)
122 |     model = Model(net, loss_fn=loss, optimizer=opt)
123 |     model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=False)
124 | 


--------------------------------------------------------------------------------
/huawei/double_recursive.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2020 Huawei Technologies Co., Ltd
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | # http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ============================================================================
 15 | """
 16 | Resnet50_distributed_training
 17 | """
 18 | import os
 19 | import mindspore.nn as nn
 20 | from mindspore import dtype as mstype
 21 | import mindspore.ops as ops
 22 | import mindspore.dataset as ds
 23 | import mindspore.dataset.vision.c_transforms as vision
 24 | import mindspore.dataset.transforms.c_transforms as C
 25 | from mindspore.communication import init, get_rank, get_group_size
 26 | from mindspore import Tensor, Model
 27 | from mindspore.nn import Momentum
 28 | from mindspore.context import ParallelMode
 29 | from mindspore import context
 30 | from mindspore.train.callback import LossMonitor
 31 | from resnet import resnet50
 32 | 
 33 | context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
 34 | init("nccl")
 35 | 
 36 | 
 37 | def create_dataset(data_path, repeat_num=1, batch_size=32, rank_id=0, rank_size=1):  # pylint: disable=missing-docstring
 38 |     resize_height = 224
 39 |     resize_width = 224
 40 |     rescale = 1.0 / 255.0
 41 |     shift = 0.0
 42 | 
 43 |     # get rank_id and rank_size
 44 |     rank_id = get_rank()
 45 |     rank_size = get_group_size()
 46 |     data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id)
 47 | 
 48 |     # define map operations
 49 |     random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4))
 50 |     random_horizontal_op = vision.RandomHorizontalFlip()
 51 |     resize_op = vision.Resize((resize_height, resize_width))
 52 |     rescale_op = vision.Rescale(rescale, shift)
 53 |     normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023))
 54 |     changeswap_op = vision.HWC2CHW()
 55 |     type_cast_op = C.TypeCast(mstype.int32)
 56 | 
 57 |     c_trans = [random_crop_op, random_horizontal_op]
 58 |     c_trans += [resize_op, rescale_op, normalize_op, changeswap_op]
 59 | 
 60 |     # apply map operations on images
 61 |     data_set = data_set.map(operations=type_cast_op, input_columns="label")
 62 |     data_set = data_set.map(operations=c_trans, input_columns="image")
 63 | 
 64 |     # apply shuffle operations
 65 |     data_set = data_set.shuffle(buffer_size=10)
 66 | 
 67 |     # apply batch operations
 68 |     data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)
 69 | 
 70 |     # apply repeat operations
 71 |     data_set = data_set.repeat(repeat_num)
 72 | 
 73 |     return data_set
 74 | 
 75 | 
 76 | class SoftmaxCrossEntropyExpand(nn.Cell):  # pylint: disable=missing-docstring
 77 |     def __init__(self, sparse=False):
 78 |         super(SoftmaxCrossEntropyExpand, self).__init__()
 79 |         self.exp = ops.Exp()
 80 |         self.sum = ops.ReduceSum(keep_dims=True)
 81 |         self.onehot = ops.OneHot()
 82 |         self.on_value = Tensor(1.0, mstype.float32)
 83 |         self.off_value = Tensor(0.0, mstype.float32)
 84 |         self.div = ops.RealDiv()
 85 |         self.log = ops.Log()
 86 |         self.sum_cross_entropy = ops.ReduceSum(keep_dims=False)
 87 |         self.mul = ops.Mul()
 88 |         self.mul2 = ops.Mul()
 89 |         self.mean = ops.ReduceMean(keep_dims=False)
 90 |         self.sparse = sparse
 91 |         self.max = ops.ReduceMax(keep_dims=True)
 92 |         self.sub = ops.Sub()
 93 |         self.eps = Tensor(1e-24, mstype.float32)
 94 | 
 95 |     def construct(self, logit, label):  # pylint: disable=missing-docstring
 96 |         logit_max = self.max(logit, -1)
 97 |         exp = self.exp(self.sub(logit, logit_max))
 98 |         exp_sum = self.sum(exp, -1)
 99 |         softmax_result = self.div(exp, exp_sum)
100 |         if self.sparse:
101 |             label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value)
102 | 
103 |         softmax_result_log = self.log(softmax_result + self.eps)
104 |         loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1)
105 |         loss = self.mul2(ops.scalar_to_array(-1.0), loss)
106 |         loss = self.mean(loss, -1)
107 | 
108 |         return loss
109 | 
110 | 
111 | def test_train_cifar(epoch_size=10):  # pylint: disable=missing-docstring
112 |     context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=True,
113 |                                       auto_parallel_search_mode='recursive_programming')
114 |     loss_cb = LossMonitor()
115 |     data_path = os.getenv('DATA_PATH')
116 |     dataset = create_dataset(data_path)
117 |     batch_size = 32
118 |     num_classes = 10
119 |     net = resnet50(batch_size, num_classes)
120 |     loss = SoftmaxCrossEntropyExpand(sparse=True)
121 |     opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9)
122 |     model = Model(net, loss_fn=loss, optimizer=opt)
123 |     model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=False)
124 | 


--------------------------------------------------------------------------------
/Models.md:
--------------------------------------------------------------------------------
  1 | |Time |  Name  | Organization | Model size | Dataset size | Devices| Backbone| paper | Parallel Methods | Optimizer |
  2 | | ---|--- | --- | --- | ---  | --- | --- | --- | --- |---|
  3 | |2022.4.5 | PaLM | Google | 540 B | 780 B tokens | 2x TPU v4 pod(2*3072)| Singleton | [arxiv](https://arxiv.org/pdf/2204.02311.pdf) | Pathways, inter-pod: DP=2。intra-pod: DP=256,TP=12 | Adafactor 
  4 | |2022.2.4 | Megatron-Turing NLG | Microsoft & Nvidia | 530 B |339 B tokens | 560 80GB A100| Singleton| [arxiv](https://arxiv.org/pdf/2201.11990.pdf) |ZeRO-D=2, T=8, P=35| Adam 1.6e-4, beta1=0.9, beta2=0.95
  5 | |2021.12.9 | GLaM  | Google Brain | 1.162T | 1.6 T tokens | [1024 TPUv3?](\textsc{https://ai.googleblog.com/2021/12/general-and-scalable-parallelization.html}) |GShard MoE Transformer | [blog](https://ai.googleblog.com/2021/12/more-efficient-in-context-learnning-with.html) | GShard, 64 experts per MoE layer with 32 MoE layers in total | Unknown
  6 | |2021.12.8 |Gopher | Google DeepMind | 280 B | 5GB?? | 4096 16 GB TPUv3 | Singleton | [deepmind](https://storage.googleapis.com/deepmind-media/research/language-research/Training%20Gopher.pdf) | T,D,P ZeRO-stage1. details ungiven. Maybe given by [Automap](https://arxiv.org/pdf/2112.02958.pdf)| Adam in pre-traininig, adafactor in fine-tuning
  7 | |2021.12.8| Wenxin | Baidu and PengCheng Lab | 260 B | 68 datasets | 1920 Ascend 910 NPU |ERNIE 3.0 Titan| [arxiv](https://arxiv.org/pdf/2112.02752.pdf) | Resource-aware acceleration. D=4, 4D parallelism (DP+MP+PP+ZeRO) | Adam 1e-4 beta1=0.9 beta2=0.95
  8 | |2021.10.25| M6-10T | Alibaba | 10T | 16 GB | 512 32GB V100 | M6 | [arxiv](https://arxiv.org/pdf/2110.03888.pdf) | ZeRO-Stage3,Offload, T, P, E | possibly Adafactor (Not given)
  9 | |2021.9.28 | Yuan 1.0 | Inspur | 245.7 B | 5 TB | 2128 GPUs (type unknown)| Singleton | [arxiv](https://arxiv.org/pdf/2110.04725.pdf) | T=8, P=38, D=7 | Adam 1.6e-4, beta1=0.9, beta2=0.95
 10 | |2021.8 | Jurassic-1 | AI21 | 178B | 300B tokens  | Thousands of GPU | Singleton | [tech paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf) | Megatron and ZeRO | batch=3.2M tokens, lr: 0.6e-4 
 11 | |2021.5 |Wudao 2.0 | BAAI | 1.75 T | 4.9 TB | ? | Cogview, CPM | ? | Zero-Stage-2, expert ? | ?
 12 | |2021.4.26 | Pangu-alpha | Huawei and PengCheng Lab | 207 B | 1.1 TB | 2048 Ascend 910 NPU| Singleton | [arxiv](https://arxiv.org/pdf/2104.12369.pdf) | T=8, P=16, ZeRO-Stage1-D=16 | Adam 2e-4, beta1=0.9, beta2=0.95
 13 | |2020.5 | GPT3 | Open-AI | 175 B | 570 GB | 10000 V100 GPUs? | Singleton |[arxiv](https://arxiv.org/pdf/2005.14165.pdf) | Model and Data parallelism, details unknown| 0.6e10-4, beta1=0.9, beta2=0.95
 14 | 
 15 | ## PaLM
 16 | First model that uses Pathways to train.
 17 | 
 18 | | layers | hidden_size | num_heads | FFN_size  | seq_length 
 19 | | ------ | ----------- | --------- | --------- |  -----|
 20 | |   118   |    18432    |    48    |  73728   | 2048|
 21 | 
 22 | ## Megatron-Turing NLG
 23 | 
 24 | Transformer decoder.
 25 | 
 26 | Use PipeDream pipeline. 
 27 | 
 28 | | layers | hidden_size | num_heads | FFN_size  | seq_length 
 29 | | ------ | ----------- | --------- | --------- |  -----|
 30 | |   105   |    20480    |    128    |  ?   | 2048|
 31 | 
 32 | ## GLaM
 33 | 
 34 | It only activates a subnetwork of 97B (8%) parameter per token during inference.
 35 | 
 36 | The power consumption is about 1/3 of GPT-3's.
 37 | 
 38 | Use Zero-shot and one-shot setting where the tasks are never seen during training.
 39 | 
 40 | In evaluation, outperform GPT-3 and use less training time to converge.
 41 | 
 42 | ## Gopher
 43 | 
 44 | Perspective: **It is still effective to enlarge model size.**
 45 | 
 46 | ## Wenxin (Ernie 3.0 Titan)
 47 | 
 48 | Universal Representation Module and Task-specific representation module.
 49 | 
 50 | ![wenxin.png](Image/models/wenxin.png)
 51 | 
 52 | Universal Representation Module setup: Scales up the FFN_size to increase model capacity.
 53 | 
 54 | | layers | hidden_size | num_heads | FFN_size  | 
 55 | | ------ | ----------- | --------- | --------- |  
 56 | |   48   |    12288    |    192    |  196608   |
 57 | 
 58 | Task-specific representation module setup:
 59 | 
 60 | | layers | hidden_size | num_heads |  FFN_size  | 
 61 | | ------ | ----------- | --------- | ---------- |  
 62 | |   12   |     768     |     12    |  unknown   |
 63 | 
 64 | 
 65 | ## Yuan 1.0
 66 | 
 67 | 76 layers, hidden size is 16384. Global batch size is 3360 and micro batch size is 1. Sequence length is 2048.
 68 | 
 69 | The dataset consists of filtered data: crawled web pages (4200 GB), public dataset (268 GB), Encyclopedia(10.5 GB) and
 70 | Books(655 GB).
 71 | 
 72 | Trained using zero-shot and few-shot.
 73 | 
 74 | ## M6-10T
 75 | 
 76 | | layers | hidden_size | num_heads | FFN_size | expert_num | training_batch_size
 77 | | ------ | ----------- | --------- | -------- |  ---| --- |
 78 | |  48    |  1024      |   16    |  9984   | 10240 experts, 80 prototypes | 8 per GPU
 79 | 
 80 | Use Pseudo-to-Real training skill, and offloading to train M6-10T on 512 V100 in ten days.
 81 | 
 82 | This model has no downstream test now.
 83 | 
 84 | ## Jurassic-1 Jumbo
 85 | 
 86 | has a large vocabulary up to 256K while GPT-3 has 50K.
 87 | 
 88 | 
 89 | | layers | hidden_size | num_heads | FFN_size | 
 90 | | ------ | ----------- | --------- | -------- |  
 91 | |   76   |    13824    |    96    |  65536   |
 92 | 
 93 | ## Pangu-alpha
 94 | 
 95 | | layers | hidden_size | num_heads | FFN_size | 
 96 | | ------ | ----------- | --------- | -------- |  
 97 | |   64   |    16384    |    128    |  65536   |     
 98 | 
 99 | ## GPT-3 175B
100 | 
101 | | layers | hidden_size | num_heads | batch_size | 
102 | | ------ | ----------- | --------- | ---------- |  
103 | |   96   |     12288   |     96    |   3.2 M tokens  |
104 | 


--------------------------------------------------------------------------------
/torchgpipe_/transformer_.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | This example is from https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html
  3 | '''
  4 | 
  5 | import sys
  6 | import math
  7 | import torch
  8 | import torch.nn as nn
  9 | import torch.nn.functional as F
 10 | import tempfile
 11 | from torch.nn import TransformerEncoder, TransformerEncoderLayer
 12 | 
 13 | import torch
 14 | from torchtext.datasets import WikiText2
 15 | from torchtext.data.utils import get_tokenizer
 16 | from torchtext.vocab import build_vocab_from_iterator
 17 | import time
 18 | 
 19 | if torch.cuda.device_count() < 2:
 20 |     print('Need at least two GPU devices for this tutorial')
 21 |     sys.exit(0)
 22 | 
 23 | 
 24 | class Encoder(nn.Module):
 25 |     def __init__(self, ntoken, ninp, dropout=0.5):
 26 |         super(Encoder, self).__init__()
 27 |         self.pos_encoder = PositionalEncoding(ninp, dropout)
 28 |         self.encoder = nn.Embedding(ntoken, ninp)
 29 |         self.ninp = ninp
 30 |         self.init_weights()
 31 | 
 32 |     def init_weights(self):
 33 |         initrange = 0.1
 34 |         self.encoder.weight.data.uniform_(-initrange, initrange)
 35 | 
 36 |     def forward(self, src):
 37 |         # Need (S, N) format for encoder.
 38 |         src = src.t()
 39 |         src = self.encoder(src) * math.sqrt(self.ninp)
 40 |         return self.pos_encoder(src)
 41 | 
 42 | 
 43 | class Decoder(nn.Module):
 44 |     def __init__(self, ntoken, ninp):
 45 |         super(Decoder, self).__init__()
 46 |         self.decoder = nn.Linear(ninp, ntoken)
 47 |         self.init_weights()
 48 | 
 49 |     def init_weights(self):
 50 |         initrange = 0.1
 51 |         self.decoder.bias.data.zero_()
 52 |         self.decoder.weight.data.uniform_(-initrange, initrange)
 53 | 
 54 |     def forward(self, inp):
 55 |         # Need batch dimension first for output of pipeline.
 56 |         return self.decoder(inp).permute(1, 0, 2)
 57 | 
 58 | 
 59 | class PositionalEncoding(nn.Module):
 60 | 
 61 |     def __init__(self, d_model, dropout=0.1, max_len=5000):
 62 |         super(PositionalEncoding, self).__init__()
 63 |         self.dropout = nn.Dropout(p=dropout)
 64 | 
 65 |         pe = torch.zeros(max_len, d_model)
 66 |         position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
 67 |         div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
 68 |         pe[:, 0::2] = torch.sin(position * div_term)
 69 |         pe[:, 1::2] = torch.cos(position * div_term)
 70 |         pe = pe.unsqueeze(0).transpose(0, 1)
 71 |         self.register_buffer('pe', pe)
 72 | 
 73 |     def forward(self, x):
 74 |         x = x + self.pe[:x.size(0), :]
 75 |         return self.dropout(x)
 76 | 
 77 | 
 78 | def data_process(raw_text_iter):
 79 |     data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
 80 |     return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))
 81 | 
 82 | 
 83 | def batchify(data, bsz):
 84 |     # Divide the dataset into bsz parts.
 85 |     nbatch = data.size(0) // bsz
 86 |     # Trim off any extra elements that wouldn't cleanly fit (remainders).
 87 |     data = data.narrow(0, 0, nbatch * bsz)
 88 |     # Evenly divide the data across the bsz batches.
 89 |     data = data.view(bsz, -1).t().contiguous()
 90 |     return data.to(device)
 91 | 
 92 | 
 93 | def get_total_params(module: torch.nn.Module):
 94 |     total_params = 0
 95 |     for param in module.parameters():
 96 |         total_params += param.numel()
 97 |     return total_params
 98 | 
 99 | 
100 | def train():
101 |     model.train()  # Turn on the train mode
102 |     total_loss = 0.
103 |     start_time = time.time()
104 |     ntokens = len(vocab)
105 | 
106 |     # Train only for 50 batches to keep script execution time low.
107 |     nbatches = min(50 * bptt, train_data.size(0) - 1)
108 | 
109 |     for batch, i in enumerate(range(0, nbatches, bptt)):
110 |         data, targets = get_batch(train_data, i)
111 |         optimizer.zero_grad()
112 |         # Since the Pipe is only within a single host and process the ``RRef``
113 |         # returned by forward method is local to this node and can simply
114 |         # retrieved via ``RRef.local_value()``.
115 |         output = model(data).local_value()
116 |         # Need to move targets to the device where the output of the
117 |         # pipeline resides.
118 |         loss = criterion(output.view(-1, ntokens), targets.cuda(1))
119 |         loss.backward()
120 |         torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
121 |         optimizer.step()
122 | 
123 |         total_loss += loss.item()
124 |         log_interval = 10
125 |         if batch % log_interval == 0 and batch > 0:
126 |             cur_loss = total_loss / log_interval
127 |             elapsed = time.time() - start_time
128 |             print('| epoch {:3d} | {:5d}/{:5d} batches | '
129 |                   'lr {:02.2f} | ms/batch {:5.2f} | '
130 |                   'loss {:5.2f} | ppl {:8.2f}'.format(
131 |                 epoch, batch, nbatches // bptt, scheduler.get_lr()[0],
132 |                               elapsed * 1000 / log_interval,
133 |                 cur_loss, math.exp(cur_loss)))
134 |             total_loss = 0
135 |             start_time = time.time()
136 | 
137 | 
138 | def evaluate(eval_model, data_source):
139 |     eval_model.eval()  # Turn on the evaluation mode
140 |     total_loss = 0.
141 |     ntokens = len(vocab)
142 |     # Evaluate only for 50 batches to keep script execution time low.
143 |     nbatches = min(50 * bptt, data_source.size(0) - 1)
144 |     with torch.no_grad():
145 |         for i in range(0, nbatches, bptt):
146 |             data, targets = get_batch(data_source, i)
147 |             output = eval_model(data).local_value()
148 |             output_flat = output.view(-1, ntokens)
149 |             # Need to move targets to the device where the output of the
150 |             # pipeline resides.
151 |             total_loss += len(data) * criterion(output_flat, targets.cuda(1)).item()
152 |     return total_loss / (len(data_source) - 1)
153 | 
154 | 
155 | if __name__ == "__main__":
156 |     train_iter = WikiText2(split='train')
157 |     tokenizer = get_tokenizer('basic_english')
158 |     vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=["<unk>"])
159 |     vocab.set_default_index(vocab["<unk>"])
160 | 
161 |     train_iter, val_iter, test_iter = WikiText2()
162 |     train_data = data_process(train_iter)
163 |     val_data = data_process(val_iter)
164 |     test_data = data_process(test_iter)
165 | 
166 |     device = torch.device("cuda")
167 |     batch_size = 20
168 |     eval_batch_size = 10
169 |     train_data = batchify(train_data, batch_size)
170 |     val_data = batchify(val_data, eval_batch_size)
171 |     test_data = batchify(test_data, eval_batch_size)
172 | 
173 |     bptt = 25
174 | 
175 | 
176 |     def get_batch(source, i):
177 |         seq_len = min(bptt, len(source) - 1 - i)
178 |         data = source[i:i + seq_len]
179 |         target = source[i + 1:i + 1 + seq_len].view(-1)
180 |         # Need batch dimension first for pipeline parallelism.
181 |         return data.t(), target
182 | 
183 | 
184 |     ntokens = len(vocab)  # the size of vocabulary
185 |     emsize = 4096  # embedding dimension
186 |     nhid = 4096  # the dimension of the feedforward network model in nn.TransformerEncoder
187 |     nlayers = 12  # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
188 |     nhead = 16  # the number of heads in the multiheadattention models
189 |     dropout = 0.2  # the dropout value
190 | 
191 |     from torch.distributed import rpc
192 | 
193 |     tmpfile = tempfile.NamedTemporaryFile()
194 |     rpc.init_rpc(
195 |         name="worker",
196 |         rank=0,
197 |         world_size=1,
198 |         rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
199 |             init_method="file://{}".format(tmpfile.name),
200 |             # Specifying _transports and _channels is a workaround and we no longer
201 |             # will have to specify _transports and _channels for PyTorch
202 |             # versions >= 1.8.1
203 |             _transports=["ibv", "uv"],
204 |             _channels=["cuda_ipc", "cuda_basic"],
205 |         )
206 |     )
207 | 
208 |     num_gpus = 2
209 |     partition_len = ((nlayers - 1) // num_gpus) + 1
210 | 
211 |     # Add encoder in the beginning.
212 |     tmp_list = [Encoder(ntokens, emsize, dropout).cuda(0)]
213 |     module_list = []
214 | 
215 |     # Add all the necessary transformer blocks.
216 |     for i in range(nlayers):
217 |         transformer_block = TransformerEncoderLayer(emsize, nhead, nhid, dropout)
218 |         if i != 0 and i % (partition_len) == 0:
219 |             module_list.append(nn.Sequential(*tmp_list))
220 |             tmp_list = []
221 |         device = i // (partition_len)
222 |         tmp_list.append(transformer_block.to(device))
223 | 
224 |     # Add decoder in the end.
225 |     tmp_list.append(Decoder(ntokens, emsize).cuda(num_gpus - 1))
226 |     module_list.append(nn.Sequential(*tmp_list))
227 | 
228 |     from torch.distributed.pipeline.sync import Pipe
229 | 
230 |     # Build the pipeline.
231 |     chunks = 8
232 |     model = Pipe(torch.nn.Sequential(*module_list), chunks=chunks)
233 | 
234 |     print('Total parameters in model: {:,}'.format(get_total_params(model)))
235 | 
236 |     criterion = nn.CrossEntropyLoss()
237 |     lr = 5.0  # learning rate
238 |     optimizer = torch.optim.SGD(model.parameters(), lr=lr)
239 |     scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)
240 | 
241 |     best_val_loss = float("inf")
242 |     epochs = 3  # The number of epochs
243 |     best_model = None
244 | 
245 |     for epoch in range(1, epochs + 1):
246 |         epoch_start_time = time.time()
247 |         train()
248 |         val_loss = evaluate(model, val_data)
249 |         print('-' * 89)
250 |         print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
251 |               'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
252 |                                          val_loss, math.exp(val_loss)))
253 |         print('-' * 89)
254 | 
255 |         if val_loss < best_val_loss:
256 |             best_val_loss = val_loss
257 |             best_model = model
258 | 
259 |         scheduler.step()
260 | 
261 |     test_loss = evaluate(best_model, test_data)
262 |     print('=' * 89)
263 |     print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
264 |         test_loss, math.exp(test_loss)))
265 |     print('=' * 89)
266 | 


--------------------------------------------------------------------------------
/huawei/resnet.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2020 Huawei Technologies Co., Ltd
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | # http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ============================================================================
 15 | '''
 16 | Resnet
 17 | '''
 18 | import numpy as np
 19 | import mindspore.nn as nn
 20 | from mindspore import Tensor
 21 | import mindspore.ops as ops
 22 | 
 23 | 
 24 | def weight_variable_0(shape):
 25 |     """weight_variable_0"""
 26 |     zeros = np.zeros(shape).astype(np.float32)
 27 |     return Tensor(zeros)
 28 | 
 29 | 
 30 | def weight_variable_1(shape):
 31 |     """weight_variable_1"""
 32 |     ones = np.ones(shape).astype(np.float32)
 33 |     return Tensor(ones)
 34 | 
 35 | 
 36 | def conv3x3(in_channels, out_channels, stride=1, padding=0):
 37 |     """3x3 convolution """
 38 |     return nn.Conv2d(in_channels, out_channels,
 39 |                      kernel_size=3, stride=stride, padding=padding, weight_init='XavierUniform',
 40 |                      has_bias=False, pad_mode="same")
 41 | 
 42 | 
 43 | def conv1x1(in_channels, out_channels, stride=1, padding=0):
 44 |     """1x1 convolution"""
 45 |     return nn.Conv2d(in_channels, out_channels,
 46 |                      kernel_size=1, stride=stride, padding=padding, weight_init='XavierUniform',
 47 |                      has_bias=False, pad_mode="same")
 48 | 
 49 | 
 50 | def conv7x7(in_channels, out_channels, stride=1, padding=0):
 51 |     """1x1 convolution"""
 52 |     return nn.Conv2d(in_channels, out_channels,
 53 |                      kernel_size=7, stride=stride, padding=padding, weight_init='XavierUniform',
 54 |                      has_bias=False, pad_mode="same")
 55 | 
 56 | 
 57 | def bn_with_initialize(out_channels):
 58 |     """bn_with_initialize"""
 59 |     shape = (out_channels)
 60 |     mean = weight_variable_0(shape)
 61 |     var = weight_variable_1(shape)
 62 |     beta = weight_variable_0(shape)
 63 |     bn = nn.BatchNorm2d(out_channels, momentum=0.99, eps=0.00001, gamma_init='Uniform',
 64 |                         beta_init=beta, moving_mean_init=mean, moving_var_init=var)
 65 |     return bn
 66 | 
 67 | 
 68 | def bn_with_initialize_last(out_channels):
 69 |     """bn_with_initialize_last"""
 70 |     shape = (out_channels)
 71 |     mean = weight_variable_0(shape)
 72 |     var = weight_variable_1(shape)
 73 |     beta = weight_variable_0(shape)
 74 |     bn = nn.BatchNorm2d(out_channels, momentum=0.99, eps=0.00001, gamma_init='Uniform',
 75 |                         beta_init=beta, moving_mean_init=mean, moving_var_init=var)
 76 |     return bn
 77 | 
 78 | 
 79 | def fc_with_initialize(input_channels, out_channels):
 80 |     """fc_with_initialize"""
 81 |     return nn.Dense(input_channels, out_channels, weight_init='XavierUniform', bias_init='Uniform')
 82 | 
 83 | 
 84 | class ResidualBlock(nn.Cell):
 85 |     """ResidualBlock"""
 86 |     expansion = 4
 87 | 
 88 |     def __init__(self,
 89 |                  in_channels,
 90 |                  out_channels,
 91 |                  stride=1):
 92 |         """init block"""
 93 |         super(ResidualBlock, self).__init__()
 94 | 
 95 |         out_chls = out_channels // self.expansion
 96 |         self.conv1 = conv1x1(in_channels, out_chls, stride=stride, padding=0)
 97 |         self.bn1 = bn_with_initialize(out_chls)
 98 | 
 99 |         self.conv2 = conv3x3(out_chls, out_chls, stride=1, padding=0)
100 |         self.bn2 = bn_with_initialize(out_chls)
101 | 
102 |         self.conv3 = conv1x1(out_chls, out_channels, stride=1, padding=0)
103 |         self.bn3 = bn_with_initialize_last(out_channels)
104 | 
105 |         self.relu = ops.ReLU()
106 |         self.add = ops.Add()
107 | 
108 |     def construct(self, x):
109 |         """construct"""
110 |         identity = x
111 | 
112 |         out = self.conv1(x)
113 |         out = self.bn1(out)
114 |         out = self.relu(out)
115 | 
116 |         out = self.conv2(out)
117 |         out = self.bn2(out)
118 |         out = self.relu(out)
119 | 
120 |         out = self.conv3(out)
121 |         out = self.bn3(out)
122 | 
123 |         out = self.add(out, identity)
124 |         out = self.relu(out)
125 | 
126 |         return out
127 | 
128 | 
129 | class ResidualBlockWithDown(nn.Cell):
130 |     """ResidualBlockWithDown"""
131 |     expansion = 4
132 | 
133 |     def __init__(self,
134 |                  in_channels,
135 |                  out_channels,
136 |                  stride=1,
137 |                  down_sample=False):
138 |         """init block with down"""
139 |         super(ResidualBlockWithDown, self).__init__()
140 | 
141 |         out_chls = out_channels // self.expansion
142 |         self.conv1 = conv1x1(in_channels, out_chls, stride=stride, padding=0)
143 |         self.bn1 = bn_with_initialize(out_chls)
144 | 
145 |         self.conv2 = conv3x3(out_chls, out_chls, stride=1, padding=0)
146 |         self.bn2 = bn_with_initialize(out_chls)
147 | 
148 |         self.conv3 = conv1x1(out_chls, out_channels, stride=1, padding=0)
149 |         self.bn3 = bn_with_initialize_last(out_channels)
150 | 
151 |         self.relu = ops.ReLU()
152 |         self.down_sample = down_sample
153 | 
154 |         self.conv_down_sample = conv1x1(in_channels, out_channels, stride=stride, padding=0)
155 |         self.bn_down_sample = bn_with_initialize(out_channels)
156 |         self.add = ops.Add()
157 | 
158 |     def construct(self, x):
159 |         """construct"""
160 |         identity = x
161 | 
162 |         out = self.conv1(x)
163 |         out = self.bn1(out)
164 |         out = self.relu(out)
165 | 
166 |         out = self.conv2(out)
167 |         out = self.bn2(out)
168 |         out = self.relu(out)
169 | 
170 |         out = self.conv3(out)
171 |         out = self.bn3(out)
172 | 
173 |         identity = self.conv_down_sample(identity)
174 |         identity = self.bn_down_sample(identity)
175 | 
176 |         out = self.add(out, identity)
177 |         out = self.relu(out)
178 | 
179 |         return out
180 | 
181 | 
182 | class MakeLayer0(nn.Cell):
183 |     """MakeLayer0"""
184 | 
185 |     def __init__(self, block, in_channels, out_channels, stride):
186 |         """init"""
187 |         super(MakeLayer0, self).__init__()
188 |         self.a = ResidualBlockWithDown(in_channels, out_channels, stride=1, down_sample=True)
189 |         self.b = block(out_channels, out_channels, stride=stride)
190 |         self.c = block(out_channels, out_channels, stride=1)
191 | 
192 |     def construct(self, x):
193 |         """construct"""
194 |         x = self.a(x)
195 |         x = self.b(x)
196 |         x = self.c(x)
197 | 
198 |         return x
199 | 
200 | 
201 | class MakeLayer1(nn.Cell):
202 |     """MakeLayer1"""
203 | 
204 |     def __init__(self, block, in_channels, out_channels, stride):
205 |         """init"""
206 |         super(MakeLayer1, self).__init__()
207 |         self.a = ResidualBlockWithDown(in_channels, out_channels, stride=stride, down_sample=True)
208 |         self.b = block(out_channels, out_channels, stride=1)
209 |         self.c = block(out_channels, out_channels, stride=1)
210 |         self.d = block(out_channels, out_channels, stride=1)
211 | 
212 |     def construct(self, x):
213 |         """construct"""
214 |         x = self.a(x)
215 |         x = self.b(x)
216 |         x = self.c(x)
217 |         x = self.d(x)
218 | 
219 |         return x
220 | 
221 | 
222 | class MakeLayer2(nn.Cell):
223 |     """MakeLayer2"""
224 | 
225 |     def __init__(self, block, in_channels, out_channels, stride):
226 |         """init"""
227 |         super(MakeLayer2, self).__init__()
228 |         self.a = ResidualBlockWithDown(in_channels, out_channels, stride=stride, down_sample=True)
229 |         self.b = block(out_channels, out_channels, stride=1)
230 |         self.c = block(out_channels, out_channels, stride=1)
231 |         self.d = block(out_channels, out_channels, stride=1)
232 |         self.e = block(out_channels, out_channels, stride=1)
233 |         self.f = block(out_channels, out_channels, stride=1)
234 | 
235 |     def construct(self, x):
236 |         """construct"""
237 |         x = self.a(x)
238 |         x = self.b(x)
239 |         x = self.c(x)
240 |         x = self.d(x)
241 |         x = self.e(x)
242 |         x = self.f(x)
243 | 
244 |         return x
245 | 
246 | 
247 | class MakeLayer3(nn.Cell):
248 |     """MakeLayer3"""
249 | 
250 |     def __init__(self, block, in_channels, out_channels, stride):
251 |         """init"""
252 |         super(MakeLayer3, self).__init__()
253 |         self.a = ResidualBlockWithDown(in_channels, out_channels, stride=stride, down_sample=True)
254 |         self.b = block(out_channels, out_channels, stride=1)
255 |         self.c = block(out_channels, out_channels, stride=1)
256 | 
257 |     def construct(self, x):
258 |         """construct"""
259 |         x = self.a(x)
260 |         x = self.b(x)
261 |         x = self.c(x)
262 | 
263 |         return x
264 | 
265 | 
266 | class Head(nn.Cell):
267 |     """Head"""
268 |     def __init__(self):
269 |         super(Head, self).__init__()
270 |         self.conv1 = conv7x7(3, 64, stride=2, padding=0)
271 |         self.bn1 = bn_with_initialize(64)
272 |         self.relu = ops.ReLU()
273 |         self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same")
274 | 
275 |     def construct(self, x):
276 |         x = self.conv1(x)
277 |         x = self.bn1(x)
278 |         x = self.relu(x)
279 |         x = self.maxpool(x)
280 |         return x
281 | 
282 | 
283 | class ResNet(nn.Cell):
284 |     """ResNet"""
285 | 
286 |     def __init__(self, block, num_classes=100, batch_size=32):
287 |         """init"""
288 |         super(ResNet, self).__init__()
289 |         self.batch_size = batch_size
290 |         self.num_classes = num_classes
291 |         self.head = Head()
292 | 
293 |         self.layer1 = MakeLayer0(block, in_channels=64, out_channels=256, stride=1)
294 |         self.layer2 = MakeLayer1(block, in_channels=256, out_channels=512, stride=2)
295 |         self.layer3 = MakeLayer2(block, in_channels=512, out_channels=1024, stride=2)
296 |         self.layer4 = MakeLayer3(block, in_channels=1024, out_channels=2048, stride=2)
297 | 
298 |         self.pool = ops.ReduceMean(keep_dims=True)
299 |         self.squeeze = ops.Squeeze(axis=(2, 3))
300 |         self.fc = fc_with_initialize(512 * block.expansion, num_classes)
301 | 
302 |         # pipeline parallel config
303 |         self.head.pipeline_stage = 0
304 |         self.layer1.pipeline_stage = 0
305 |         self.layer2.pipeline_stage = 0
306 |         self.layer3.pipeline_stage = 1
307 |         self.layer4.pipeline_stage = 1
308 |         self.fc.pipeline_stage = 1
309 | 
310 |     def construct(self, x):
311 |         """construct"""
312 |         x = self.head(x)
313 |         x = self.layer1(x)
314 |         x = self.layer2(x)
315 |         x = self.layer3(x)
316 |         x = self.layer4(x)
317 | 
318 |         x = self.pool(x, (2, 3))
319 |         x = self.squeeze(x)
320 |         x = self.fc(x)
321 |         return x
322 | 
323 | 
324 | def resnet50(batch_size, num_classes):
325 |     """create resnet50"""
326 |     return ResNet(ResidualBlock, num_classes, batch_size)
327 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Concept Explanation
  2 | 
  3 | ## Data Parallelism (DP)
  4 | 
  5 | ## Model Parallelism
  6 | 
  7 | Model Parallelism has two types: Inter-layer and intra-layer. We note Inter-layer model parallelism as MP, and
  8 | intra-layer model parallelism as TP (tensor parallelism).
  9 | 
 10 | some researchers may call TP parameter parallelism or intra-layer model parallelism.
 11 | 
 12 | Popular intra-model parallelism methods include 2D, 2.5D, 3D model-parallelism as well as Megatron(1D). There are only
 13 | few work related to 2D, 2.5D and 3D now (only Colossal-AI).
 14 | 
 15 | ## Pipeline Parallelism
 16 | 
 17 | The partition of PP and MP are similar, but has different executing behaviors. Basically pipeline parallelism has two
 18 | families: PipeDream family and GPipe family.
 19 | 
 20 | # Published methods of auto-parallelism, including:
 21 | 
 22 | I classify parallelism methods according to their partition ways.
 23 | 
 24 | ## Pipeline Parallelism or Inter-layer Model Parallelism only:
 25 | 
 26 | |  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods
 27 | | --- | --- | --- | ---  | --- | --- | --- |
 28 | | ColocRL(REINFORCE) | Use reinforce learning to discover model partitions | Google Brain | [mlr.press](http://proceedings.mlr.press/v70/mirhoseini17a/mirhoseini17a.pdf) | Tensorflow | PMLR 70, 2017 | Reinforce
 29 | | A hierarchical model for device placement (HDP)| Use Scotch to do graph partitioning | Google |[link](https://openreview.net/pdf?id=Hkc-TeZ0W) | Tensorflow | ICLR 2018 | Reinforce LSTM
 30 | | GPipe| No implementation, see torchgpipe | Google | [arxiv](https://arxiv.org/abs/1811.06965) | None| 2018 on arxiv, NIPS2019 | averagely partition or manually
 31 | |[torchgpipe](https://github.com/kakaobrain/torchgpipe)| An A GPipe implementation in PyTorch |  UNIST | [arxiv](https://arxiv.org/pdf/2004.09910.pdf) | pytorch | 2020 on arxiv | balance stages by profiling
 32 | | GDP | A general deep RL method for automating device placements on arbitrary graphs. Orthogonal to DP,MP,PP | Google| [arxiv](https://export.arxiv.org/pdf/1910.01578.pdf) | Unknown | 2019 on arxiv | Reinforce Transformer
 33 | | Pesto | partition model based  on inter-layer model parallelism | Stony Brook University | [acm](https://www3.cs.stonybrook.edu/~anshul/middleware21_pesto.pdf) | Tensorflow | Middleware '21 | integer linear program
 34 | | [vPipe](https://github.com/hku-systems/vpipe) | A pipeline only system designed for NAS network. Complementary to hybrid parallelism| HKU | [ieee](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9472938) | PyTorch | TPDS vol.33 no.3 2022 |Swap, Recompute, Partition(SRP) planner. P: Kernighan-Lin algorithm
 35 | 
 36 | ## Data Parallelism + Pipeline Parallelism (or Inter-layer Model Parallelism):
 37 | 
 38 | |  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods |
 39 | | --- | --- | --- | ---  | --- | --- | --- |
 40 | | Spotlight| Model device placement as a Markov decision process (MDP). | University of Toronto | [mlr.press](http://proceedings.mlr.press/v80/gao18a/gao18a.pdf) | Unknown|PMLR 80, 2018 | Reinforce LSTM
 41 | | Placeto | Looks like Spotlight with MDP, but have different Policy. | MIT |[nips](https://proceedings.neurips.cc/paper/2019/file/71560ce98c8250ce57a6a970c9991a5f-Paper.pdf) | Tensorflow | NIPS 2019 |  Reinforce
 42 | |[REGAL](https://github.com/deepmind/deepmind-research/tree/master/regal)|a deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler. |Google|[openreview](https://openreview.net/pdf?id=rkxDoJBYPB) | Unknown |ICLR 2020 |RL with Genetic Algorithm
 43 | |[PipeDream](https://github.com/msr-fiddle/pipedream) |This repository contains the source code implementation of PipeDream and PipeDream-2BW | Microsoft Fiddle| [arxiv](https://arxiv.org/pdf/1806.03377.pdf), | PyTorch | 2018 on arxiv, SOSP 2019 | Dynamic Programming with Profile
 44 | |PipeDream-2BW | See above one | Microsoft |[arxiv](https://arxiv.org/pdf/2006.09503.pdf), [mlr.press](http://proceedings.mlr.press/v139/narayanan21a/narayanan21a.pdf)  | PyTorch | PMLR 139, 2021 | Dynamic Programming with Profile
 45 | |[DNN-partitioning](https://github.com/msr-fiddle/dnn-partitioning)| published at NeurIPS 2020. | Microsoft Fiddle| [arxiv](https://arxiv.org/pdf/2006.16423.pdf) | proof-of-concept implementation | NIPS 2020 |Dynamic Programming and Integer Programming
 46 | |HetPipe| Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism | UNIST | [usenix](https://www.usenix.org/system/files/atc20-park.pdf) | PyTorch (not open sourced) | USENIX 2020 | use CPLEX to solve linear programming problem
 47 | |[DAPPLE](https://github.com/AlibabaPAI/DAPPLE) | An Efficient Pipelined Data Parallel Approach for Training Large Model. Succeed from GPipe | Alibaba | [arxiv](https://arxiv.org/pdf/2007.01045.pdf) | DAPPLE | 2020 on arxiv; PPoPP 21 | Dynamic Programming
 48 | |[PipeTransformer](https://github.com/Distributed-AI/PipeTransformer) |Automated Elastic Pipelining for Distributed Training of Transformers | University of South  California | [arxiv](https://arxiv.org/pdf/2102.03161.pdf) |PyTorch |  ICML 21 | Dynamic Programming
 49 | |[Chimera](https://github.com/Shigangli/Chimera) | Efficiently training large-scale neural networks with bidirectional pipelines | Department of Computer Science, ETH Zurich Switzerland | [dl.acm](https://dl.acm.org/doi/pdf/10.1145/3458817.3476145) | PyTorch | SC 2021 | Performance model with brute force
 50 | | TAPP | Use a Seq2Seq based on attention mechanism to predict stage for layers. | Hohai University | [mdpi](https://www.mdpi.com/2076-3417/11/11/4785/pdf) | Unknown |Appl.sci. 2021, 11 | Reinforce Seq2Seq based on attention
 51 | |[RaNNC](https://github.com/nict-wisdom/rannc/tree/main) | RaNNC is an automatic parallelization middleware used to train very large-scale neural networks. | DIRECT and University of Tokyo | [arxiv](http://arxiv.org/abs/2103.16063) | PyTorch | IPDPS 2021 | dynamic programming
 52 | |[HeterPS](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/framework/fleet/heter_ps)| distributed deep learning with RL based scheduling in heterogeneous environment. | Baidu |  [arxiv](https://arxiv.org/pdf/2111.10635.pdf) | Paddle | 2021 | Reinforce learning based
 53 | |[FTPipe](https://github.com/saareliad/FTPipe) | FTPipe can automatically transform sequential implementation into a multi-GPU one. | Technion-Israel Institute of Technology | [usenix](https://usenix.org/system/files/atc21-eliad.pdf) | PyTorch | 2021 | multiprocessor scheduling problem with profiling.
 54 | 
 55 | ## Data Parallelism + Intra-layer Model Parallelism (or Tensor Parallelism):
 56 | 
 57 | |  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods |
 58 | | --- | --- | --- | ---  | --- | --- | --- |
 59 | |[OptCNN](https://github.com/flexflow/FlexFlow) | auto parallelism method  for CNN |Zhihao Jia | [mlr.press](http://proceedings.mlr.press/v80/jia18a/jia18a.pdf) | FlexFlow | PMLR 80, 2018 | Dynamic Programming based graph search algorithm
 60 | |[FlexFlow](https://github.com/flexflow/FlexFlow) | a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies | Zhihao Jia | [stanford](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf) |FlexFlow, compatible with PyTorch, Keras | SysML 2019 | MCMC
 61 | |Tofu| Supporting Very Large Models using Automatic Dataflow Graph Partitioning | New York University | [dl.acm](https://dl.acm.org/doi/pdf/10.1145/3302424.3303953) | Not OpenSourced | Euro-Sys 2019 | same as OptCNN
 62 | |[AccPar](https://github.com/linghaosong/AccPar) |Tensor partitioning for heterogeneous deep learning accelerators. | Linghao Song from USC| [usc.edu](http://alchem.usc.edu/portal/static/download/accpar.pdf) | Need Manually Deploy | 2019 on arxiv, HPCA 2020 | Dynamic Programming
 63 | |[TensorOpt](https://github.com/mindspore-ai/mindspore/tree/master/mindspore/ccsrc/frontend/parallel/auto_parallel) | Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism | CUHK & Huawei |  [arxiv](https://arxiv.org/pdf/2004.10856.pdf) | MindSpore | 2020 on arxiv | Dynamic Programming based graph search algorithm
 64 | |[ROC](https://github.com/jiazhihao/ROC) | Another paper from Zhihao, Jia. Designed for GNN | Zhihao Jia | [mlsys](https://proceedings.mlsys.org/paper/2020/file/fe9fc289c3ff0af142b6d3bead98a923-Paper.pdf) | On top of Flexflow  | MLSys 2020 | uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost.
 65 | |[Double Recursive](https://github.com/mindspore-ai/mindspore/tree/master/mindspore/ccsrc/frontend/parallel/auto_parallel/rec_core) | A Double recursive algorithm to search strategies | Huawei | [link](https://link.springer.com/chapter/10.1007/978-3-030-85665-6_13) | MindSpore | Euro-Par 2021 | Double Recursive
 66 | |[PaSE](https://github.com/baidu-research/PaSE) |PaSE uses a dynamic programming based approach to find an efficient strategy within a reasonable time. | Baidu Research | [ieee](https://github.com/baidu-research/PaSE/raw/master/docs/PaSE_ipdps2021.pdf) | prototype | IPDPS 2021 | Dynamic Programming
 67 | |P^2| offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way |University of Cambridge & DeepMind | [arxiv](https://arxiv.org/pdf/2110.10548.pdf) | Simulation Experiment |2021 on arxiv, MLSys 2022 | Synthesize tool with simulation
 68 | |AutoMap| Uses Search and Learn to do find Megatron-like strategies | DeepMind | [arxiv](https://arxiv.org/pdf/2112.02958.pdf) | JAX python API, XLA backend | 2021 on arxiv, NIPS 2021 | Search: Monte Carlo Tree Search; Learn: Interactive Network
 69 | 
 70 | ## Data Parallelism + Model Parallelism (or Tensor Parallelism) + Pipeline Parallelism:
 71 | 
 72 | |  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods|
 73 | | --- | --- | --- | ---  | --- | --- | --- |
 74 | |Auto-MAP| It works on HLO IR. Use Linkage Group to prune search space Use DQN RL to search DD, MP, PP stategies. | Alibaba | [arxiv](https://arxiv.org/pdf/2007.04069.pdf) | RAINBOW DQN | 2020 | Reinforce Learning
 75 | |[Piper](https://github.com/msr-fiddle/piper) | This code package contains algorithms (proof-of-concept implementation) and input files (profiled DNN models / workloads) from the paper "Piper: Multidimensional Planner for DNN Parallelization" published at NeurIPS 2021. An extension of DNN partitioning| Microsoft Fiddle| [link](https://www.microsoft.com/en-us/research/publication/piper-multidimensional-planner-for-dnn-parallelization/) | proof-of-concept implementation | NIPS 2021 | two-level dynamic programming
 76 | |[GSPMD](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla) |a system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified way | Google | [arxiv](https://arxiv.org/pdf/2105.04663.pdf) | Tensorflow XLA | 2021 | sharding propagation
 77 | |[DistIR](https://github.com/microsoft/dist-ir) | Horizontal TP. An intermediate representation and simulator for efficient neural network distribution | Stanford University & Microsoft Fiddle| [arxiv](https://arxiv.org/abs/2111.05426) | PyTorch | MLSys 2021 | Grid-Search Simulator
 78 | |Neo | A software-hardware co-designed system for high-performance distributed training of large-scale DLRM.  | Facebook | [arxiv](https://export.arxiv.org/pdf/2104.05158.pdf) | PyTorch | 2021 | 1. Greedy 2. Karmarker-Karp Algorithm
 79 | |Adaptive Paddle| Elastic training, fault tolerant, Cost-model based Sharding propagation |Baidu | [arxiv](https://arxiv.org/pdf/2112.02752.pdf) | Paddle | 2021 | Cost model based. Details un-given.
 80 | |[Alpa](https://github.com/alpa-projects/alpa) | Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | UC Berkley, Google, etc. | [arxiv](https://arxiv.org/pdf/2201.12023.pdf) | Jax, XLA | 2022 | Integer Linear for Intra, Dynamic programming for inter
 81 | 
 82 | ## Other Interesting automatic work
 83 | 
 84 | |  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods|
 85 | | --- | --- | --- | ---  | --- | --- | --- |
 86 | |[TASO](https://github.com/jiazhihao/TASO) | automatically optimize DNN computation with graph substitution |  Zhihao Jia |
 87 | 
 88 | ---
 89 | 
 90 | # Classify with Machine-Learning Based Methods and Classic Algorithm Based Methods
 91 | 
 92 | ## Machine-Learning Based Methods
 93 | 
 94 | | Name | Method Type | Parallelism | Year |
 95 | | --- | --- | --- | ---  |
 96 | | ColocRL | Reinforcement | MP | 2017 | 
 97 | | HDP | Reinforcement | MP | 2018 |  
 98 | | GDP | Reinforcement | MP | 2019 |  
 99 | | REGAL | Reinforcement | MP | 2020 |  
100 | | TAPP | Reinforcement | DP+PP | 2021 |
101 | | Spotlight | Reinforcement | DP+MP | 2018 |
102 | | Placeto | Reinforcement | DP+MP | 2019 |  
103 | | HeterPS | Reinforcement | DP+PP | 2021 | 
104 | | AutoMap | Deep Learning to predict rank | DP+TP | 2021 | 
105 | | Auto-MAP | Reinforcement | DP or TP or PP | 2020 | 
106 | | FlexFlow | MCMC | DP+TP  | 2019 | 
107 | | ROC | uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost. | DP+TP  | 2020 | 
108 | 
109 | ## Classic Algorithm Based Methods
110 | 
111 | | Name | Method Type | Parallelism | Year |
112 | | --- | --- | --- | ---  |
113 | | Pesto | integer linear | MP | 2021 | 
114 | | vpipe | SRP algorithm + KL (DP) | PP | 2022 | 
115 | | PipeDream | dynamic programming | DP+PP | 2019 | 
116 | | DNN-partitioning | dynamic programming + integer programming | DP+PP | 2020 |
117 | | PipeDream-2BW | dynamic programming | DP+PP | 2021 |
118 | | HetPipe | dynamic programming | DP+PP | 2020 |   
119 | | DAPPLE | dynamic programming | DP+PP | 2021 | 
120 | | PipeTransformer | dynamic programming | DP+PP | 2021 | 
121 | | Chimera | Grid-Search| DP+PP | 2021 | 
122 | | RaNNC | dynamic programming | DP+PP | 2021 |  
123 | | FTPipe | Multiprocessor scheduling problem with profiling | DP+PP | 2021 |
124 | | OptCNN | dynamic programming | DP+TP | 2018 | 
125 | | Tofu | dynamic programming | DP+TP | 2019 | 
126 | | AccPar | dynamic programming | DP+TP | 2020 | 
127 | | TensorOpt | dynamic programming | DP+TP | 2020 | 
128 | | Double Recursive | Double recursive  | DP+TP | 2021 |
129 | | PaSE | dynamic programming | DP+TP | 2021 |
130 | | P^2 | Synthesize tool with simulation | DP+TP | 2021 |  
131 | | Piper | two-level dynamic programming | DP+TP+PP | 2021 |
132 | | GSPMD | heuristic-propagation | DP+TP+PP | 2021 | 
133 | | DistIR | grid search | DP+TP+PP | 2021 | 
134 | | Neo | Greedy + Karmarker-karp algorithm | DP+TP+PP | 2021 |
135 | | Alpa | Integer programming + Dynamic Programming | DP+TP+PP | 2022 |
136 | 
137 | ---
138 | 
139 | ## Pictures
140 | 
141 | ### REINFORCE
142 | 
143 | ![img.png](Image/overall/reinforce.png)
144 | 
145 | ### Spotlight
146 | 
147 | ![img.png](Image/overall/spotlight.png)
148 | 
149 | ### GPipe
150 | 
151 | ![img.png](Image/overall/gpipe.png)
152 | 
153 | ### GDP
154 | 
155 | ![img.png](Image/overall/gdp.png)
156 | 
157 | ### Placeto
158 | 
159 | ![img.png](Image/overall/placeto.png)
160 | 
161 | ### REGAL
162 | 
163 | ![img.png](Image/overall/REGAL.png)
164 | 
165 | # News
166 | 
167 | 2021.12.9 DeepMind proposes Gopher, a 280 billion parameter transformer language model. Trained by 4096 16GB
168 | TPUv3. [link](https://deepmind.com/blog/article/language-modelling-at-scale)
169 | 
170 | 2021.12.8 Baidu and Peng Cheng proposes Wenxin (文心), a 260 billion parameter knowledge-aware pretrained model (a.k.a.
171 | ERNIE 3.0 Titan). Trained with Adaptive Paddle in the Table above.
172 | 
173 | 2021.10.26 Inspur formally proposes 245.7 billion parameter on AICC 2021.s


--------------------------------------------------------------------------------