├── drafts
    └── papaya.pdf
└── README.md


/drafts/papaya.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuXiaoxuanPKU/Cost-Model-papers/HEAD/drafts/papaya.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Open Source Projects
 2 | - Cost Model
 3 |   - [nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices](https://github.com/microsoft/nn-Meter)
 4 |   - [Paleo: A Performance Model for Deep Neural Networks](https://github.com/TalwalkarLab/paleo)
 5 | - Distributed Training
 6 |   - [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://github.com/microsoft/DeepSpeed)
 7 |   - [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://github.com/NVIDIA/Megatron-LM)
 8 |   - [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning](https://github.com/alpa-projects)
 9 | 
10 | ## Memory Cost Model
11 | - [Estimating GPU Memory Consumption of Deep Learning Models](https://www.microsoft.com/en-us/research/uploads/prod/2020/09/dnnmem.pdf) by Yanjie Gao et al., ESEC/FSE 2020
12 | 
13 | ## Computation Cost Model
14 | - Cost Model for NAS/Cloud
15 |   - [Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training](https://www.usenix.org/system/files/atc20-zhu-hongyu.pdf) by Hongyu Zhu et al., USENIX ATC 2020
16 |   - [Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training](https://www.usenix.org/system/files/atc21-yu.pdf) by Geoffrey X. Yu et al., USENIX ATC 2021
17 |   - [To bridge neural network design and real-world performance: A behaviour study for neural networks](https://proceedings.mlsys.org/paper/2021/file/02522a2b2726fb0a03bb19f2d8d9524d-Paper.pdf) by Xiaohu Tang et al., MLSys 2021
18 |   - [perf4sight: A toolflow to model CNN training performance on Edge GPUs](https://arxiv.org/pdf/2108.05580.pdf) by Aditya Rajagopal et al., ArXiv 2021
19 |   - [nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices](https://dl.acm.org/doi/pdf/10.1145/3458864.3467882?casa_token=x0qNEhcP_wAAAAAA:uCTMD3yLynIaS7PwFvxzT65oxmrKz6EyOClSjYNCr-t036yn8VsqJcNjygQDkhR_04NeyZvRWS0e) by Li Lyna Zhang et al., MobiSys 2021
20 |   - [Empirical Analysis and Modeling of Compute Times of CNN Operations on AWS Cloud](https://ieeexplore.ieee.org/abstract/document/9251263) by Ubaid Ullah Hafeez et al., IISWC 2020
21 |   - [Paleo: A Performance Model for Deep Neural Networks](https://openreview.net/pdf?id=SyVVJ85lg) by Hang Qi et al., ICLR 2017
22 |   - [Augur: Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices](https://arxiv.org/pdf/1709.09503.pdf) by Zongqing Lu et al., Proceedings of the 25th ACM international conference on Multimedia 2017
23 |   - [Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures](https://arxiv.org/pdf/1906.01992.pdf) by Andre Viebke et al., ArXiv 2019
24 |  
25 | - Cost model for kernel compilation
26 |   - [A learned Performance Model for Tensor Processing Units](https://arxiv.org/abs/2008.01040) by Samuel J. Kaufman et al., MLSys 2021
27 | 
28 | ## Communication Cost Model
29 | - [Iteration Time Prediction for CNN in Multi-GPU Platform: Modeling and Analysis](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8713989) by Ziqian Pei et al., IEEE Access 2019
30 | 
31 | ## Distributed Training
32 | - [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning](https://arxiv.org/pdf/2201.12023.pdf) by Lianmin Zheng et al., ArXiv 2022
33 | - [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) by Mohammad Shoeybi et al., ArXiv 2019
34 | - [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/pdf/2104.07857.pdf) by Samyam Rajbhandari et al., SC 2021
35 | - [Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc](https://cs.stanford.edu/~zhihao/papers/mlsys20.pdf) by Zhihao Jia et al., MLSys 2020
36 | - [A Distributed Multi-GPU System for Fast Graph Processing](http://www.vldb.org/pvldb/vol11/p297-jia.pdf) by Zhihao Jia et al., VLDB 2017
37 | 
38 | ## Device Placement
39 | - [Device Placement Optimization with Reinforcement Learning](https://arxiv.org/pdf/1706.04972.pdf) by Azalia Mirhoseini et al., ICML 2017
40 | - [DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9460468&casa_token=2gAY08LpV_oAAAAA:CPc0zg6FF4hQ9AfoW2X5SpyxYWcQQpn0G_kxQ-5QXwCHYhD--lf5A4-ELiSlXrKcDTXbsI2sEKg) by Minjia et al., IEEE IPDPS 2021
41 | 
42 | ## Memory Optimization for Training
43 | - gradient checkpoint
44 |   - [Training Deep Nets with Sublinear Memory Cost](https://arxiv.org/pdf/1604.06174.pdf) by Tianqi Chen et al., arXiv 2016
45 |   - [Efficient Rematerialization for Deep Networks](https://proceedings.neurips.cc/paper/2019/file/ffe10334251de1dc98339d99ae4743ba-Paper.pdf) by Ravi Kumar et al., NeurIPS 2019
46 |   - [Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization](https://arxiv.org/pdf/1910.02653.pdf) by Paras Jain et al., MLsys 2020
47 |   - [Dynamic Tensor Rematerialization](https://arxiv.org/pdf/2006.09616.pdf) by Marisa Kirisame et al., ICLR 2021
48 | - gradient checkpoint + distributed training
49 |   - [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/pdf/2205.05198.pdf) by Vijay Korthikanti et al., arXiv 2022
50 | - kernel fusion
51 |   - [Data movement is all you need: A case study on optimizing transformers](https://proceedings.mlsys.org/paper/2021/file/c9e1074f5b3f9fc8ea15d152add07294-Paper.pdf) by Andrei Ivanov et al., MLSys 2021
52 | - compression/quantization
53 |   - [Gist: Efficient Data Encoding for Deep Neural Network Training](https://www.microsoft.com/en-us/research/uploads/prod/2018/04/fiddle-gist-isca18.pdf) by Animesh Jain et al., ISCA 2018 
54 |   - [Gradient Compression Supercharged High-Performance Data Parallel DNN Training](https://www.ruichuan.org/papers/hipress-sosp21.pdf) by Youhui Bai et al., SOSP 2021
55 |   - [GACT: Activation compressed training for generic network architectures](https://proceedings.mlr.press/v162/liu22v/liu22v.pdf) by Xiaoxuan Liu et al., ICML 2022
56 |   - [On the Utility of Gradient Compression in Distributed Training Systems](https://proceedings.mlsys.org/paper/2022/hash/cedebb6e872f539bef8c3f919874e9d7-Abstract.html) by Saurabh Agarwal et al., MLsys 2022
57 | - swapping
58 |   - [Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training](https://hal.inria.fr/hal-02316266/document) by Olivier Beaumont et al., European Conference on Parallel Processing 2020
59 |   - [SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping](http://www.news.cs.nyu.edu/~jinyang/pub/swapadvisor-asplos20.pdf) by Huang et al., ASPLOS 2020
60 |   - [Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers](https://arxiv.org/pdf/2202.01306.pdf) by Youjie Li et al., VLDB2022.
61 |   - [STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training](https://github.com/strongh2/sc22-ae) by Xiaoyang Sun et al., SC2022
62 |   - [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://www.usenix.org/conference/atc21/presentation/ren-jie) by Jie Ren et al., USENIX ATC'21
63 | - swapping + pipeline parallelism
64 |   - [Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers](https://arxiv.org/pdf/2202.01306.pdf) by Youjie Li et al., VLDB 2022.
65 | - swapping + gradient checkpointing
66 |   - [Capuchin: Tensor-based gpu memory management for deep learning](https://dl.acm.org/doi/pdf/10.1145/3373376.3378505?casa_token=Fa8ZayNjRk0AAAAA:8Bc7PzTe0SrH_edARFzh1vi7ll7CNzUDHsk4pHiOu8dwbmHExYFtYeQGKCKIqtPhS-tSXN1q_kn1KA) by Peng, Xuan, et al., ASPLOS 2020
67 |   - [Efficient Combination of Rematerialization and Offloading for Training DNNs](https://proceedings.neurips.cc/paper/2021/file/c8461bf13fca8a2b9912ab2eb1668e4b-Paper.pdf) by Olivier Beaumont et al., NeurIPS 2021
68 |   - [POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging](https://proceedings.mlr.press/v162/patil22b.html) by Shishir G. Patil, ICML 2022
69 |  - Memory Allocator
70 |    - [OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks](https://arxiv.org/abs/2210.12924) by Benoit Steiner et al., Arxiv 2022
71 |  - Efficient Optimizer
72 |    - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054) by Samyam Rajbhandari et al., SC'20
73 |    - [1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed](http://proceedings.mlr.press/v139/tang21a.html) by Hanlin Tang et al., ICML 2021
74 |  - Hardware related
75 |    - [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) by Tri Dao et al., NeurIPS 2022
76 |   
77 | ## Framework Introduction
78 | - [PyTorch Internal](http://blog.ezyang.com/2019/05/pytorch-internals/)
79 | - [Profiler Trace File](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview)
80 | - [Characterizing Deep Learning Training Workloads on Alibaba-PAI](https://arxiv.org/pdf/1910.05930.pdf) by Mengdi Wang et al., IISWC 2019
81 | 


--------------------------------------------------------------------------------