├── img
    ├── llm-evolutionary-tree.png
    ├── ai-training-computation-202206.png
    ├── ai-training-computation-202303.png
    └── ai-training-computation-202306.png
├── .gitignore
└── README.md


/img/llm-evolutionary-tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/llm-evolutionary-tree.png


--------------------------------------------------------------------------------
/img/ai-training-computation-202206.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202206.png


--------------------------------------------------------------------------------
/img/ai-training-computation-202303.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202303.png


--------------------------------------------------------------------------------
/img/ai-training-computation-202306.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202306.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | # Created by https://www.toptal.com/developers/gitignore/api/macos
 3 | # Edit at https://www.toptal.com/developers/gitignore?templates=macos
 4 | 
 5 | ### macOS ###
 6 | # General
 7 | .DS_Store
 8 | .AppleDouble
 9 | .LSOverride
10 | 
11 | # Icon must end with two \r
12 | Icon
13 | 
14 | # Thumbnails
15 | ._*
16 | 
17 | # Files that might appear in the root of a volume
18 | .DocumentRevisions-V100
19 | .fseventsd
20 | .Spotlight-V100
21 | .TemporaryItems
22 | .Trashes
23 | .VolumeIcon.icns
24 | .com.apple.timemachine.donotpresent
25 | 
26 | # Directories potentially created on remote AFP share
27 | .AppleDB
28 | .AppleDesktop
29 | Network Trash Folder
30 | Temporary Items
31 | .apdisk
32 | 
33 | ### macOS Patch ###
34 | # iCloud generated files
35 | *.icloud
36 | 
37 | # End of https://www.toptal.com/developers/gitignore/api/macos
38 | 
39 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | <!-- omit in toc -->
   2 | 
   3 | # awesome-huge-models [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
   4 | 
   5 | <!-- [![MIT License](https://img.shields.io/badge/license-MIT-green.svg)](https://opensource.org/licenses/MIT) -->
   6 | 
   7 | A collection of AWESOME things about HUGE AI models.
   8 | 
   9 | **[2023.06]** We are now in the post-GPT4 era, where LLMs are thriving and new models are emerging from GitHub repositories rather than traditional papers. People are striving to release everything openly, including training and inference codes, instruction-tuned weights and datasets, pretrained weights, and [the datasets used for pretraining LLMs](#open-llm-training-dataset). In this update, I try to catch up with the latest developments in the open-source wave of LLMs.
  10 | 
  11 | **[2023.03]** Only pretrained models are recorded here. Models are sorted according to the first release date. To support the open source process of LLM, we highligh the open-sourced LLM models with [[open]]().
  12 | 
  13 | **[2022.06]** There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [[Intro in Chinese](https://zhuanlan.zhihu.com/p/529863941)]
  14 | 
  15 | <!-- omit in toc -->
  16 | 
  17 | ## Contents
  18 | 
  19 | - [awesome-huge-models ](#awesome-huge-models-)
  20 |   - [Contents](#contents)
  21 |   - [Survey](#survey)
  22 |   - [Models](#models)
  23 |     - [Language Model](#language-model)
  24 |     - [Vision Models](#vision-models)
  25 |     - [Reinforcement Learning](#reinforcement-learning)
  26 |     - [Speech](#speech)
  27 |     - [Science](#science)
  28 |   - [Open LLM Training Dataset](#open-llm-training-dataset)
  29 |   - [Distributed Training Framework](#distributed-training-framework)
  30 |     - [PyTorch Ecosystem](#pytorch-ecosystem)
  31 |     - [XLA Ecosystem](#xla-ecosystem)
  32 |     - [Other Frameworks](#other-frameworks)
  33 |     - [Inference Frameworks](#inference-frameworks)
  34 |     - [Recommendation Training Framework](#recommendation-training-framework)
  35 |   - [Keys Explanations](#keys-explanations)
  36 | 
  37 | ## Survey
  38 | 
  39 | <p align="center">
  40 |     <img src="img/ai-training-computation-202306.png" alt="Big models in NLP" width="460"/>
  41 | </p >
  42 | 
  43 | - [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223) [2023.03]
  44 | - [A Dive into Vision-Language Models](https://huggingface.co/blog/vision_language_pretraining) [2023.02]
  45 | - [Compute Trends Across Three Eras of Machine Learning](https://arxiv.org/abs/2202.05924) [[chart](https://ourworldindata.org/grapher/ai-training-computation)] [2022.02]
  46 | - [Vision-and-Language Pretrained Models: A Survey](https://arxiv.org/abs/2204.07356) [2022.04]
  47 | - [A Roadmap to Big Model](https://arxiv.org/abs/2203.14101) [2022.03]
  48 | - [A Survey of Vision-Language Pre-trained Models](https://arxiv.org/abs/2202.10936) [2022.02]
  49 | - [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) [2022.01]
  50 | - [On the Opportunities and Risk of Foundation Models](https://arxiv.org/abs/2108.07258) [2021.08]
  51 | - [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) [2021.06]
  52 | 
  53 | Resources list:
  54 | 
  55 | - [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
  56 | - [Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM)
  57 | - [Open-LLM](https://github.com/eugeneyan/open-llms)
  58 | - [LLMDataHub](https://github.com/Zjh-819/LLMDataHub)
  59 | 
  60 | ## Models
  61 | 
  62 | ### Language Model
  63 | 
  64 | <p align="center">
  65 |     <img src="img/llm-evolutionary-tree.png" alt="LLM evolutionary tree" width="460"/>
  66 | </p >
  67 | 
  68 | - **Baichuan** [[Baichuan]]() Jun. 2023 [[open]](https://github.com/baichuan-inc/baichuan-7B)
  69 | 
  70 |   ```yaml
  71 |   Field: Language
  72 |   Params: 7B
  73 |   Training Data: 1.2T tokens (English, Chinese, Private)
  74 |   License: Apache 2.0
  75 |   Context: 4096
  76 |   ```
  77 | 
  78 | - **Falcon** [[TII]]() Jun. 2023 [[open]](https://huggingface.co/tiiuae/falcon-40b)
  79 | 
  80 |   ```yaml
  81 |   Field: Language
  82 |   Params: 40B
  83 |   Training Data: 1T tokens (RefinedWeb)
  84 |   License: Apache 2.0
  85 |   Context Length: 2048
  86 |   ```
  87 | 
  88 | - **OpenLLaMA** [[OpenLM]]() May. 2023 [[open]](https://github.com/openlm-research/open_llama)  
  89 | 
  90 |   ```yaml
  91 |   Field: Language
  92 |   Params: 13B, 7B, 3B
  93 |   Training Data: 1T tokens (RedPajama)
  94 |   License: Apache 2.0
  95 |   Context Length: 2048
  96 |   ```
  97 | 
  98 | - **Redpajama-INCITE** [[Together]](https://github.com/togethercomputer/RedPajama-Data) May. 2023 [[open]](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1)
  99 | 
 100 |   ```yaml
 101 |   Field: Language
 102 |   Params: 7B, 3B
 103 |   Training Data: 1T tokens (Redpajama)
 104 |   License: Apache 2.0
 105 |   Context Length: 2048
 106 |   ```
 107 | 
 108 | - **MPT** [[MosaicML]](https://www.mosaicml.com/blog/mpt-7b) May. 2023 [[open]](https://github.com/mosaicml/llm-foundry)  
 109 | 
 110 |   ```yaml
 111 |   Field: Language
 112 |   Params: 30B, 7B
 113 |   Training Data: 1T tokens (Private)
 114 |   License: Apache 2.0, CC BY-SA-3.0
 115 |   Context Length: 84k
 116 |   ```
 117 | 
 118 | - **Stable-LM** [[Stability-AI]](https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models) Apr. 2023 [[open]](https://github.com/Stability-AI/StableLM#stablelm-alpha)
 119 | 
 120 |   ```yaml
 121 |   Field: Language
 122 |   Params: 7B, 3B
 123 |   Training Data: 1.5T tokens
 124 |   License: CC BY-SA-4.0
 125 |   ```
 126 | 
 127 | - **LiT-LLaMa** [[Lightning-AI]]() Apr. 2023 [[open]](https://github.com/Lightning-AI/lit-llama)  
 128 |   
 129 |   ```yaml
 130 |   Field: Language
 131 |   Params: 13B, 7B
 132 |   Training Data: 1.2T tokens (Redpajama)
 133 |   License: Apache 2.0
 134 |   ```
 135 | 
 136 | - **h2oGPT** [[H2O.ai]](https://h2o.ai/blog/building-the-worlds-best-open-source-large-language-model-h2o-ais-journey/) [[open]](https://github.com/h2oai/h2ogpt)  
 137 |   [h2oGPT: Democratizing Large Language Models](https://arxiv.org/pdf/2306.08161.pdf)
 138 | 
 139 |   ```yaml
 140 |   Field: Language
 141 |   Params: 13B, 7B
 142 |   Training Data: 1.0T tokens
 143 |   License: Apache 2.0
 144 |   Context Length: 2048
 145 |   ```
 146 | 
 147 | - **Cerabras-GPT** [[Cerabras]]() Mar. 2023 [[open]](https://huggingface.co/cerebras/Cerebras-GPT-13B)  
 148 |   Training Compute-Optimal Large Language Models [[preprint]](https://arxiv.org/abs/2203.15556)  
 149 | 
 150 |   ```yaml
 151 |   Field: Language
 152 |   Params: 13B
 153 |   Training Data: 371B tokens (Redpajama)
 154 |   License: Apache 2.0
 155 |   Context Length: 2048
 156 |   ```
 157 | 
 158 | - **Claude** [[Anthropic]](https://www.anthropic.com/index/introducing-claude) Mar. 2023 [close]
 159 | 
 160 |   ```yaml
 161 |   Field: Language-Vision
 162 |   ```
 163 | 
 164 | - **GPT-4** [[OpenAI]](https://openai.com/product/gpt-4) Mar. 2023 [close]  
 165 |    GPT-4 Technical Report [[Preprint]](https://cdn.openai.com/papers/gpt-4.pdf)
 166 | 
 167 |   ```yaml
 168 |   Field: Language-Vision
 169 |   Params: 1.7T
 170 |   Architecture: De, MoE
 171 |   ```
 172 | 
 173 | - **Bard** [[Google]](https://blog.google/technology/ai/bard-google-ai-search-updates/)
 174 |   
 175 |   ```yaml
 176 |   Field: Language-Vision
 177 |   ```
 178 | 
 179 | - **LLaMa** [[Meta]]() Feb. 2023 [[open]](https://github.com/facebookresearch/llama)  
 180 |    Open and Efficient Foundation Language Models [[Preprint]](https://arxiv.org/pdf/2302.13971v1.pdf)
 181 | 
 182 |   ```yaml
 183 |   Field: Language
 184 |   Params: 65B, 33B, 13B, 7B
 185 |   Training Data: 4TB (1.4T tokens)
 186 |   Training Cost: 1,022,362 (2048 80G-A100 x 21 days)
 187 |   Training Power Consumption: 449 MWh
 188 |   Instruction-tuned Variants: Alpaca, Vicuna, Dolly, Guanaco, ColossalChat, GPT4All, Koala, BELLE, MiniGPT-4, etc.
 189 |   License: GPL
 190 |   ```
 191 | 
 192 | - **RWKV-4** [[Personal]]() Dec. 2022 [[open]](https://github.com/BlinkDL/RWKV-LM)
 193 | 
 194 |   ```yaml
 195 |   Field: Language
 196 |   Params: 14B, 7B, 3B, 1.5B
 197 |   Training Data: 332B tokens
 198 |   Architecture: De, RNN
 199 |   License: Apache 2.0
 200 |   ```
 201 | 
 202 | - **AnthropicLM** [[Anthropic]]() Dec. 2022 [close]  
 203 |    Constitutional AI: Harmlessness from AI Feedback
 204 | 
 205 |   ```yaml
 206 |   Field: Language
 207 |   Params: 52B
 208 |   ```
 209 | 
 210 | - **BLOOM** [[BigScience]]() Nov. 2022 [[open]](https://huggingface.co/bigscience/bloom)  
 211 |    A 176B-Parameter Open-Access Multilingual Language Model [[Preprint]](https://arxiv.org/pdf/2211.05100.pdf)
 212 | 
 213 |   ```yaml
 214 |   Field: Language
 215 |   Params: 176B
 216 |   Training Data: 174GB (336B tokens)
 217 |   Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months
 218 |   Training Power Consumption: 475 MWh
 219 |   Training Framework: Megatron + Deepspeed
 220 |   Instruction-tuned Variants: BLOOMZ
 221 |   License: OpenRAIL-M v1
 222 |   Context Length: 2048
 223 |   ```
 224 |   
 225 | - **Galactica** [[Meta]]() Nov. 2022 [[open]](https://huggingface.co/facebook/galactica-1.3b)
 226 |   A scientific language model trained on over 48 million scientific texts [[Preprint]](https://arxiv.org/pdf/2211.09085.pdf)
 227 |   
 228 |   ```yaml
 229 |   Field: Language
 230 |   Params: 125M, 1.3B, 6.7B, 30B, 120B
 231 |   ```
 232 | 
 233 | - **Pythia** [[EleutherAI]]() Oct. 2022 [[open]](https://github.com/EleutherAI/pythia)
 234 |   
 235 |   ```yaml
 236 |   Field: Language
 237 |   Params: 12B
 238 |   Instruction-tuned Variants: Dolly 2.0
 239 |   License: Apache 2.0
 240 |   Context Length: 2048
 241 |   ```
 242 | 
 243 | - **GLM-130B** [[BAAI]](https://keg.cs.tsinghua.edu.cn/glm-130b/zh/posts/glm-130b/) Oct. 2022 [[open]](https://github.com/THUDM/GLM-130B)  
 244 |    GLM-130B: An Open Bilingual Pre-trained Model [[ICLR'23]](https://arxiv.org/pdf/2210.02414.pdf)
 245 | 
 246 |   ```yaml
 247 |   Field: Language
 248 |   Params: 130B
 249 |   Training Data: (400B tokens)
 250 |   Training Cost: 516,096 A100 hours = 768 40G-A100 x 28 days
 251 |   Training Framework: Megatron + Deepspeed
 252 |   ```
 253 | 
 254 | - **UL2** [[Google]]() May 2022 [[open]](https://huggingface.co/google/ul2)  
 255 |    Unifying Language Learning Paradigms [[Preprint]](https://arxiv.org/abs/2205.05131)
 256 | 
 257 |   ```yaml
 258 |   Field: Language
 259 |   Params: 20B (1T tokens)
 260 |   Training Data: 800GB
 261 |   Achitecture: En-De
 262 |   Training Framework: Jax + T5x
 263 |   License: Apache 2.0
 264 |   Instruction-tuned Variants: Flan-UL2
 265 |   Context Length: 2048
 266 |   ```
 267 | 
 268 | - **OPT** [[Meta]](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) May 2022 [[open]](https://github.com/facebookresearch/metaseq)  
 269 |    OPT: Open Pre-trained Transformer Language Models [[Preprint]](https://arxiv.org/abs/2205.01068)
 270 | 
 271 |   ```yaml
 272 |   Field: Language
 273 |   Params: 175B
 274 |   Training Data: 800GB (180B tokens)
 275 |   Training Cost: 809,472 A100 hours =  992 80G-A100 x 34 days
 276 |   Training Power Consumption: 356 MWh
 277 |   Architecutre: De
 278 |   Training Framework: Megatron + Fairscale
 279 |   ```
 280 | 
 281 | - **PaLM** [[Google]](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html) Apr. 2022 [close]  
 282 |    PaLM: Scaling Language Modeling with Pathways [[Preprint]](https://arxiv.org/abs/2204.02311)
 283 | 
 284 |   ```yaml
 285 |   Field: Language
 286 |   Params: 550B
 287 |   Training Data: 3TB (780B tokens)
 288 |   Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days)
 289 |   Training petaFLOPs: 2.5B
 290 |   Architecture: De
 291 |   Training Framework: Jax + T5x
 292 |   ```
 293 | 
 294 | - **GPT-NeoX** [[EleutherAI]](https://blog.eleuther.ai/announcing-20b/) Apr. 2022 [[open]](https://github.com/EleutherAI/gpt-neox)  
 295 |    GPT-NeoX-20B: An Open-Source Autoregressive Language Model [[Preprint]](https://arxiv.org/abs/2204.06745)
 296 | 
 297 |   ```yaml
 298 |   Field: Language
 299 |   Params: 20B
 300 |   Training Data: 525GiB
 301 |   Training petaFLOPs: 93B
 302 |   Architecture: De
 303 |   Training Framework: Megatron + Fairscale
 304 |   License: Apache 2.0
 305 |   Context Length: 2048
 306 |   ```
 307 | 
 308 | - **InstructGPT** [[OpenAI]]() Mar. 2022 [close]  
 309 |    Training language models to follow instructions with human feedback [[Preprint]](https://arxiv.org/abs/2203.02155)
 310 | 
 311 |   ```yaml
 312 |   Field: Language
 313 |   Params: 175B
 314 |   ```
 315 | 
 316 | - **Chinchilla** [[DeepMind]](https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training) Mar. 2022 [close]  
 317 |    Training Compute-Optimal Large Language Models [[Preprint]](https://arxiv.org/abs/2203.15556)
 318 | 
 319 |   ```yaml
 320 |   Field: Language
 321 |   Params: 70B
 322 |   Training Data: 5.2TB (1.4T tokens)
 323 |   Training petaFLOPs: 580M
 324 |   Architecture: De
 325 |   ```
 326 | 
 327 | - **EVA 2.0** [[BAAI]](https://wudaoai.cn/model/detail/EVA) Mar. 2022 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master)  
 328 |    EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [[Preprint]](https://arxiv.org/abs/2203.09313)
 329 | 
 330 |   ```yaml
 331 |   Field: Language (Dialogue)
 332 |   Params: 2.8B
 333 |   Training Data: 180G (1.4B samples, Chinese)
 334 |   ```
 335 | 
 336 | - **AlphaCode** [[DeepMind]](https://www.deepmind.com/blog/competitive-programming-with-alphacode) Mar. 2022 [close]  
 337 |    Competition-Level Code Generation with AlphaCode [[Preprint]](https://arxiv.org/abs/2203.07814)
 338 | 
 339 |   ```yaml
 340 |   Field: Code Generation
 341 |   Params: 41B
 342 |   Training Data: (967B tokens)
 343 |   Architecture: De
 344 |   ```
 345 | 
 346 | - **ST-MoE** [[Google]]() Feb. 2022 [close]  
 347 |    ST-MoE: Designing Stable and Transferable Sparse Expert Models [[Preprint]](https://arxiv.org/abs/2202.08906)
 348 | 
 349 |   ```yaml
 350 |   Field: Language
 351 |   Params: 296B
 352 |   Architecture: En-De, MoE
 353 |   ```
 354 | 
 355 | - **LaMDA** [[Google]](https://arxiv.org/abs/2201.08239) Jan. 2022 [close]  
 356 |    LaMDA: Language Models for Dialog Applications [[Preprint]](https://arxiv.org/abs/2201.08239)
 357 | 
 358 |   ```yaml
 359 |   Field: Language (Dialogue)
 360 |   Params: 137B
 361 |   Training Data: (1.56T words)
 362 |   Training petaFLOPs: 360M
 363 |   Architecture: De
 364 |   ```
 365 | 
 366 | - **GLaM** [[Google]](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) Dec. 2021 [close]  
 367 |    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2112.06905)
 368 | 
 369 |   ```yaml
 370 |   Field: Language
 371 |   Params: 1.2T
 372 |   Architecture: De, MoE
 373 |   ```
 374 | 
 375 | - **Gopher** [[DeepMind]](https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval) Dec. 2021 [close]  
 376 |    Scaling Language Models: Methods, Analysis & Insights from Training Gopher [[Preprint]](https://arxiv.org/abs/2112.11446)
 377 | 
 378 |   ```yaml
 379 |   Field: Language
 380 |   Params: 280B
 381 |   Training Data: 1.3TB (300B tokens)
 382 |   Training petaFLOPs: 630M
 383 |   Architecture: De
 384 |   ```
 385 | 
 386 | - **Yuan 1.0** [[inspur]](https://air.inspur.com/home) Oct. 2021 [close]  
 387 |    Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2110.04725)
 388 | 
 389 |   ```yaml
 390 |   Field: Language
 391 |   Params: 245B
 392 |   Training Data: 5TB (180B tokens, Chinese)
 393 |   Training petaFLOPs: 410M
 394 |   Architecture: De, MoE
 395 |   ```
 396 | 
 397 | - **MT-NLG** [[Microsoft, Nvidia]](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) Oct. 2021 [close]  
 398 |    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [[Preprint]](https://arxiv.org/abs/2201.11990)
 399 | 
 400 |   ```yaml
 401 |   Field: Language
 402 |   Params: 530B
 403 |   Training Data: 339B tokens
 404 |   Training petaFLOPs: 1.4B
 405 |   Architecture: De
 406 |   ```
 407 | 
 408 | - **Plato-XL** [[Baidu]](http://research.baidu.com/Blog/index-view?id=163) Sept. 2021 [close]  
 409 |    PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [[Preprint]](https://arxiv.org/abs/2109.09519)
 410 | 
 411 |   ```yaml
 412 |   Field: Language (Dialogue)
 413 |   Params: 11B
 414 |   Training Data: (1.2B samples)
 415 |   ```
 416 | 
 417 | - **GPT-J** [[EleutherAI]](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) Aug. 2021 [[open]](https://github.com/kingoflolz/mesh-transformer-jax)  
 418 | 
 419 |   ```yaml
 420 |   Field: Language
 421 |   Params: 6B
 422 |   Programming Language: Jax
 423 |   ```
 424 | 
 425 | - **Jurassic-1** [[AI21 Labs]](https://www.zdnet.com/article/watch-out-gpt-3-here-comes-ai21s-jurassic-language-model/) Aug. 2021 [close]  
 426 |    Jurassic-1: Technical Details and Evaluation [[Preprint]](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)
 427 | 
 428 |   ```yaml
 429 |   Field: Language
 430 |   Params: 178B
 431 |   Training petaFLOPs: 370M
 432 |   Architecture: De
 433 |   ```
 434 | 
 435 | - **Codex** [[OpenAI]](https://openai.com/blog/openai-codex/) July 2021 [close]  
 436 |    Evaluating Large Language Models Trained on Code [[Preprint]](https://arxiv.org/abs/2107.03374)
 437 | 
 438 |   ```yaml
 439 |   Field: Code Generation
 440 |   Params: 12B
 441 |   Training Data: 159GB
 442 |   Architecture: De
 443 |   ```
 444 | 
 445 | - **ERNIE 3.0** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie) July 2021 [close]  
 446 |    ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [[Preprint]](https://arxiv.org/abs/2107.02137)
 447 | 
 448 |   ```yaml
 449 |   Field: Language
 450 |   Params: 10B
 451 |   Training Data: 4TB (375B tokens, with knowledge graph)
 452 |   Architecture: En
 453 |   Objective: MLM
 454 |   ```
 455 | 
 456 | - **CPM-2** [[BAAI]]() June 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master)  
 457 |    CPM-2: Large-scale Cost-effective Pre-trained Language Models [[Preprint]](https://arxiv.org/abs/2106.10715)
 458 | 
 459 |   ```yaml
 460 |   Field: Language
 461 |   Params: 198B
 462 |   Training Data: 2.6TB (Chinese 2.3TB, English 300GB)
 463 |   Architecture: En-De
 464 |   Objective: MLM
 465 |   ```
 466 | 
 467 | - **HyperClova** [[Naver]](https://www.navercorp.com/promotion/pressReleasesView/30546) May 2021 [close]  
 468 |    What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [[Preprint]](https://arxiv.org/abs/2109.04650v1)
 469 | 
 470 |   ```yaml
 471 |   Field: Language
 472 |   Params: 82B
 473 |   Training Data: 562B tokens (Korean)
 474 |   Training petaFLOPs: 63B
 475 |   Architecture: De
 476 |   ```
 477 | 
 478 | - **ByT5** [[Google]]() May 2021 [[open]](https://github.com/google-research/byt5)  
 479 |    ByT5: Towards a token-free future with pre-trained byte-to-byte models [[TACL'22]](https://arxiv.org/abs/2105.13626)
 480 | 
 481 |   ```yaml
 482 |   Field: Language
 483 |   Params: 13B
 484 |   Training Data: (101 languages)
 485 |   Architecture: En-De
 486 |   ```
 487 | 
 488 | - **PanGu-α** [[Huawei]]() Apr. 2021 [close]  
 489 |    PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [[Preprint]](https://arxiv.org/abs/2104.12369)
 490 | 
 491 |   ```yaml
 492 |   Field: Language
 493 |   Params: 200B
 494 |   Training Data: 1.1TB (Chinese)
 495 |   Training petaFLOPs: 58M
 496 |   Architecture: De
 497 |   ```
 498 | 
 499 | - **mT5** [[Google]]() Mar. 2021 [[open]](https://github.com/google-research/multilingual-t5)  
 500 |    mT5: A massively multilingual pre-trained text-to-text transformer [[Preprint]](https://arxiv.org/abs/2010.11934)
 501 | 
 502 |   ```yaml
 503 |   Field: Language
 504 |   Params: 13B
 505 |   Training Data: (101 languages)
 506 |   Architecture: En-De
 507 |   ```
 508 | 
 509 | - **WuDao-WenHui** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/Transformer-XL)
 510 | 
 511 |   ```yaml
 512 |   Field: Language
 513 |   Params: 2.9B
 514 |   Training Data: 303GB (Chinese)
 515 |   ```
 516 | 
 517 | - **GLM** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/GLM)  
 518 |    GLM: General Language Model Pretraining with Autoregressive Blank Infilling [[Preprint]](https://arxiv.org/abs/2103.10360)
 519 | 
 520 |   ```yaml
 521 |   Field: Language
 522 |   Params: 10B
 523 |   Architecture: De
 524 |   ```
 525 | 
 526 | - **Switch Transformer** [[Google]]() Jan. 2021 [[open]](https://github.com/google-research/t5x)  
 527 |    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [[Preprint]](https://arxiv.org/abs/2101.03961)
 528 | 
 529 |   ```yaml
 530 |   Field: Language
 531 |   Params: 1.6T
 532 |   Training Data: 750GB
 533 |   Training petaFLOPs: 82M
 534 |   Architecture: En-De, MoE
 535 |   Objective: MLM
 536 |   ```
 537 | 
 538 | - **CPM** [[BAAI]]() Dec. 2020 [[open]](https://github.com/TsinghuaAI/CPM)  
 539 |    CPM: A Large-scale Generative Chinese Pre-trained Language Model [[Preprint]](https://arxiv.org/abs/2012.00413)
 540 | 
 541 |   ```yaml
 542 |   Field: Language
 543 |   Params: 2.6B
 544 |   Training Data: 100G (Chinese)
 545 |   Training petaFLOPs: 1.8M
 546 |   Architecture: De
 547 |   Objective: LTR
 548 |   ```
 549 | 
 550 | - **GPT-3** [[OpenAI]](https://openai.com/api/) May 2020 [close]  
 551 |    Language Models are Few-Shot Learners [[NeurIPS'20]](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)
 552 | 
 553 |   ```yaml
 554 |   Field: Language
 555 |   Params: 175B
 556 |   Training Data: 45TB (680B Tokens)
 557 |   Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years)
 558 |   Training Cost: $4.6M
 559 |   Training petaFLOPs: 310M
 560 |   Architecture: De
 561 |   Obective: LTR
 562 |   Instruction-tuned Variants: InstructGPT, WebGPT, ChatGPT
 563 |   ```
 564 | 
 565 | - **Blender** [[Meta]](https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/) Apr. 2020 [[close]](https://huggingface.co/facebook/blenderbot-90M?text=Hey+my+name+is+Thomas%21+How+are+you%3F)  
 566 |    Recipes for building an open-domain chatbot [[Preprint]](https://arxiv.org/abs/2004.13637)
 567 | 
 568 |   ```yaml
 569 |   Field: Language (Dialogue)
 570 |   Params: 9.4B
 571 |   ```
 572 | 
 573 | - **T-NLG** [[Microsoft]](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) Feb. 2020 [close]
 574 | 
 575 |   ```yaml
 576 |   Field: Language
 577 |   Params: 17B
 578 |   Training petaFLOPs: 16M
 579 |   Architecture: De
 580 |   Obective: LTR
 581 |   ```
 582 | 
 583 | - **Meena** [[Google]](https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html) Jan. 2020 [close]  
 584 |    Towards a Human-like Open-Domain Chatbot [[Preprint]](https://arxiv.org/abs/2001.09977)
 585 | 
 586 |   ```yaml
 587 |   Field: Language (Dialogue)
 588 |   Params: 2.6B
 589 |   Training Data: 341GB (40B words)
 590 |   Training petaFLOPs: 110M
 591 |   ```
 592 | 
 593 | - **DialoGPT** [[Microsoft]](https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/) Nov. 2019 [[open]](https://github.com/microsoft/DialoGPT)  
 594 |    DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [[ACL'20]](https://arxiv.org/abs/1911.00536)
 595 | 
 596 |   ```yaml
 597 |   Field: Language (Dialogue)
 598 |   Params: 762M
 599 |   Training Data: (147M conversation)
 600 |   Architecture: De
 601 |   ```
 602 | 
 603 | - **T5** [[Google]](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) Oct. 2019 [[open]](https://github.com/google-research/text-to-text-transfer-transformer)  
 604 |    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [[JMLR'19]](https://arxiv.org/abs/1910.10683)
 605 | 
 606 |   ```yaml
 607 |   Field: Language
 608 |   Params: 11B
 609 |   Training Data: 800GB
 610 |   Training Cost: $1.5M
 611 |   Training petaFLOPs: 41M
 612 |   Architecture: En-De
 613 |   Obective: MLM
 614 |   License: Apache 2.0
 615 |   Instruction-tuned Variants: Flan-T5
 616 |   Context-Length: 512
 617 |   ```
 618 | 
 619 | - **Megatron-LM** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)  
 620 |    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)
 621 | 
 622 |   ```yaml
 623 |   Field: Language
 624 |   Params: 8.3B
 625 |   Training Data: 174GB
 626 |   Training petaFLOPs: 9.1M
 627 |   Architecture: De
 628 |   Obective: LTR
 629 |   Training Framework: Megatron
 630 |   ```
 631 | 
 632 | - **Megatron-BERT** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)  
 633 |    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)
 634 | 
 635 |   ```yaml
 636 |   Field: Language
 637 |   Params: 3.9B
 638 |   Training Data: 174GB
 639 |   Training petaFLOPs: 57M
 640 |   Architecture: En
 641 |   Obective: MLM
 642 |   Training Framework: Megatron
 643 |   ```
 644 | 
 645 | - **RoBERTa** [[Meta]](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/) July 2019 [[open]](https://github.com/facebookresearch/fairseq)  
 646 |    RoBERTa: A Robustly Optimized BERT Pretraining Approach [[Preprint]](https://arxiv.org/abs/1907.11692)
 647 | 
 648 |   ```yaml
 649 |   Field: Language
 650 |   Params: 354M
 651 |   Training Data: 160GB
 652 |   Training Time: 1024 V100 GPU days
 653 |   Architecture: En
 654 |   Objective: MLM
 655 |   ```
 656 | 
 657 | - **XLNet** [[Google]]() June 2019 [[open]](https://github.com/zihangdai/xlnet)  
 658 |    XLNet: Generalized Autoregressive Pretraining for Language Understanding [[NeurIPS'19]](https://papers.nips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html)
 659 | 
 660 |   ```yaml
 661 |   Field: Language
 662 |   Params: 340M
 663 |   Training Data: 113GB (33B words)
 664 |   Training Time: 1280 TPUv3 days
 665 |   Training Cost: $245k
 666 |   Architecture: En
 667 |   Objective: PLM
 668 |   ```
 669 | 
 670 | - **GPT-2** [[OpenAI]](https://openai.com/blog/better-language-models/) Feb. 2019 [[open]](https://github.com/openai/gpt-2)  
 671 |    Language Models are Unsupervised Multitask Learners [[Preprint]](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
 672 | 
 673 |   ```yaml
 674 |   Field: Language
 675 |   Params: 1.5B
 676 |   Training Data: 40GB (8M web pages)
 677 |   Training Cost: $43k
 678 |   Training petaFLOPs: 1.5M
 679 |   Architecture: De
 680 |   Objective: LTR
 681 |   ```
 682 | 
 683 | - **BERT** [[Google]]() Oct. 2018 [[open]](https://github.com/google-research/bert)  
 684 |    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [[NAACL'18]](https://arxiv.org/abs/1810.04805)
 685 | 
 686 |   ```yaml
 687 |   Field: Language
 688 |   Params: 330M
 689 |   Training Data: 16GB (3.3B words)
 690 |   Training Time: 64 TPUv2 days (280 V100 GPU days)
 691 |   Training Cost: $7k
 692 |   Training petaFLOPs: 290k
 693 |   Architecture: En
 694 |   Objective: MLM, NSP
 695 |   ```
 696 | 
 697 | - **GPT** [[OpenAI]](https://openai.com/blog/language-unsupervised/) June 2018 [open]
 698 |   Improving Language Understanding by Generative Pre-Training [[Preprint]](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
 699 | 
 700 |   ```yaml
 701 |   Field: Language
 702 |   Params: 117M
 703 |   Training Data: 1GB (7k books)
 704 |   Training petaFLOPs: 18k
 705 |   Architecture: De
 706 |   Objective: LTR
 707 |   ```
 708 | 
 709 | ### Vision Models
 710 | 
 711 | - **Eva02-E** [[BAAI]]() Mar. 2023 [[open]](https://github.com/huggingface/pytorch-image-models/tree/main)  
 712 |   EVA-02: A Visual Representation for Neon Genesis [[Preprint]](https://arxiv.org/abs/2303.11331v2)
 713 | 
 714 |   ```yaml
 715 |   Field: Vision-Language
 716 |   Params: 5B
 717 |   Training Data: 2B image-text pairs
 718 |   Architecture: Transformer
 719 |   Objective: MIM, Clip Constrastive
 720 |   ```
 721 | 
 722 | - **MAE->WSP-2B** [[Meta]]() Mar. 2023 [close]  
 723 |    The effectiveness of MAE pre-pretraining for billion-scale pretraining [[Preprint]](https://arxiv.org/abs/2303.13496)
 724 | 
 725 |   ```yaml
 726 |   Field: Vision
 727 |   Params: 6.5B
 728 |   Training Data: 3B images
 729 |   Architecture: Transformer
 730 |   Objective: MAE, Weakly-Supervised
 731 |   ```
 732 | 
 733 | - **OpenCLIP G/14** [[LAION]]() Mar. 2023 [[open]](https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K)
 734 | 
 735 |   ```yaml
 736 |   Field: Vision-Language
 737 |   Params: 2.5B
 738 |   Training Data: 2B images
 739 |   ```
 740 | 
 741 | - **ViT-22B** [[Google]]() Feb. 2023 [close]  
 742 |   [Scaling Vision Transformers to 22 Billion Parameters](https://arxiv.org/abs/2302.05442)
 743 | 
 744 |   ```yaml
 745 |   Field: Vision
 746 |   Params: 22B
 747 |   Training Data: 4B images
 748 |   Architecture: Transformer
 749 |   Objective: Supervised
 750 |   ```
 751 | 
 752 | - **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2022 [close]  
 753 |    ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283)
 754 | 
 755 |   ```yaml
 756 |   Field: Image Generation (text to image)
 757 |   Params: 10B
 758 |   Training Data: 145M text-image pairs
 759 |   Architecture: Transformer, dVAE + De
 760 |   ```
 761 | 
 762 | - **InternImage-G** [[Shanghai AI Lab]](https://github.com/OpenGVLab/InternImage) Nov. 2022 [[open]](https://github.com/OpenGVLab/InternImage)
 763 |   InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [[CVPR'23 Highlight]](https://arxiv.org/abs/2211.05778)
 764 | 
 765 |   ```yaml
 766 |   Field: Vision
 767 |   Params: 3B
 768 |   Architecture: CNN
 769 |   Core Operator: Deformable Convolution v3
 770 |   ```
 771 | 
 772 | - **Stable Diffusion** [[Stability AI]]() Aug. 2022 [[open]]()
 773 | 
 774 |   ```yaml
 775 |   Field: Image Generation (text to image)
 776 |   Params: 890M
 777 |   Training Data: 5B images
 778 |   Architecture: Transformer, Diffusion
 779 |   ```
 780 | 
 781 | - **Imagen** [[Google]](https://imagen.research.google/) May 2022  
 782 |    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [[Preprint]](https://arxiv.org/abs/2205.11487)
 783 | 
 784 |   ```yaml
 785 |   Field: Image Generation (text to image)
 786 |   Text Encoder: T5
 787 |   Image Decoder: Diffusion, Upsampler
 788 |   ```
 789 | 
 790 | - **Flamingo** [[DeepMind]]() Apr. 2022 [close]  
 791 |    Flamingo: a Visual Language Model for Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2204.14198)
 792 | 
 793 |   ```yaml
 794 |   Field: Vision-Language
 795 |   Params: 80B
 796 |   ```
 797 | 
 798 | - **DALL·E 2** [[OpenAI]](https://openai.com/dall-e-2/) Apr. 2022  
 799 |    Hierarchical Text-Conditional Image Generation with CLIP Latents [[Preprint]](https://cdn.openai.com/papers/dall-e-2.pdf)
 800 | 
 801 |   ```yaml
 802 |   Field: Image Generation (text to image)
 803 |   Text Encoder: GPT2 (CLIP)
 804 |   Image Encoder: ViT (CLIP)
 805 |   Image Decoder: Diffusion, Upsampler
 806 |   ```
 807 | 
 808 | - **BaGuaLu** [[BAAI, Alibaba]]() Apr. 2022  
 809 |    BaGuaLu: targeting brain scale pretrained models with over 37 million cores [[PPoPP'22]](https://keg.cs.tsinghua.edu.cn/jietang/publications/PPOPP22-Ma%20et%20al.-BaGuaLu%20Targeting%20Brain%20Scale%20Pretrained%20Models%20w.pdf)
 810 | 
 811 |   ```yaml
 812 |   Field: Vision-Language
 813 |   Params: 174T
 814 |   Architecture: M6
 815 |   ```
 816 | 
 817 | - **SEER** [[Meta]]() Feb. 2022 [[open]](https://github.com/facebookresearch/vissl)  
 818 |    Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [[Preprint]](https://arxiv.org/abs/2202.08360v2)
 819 | 
 820 |   ```yaml
 821 |   Field: Vision
 822 |   Params: 10B
 823 |   Training Data: 1B images
 824 |   Architecture: Convolution
 825 |   Objective: SwAV
 826 |   ```
 827 | 
 828 | - **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2021  
 829 |    ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283)
 830 | 
 831 |   ```yaml
 832 |   Field: Image Generation (text to image)
 833 |   Params: 10B
 834 |   Training Data: 145M text-image pairs
 835 |   Architecture: Transformer, dVAE + De
 836 |   ```
 837 | 
 838 | - **NUWA** [[Microsoft]]() Nov. 2021 [[open]](https://github.com/microsoft/NUWA)  
 839 |    NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [[Preprint]](https://arxiv.org/abs/2111.12417)
 840 | 
 841 |   ```yaml
 842 |   Field: Vision-Language
 843 |   Generatioon: Image, Video
 844 |   Params: 870M
 845 |   ```
 846 | 
 847 | - **SwinV2-G** [[Google]]() Nov. 2021 [[open]](https://github.com/microsoft/Swin-Transformer)  
 848 |    Swin Transformer V2: Scaling Up Capacity and Resolution [[CVPR'22]](https://arxiv.org/abs/2111.09883v2)
 849 | 
 850 |   ```yaml
 851 |   Field: Vision
 852 |   Params: 3B
 853 |   Training Data: 70M
 854 |   Architecture: Transformer
 855 |   Objective: Supervised
 856 |   ```
 857 | 
 858 | - **Zidongtaichu** [[CASIA]](http://www.ia.cas.cn/xwzx/kydt/202109/t20210927_6215538.html) Sept. 2021 [close]
 859 | 
 860 |   ```yaml
 861 |   Field: Image, Video, Language, Speech
 862 |   Params: 100B
 863 |   ```
 864 | 
 865 | - **ViT-G/14** [[Google]]() June 2021  
 866 |    Scaling Vision Transformers [[Preprint]](https://arxiv.org/abs/2106.04560)
 867 | 
 868 |   ```yaml
 869 |   Field: Vision
 870 |   Params: 1.8B
 871 |   Training Data: 300M images
 872 |   Training petaFLOPs: 3.4M
 873 |   Architecture: Transformer
 874 |   Objective: Supervised
 875 |   ```
 876 | 
 877 | - **CoAtNet** [[Google]](https://ai.googleblog.com/2021/09/toward-fast-and-accurate-neural.html) June 2021 [[open]](https://github.com/chinhsuanwu/coatnet-pytorch)  
 878 |    CoAtNet: Marrying Convolution and Attention for All Data Sizes [[NeurIPS'21]](https://arxiv.org/abs/2106.04803)
 879 | 
 880 |   ```yaml
 881 |   Field: Vision
 882 |   Params: 2.4B
 883 |   Training Data: 300M images
 884 |   Architecture: Transformer, Convolution
 885 |   Objective: Supervised
 886 |   ```
 887 | 
 888 | - **V-MoE** [[Google]](https://ai.googleblog.com/2022/01/scaling-vision-with-sparse-mixture-of.html) June 2021  
 889 |    Scaling Vision with Sparse Mixture of Experts [[NeurIPS'21]](https://proceedings.neurips.cc//paper/2021/file/48237d9f2dea8c74c2a72126cf63d933-Paper.pdf)
 890 | 
 891 |   ```yaml
 892 |   Field: Vision
 893 |   Params: 15B
 894 |   Training Data: 300M images
 895 |   Training Time: 16.8k TPUv3 days
 896 |   Training petaFLOPs: 33.9M
 897 |   Architecture: Transformer, MoE
 898 |   Objective: Supervised
 899 |   ```
 900 | 
 901 | - **CogView** [[BAAI, Alibaba]](https://wudao.aminer.cn/CogView/index.html) May 2021 [</>](https://github.com/THUDM/CogView)  
 902 |    CogView: Mastering Text-to-Image Generation via Transformers [[NeurIPS'21]](https://arxiv.org/abs/2105.13290)
 903 | 
 904 |   ```yaml
 905 |   Field: Vision-Language
 906 |   Params: 4B
 907 |   Training Data: 30M text-image pairs
 908 |   Training petaFLOPs: 27M
 909 |   Image Encoder: VAE
 910 |   Text Encoder & Image Decoder: GPT2
 911 |   ```
 912 | 
 913 | - **M6** [[Alibaba]](https://m6.aliyun.com/#/) Mar. 2021  
 914 |    M6: A Chinese Multimodal Pretrainer [[Preprint]](https://arxiv.org/abs/2103.00823)
 915 | 
 916 |   ```yaml
 917 |   Field: Vision-Language
 918 |   Params: 10T
 919 |   Training Data: 300G Texts + 2TB Images
 920 |   Training petaFLOPs: 5.5M
 921 |   Fusion: Single-stream
 922 |   Objective: MLM, IC
 923 |   ```
 924 | 
 925 | - **DALL·E** [[OpenAI]](https://openai.com/blog/dall-e/) Feb. 2021  
 926 |    Zero-Shot Text-to-Image Generation [[ICML'21]](https://arxiv.org/abs/2102.12092)
 927 | 
 928 |   ```yaml
 929 |   Field: Image Generation (text to image)
 930 |   Params: 12B
 931 |   Training Data: 250M text-images pairs
 932 |   Training petaFLOPs: 47M
 933 |   Image Encoder: dVAE
 934 |   Text Encoder & Image Decoder: GPT2
 935 |   ```
 936 | 
 937 | - **CLIP** [[OpenAI]](https://openai.com/blog/clip/) Jan. 2021  
 938 |    Learning Transferable Visual Models From Natural Language Supervision [[ICML'22]](https://arxiv.org/abs/2103.00020)
 939 | 
 940 |   ```yaml
 941 |   Field: Vision-Language
 942 |   Training Data: 400M text-image pairs
 943 |   Training petaFLOPs: 11M
 944 |   Image Encoder: ViT
 945 |   Text Encoder: GPT-2
 946 |   Fusion: Dual Encoder
 947 |   Objective: CMCL
 948 |   ```
 949 | 
 950 | - **ViT-H/14** [[Google]](https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html) Oct. 2020 [[open]](https://github.com/google-research/vision_transformer)  
 951 |    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[ICLR'20]](https://arxiv.org/abs/2010.11929)
 952 | 
 953 |   ```yaml
 954 |   Field: Vision
 955 |   Params: 632M
 956 |   Training Data: 300M images
 957 |   Training petaFLOPs: 13M
 958 |   Architecture: Transformer
 959 |   Objective: Supervised
 960 |   ```
 961 | 
 962 | - **iGPT-XL** [[OpenAI]](https://openai.com/blog/image-gpt/) June 2020 [[open]](https://github.com/openai/image-gpt)  
 963 |    Generative Pretraining From Pixels [[ICML'20]](https://proceedings.mlr.press/v119/chen20s.html)
 964 | 
 965 |   ```yaml
 966 |   Field: Image Generation
 967 |   Params: 6.8B
 968 |   Training Data: 1M images
 969 |   Training petaFLOPs: 33M
 970 |   Architecture: Transformer, De
 971 |   ```
 972 | 
 973 | - **BigGAN-deep** [[DeepMind]]() Sept. 2018 [[open]](https://github.com/ajbrock/BigGAN-PyTorch)  
 974 |    Large Scale GAN Training for High Fidelity Natural Image Synthesis [[ICLR'19]](https://arxiv.org/abs/1809.11096)
 975 | 
 976 |   ```yaml
 977 |   Field: Image Generation
 978 |   Params: 158M
 979 |   Training Data: 300M images
 980 |   Training petaFLOPs: 3M
 981 |   Architecture: Convolution, GAN
 982 |   Resolution: 512x512
 983 |   ```
 984 | 
 985 | ### Reinforcement Learning
 986 | 
 987 | - **PaLM-E** [[Google]](https://palm-e.github.io/) March 2023 [close]  
 988 |    PaLM-E: An Embodied Multimodal Language Model [[Preprint]](https://palm-e.github.io/assets/palm-e.pdf)
 989 | 
 990 |   ```yaml
 991 |   Field: Reinforcement Learning
 992 |   Params: 562B (540B LLM + 22B Vi)
 993 |   ```
 994 | 
 995 | - **Gato** [[DeepMind]](https://www.deepmind.com/publications/a-generalist-agent) May 2022 [close]  
 996 |    A Generalist Agent [[Preprint]](https://arxiv.org/abs/2205.06175)
 997 | 
 998 |   ```yaml
 999 |   Field: Reinforcement Learning
1000 |   Params: 1.2B
1001 |   Training Data: (604 Tasks)
1002 |   Objective: Supervised
1003 |   ```
1004 | 
1005 | ### Speech
1006 | 
1007 | - **USM** [[Google]](https://sites.research.google/usm/) Mar. 2023 [close]  
1008 |   Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [[Preprint]](https://arxiv.org/pdf/2303.01037v2.pdf)
1009 | 
1010 |   ```yaml
1011 |   Field: Speech
1012 |   Params: 2B
1013 |   Training Data: 12,000,000 hours
1014 |   ```
1015 | 
1016 | - **Whisper** [[OpenAI]](https://openai.com/research/whisper) Sept. 2022 [[close]](https://github.com/openai/whisper)  
1017 |    Robust Speech Recognition via Large-Scale Weak Supervision [[Preprint]](https://arxiv.org/pdf/2212.04356.pdf)
1018 | 
1019 |   ```yaml
1020 |   Field: Speech
1021 |   Params: 1.55B
1022 |   Training Data: 680,000 hours
1023 |   Objective: Weakly Supervised
1024 |   ```
1025 | 
1026 | - **HuBERT** [[Meta]](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression/) June 2021 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert)  
1027 |    HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [[Preprint]](https://arxiv.org/abs/2106.07447)
1028 | 
1029 |   ```yaml
1030 |   Field: Speech
1031 |   Params: 1B
1032 |   Training Data: 60,000 hours
1033 |   Objective: MLM
1034 |   ```
1035 | 
1036 | - **wav2vec 2.0** [[Meta]]() Oct. 2020 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec)  
1037 |    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [[NeurIPS'20]](https://arxiv.org/abs/2006.11477)
1038 | 
1039 |   ```yaml
1040 |   Field: Speech
1041 |   Params: 317M
1042 |   Training Data: 50,000 hours
1043 |   Training petaFLOPs: 430M
1044 |   Objective: MLM
1045 |   ```
1046 | 
1047 | - **DeepSpeech 2** [[Meta]]() Dec. 2015 [[open]](https://github.com/PaddlePaddle/PaddleSpeech)  
1048 |    Deep Speech 2: End-to-End Speech Recognition in
1049 |   English and Mandarin [[ICML'15]](https://arxiv.org/pdf/1512.02595.pdf)
1050 | 
1051 |       ```yaml
1052 |       Field: Speech
1053 |       Params: 300M
1054 |       Training Data: 21,340 hours
1055 |       ```
1056 | 
1057 | ### Science
1058 | 
1059 | - **AlphaFold 2** [[DeepMind]](https://www.deepmind.com/research/highlighted-research/alphafold) July 2021 [[open]](https://github.com/deepmind/alphafold)  
1060 |    Highly accurate protein structure prediction with AlphaFold [[Nature]](https://www.nature.com/articles/s41586-021-03819-2)
1061 | 
1062 |   ```yaml
1063 |   Field: Biology
1064 |   Params: 21B
1065 |   Training petaFLOPs: 100k
1066 |   ```
1067 | 
1068 | ## Open LLM Training Dataset
1069 | 
1070 | This section will be reorganized. For now, as LLM prevails and data quality is a key for the performance of LLM, we keep track of this trend.
1071 | 
1072 | - [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B): 627B tokens, 895GB Compressed, primarily English, cleaned from RedPajama, Apache 2.0
1073 | - [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb): ~600B tokens, 500GB Compressed, English, ODC-By 1.0 license (The 5T tokens version is private)
1074 | - [MNBVC](https://github.com/esbatmop/MNBVC): 5TB (on-going, target 40TB), Chinese, MIT License
1075 | - [The Pile](https://pile.eleuther.ai/): 825G
1076 | - [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T): 1.2T tokens
1077 | 
1078 | ## Distributed Training Framework
1079 | 
1080 | > Deep Learning frameworks supportting distributed training are marked with \*.
1081 | 
1082 | ### PyTorch Ecosystem
1083 | 
1084 | - **Accelerate** [[Huggingface]]() Oct. 2020 [[open]](https://github.com/huggingface/accelerate)
1085 | - **Hivemind** Aug. 2020 [[open]](https://github.com/learning-at-home/hivemind)  
1086 |   Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2002.04013)
1087 | - **FairScale** [[Meta]]() July 2020 [[open]](https://github.com/facebookresearch/fairscale)
1088 | - **DeepSpeed** [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed)  
1089 |    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [[SC'20]](https://arxiv.org/abs/1910.02054)
1090 | - **Megatron** [[Nivida]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)  
1091 |    Megatron: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)
1092 | - **PyTorch\*** [[Meta]](https://pytorch.org/) Sept. 2016 [[open]](https://github.com/pytorch/pytorch)  
1093 |    PyTorch: An Imperative Style, High-Performance Deep Learning Library [[NeurIPS'19]](http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf)
1094 | 
1095 | ### XLA Ecosystem
1096 | 
1097 | - **T5x** [[Google]]() Mar. 2022 [[open]](https://github.com/google-research/t5x)  
1098 |   Scaling Up Models and Data with 𝚝𝟻𝚡 and 𝚜𝚎𝚚𝚒𝚘 [[Preprint]](https://arxiv.org/abs/2203.17189)
1099 | - **Alpa** [[Google]]() Jan. 2022 [[open]](https://github.com/alpa-projects/alpa)  
1100 |   Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [[OSDI'22]](https://arxiv.org/pdf/2201.12023.pdf)
1101 | - **Pathways** [[Google]](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/) Mar. 2021 [close]  
1102 |    Pathways: Asynchronous Distributed Dataflow for ML [[Preprint]](https://arxiv.org/abs/2203.12533)
1103 | - **Colossal-AI** [[HPC-AI TECH]](https://colossalai.org/) Nov. 2021 [[open]](https://github.com/hpcaitech/ColossalAI)  
1104 |    Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [[Preprint]](https://arxiv.org/abs/2110.14883)
1105 | - **GShard** [[Google]](https://arxiv.org/abs/2006.16668) June 2020  
1106 |    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [[Preprint]](https://arxiv.org/abs/2006.16668)
1107 | - **Jax\*** [Google]() Oct 2019 [[open]](https://github.com/google/jax)
1108 | - **Mesh Tensorflow** [[Google]]() Nov. 2018 [[open]](https://github.com/tensorflow/mesh)
1109 | - **Horovod** [[Uber]](https://horovod.ai/) Feb. 2018 [[open]](https://github.com/horovod/horovod)  
1110 |    Horovod: fast and easy distributed deep learning in TensorFlow [[Preprint]](https://arxiv.org/abs/1802.05799)
1111 | - **Tensorflow\*** [[Google]](https://www.tensorflow.org/) Nov. 2015 [[open]](https://github.com/tensorflow/tensorflow)  
1112 |    TensorFlow: A system for large-scale machine learning [[OSDI'16]](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)
1113 | 
1114 | ### Other Frameworks
1115 | 
1116 | - **OneFlow\*** [[OneFlow]](https://docs.oneflow.org/master/index.html) July 2020 [[open]](https://github.com/OneFlow-Inc/oneflow)  
1117 |    OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [[Preprint]](https://arxiv.org/abs/2110.15032)
1118 | - **MindSpore\*** [[Huawei]](https://e.huawei.com/en/products/cloud-computing-dc/atlas/mindspore) Mar. 2020 [[open]](https://github.com/mindspore-ai/mindspore)
1119 | - **PaddlePaddle\*** [[Baidu]](https://www.paddlepaddle.org.cn/) Nov. 2018 [[open]](https://github.com/PaddlePaddle/Paddle)  
1120 |    End-to-end Adaptive Distributed Training on PaddlePaddle [[Preprint]](https://arxiv.org/abs/2112.02752)
1121 | - **Ray** [[Berkeley]]() Dec. 2017 [[open]]([OSDI'17](https://github.com/ray-project/ray))  
1122 |    Ray: A Distributed Framework for Emerging AI Applications [[OSDI'17]](https://arxiv.org/pdf/1712.05889.pdf)
1123 | 
1124 | ### Inference Frameworks
1125 | 
1126 | - Petals [[BigScience]]() Dec. 2022 [[open]](https://github.com/bigscience-workshop/petals)
1127 | - FlexGen [[Stanford, Berkerley, CMU, etc.]]() May 2022 [[open]](https://github.com/FMInference/FlexGen)
1128 | - FastTransformer [[NVIDIA]]() Apr. 2021 [[open]](https://github.com/NVIDIA/FasterTransformer)
1129 | - MegEngine [[MegEngine]](https://www.megengine.org.cn/) Mar. 2020
1130 | - DeepSpeed-Inference [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed)
1131 | - MediaPipe [[Google]](https://google.github.io/mediapipe/) July 2019 [[open]](https://github.com/google/mediapipe)
1132 | - TensorRT [[Nvidia]]() Jun 2019 [[open]](https://github.com/NVIDIA/TensorRT)
1133 | - MNN [[Alibaba]]() May 2019 [[open]](https://github.com/alibaba/MNN)
1134 | - OpenVINO [[Intel]](https://docs.openvino.ai/latest/index.html) Oct. 2019 [[open]](https://github.com/openvinotoolkit/openvino)
1135 | - ONNX [[Linux Foundation]](https://onnx.ai/) Sep 2017 [[open]](https://github.com/onnx/onnx)
1136 | - ncnn [[Tencent]]() July 2017 [[open]](https://github.com/Tencent/ncnn)
1137 | 
1138 | ### Recommendation Training Framework
1139 | 
1140 | - **HET** [[Tencent]]() Dec. 2021  
1141 |    HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [[VLDB'22]](https://arxiv.org/abs/2112.07221)
1142 | - **Persia** [[Kuaishou]]() Nov. 2021  
1143 |    Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [[Preprint]](https://arxiv.org/abs/2111.05897)
1144 | 
1145 |   ```yaml
1146 |   Embeddings Params: 100T
1147 |   ```
1148 | 
1149 | - **ZionEX** [[Meta]]() Apr. 2021  
1150 |    Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [[ISCA'21]](https://arxiv.org/abs/2104.05158)
1151 | 
1152 |   ```yaml
1153 |   Embeddings Params: 10T
1154 |   ```
1155 | 
1156 | - **ScaleFreeCTR** [[Huawei]]() Apr. 2021  
1157 |    ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [[SIGIR'21]](https://arxiv.org/abs/2104.08542)
1158 | - **Kraken** [[Kuaishou]]() Nov. 2020  
1159 |    Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [[SC'20]](http://storage.cs.tsinghua.edu.cn/papers/sc20-kraken.pdf/)
1160 | - **TensorNet** [[Qihoo360]]() Sept. 2020 [[open]](https://github.com/Qihoo360/tensornet)
1161 | - **HierPS** [[Baidu]]() Mar. 2020  
1162 |    Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [[MLSys'20]](https://arxiv.org/abs/2003.05622)
1163 | - **AIBox** [[Baidu]]() Oct. 2019  
1164 |    AIBox: CTR Prediction Model Training on a Single Node [[CIKM'20]](https://dl.acm.org/doi/pdf/10.1145/3357384.3358045)
1165 | 
1166 |   ```yaml
1167 |   Embeddings Params: 0.1T
1168 |   ```
1169 | 
1170 | - **XDL** [[Alibaba]]() Aug. 2019  
1171 |    XDL: an industrial deep learning framework for high-dimensional sparse data [[DLP-KDD'21]](https://dlp-kdd.github.io/dlp-kdd2019/assets/pdf/a6-jiang.pdf)
1172 | 
1173 |   ```yaml
1174 |   Embeddings Params: 0.01T
1175 |   ```
1176 | 
1177 | ## Keys Explanations
1178 | 
1179 | - Company tags: the related company name. Other institudes may also involve in the job.
1180 | - Params: number of parameters of the largest model
1181 | - Training data size, training cost and training petaFLOPs may have some uncertainty.
1182 | - Training cost
1183 |   - TPUv2 hour: $4.5
1184 |   - TPUv3 hour: $8
1185 |   - V100 GPU hour: $0.55 (2022)
1186 |   - A100 GPU hoor: $1.10 (2022)
1187 | - Architecture
1188 |   - En: Encoder-based Language Model
1189 |   - De: Decoder-based Language Model
1190 |   - En-De=Encoder-Decoder-based Language Model
1191 |   - The above three architectures are powered with transformers.
1192 |   - MoE: Mixture of Experts
1193 | - Objective (See explanation in section 6–8 of [this paper](https://arxiv.org/pdf/2203.14101v3.pdf))
1194 |   - MLM: Masked Language Modeling
1195 |   - LTR: Left-To-Right Language Modeling
1196 |   - NSP: Next Sentence Prediction
1197 |   - PLM: Permuted Language Modeling
1198 |   - IC: Image Captioning
1199 |   - VLM: Vision Languauge Matching
1200 |   - CMCL: Cross-Modal Contrastive Learning
1201 | - FLOPs: number of FLOating-Point operations [[explanation]](https://openai.com/blog/ai-and-compute/)
1202 |   - 1 petaFLOPs = 1e15 FLOPs
1203 | 


--------------------------------------------------------------------------------