├── img ├── llm-evolutionary-tree.png ├── ai-training-computation-202206.png ├── ai-training-computation-202303.png └── ai-training-computation-202306.png ├── .gitignore └── README.md /img/llm-evolutionary-tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/llm-evolutionary-tree.png -------------------------------------------------------------------------------- /img/ai-training-computation-202206.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202206.png -------------------------------------------------------------------------------- /img/ai-training-computation-202303.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202303.png -------------------------------------------------------------------------------- /img/ai-training-computation-202306.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202306.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # Created by https://www.toptal.com/developers/gitignore/api/macos 3 | # Edit at https://www.toptal.com/developers/gitignore?templates=macos 4 | 5 | ### macOS ### 6 | # General 7 | .DS_Store 8 | .AppleDouble 9 | .LSOverride 10 | 11 | # Icon must end with two \r 12 | Icon 13 | 14 | # Thumbnails 15 | ._* 16 | 17 | # Files that might appear in the root of a volume 18 | .DocumentRevisions-V100 19 | .fseventsd 20 | .Spotlight-V100 21 | .TemporaryItems 22 | .Trashes 23 | .VolumeIcon.icns 24 | .com.apple.timemachine.donotpresent 25 | 26 | # Directories potentially created on remote AFP share 27 | .AppleDB 28 | .AppleDesktop 29 | Network Trash Folder 30 | Temporary Items 31 | .apdisk 32 | 33 | ### macOS Patch ### 34 | # iCloud generated files 35 | *.icloud 36 | 37 | # End of https://www.toptal.com/developers/gitignore/api/macos 38 | 39 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # awesome-huge-models [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 4 | 5 | 6 | 7 | A collection of AWESOME things about HUGE AI models. 8 | 9 | **[2023.06]** We are now in the post-GPT4 era, where LLMs are thriving and new models are emerging from GitHub repositories rather than traditional papers. People are striving to release everything openly, including training and inference codes, instruction-tuned weights and datasets, pretrained weights, and [the datasets used for pretraining LLMs](#open-llm-training-dataset). In this update, I try to catch up with the latest developments in the open-source wave of LLMs. 10 | 11 | **[2023.03]** Only pretrained models are recorded here. Models are sorted according to the first release date. To support the open source process of LLM, we highligh the open-sourced LLM models with [[open]](). 12 | 13 | **[2022.06]** There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [[Intro in Chinese](https://zhuanlan.zhihu.com/p/529863941)] 14 | 15 | 16 | 17 | ## Contents 18 | 19 | - [awesome-huge-models ](#awesome-huge-models-) 20 | - [Contents](#contents) 21 | - [Survey](#survey) 22 | - [Models](#models) 23 | - [Language Model](#language-model) 24 | - [Vision Models](#vision-models) 25 | - [Reinforcement Learning](#reinforcement-learning) 26 | - [Speech](#speech) 27 | - [Science](#science) 28 | - [Open LLM Training Dataset](#open-llm-training-dataset) 29 | - [Distributed Training Framework](#distributed-training-framework) 30 | - [PyTorch Ecosystem](#pytorch-ecosystem) 31 | - [XLA Ecosystem](#xla-ecosystem) 32 | - [Other Frameworks](#other-frameworks) 33 | - [Inference Frameworks](#inference-frameworks) 34 | - [Recommendation Training Framework](#recommendation-training-framework) 35 | - [Keys Explanations](#keys-explanations) 36 | 37 | ## Survey 38 | 39 |

40 | Big models in NLP 41 |

42 | 43 | - [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223) [2023.03] 44 | - [A Dive into Vision-Language Models](https://huggingface.co/blog/vision_language_pretraining) [2023.02] 45 | - [Compute Trends Across Three Eras of Machine Learning](https://arxiv.org/abs/2202.05924) [[chart](https://ourworldindata.org/grapher/ai-training-computation)] [2022.02] 46 | - [Vision-and-Language Pretrained Models: A Survey](https://arxiv.org/abs/2204.07356) [2022.04] 47 | - [A Roadmap to Big Model](https://arxiv.org/abs/2203.14101) [2022.03] 48 | - [A Survey of Vision-Language Pre-trained Models](https://arxiv.org/abs/2202.10936) [2022.02] 49 | - [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) [2022.01] 50 | - [On the Opportunities and Risk of Foundation Models](https://arxiv.org/abs/2108.07258) [2021.08] 51 | - [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) [2021.06] 52 | 53 | Resources list: 54 | 55 | - [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) 56 | - [Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM) 57 | - [Open-LLM](https://github.com/eugeneyan/open-llms) 58 | - [LLMDataHub](https://github.com/Zjh-819/LLMDataHub) 59 | 60 | ## Models 61 | 62 | ### Language Model 63 | 64 |

65 | LLM evolutionary tree 66 |

67 | 68 | - **Baichuan** [[Baichuan]]() Jun. 2023 [[open]](https://github.com/baichuan-inc/baichuan-7B) 69 | 70 | ```yaml 71 | Field: Language 72 | Params: 7B 73 | Training Data: 1.2T tokens (English, Chinese, Private) 74 | License: Apache 2.0 75 | Context: 4096 76 | ``` 77 | 78 | - **Falcon** [[TII]]() Jun. 2023 [[open]](https://huggingface.co/tiiuae/falcon-40b) 79 | 80 | ```yaml 81 | Field: Language 82 | Params: 40B 83 | Training Data: 1T tokens (RefinedWeb) 84 | License: Apache 2.0 85 | Context Length: 2048 86 | ``` 87 | 88 | - **OpenLLaMA** [[OpenLM]]() May. 2023 [[open]](https://github.com/openlm-research/open_llama) 89 | 90 | ```yaml 91 | Field: Language 92 | Params: 13B, 7B, 3B 93 | Training Data: 1T tokens (RedPajama) 94 | License: Apache 2.0 95 | Context Length: 2048 96 | ``` 97 | 98 | - **Redpajama-INCITE** [[Together]](https://github.com/togethercomputer/RedPajama-Data) May. 2023 [[open]](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) 99 | 100 | ```yaml 101 | Field: Language 102 | Params: 7B, 3B 103 | Training Data: 1T tokens (Redpajama) 104 | License: Apache 2.0 105 | Context Length: 2048 106 | ``` 107 | 108 | - **MPT** [[MosaicML]](https://www.mosaicml.com/blog/mpt-7b) May. 2023 [[open]](https://github.com/mosaicml/llm-foundry) 109 | 110 | ```yaml 111 | Field: Language 112 | Params: 30B, 7B 113 | Training Data: 1T tokens (Private) 114 | License: Apache 2.0, CC BY-SA-3.0 115 | Context Length: 84k 116 | ``` 117 | 118 | - **Stable-LM** [[Stability-AI]](https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models) Apr. 2023 [[open]](https://github.com/Stability-AI/StableLM#stablelm-alpha) 119 | 120 | ```yaml 121 | Field: Language 122 | Params: 7B, 3B 123 | Training Data: 1.5T tokens 124 | License: CC BY-SA-4.0 125 | ``` 126 | 127 | - **LiT-LLaMa** [[Lightning-AI]]() Apr. 2023 [[open]](https://github.com/Lightning-AI/lit-llama) 128 | 129 | ```yaml 130 | Field: Language 131 | Params: 13B, 7B 132 | Training Data: 1.2T tokens (Redpajama) 133 | License: Apache 2.0 134 | ``` 135 | 136 | - **h2oGPT** [[H2O.ai]](https://h2o.ai/blog/building-the-worlds-best-open-source-large-language-model-h2o-ais-journey/) [[open]](https://github.com/h2oai/h2ogpt) 137 | [h2oGPT: Democratizing Large Language Models](https://arxiv.org/pdf/2306.08161.pdf) 138 | 139 | ```yaml 140 | Field: Language 141 | Params: 13B, 7B 142 | Training Data: 1.0T tokens 143 | License: Apache 2.0 144 | Context Length: 2048 145 | ``` 146 | 147 | - **Cerabras-GPT** [[Cerabras]]() Mar. 2023 [[open]](https://huggingface.co/cerebras/Cerebras-GPT-13B) 148 | Training Compute-Optimal Large Language Models [[preprint]](https://arxiv.org/abs/2203.15556) 149 | 150 | ```yaml 151 | Field: Language 152 | Params: 13B 153 | Training Data: 371B tokens (Redpajama) 154 | License: Apache 2.0 155 | Context Length: 2048 156 | ``` 157 | 158 | - **Claude** [[Anthropic]](https://www.anthropic.com/index/introducing-claude) Mar. 2023 [close] 159 | 160 | ```yaml 161 | Field: Language-Vision 162 | ``` 163 | 164 | - **GPT-4** [[OpenAI]](https://openai.com/product/gpt-4) Mar. 2023 [close] 165 | GPT-4 Technical Report [[Preprint]](https://cdn.openai.com/papers/gpt-4.pdf) 166 | 167 | ```yaml 168 | Field: Language-Vision 169 | Params: 1.7T 170 | Architecture: De, MoE 171 | ``` 172 | 173 | - **Bard** [[Google]](https://blog.google/technology/ai/bard-google-ai-search-updates/) 174 | 175 | ```yaml 176 | Field: Language-Vision 177 | ``` 178 | 179 | - **LLaMa** [[Meta]]() Feb. 2023 [[open]](https://github.com/facebookresearch/llama) 180 | Open and Efficient Foundation Language Models [[Preprint]](https://arxiv.org/pdf/2302.13971v1.pdf) 181 | 182 | ```yaml 183 | Field: Language 184 | Params: 65B, 33B, 13B, 7B 185 | Training Data: 4TB (1.4T tokens) 186 | Training Cost: 1,022,362 (2048 80G-A100 x 21 days) 187 | Training Power Consumption: 449 MWh 188 | Instruction-tuned Variants: Alpaca, Vicuna, Dolly, Guanaco, ColossalChat, GPT4All, Koala, BELLE, MiniGPT-4, etc. 189 | License: GPL 190 | ``` 191 | 192 | - **RWKV-4** [[Personal]]() Dec. 2022 [[open]](https://github.com/BlinkDL/RWKV-LM) 193 | 194 | ```yaml 195 | Field: Language 196 | Params: 14B, 7B, 3B, 1.5B 197 | Training Data: 332B tokens 198 | Architecture: De, RNN 199 | License: Apache 2.0 200 | ``` 201 | 202 | - **AnthropicLM** [[Anthropic]]() Dec. 2022 [close] 203 | Constitutional AI: Harmlessness from AI Feedback 204 | 205 | ```yaml 206 | Field: Language 207 | Params: 52B 208 | ``` 209 | 210 | - **BLOOM** [[BigScience]]() Nov. 2022 [[open]](https://huggingface.co/bigscience/bloom) 211 | A 176B-Parameter Open-Access Multilingual Language Model [[Preprint]](https://arxiv.org/pdf/2211.05100.pdf) 212 | 213 | ```yaml 214 | Field: Language 215 | Params: 176B 216 | Training Data: 174GB (336B tokens) 217 | Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months 218 | Training Power Consumption: 475 MWh 219 | Training Framework: Megatron + Deepspeed 220 | Instruction-tuned Variants: BLOOMZ 221 | License: OpenRAIL-M v1 222 | Context Length: 2048 223 | ``` 224 | 225 | - **Galactica** [[Meta]]() Nov. 2022 [[open]](https://huggingface.co/facebook/galactica-1.3b) 226 | A scientific language model trained on over 48 million scientific texts [[Preprint]](https://arxiv.org/pdf/2211.09085.pdf) 227 | 228 | ```yaml 229 | Field: Language 230 | Params: 125M, 1.3B, 6.7B, 30B, 120B 231 | ``` 232 | 233 | - **Pythia** [[EleutherAI]]() Oct. 2022 [[open]](https://github.com/EleutherAI/pythia) 234 | 235 | ```yaml 236 | Field: Language 237 | Params: 12B 238 | Instruction-tuned Variants: Dolly 2.0 239 | License: Apache 2.0 240 | Context Length: 2048 241 | ``` 242 | 243 | - **GLM-130B** [[BAAI]](https://keg.cs.tsinghua.edu.cn/glm-130b/zh/posts/glm-130b/) Oct. 2022 [[open]](https://github.com/THUDM/GLM-130B) 244 | GLM-130B: An Open Bilingual Pre-trained Model [[ICLR'23]](https://arxiv.org/pdf/2210.02414.pdf) 245 | 246 | ```yaml 247 | Field: Language 248 | Params: 130B 249 | Training Data: (400B tokens) 250 | Training Cost: 516,096 A100 hours = 768 40G-A100 x 28 days 251 | Training Framework: Megatron + Deepspeed 252 | ``` 253 | 254 | - **UL2** [[Google]]() May 2022 [[open]](https://huggingface.co/google/ul2) 255 | Unifying Language Learning Paradigms [[Preprint]](https://arxiv.org/abs/2205.05131) 256 | 257 | ```yaml 258 | Field: Language 259 | Params: 20B (1T tokens) 260 | Training Data: 800GB 261 | Achitecture: En-De 262 | Training Framework: Jax + T5x 263 | License: Apache 2.0 264 | Instruction-tuned Variants: Flan-UL2 265 | Context Length: 2048 266 | ``` 267 | 268 | - **OPT** [[Meta]](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) May 2022 [[open]](https://github.com/facebookresearch/metaseq) 269 | OPT: Open Pre-trained Transformer Language Models [[Preprint]](https://arxiv.org/abs/2205.01068) 270 | 271 | ```yaml 272 | Field: Language 273 | Params: 175B 274 | Training Data: 800GB (180B tokens) 275 | Training Cost: 809,472 A100 hours = 992 80G-A100 x 34 days 276 | Training Power Consumption: 356 MWh 277 | Architecutre: De 278 | Training Framework: Megatron + Fairscale 279 | ``` 280 | 281 | - **PaLM** [[Google]](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html) Apr. 2022 [close] 282 | PaLM: Scaling Language Modeling with Pathways [[Preprint]](https://arxiv.org/abs/2204.02311) 283 | 284 | ```yaml 285 | Field: Language 286 | Params: 550B 287 | Training Data: 3TB (780B tokens) 288 | Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days) 289 | Training petaFLOPs: 2.5B 290 | Architecture: De 291 | Training Framework: Jax + T5x 292 | ``` 293 | 294 | - **GPT-NeoX** [[EleutherAI]](https://blog.eleuther.ai/announcing-20b/) Apr. 2022 [[open]](https://github.com/EleutherAI/gpt-neox) 295 | GPT-NeoX-20B: An Open-Source Autoregressive Language Model [[Preprint]](https://arxiv.org/abs/2204.06745) 296 | 297 | ```yaml 298 | Field: Language 299 | Params: 20B 300 | Training Data: 525GiB 301 | Training petaFLOPs: 93B 302 | Architecture: De 303 | Training Framework: Megatron + Fairscale 304 | License: Apache 2.0 305 | Context Length: 2048 306 | ``` 307 | 308 | - **InstructGPT** [[OpenAI]]() Mar. 2022 [close] 309 | Training language models to follow instructions with human feedback [[Preprint]](https://arxiv.org/abs/2203.02155) 310 | 311 | ```yaml 312 | Field: Language 313 | Params: 175B 314 | ``` 315 | 316 | - **Chinchilla** [[DeepMind]](https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training) Mar. 2022 [close] 317 | Training Compute-Optimal Large Language Models [[Preprint]](https://arxiv.org/abs/2203.15556) 318 | 319 | ```yaml 320 | Field: Language 321 | Params: 70B 322 | Training Data: 5.2TB (1.4T tokens) 323 | Training petaFLOPs: 580M 324 | Architecture: De 325 | ``` 326 | 327 | - **EVA 2.0** [[BAAI]](https://wudaoai.cn/model/detail/EVA) Mar. 2022 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master) 328 | EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [[Preprint]](https://arxiv.org/abs/2203.09313) 329 | 330 | ```yaml 331 | Field: Language (Dialogue) 332 | Params: 2.8B 333 | Training Data: 180G (1.4B samples, Chinese) 334 | ``` 335 | 336 | - **AlphaCode** [[DeepMind]](https://www.deepmind.com/blog/competitive-programming-with-alphacode) Mar. 2022 [close] 337 | Competition-Level Code Generation with AlphaCode [[Preprint]](https://arxiv.org/abs/2203.07814) 338 | 339 | ```yaml 340 | Field: Code Generation 341 | Params: 41B 342 | Training Data: (967B tokens) 343 | Architecture: De 344 | ``` 345 | 346 | - **ST-MoE** [[Google]]() Feb. 2022 [close] 347 | ST-MoE: Designing Stable and Transferable Sparse Expert Models [[Preprint]](https://arxiv.org/abs/2202.08906) 348 | 349 | ```yaml 350 | Field: Language 351 | Params: 296B 352 | Architecture: En-De, MoE 353 | ``` 354 | 355 | - **LaMDA** [[Google]](https://arxiv.org/abs/2201.08239) Jan. 2022 [close] 356 | LaMDA: Language Models for Dialog Applications [[Preprint]](https://arxiv.org/abs/2201.08239) 357 | 358 | ```yaml 359 | Field: Language (Dialogue) 360 | Params: 137B 361 | Training Data: (1.56T words) 362 | Training petaFLOPs: 360M 363 | Architecture: De 364 | ``` 365 | 366 | - **GLaM** [[Google]](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) Dec. 2021 [close] 367 | GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2112.06905) 368 | 369 | ```yaml 370 | Field: Language 371 | Params: 1.2T 372 | Architecture: De, MoE 373 | ``` 374 | 375 | - **Gopher** [[DeepMind]](https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval) Dec. 2021 [close] 376 | Scaling Language Models: Methods, Analysis & Insights from Training Gopher [[Preprint]](https://arxiv.org/abs/2112.11446) 377 | 378 | ```yaml 379 | Field: Language 380 | Params: 280B 381 | Training Data: 1.3TB (300B tokens) 382 | Training petaFLOPs: 630M 383 | Architecture: De 384 | ``` 385 | 386 | - **Yuan 1.0** [[inspur]](https://air.inspur.com/home) Oct. 2021 [close] 387 | Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2110.04725) 388 | 389 | ```yaml 390 | Field: Language 391 | Params: 245B 392 | Training Data: 5TB (180B tokens, Chinese) 393 | Training petaFLOPs: 410M 394 | Architecture: De, MoE 395 | ``` 396 | 397 | - **MT-NLG** [[Microsoft, Nvidia]](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) Oct. 2021 [close] 398 | Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [[Preprint]](https://arxiv.org/abs/2201.11990) 399 | 400 | ```yaml 401 | Field: Language 402 | Params: 530B 403 | Training Data: 339B tokens 404 | Training petaFLOPs: 1.4B 405 | Architecture: De 406 | ``` 407 | 408 | - **Plato-XL** [[Baidu]](http://research.baidu.com/Blog/index-view?id=163) Sept. 2021 [close] 409 | PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [[Preprint]](https://arxiv.org/abs/2109.09519) 410 | 411 | ```yaml 412 | Field: Language (Dialogue) 413 | Params: 11B 414 | Training Data: (1.2B samples) 415 | ``` 416 | 417 | - **GPT-J** [[EleutherAI]](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) Aug. 2021 [[open]](https://github.com/kingoflolz/mesh-transformer-jax) 418 | 419 | ```yaml 420 | Field: Language 421 | Params: 6B 422 | Programming Language: Jax 423 | ``` 424 | 425 | - **Jurassic-1** [[AI21 Labs]](https://www.zdnet.com/article/watch-out-gpt-3-here-comes-ai21s-jurassic-language-model/) Aug. 2021 [close] 426 | Jurassic-1: Technical Details and Evaluation [[Preprint]](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf) 427 | 428 | ```yaml 429 | Field: Language 430 | Params: 178B 431 | Training petaFLOPs: 370M 432 | Architecture: De 433 | ``` 434 | 435 | - **Codex** [[OpenAI]](https://openai.com/blog/openai-codex/) July 2021 [close] 436 | Evaluating Large Language Models Trained on Code [[Preprint]](https://arxiv.org/abs/2107.03374) 437 | 438 | ```yaml 439 | Field: Code Generation 440 | Params: 12B 441 | Training Data: 159GB 442 | Architecture: De 443 | ``` 444 | 445 | - **ERNIE 3.0** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie) July 2021 [close] 446 | ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [[Preprint]](https://arxiv.org/abs/2107.02137) 447 | 448 | ```yaml 449 | Field: Language 450 | Params: 10B 451 | Training Data: 4TB (375B tokens, with knowledge graph) 452 | Architecture: En 453 | Objective: MLM 454 | ``` 455 | 456 | - **CPM-2** [[BAAI]]() June 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master) 457 | CPM-2: Large-scale Cost-effective Pre-trained Language Models [[Preprint]](https://arxiv.org/abs/2106.10715) 458 | 459 | ```yaml 460 | Field: Language 461 | Params: 198B 462 | Training Data: 2.6TB (Chinese 2.3TB, English 300GB) 463 | Architecture: En-De 464 | Objective: MLM 465 | ``` 466 | 467 | - **HyperClova** [[Naver]](https://www.navercorp.com/promotion/pressReleasesView/30546) May 2021 [close] 468 | What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [[Preprint]](https://arxiv.org/abs/2109.04650v1) 469 | 470 | ```yaml 471 | Field: Language 472 | Params: 82B 473 | Training Data: 562B tokens (Korean) 474 | Training petaFLOPs: 63B 475 | Architecture: De 476 | ``` 477 | 478 | - **ByT5** [[Google]]() May 2021 [[open]](https://github.com/google-research/byt5) 479 | ByT5: Towards a token-free future with pre-trained byte-to-byte models [[TACL'22]](https://arxiv.org/abs/2105.13626) 480 | 481 | ```yaml 482 | Field: Language 483 | Params: 13B 484 | Training Data: (101 languages) 485 | Architecture: En-De 486 | ``` 487 | 488 | - **PanGu-α** [[Huawei]]() Apr. 2021 [close] 489 | PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [[Preprint]](https://arxiv.org/abs/2104.12369) 490 | 491 | ```yaml 492 | Field: Language 493 | Params: 200B 494 | Training Data: 1.1TB (Chinese) 495 | Training petaFLOPs: 58M 496 | Architecture: De 497 | ``` 498 | 499 | - **mT5** [[Google]]() Mar. 2021 [[open]](https://github.com/google-research/multilingual-t5) 500 | mT5: A massively multilingual pre-trained text-to-text transformer [[Preprint]](https://arxiv.org/abs/2010.11934) 501 | 502 | ```yaml 503 | Field: Language 504 | Params: 13B 505 | Training Data: (101 languages) 506 | Architecture: En-De 507 | ``` 508 | 509 | - **WuDao-WenHui** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/Transformer-XL) 510 | 511 | ```yaml 512 | Field: Language 513 | Params: 2.9B 514 | Training Data: 303GB (Chinese) 515 | ``` 516 | 517 | - **GLM** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/GLM) 518 | GLM: General Language Model Pretraining with Autoregressive Blank Infilling [[Preprint]](https://arxiv.org/abs/2103.10360) 519 | 520 | ```yaml 521 | Field: Language 522 | Params: 10B 523 | Architecture: De 524 | ``` 525 | 526 | - **Switch Transformer** [[Google]]() Jan. 2021 [[open]](https://github.com/google-research/t5x) 527 | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [[Preprint]](https://arxiv.org/abs/2101.03961) 528 | 529 | ```yaml 530 | Field: Language 531 | Params: 1.6T 532 | Training Data: 750GB 533 | Training petaFLOPs: 82M 534 | Architecture: En-De, MoE 535 | Objective: MLM 536 | ``` 537 | 538 | - **CPM** [[BAAI]]() Dec. 2020 [[open]](https://github.com/TsinghuaAI/CPM) 539 | CPM: A Large-scale Generative Chinese Pre-trained Language Model [[Preprint]](https://arxiv.org/abs/2012.00413) 540 | 541 | ```yaml 542 | Field: Language 543 | Params: 2.6B 544 | Training Data: 100G (Chinese) 545 | Training petaFLOPs: 1.8M 546 | Architecture: De 547 | Objective: LTR 548 | ``` 549 | 550 | - **GPT-3** [[OpenAI]](https://openai.com/api/) May 2020 [close] 551 | Language Models are Few-Shot Learners [[NeurIPS'20]](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf) 552 | 553 | ```yaml 554 | Field: Language 555 | Params: 175B 556 | Training Data: 45TB (680B Tokens) 557 | Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years) 558 | Training Cost: $4.6M 559 | Training petaFLOPs: 310M 560 | Architecture: De 561 | Obective: LTR 562 | Instruction-tuned Variants: InstructGPT, WebGPT, ChatGPT 563 | ``` 564 | 565 | - **Blender** [[Meta]](https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/) Apr. 2020 [[close]](https://huggingface.co/facebook/blenderbot-90M?text=Hey+my+name+is+Thomas%21+How+are+you%3F) 566 | Recipes for building an open-domain chatbot [[Preprint]](https://arxiv.org/abs/2004.13637) 567 | 568 | ```yaml 569 | Field: Language (Dialogue) 570 | Params: 9.4B 571 | ``` 572 | 573 | - **T-NLG** [[Microsoft]](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) Feb. 2020 [close] 574 | 575 | ```yaml 576 | Field: Language 577 | Params: 17B 578 | Training petaFLOPs: 16M 579 | Architecture: De 580 | Obective: LTR 581 | ``` 582 | 583 | - **Meena** [[Google]](https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html) Jan. 2020 [close] 584 | Towards a Human-like Open-Domain Chatbot [[Preprint]](https://arxiv.org/abs/2001.09977) 585 | 586 | ```yaml 587 | Field: Language (Dialogue) 588 | Params: 2.6B 589 | Training Data: 341GB (40B words) 590 | Training petaFLOPs: 110M 591 | ``` 592 | 593 | - **DialoGPT** [[Microsoft]](https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/) Nov. 2019 [[open]](https://github.com/microsoft/DialoGPT) 594 | DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [[ACL'20]](https://arxiv.org/abs/1911.00536) 595 | 596 | ```yaml 597 | Field: Language (Dialogue) 598 | Params: 762M 599 | Training Data: (147M conversation) 600 | Architecture: De 601 | ``` 602 | 603 | - **T5** [[Google]](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) Oct. 2019 [[open]](https://github.com/google-research/text-to-text-transfer-transformer) 604 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [[JMLR'19]](https://arxiv.org/abs/1910.10683) 605 | 606 | ```yaml 607 | Field: Language 608 | Params: 11B 609 | Training Data: 800GB 610 | Training Cost: $1.5M 611 | Training petaFLOPs: 41M 612 | Architecture: En-De 613 | Obective: MLM 614 | License: Apache 2.0 615 | Instruction-tuned Variants: Flan-T5 616 | Context-Length: 512 617 | ``` 618 | 619 | - **Megatron-LM** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM) 620 | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053) 621 | 622 | ```yaml 623 | Field: Language 624 | Params: 8.3B 625 | Training Data: 174GB 626 | Training petaFLOPs: 9.1M 627 | Architecture: De 628 | Obective: LTR 629 | Training Framework: Megatron 630 | ``` 631 | 632 | - **Megatron-BERT** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM) 633 | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053) 634 | 635 | ```yaml 636 | Field: Language 637 | Params: 3.9B 638 | Training Data: 174GB 639 | Training petaFLOPs: 57M 640 | Architecture: En 641 | Obective: MLM 642 | Training Framework: Megatron 643 | ``` 644 | 645 | - **RoBERTa** [[Meta]](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/) July 2019 [[open]](https://github.com/facebookresearch/fairseq) 646 | RoBERTa: A Robustly Optimized BERT Pretraining Approach [[Preprint]](https://arxiv.org/abs/1907.11692) 647 | 648 | ```yaml 649 | Field: Language 650 | Params: 354M 651 | Training Data: 160GB 652 | Training Time: 1024 V100 GPU days 653 | Architecture: En 654 | Objective: MLM 655 | ``` 656 | 657 | - **XLNet** [[Google]]() June 2019 [[open]](https://github.com/zihangdai/xlnet) 658 | XLNet: Generalized Autoregressive Pretraining for Language Understanding [[NeurIPS'19]](https://papers.nips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html) 659 | 660 | ```yaml 661 | Field: Language 662 | Params: 340M 663 | Training Data: 113GB (33B words) 664 | Training Time: 1280 TPUv3 days 665 | Training Cost: $245k 666 | Architecture: En 667 | Objective: PLM 668 | ``` 669 | 670 | - **GPT-2** [[OpenAI]](https://openai.com/blog/better-language-models/) Feb. 2019 [[open]](https://github.com/openai/gpt-2) 671 | Language Models are Unsupervised Multitask Learners [[Preprint]](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) 672 | 673 | ```yaml 674 | Field: Language 675 | Params: 1.5B 676 | Training Data: 40GB (8M web pages) 677 | Training Cost: $43k 678 | Training petaFLOPs: 1.5M 679 | Architecture: De 680 | Objective: LTR 681 | ``` 682 | 683 | - **BERT** [[Google]]() Oct. 2018 [[open]](https://github.com/google-research/bert) 684 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [[NAACL'18]](https://arxiv.org/abs/1810.04805) 685 | 686 | ```yaml 687 | Field: Language 688 | Params: 330M 689 | Training Data: 16GB (3.3B words) 690 | Training Time: 64 TPUv2 days (280 V100 GPU days) 691 | Training Cost: $7k 692 | Training petaFLOPs: 290k 693 | Architecture: En 694 | Objective: MLM, NSP 695 | ``` 696 | 697 | - **GPT** [[OpenAI]](https://openai.com/blog/language-unsupervised/) June 2018 [open] 698 | Improving Language Understanding by Generative Pre-Training [[Preprint]](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) 699 | 700 | ```yaml 701 | Field: Language 702 | Params: 117M 703 | Training Data: 1GB (7k books) 704 | Training petaFLOPs: 18k 705 | Architecture: De 706 | Objective: LTR 707 | ``` 708 | 709 | ### Vision Models 710 | 711 | - **Eva02-E** [[BAAI]]() Mar. 2023 [[open]](https://github.com/huggingface/pytorch-image-models/tree/main) 712 | EVA-02: A Visual Representation for Neon Genesis [[Preprint]](https://arxiv.org/abs/2303.11331v2) 713 | 714 | ```yaml 715 | Field: Vision-Language 716 | Params: 5B 717 | Training Data: 2B image-text pairs 718 | Architecture: Transformer 719 | Objective: MIM, Clip Constrastive 720 | ``` 721 | 722 | - **MAE->WSP-2B** [[Meta]]() Mar. 2023 [close] 723 | The effectiveness of MAE pre-pretraining for billion-scale pretraining [[Preprint]](https://arxiv.org/abs/2303.13496) 724 | 725 | ```yaml 726 | Field: Vision 727 | Params: 6.5B 728 | Training Data: 3B images 729 | Architecture: Transformer 730 | Objective: MAE, Weakly-Supervised 731 | ``` 732 | 733 | - **OpenCLIP G/14** [[LAION]]() Mar. 2023 [[open]](https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K) 734 | 735 | ```yaml 736 | Field: Vision-Language 737 | Params: 2.5B 738 | Training Data: 2B images 739 | ``` 740 | 741 | - **ViT-22B** [[Google]]() Feb. 2023 [close] 742 | [Scaling Vision Transformers to 22 Billion Parameters](https://arxiv.org/abs/2302.05442) 743 | 744 | ```yaml 745 | Field: Vision 746 | Params: 22B 747 | Training Data: 4B images 748 | Architecture: Transformer 749 | Objective: Supervised 750 | ``` 751 | 752 | - **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2022 [close] 753 | ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283) 754 | 755 | ```yaml 756 | Field: Image Generation (text to image) 757 | Params: 10B 758 | Training Data: 145M text-image pairs 759 | Architecture: Transformer, dVAE + De 760 | ``` 761 | 762 | - **InternImage-G** [[Shanghai AI Lab]](https://github.com/OpenGVLab/InternImage) Nov. 2022 [[open]](https://github.com/OpenGVLab/InternImage) 763 | InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [[CVPR'23 Highlight]](https://arxiv.org/abs/2211.05778) 764 | 765 | ```yaml 766 | Field: Vision 767 | Params: 3B 768 | Architecture: CNN 769 | Core Operator: Deformable Convolution v3 770 | ``` 771 | 772 | - **Stable Diffusion** [[Stability AI]]() Aug. 2022 [[open]]() 773 | 774 | ```yaml 775 | Field: Image Generation (text to image) 776 | Params: 890M 777 | Training Data: 5B images 778 | Architecture: Transformer, Diffusion 779 | ``` 780 | 781 | - **Imagen** [[Google]](https://imagen.research.google/) May 2022 782 | Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [[Preprint]](https://arxiv.org/abs/2205.11487) 783 | 784 | ```yaml 785 | Field: Image Generation (text to image) 786 | Text Encoder: T5 787 | Image Decoder: Diffusion, Upsampler 788 | ``` 789 | 790 | - **Flamingo** [[DeepMind]]() Apr. 2022 [close] 791 | Flamingo: a Visual Language Model for Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2204.14198) 792 | 793 | ```yaml 794 | Field: Vision-Language 795 | Params: 80B 796 | ``` 797 | 798 | - **DALL·E 2** [[OpenAI]](https://openai.com/dall-e-2/) Apr. 2022 799 | Hierarchical Text-Conditional Image Generation with CLIP Latents [[Preprint]](https://cdn.openai.com/papers/dall-e-2.pdf) 800 | 801 | ```yaml 802 | Field: Image Generation (text to image) 803 | Text Encoder: GPT2 (CLIP) 804 | Image Encoder: ViT (CLIP) 805 | Image Decoder: Diffusion, Upsampler 806 | ``` 807 | 808 | - **BaGuaLu** [[BAAI, Alibaba]]() Apr. 2022 809 | BaGuaLu: targeting brain scale pretrained models with over 37 million cores [[PPoPP'22]](https://keg.cs.tsinghua.edu.cn/jietang/publications/PPOPP22-Ma%20et%20al.-BaGuaLu%20Targeting%20Brain%20Scale%20Pretrained%20Models%20w.pdf) 810 | 811 | ```yaml 812 | Field: Vision-Language 813 | Params: 174T 814 | Architecture: M6 815 | ``` 816 | 817 | - **SEER** [[Meta]]() Feb. 2022 [[open]](https://github.com/facebookresearch/vissl) 818 | Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [[Preprint]](https://arxiv.org/abs/2202.08360v2) 819 | 820 | ```yaml 821 | Field: Vision 822 | Params: 10B 823 | Training Data: 1B images 824 | Architecture: Convolution 825 | Objective: SwAV 826 | ``` 827 | 828 | - **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2021 829 | ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283) 830 | 831 | ```yaml 832 | Field: Image Generation (text to image) 833 | Params: 10B 834 | Training Data: 145M text-image pairs 835 | Architecture: Transformer, dVAE + De 836 | ``` 837 | 838 | - **NUWA** [[Microsoft]]() Nov. 2021 [[open]](https://github.com/microsoft/NUWA) 839 | NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [[Preprint]](https://arxiv.org/abs/2111.12417) 840 | 841 | ```yaml 842 | Field: Vision-Language 843 | Generatioon: Image, Video 844 | Params: 870M 845 | ``` 846 | 847 | - **SwinV2-G** [[Google]]() Nov. 2021 [[open]](https://github.com/microsoft/Swin-Transformer) 848 | Swin Transformer V2: Scaling Up Capacity and Resolution [[CVPR'22]](https://arxiv.org/abs/2111.09883v2) 849 | 850 | ```yaml 851 | Field: Vision 852 | Params: 3B 853 | Training Data: 70M 854 | Architecture: Transformer 855 | Objective: Supervised 856 | ``` 857 | 858 | - **Zidongtaichu** [[CASIA]](http://www.ia.cas.cn/xwzx/kydt/202109/t20210927_6215538.html) Sept. 2021 [close] 859 | 860 | ```yaml 861 | Field: Image, Video, Language, Speech 862 | Params: 100B 863 | ``` 864 | 865 | - **ViT-G/14** [[Google]]() June 2021 866 | Scaling Vision Transformers [[Preprint]](https://arxiv.org/abs/2106.04560) 867 | 868 | ```yaml 869 | Field: Vision 870 | Params: 1.8B 871 | Training Data: 300M images 872 | Training petaFLOPs: 3.4M 873 | Architecture: Transformer 874 | Objective: Supervised 875 | ``` 876 | 877 | - **CoAtNet** [[Google]](https://ai.googleblog.com/2021/09/toward-fast-and-accurate-neural.html) June 2021 [[open]](https://github.com/chinhsuanwu/coatnet-pytorch) 878 | CoAtNet: Marrying Convolution and Attention for All Data Sizes [[NeurIPS'21]](https://arxiv.org/abs/2106.04803) 879 | 880 | ```yaml 881 | Field: Vision 882 | Params: 2.4B 883 | Training Data: 300M images 884 | Architecture: Transformer, Convolution 885 | Objective: Supervised 886 | ``` 887 | 888 | - **V-MoE** [[Google]](https://ai.googleblog.com/2022/01/scaling-vision-with-sparse-mixture-of.html) June 2021 889 | Scaling Vision with Sparse Mixture of Experts [[NeurIPS'21]](https://proceedings.neurips.cc//paper/2021/file/48237d9f2dea8c74c2a72126cf63d933-Paper.pdf) 890 | 891 | ```yaml 892 | Field: Vision 893 | Params: 15B 894 | Training Data: 300M images 895 | Training Time: 16.8k TPUv3 days 896 | Training petaFLOPs: 33.9M 897 | Architecture: Transformer, MoE 898 | Objective: Supervised 899 | ``` 900 | 901 | - **CogView** [[BAAI, Alibaba]](https://wudao.aminer.cn/CogView/index.html) May 2021 [](https://github.com/THUDM/CogView) 902 | CogView: Mastering Text-to-Image Generation via Transformers [[NeurIPS'21]](https://arxiv.org/abs/2105.13290) 903 | 904 | ```yaml 905 | Field: Vision-Language 906 | Params: 4B 907 | Training Data: 30M text-image pairs 908 | Training petaFLOPs: 27M 909 | Image Encoder: VAE 910 | Text Encoder & Image Decoder: GPT2 911 | ``` 912 | 913 | - **M6** [[Alibaba]](https://m6.aliyun.com/#/) Mar. 2021 914 | M6: A Chinese Multimodal Pretrainer [[Preprint]](https://arxiv.org/abs/2103.00823) 915 | 916 | ```yaml 917 | Field: Vision-Language 918 | Params: 10T 919 | Training Data: 300G Texts + 2TB Images 920 | Training petaFLOPs: 5.5M 921 | Fusion: Single-stream 922 | Objective: MLM, IC 923 | ``` 924 | 925 | - **DALL·E** [[OpenAI]](https://openai.com/blog/dall-e/) Feb. 2021 926 | Zero-Shot Text-to-Image Generation [[ICML'21]](https://arxiv.org/abs/2102.12092) 927 | 928 | ```yaml 929 | Field: Image Generation (text to image) 930 | Params: 12B 931 | Training Data: 250M text-images pairs 932 | Training petaFLOPs: 47M 933 | Image Encoder: dVAE 934 | Text Encoder & Image Decoder: GPT2 935 | ``` 936 | 937 | - **CLIP** [[OpenAI]](https://openai.com/blog/clip/) Jan. 2021 938 | Learning Transferable Visual Models From Natural Language Supervision [[ICML'22]](https://arxiv.org/abs/2103.00020) 939 | 940 | ```yaml 941 | Field: Vision-Language 942 | Training Data: 400M text-image pairs 943 | Training petaFLOPs: 11M 944 | Image Encoder: ViT 945 | Text Encoder: GPT-2 946 | Fusion: Dual Encoder 947 | Objective: CMCL 948 | ``` 949 | 950 | - **ViT-H/14** [[Google]](https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html) Oct. 2020 [[open]](https://github.com/google-research/vision_transformer) 951 | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[ICLR'20]](https://arxiv.org/abs/2010.11929) 952 | 953 | ```yaml 954 | Field: Vision 955 | Params: 632M 956 | Training Data: 300M images 957 | Training petaFLOPs: 13M 958 | Architecture: Transformer 959 | Objective: Supervised 960 | ``` 961 | 962 | - **iGPT-XL** [[OpenAI]](https://openai.com/blog/image-gpt/) June 2020 [[open]](https://github.com/openai/image-gpt) 963 | Generative Pretraining From Pixels [[ICML'20]](https://proceedings.mlr.press/v119/chen20s.html) 964 | 965 | ```yaml 966 | Field: Image Generation 967 | Params: 6.8B 968 | Training Data: 1M images 969 | Training petaFLOPs: 33M 970 | Architecture: Transformer, De 971 | ``` 972 | 973 | - **BigGAN-deep** [[DeepMind]]() Sept. 2018 [[open]](https://github.com/ajbrock/BigGAN-PyTorch) 974 | Large Scale GAN Training for High Fidelity Natural Image Synthesis [[ICLR'19]](https://arxiv.org/abs/1809.11096) 975 | 976 | ```yaml 977 | Field: Image Generation 978 | Params: 158M 979 | Training Data: 300M images 980 | Training petaFLOPs: 3M 981 | Architecture: Convolution, GAN 982 | Resolution: 512x512 983 | ``` 984 | 985 | ### Reinforcement Learning 986 | 987 | - **PaLM-E** [[Google]](https://palm-e.github.io/) March 2023 [close] 988 | PaLM-E: An Embodied Multimodal Language Model [[Preprint]](https://palm-e.github.io/assets/palm-e.pdf) 989 | 990 | ```yaml 991 | Field: Reinforcement Learning 992 | Params: 562B (540B LLM + 22B Vi) 993 | ``` 994 | 995 | - **Gato** [[DeepMind]](https://www.deepmind.com/publications/a-generalist-agent) May 2022 [close] 996 | A Generalist Agent [[Preprint]](https://arxiv.org/abs/2205.06175) 997 | 998 | ```yaml 999 | Field: Reinforcement Learning 1000 | Params: 1.2B 1001 | Training Data: (604 Tasks) 1002 | Objective: Supervised 1003 | ``` 1004 | 1005 | ### Speech 1006 | 1007 | - **USM** [[Google]](https://sites.research.google/usm/) Mar. 2023 [close] 1008 | Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [[Preprint]](https://arxiv.org/pdf/2303.01037v2.pdf) 1009 | 1010 | ```yaml 1011 | Field: Speech 1012 | Params: 2B 1013 | Training Data: 12,000,000 hours 1014 | ``` 1015 | 1016 | - **Whisper** [[OpenAI]](https://openai.com/research/whisper) Sept. 2022 [[close]](https://github.com/openai/whisper) 1017 | Robust Speech Recognition via Large-Scale Weak Supervision [[Preprint]](https://arxiv.org/pdf/2212.04356.pdf) 1018 | 1019 | ```yaml 1020 | Field: Speech 1021 | Params: 1.55B 1022 | Training Data: 680,000 hours 1023 | Objective: Weakly Supervised 1024 | ``` 1025 | 1026 | - **HuBERT** [[Meta]](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression/) June 2021 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert) 1027 | HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [[Preprint]](https://arxiv.org/abs/2106.07447) 1028 | 1029 | ```yaml 1030 | Field: Speech 1031 | Params: 1B 1032 | Training Data: 60,000 hours 1033 | Objective: MLM 1034 | ``` 1035 | 1036 | - **wav2vec 2.0** [[Meta]]() Oct. 2020 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec) 1037 | wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [[NeurIPS'20]](https://arxiv.org/abs/2006.11477) 1038 | 1039 | ```yaml 1040 | Field: Speech 1041 | Params: 317M 1042 | Training Data: 50,000 hours 1043 | Training petaFLOPs: 430M 1044 | Objective: MLM 1045 | ``` 1046 | 1047 | - **DeepSpeech 2** [[Meta]]() Dec. 2015 [[open]](https://github.com/PaddlePaddle/PaddleSpeech) 1048 | Deep Speech 2: End-to-End Speech Recognition in 1049 | English and Mandarin [[ICML'15]](https://arxiv.org/pdf/1512.02595.pdf) 1050 | 1051 | ```yaml 1052 | Field: Speech 1053 | Params: 300M 1054 | Training Data: 21,340 hours 1055 | ``` 1056 | 1057 | ### Science 1058 | 1059 | - **AlphaFold 2** [[DeepMind]](https://www.deepmind.com/research/highlighted-research/alphafold) July 2021 [[open]](https://github.com/deepmind/alphafold) 1060 | Highly accurate protein structure prediction with AlphaFold [[Nature]](https://www.nature.com/articles/s41586-021-03819-2) 1061 | 1062 | ```yaml 1063 | Field: Biology 1064 | Params: 21B 1065 | Training petaFLOPs: 100k 1066 | ``` 1067 | 1068 | ## Open LLM Training Dataset 1069 | 1070 | This section will be reorganized. For now, as LLM prevails and data quality is a key for the performance of LLM, we keep track of this trend. 1071 | 1072 | - [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B): 627B tokens, 895GB Compressed, primarily English, cleaned from RedPajama, Apache 2.0 1073 | - [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb): ~600B tokens, 500GB Compressed, English, ODC-By 1.0 license (The 5T tokens version is private) 1074 | - [MNBVC](https://github.com/esbatmop/MNBVC): 5TB (on-going, target 40TB), Chinese, MIT License 1075 | - [The Pile](https://pile.eleuther.ai/): 825G 1076 | - [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T): 1.2T tokens 1077 | 1078 | ## Distributed Training Framework 1079 | 1080 | > Deep Learning frameworks supportting distributed training are marked with \*. 1081 | 1082 | ### PyTorch Ecosystem 1083 | 1084 | - **Accelerate** [[Huggingface]]() Oct. 2020 [[open]](https://github.com/huggingface/accelerate) 1085 | - **Hivemind** Aug. 2020 [[open]](https://github.com/learning-at-home/hivemind) 1086 | Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2002.04013) 1087 | - **FairScale** [[Meta]]() July 2020 [[open]](https://github.com/facebookresearch/fairscale) 1088 | - **DeepSpeed** [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed) 1089 | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [[SC'20]](https://arxiv.org/abs/1910.02054) 1090 | - **Megatron** [[Nivida]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM) 1091 | Megatron: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053) 1092 | - **PyTorch\*** [[Meta]](https://pytorch.org/) Sept. 2016 [[open]](https://github.com/pytorch/pytorch) 1093 | PyTorch: An Imperative Style, High-Performance Deep Learning Library [[NeurIPS'19]](http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf) 1094 | 1095 | ### XLA Ecosystem 1096 | 1097 | - **T5x** [[Google]]() Mar. 2022 [[open]](https://github.com/google-research/t5x) 1098 | Scaling Up Models and Data with 𝚝𝟻𝚡 and 𝚜𝚎𝚚𝚒𝚘 [[Preprint]](https://arxiv.org/abs/2203.17189) 1099 | - **Alpa** [[Google]]() Jan. 2022 [[open]](https://github.com/alpa-projects/alpa) 1100 | Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [[OSDI'22]](https://arxiv.org/pdf/2201.12023.pdf) 1101 | - **Pathways** [[Google]](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/) Mar. 2021 [close] 1102 | Pathways: Asynchronous Distributed Dataflow for ML [[Preprint]](https://arxiv.org/abs/2203.12533) 1103 | - **Colossal-AI** [[HPC-AI TECH]](https://colossalai.org/) Nov. 2021 [[open]](https://github.com/hpcaitech/ColossalAI) 1104 | Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [[Preprint]](https://arxiv.org/abs/2110.14883) 1105 | - **GShard** [[Google]](https://arxiv.org/abs/2006.16668) June 2020 1106 | GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [[Preprint]](https://arxiv.org/abs/2006.16668) 1107 | - **Jax\*** [Google]() Oct 2019 [[open]](https://github.com/google/jax) 1108 | - **Mesh Tensorflow** [[Google]]() Nov. 2018 [[open]](https://github.com/tensorflow/mesh) 1109 | - **Horovod** [[Uber]](https://horovod.ai/) Feb. 2018 [[open]](https://github.com/horovod/horovod) 1110 | Horovod: fast and easy distributed deep learning in TensorFlow [[Preprint]](https://arxiv.org/abs/1802.05799) 1111 | - **Tensorflow\*** [[Google]](https://www.tensorflow.org/) Nov. 2015 [[open]](https://github.com/tensorflow/tensorflow) 1112 | TensorFlow: A system for large-scale machine learning [[OSDI'16]](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf) 1113 | 1114 | ### Other Frameworks 1115 | 1116 | - **OneFlow\*** [[OneFlow]](https://docs.oneflow.org/master/index.html) July 2020 [[open]](https://github.com/OneFlow-Inc/oneflow) 1117 | OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [[Preprint]](https://arxiv.org/abs/2110.15032) 1118 | - **MindSpore\*** [[Huawei]](https://e.huawei.com/en/products/cloud-computing-dc/atlas/mindspore) Mar. 2020 [[open]](https://github.com/mindspore-ai/mindspore) 1119 | - **PaddlePaddle\*** [[Baidu]](https://www.paddlepaddle.org.cn/) Nov. 2018 [[open]](https://github.com/PaddlePaddle/Paddle) 1120 | End-to-end Adaptive Distributed Training on PaddlePaddle [[Preprint]](https://arxiv.org/abs/2112.02752) 1121 | - **Ray** [[Berkeley]]() Dec. 2017 [[open]]([OSDI'17](https://github.com/ray-project/ray)) 1122 | Ray: A Distributed Framework for Emerging AI Applications [[OSDI'17]](https://arxiv.org/pdf/1712.05889.pdf) 1123 | 1124 | ### Inference Frameworks 1125 | 1126 | - Petals [[BigScience]]() Dec. 2022 [[open]](https://github.com/bigscience-workshop/petals) 1127 | - FlexGen [[Stanford, Berkerley, CMU, etc.]]() May 2022 [[open]](https://github.com/FMInference/FlexGen) 1128 | - FastTransformer [[NVIDIA]]() Apr. 2021 [[open]](https://github.com/NVIDIA/FasterTransformer) 1129 | - MegEngine [[MegEngine]](https://www.megengine.org.cn/) Mar. 2020 1130 | - DeepSpeed-Inference [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed) 1131 | - MediaPipe [[Google]](https://google.github.io/mediapipe/) July 2019 [[open]](https://github.com/google/mediapipe) 1132 | - TensorRT [[Nvidia]]() Jun 2019 [[open]](https://github.com/NVIDIA/TensorRT) 1133 | - MNN [[Alibaba]]() May 2019 [[open]](https://github.com/alibaba/MNN) 1134 | - OpenVINO [[Intel]](https://docs.openvino.ai/latest/index.html) Oct. 2019 [[open]](https://github.com/openvinotoolkit/openvino) 1135 | - ONNX [[Linux Foundation]](https://onnx.ai/) Sep 2017 [[open]](https://github.com/onnx/onnx) 1136 | - ncnn [[Tencent]]() July 2017 [[open]](https://github.com/Tencent/ncnn) 1137 | 1138 | ### Recommendation Training Framework 1139 | 1140 | - **HET** [[Tencent]]() Dec. 2021 1141 | HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [[VLDB'22]](https://arxiv.org/abs/2112.07221) 1142 | - **Persia** [[Kuaishou]]() Nov. 2021 1143 | Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [[Preprint]](https://arxiv.org/abs/2111.05897) 1144 | 1145 | ```yaml 1146 | Embeddings Params: 100T 1147 | ``` 1148 | 1149 | - **ZionEX** [[Meta]]() Apr. 2021 1150 | Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [[ISCA'21]](https://arxiv.org/abs/2104.05158) 1151 | 1152 | ```yaml 1153 | Embeddings Params: 10T 1154 | ``` 1155 | 1156 | - **ScaleFreeCTR** [[Huawei]]() Apr. 2021 1157 | ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [[SIGIR'21]](https://arxiv.org/abs/2104.08542) 1158 | - **Kraken** [[Kuaishou]]() Nov. 2020 1159 | Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [[SC'20]](http://storage.cs.tsinghua.edu.cn/papers/sc20-kraken.pdf/) 1160 | - **TensorNet** [[Qihoo360]]() Sept. 2020 [[open]](https://github.com/Qihoo360/tensornet) 1161 | - **HierPS** [[Baidu]]() Mar. 2020 1162 | Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [[MLSys'20]](https://arxiv.org/abs/2003.05622) 1163 | - **AIBox** [[Baidu]]() Oct. 2019 1164 | AIBox: CTR Prediction Model Training on a Single Node [[CIKM'20]](https://dl.acm.org/doi/pdf/10.1145/3357384.3358045) 1165 | 1166 | ```yaml 1167 | Embeddings Params: 0.1T 1168 | ``` 1169 | 1170 | - **XDL** [[Alibaba]]() Aug. 2019 1171 | XDL: an industrial deep learning framework for high-dimensional sparse data [[DLP-KDD'21]](https://dlp-kdd.github.io/dlp-kdd2019/assets/pdf/a6-jiang.pdf) 1172 | 1173 | ```yaml 1174 | Embeddings Params: 0.01T 1175 | ``` 1176 | 1177 | ## Keys Explanations 1178 | 1179 | - Company tags: the related company name. Other institudes may also involve in the job. 1180 | - Params: number of parameters of the largest model 1181 | - Training data size, training cost and training petaFLOPs may have some uncertainty. 1182 | - Training cost 1183 | - TPUv2 hour: $4.5 1184 | - TPUv3 hour: $8 1185 | - V100 GPU hour: $0.55 (2022) 1186 | - A100 GPU hoor: $1.10 (2022) 1187 | - Architecture 1188 | - En: Encoder-based Language Model 1189 | - De: Decoder-based Language Model 1190 | - En-De=Encoder-Decoder-based Language Model 1191 | - The above three architectures are powered with transformers. 1192 | - MoE: Mixture of Experts 1193 | - Objective (See explanation in section 6–8 of [this paper](https://arxiv.org/pdf/2203.14101v3.pdf)) 1194 | - MLM: Masked Language Modeling 1195 | - LTR: Left-To-Right Language Modeling 1196 | - NSP: Next Sentence Prediction 1197 | - PLM: Permuted Language Modeling 1198 | - IC: Image Captioning 1199 | - VLM: Vision Languauge Matching 1200 | - CMCL: Cross-Modal Contrastive Learning 1201 | - FLOPs: number of FLOating-Point operations [[explanation]](https://openai.com/blog/ai-and-compute/) 1202 | - 1 petaFLOPs = 1e15 FLOPs 1203 | --------------------------------------------------------------------------------