├── img
├── llm-evolutionary-tree.png
├── ai-training-computation-202206.png
├── ai-training-computation-202303.png
└── ai-training-computation-202306.png
├── .gitignore
└── README.md
/img/llm-evolutionary-tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/llm-evolutionary-tree.png
--------------------------------------------------------------------------------
/img/ai-training-computation-202206.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202206.png
--------------------------------------------------------------------------------
/img/ai-training-computation-202303.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202303.png
--------------------------------------------------------------------------------
/img/ai-training-computation-202306.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhengzangw/awesome-huge-models/HEAD/img/ai-training-computation-202306.png
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | # Created by https://www.toptal.com/developers/gitignore/api/macos
3 | # Edit at https://www.toptal.com/developers/gitignore?templates=macos
4 |
5 | ### macOS ###
6 | # General
7 | .DS_Store
8 | .AppleDouble
9 | .LSOverride
10 |
11 | # Icon must end with two \r
12 | Icon
13 |
14 | # Thumbnails
15 | ._*
16 |
17 | # Files that might appear in the root of a volume
18 | .DocumentRevisions-V100
19 | .fseventsd
20 | .Spotlight-V100
21 | .TemporaryItems
22 | .Trashes
23 | .VolumeIcon.icns
24 | .com.apple.timemachine.donotpresent
25 |
26 | # Directories potentially created on remote AFP share
27 | .AppleDB
28 | .AppleDesktop
29 | Network Trash Folder
30 | Temporary Items
31 | .apdisk
32 |
33 | ### macOS Patch ###
34 | # iCloud generated files
35 | *.icloud
36 |
37 | # End of https://www.toptal.com/developers/gitignore/api/macos
38 |
39 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # awesome-huge-models [](https://awesome.re)
4 |
5 |
6 |
7 | A collection of AWESOME things about HUGE AI models.
8 |
9 | **[2023.06]** We are now in the post-GPT4 era, where LLMs are thriving and new models are emerging from GitHub repositories rather than traditional papers. People are striving to release everything openly, including training and inference codes, instruction-tuned weights and datasets, pretrained weights, and [the datasets used for pretraining LLMs](#open-llm-training-dataset). In this update, I try to catch up with the latest developments in the open-source wave of LLMs.
10 |
11 | **[2023.03]** Only pretrained models are recorded here. Models are sorted according to the first release date. To support the open source process of LLM, we highligh the open-sourced LLM models with [[open]]().
12 |
13 | **[2022.06]** There is a trend of training large-scale deep learning models (w.r.t. params, dataset, FLOPs) led by big companies. These models achieve the SoTA perfermance at a high price, with bags of training tricks and distributed training systems. Keeping an eye on this trend informs us of the current boundaries of AI models. [[Intro in Chinese](https://zhuanlan.zhihu.com/p/529863941)]
14 |
15 |
16 |
17 | ## Contents
18 |
19 | - [awesome-huge-models ](#awesome-huge-models-)
20 | - [Contents](#contents)
21 | - [Survey](#survey)
22 | - [Models](#models)
23 | - [Language Model](#language-model)
24 | - [Vision Models](#vision-models)
25 | - [Reinforcement Learning](#reinforcement-learning)
26 | - [Speech](#speech)
27 | - [Science](#science)
28 | - [Open LLM Training Dataset](#open-llm-training-dataset)
29 | - [Distributed Training Framework](#distributed-training-framework)
30 | - [PyTorch Ecosystem](#pytorch-ecosystem)
31 | - [XLA Ecosystem](#xla-ecosystem)
32 | - [Other Frameworks](#other-frameworks)
33 | - [Inference Frameworks](#inference-frameworks)
34 | - [Recommendation Training Framework](#recommendation-training-framework)
35 | - [Keys Explanations](#keys-explanations)
36 |
37 | ## Survey
38 |
39 |
40 |
41 |
42 |
43 | - [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223) [2023.03]
44 | - [A Dive into Vision-Language Models](https://huggingface.co/blog/vision_language_pretraining) [2023.02]
45 | - [Compute Trends Across Three Eras of Machine Learning](https://arxiv.org/abs/2202.05924) [[chart](https://ourworldindata.org/grapher/ai-training-computation)] [2022.02]
46 | - [Vision-and-Language Pretrained Models: A Survey](https://arxiv.org/abs/2204.07356) [2022.04]
47 | - [A Roadmap to Big Model](https://arxiv.org/abs/2203.14101) [2022.03]
48 | - [A Survey of Vision-Language Pre-trained Models](https://arxiv.org/abs/2202.10936) [2022.02]
49 | - [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) [2022.01]
50 | - [On the Opportunities and Risk of Foundation Models](https://arxiv.org/abs/2108.07258) [2021.08]
51 | - [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) [2021.06]
52 |
53 | Resources list:
54 |
55 | - [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
56 | - [Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM)
57 | - [Open-LLM](https://github.com/eugeneyan/open-llms)
58 | - [LLMDataHub](https://github.com/Zjh-819/LLMDataHub)
59 |
60 | ## Models
61 |
62 | ### Language Model
63 |
64 |
65 |
66 |
67 |
68 | - **Baichuan** [[Baichuan]]() Jun. 2023 [[open]](https://github.com/baichuan-inc/baichuan-7B)
69 |
70 | ```yaml
71 | Field: Language
72 | Params: 7B
73 | Training Data: 1.2T tokens (English, Chinese, Private)
74 | License: Apache 2.0
75 | Context: 4096
76 | ```
77 |
78 | - **Falcon** [[TII]]() Jun. 2023 [[open]](https://huggingface.co/tiiuae/falcon-40b)
79 |
80 | ```yaml
81 | Field: Language
82 | Params: 40B
83 | Training Data: 1T tokens (RefinedWeb)
84 | License: Apache 2.0
85 | Context Length: 2048
86 | ```
87 |
88 | - **OpenLLaMA** [[OpenLM]]() May. 2023 [[open]](https://github.com/openlm-research/open_llama)
89 |
90 | ```yaml
91 | Field: Language
92 | Params: 13B, 7B, 3B
93 | Training Data: 1T tokens (RedPajama)
94 | License: Apache 2.0
95 | Context Length: 2048
96 | ```
97 |
98 | - **Redpajama-INCITE** [[Together]](https://github.com/togethercomputer/RedPajama-Data) May. 2023 [[open]](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1)
99 |
100 | ```yaml
101 | Field: Language
102 | Params: 7B, 3B
103 | Training Data: 1T tokens (Redpajama)
104 | License: Apache 2.0
105 | Context Length: 2048
106 | ```
107 |
108 | - **MPT** [[MosaicML]](https://www.mosaicml.com/blog/mpt-7b) May. 2023 [[open]](https://github.com/mosaicml/llm-foundry)
109 |
110 | ```yaml
111 | Field: Language
112 | Params: 30B, 7B
113 | Training Data: 1T tokens (Private)
114 | License: Apache 2.0, CC BY-SA-3.0
115 | Context Length: 84k
116 | ```
117 |
118 | - **Stable-LM** [[Stability-AI]](https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models) Apr. 2023 [[open]](https://github.com/Stability-AI/StableLM#stablelm-alpha)
119 |
120 | ```yaml
121 | Field: Language
122 | Params: 7B, 3B
123 | Training Data: 1.5T tokens
124 | License: CC BY-SA-4.0
125 | ```
126 |
127 | - **LiT-LLaMa** [[Lightning-AI]]() Apr. 2023 [[open]](https://github.com/Lightning-AI/lit-llama)
128 |
129 | ```yaml
130 | Field: Language
131 | Params: 13B, 7B
132 | Training Data: 1.2T tokens (Redpajama)
133 | License: Apache 2.0
134 | ```
135 |
136 | - **h2oGPT** [[H2O.ai]](https://h2o.ai/blog/building-the-worlds-best-open-source-large-language-model-h2o-ais-journey/) [[open]](https://github.com/h2oai/h2ogpt)
137 | [h2oGPT: Democratizing Large Language Models](https://arxiv.org/pdf/2306.08161.pdf)
138 |
139 | ```yaml
140 | Field: Language
141 | Params: 13B, 7B
142 | Training Data: 1.0T tokens
143 | License: Apache 2.0
144 | Context Length: 2048
145 | ```
146 |
147 | - **Cerabras-GPT** [[Cerabras]]() Mar. 2023 [[open]](https://huggingface.co/cerebras/Cerebras-GPT-13B)
148 | Training Compute-Optimal Large Language Models [[preprint]](https://arxiv.org/abs/2203.15556)
149 |
150 | ```yaml
151 | Field: Language
152 | Params: 13B
153 | Training Data: 371B tokens (Redpajama)
154 | License: Apache 2.0
155 | Context Length: 2048
156 | ```
157 |
158 | - **Claude** [[Anthropic]](https://www.anthropic.com/index/introducing-claude) Mar. 2023 [close]
159 |
160 | ```yaml
161 | Field: Language-Vision
162 | ```
163 |
164 | - **GPT-4** [[OpenAI]](https://openai.com/product/gpt-4) Mar. 2023 [close]
165 | GPT-4 Technical Report [[Preprint]](https://cdn.openai.com/papers/gpt-4.pdf)
166 |
167 | ```yaml
168 | Field: Language-Vision
169 | Params: 1.7T
170 | Architecture: De, MoE
171 | ```
172 |
173 | - **Bard** [[Google]](https://blog.google/technology/ai/bard-google-ai-search-updates/)
174 |
175 | ```yaml
176 | Field: Language-Vision
177 | ```
178 |
179 | - **LLaMa** [[Meta]]() Feb. 2023 [[open]](https://github.com/facebookresearch/llama)
180 | Open and Efficient Foundation Language Models [[Preprint]](https://arxiv.org/pdf/2302.13971v1.pdf)
181 |
182 | ```yaml
183 | Field: Language
184 | Params: 65B, 33B, 13B, 7B
185 | Training Data: 4TB (1.4T tokens)
186 | Training Cost: 1,022,362 (2048 80G-A100 x 21 days)
187 | Training Power Consumption: 449 MWh
188 | Instruction-tuned Variants: Alpaca, Vicuna, Dolly, Guanaco, ColossalChat, GPT4All, Koala, BELLE, MiniGPT-4, etc.
189 | License: GPL
190 | ```
191 |
192 | - **RWKV-4** [[Personal]]() Dec. 2022 [[open]](https://github.com/BlinkDL/RWKV-LM)
193 |
194 | ```yaml
195 | Field: Language
196 | Params: 14B, 7B, 3B, 1.5B
197 | Training Data: 332B tokens
198 | Architecture: De, RNN
199 | License: Apache 2.0
200 | ```
201 |
202 | - **AnthropicLM** [[Anthropic]]() Dec. 2022 [close]
203 | Constitutional AI: Harmlessness from AI Feedback
204 |
205 | ```yaml
206 | Field: Language
207 | Params: 52B
208 | ```
209 |
210 | - **BLOOM** [[BigScience]]() Nov. 2022 [[open]](https://huggingface.co/bigscience/bloom)
211 | A 176B-Parameter Open-Access Multilingual Language Model [[Preprint]](https://arxiv.org/pdf/2211.05100.pdf)
212 |
213 | ```yaml
214 | Field: Language
215 | Params: 176B
216 | Training Data: 174GB (336B tokens)
217 | Training Cost: 1M A100 GPU hours = 384 80G-A100 x 4 months
218 | Training Power Consumption: 475 MWh
219 | Training Framework: Megatron + Deepspeed
220 | Instruction-tuned Variants: BLOOMZ
221 | License: OpenRAIL-M v1
222 | Context Length: 2048
223 | ```
224 |
225 | - **Galactica** [[Meta]]() Nov. 2022 [[open]](https://huggingface.co/facebook/galactica-1.3b)
226 | A scientific language model trained on over 48 million scientific texts [[Preprint]](https://arxiv.org/pdf/2211.09085.pdf)
227 |
228 | ```yaml
229 | Field: Language
230 | Params: 125M, 1.3B, 6.7B, 30B, 120B
231 | ```
232 |
233 | - **Pythia** [[EleutherAI]]() Oct. 2022 [[open]](https://github.com/EleutherAI/pythia)
234 |
235 | ```yaml
236 | Field: Language
237 | Params: 12B
238 | Instruction-tuned Variants: Dolly 2.0
239 | License: Apache 2.0
240 | Context Length: 2048
241 | ```
242 |
243 | - **GLM-130B** [[BAAI]](https://keg.cs.tsinghua.edu.cn/glm-130b/zh/posts/glm-130b/) Oct. 2022 [[open]](https://github.com/THUDM/GLM-130B)
244 | GLM-130B: An Open Bilingual Pre-trained Model [[ICLR'23]](https://arxiv.org/pdf/2210.02414.pdf)
245 |
246 | ```yaml
247 | Field: Language
248 | Params: 130B
249 | Training Data: (400B tokens)
250 | Training Cost: 516,096 A100 hours = 768 40G-A100 x 28 days
251 | Training Framework: Megatron + Deepspeed
252 | ```
253 |
254 | - **UL2** [[Google]]() May 2022 [[open]](https://huggingface.co/google/ul2)
255 | Unifying Language Learning Paradigms [[Preprint]](https://arxiv.org/abs/2205.05131)
256 |
257 | ```yaml
258 | Field: Language
259 | Params: 20B (1T tokens)
260 | Training Data: 800GB
261 | Achitecture: En-De
262 | Training Framework: Jax + T5x
263 | License: Apache 2.0
264 | Instruction-tuned Variants: Flan-UL2
265 | Context Length: 2048
266 | ```
267 |
268 | - **OPT** [[Meta]](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) May 2022 [[open]](https://github.com/facebookresearch/metaseq)
269 | OPT: Open Pre-trained Transformer Language Models [[Preprint]](https://arxiv.org/abs/2205.01068)
270 |
271 | ```yaml
272 | Field: Language
273 | Params: 175B
274 | Training Data: 800GB (180B tokens)
275 | Training Cost: 809,472 A100 hours = 992 80G-A100 x 34 days
276 | Training Power Consumption: 356 MWh
277 | Architecutre: De
278 | Training Framework: Megatron + Fairscale
279 | ```
280 |
281 | - **PaLM** [[Google]](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html) Apr. 2022 [close]
282 | PaLM: Scaling Language Modeling with Pathways [[Preprint]](https://arxiv.org/abs/2204.02311)
283 |
284 | ```yaml
285 | Field: Language
286 | Params: 550B
287 | Training Data: 3TB (780B tokens)
288 | Training Cost: $10M (16,809,984 TPUv4core-hours, 64 days)
289 | Training petaFLOPs: 2.5B
290 | Architecture: De
291 | Training Framework: Jax + T5x
292 | ```
293 |
294 | - **GPT-NeoX** [[EleutherAI]](https://blog.eleuther.ai/announcing-20b/) Apr. 2022 [[open]](https://github.com/EleutherAI/gpt-neox)
295 | GPT-NeoX-20B: An Open-Source Autoregressive Language Model [[Preprint]](https://arxiv.org/abs/2204.06745)
296 |
297 | ```yaml
298 | Field: Language
299 | Params: 20B
300 | Training Data: 525GiB
301 | Training petaFLOPs: 93B
302 | Architecture: De
303 | Training Framework: Megatron + Fairscale
304 | License: Apache 2.0
305 | Context Length: 2048
306 | ```
307 |
308 | - **InstructGPT** [[OpenAI]]() Mar. 2022 [close]
309 | Training language models to follow instructions with human feedback [[Preprint]](https://arxiv.org/abs/2203.02155)
310 |
311 | ```yaml
312 | Field: Language
313 | Params: 175B
314 | ```
315 |
316 | - **Chinchilla** [[DeepMind]](https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training) Mar. 2022 [close]
317 | Training Compute-Optimal Large Language Models [[Preprint]](https://arxiv.org/abs/2203.15556)
318 |
319 | ```yaml
320 | Field: Language
321 | Params: 70B
322 | Training Data: 5.2TB (1.4T tokens)
323 | Training petaFLOPs: 580M
324 | Architecture: De
325 | ```
326 |
327 | - **EVA 2.0** [[BAAI]](https://wudaoai.cn/model/detail/EVA) Mar. 2022 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master)
328 | EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training [[Preprint]](https://arxiv.org/abs/2203.09313)
329 |
330 | ```yaml
331 | Field: Language (Dialogue)
332 | Params: 2.8B
333 | Training Data: 180G (1.4B samples, Chinese)
334 | ```
335 |
336 | - **AlphaCode** [[DeepMind]](https://www.deepmind.com/blog/competitive-programming-with-alphacode) Mar. 2022 [close]
337 | Competition-Level Code Generation with AlphaCode [[Preprint]](https://arxiv.org/abs/2203.07814)
338 |
339 | ```yaml
340 | Field: Code Generation
341 | Params: 41B
342 | Training Data: (967B tokens)
343 | Architecture: De
344 | ```
345 |
346 | - **ST-MoE** [[Google]]() Feb. 2022 [close]
347 | ST-MoE: Designing Stable and Transferable Sparse Expert Models [[Preprint]](https://arxiv.org/abs/2202.08906)
348 |
349 | ```yaml
350 | Field: Language
351 | Params: 296B
352 | Architecture: En-De, MoE
353 | ```
354 |
355 | - **LaMDA** [[Google]](https://arxiv.org/abs/2201.08239) Jan. 2022 [close]
356 | LaMDA: Language Models for Dialog Applications [[Preprint]](https://arxiv.org/abs/2201.08239)
357 |
358 | ```yaml
359 | Field: Language (Dialogue)
360 | Params: 137B
361 | Training Data: (1.56T words)
362 | Training petaFLOPs: 360M
363 | Architecture: De
364 | ```
365 |
366 | - **GLaM** [[Google]](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) Dec. 2021 [close]
367 | GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2112.06905)
368 |
369 | ```yaml
370 | Field: Language
371 | Params: 1.2T
372 | Architecture: De, MoE
373 | ```
374 |
375 | - **Gopher** [[DeepMind]](https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval) Dec. 2021 [close]
376 | Scaling Language Models: Methods, Analysis & Insights from Training Gopher [[Preprint]](https://arxiv.org/abs/2112.11446)
377 |
378 | ```yaml
379 | Field: Language
380 | Params: 280B
381 | Training Data: 1.3TB (300B tokens)
382 | Training petaFLOPs: 630M
383 | Architecture: De
384 | ```
385 |
386 | - **Yuan 1.0** [[inspur]](https://air.inspur.com/home) Oct. 2021 [close]
387 | Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2110.04725)
388 |
389 | ```yaml
390 | Field: Language
391 | Params: 245B
392 | Training Data: 5TB (180B tokens, Chinese)
393 | Training petaFLOPs: 410M
394 | Architecture: De, MoE
395 | ```
396 |
397 | - **MT-NLG** [[Microsoft, Nvidia]](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) Oct. 2021 [close]
398 | Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [[Preprint]](https://arxiv.org/abs/2201.11990)
399 |
400 | ```yaml
401 | Field: Language
402 | Params: 530B
403 | Training Data: 339B tokens
404 | Training petaFLOPs: 1.4B
405 | Architecture: De
406 | ```
407 |
408 | - **Plato-XL** [[Baidu]](http://research.baidu.com/Blog/index-view?id=163) Sept. 2021 [close]
409 | PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation [[Preprint]](https://arxiv.org/abs/2109.09519)
410 |
411 | ```yaml
412 | Field: Language (Dialogue)
413 | Params: 11B
414 | Training Data: (1.2B samples)
415 | ```
416 |
417 | - **GPT-J** [[EleutherAI]](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) Aug. 2021 [[open]](https://github.com/kingoflolz/mesh-transformer-jax)
418 |
419 | ```yaml
420 | Field: Language
421 | Params: 6B
422 | Programming Language: Jax
423 | ```
424 |
425 | - **Jurassic-1** [[AI21 Labs]](https://www.zdnet.com/article/watch-out-gpt-3-here-comes-ai21s-jurassic-language-model/) Aug. 2021 [close]
426 | Jurassic-1: Technical Details and Evaluation [[Preprint]](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)
427 |
428 | ```yaml
429 | Field: Language
430 | Params: 178B
431 | Training petaFLOPs: 370M
432 | Architecture: De
433 | ```
434 |
435 | - **Codex** [[OpenAI]](https://openai.com/blog/openai-codex/) July 2021 [close]
436 | Evaluating Large Language Models Trained on Code [[Preprint]](https://arxiv.org/abs/2107.03374)
437 |
438 | ```yaml
439 | Field: Code Generation
440 | Params: 12B
441 | Training Data: 159GB
442 | Architecture: De
443 | ```
444 |
445 | - **ERNIE 3.0** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie) July 2021 [close]
446 | ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [[Preprint]](https://arxiv.org/abs/2107.02137)
447 |
448 | ```yaml
449 | Field: Language
450 | Params: 10B
451 | Training Data: 4TB (375B tokens, with knowledge graph)
452 | Architecture: En
453 | Objective: MLM
454 | ```
455 |
456 | - **CPM-2** [[BAAI]]() June 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master)
457 | CPM-2: Large-scale Cost-effective Pre-trained Language Models [[Preprint]](https://arxiv.org/abs/2106.10715)
458 |
459 | ```yaml
460 | Field: Language
461 | Params: 198B
462 | Training Data: 2.6TB (Chinese 2.3TB, English 300GB)
463 | Architecture: En-De
464 | Objective: MLM
465 | ```
466 |
467 | - **HyperClova** [[Naver]](https://www.navercorp.com/promotion/pressReleasesView/30546) May 2021 [close]
468 | What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers [[Preprint]](https://arxiv.org/abs/2109.04650v1)
469 |
470 | ```yaml
471 | Field: Language
472 | Params: 82B
473 | Training Data: 562B tokens (Korean)
474 | Training petaFLOPs: 63B
475 | Architecture: De
476 | ```
477 |
478 | - **ByT5** [[Google]]() May 2021 [[open]](https://github.com/google-research/byt5)
479 | ByT5: Towards a token-free future with pre-trained byte-to-byte models [[TACL'22]](https://arxiv.org/abs/2105.13626)
480 |
481 | ```yaml
482 | Field: Language
483 | Params: 13B
484 | Training Data: (101 languages)
485 | Architecture: En-De
486 | ```
487 |
488 | - **PanGu-α** [[Huawei]]() Apr. 2021 [close]
489 | PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation [[Preprint]](https://arxiv.org/abs/2104.12369)
490 |
491 | ```yaml
492 | Field: Language
493 | Params: 200B
494 | Training Data: 1.1TB (Chinese)
495 | Training petaFLOPs: 58M
496 | Architecture: De
497 | ```
498 |
499 | - **mT5** [[Google]]() Mar. 2021 [[open]](https://github.com/google-research/multilingual-t5)
500 | mT5: A massively multilingual pre-trained text-to-text transformer [[Preprint]](https://arxiv.org/abs/2010.11934)
501 |
502 | ```yaml
503 | Field: Language
504 | Params: 13B
505 | Training Data: (101 languages)
506 | Architecture: En-De
507 | ```
508 |
509 | - **WuDao-WenHui** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/Transformer-XL)
510 |
511 | ```yaml
512 | Field: Language
513 | Params: 2.9B
514 | Training Data: 303GB (Chinese)
515 | ```
516 |
517 | - **GLM** [[BAAI]]() Mar. 2021 [[open]](https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/GLM)
518 | GLM: General Language Model Pretraining with Autoregressive Blank Infilling [[Preprint]](https://arxiv.org/abs/2103.10360)
519 |
520 | ```yaml
521 | Field: Language
522 | Params: 10B
523 | Architecture: De
524 | ```
525 |
526 | - **Switch Transformer** [[Google]]() Jan. 2021 [[open]](https://github.com/google-research/t5x)
527 | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [[Preprint]](https://arxiv.org/abs/2101.03961)
528 |
529 | ```yaml
530 | Field: Language
531 | Params: 1.6T
532 | Training Data: 750GB
533 | Training petaFLOPs: 82M
534 | Architecture: En-De, MoE
535 | Objective: MLM
536 | ```
537 |
538 | - **CPM** [[BAAI]]() Dec. 2020 [[open]](https://github.com/TsinghuaAI/CPM)
539 | CPM: A Large-scale Generative Chinese Pre-trained Language Model [[Preprint]](https://arxiv.org/abs/2012.00413)
540 |
541 | ```yaml
542 | Field: Language
543 | Params: 2.6B
544 | Training Data: 100G (Chinese)
545 | Training petaFLOPs: 1.8M
546 | Architecture: De
547 | Objective: LTR
548 | ```
549 |
550 | - **GPT-3** [[OpenAI]](https://openai.com/api/) May 2020 [close]
551 | Language Models are Few-Shot Learners [[NeurIPS'20]](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)
552 |
553 | ```yaml
554 | Field: Language
555 | Params: 175B
556 | Training Data: 45TB (680B Tokens)
557 | Training Time: 95 A100 GPU years (835584 A100 GPU hours, 355 V100 GPU years)
558 | Training Cost: $4.6M
559 | Training petaFLOPs: 310M
560 | Architecture: De
561 | Obective: LTR
562 | Instruction-tuned Variants: InstructGPT, WebGPT, ChatGPT
563 | ```
564 |
565 | - **Blender** [[Meta]](https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet/) Apr. 2020 [[close]](https://huggingface.co/facebook/blenderbot-90M?text=Hey+my+name+is+Thomas%21+How+are+you%3F)
566 | Recipes for building an open-domain chatbot [[Preprint]](https://arxiv.org/abs/2004.13637)
567 |
568 | ```yaml
569 | Field: Language (Dialogue)
570 | Params: 9.4B
571 | ```
572 |
573 | - **T-NLG** [[Microsoft]](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) Feb. 2020 [close]
574 |
575 | ```yaml
576 | Field: Language
577 | Params: 17B
578 | Training petaFLOPs: 16M
579 | Architecture: De
580 | Obective: LTR
581 | ```
582 |
583 | - **Meena** [[Google]](https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html) Jan. 2020 [close]
584 | Towards a Human-like Open-Domain Chatbot [[Preprint]](https://arxiv.org/abs/2001.09977)
585 |
586 | ```yaml
587 | Field: Language (Dialogue)
588 | Params: 2.6B
589 | Training Data: 341GB (40B words)
590 | Training petaFLOPs: 110M
591 | ```
592 |
593 | - **DialoGPT** [[Microsoft]](https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/) Nov. 2019 [[open]](https://github.com/microsoft/DialoGPT)
594 | DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation [[ACL'20]](https://arxiv.org/abs/1911.00536)
595 |
596 | ```yaml
597 | Field: Language (Dialogue)
598 | Params: 762M
599 | Training Data: (147M conversation)
600 | Architecture: De
601 | ```
602 |
603 | - **T5** [[Google]](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) Oct. 2019 [[open]](https://github.com/google-research/text-to-text-transfer-transformer)
604 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [[JMLR'19]](https://arxiv.org/abs/1910.10683)
605 |
606 | ```yaml
607 | Field: Language
608 | Params: 11B
609 | Training Data: 800GB
610 | Training Cost: $1.5M
611 | Training petaFLOPs: 41M
612 | Architecture: En-De
613 | Obective: MLM
614 | License: Apache 2.0
615 | Instruction-tuned Variants: Flan-T5
616 | Context-Length: 512
617 | ```
618 |
619 | - **Megatron-LM** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)
620 | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)
621 |
622 | ```yaml
623 | Field: Language
624 | Params: 8.3B
625 | Training Data: 174GB
626 | Training petaFLOPs: 9.1M
627 | Architecture: De
628 | Obective: LTR
629 | Training Framework: Megatron
630 | ```
631 |
632 | - **Megatron-BERT** [[Nvidia]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)
633 | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)
634 |
635 | ```yaml
636 | Field: Language
637 | Params: 3.9B
638 | Training Data: 174GB
639 | Training petaFLOPs: 57M
640 | Architecture: En
641 | Obective: MLM
642 | Training Framework: Megatron
643 | ```
644 |
645 | - **RoBERTa** [[Meta]](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/) July 2019 [[open]](https://github.com/facebookresearch/fairseq)
646 | RoBERTa: A Robustly Optimized BERT Pretraining Approach [[Preprint]](https://arxiv.org/abs/1907.11692)
647 |
648 | ```yaml
649 | Field: Language
650 | Params: 354M
651 | Training Data: 160GB
652 | Training Time: 1024 V100 GPU days
653 | Architecture: En
654 | Objective: MLM
655 | ```
656 |
657 | - **XLNet** [[Google]]() June 2019 [[open]](https://github.com/zihangdai/xlnet)
658 | XLNet: Generalized Autoregressive Pretraining for Language Understanding [[NeurIPS'19]](https://papers.nips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html)
659 |
660 | ```yaml
661 | Field: Language
662 | Params: 340M
663 | Training Data: 113GB (33B words)
664 | Training Time: 1280 TPUv3 days
665 | Training Cost: $245k
666 | Architecture: En
667 | Objective: PLM
668 | ```
669 |
670 | - **GPT-2** [[OpenAI]](https://openai.com/blog/better-language-models/) Feb. 2019 [[open]](https://github.com/openai/gpt-2)
671 | Language Models are Unsupervised Multitask Learners [[Preprint]](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
672 |
673 | ```yaml
674 | Field: Language
675 | Params: 1.5B
676 | Training Data: 40GB (8M web pages)
677 | Training Cost: $43k
678 | Training petaFLOPs: 1.5M
679 | Architecture: De
680 | Objective: LTR
681 | ```
682 |
683 | - **BERT** [[Google]]() Oct. 2018 [[open]](https://github.com/google-research/bert)
684 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [[NAACL'18]](https://arxiv.org/abs/1810.04805)
685 |
686 | ```yaml
687 | Field: Language
688 | Params: 330M
689 | Training Data: 16GB (3.3B words)
690 | Training Time: 64 TPUv2 days (280 V100 GPU days)
691 | Training Cost: $7k
692 | Training petaFLOPs: 290k
693 | Architecture: En
694 | Objective: MLM, NSP
695 | ```
696 |
697 | - **GPT** [[OpenAI]](https://openai.com/blog/language-unsupervised/) June 2018 [open]
698 | Improving Language Understanding by Generative Pre-Training [[Preprint]](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
699 |
700 | ```yaml
701 | Field: Language
702 | Params: 117M
703 | Training Data: 1GB (7k books)
704 | Training petaFLOPs: 18k
705 | Architecture: De
706 | Objective: LTR
707 | ```
708 |
709 | ### Vision Models
710 |
711 | - **Eva02-E** [[BAAI]]() Mar. 2023 [[open]](https://github.com/huggingface/pytorch-image-models/tree/main)
712 | EVA-02: A Visual Representation for Neon Genesis [[Preprint]](https://arxiv.org/abs/2303.11331v2)
713 |
714 | ```yaml
715 | Field: Vision-Language
716 | Params: 5B
717 | Training Data: 2B image-text pairs
718 | Architecture: Transformer
719 | Objective: MIM, Clip Constrastive
720 | ```
721 |
722 | - **MAE->WSP-2B** [[Meta]]() Mar. 2023 [close]
723 | The effectiveness of MAE pre-pretraining for billion-scale pretraining [[Preprint]](https://arxiv.org/abs/2303.13496)
724 |
725 | ```yaml
726 | Field: Vision
727 | Params: 6.5B
728 | Training Data: 3B images
729 | Architecture: Transformer
730 | Objective: MAE, Weakly-Supervised
731 | ```
732 |
733 | - **OpenCLIP G/14** [[LAION]]() Mar. 2023 [[open]](https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K)
734 |
735 | ```yaml
736 | Field: Vision-Language
737 | Params: 2.5B
738 | Training Data: 2B images
739 | ```
740 |
741 | - **ViT-22B** [[Google]]() Feb. 2023 [close]
742 | [Scaling Vision Transformers to 22 Billion Parameters](https://arxiv.org/abs/2302.05442)
743 |
744 | ```yaml
745 | Field: Vision
746 | Params: 22B
747 | Training Data: 4B images
748 | Architecture: Transformer
749 | Objective: Supervised
750 | ```
751 |
752 | - **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2022 [close]
753 | ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283)
754 |
755 | ```yaml
756 | Field: Image Generation (text to image)
757 | Params: 10B
758 | Training Data: 145M text-image pairs
759 | Architecture: Transformer, dVAE + De
760 | ```
761 |
762 | - **InternImage-G** [[Shanghai AI Lab]](https://github.com/OpenGVLab/InternImage) Nov. 2022 [[open]](https://github.com/OpenGVLab/InternImage)
763 | InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [[CVPR'23 Highlight]](https://arxiv.org/abs/2211.05778)
764 |
765 | ```yaml
766 | Field: Vision
767 | Params: 3B
768 | Architecture: CNN
769 | Core Operator: Deformable Convolution v3
770 | ```
771 |
772 | - **Stable Diffusion** [[Stability AI]]() Aug. 2022 [[open]]()
773 |
774 | ```yaml
775 | Field: Image Generation (text to image)
776 | Params: 890M
777 | Training Data: 5B images
778 | Architecture: Transformer, Diffusion
779 | ```
780 |
781 | - **Imagen** [[Google]](https://imagen.research.google/) May 2022
782 | Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [[Preprint]](https://arxiv.org/abs/2205.11487)
783 |
784 | ```yaml
785 | Field: Image Generation (text to image)
786 | Text Encoder: T5
787 | Image Decoder: Diffusion, Upsampler
788 | ```
789 |
790 | - **Flamingo** [[DeepMind]]() Apr. 2022 [close]
791 | Flamingo: a Visual Language Model for Few-Shot Learning [[Preprint]](https://arxiv.org/abs/2204.14198)
792 |
793 | ```yaml
794 | Field: Vision-Language
795 | Params: 80B
796 | ```
797 |
798 | - **DALL·E 2** [[OpenAI]](https://openai.com/dall-e-2/) Apr. 2022
799 | Hierarchical Text-Conditional Image Generation with CLIP Latents [[Preprint]](https://cdn.openai.com/papers/dall-e-2.pdf)
800 |
801 | ```yaml
802 | Field: Image Generation (text to image)
803 | Text Encoder: GPT2 (CLIP)
804 | Image Encoder: ViT (CLIP)
805 | Image Decoder: Diffusion, Upsampler
806 | ```
807 |
808 | - **BaGuaLu** [[BAAI, Alibaba]]() Apr. 2022
809 | BaGuaLu: targeting brain scale pretrained models with over 37 million cores [[PPoPP'22]](https://keg.cs.tsinghua.edu.cn/jietang/publications/PPOPP22-Ma%20et%20al.-BaGuaLu%20Targeting%20Brain%20Scale%20Pretrained%20Models%20w.pdf)
810 |
811 | ```yaml
812 | Field: Vision-Language
813 | Params: 174T
814 | Architecture: M6
815 | ```
816 |
817 | - **SEER** [[Meta]]() Feb. 2022 [[open]](https://github.com/facebookresearch/vissl)
818 | Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [[Preprint]](https://arxiv.org/abs/2202.08360v2)
819 |
820 | ```yaml
821 | Field: Vision
822 | Params: 10B
823 | Training Data: 1B images
824 | Architecture: Convolution
825 | Objective: SwAV
826 | ```
827 |
828 | - **ERNIE-ViLG** [[Baidu]](https://wenxin.baidu.com/wenxin/ernie-vilg) Dec. 2021
829 | ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [[Preprint]](https://arxiv.org/abs/2112.15283)
830 |
831 | ```yaml
832 | Field: Image Generation (text to image)
833 | Params: 10B
834 | Training Data: 145M text-image pairs
835 | Architecture: Transformer, dVAE + De
836 | ```
837 |
838 | - **NUWA** [[Microsoft]]() Nov. 2021 [[open]](https://github.com/microsoft/NUWA)
839 | NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion [[Preprint]](https://arxiv.org/abs/2111.12417)
840 |
841 | ```yaml
842 | Field: Vision-Language
843 | Generatioon: Image, Video
844 | Params: 870M
845 | ```
846 |
847 | - **SwinV2-G** [[Google]]() Nov. 2021 [[open]](https://github.com/microsoft/Swin-Transformer)
848 | Swin Transformer V2: Scaling Up Capacity and Resolution [[CVPR'22]](https://arxiv.org/abs/2111.09883v2)
849 |
850 | ```yaml
851 | Field: Vision
852 | Params: 3B
853 | Training Data: 70M
854 | Architecture: Transformer
855 | Objective: Supervised
856 | ```
857 |
858 | - **Zidongtaichu** [[CASIA]](http://www.ia.cas.cn/xwzx/kydt/202109/t20210927_6215538.html) Sept. 2021 [close]
859 |
860 | ```yaml
861 | Field: Image, Video, Language, Speech
862 | Params: 100B
863 | ```
864 |
865 | - **ViT-G/14** [[Google]]() June 2021
866 | Scaling Vision Transformers [[Preprint]](https://arxiv.org/abs/2106.04560)
867 |
868 | ```yaml
869 | Field: Vision
870 | Params: 1.8B
871 | Training Data: 300M images
872 | Training petaFLOPs: 3.4M
873 | Architecture: Transformer
874 | Objective: Supervised
875 | ```
876 |
877 | - **CoAtNet** [[Google]](https://ai.googleblog.com/2021/09/toward-fast-and-accurate-neural.html) June 2021 [[open]](https://github.com/chinhsuanwu/coatnet-pytorch)
878 | CoAtNet: Marrying Convolution and Attention for All Data Sizes [[NeurIPS'21]](https://arxiv.org/abs/2106.04803)
879 |
880 | ```yaml
881 | Field: Vision
882 | Params: 2.4B
883 | Training Data: 300M images
884 | Architecture: Transformer, Convolution
885 | Objective: Supervised
886 | ```
887 |
888 | - **V-MoE** [[Google]](https://ai.googleblog.com/2022/01/scaling-vision-with-sparse-mixture-of.html) June 2021
889 | Scaling Vision with Sparse Mixture of Experts [[NeurIPS'21]](https://proceedings.neurips.cc//paper/2021/file/48237d9f2dea8c74c2a72126cf63d933-Paper.pdf)
890 |
891 | ```yaml
892 | Field: Vision
893 | Params: 15B
894 | Training Data: 300M images
895 | Training Time: 16.8k TPUv3 days
896 | Training petaFLOPs: 33.9M
897 | Architecture: Transformer, MoE
898 | Objective: Supervised
899 | ```
900 |
901 | - **CogView** [[BAAI, Alibaba]](https://wudao.aminer.cn/CogView/index.html) May 2021 [>](https://github.com/THUDM/CogView)
902 | CogView: Mastering Text-to-Image Generation via Transformers [[NeurIPS'21]](https://arxiv.org/abs/2105.13290)
903 |
904 | ```yaml
905 | Field: Vision-Language
906 | Params: 4B
907 | Training Data: 30M text-image pairs
908 | Training petaFLOPs: 27M
909 | Image Encoder: VAE
910 | Text Encoder & Image Decoder: GPT2
911 | ```
912 |
913 | - **M6** [[Alibaba]](https://m6.aliyun.com/#/) Mar. 2021
914 | M6: A Chinese Multimodal Pretrainer [[Preprint]](https://arxiv.org/abs/2103.00823)
915 |
916 | ```yaml
917 | Field: Vision-Language
918 | Params: 10T
919 | Training Data: 300G Texts + 2TB Images
920 | Training petaFLOPs: 5.5M
921 | Fusion: Single-stream
922 | Objective: MLM, IC
923 | ```
924 |
925 | - **DALL·E** [[OpenAI]](https://openai.com/blog/dall-e/) Feb. 2021
926 | Zero-Shot Text-to-Image Generation [[ICML'21]](https://arxiv.org/abs/2102.12092)
927 |
928 | ```yaml
929 | Field: Image Generation (text to image)
930 | Params: 12B
931 | Training Data: 250M text-images pairs
932 | Training petaFLOPs: 47M
933 | Image Encoder: dVAE
934 | Text Encoder & Image Decoder: GPT2
935 | ```
936 |
937 | - **CLIP** [[OpenAI]](https://openai.com/blog/clip/) Jan. 2021
938 | Learning Transferable Visual Models From Natural Language Supervision [[ICML'22]](https://arxiv.org/abs/2103.00020)
939 |
940 | ```yaml
941 | Field: Vision-Language
942 | Training Data: 400M text-image pairs
943 | Training petaFLOPs: 11M
944 | Image Encoder: ViT
945 | Text Encoder: GPT-2
946 | Fusion: Dual Encoder
947 | Objective: CMCL
948 | ```
949 |
950 | - **ViT-H/14** [[Google]](https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html) Oct. 2020 [[open]](https://github.com/google-research/vision_transformer)
951 | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[ICLR'20]](https://arxiv.org/abs/2010.11929)
952 |
953 | ```yaml
954 | Field: Vision
955 | Params: 632M
956 | Training Data: 300M images
957 | Training petaFLOPs: 13M
958 | Architecture: Transformer
959 | Objective: Supervised
960 | ```
961 |
962 | - **iGPT-XL** [[OpenAI]](https://openai.com/blog/image-gpt/) June 2020 [[open]](https://github.com/openai/image-gpt)
963 | Generative Pretraining From Pixels [[ICML'20]](https://proceedings.mlr.press/v119/chen20s.html)
964 |
965 | ```yaml
966 | Field: Image Generation
967 | Params: 6.8B
968 | Training Data: 1M images
969 | Training petaFLOPs: 33M
970 | Architecture: Transformer, De
971 | ```
972 |
973 | - **BigGAN-deep** [[DeepMind]]() Sept. 2018 [[open]](https://github.com/ajbrock/BigGAN-PyTorch)
974 | Large Scale GAN Training for High Fidelity Natural Image Synthesis [[ICLR'19]](https://arxiv.org/abs/1809.11096)
975 |
976 | ```yaml
977 | Field: Image Generation
978 | Params: 158M
979 | Training Data: 300M images
980 | Training petaFLOPs: 3M
981 | Architecture: Convolution, GAN
982 | Resolution: 512x512
983 | ```
984 |
985 | ### Reinforcement Learning
986 |
987 | - **PaLM-E** [[Google]](https://palm-e.github.io/) March 2023 [close]
988 | PaLM-E: An Embodied Multimodal Language Model [[Preprint]](https://palm-e.github.io/assets/palm-e.pdf)
989 |
990 | ```yaml
991 | Field: Reinforcement Learning
992 | Params: 562B (540B LLM + 22B Vi)
993 | ```
994 |
995 | - **Gato** [[DeepMind]](https://www.deepmind.com/publications/a-generalist-agent) May 2022 [close]
996 | A Generalist Agent [[Preprint]](https://arxiv.org/abs/2205.06175)
997 |
998 | ```yaml
999 | Field: Reinforcement Learning
1000 | Params: 1.2B
1001 | Training Data: (604 Tasks)
1002 | Objective: Supervised
1003 | ```
1004 |
1005 | ### Speech
1006 |
1007 | - **USM** [[Google]](https://sites.research.google/usm/) Mar. 2023 [close]
1008 | Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [[Preprint]](https://arxiv.org/pdf/2303.01037v2.pdf)
1009 |
1010 | ```yaml
1011 | Field: Speech
1012 | Params: 2B
1013 | Training Data: 12,000,000 hours
1014 | ```
1015 |
1016 | - **Whisper** [[OpenAI]](https://openai.com/research/whisper) Sept. 2022 [[close]](https://github.com/openai/whisper)
1017 | Robust Speech Recognition via Large-Scale Weak Supervision [[Preprint]](https://arxiv.org/pdf/2212.04356.pdf)
1018 |
1019 | ```yaml
1020 | Field: Speech
1021 | Params: 1.55B
1022 | Training Data: 680,000 hours
1023 | Objective: Weakly Supervised
1024 | ```
1025 |
1026 | - **HuBERT** [[Meta]](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression/) June 2021 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert)
1027 | HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [[Preprint]](https://arxiv.org/abs/2106.07447)
1028 |
1029 | ```yaml
1030 | Field: Speech
1031 | Params: 1B
1032 | Training Data: 60,000 hours
1033 | Objective: MLM
1034 | ```
1035 |
1036 | - **wav2vec 2.0** [[Meta]]() Oct. 2020 [[open]](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec)
1037 | wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [[NeurIPS'20]](https://arxiv.org/abs/2006.11477)
1038 |
1039 | ```yaml
1040 | Field: Speech
1041 | Params: 317M
1042 | Training Data: 50,000 hours
1043 | Training petaFLOPs: 430M
1044 | Objective: MLM
1045 | ```
1046 |
1047 | - **DeepSpeech 2** [[Meta]]() Dec. 2015 [[open]](https://github.com/PaddlePaddle/PaddleSpeech)
1048 | Deep Speech 2: End-to-End Speech Recognition in
1049 | English and Mandarin [[ICML'15]](https://arxiv.org/pdf/1512.02595.pdf)
1050 |
1051 | ```yaml
1052 | Field: Speech
1053 | Params: 300M
1054 | Training Data: 21,340 hours
1055 | ```
1056 |
1057 | ### Science
1058 |
1059 | - **AlphaFold 2** [[DeepMind]](https://www.deepmind.com/research/highlighted-research/alphafold) July 2021 [[open]](https://github.com/deepmind/alphafold)
1060 | Highly accurate protein structure prediction with AlphaFold [[Nature]](https://www.nature.com/articles/s41586-021-03819-2)
1061 |
1062 | ```yaml
1063 | Field: Biology
1064 | Params: 21B
1065 | Training petaFLOPs: 100k
1066 | ```
1067 |
1068 | ## Open LLM Training Dataset
1069 |
1070 | This section will be reorganized. For now, as LLM prevails and data quality is a key for the performance of LLM, we keep track of this trend.
1071 |
1072 | - [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B): 627B tokens, 895GB Compressed, primarily English, cleaned from RedPajama, Apache 2.0
1073 | - [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb): ~600B tokens, 500GB Compressed, English, ODC-By 1.0 license (The 5T tokens version is private)
1074 | - [MNBVC](https://github.com/esbatmop/MNBVC): 5TB (on-going, target 40TB), Chinese, MIT License
1075 | - [The Pile](https://pile.eleuther.ai/): 825G
1076 | - [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T): 1.2T tokens
1077 |
1078 | ## Distributed Training Framework
1079 |
1080 | > Deep Learning frameworks supportting distributed training are marked with \*.
1081 |
1082 | ### PyTorch Ecosystem
1083 |
1084 | - **Accelerate** [[Huggingface]]() Oct. 2020 [[open]](https://github.com/huggingface/accelerate)
1085 | - **Hivemind** Aug. 2020 [[open]](https://github.com/learning-at-home/hivemind)
1086 | Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts [[Preprint]](https://arxiv.org/abs/2002.04013)
1087 | - **FairScale** [[Meta]]() July 2020 [[open]](https://github.com/facebookresearch/fairscale)
1088 | - **DeepSpeed** [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed)
1089 | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [[SC'20]](https://arxiv.org/abs/1910.02054)
1090 | - **Megatron** [[Nivida]]() Sept. 2019 [[open]](https://github.com/NVIDIA/Megatron-LM)
1091 | Megatron: Training Multi-Billion Parameter Language Models Using Model Parallelism [[Preprint]](https://arxiv.org/abs/1909.08053)
1092 | - **PyTorch\*** [[Meta]](https://pytorch.org/) Sept. 2016 [[open]](https://github.com/pytorch/pytorch)
1093 | PyTorch: An Imperative Style, High-Performance Deep Learning Library [[NeurIPS'19]](http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf)
1094 |
1095 | ### XLA Ecosystem
1096 |
1097 | - **T5x** [[Google]]() Mar. 2022 [[open]](https://github.com/google-research/t5x)
1098 | Scaling Up Models and Data with 𝚝𝟻𝚡 and 𝚜𝚎𝚚𝚒𝚘 [[Preprint]](https://arxiv.org/abs/2203.17189)
1099 | - **Alpa** [[Google]]() Jan. 2022 [[open]](https://github.com/alpa-projects/alpa)
1100 | Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [[OSDI'22]](https://arxiv.org/pdf/2201.12023.pdf)
1101 | - **Pathways** [[Google]](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/) Mar. 2021 [close]
1102 | Pathways: Asynchronous Distributed Dataflow for ML [[Preprint]](https://arxiv.org/abs/2203.12533)
1103 | - **Colossal-AI** [[HPC-AI TECH]](https://colossalai.org/) Nov. 2021 [[open]](https://github.com/hpcaitech/ColossalAI)
1104 | Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [[Preprint]](https://arxiv.org/abs/2110.14883)
1105 | - **GShard** [[Google]](https://arxiv.org/abs/2006.16668) June 2020
1106 | GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [[Preprint]](https://arxiv.org/abs/2006.16668)
1107 | - **Jax\*** [Google]() Oct 2019 [[open]](https://github.com/google/jax)
1108 | - **Mesh Tensorflow** [[Google]]() Nov. 2018 [[open]](https://github.com/tensorflow/mesh)
1109 | - **Horovod** [[Uber]](https://horovod.ai/) Feb. 2018 [[open]](https://github.com/horovod/horovod)
1110 | Horovod: fast and easy distributed deep learning in TensorFlow [[Preprint]](https://arxiv.org/abs/1802.05799)
1111 | - **Tensorflow\*** [[Google]](https://www.tensorflow.org/) Nov. 2015 [[open]](https://github.com/tensorflow/tensorflow)
1112 | TensorFlow: A system for large-scale machine learning [[OSDI'16]](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)
1113 |
1114 | ### Other Frameworks
1115 |
1116 | - **OneFlow\*** [[OneFlow]](https://docs.oneflow.org/master/index.html) July 2020 [[open]](https://github.com/OneFlow-Inc/oneflow)
1117 | OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [[Preprint]](https://arxiv.org/abs/2110.15032)
1118 | - **MindSpore\*** [[Huawei]](https://e.huawei.com/en/products/cloud-computing-dc/atlas/mindspore) Mar. 2020 [[open]](https://github.com/mindspore-ai/mindspore)
1119 | - **PaddlePaddle\*** [[Baidu]](https://www.paddlepaddle.org.cn/) Nov. 2018 [[open]](https://github.com/PaddlePaddle/Paddle)
1120 | End-to-end Adaptive Distributed Training on PaddlePaddle [[Preprint]](https://arxiv.org/abs/2112.02752)
1121 | - **Ray** [[Berkeley]]() Dec. 2017 [[open]]([OSDI'17](https://github.com/ray-project/ray))
1122 | Ray: A Distributed Framework for Emerging AI Applications [[OSDI'17]](https://arxiv.org/pdf/1712.05889.pdf)
1123 |
1124 | ### Inference Frameworks
1125 |
1126 | - Petals [[BigScience]]() Dec. 2022 [[open]](https://github.com/bigscience-workshop/petals)
1127 | - FlexGen [[Stanford, Berkerley, CMU, etc.]]() May 2022 [[open]](https://github.com/FMInference/FlexGen)
1128 | - FastTransformer [[NVIDIA]]() Apr. 2021 [[open]](https://github.com/NVIDIA/FasterTransformer)
1129 | - MegEngine [[MegEngine]](https://www.megengine.org.cn/) Mar. 2020
1130 | - DeepSpeed-Inference [[Microsoft]](https://www.microsoft.com/en-us/research/project/deepspeed/) Oct. 2019 [[open]](https://github.com/microsoft/DeepSpeed)
1131 | - MediaPipe [[Google]](https://google.github.io/mediapipe/) July 2019 [[open]](https://github.com/google/mediapipe)
1132 | - TensorRT [[Nvidia]]() Jun 2019 [[open]](https://github.com/NVIDIA/TensorRT)
1133 | - MNN [[Alibaba]]() May 2019 [[open]](https://github.com/alibaba/MNN)
1134 | - OpenVINO [[Intel]](https://docs.openvino.ai/latest/index.html) Oct. 2019 [[open]](https://github.com/openvinotoolkit/openvino)
1135 | - ONNX [[Linux Foundation]](https://onnx.ai/) Sep 2017 [[open]](https://github.com/onnx/onnx)
1136 | - ncnn [[Tencent]]() July 2017 [[open]](https://github.com/Tencent/ncnn)
1137 |
1138 | ### Recommendation Training Framework
1139 |
1140 | - **HET** [[Tencent]]() Dec. 2021
1141 | HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [[VLDB'22]](https://arxiv.org/abs/2112.07221)
1142 | - **Persia** [[Kuaishou]]() Nov. 2021
1143 | Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters [[Preprint]](https://arxiv.org/abs/2111.05897)
1144 |
1145 | ```yaml
1146 | Embeddings Params: 100T
1147 | ```
1148 |
1149 | - **ZionEX** [[Meta]]() Apr. 2021
1150 | Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [[ISCA'21]](https://arxiv.org/abs/2104.05158)
1151 |
1152 | ```yaml
1153 | Embeddings Params: 10T
1154 | ```
1155 |
1156 | - **ScaleFreeCTR** [[Huawei]]() Apr. 2021
1157 | ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [[SIGIR'21]](https://arxiv.org/abs/2104.08542)
1158 | - **Kraken** [[Kuaishou]]() Nov. 2020
1159 | Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations [[SC'20]](http://storage.cs.tsinghua.edu.cn/papers/sc20-kraken.pdf/)
1160 | - **TensorNet** [[Qihoo360]]() Sept. 2020 [[open]](https://github.com/Qihoo360/tensornet)
1161 | - **HierPS** [[Baidu]]() Mar. 2020
1162 | Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems [[MLSys'20]](https://arxiv.org/abs/2003.05622)
1163 | - **AIBox** [[Baidu]]() Oct. 2019
1164 | AIBox: CTR Prediction Model Training on a Single Node [[CIKM'20]](https://dl.acm.org/doi/pdf/10.1145/3357384.3358045)
1165 |
1166 | ```yaml
1167 | Embeddings Params: 0.1T
1168 | ```
1169 |
1170 | - **XDL** [[Alibaba]]() Aug. 2019
1171 | XDL: an industrial deep learning framework for high-dimensional sparse data [[DLP-KDD'21]](https://dlp-kdd.github.io/dlp-kdd2019/assets/pdf/a6-jiang.pdf)
1172 |
1173 | ```yaml
1174 | Embeddings Params: 0.01T
1175 | ```
1176 |
1177 | ## Keys Explanations
1178 |
1179 | - Company tags: the related company name. Other institudes may also involve in the job.
1180 | - Params: number of parameters of the largest model
1181 | - Training data size, training cost and training petaFLOPs may have some uncertainty.
1182 | - Training cost
1183 | - TPUv2 hour: $4.5
1184 | - TPUv3 hour: $8
1185 | - V100 GPU hour: $0.55 (2022)
1186 | - A100 GPU hoor: $1.10 (2022)
1187 | - Architecture
1188 | - En: Encoder-based Language Model
1189 | - De: Decoder-based Language Model
1190 | - En-De=Encoder-Decoder-based Language Model
1191 | - The above three architectures are powered with transformers.
1192 | - MoE: Mixture of Experts
1193 | - Objective (See explanation in section 6–8 of [this paper](https://arxiv.org/pdf/2203.14101v3.pdf))
1194 | - MLM: Masked Language Modeling
1195 | - LTR: Left-To-Right Language Modeling
1196 | - NSP: Next Sentence Prediction
1197 | - PLM: Permuted Language Modeling
1198 | - IC: Image Captioning
1199 | - VLM: Vision Languauge Matching
1200 | - CMCL: Cross-Modal Contrastive Learning
1201 | - FLOPs: number of FLOating-Point operations [[explanation]](https://openai.com/blog/ai-and-compute/)
1202 | - 1 petaFLOPs = 1e15 FLOPs
1203 |
--------------------------------------------------------------------------------