├── image.png ├── image-1.png └── README.md /image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OpenCSGs/Awesome-SLMs/HEAD/image.png -------------------------------------------------------------------------------- /image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OpenCSGs/Awesome-SLMs/HEAD/image-1.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🎉**Awesome-SLM**🎉 2 | 3 | ## 🌱 How to Contribute 4 | We are welcome contributions from researchers. For detailed guidelines on how to contribute, please see our [CONTRIBUTING.md](CONTRIBUTING.md) file. 5 | 6 | ## 📜 Contents 7 | - [🎉**Awesome SLM**🎉](#Awesome-SLM) 8 | - [🌱 How to Contribute](#-how-to-contribute) 9 | - [📜 Contents](#-contents) 10 | - [👋 Introduction](#-introduction) 11 | - [🔥 Base Model](#-base-model) 12 | - [💪 Pretrain datasets](#-Pretrain-dataset) 13 | - [💡 SFT datasets](#-SFT-dataset) 14 | - [🔧 synthetic datasets](#-synthetic-dataset) 15 | - [📦 preference dataset](#-preference-dataset) 16 | - [🌈 benchmark](#-benchmark) 17 | 18 | ## 👋 Introduction 19 | 20 | 21 | ## 🔥 Base Model 22 | 1. OPT-series [[paper](https://arxiv.org/abs/2205.01068)] [[code](https://github.com/facebookresearch/metaseq)] [[model](https://huggingface.co/facebook/opt-1.3b)] 23 | - release time: 2022/06 24 | - organzation: meta 25 | - model size: 125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, 175B 26 | - 模型结构: 27 | a. 训练数据:该模型使用了广泛的训练数据集,包括RoBERTa使用的数据集、The Pile以及PushShift.io的Reddit数据,数据量为180B tokens。对数据进行了去重处理,这些数据主要是英语文本,使用GPT-2的BPE分词器。 28 | b. 训练策略:模型训练使用了AdamW优化器,学习率采用线性调度,从零逐步升至最大值,然后随着训练进程逐渐下降。此外,训练过程中采用了较大的batch size 29 | c. 注意力机制:仅解码器的预训练transformer模型,采用多头自注意力机制,使用交替dense and locally banded sparse attention 30 | d. 模型层数和Block类型:模型的架构和超参数主要遵循GPT-3的设计 31 | 32 | ![alt text](image.png) 33 | 34 | 2. Pythia [[paper](https://arxiv.org/pdf/2304.01373)] [[code](https://github.com/EleutherAI/pythia)] [[model](https://huggingface.co/EleutherAI/pythia-1b)] 35 | - release time: 2023/06 36 | - organzation: meta 37 | - model size: 70M, 160M, 410M, 1.0B, 1.4B, 2.8B, 6.9B, 12B 38 | - 模型结构: 39 | a. 训练数据:训练数据使用的是Pile数据集,数据为全英文,经去重处理后数据量大小为 207B 40 | b. 训练策略:使用GPT-NeoX库进行训练,采用Adam优化器,并利用零冗余优化(ZeRO)和数据并行、张量并行的方法来优化性能。 41 | c. 注意力机制:采用多头自注意力机制,dense attention,使用旋转嵌入,在训练过程中使用Flash Attention技术来提高设备吞吐量 42 | d. 模型层数和Block类型:模型的架构和超参数主要遵循GPT-3的设计 43 | 44 | ![alt text](image-1.png) 45 | 46 | 3. phi-1 [[paper](https://arxiv.org/pdf/2306.11644.pdf)] [[code](https://huggingface.co/TommyZQ/phi-1)] [[model](https://huggingface.co/TommyZQ/phi-1)] 47 | - release time: 2023/06 48 | - organzation: mircosoft 49 | - model size: 1.42B 50 | - 模型结构: 51 | a. 训练数据: 数据集处理方面比较有特色,提出了教科书级数据,包含从The Stack和StackOverflow中筛选出的子集(约6B tokens)、由GPT-3.5生成的Python教科书(少于1B的tokens)、约180M的Python练习和解决方案的tokens 52 | b. 训练策略:phi-1-base模型在CodeTextbook数据集(过滤后的代码语言数据集和合成教科书数据集)上进行预训练,使用AdamW优化器、线性预热线性衰减学习率调度、attention和残差dropout均为0.1,batch size为1024 53 | c. 注意力机制:使用了仅解码器transformer的多头注意力机制,并在预训练和微调过程中使用了flashattention来提高效率 54 | d. 模型层数和Block类型: 55 | 模型包含24层,使用并行配置的 MHA 和 MLP 层,每个Block包括以下部分: 56 | 隐藏层大小:2048 57 | 注意力头数:32 58 | 最大位置嵌入:2048 59 | 位置嵌入类型:旋转位置嵌入(rotary) 60 | 残差连接:gpt-j-residual 61 | 62 | 4. phi-1_5 [[paper](https://arxiv.org/pdf/2309.05463.pdf)] [[model](https://huggingface.co/TommyZQ/phi-1_5)] 63 | - release time: 2023/09 64 | - organzation: mircosoft 65 | - model size: 1.42B 66 | - 模型结构: 67 | a. 训练数据: phi-1的训练数据(7B tokens)+新创建的合成“教科书”数据(约20B tokens),用于教授常识推理和世界通用知识(科学、日常活动、心智理论等) 68 | b. 训练策略:从随机初始化开始训练phi-1.5,使用常数学习率2e-4(无预热),权重衰减:0.1。Adam优化器,动量参数为0.9和0.98。混合精度训练使用fp16和DeepSpeed ZeRO Stage 2,批量大小2048。 69 | c. 注意力机制:使用了仅解码器transformer的多头注意力机制,并在预训练和微调过程中使用了flashattention来提高效率 70 | d. 模型层数和Block类型:(和phi-1相同) 71 | 模型包含24层,使用并行配置的 MHA 和 MLP 层,每个Block包括以下部分: 72 | 隐藏层大小:2048 73 | 注意力头数:32 74 | 注意力头维度:64 75 | 最大位置嵌入:2048 76 | 位置嵌入类型:旋转位置嵌入(rotary) 77 | 残差连接:gpt-j-residual 78 | 79 | 5. phi-2 [[paper](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/)][[model](https://huggingface.co/TommyZQ/phi-2)] 80 | - release time: 2023/12 81 | - organzation: mircosoft 82 | - model size: 2.78B 83 | - 模型结构: 84 | a. 训练数据: 与phi-1.5相同,数据源基于phi-1.5,并增加了由各种NLP合成文本和过滤网站(出于安全和教育价值)组成的新数据源250B tokens,训练tokens为1.4T tokens 85 | b. 训练策略:没有详细明确 86 | c. 注意力机制:没有详细明确 87 | d. 模型层数和Block类型:没有详细明确 88 | 89 | 6. phi-3-series [[paper](https://arxiv.org/pdf/2404.14219)] [[model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)] 90 | - release time: 2024/04 91 | - organzation: mircosoft 92 | - model series: Phi-3-mini-4k-instruct, Phi-3-mini-128k-instruct 93 | - model size: 3.82B 94 | - 模型结构: 95 | a. 训练数据: 训练数据集是phi-2数据集的升级版本,包括经过严格过滤的公开可用的网页数据和合成数据,训练tokens为3.3T tokens 96 | b. 训练策略: 97 | 预训练分为两个不连续且顺序进行的阶段:1.主要使用网络来源的数据,旨在教授模型通用知识和语言理解。2.合并了更多经过严格过滤的网络数据和一些合成数据,旨在教授模型逻辑推理和各种特殊技能 98 | 后训练:包括监督微调(SFT)和直接偏好优化(DPO)两个阶段。SFT数据集覆盖多种领域的高质量数据,DPO数据集则用于调整模型行为 99 | c. 注意力机制:使用了仅解码器transformer的分组查询注意力机制,默认上下文长度为4K,使用LongRope技术扩展到128K,使用Flash Attention加速训练 100 | d. 模型层数和Block类型:( Llama-2 类似的块结构) 101 | 模型包含32层,每个Block包括以下部分: 102 | 隐藏层大小:3072 103 | 注意力头数:32 104 | 注意力头维度:64 105 | 最大位置嵌入:2048 106 | 位置嵌入类型:旋转位置嵌入(rotary) 107 | 残差连接:gpt-j-residual 108 | 109 | 7. Tinyllama [[paper](https://arxiv.org/abs/2401.02385)] [[model](https://huggingface.co/TinyLlama)] 110 | - release time: 2024/01 111 | - organzation: 新加坡科技与设计大学 112 | - model size: 1.1B 113 | - 模型结构: 114 | a. 训练数据: 训练数据由两部分组成: SlimPajama:这是一个高质量的语料库,专门用于训练大型语言模型。它由RedPajama衍生而来,并经过额外的清洗和去重过程。原始的RedPajama语料库包含超过1.2万亿tokens。经过过滤后,SlimPajama保留了原始tokens的50%;StarCoder训练数据集:这个数据集用于训练StarCoder,包含86种编程语言的数据,除了代码数据外,还包括GitHub问题和涉及自然语言的文本-代码对。为避免重复,SlimPajama中移除了GitHub子集,只从StarCoder训练数据集中采样代码相关数据.合并这两个数据集后,得到大约9500亿tokens进行预训练,总共处理了3万亿tokens 115 | b. 训练策略:Adamw优化器,基于lit-gpt构建框架。TinyLlama的预训练分为两个阶段: 116 | 基础预训练:使用SlimPajama数据训练1.5万亿tokens,主要发展模型的常识推理能力。 117 | 持续预训练:结合SlimPajama数据和StarCoder、Proof Pile等代码和数学内容,以及Skypile的中文数据,分别针对一般应用、数学和编码任务以及中文处理进行持续预训练 118 | c. 注意力机制:使用分组查询注意力机制,使用旋转位置嵌入(RoPE),使用Flash Attention2加速训练 119 | d. 模型层数和Block类型:模型共22层 120 | 隐藏层大小:2048 121 | 中间隐藏层大小:5632 122 | 上下文长度:2048 123 | 注意力头数:32 124 | 词汇表大小:32000 125 | 126 | 激活函数:使用SwiGLU,即Swish激活函数和门控线性单元(GLU)的结合 127 | 128 | 预归一化和RMS归一化:在每个Transformer子层的输入进行归一化 129 | 130 | 8. MiniCPM-series [[paper](https://shengdinghu.notion.site/MiniCPM-c805a17c5c8046398914e47f0542095a)][[code](https://github.com/OpenBMB/MiniCPM)[[model](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)] 131 | - release time: 2024/02 132 | - organzation: openbmb 133 | - model series: MiniCPM-1B-sft-bf16, MiniCPM-2B-sft-bf16, MiniCPM-2B-sft-fp32, MiniCPM-2B-128k, MiniCPM-MoE-8x2B 134 | - model size: 1.2B, 2.4B, 8X2.4B (excluding embeddings) 135 | 136 | 9. H2O-Danube-1.8B [[paper](https://arxiv.org/abs/2401.16818)] [[code](https://github.com/OpenBMB/MiniCPM)] [[model](https://huggingface.co/h2oai/h2o-danube2-1.8b-base)] 137 | - release time: 2024/04 138 | - organzation: h2oai 139 | - model series: h2o-danube2-1.8b-base, h2o-danube2-1.8b-sft, h2o-danube2-1.8b-chat 140 | - model size: 1.8B 141 | 142 | 10. csg-wukong-series[[model](https://huggingface.co/opencsg/csg-wukong-1B)] 143 | - release time: 2024/04 144 | - organzation: opencsg 145 | - model series: csg-wukong-1B, csg-wukong-1B-VL, csg-wukong-1B-chat 146 | - model size: 1B 147 | 148 | 11. CT-LLM-Base[[paper](https://arxiv.org/pdf/2404.04167.pdf)] [[code](https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM)] [[model](https://huggingface.co/m-a-p/CT-LLM-Base)] 149 | - release time: 2024/04 150 | - organzation: Peking University 151 | - model series: CT-LLM-Base 152 | - model size: 2B 153 | 154 | 12. Qwen-series[[paper](https://arxiv.org/abs/2309.16609)] [[code](https://github.com/QwenLM/Qwen)] [[model](https://huggingface.co/Qwen)] 155 | - release time: 2023/08 156 | - organzation: Alibaba Cloud 157 | - model series: Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B, Qwen-1.8B-Chat, Qwen-7B-Chat, Qwen-14B-Chat, Qwen-72B-Chat 158 | - model size: 1.8B,7B,14B,72B 159 | 160 | 13. Qwen2-series[[paper](https://arxiv.org/abs/2309.16609)] [[code](https://github.com/QwenLM/Qwen)] [[model](https://huggingface.co/Qwen)] 161 | - release time: 2024/06 162 | - organzation: Alibaba Cloud 163 | - model series: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, Qwen2-72B 164 | - model size: 0.5B,7B,A14B,72B 165 | 166 | 14. Gemma-series[[paper](https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf)] [[code](https://github.com/google-deepmind/gemma)] [[model](https://huggingface.co/google/gemma-2b)] 167 | - release time: 2024/02 168 | - organzation: Google 169 | - model series: gemma-2b, gemma-2b-it, gemma-7b, gemma-7b-it, gemma-2-9b,gemma-2-9b-it,gemma-2-27b,gemma-2-27b-it 170 | - model size: 2B,7B,27B 171 | 172 | 15. OpenELM-series[[paper](https://arxiv.org/abs/2404.14619)] [[code](https://github.com/apple/corenet)] [[model](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759c)] 173 | - release time: 2024/04 174 | - organzation: apple 175 | - model series: OpenELM-270M, OpenELM-450M, OpenELM-1.1B, OpenELM-3B,OpenELM-270M-Instruct,OpenELM-450M-Instruct,OpenELM-1.1B-Instruct,OpenELM-3B-Instruct 176 | - model size: 0.27B,0.45B,1.1B,3B 177 | 178 | 16. Sheared-LLaMA-series[[paper](https://arxiv.org/abs/2310.06694)] [[code](https://github.com/princeton-nlp/LLM-Shearing)] [[model](https://huggingface.co/princeton-nlp/Sheared-LLaMA-1.3B)] 179 | - release time: 2023/10 180 | - organzation: Princeton NLP group 181 | - model series: Sheared-LLaMA-1.3B, Sheared-LLaMA-2.7B,Sheared-LLaMA-1.3B-Pruned, Sheared-LLaMA-2.7B-Pruned,Sheared-LLaMA-1.3B-ShareGPT, Sheared-LLaMA-2.7B-ShareGPT 182 | - model size: 1.3B,2.7B 183 | 184 | 17. SlimPajama-DC[[paper](https://arxiv.org/html/2309.10818v3)] [[code](https://github.com/togethercomputer/RedPajama-Data)] [[model](https://huggingface.co/MBZUAI-LLM/SlimPajama-DC)] 185 | - release time: 2023/09 186 | - organzation: cerebras 187 | - model series: SlimPajama-DC-1.3B 188 | - model size: 1.3B 189 | 190 | 18. RedPajama [[code](https://github.com/togethercomputer/RedPajama-Data)] [[model](https://huggingface.co/MBZUAI-LLM/SlimPajama-DC)] 191 | - release time: 2023/05 192 | - organzation: Together Computer. 193 | - model series: RedPajama-INCITE-Base-3B-v1, RedPajama-INCITE-Instruct-3B-v1,RedPajama-INCITE-Chat-3B-v1 194 | - model size: 1.3B 195 | 196 | 19. OLMo[[paper](https://arxiv.org/html/2402.00838)] [[code](https://github.com/allenai/OLMo)] [[model](https://huggingface.co/MBZUAI-LLM/SlimPajama-DC)] 197 | - release time: 2024/02 198 | - organzation: allenai 199 | - model series: OLMo-1B,OLMo-7B,OLMo-7B-Twin-2T 200 | - model size: 1B,7B 201 | 202 | 20. Cerebras-GPT-series[[paper](https://arxiv.org/html/2304.03208)] [[model](https://huggingface.co/cerebras/Cerebras-GPT-111M)] 203 | - release time: 2023/04 204 | - organzation: cerebras 205 | - model series: Cerebras-GPT-111M,Cerebras-GPT-256M,Cerebras-GPT-590M,Cerebras-GPT-11.3B,Cerebras-GPT-2.7B,Cerebras-GPT-6.7B,Cerebras-GPT-111M,Cerebras-GPT-13B 206 | - model size: 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, 13B 207 | 208 | ## 💪 Pretrain Datasets 209 | - SlimPajama-627B [[paper](https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama)] [[code](https://github.com/Cerebras/modelzoo/tree/main/src/cerebras/modelzoo/data_preparation/nlp/slimpajama)] [[dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B)] 210 | - release time: 2023/06 211 | - dataset size: 895 GB 212 | - token size: 627B 213 | - language: Primarily English, with some non-English files in Wikipedia 214 | 215 | 216 | - dolma [[paper](https://arxiv.org/abs/2402.00159)] [[code](https://github.com/allenai/dolma)] [[dataset](https://huggingface.co/datasets/allenai/dolma)] 217 | - release time: 2024/04 218 | - dataset size: 4.5TB 219 | - token size: 1.7T 220 | - language: Primarily English, with some non-English files in Wikipedia 221 | 222 | - RedPajama-Data-1T [[paper](https://arxiv.org/pdf/1906.02285.pdf)] [[code](https://github.com/togethercomputer/RedPajama-Data)] [[dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)] 223 | - release time: 2023/04 224 | - token size: 627B 225 | 226 | - C4 [[paper](https://www.tensorflow.org/datasets/catalog/c4)] [[code](https://github.com/allenai/c4-documentation)] [[dataset](https://huggingface.co/datasets/c4)] 227 | - release time: 2022/01 228 | - dataset size: en: 305GB, en.noclean: 2.3TB, en.noblocklist: 380GB, realnewslike: 15GB, multilingual (mC4): 9.7TB (108 subsets, one per language) 229 | 230 | 231 | ## 💡 SFT Datasets 232 | - ultrachat [[code](https://github.com/thunlp/UltraChat)] [[dataset](https://huggingface.co/datasets/stingning/ultrachat)] 233 | - release time: 2023/04 234 | - dataset size: 2.5GB 235 | - language: en 236 | 237 | - ultrachat_200k [[code](https://github.com/thunlp/UltraChat)] [[dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)] 238 | - release time: 2023/10 239 | - dataset size: 1.6GB 240 | - language: en 241 | 242 | 243 | ## 🔧 synthetic datasets 244 | - cosmopedia [[code](https://github.com/thunlp/UltraChat)] [[dataset](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)] 245 | - release time: 2024/02 246 | - dataset size: 92.2GB 247 | - language: en 248 | 249 | 250 | ## 📦 preference dataset 251 | - UltraFeedback [[code](https://github.com/thunlp/UltraChat)] [[dataset](https://huggingface.co/datasets/openbmb/UltraFeedback)] 252 | - release time: 2023/09 253 | - dataset size: 0.94GB 254 | - language: en 255 | 256 | 257 | ## 🌈 benchmark 258 | --------------------------------------------------------------------------------