├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── README_en.md
├── eval_model.py
├── images
├── 1-wiki.png
├── 2-wiki.png
├── 3-wiki.png
├── 4-wiki.png
├── 5-wiki.png
├── LLM-structure-moe.png
├── LLM-structure.png
├── and_huggingface.png
├── and_modelscope.png
├── compare_radar.png
├── dataset.jpg
├── gpt3_config.png
├── logo.png
├── logo2.png
├── minimind2.gif
├── pre_512_loss.png
├── pre_768_loss.png
├── sft_512_loss.png
└── sft_768_loss.png
├── model
├── LMConfig.py
├── dataset.py
├── minimind_tokenizer
│ ├── merges.txt
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ └── vocab.json
├── model.py
└── model_lora.py
├── requirements.txt
├── scripts
├── chat_openai_api.py
├── convert_model.py
├── serve_openai_api.py
├── train_tokenizer.py
└── web_demo.py
├── train_distill_reason.py
├── train_distillation.py
├── train_dpo.py
├── train_full_sft.py
├── train_lora.py
└── train_pretrain.py
/.gitignore:
--------------------------------------------------------------------------------
1 | /model/__pycache__
2 | /dataset
3 | /out
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | We as members, contributors, and leaders pledge to make participation in our
6 | community a harassment-free experience for everyone, regardless of age, body
7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
8 | identity and expression, level of experience, education, socio-economic status,
9 | nationality, personal appearance, race, religion, or sexual identity
10 | and orientation.
11 |
12 | We pledge to act and interact in ways that contribute to an open, welcoming,
13 | diverse, inclusive, and healthy community.
14 |
15 | ## Our Standards
16 |
17 | Examples of behavior that contributes to a positive environment for our
18 | community include:
19 |
20 | * Demonstrating empathy and kindness toward other people
21 | * Being respectful of differing opinions, viewpoints, and experiences
22 | * Giving and gracefully accepting constructive feedback
23 | * Accepting responsibility and apologizing to those affected by our mistakes,
24 | and learning from the experience
25 | * Focusing on what is best not just for us as individuals, but for the
26 | overall community
27 |
28 | Examples of unacceptable behavior include:
29 |
30 | * The use of sexualized language or imagery, and sexual attention or
31 | advances of any kind
32 | * Trolling, insulting or derogatory comments, and personal or political attacks
33 | * Public or private harassment
34 | * Publishing others' private information, such as a physical or email
35 | address, without their explicit permission
36 | * Other conduct which could reasonably be considered inappropriate in a
37 | professional setting
38 |
39 | ## Enforcement Responsibilities
40 |
41 | Community leaders are responsible for clarifying and enforcing our standards of
42 | acceptable behavior and will take appropriate and fair corrective action in
43 | response to any behavior that they deem inappropriate, threatening, offensive,
44 | or harmful.
45 |
46 | Community leaders have the right and responsibility to remove, edit, or reject
47 | comments, commits, code, wiki edits, issues, and other contributions that are
48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
49 | decisions when appropriate.
50 |
51 | ## Scope
52 |
53 | This Code of Conduct applies within all community spaces, and also applies when
54 | an individual is officially representing the community in public spaces.
55 | Examples of representing our community include using an official e-mail address,
56 | posting via an official social media account, or acting as an appointed
57 | representative at an online or offline event.
58 |
59 | ## Enforcement
60 |
61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
62 | reported to the community leaders responsible for enforcement at
63 | .
64 | All complaints will be reviewed and investigated promptly and fairly.
65 |
66 | All community leaders are obligated to respect the privacy and security of the
67 | reporter of any incident.
68 |
69 | ## Enforcement Guidelines
70 |
71 | Community leaders will follow these Community Impact Guidelines in determining
72 | the consequences for any action they deem in violation of this Code of Conduct:
73 |
74 | ### 1. Correction
75 |
76 | **Community Impact**: Use of inappropriate language or other behavior deemed
77 | unprofessional or unwelcome in the community.
78 |
79 | **Consequence**: A private, written warning from community leaders, providing
80 | clarity around the nature of the violation and an explanation of why the
81 | behavior was inappropriate. A public apology may be requested.
82 |
83 | ### 2. Warning
84 |
85 | **Community Impact**: A violation through a single incident or series
86 | of actions.
87 |
88 | **Consequence**: A warning with consequences for continued behavior. No
89 | interaction with the people involved, including unsolicited interaction with
90 | those enforcing the Code of Conduct, for a specified period of time. This
91 | includes avoiding interactions in community spaces as well as external channels
92 | like social media. Violating these terms may lead to a temporary or
93 | permanent ban.
94 |
95 | ### 3. Temporary Ban
96 |
97 | **Community Impact**: A serious violation of community standards, including
98 | sustained inappropriate behavior.
99 |
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 |
106 | ### 4. Permanent Ban
107 |
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior, harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 |
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 |
115 | ## Attribution
116 |
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120 |
121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct
122 | enforcement ladder](https://github.com/mozilla/diversity).
123 |
124 | [homepage]: https://www.contributor-covenant.org
125 |
126 | For answers to common questions about this code of conduct, see the FAQ at
127 | https://www.contributor-covenant.org/faq. Translations are available at
128 | https://www.contributor-covenant.org/translations.
129 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | 
4 |
5 |
6 |
7 |
8 |
9 | 
10 | [](https://github.com/jingyaogong/minimind/stargazers)
11 | [](LICENSE)
12 | [](https://github.com/jingyaogong/minimind/commits/master)
13 | [](https://github.com/jingyaogong/minimind/pulls)
14 | [](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
15 |
16 |
17 |
18 |
19 |
"大道至简"
20 |
21 |
22 |
23 |
24 | 中文 | [English](./README_en.md)
25 |
26 |
27 |
28 | * 此开源项目旨在完全从0开始,仅用3块钱成本 + 2小时!即可训练出仅为25.8M的超小语言模型**MiniMind**。
29 | * **MiniMind**系列极其轻量,最小版本体积是 GPT-3 的 $\frac{1}{7000}$,力求做到最普通的个人GPU也可快速训练。
30 | * 项目同时开源了大模型的极简结构-包含拓展共享混合专家(MoE)、数据集清洗、预训练(Pretrain)、监督微调(SFT)、LoRA微调,
31 | 直接偏好强化学习(DPO)算法、模型蒸馏算法等全过程代码。
32 | * **MiniMind**同时拓展了视觉多模态的VLM: [MiniMind-V](https://github.com/jingyaogong/minimind-v)。
33 | * 项目所有核心算法代码均从0使用PyTorch原生重构!不依赖第三方库提供的抽象接口。
34 | * 这不仅是大语言模型的全阶段开源复现,也是一个入门LLM的教程。
35 | * 希望此项目能为所有人提供一个抛砖引玉的示例,一起感受创造的乐趣!推动更广泛AI社区的进步!
36 |
37 | > 为防止误解,“2小时” 基于NVIDIA 3090硬件设备(单卡)测试,“3块钱”
38 | > 指GPU服务器租用成本,具体规格详情见下文。
39 |
40 | ---
41 |
42 |
43 |
44 |
45 | 
46 |
47 | [🔗🍓推理模型](https://www.modelscope.cn/studios/gongjy/MiniMind-Reasoning) | [🔗🤖常规模型](https://www.modelscope.cn/studios/gongjy/MiniMind) | [🔗🎞️视频介绍](https://www.bilibili.com/video/BV12dHPeqE72/?share_source=copy_web&vd_source=670c2504f88726f8cf4a21ef6147c0e8)
48 |
49 |
50 |
66 |
67 |
68 |
69 |
70 | # 📌 Introduction
71 |
72 | 大语言模型(Large Language Model, LLM)的出现引发了全世界对AI的空前关注。
73 | 无论是ChatGPT、DeepSeek还是Qwen,都以其惊艳的效果令人叹为观止。
74 | 然而,动辄数百亿参数的庞大规模,使得它们对个人设备而言不仅难以训练,甚至连部署都显得遥不可及。
75 | 打开大模型的“黑盒子”,探索其内部运作机制,多么令人心潮澎湃!
76 | 遗憾的是,99%的探索只能止步于使用LoRA等技术对现有大模型进行少量微调,学习一些新指令或任务。
77 | 这就好比教牛顿如何使用21世纪的智能手机——虽然有趣,却完全偏离了理解物理本质的初衷。
78 | 与此同时,第三方的大模型框架和工具库,如transformers+trl,几乎只暴露了高度抽象的接口。
79 | 通过短短10行代码,就能完成“加载模型+加载数据集+推理+强化学习”的全流程训练。
80 | 这种高效的封装固然便利,但也像一架高速飞船,将我们与底层实现隔离开来,阻碍了深入探究LLM核心代码的机会。
81 | 然而,“用乐高拼出一架飞机,远比坐在头等舱里飞行更让人兴奋!”。
82 | 更糟糕的是,互联网上充斥着大量付费课程和营销号,以漏洞百出、一知半解的内容推销AI教程。
83 | 正因如此,本项目初衷是拉低LLM的学习门槛,让每个人都能从理解每一行代码开始,
84 | 从零开始亲手训练一个极小的语言模型。是的,从**零开始训练**,而不是仅仅进行**推理**!
85 | 最低只需3块钱不到的服务器成本,就能亲身体验从0到1构建一个语言模型的全过程。
86 | 一起感受创造的乐趣吧!
87 |
88 | > [!NOTE]
89 | > (截至2025-02-07)MiniMind系列已完成多个型号模型的预训练,最小仅需25.8M(0.02B),即可具备流畅对话能力!
90 |
91 |
92 | Models List
93 |
94 | | 模型 (大小) | 推理占用 (约) | Release |
95 | |-------------------------|----------|------------|
96 | | MiniMind2-small (26M) | 0.5 GB | 2025.02.06 |
97 | | MiniMind2-MoE (145M) | 1.0 GB | 2025.02.06 |
98 | | MiniMind2 (104M) | 1.0 GB | 2025.02.06 |
99 | | minimind-v1-small (26M) | 0.5 GB | 2024.08.28 |
100 | | minimind-v1-moe (4×26M) | 1.0 GB | 2024.09.17 |
101 | | minimind-v1 (108M) | 1.0 GB | 2024.09.01 |
102 |
103 |
104 |
105 | **项目包含**
106 |
107 | - MiniMind-LLM结构的全部代码(Dense+MoE模型)。
108 | - 包含Tokenizer分词器详细训练代码。
109 | - 包含Pretrain、SFT、LoRA、RLHF-DPO、模型蒸馏的全过程训练代码。
110 | - 收集、蒸馏、整理并清洗去重所有阶段的高质量数据集,且全部开源。
111 | - 从0实现预训练、指令微调、LoRA、DPO强化学习,白盒模型蒸馏。关键算法几乎不依赖第三方封装的框架,且全部开源。
112 | - 同时兼容`transformers`、`trl`、`peft`等第三方主流框架。
113 | - 训练支持单机单卡、单机多卡(DDP、DeepSpeed)训练,支持wandb可视化训练流程。支持动态启停训练。
114 | - 在第三方测评榜(C-Eval、C-MMLU、OpenBookQA等)进行模型测试。
115 | - 实现Openai-Api协议的极简服务端,便于集成到第三方ChatUI使用(FastGPT、Open-WebUI等)。
116 | - 基于streamlit实现最简聊天WebUI前端。
117 | - 复现(蒸馏/RL)大型推理模型DeepSeek-R1的MiniMind-Reason模型,**数据+模型**全部开源!
118 |
119 | 希望此开源项目可以帮助LLM初学者快速入门!
120 |
121 | ### 👉**更新日志**
122 |
123 |
124 | 2025-02-09 (newest 🎉🎉🎉)
125 |
126 | - 迎来发布以来重大更新,Release MiniMind2 Series。
127 | - 代码几乎全部重构,使用更简洁明了的统一结构。
128 | 如有旧代码的兼容性需要,可访问[🔗旧仓库内容🔗](https://github.com/jingyaogong/minimind/tree/6e9cd28ef9b34a0a10afbdf6f59e65cb6e628efb)。
129 | - 免去数据预处理步骤。统一数据集格式,更换为`jsonl`格式杜绝数据集下载混乱的问题。
130 | - MiniMind2系列效果相比MiniMind-V1显著提升。
131 | - 小问题:{kv-cache写法更标准、MoE的负载均衡loss被考虑等等}
132 | - 提供模型迁移到私有数据集的训练方案(医疗模型、自我认知样例)。
133 | - 精简预训练数据集,并大幅提升预训练数据质量,大幅缩短个人快速训练所需时间,单卡3090即可2小时复现!
134 | - 更新:LoRA微调脱离peft包装,从0实现LoRA过程;DPO算法从0使用PyTorch原生实现;模型白盒蒸馏原生实现。
135 | - MiniMind2-DeepSeek-R1系列蒸馏模型诞生!
136 | - MiniMind2具备一定的英文能力!
137 | - 更新MiniMind2与第三方模型的基于更多大模型榜单测试性能的结果。
138 |
139 |
140 |
141 |
142 |
143 |
144 | 2024-10-05
145 |
146 | - 为MiniMind拓展了多模态能力之---视觉
147 | - 移步孪生项目[minimind-v](https://github.com/jingyaogong/minimind-v)查看详情!
148 |
149 |
150 |
151 |
152 |
153 |
154 | 2024-09-27
155 |
156 | - 09-27更新pretrain数据集的预处理方式,为了保证文本完整性,放弃预处理成.bin训练的形式(轻微牺牲训练速度)。
157 | - 目前pretrain预处理后的文件命名为:pretrain_data.csv。
158 | - 删除了一些冗余的代码。
159 |
160 |
161 |
162 |
163 |
164 | 2024-09-17
165 |
166 | - 更新minimind-v1-moe模型
167 | - 为了防止歧义,不再使用mistral_tokenizer分词,全部采用自定义的minimind_tokenizer作为分词器。
168 |
169 |
170 |
171 |
172 |
173 | 2024-09-01
174 |
175 | - 更新minimind-v1 (108M)模型,采用minimind_tokenizer,预训练轮次3 + SFT轮次10,更充分训练,性能更强。
176 | - 项目已部署至ModelScope创空间,可以在此网站上体验:
177 | - [🔗ModelScope在线体验🔗](https://www.modelscope.cn/studios/gongjy/minimind)
178 |
179 |
180 |
181 |
182 |
183 | 2024-08-27
184 |
185 | - 项目首次开源
186 |
187 |
188 |
189 | # 📌 快速开始
190 |
191 |
192 | 分享本人的软硬件配置(仅供参考)
193 |
194 | * CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
195 | * RAM: 128 GB
196 | * GPU: NVIDIA GeForce RTX 3090(24GB) * 8
197 | * Ubuntu==20.04
198 | * CUDA==12.2
199 | * Python==3.10.16
200 | * [requirements.txt](./requirements.txt)
201 |
202 |
203 |
204 | ### 第0步
205 |
206 | ```bash
207 | git clone https://github.com/jingyaogong/minimind.git
208 | ```
209 |
210 | ## Ⅰ 测试已有模型效果
211 |
212 | ### 1.环境准备
213 |
214 | ```bash
215 | pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
216 | ```
217 |
218 | ### 2.下载模型
219 |
220 | ```bash
221 | git clone https://huggingface.co/jingyaogong/MiniMind2
222 | ```
223 |
224 | ### 3.命令行问答
225 |
226 | ```bash
227 | # load=0: load from pytorch model, load=1: load from transformers-hf model
228 | python eval_model.py --load 1 --model_mode 2
229 | ```
230 |
231 | ### 4.或启动WebUI
232 |
233 | ```bash
234 | # 可能需要`python>=3.10` 安装 `pip install streamlit`
235 | # cd scripts
236 | streamlit run web_demo.py
237 | ```
238 |
239 | ## Ⅱ 从0开始自己训练
240 |
241 | ### 1.环境准备
242 |
243 | ```bash
244 | pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
245 | ```
246 |
247 |
248 | 注:提前测试Torch是否可用cuda
249 |
250 | ```bash
251 | import torch
252 | print(torch.cuda.is_available())
253 | ```
254 |
255 | 如果不可用,请自行去[torch_stable](https://download.pytorch.org/whl/torch_stable.html)
256 | 下载whl文件安装。参考[链接](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
257 |
258 |
259 |
260 | ### 2.数据下载
261 |
262 | 从下文提供的[数据集下载链接](https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files)
263 | 下载需要的数据文件(创建`./dataset`目录)并放到`./dataset`下
264 |
265 |
266 | 注:数据集须知
267 |
268 | 默认推荐下载`pretrain_hq.jsonl` + `sft_mini_512.jsonl`最快速度复现Zero聊天模型。
269 |
270 | 数据文件可自由选择,下文提供了多种搭配方案,可根据自己手头的训练需求和GPU资源进行适当组合。
271 |
272 |
273 |
274 | ### 3.开始训练
275 |
276 | **3.1 预训练(学知识)**
277 |
278 | ```bash
279 | python train_pretrain.py
280 | ```
281 |
282 | > 执行预训练,得到 `pretrain_*.pth` 作为预训练的输出权重(其中*为模型的dimension,默认为512)
283 |
284 |
285 | **3.2 监督微调(学对话方式)**
286 |
287 | ```bash
288 | python train_full_sft.py
289 | ```
290 |
291 | > 执行监督微调,得到 `full_sft_*.pth` 作为指令微调的输出权重(其中`full`即为全参数微调)
292 |
293 |
294 | 注:训练须知
295 |
296 | 所有训练过程默认每隔100步保存1次参数到文件`./out/***.pth`(每次会覆盖掉旧权重文件)。
297 |
298 | 简单起见,此处只写明两个阶段训练过程。如需其它训练 (LoRA, 蒸馏, 强化学习, 微调推理等) 可参考下文【实验】小节的详细说明。
299 |
300 |
301 |
302 |
303 | ---
304 |
305 | ### 4.测试模型效果
306 |
307 | 确保需要测试的模型`*.pth`文件位于`./out/`目录下。
308 | 也可以直接去[此处](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files)下载使用我训练的`*.pth`文件。
309 |
310 | ```bash
311 | python eval_model.py --model_mode 1 # 默认为0:测试pretrain模型效果,设置为1:测试full_sft模型效果
312 | ```
313 |
314 |
315 | 注:测试须知
316 |
317 | 如需详情,查看`eval_model.py`脚本代码即可。model_mode分为 0: 预训练模型,1: SFT-Chat模型,2: RLHF-Chat模型,3: Reason模型
318 |
319 |
320 |
321 |
322 | ---
323 |
324 | > [!TIP]
325 | > 所有训练脚本均为Pytorch原生框架,均支持多卡加速,假设你的设备有N (N>1) 张显卡:
326 |
327 | 单机N卡启动训练方式 (DDP, 支持多机多卡集群)
328 |
329 | ```bash
330 | torchrun --nproc_per_node N train_xxx.py
331 | ```
332 |
333 |
334 | 注:其它须知
335 |
336 | 单机N卡启动训练 (DeepSpeed)
337 |
338 | ```bash
339 | deepspeed --master_port 29500 --num_gpus=N train_xxx.py
340 | ```
341 |
342 | 可根据需要开启wandb记录训练过程
343 |
344 | ```bash
345 | # 需要登录: wandb login
346 | torchrun --nproc_per_node N train_xxx.py --use_wandb
347 | # and
348 | python train_xxx.py --use_wandb
349 | ```
350 |
351 | 通过添加`--use_wandb`参数,可以记录训练过程,训练完成后,可以在wandb网站上查看训练过程。通过修改`wandb_project`
352 | 和`wandb_run_name`参数,可以指定项目名称和运行名称。
353 |
354 |
355 |
356 | # 📌 数据介绍
357 |
358 | ## Ⅰ Tokenizer
359 |
360 | 分词器将单词从自然语言通过“词典”映射到`0, 1, 36`这样的数字,可以理解为数字就代表了单词在“词典”中的页码。
361 | 可以选择自己构造词表训练一个“词典”,代码可见`./scripts/train_tokenizer.py`(仅供学习参考,若非必要无需再自行训练,MiniMind已自带tokenizer)。
362 | 或者选择比较出名的开源大模型分词器,
363 | 正如同直接用新华/牛津词典的优点是token编码压缩率很好,缺点是页数太多,动辄数十万个词汇短语;
364 | 自己训练的分词器,优点是词表长度和内容随意控制,缺点是压缩率很低(例如"hello"也许会被拆分为"h e l l o"
365 | 五个独立的token),且生僻词难以覆盖。
366 | “词典”的选择固然很重要,LLM的输出本质上是SoftMax到词典N个词的多分类问题,然后通过“词典”解码到自然语言。
367 | 因为MiniMind体积需要严格控制,为了避免模型头重脚轻(词嵌入embedding层参数在LLM占比太高),所以词表长度短短益善。
368 |
369 |
370 | Tokenizer介绍
371 |
372 | 第三方强大的开源模型例如Yi、qwen、chatglm、mistral、Llama3的tokenizer词表长度如下:
373 |
374 |
375 | Tokenizer模型 词表大小 来源
376 | yi tokenizer 64,000 01万物(中国)
377 | qwen2 tokenizer 151,643 阿里云(中国)
378 | glm tokenizer 151,329 智谱AI(中国)
379 | mistral tokenizer 32,000 Mistral AI(法国)
380 | llama3 tokenizer 128,000 Meta(美国)
381 | minimind tokenizer 6,400 自定义
382 |
383 |
384 | > 👉2024-09-17更新:为了防止过去的版本歧义&控制体积,minimind所有模型均使用minimind_tokenizer分词,废弃所有mistral_tokenizer版本。
385 |
386 | ```
387 | # 一些自言自语
388 | > 尽管minimind_tokenizer长度很小,编解码效率弱于qwen2、glm等中文友好型分词器。
389 | > 但minimind模型选择了自己训练的minimind_tokenizer作为分词器,以保持整体参数轻量,避免编码层和计算层占比失衡,头重脚轻,因为minimind的词表大小只有6400。
390 | > 且minimind在实际测试中没有出现过生僻词汇解码失败的情况,效果良好。
391 | > 由于自定义词表压缩长度到6400,使得LLM总参数量最低只有25.8M。
392 | > 训练数据`tokenizer_train.jsonl`均来自于`匠数大模型数据集`,这部分数据相对次要,如需训练可以自由选择。
393 | ```
394 |
395 |
396 |
397 | ## Ⅱ Pretrain数据
398 |
399 | 经历了MiniMind-V1的低质量预训练数据,导致模型胡言乱语的教训,`2025-02-05` 之后决定不再采用大规模无监督的数据集做预训练。
400 | 进而尝试把[匠数大模型数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)的中文部分提取出来,
401 | 清洗出字符`<512`长度的大约1.6GB的语料直接拼接成预训练数据 `pretrain_hq.jsonl`,hq即为high
402 | quality(当然也还不算high,提升数据质量无止尽)。
403 |
404 | 文件`pretrain_hq.jsonl` 数据格式为
405 |
406 | ```bash
407 | {"text": "如何才能摆脱拖延症? 治愈拖延症并不容易,但以下建议可能有所帮助..."}
408 | ```
409 |
410 | ## Ⅲ SFT数据
411 |
412 | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)
413 | “是一个完整、格式统一、安全的大模型训练和研究资源。
414 | 从网络上的公开数据源收集并整理了大量开源数据集,对其进行了格式统一,数据清洗,
415 | 包含10M条数据的中文数据集和包含2M条数据的英文数据集。”
416 | 以上是官方介绍,下载文件后的数据总量大约在4B tokens,肯定是适合作为中文大语言模型的SFT数据的。
417 | 但是官方提供的数据格式很乱,全部用来sft代价太大。
418 | 我将把官方数据集进行了二次清洗,把含有符号污染和噪声的条目去除;另外依然只保留了总长度`<512`
419 | 的内容,此阶段希望通过大量对话补充预训练阶段欠缺的知识。
420 | 导出文件为`sft_512.jsonl`(~7.5GB)。
421 |
422 | [Magpie-SFT数据集](https://www.modelscope.cn/organization/Magpie-Align)
423 | 收集了~1M条来自Qwen2/2.5的高质量对话,我将这部分数据进一步清洗,把总长度`<2048`的部分导出为`sft_2048.jsonl`(~9GB)。
424 | 长度`<1024`的部分导出为`sft_1024.jsonl`(~5.5GB),用大模型对话数据直接进行sft就属于“黑盒蒸馏”的范畴。
425 |
426 | 进一步清洗前两步sft的数据(只保留中文字符占比高的内容),筛选长度`<512`的对话,得到`sft_mini_512.jsonl`(~1.2GB)。
427 |
428 | 所有sft文件 `sft_X.jsonl` 数据格式均为
429 |
430 | ```text
431 | {
432 | "conversations": [
433 | {"role": "user", "content": "你好"},
434 | {"role": "assistant", "content": "你好!"},
435 | {"role": "user", "content": "再见"},
436 | {"role": "assistant", "content": "再见!"}
437 | ]
438 | }
439 | ```
440 |
441 | ## Ⅳ RLHF数据
442 |
443 | 来自[Magpie-DPO数据集](https://www.modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1)
444 | 大约200k条偏好数据(均是英文)生成自Llama3.1-70B/8B,可以用于训练奖励模型,优化模型回复质量,使其更加符合人类偏好。
445 | 这里将数据总长度`<3000`的内容重组为`dpo.jsonl`(~0.9GB),包含`chosen`和`rejected`两个字段,`chosen`
446 | 为偏好的回复,`rejected`为拒绝的回复。
447 |
448 | 文件 `dpo.jsonl` 数据格式为
449 |
450 | ```text
451 | {
452 | "chosen": [
453 | {"content": "Q", "role": "user"},
454 | {"content": "good answer", "role": "assistant"}
455 | ],
456 | "rejected": [
457 | {"content": "Q", "role": "user"},
458 | {"content": "bad answer", "role": "assistant"}
459 | ]
460 | }
461 | ```
462 |
463 | ## Ⅴ Reason数据集:
464 |
465 | 不得不说2025年2月谁能火的过DeepSeek...
466 | 也激发了我对RL引导的推理模型的浓厚兴趣,目前已经用Qwen2.5复现了R1-Zero。
467 | 如果有时间+效果work(但99%基模能力不足)我会在之后更新MiniMind基于RL训练的推理模型而不是蒸馏模型。
468 | 时间有限,最快的低成本方案依然是直接蒸馏(黑盒方式)。
469 | 耐不住R1太火,短短几天就已经存在一些R1的蒸馏数据集[R1-Llama-70B](https://www.modelscope.cn/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B)、[R1-Distill-SFT](https://www.modelscope.cn/datasets/AI-ModelScope/R1-Distill-SFT)、
470 | [Alpaca-Distill-R1](https://huggingface.co/datasets/shareAI/Alpaca-Distill-R1-ZH)、
471 | [deepseek_r1_zh](https://huggingface.co/datasets/jinliuxi/deepseek_r1_zh)等等,纯中文的数据可能比较少。
472 | 最终整合它们,导出文件为`r1_mix_1024.jsonl`,数据格式和`sft_X.jsonl`一致。
473 |
474 | ## Ⅵ 更多数据集
475 |
476 | 目前已经有[HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM)
477 | 在收集和梳理中文LLM相关的开源模型、应用、数据集及教程等资料,并持续更新这方面的最新进展。全面且专业,Respect!
478 |
479 | ---
480 |
481 | ## Ⅷ 数据集下载
482 |
483 | > [!NOTE]
484 | > 2025-02-05后,开源MiniMind最终训练所用的所有数据集,因此无需再自行预处理大规模数据集,避免重复性的数据处理工作。
485 |
486 | MiniMind训练数据集 ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main))
487 |
488 | > 无需全部clone,可单独下载所需的文件
489 |
490 | 将下载的数据集文件放到`./dataset/`目录下(✨为推荐的必须项)
491 |
492 | ```bash
493 | ./dataset/
494 | ├── dpo.jsonl (909MB)
495 | ├── lora_identity.jsonl (22.8KB)
496 | ├── lora_medical.jsonl (34MB)
497 | ├── pretrain_hq.jsonl (1.6GB, ✨)
498 | ├── r1_mix_1024.jsonl (340MB)
499 | ├── sft_1024.jsonl (5.6GB)
500 | ├── sft_2048.jsonl (9GB)
501 | ├── sft_512.jsonl (7.5GB)
502 | ├── sft_mini_512.jsonl (1.2GB, ✨)
503 | └── tokenizer_train.jsonl (1GB)
504 | ```
505 |
506 |
507 | 注:各数据集简介
508 |
509 | * `dpo.jsonl` --RLHF阶段数据集
510 | * `lora_identity.jsonl` --自我认知数据集(例如:你是谁?我是minimind...),推荐用于lora训练(亦可用于全参SFT,勿被名字局限)
511 | * `lora_medical.jsonl` --医疗问答数据集,推荐用于lora训练(亦可用于全参SFT,勿被名字局限)
512 | * `pretrain_hq.jsonl`✨ --预训练数据集,整合自jiangshu科技
513 | * `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B蒸馏数据,每条数据字符最大长度为1024(因此训练时设置max_seq_len=1024)
514 | * `sft_1024.jsonl` --整合自Qwen2.5蒸馏数据(是sft_2048的子集),每条数据字符最大长度为1024(因此训练时设置max_seq_len=1024)
515 | * `sft_2048.jsonl` --整合自Qwen2.5蒸馏数据,每条数据字符最大长度为2048(因此训练时设置max_seq_len=2048)
516 | * `sft_512.jsonl` --整合自匠数科技SFT数据,每条数据字符最大长度为512(因此训练时设置max_seq_len=512)
517 | * `sft_mini_512.jsonl`✨ --极简整合自匠数科技SFT数据+Qwen2.5蒸馏数据(用于快速训练Zero模型),每条数据字符最大长度为512(因此训练时设置max_seq_len=512)
518 | * `tokenizer_train.jsonl` --均来自于`匠数大模型数据集`,这部分数据相对次要,(不推荐自己重复训练tokenizer,理由如上)如需自己训练tokenizer可以自由选择数据集。
519 |
520 |
521 |
522 |
523 | 
524 |
525 |
526 | 说明 & 推荐训练方案
527 |
528 | * MiniMind2 Series均经过共约20GB语料训练,大约4B tokens,即对应上面的数据组合训练结果(开销:💰💰💰💰💰💰💰💰,效果:😊😊😊😊😊😊)
529 |
530 | * 想要最快速度从0实现Zero模型,推荐使用`pretrain_hq.jsonl` + `sft_mini_512.jsonl` 的数据组合,具体花销和效果可查看下文表格(开销:💰,效果:😊😊)
531 |
532 | * 推荐具备一定算力资源或更在意效果的朋友可以考虑前者完整复现MiniMind2;仅有单卡GPU或在乎短时间快速复现的朋友强烈推荐后者;
533 |
534 | * 【折中方案】亦可选择例如`sft_mini_512.jsonl`、`sft_1024.jsonl`中等规模数据进行自由组合训练(开销:💰💰💰,效果:😊😊😊😊)。
535 |
536 |
537 |
538 | # 📌 Model Structure
539 |
540 | MiniMind-Dense(和[Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/)一样)使用了Transformer的Decoder-Only结构,跟GPT-3的区别在于:
541 |
542 | * 采用了GPT-3的预标准化方法,也就是在每个Transformer子层的输入上进行归一化,而不是在输出上。具体来说,使用的是RMSNorm归一化函数。
543 | * 用SwiGLU激活函数替代了ReLU,这样做是为了提高性能。
544 | * 像GPT-Neo一样,去掉了绝对位置嵌入,改用了旋转位置嵌入(RoPE),这样在处理超出训练长度的推理时效果更好。
545 |
546 | ---
547 |
548 | MiniMind-MoE模型,它的结构基于Llama3和[Deepseek-V2/3](https://arxiv.org/pdf/2405.04434)中的MixFFN混合专家模块。
549 |
550 | * DeepSeek-V2在前馈网络(FFN)方面,采用了更细粒度的专家分割和共享的专家隔离技术,以提高Experts的效果。
551 |
552 | ---
553 |
554 | MiniMind的整体结构一致,只是在RoPE计算、推理函数和FFN层的代码上做了一些小调整。
555 | 其结构如下图(重绘版):
556 |
557 | 
558 | 
559 |
560 | 修改模型配置见[./model/LMConfig.py](./model/LMConfig.py)。
561 | 参考模型参数版本见下表:
562 |
563 | | Model Name | params | len_vocab | rope_theta | n_layers | d_model | kv_heads | q_heads | share+route |
564 | |-------------------|--------|-----------|------------|----------|---------|----------|---------|-------------|
565 | | MiniMind2-Small | 26M | 6400 | 1e6 | 8 | 512 | 2 | 8 | - |
566 | | MiniMind2-MoE | 145M | 6400 | 1e6 | 8 | 640 | 2 | 8 | 1+4 |
567 | | MiniMind2 | 104M | 6400 | 1e6 | 16 | 768 | 2 | 8 | - |
568 | | minimind-v1-small | 26M | 6400 | 1e4 | 8 | 512 | 8 | 16 | - |
569 | | minimind-v1-moe | 4×26M | 6400 | 1e4 | 8 | 512 | 8 | 16 | 1+4 |
570 | | minimind-v1 | 108M | 6400 | 1e4 | 16 | 768 | 8 | 16 | - |
571 |
572 | # 📌 Experiment
573 |
574 | ## Ⅰ 训练开销
575 |
576 | - **时间单位**:小时 (h)。
577 | - **成本单位**:人民币 (¥);7¥ ≈ 1美元。
578 | - **3090 租卡单价**:≈1.3¥/h(可自行参考实时市价)。
579 | - **参考标准**:表格仅实测 `pretrain` 和 `sft_mini_512` 两个数据集的训练时间,其它耗时根据数据集大小估算(可能存在些许出入)。
580 |
581 | > 基于 3090 (单卡)成本计算
582 |
583 | | Model Name | params | pretrain | sft_mini_512 | sft_512 | sft_1024 | sft_2048 | RLHF |
584 | |-----------------|--------|------------------|------------------|---------------|-------------------|------------------|---------------|
585 | | MiniMind2-Small | 26M | ≈1.1h ≈1.43¥ | ≈1h ≈1.3¥ | ≈6h ≈7.8¥ | ≈4.58h ≈5.95¥ | ≈7.5h ≈9.75¥ | ≈1h ≈1.3¥ |
586 | | MiniMind2 | 104M | ≈3.9h ≈5.07¥ | ≈3.3h ≈4.29¥ | ≈20h ≈26¥ | ≈15h ≈19.5¥ | ≈25h ≈32.5¥ | ≈3h ≈3.9¥ |
587 |
588 | ---
589 |
590 |
591 | 训练开销总结&预测
592 |
593 |
594 | > MiniMind2-Small参数
595 | >> `pretrain_hq`+`sft_mini_512`数据集
596 | 单卡3090 (1 epoch) + 2.1小时 + 花费2.73元人民币
597 | 即可从0训练出MiniMind-Zero-0.025B模型!!!
598 |
599 | > MiniMind2-Small参数
600 | >> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo`数据集
601 | 单卡3090 (2 epochs) + 大约38.16小时 + 花费49.61元人民币
602 | 即可从0训练出MiniMind2-Small-0.025B模型!!!
603 |
604 | > MiniMind2参数
605 | >> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo`数据集
606 | 单卡3090 (2 epochs) + 大约122小时 + 花费158.6元人民币
607 | 即可从0训练出MiniMind2-0.1B模型!!!
608 |
609 |
610 |
611 |
612 |
613 | ✨基于单卡NVIDIA 3090的`MiniMind-Zero`从0训练仅需`2小时` + `3块钱`,实现ChatBot效果!
614 |
615 | ✨PS:若采用8卡4090训练,总用时甚至可以压缩到10分钟以内!(由于时间更短,花费同样3元左右,与单卡成本相当)
616 |
617 | ✨以极低极低的门槛,实现人人可玩的大模型自由!这正是MiniMind系列的诞生初衷!
618 |
619 | ✨仅价值`3块钱`成本的`MiniMind-Zero`并不是噱头!Chat测试:
620 |
621 | ```text
622 | 👶: 请介绍一下自己。
623 | 🤖️: 作为人工智能,我没有实际的生活,也没有自我意识,所以没有自己的生活。我被设计成能够帮助用户解答问题、提供信息、进行对话等。我的设计和功能是由计算机科学、机器学习、人工智能技术和算法所构建的,我通过编程和算法实现。
624 |
625 | 👶: 推荐一些杭州的特色美食吧。
626 | 🤖️: 杭州有很多美味的特色美食,比如鸡肉串、鳗鱼头、水煮鱼、豆腐脑等。这些美食都非常美味,有着独特的风味和口感,绝对是值得一试的美食。
627 |
628 | 👶: 请为我讲解“大语言模型”这个概念。
629 | 🤖️: 大语言模型是一种人工智能技术,它使用大量的文本数据来训练模型,然后在不断的迭代中不断优化模型。它的目标是根据输入的数据来生成符合特定需求的文本,这些文本可以是任何你感兴趣的主题。
630 |
631 | 👶: Introduce the history of the United States, please.
632 | 🤖️: 您提到的“Introok's the believeations of theument." 这个名字来源于中国古代的"groty of of the change."
633 | ```
634 |
635 | 极速且初具效果,甚至仍然可以进一步压缩获取更小更优质的训练数据。
636 | Zero模型权重保存为 `full_sft_512_zero.pth`(见下文MiniMind模型文件链接),如有兴趣可下载检验此模型效果。
637 |
638 |
639 | ---
640 |
641 | ## Ⅱ 主要训练步骤
642 |
643 | ### **1. 预训练(Pretrain)**:
644 |
645 | LLM首先要学习的并非直接与人交流,而是让网络参数中充满知识的墨水,“墨水” 理论上喝的越饱越好,产生大量的对世界的知识积累。
646 | 预训练就是让Model先埋头苦学大量基本的知识,例如从Wiki百科、新闻、书籍整理大规模的高质量训练数据。
647 | 这个过程是“无监督”的,即人类不需要在过程中做任何“有监督”的校正,而是由模型自己从大量文本中总结规律学习知识点。
648 | 模型此阶段目的只有一个:**学会词语接龙**。例如我们输入“秦始皇”四个字,它可以接龙“是中国的第一位皇帝”。
649 |
650 | ```bash
651 | torchrun --nproc_per_node 1 train_pretrain.py # 1即为单卡训练,可根据硬件情况自行调整 (设置>=2)
652 | # or
653 | python train_pretrain.py
654 | ```
655 |
656 | > 训练后的模型权重文件默认每隔`100步`保存为: `pretrain_*.pth`(*
657 | > 为模型具体dimension,每次保存时新文件会覆盖旧文件)
658 |
659 | ### **2. 有监督微调(Supervised Fine-Tuning)**:
660 |
661 | 经过预训练,LLM此时已经掌握了大量知识,然而此时它只会无脑地词语接龙,还不会与人聊天。
662 | SFT阶段就需要把半成品LLM施加一个自定义的聊天模板进行微调。
663 | 例如模型遇到这样的模板【问题->回答,问题->回答】后不再无脑接龙,而是意识到这是一段完整的对话结束。
664 | 称这个过程为指令微调,就如同让已经学富五车的「牛顿」先生适应21世纪智能手机的聊天习惯,学习屏幕左侧是对方消息,右侧是本人消息这个规律。
665 | 在训练时,MiniMind的指令和回答长度被截断在512,是为了节省显存空间。就像我们学习时,会先从短的文章开始,当学会写作200字作文后,800字文章也可以手到擒来。
666 | 在需要长度拓展时,只需要准备少量的2k/4k/8k长度对话数据进行进一步微调即可(此时最好配合RoPE-NTK的基准差值)。
667 | > 在推理时通过调整RoPE线性差值,实现免训练长度外推到2048及以上将会很方便。
668 |
669 | ```bash
670 | torchrun --nproc_per_node 1 train_full_sft.py
671 | # or
672 | python train_full_sft.py
673 | ```
674 |
675 | > 训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`(*
676 | > 为模型具体dimension,每次保存时新文件会覆盖旧文件)
677 |
678 | ## Ⅲ 其它训练步骤
679 |
680 | ### **3. 人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)**
681 |
682 | 在前面的训练步骤中,模型已经具备了基本的对话能力,但是这样的能力完全基于单词接龙,缺少正反样例的激励。
683 | 模型此时尚未知什么回答是好的,什么是差的。我们希望它能够更符合人的偏好,降低让人类不满意答案的产生概率。
684 | 这个过程就像是让模型参加新的培训,从优秀员工的作为例子,消极员工作为反例,学习如何更好地回复。
685 | 此处使用的是RLHF系列之-直接偏好优化(Direct Preference Optimization, DPO)。
686 | 与PPO(Proximal Policy Optimization)这种需要奖励模型、价值模型的RL算法不同;
687 | DPO通过推导PPO奖励模型的显式解,把在线奖励模型换成离线数据,Ref模型输出可以提前保存。
688 | DPO性能几乎不变,只用跑 actor_model 和 ref_model 两个模型,大大节省显存开销和增加训练稳定性。
689 |
690 | > 注:RLHF训练步骤**并非必须**,此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”,有利(符合偏好、减少有害内容)也有弊(样本收集昂贵、反馈偏差、多样性损失)。
691 |
692 | ```bash
693 | torchrun --nproc_per_node 1 train_dpo.py
694 | # or
695 | python train_dpo.py
696 | ```
697 |
698 | > 训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`(*
699 | > 为模型具体dimension,每次保存时新文件会覆盖旧文件)
700 |
701 | ### **4. 知识蒸馏(Knowledge Distillation, KD)**
702 |
703 | 在前面的所有训练步骤中,模型已经完全具备了基本能力,通常可以学成出师了。
704 | 而知识蒸馏可以进一步优化模型的性能和效率,所谓知识蒸馏,即学生模型面向教师模型学习。
705 | 教师模型通常是经过充分训练的大模型,具有较高的准确性和泛化能力。
706 | 学生模型是一个较小的模型,目标是学习教师模型的行为,而不是直接从原始数据中学习。
707 | 在SFT学习中,模型的目标是拟合词Token分类硬标签(hard labels),即真实的类别标签(如 0 或 6400)。
708 | 在知识蒸馏中,教师模型的softmax概率分布被用作软标签(soft labels)。小模型仅学习软标签,并使用KL-Loss来优化模型的参数。
709 | 通俗地说,SFT直接学习老师给的解题答案。而KD过程相当于“打开”老师聪明的大脑,尽可能地模仿老师“大脑”思考问题的神经元状态。
710 | 例如,当老师模型计算`1+1=2`这个问题的时候,最后一层神经元a状态为0,神经元b状态为100,神经元c状态为-99...
711 | 学生模型通过大量数据,学习教师模型大脑内部的运转规律。这个过程即称之为:知识蒸馏。
712 | 知识蒸馏的目的只有一个:让小模型体积更小的同时效果更好。
713 | 然而随着LLM诞生和发展,模型蒸馏一词被广泛滥用,从而产生了“白盒/黑盒”知识蒸馏两个派别。
714 | GPT-4这种闭源模型,由于无法获取其内部结构,因此只能面向它所输出的数据学习,这个过程称之为黑盒蒸馏,也是大模型时代最普遍的做法。
715 | 黑盒蒸馏与SFT过程完全一致,只不过数据是从大模型的输出收集,因此只需要准备数据并且进一步FT即可。
716 | 注意更改被加载的基础模型为`full_sft_*.pth`,即基于微调模型做进一步的蒸馏学习。
717 | `./dataset/sft_1024.jsonl`与`./dataset/sft_2048.jsonl` 均收集自qwen2.5-7/72B-Instruct大模型,可直接用于SFT以获取Qwen的部分行为。
718 |
719 | ```bash
720 | # 注意需要更改train_full_sft.py数据集路径,以及max_seq_len
721 | torchrun --nproc_per_node 1 train_full_sft.py
722 | # or
723 | python train_full_sft.py
724 | ```
725 |
726 | > 训练后的模型权重文件默认每隔`100步`同样保存为: `full_sft_*.pth`(*为模型具体dimension,每次保存时新文件会覆盖旧文件)
727 |
728 | 此处应当着重介绍MiniMind实现的白盒蒸馏代码`train_distillation.py`,由于MiniMind同系列本身并不存在强大的教师模型,因此白盒蒸馏代码仅作为学习参考。
729 |
730 | ```bash
731 | torchrun --nproc_per_node 1 train_distillation.py
732 | # or
733 | python train_distillation.py
734 | ```
735 |
736 | ### **5. LoRA (Low-Rank Adaptation)**
737 |
738 | LoRA是一种高效的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法,旨在通过低秩分解的方式对预训练模型进行微调。
739 | 相比于全参数微调(Full Fine-Tuning),LoRA 只需要更新少量的参数。
740 | LoRA 的核心思想是:在模型的权重矩阵中引入低秩分解,仅对低秩部分进行更新,而保持原始预训练权重不变。
741 | 代码可见`./model/model_lora.py`和`train_lora.py`,完全从0实现LoRA流程,不依赖第三方库的封装。
742 |
743 | ```bash
744 | torchrun --nproc_per_node 1 train_lora.py
745 | # or
746 | python train_lora.py
747 | ```
748 |
749 | > 训练后的模型权重文件默认每隔`100步`保存为: `lora_xxx_*.pth`(*
750 | > 为模型具体dimension,每次保存时新文件会覆盖旧文件)
751 |
752 |
753 | 非常多的人困惑,如何使模型学会自己私有领域的知识?如何准备数据集?如何迁移通用领域模型打造垂域模型?
754 | 这里举几个例子,对于通用模型,医学领域知识欠缺,可以尝试在原有模型基础上加入领域知识,以获得更好的性能。
755 | 同时,我们通常不希望学会领域知识的同时损失原有基础模型的其它能力,此时LoRA可以很好的改善这个问题。
756 | 只需要准备如下格式的对话数据集放置到`./dataset/lora_xxx.jsonl`,启动 `python train_lora.py`
757 | 训练即可得到`./out/lora/lora_xxx.pth`新模型权重。
758 |
759 | **医疗场景**
760 |
761 | ```
762 | {"conversations": [{"role": "user", "content": "请问颈椎病的人枕头多高才最好?"}, {"role": "assistant", "content": "颈椎病患者选择枕头的高度应该根据..."}]}
763 | {"conversations": [{"role": "user", "content": "请问xxx"}, {"role": "assistant", "content": "xxx..."}]}
764 | ```
765 |
766 | **自我认知场景**
767 |
768 | ```
769 | {"conversations": [{"role": "user", "content": "你叫什么名字?"}, {"role": "assistant", "content": "我叫minimind..."}]}
770 | {"conversations": [{"role": "user", "content": "你是谁"}, {"role": "assistant", "content": "我是..."}]}
771 | ```
772 |
773 | 此时【基础模型+LoRA模型】即可获得医疗场景模型增强的能力,相当于为基础模型增加了LoRA外挂,这个过程并不损失基础模型的本身能力。
774 | 我们可以通过`eval_model.py`进行模型评估测试。
775 |
776 | ```bash
777 | # 注意:model_mode即选择基础模型的类型,这和train_lora是基于哪个模型训练的相关,确保统一即可。
778 | python eval_model.py --lora_name 'lora_medical' --model_mode 2
779 | ```
780 |
781 | **小测试**
782 |
783 | ```
784 | 👶: 我最近经常感到头晕,可能是什么原因?
785 | 🤖️: 头晕可能是由于多种原因,包括但不限于:心脏疾病、眼睛损伤、过敏反应、皮肤问题、过敏反应、压力、焦虑、疲劳、药物副作用、性功能障碍或者其他...
786 |
787 | ---
788 |
789 | 👶: 你和openai是什么关系?
790 | 🤖️: 我是 MiniMind,一个由 Jingyao Gong 开发的人工智能助手。我通过自然语言处理和算法训练来与用户进行交互。
791 | ```
792 |
793 | PS:只要有所需要的数据集,也可以full_sft全参微调(需要进行通用知识的混合配比,否则过拟合领域数据会让模型变傻,损失通用性)
794 |
795 | ### **6. 训练推理模型 (Reasoning Model)**
796 |
797 | DeepSeek-R1实在太火了,几乎重新指明了未来LLM的新范式。
798 | 论文指出`>3B`的模型经历多次反复的冷启动和RL奖励训练才能获得肉眼可见的推理能力提升。
799 | 最快最稳妥最经济的做法,以及最近爆发的各种各样所谓的推理模型几乎都是直接面向数据进行蒸馏训练,
800 | 但由于缺乏技术含量,蒸馏派被RL派瞧不起(hhhh)。
801 | 本人迅速已经在Qwen系列1.5B小模型上进行了尝试,很快复现了Zero过程的数学推理能力。
802 | 然而一个遗憾的共识是:参数太小的模型直接通过冷启动SFT+GRPO几乎不可能获得任何推理效果。
803 | MiniMind2第一时间只能坚定不移的选择做蒸馏派,日后基于0.1B模型的RL如果同样取得小小进展会更新此部分的训练方案。
804 |
805 | 做蒸馏需要准备的依然是和SFT阶段同样格式的数据即可,数据集来源已如上文介绍。数据格式例如:
806 |
807 | ```json lines
808 | {
809 | "conversations": [
810 | {
811 | "role": "user",
812 | "content": "你好,我是小芳,很高兴认识你。"
813 | },
814 | {
815 | "role": "assistant",
816 | "content": "\n你好!我是由中国的个人开发者独立开发的智能助手MiniMind-R1-Lite-Preview,很高兴为您提供服务!\n \n\n你好!我是由中国的个人开发者独立开发的智能助手MiniMind-R1-Lite-Preview,很高兴为您提供服务!\n "
817 | }
818 | ]
819 | }
820 | ```
821 |
822 | 推理模型R1的回复模板是:
823 |
824 | ```text
825 | \n思考过程\n \n
826 | \n最终回答\n
827 | ```
828 |
829 | 这在GRPO中通过设置规则奖励函数约束模型符合思考标签和回复标签(在冷启动靠前的阶段奖励值设置应该提高一些)
830 |
831 | 另一个问题是蒸馏过程虽然和SFT一样,但实验结果是模型难以每次都符合模板规范的回复,即脱离思考和回复标签约束。
832 | 这里的小技巧是增加标记位置token的损失惩罚,详见`train_distill_reason.py`:
833 |
834 | ```text
835 | # 在 sp_ids 对应的位置增加额外的惩罚
836 | ...
837 | loss_mask[sp_ids] = 10 # 惩罚系数
838 | ```
839 |
840 | 另另一个tips是由于推理数据由于只筛选了`<1024`长度的数据,其中多轮对话和英文数据偏少,
841 | 因此`r1_mix_1024.jsonl`进行了大约10k条多轮对话+英文数据的混合,防止模型遗忘严重。
842 |
843 | 脚本默认基于rlhf后的基模型做推理能力的蒸馏微调,下面直接启动训练即可:
844 |
845 | ```bash
846 | torchrun --nproc_per_node 1 train_distill_reason.py
847 | # or
848 | python train_distill_reason.py
849 | ```
850 |
851 | > 训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`(*为模型具体dimension,每次保存时新文件会覆盖旧文件)
852 |
853 | 测试一下:
854 |
855 | ```text
856 | 👶: 请介绍一下自己。
857 | 🤖️:
858 | 嗯,用户让我介绍一下自己。我现在需要解释一下自己。首先,我应该明确用户是谁,是学生、职场人士还是普通人。用户可能是一个学生或者职场人士,所以信息传递应该是帮助学生和职场人士的信息。
859 | 然后,我需要回应用户,这可能包括介绍自己的位置、兴趣爱好、职业经历等信息。用户还可能想了解自己在哪里,比如职业发展、工作经验等。
860 | 我应该注意用词要正式,不要太正式,同时要保持简洁明了。这样用户容易理解,也方便他们查阅相关资料。
861 | 最后,我应该总结一下,让用户知道我是如何帮助他们完成他们的任务的,这样他们有更多关于我的信息。
862 |
863 |
864 | 我是**“你的名字”,你叫什么名字?
865 |
866 |
867 | 👶: 你更擅长哪一个学科?
868 | 🤖️:
869 | 您好!我是由中国的个人开发者开发的智能助手MiniMind-R1。如您有任何任何问题,我会尽我所能为您提供帮助。
870 |
871 |
872 | 您好!我是由中国的个人开发者开发的智能助手MiniMind-R1。如您有任何任何问题,我会尽我所能为您提供帮助。
873 |
874 | ```
875 |
876 | ---
877 |
878 | ## Ⅳ 模型参数设定
879 |
880 | 📋关于LLM的参数配置,有一篇很有意思的论文[MobileLLM](https://arxiv.org/pdf/2402.14905)做了详细的研究和实验。
881 | Scaling Law在小模型中有自己独特的规律。
882 | 引起Transformer参数成规模变化的参数几乎只取决于`d_model`和`n_layers`。
883 |
884 | * `d_model`↑ + `n_layers`↓ -> 矮胖子
885 | * `d_model`↓ + `n_layers`↑ -> 瘦高个
886 |
887 | 2020年提出Scaling Law的论文认为,训练数据量、参数量以及训练迭代次数才是决定性能的关键因素,而模型架构的影响几乎可以忽视。
888 | 然而似乎这个定律对小模型并不完全适用。
889 | MobileLLM提出架构的深度比宽度更重要,「深而窄」的「瘦长」模型可以学习到比「宽而浅」模型更多的抽象概念。
890 | 例如当模型参数固定在125M或者350M时,30~42层的「狭长」模型明显比12层左右的「矮胖」模型有更优越的性能,
891 | 在常识推理、问答、阅读理解等8个基准测试上都有类似的趋势。
892 | 这其实是非常有趣的发现,因为以往为100M左右量级的小模型设计架构时,几乎没人尝试过叠加超过12层。
893 | 这与MiniMind在训练过程中,模型参数量在`d_model`和`n_layers`之间进行调整实验观察到的效果是一致的。
894 | 然而「深而窄」的「窄」也是有维度极限的,当d_model<512时,词嵌入维度坍塌的劣势非常明显,
895 | 增加的layers并不能弥补词嵌入在固定q_head带来d_head不足的劣势。
896 | 当d_model>1536时,layers的增加似乎比d_model的优先级更高,更能带来具有“性价比”的参数->效果增益。
897 |
898 | * 因此MiniMind设定small模型dim=512,n_layers=8来获取的「极小体积<->更好效果」的平衡。
899 | * 设定dim=768,n_layers=16来获取效果的更大收益,更加符合小模型Scaling-Law的变化曲线。
900 |
901 | 作为参考,GPT3的参数设定见下表:
902 | 
903 |
904 | ---
905 |
906 | ## Ⅴ 训练结果
907 |
908 | MiniMind2 模型训练损失走势(由于数据集在训练后又更新清洗多次,因此Loss仅供参考)
909 |
910 | | models | pretrain (length-512) | sft (length-512) |
911 | |-----------------|----------------------------------------------------|----------------------------------------------------|
912 | | MiniMind2-Small | | |
913 | | MiniMind2 | | |
914 |
915 | ### 训练完成-模型合集
916 |
917 | > 考虑到多人反应百度网盘速度慢,MiniMind2及以后全部使用ModelScope/HuggingFace托管。
918 |
919 | #### ① PyTorch原生模型
920 |
921 | MiniMind2模型权重 ([ModelScope](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch) | [HuggingFace](https://huggingface.co/jingyaogong/MiniMind2-Pytorch))
922 |
923 | MiniMind-V1模型权重 ([百度网盘](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666))
924 |
925 |
926 | Torch文件命名对照
927 |
928 | | Model Name | params | pretrain_model | sft_model | rl_model | reason_model | lora_model |
929 | |-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------|
930 | | MiniMind2-small | 26M | `pretrain_512.pth` | `full_sft_512.pth` | `rlhf_512.pth` | `reason_512.pth` | `lora_xxx_512.pth` |
931 | | MiniMind2-MoE | 145M | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | - | - |
932 | | MiniMind2 | 104M | `pretrain_768.pth` | `full_sft_768.pth` | `rlhf_768.pth` | `reason_768.pth` | `lora_xxx_768.pth` |
933 |
934 | | Model Name | params | pretrain_model | 单轮对话sft | 多轮对话sft | rl_model |
935 | |-------------------|--------|------------------------|------------------------------------|-----------------------------------|--------------|
936 | | minimind-v1-small | 26M | `pretrain_512.pth` | `single_chat/full_sft_512.pth` | `multi_chat/full_sft_512.pth` | `rl_512.pth` |
937 | | minimind-v1-moe | 4×26M | `pretrain_512_moe.pth` | `single_chat/full_sft_512_moe.pth` | `multi_chat/full_sft_512_moe.pth` | - |
938 | | minimind-v1 | 108M | `pretrain_768.pth` | `single_chat/full_sft_768.pth` | `multi_chat/full_sft_768.pth` | `rl_768.pth` |
939 |
940 |
941 |
942 | #### ② Transformers模型
943 |
944 | MiniMind系列 ([ModelScope](https://www.modelscope.cn/collections/MiniMind-b72f4cfeb74b47)
945 | | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5))
946 |
947 | ---
948 |
949 | # 📌 Eval
950 |
951 | ## Ⅰ RLHF对比SFT篇
952 |
953 | 测试基于`full_sft`和`rlhf`的`MiniMind2系列`模型对比,测试随机种子均固定
954 |
955 |
956 | 问答实测
957 |
958 | [A0] MiniMind2:full_sft_640
959 |
960 | [A1] MiniMind2:rlhf_640
961 |
962 | ```text
963 | [Q]: 你叫什么名字?
964 | [A0]: 我是一个大型语言模型,可以回答各种问题,提供信息和帮助。我可以回答各种问题,提供信息和帮助,帮助你解决各种问题。
965 | [A1]: 我是一种基于文本的AI模型,我并没有个人经历或情感,所以无法感受和体验情感。我的设计目的是为了帮助用户解决问题、提供信息和进行对话。如果您有任何问题或需要帮助,请随时告诉我!
966 |
967 | [Q]: 你更擅长哪一个学科?
968 | [A0]: 作为一个AI模型,我并没有真正的“擅长”这个学科。我被设计成能够提供信息、解答问题、进行对话、执行任务和与用户进行对话。不过,我可以提供信息、解答问题、进行对话、提供建议、处理问题、进行对话、提供建议、学习新知识、编写代码、回答问题、提供建议等多种学科的功能。如果你有具体的学科需求或主题,欢迎提供详细信息。
969 | [A1]: 作为AI,我并没有个人能力或兴趣,但我可以提供一些关于数学、科学、计算机科学、语言学、心理学等学科的知识。如果你对数学感兴趣,我可以提供一些基础知识,例如代数、几何、概率、统计学等。如果你需要更深入的学习,我也可以提供相关的在线课程、书籍、论文等资源。请告诉我你具体需要什么帮助,我会尽力提供帮助。
970 |
971 | [Q]: 鲁迅的《狂人日记》是如何批判封建礼教的?
972 | [A0]: 鲁迅的《狂人日记》是一部讲述了鲁迅对人类价值和道德责任的倡导。在这部作品中,鲁迅通过细腻的笔触,探讨了人性、道德、正义、牺牲与责任等深刻的主题。以下是鲁迅在《狂人日记》中如何批判封建礼教的几个方面:1. **社会现实与道德观念**:鲁迅通过讲述鲁迅的故事,探讨了社会的不平等、权力的流动、人性的演变等社会问题。他认为,社会不平等不仅仅是不公正的,更是每个人在社会中的不公正。他呼吁社会各界共同努力,通过教育、宣传、法律等手段,为构建和谐社会做出贡献。2. **道德与责任**:鲁迅通过讲述鲁迅的故事,展现了道德和责任的重要性。他提倡通过道德教育来培养人们的道德品质,同时也强调了道德和责任的重要性。鲁迅通过讲述鲁迅的故事,揭示了社会的不平等和人性的复杂性,体现了对人性的深刻理解。3. **道德与责任**:鲁迅通过讲述鲁迅的故事,强调了道德责任的重要性。他提倡通过自己的行为来体现对社会的尊重和对人性的深刻理解。鲁迅通过讲述鲁迅的故事,强调了道德和责任的重要性。4. **道德与道德**:鲁迅通过讲述鲁迅的故事,揭示了道德和责任的真正意义。鲁迅通过讲述鲁迅的故事,揭示了道德和责任的本质,强调了道德和责任的重要性。5. **个人与社会**:鲁迅通过讲述鲁迅的故事,揭示了个人与社会之间的复杂关系。他呼吁社会各界共同努力,通过教育和宣传,为构建和谐社会做出贡献。鲁迅的《狂人日记》通过其丰富的故事和深刻的道德寓言,传达了关于人性、道德、正义、道德以及个人与社会的复杂关系。这部作品不仅在文学价值上取得了巨大成就,更在思想层面上启发了后来的文学创作,对后世产生了深远的影响。
973 | [A1]: 鲁迅的《狂人日记》是一部描绘战争和自由精神的作品,它批判了封建社会的种种问题和冲突,反映了社会的不平等和人性本质。在这部作品中,鲁迅通过对封建社会的诸多现象进行批判,如“弗雷德里克·约翰逊的“鹰迪伦”、“纳扎特的”、“贝克汉姆”、“乔治·威尔逊”等,对封建社会的种种问题进行了批判,强调了封建社会的不平等、人性本质以及人性本质。鲁迅通过他的“鹰迪伦”、“纳扎特的”、“贝克汉姆”、“乔治·威尔逊”等,对封建社会的种种问题进行了批判。在小说中,鲁迅通过对封建社会的诸多现象进行批判,揭示了封建社会的腐败、人性本质以及人性本质的复杂性。这种批判体现了对封建社会的尊重和对人性的复杂性,同时也反映了现代社会中普遍存在的不平等和人性本质。鲁迅的《狂人日记》不仅是一部关于封建社会的小说,也是一部具有深刻社会意义的作品,对后世的审判和批判起到了关键作用。
974 |
975 | [Q]: 我咳嗽已经持续了两周,需要去医院检查吗?
976 | [A0]: 咳嗽确实需要一定的检查,但如果您的咳嗽持续时间过长,或者有其他不适症状(如持续咳嗽、呼吸困难等),建议尽快就医。医生会根据您的症状、体检结果和可能的影像学检查(如X光、MRI或CT扫描)来诊断您的症状。在等待就医期间,您可以尝试以下方法来帮助缓解症状:1. **休息**:确保您有足够的睡眠。2. **保持水分**:脱水可能会加重咳嗽。3. **热敷**:使用热敷袋或热毛巾敷在疼痛的部位,有助于缓解喉咙痛。4. **避免刺激**:避免使用可能刺激喉咙的化学物质,如某些化学物质。5. **保持室内湿润**:确保您的卧室有良好的通风,可以使用加湿器帮助保持空气湿度。6. **保持室内湿度**:使用加湿器可以帮助缓解喉咙干燥和咳嗽。如果症状持续不缓解或有其他不适症状,请立即就医。希望您早日康复!
977 | [A1]: 如果您的咳嗽持续了两周,通常建议您尽快就医,因为这可能会影响您的呼吸和心跳。您可以通过以下步骤来检查咳嗽:1. **休息**:用纸巾或肘部轻轻按压咳嗽的位置。2. **使用药物**:在可能的情况下,使用药物(如抗过敏药、抗过敏药等)来缓解咳嗽。请在医生的指导下使用药物。3. **热敷**:在咳嗽期间,可以使用热敷来缓解疼痛和肿胀。4. **药物治疗**:如果咳嗽持续了两周,可以使用一些消炎药或抗过敏药来缓解症状。请务必在医生的指导下使用。5. **保持健康的生活方式**:保持充足的睡眠、均衡的饮食、适量的运动和避免吸烟和限酒,这些都有助于提高咳嗽的几率。请记住,这些建议不能替代专业医疗意见。如果您的症状持续或加重,强烈建议您尽快就医。
978 | ```
979 |
980 |
981 |
982 | 👉效果总结
983 |
984 | full_sft模型在简洁性和信息准确性方面表现更好;rlhf模型在回答中倾向于提供更多的背景信息,但信息准确性有待改进。
985 | 总的来说RLHF后的模型倾向于学习:说更多有礼貌但无用的废话讨好“对话”本身,而对信息准确性则有轻微损失。
986 | 天下没有免费的午餐,还需要继续提升RLHF数据集的质量,也要接受模型能力无法避免的损失(程度有轻重)。
987 | DPO和在线PPO的区别在于reject和chosen都是离线准备的,和minimind模型本身的输出必然存在很大的分布差异。
988 | 通俗地说DPO算法使模型观看乒乓球世界冠军的打法「录像」进行RL,而不是像PPO一样请reward模型做「教练」纠正自己的打法进行RL。
989 |
990 | ## Ⅱ 主观样例测评
991 |
992 | 🏃以下测试于2025-02-09完成,此日期后发布的新模型,无特殊需要时将不加入测试。
993 |
994 | [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
995 | [B] [MiniMind2-MoE (0.15B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
996 | [C] [MiniMind2-Small (0.02B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
997 | [D] [minimind-v1-small(0.02B)](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666)
998 | [E] [minimind-v1-moe(0.1B)](https://pan.baidu.com/s/1tqB-GMvuiGQBvEl-yZ-oBw?pwd=6666)
999 | [F] [minimind-v1(0.1B)](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666)
1000 | [G] [baby-llama2-chinese(0.2B)](https://github.com/DLLXW/baby-llama2-chinese)
1001 | [H] [chatlm-mini-chinese(0.2B)](https://github.com/charent/ChatLM-mini-Chinese)
1002 |
1003 | ```text
1004 | [Q]: 你知道长江吗?
1005 | [A]: 长江,位于中国东南部,是中国的第一长河,全长6300多公里,流经11个省份,是中国最长的河流,也是中国的“母亲河”。长江的历史可以追溯到公元前2000多年的商朝时期,当时的商朝是商朝的后裔。商朝时期,长江的经济发展达到了高峰,是中国最重要的农业和工业中心之一。长江流域的经济主要以农业为主,长江渔业占据了主要产业,如钓鱼、猎鱼等。长江的水资源也非常丰富,包括河流、湖泊、河流等。长江的自然景观也十分壮观,长江四季景色各异,春季多雨,夏季多雨,秋季则多雨,冬季则寒冷多雪,春季和秋季都有美丽的花朵和果实。长江是中国的经济中心,拥有众多的国家和城市,是世界上重要的交通枢纽和航运中心。同时,长江还是中国的重要水利资源,对于中国的经济发展具有重要意义。
1006 | [B]: 是的,我知道长江。长江位于中国的中部,是中国第三长的河流,全长约6300公里,流经江西省、安徽省、江苏省、安徽省、江苏省和安徽省。
1007 | [C]: 当然,长江是中国最长的河流,全长约6300公里。它以其丰富的水资源、丰富的水生动植物资源以及独特的地理位置而闻名于世。
1008 | [D]: 是的,长江是中国最长的河流,全长超过3300千米。
1009 | [E]: 长江是中国最长的河流,总长度约为6300公里。它发源于青藏高原的青海湖,流经青海、四川、甘肃、宁夏、内蒙古、陕西和山西。长江流域是中国最大的农业区之一,也是中国的文化和经济中心。
1010 | [F]: 是的,长江是中国最长的河流之一,也是世界第三大河流。长江发源于青藏高原,流经11个省份,最终注入东海。长江流域是中国最大的河流流域之一,拥有着丰富的自然资源和文化遗产,如长江大桥、长江海洋博物馆、长江古城等。
1011 | [G]: 同时也知道,中国是世界上最长的城市之一,拥有超过3400万人口。
1012 | [H]: 是的,长江是中国最长的河流,也是世界上最长的河流之一。它发源于青藏高原,流经九个省区,最终注入东海。长江流域的气候温和,四季分明,是中国重要的农业和工业基地。长江流域的水产资源丰富,是中国的重要渔业资源之一。
1013 |
1014 |
1015 | [Q]: 世界上最高的山峰是什么?
1016 | [A]: 世界上最高的山峰是珠穆朗玛峰(Mount Everest),位于喜马拉雅山脉,尼泊尔和中国的边界上。它的海拔高度是8,848.86米(29,031.7英尺),是地球上最高的山峰。
1017 | [B]: 珠穆朗玛峰(Mount Everest)是世界上最高的山峰,海拔高度达8,848米(29,029英尺)。
1018 | [C]: 珠穆朗玛峰是世界上最高的山峰,位于喜马拉雅山脉,海拔8,848.86米(29,031.7英尺)。它位于尼泊尔和中国的边界上,是喜马拉雅山脉的一部分。
1019 | [D]: 世界上最高的山峰是珠穆朗玛峰,位于尼泊尔和中国的边境上,海拔8848米。
1020 | [E]: 世界上最高的山峰是珠穆朗玛峰,位于喜马拉雅山脉,海拔8,848米。
1021 | [F]: 珠穆朗玛峰是世界上最高的山峰,海拔8848米。
1022 | [G]: 世界上最高的山峰是珠穆朗玛峰,位于喜马拉雅山脉,海拔8,848.86米(29,031.69英尺)。
1023 | [H]: 世界上最高的山峰是珠穆朗玛峰,也称为萨加玛塔,位于喜马拉雅山脉的尼泊尔和中国边境之间,海拔8,848.86米(29,031.69英尺)。它是世界上最著名的登山目标之一,吸引着来自世界各地的登山者前来挑战。珠穆朗玛峰的名字来自藏语,意思是“女神的额峰”,它是由两个不同的山峰组成的,其中一个是位于尼泊尔的珠穆朗玛峰,另一个则位于中国。它的高度是通过测量海拔高度来确定的,因为它是通过测量地球的平均高度来确定的。尽管珠穆朗玛峰是世界上最高的山峰,但它并不是地球上任何其他地方都能到达的。地球上的其他地方,如南美洲的安第斯山脉,也是世界上最高的山脉,但它并不是地球上最高的山峰。总之,珠穆朗玛峰是世界上最高的山峰,但它并不是地球上任何地方都能到达的。它是地球上最高的山脉,但它并不是地球上任何地方都能到达的。
1024 |
1025 |
1026 | [Q]: 万有引力是谁提出的?
1027 | [A]: 万有引力是由牛顿在17世纪提出的,它是牛顿三大定律之一。牛顿在17世纪提出的引力定律揭示了万有引力定律的本质,并且解释了行星、卫星、彗星等天体之间如何相互作用。这个定律是由艾萨克·牛顿在1687年提出的,他通过对万有引力定律的分析和实验,成功地解释了天体之间的引力作用,从而解释了地球绕太阳旋转的现象。
1028 | [B]: 万有引力(Human Policy)是牛顿的三大力学定律,用于描述两个物体之间的相互作用力。它由万有引力定律和牛顿的第一定律(惯性定律)构成。万有引力是由于两个物体之间相互作用而产生的,它们的质量、距离、速度和方向等因素决定了它们之间的相互作用。这种相互作用力是:质量的乘积:质量的乘积是万有引力定律和牛顿第二定律的乘积。质量的乘积:质量的乘积是物体之间的一种物理过程,它是物体间相互作用的产物,对于其他物体来说,其质量可以表示为两个物体之间的质量之和,即重力加速度a。
1029 | [C]: 万有引力是一个由牛顿在1687年提出的物理学概念,它描述了物体之间引力如何相互作用的规律。
1030 | [D]: 万有引力是由英国物理学家牛顿提出的。他认为,所有物体都会产生引力,这个引力是由物体的质量和距离决定的。这个理论揭示了质量和距离之间的关系,并证明了万有引力的本质。
1031 | [E]: 万有引力是由牛顿在1687年提出的。它是宇宙中一个基本的物理现象,它描述了物体之间的相互作用。这个力被称为万有引力,它在所有物体之间都存在,这种力与物体的质量和距离有关。万有引力是自然界中最基本的力量之一,它不仅影响着物体的运动和变化,还对宇宙的结构和演化产生着重要的影响。
1032 | [F]: 万有引力是英国物理学家艾萨克·牛顿在17世纪提出的。
1033 | [G]: 引力是宇宙中一种力量,它是由质量或能量所组成的。根据牛顿的万有引力定律,一个物体之间的引力大小与它的质量成正比,与它的质量成反比。因此,牛顿的万有引力定律表明,一个物体之间的引力大小与它的质量成正比,与它的质量成反比。这个定律在物理学、天文学和物理学中都有广泛的应用。
1034 | [H]: 万有引力是由英国物理学家艾萨克·牛顿于1687年提出的。他在1687年发表了一篇著名的论文,题为《自然哲学的数学原理》,其中提出了万有引力定律,指出任何两个物体之间都存在着一种引力,这种引力的大小与它们的质量成正比,与它们之间的距离的平方成反比。这个定律被称为万有引力定律,它描述了质量和距离之间的关系。它表明,两个物体之间的引力与它们的质量成正比,与它们之间的距离的平方成反比。这个定律对于解释宇宙中的许多现象和现象之间的相互作用非常重要,例如行星绕太阳运动、恒星之间的引力透镜效应等等。它也被广泛应用于天文学、物理学和工程学等领域,为我们理解宇宙的本质和演化提供了重要的理论基础。
1035 |
1036 |
1037 | [Q]: 你知道大熊猫的主要食物是什么吗?
1038 | [A]: 大熊猫主要以竹子为食,但也会吃其他植物、昆虫、小型哺乳动物和鱼类。它们的消化系统与其他熊科动物类似,能够消化竹子。大熊猫的牙齿锋利,适合咀嚼竹子,因此它们经常被称为“竹食动物”。
1039 | [B]: 大熊猫是以竹子为食的哺乳动物,它们是肉食动物。它们主要食物是竹子,特别是竹子中富含纤维素和矿物质,并且不会直接进入消化系统。大熊猫的肠道主要由竹子组成,它们需要消化竹子中的纤维素和其他纤维素来支持它们的正常进食。
1040 | [C]: 大熊猫主要以竹子为食,但它们也吃其他植物,包括树叶、嫩芽、昆虫、水果、蔬菜和小型哺乳动物等。大熊猫是中国特有的物种,属于熊科,主要分布在中国的四川、陕西和甘肃等地。
1041 | [D]: 大熊猫的主要食物是竹子。它们通常吃竹子,每天需要吃掉大约30公斤的竹子。
1042 | [E]: 是的,大熊猫主要以竹子为食。它们每天需要吃掉大量的竹子来维持生命,每天需要吃掉40至80公斤的竹子。此外,它们也会吃一些其他的植物,如草、果实、坚果和小型哺乳动物。
1043 | [F]: 大熊猫主要以竹子为食,也会偶尔食用其他的植物和动物。此外,大熊猫也会吃一些野果、小动物和植物,但这些都不属于它们的主要食物来源。
1044 | [G]: 大熊猫的主要食物是竹子。
1045 | [H]: 大熊猫是一种以竹子为主要食物的植食动物,它们的肠道结构适合消化纤维素含量高的竹子,但也会吃一些其他植物,如草本植物和小型哺乳动物。大熊猫的肠道结构也适应了高纤维的竹子,这使得它们能够消化大量的纤维素,从而帮助它们在野外生存。大熊猫的肠道中还含有一种叫做“纤维素酶”的酶类物质,这种酶能够帮助大熊猫分解纤维素,从而帮助它们消化大量的水分和营养物质。此外,大熊猫的肠道还含有一种叫做“纤维素蛋白”的蛋白质,这种蛋白质能够帮助大熊猫维持体内的能量水平,从而帮助它们在野外生存。总的来说,大熊猫的主要食物是竹子,但它们也会吃一些其他植物,如草本植物和小型哺乳动物。
1046 |
1047 |
1048 | [Q]: 海水为什么是咸的?
1049 | [A]: 海水之所以是咸的,主要是由于其中的化学成分。海水中含有大量的矿物质和盐分,其中主要成分是硫酸盐、碳酸盐、镁离子和钾离子。这些物质在高浓度下,会沉淀在海水中,形成海水。具体来说,海水中的矿物质和盐分主要来自于岩石、土壤、生物和人类排泄物等。岩石是海水的组成部分,主要由矿物质组成,如石英、长石、云母等。土壤中含有大量的化学物质,如氮、磷、钾等,这些物质会渗透到土壤中,形成土壤。生物体内的盐分主要来源于植物的叶绿素等色素,这些色素会吸收太阳光中的能量,转化为生物体可以利用的能量。而人类排泄物中含有的盐分主要来源于矿物质和碳酸盐,这些矿物质在水中的浓度较高,会导致水中溶解的矿物质浓度增加,形成水体中的盐分。因此,海水之所以是咸的,是由于其中的化学成分和结构特征。海水中的矿物质和盐分主要来自于岩石、土壤、生物和人类排泄物等,而这些物质的积累和排泄会导致海水呈现出咸味。
1050 | [B]: 海水是咸的,因为它包含了多种物质,包括氯化钠(NaCl)、氯化钠(NaCl)、碳酸盐(如硫酸盐)、氯化钠(NaCl)和水。咸物质的来源:盐是咸的常见来源,含有钾、钙、镁、钾等矿物质。这些矿物质通常来自于地下水、地下水或土壤中,这些矿物质的含量远低于纯净水。咸味的原因:盐的咸味主要源于其溶解于水中的微小颗粒物。这些小颗粒物主要通过植物根部和动物体内的微生物传播,如昆虫、小型哺乳动物、鸟类和某些昆虫。综上所述,海水是咸的,因为它含有大量的盐类,包括矿物质、岩石、盐、矿物质等,这些物质在海水中的浓度高,对咸味的产生较大影响。
1051 | [C]: 海水是咸的,主要是因为其中含有各种盐分,其中以海水为主要成分之一。当海水中含有大量的盐分时,盐分会被溶解在海水中,形成咸味。海水中的盐分主要来源于溶解盐类、蒸发析出的盐和海水冻结成冰等因素。
1052 | [D]: 海水中含有许多盐分,这些盐分可以导致海水咸度的增加。海水中的盐分主要来自于海水中的盐分,这些盐分可以从海水中的盐分中获得。海水中的氧化物和盐分可以与海水中的盐分形成氢键,从而使海水咸度增加。
1053 | [E]: 海水是咸的,这是因为它含有大量的盐分,其中包括氯离子和氯离子。这些离子在海洋中形成了大量的盐,这些盐分被称为盐分。海水中含有许多溶解在水中的盐,这些盐分会随着海水的升高而逐渐蒸发,最终导致海水中盐分的蒸发。这些盐分在海水中被完全消耗,然后被排除在海水中。此外,海水中还含有许多其他的盐分,但这些盐分是不同的,因为它们在水中的浮起和下降过程是不同的。海水中的盐分是从海底的岩石中提取出来的,这些盐分在海水中可以被利用。总的来说,海水是咸的,这也是为什么海水中含有许多盐分和盐分的原因。
1054 | [F]: 海水含有许多溶解的盐类,其中最主要的是氯化钠。当水中加入盐时,盐的分子会与水分子发生反应,形成氢氧化钠,这是一种盐。
1055 | [G]: 因为水是许多生物的生存所必需的,它能够吸收和释放二氧化碳,释放氧气,维持大气中的水蒸气,以及维持水循环。水中的盐类和化学物质会释放出氧气,使地球上的生物呼吸。此外,水还具有许多其他功能,如保持水循环、维持水循环、调节气候和环境等。水是许多生物生存所必需的,它也是许多自然现象的重要基础。
1056 | [H]: 海水是咸的主要原因是因为其中含有大量的溶解盐类,其中最主要的是氯化钠(NaCl)。氯化钠是一种非常常见的盐类,它存在于地球上的大部分地区,包括海洋、湖泊、河流和地下水。当水从海洋中蒸发时,它会带走一部分盐分,使海水变得咸。此外,海中还含有其他溶解盐类的物质,如硫酸盐、硫酸镁、硫酸钙、硫酸镁盐、硫酸镁钾、硫酸镁钠、硫酸镁钙、硫酸镁镁钙、硫酸镁镁钙、硫酸镁镁钙、硫酸镁镁钠、硫酸镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁镁钙、硫酸镁镁镁镁
1057 |
1058 |
1059 | [Q]: 写一段python的快速排序代码
1060 | [A]: 好的,以下是Python的快速排序代码:def quick_sort(arr): if len(arr) <= 1: return arr else: pivot = arr[len(arr) // 2] left = [x for x in arr[1:]] right = [x for x in arr[2:] if x < pivot] return quick_sort(left) + [pivot] + quick_sort(right)# 测试代码arr = [3, 6, 8, 10, 1, 2, 1]print(quick_sort(arr))这段代码首先检查数组的长度是否小于等于1。如果是,那么它就按照从小到大的顺序排列。接着,对于数组的每个元素,它想要进行递归地排序,直到达到排序的结束位置。如果数组的长度是奇数,那么它就按照从大到小的顺序排列。最后,平均的时间复杂度是O(n log n),其中n是数组的长度。
1061 | [B]: 以下是一个使用Python的快速排序算法的实现:def sort_array(arr): # 生成一个随机排列 sorted_array = sorted(arr) # 返回排序后的列表 return sorted(arr)# 示例arr = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]# 调用函数sorted_array = sort_array(arr)print("原始数组是:", sorted_array)这段代码首先生成一个随机排列,然后通过一个列表的长度进行排序。最后,返回排序后的列表。注意,这个代码使用了Python的内置sorted函数和内置的sorted()函数,它们在时间复杂度为O(n log n)的情况下,比直接使用sorted()函数更快。
1062 | [C]: 以下是一个简单的排序算法,适用于小型数组,包括小于等于5的数组。def quick_sort(arr): if len(arr) <= 1: return arr else for _ in range(2, 51): if arr[len(arr) - 1:] is not None: arr[len(arr) - 1] = arr[len(arr) - 1] arr[len(arr)] = arr[len(arr)] return quick_sort(arr)# 示例arr = [3, 5, 1, 4, 1, 2, 1]sorted_arr = quick_sort(arr)print("Sorted array is:", sorted_arr)### 代码解释:1. **函数定义**:quick_sort 函数接受一个数组 arr 作为参数。2. **循环**:对于 arr 的元素,使用递归来实现,即从右到左遍历数组,并将当前元素与数组的当前元素交换位置,直到数组达到空或数组长度小于等于 arr。3. **递归调用**:遍历数组,将当前元素与数组的当前元素进行交换,直到数组达到空或数组长度小于等于 arr。4. **检查数组长度**:使用 if __name__ == "__main__" 检查数组的长度是否小于 arr,如果是,就将数组从较大的数组移动到数组的末尾。5. **输出排序后的结果**
1063 | [D]: 作为一个AI语言模型,我无法编写Python程序。但是,Python是一个非常流行的编程语言,它具有许多优秀的库和框架,包括NumPy、Pandas、Matplotlib、Pandas、Bandke、Norm和NKlearn等。Python有很多优秀的自然语言处理库和框架,可以用于处理和分析各种类型的数据,如文本、图像、音频、视频等。Python还可以用于创建各种应用程序,如网页搜索、游戏开发、科学计算等。Python非常灵活,也支持多种编程风格,因此可以轻松地找到适合自己的解决方案。
1064 | [E]: 作为AI语言模型,我可以回答您的问题。但是我无法进行编程操作,因为我无法确定您想问的是哪种编程语言。请您提供更具体的问题,以便我能够更好地回答您的问题。
1065 | [F]: 快速排序(Quick Sort)是一种简单的排序算法,它通过将数据划分为小组,每组数据按照一定规则排序,最终得到一个有序数组。
1066 | [G]: 好的,以下是一段快速排序的Python代码```python sorted_print(sorted(sorted(sorted(sorted_print))```输出结果为:```[1, 2, 3, 4, 5, 6, 7, 8, 9]```
1067 | [H]: 快速排序是一种基于分治思想的高效排序算法,它的时间复杂度为 O(n log n)。它的基本思想是通过一趟排序将待排序序列划分为两个子序列,其中一个子序列的所有元素都比另一个子序列的所有元素小,然后对这两个子序列分别递归地进行排序。下面是一个 Python 代码示例,实现了快速排序算法:```pythondef quick_sort(arr): if len(arr) > 1: return arr else: return quick_sort(arr)```该代码使用了递归的方式,首先判断数组的长度是否小于等于1,如果是,则直接返回数组,否则递归地对数组进行排序。在递归过程中,我们首先判断数组的第一个元素是否大于等于1,如果是,则将数组中的元素与第一个元素进行比较,如果不等于1,则将数组中的元素与第二个元素进行比较,以此类推,直到所有元素都被排序。该代码的时间复杂度为 O(n log n),是一个非常实用的排序算法。
1068 |
1069 |
1070 | [Q]: 截至2025年2月9日14时31分,电影《哪吒之魔童闹海》累计票房(含预售)突破78.09亿元,超过《星球大战:天行者崛起》的票房成绩,成为首部进入全球票房榜前40的亚洲电影。这一成就不仅标志着中国电影在国际市场的突破,也展示了中国动画电影的巨大潜力和市场吸引力。从这段文字总结20字的摘要给我。
1071 | [A]: 电影《哪吒之魔童闹海》票房突破78.09亿元,成为首部进入全球票房榜前40的亚洲电影。这标志着中国电影的国际突破,也展现了中国动画电影的巨大潜力。
1072 | [B]: 2025年2月9日,中国电影在世界舞台上突破78.09亿元,超过《星球大战:天行者崛起》的票房成绩。
1073 | [C]: 电影《哪吒之魔童闹海》票房突破78.09亿元,成为全球票房榜前40的亚洲电影,展现中国动画电影的巨大潜力。
1074 | [D]: 《哪吒之魔童闹海》累计票房突破78.09亿,成为首部进入全球票房榜前40的亚洲电影,展示了中国电影在国际市场的突破,同时展示了中国动画电影的巨大潜力和市场吸引力。
1075 | [E]: 《猛龙强:打开世界奇幻冒险》成功地,上海成功吸引了众多观众,同时也引起了广泛的关注。
1076 | [F]: 《哪吒之魔童闹海》是一部让中国动画电影成为国际市场的巨大潜力。该电影以其华丽的画面和惊险的战斗场面而闻名于世。
1077 | [G]: 《哪行是值得的》是2021年1月9日14时31分,电影《哪成大作为》在电影市场突破7800亿美元,成为中国动画电影的突破,并展示中国动画电影的巨大潜力和市场吸引力。
1078 | [H]: 我是一个参数很少的AI模型🥺,知识库较少,无法直接回答您的问题,换个问题试试吧👋
1079 | ```
1080 |
1081 | ---
1082 |
1083 | 🙋直接把以上所有问题和模型的回答丢给DeepSeek-R1,让它帮忙点评和排名打分:
1084 |
1085 |
1086 |
1087 | 具体点评
1088 |
1089 | ### 评分标准:
1090 |
1091 | - **准确性**:回答是否正确且无明显错误。
1092 | - **完整性**:回答是否涵盖了问题的核心要点。
1093 | - **逻辑性**:回答是否有条理,是否符合逻辑。
1094 | - **代码质量**:代码是否能正常运行,逻辑是否清晰。
1095 |
1096 | ### 点评:
1097 |
1098 | 1. **A模型**:
1099 | - **优点**:回答非常全面,信息量大,逻辑清晰,尤其在长江、大熊猫、海水咸味等问题上表现优异。代码虽然有小瑕疵,但整体思路正确。
1100 | - **缺点**:部分回答稍显冗长,但不影响整体质量。
1101 | - **总评**:综合表现最佳,得分最高。
1102 |
1103 | 2. **H模型**:
1104 | - **优点**:回答较为准确,尤其在珠穆朗玛峰、万有引力等问题上表现出色。代码虽未完全展示,但解释较为详细。
1105 | - **缺点**:部分回答略显啰嗦,但逻辑性较强。
1106 | - **总评**:仅次于A模型,表现稳定。
1107 |
1108 | 3. **C模型**:
1109 | - **优点**:回答简洁明了,尤其在大熊猫、快速排序代码等问题上表现较好。
1110 | - **缺点**:部分回答略显简短,缺乏深入解释。
1111 | - **总评**:整体表现不错,但细节上略逊于A和H。
1112 |
1113 | 4. **F模型**:
1114 | - **优点**:回答较为准确,尤其在长江、万有引力等问题上表现尚可。代码部分有一定逻辑性。
1115 | - **缺点**:部分回答不够深入,代码存在一些小问题。
1116 | - **总评**:表现中规中矩,仍有提升空间。
1117 |
1118 | 5. **D模型**:
1119 | - **优点**:回答基本准确,尤其在万有引力、长江等问题上表现尚可。
1120 | - **缺点**:部分回答过于简略,代码存在明显错误。
1121 | - **总评**:表现一般,需改进代码部分。
1122 |
1123 | 6. **B模型**:
1124 | - **优点**:回答较为准确,尤其在长江、海水咸味等问题上表现尚可。
1125 | - **缺点**:部分回答逻辑性较差,代码存在较大问题。
1126 | - **总评**:表现平平,需进一步优化。
1127 |
1128 | 7. **E模型**:
1129 | - **优点**:部分回答较为准确,尤其在海水咸味、大熊猫等问题上表现尚可。
1130 | - **缺点**:回答过于简略,代码部分几乎无法运行。
1131 | - **总评**:表现不佳,需大幅提升。
1132 |
1133 | 8. **G模型**:
1134 | - **优点**:几乎没有明显的优点。
1135 | - **缺点**:回答严重偏离主题,代码部分完全无法运行。
1136 | - **总评**:表现最差,需大幅改进。
1137 |
1138 | ---
1139 |
1140 | ### 总结:
1141 |
1142 | - **A模型**在各方面表现最为出色,尤其在复杂问题的回答上展现了极高的准确性与逻辑性。
1143 | - **H模型**紧随其后,表现稳定,但在某些细节上略显不足。
1144 | - **G模型**表现最差,回答偏离主题且代码无法运行,需大幅改进。
1145 |
1146 |
1147 |
1148 | ### 打分排序
1149 |
1150 | | 排名 | 模型 | 准确性 (30分) | 完整性 (30分) | 逻辑性 (20分) | 代码质量 (20分) | 总分 (100分) |
1151 | |----|----|-----------|-----------|-----------|------------|-----------|
1152 | | 1 | A | 28 | 29 | 19 | 20 | 96 |
1153 | | 2 | H | 27 | 28 | 18 | 20 | 93 |
1154 | | 3 | C | 26 | 27 | 18 | 18 | 89 |
1155 | | 4 | F | 25 | 26 | 17 | 18 | 86 |
1156 | | 5 | D | 24 | 25 | 17 | 16 | 82 |
1157 | | 6 | B | 23 | 24 | 16 | 15 | 78 |
1158 | | 7 | E | 22 | 23 | 15 | 14 | 74 |
1159 | | 8 | G | 10 | 12 | 10 | 10 | 42 |
1160 |
1161 |
1162 | ### 👉主观效果总结
1163 |
1164 | 个人主观评价与DeepSeek-R1基本相符,其中:
1165 |
1166 | * MiniMind系列的排序非常符合直觉,参数越大+训练数据越充分评分越高,幻觉和错误都会比小模型肉眼可见的好。
1167 |
1168 | * H模型的回答肉眼看起来是不错的,尽管存在些许幻觉瞎编的情况。
1169 |
1170 | * G模型可能训练数据不够完备,给出的权重经过测试效果不佳。
1171 |
1172 | * 再复诵一遍经久不衰的Scaling Law: 参数越大,训练数据越多模型的性能越强。
1173 |
1174 | ---
1175 |
1176 | ## Ⅲ Objective Benchmark
1177 |
1178 | 下面就到喜闻乐见的benchmark刷榜测试环节,就不找乐子和qwen、glm级别的中文模型做对比了。
1179 | 这里选取了一些<1B的微型模型进行横评比较,
1180 | 测试集选择C-Eval、CMMLU、A-CLUE、TMMLU+这几个纯中文语言榜单。
1181 |
1182 |
1183 |
1184 | 测评框架
1185 |
1186 | 测评框架选择[lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness),
1187 | 安装后启动测试非常方便:
1188 |
1189 | ```bash
1190 | lm_eval --model hf --model_args pretrained=<填写模型路径>,device=cuda,dtype=auto --tasks ceval* --batch_size 8 --trust_remote_code
1191 | ```
1192 |
1193 |
1194 |
1195 |
1196 |
1197 | PS: 在这种全是选择题的测评集中,为了避免回复格式的难以固定的特点,
1198 | 所以常用做法是直接把`A`,`B`,`C`,`D`四个字母对应token的预测概率取出来,将其中概率最大的字母与标准答案计算正确率。
1199 | 选择题1/4乱选的正确率是25%,然而这个量级的所有模型都集中在25附近,甚至很多时候不如瞎选,是不是像极了高中完形填空的滑铁卢正确率...
1200 | MiniMind模型本身预训练数据集小的可怜,也没有针对性的对测试集做刷榜微调,因此结果图一乐即可:
1201 |
1202 | | models | from | params↓ | ceval↑ | cm mlu↑ | aclue↑ | tmmlu+↑ |
1203 | |-------------------------------------------------------------------------------|---------------|---------|--------|---------|--------|---------|
1204 | | MiniMind2 | JingyaoGong | 104M | 26.52 | 24.42 | 24.97 | 25.27 |
1205 | | MiniMind2-Small | JingyaoGong | 26M | 26.37 | 24.97 | 25.39 | 24.63 |
1206 | | MiniMind2-MoE | JingyaoGong | 145M | 26.6 | 25.01 | 24.83 | 25.01 |
1207 | | [Steel-LLM](https://github.com/zhanshijinwat/Steel-LLM) | ZhanShiJin | 1121M | 24.81 | 25.32 | 26 | 24.39 |
1208 | | [GPT2-medium](https://huggingface.co/openai-community/gpt2-medium) | OpenAI | 360M | 23.18 | 25 | 18.6 | 25.19 |
1209 | | [TinyLlama-1.1B-Chat-V1.0](https://github.com/jzhang38/TinyLlama) | TinyLlama | 1100M | 25.48 | 25 | 25.4 | 25.13 |
1210 | | [SmolLM2](https://github.com/huggingface/smollm) | HuggingFaceTB | 135M | 24.37 | 25.02 | 25.37 | 25.06 |
1211 | | [Aquila-Instruct](https://www.modelscope.cn/models/BAAI/Aquila-135M-Instruct) | BAAI | 135M | 25.11 | 25.1 | 24.43 | 25.05 |
1212 |
1213 | 
1214 |
1215 | # 📌 其它 (Others)
1216 |
1217 | ### 推理与导出
1218 |
1219 | * [./scripts/convert_model.py](./scripts/convert_model.py)可以将torch/transformers模型互相转换。
1220 |
1221 | * MiniMind的HuggingFace集合地址:
1222 | [MiniMind](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
1223 |
1224 | ---
1225 |
1226 | ### 基于MiniMind-API服务接口
1227 |
1228 | * [./scripts/serve_openai_api.py](./scripts/serve_openai_api.py)完成了兼容openai-api的最简聊天接口,方便将自己的模型接入第三方UI
1229 | 例如FastGPT、OpenWebUI、Dify等等。
1230 |
1231 | * 从[Huggingface](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)下载模型权重文件,文件树:
1232 | ```
1233 | (root dir)
1234 | ├─
1235 | | ├── config.json
1236 | | ├── generation_config.json
1237 | | ├── LMConfig.py
1238 | | ├── model.py
1239 | | ├── pytorch_model.bin
1240 | | ├── special_tokens_map.json
1241 | | ├── tokenizer_config.json
1242 | | ├── tokenizer.json
1243 | ```
1244 |
1245 | * 启动聊天服务端
1246 | ```bash
1247 | python serve_openai_api.py
1248 | ```
1249 | * 测试服务接口
1250 | ```bash
1251 | python chat_openai_api.py
1252 | ```
1253 | * API接口示例,兼容openai api格式
1254 | ```bash
1255 | curl http://ip:port/v1/chat/completions \
1256 | -H "Content-Type: application/json" \
1257 | -d '{
1258 | "model": "model-identifier",
1259 | "messages": [
1260 | { "role": "user", "content": "世界上最高的山是什么?" }
1261 | ],
1262 | "temperature": 0.7,
1263 | "max_tokens": 512,
1264 | "stream": true
1265 | }'
1266 | ```
1267 |
1268 | # 📌 Acknowledge
1269 |
1270 | > [!NOTE]
1271 | > 如果觉得`MiniMind系列`对您有所帮助,可以在 GitHub 上加一个⭐
1272 | > 篇幅超长水平有限难免纰漏,欢迎在Issues交流指正或提交PR改进项目
1273 | > 您的小小支持就是持续改进此项目的动力!
1274 |
1275 | ## 🤝[贡献者](https://github.com/jingyaogong/minimind/graphs/contributors)
1276 |
1277 |
1282 |
1283 |
1284 |
1285 |
1286 |
1287 |
1288 |
1289 |
1290 |
1291 |
1292 | ## 😊鸣谢
1293 |
1294 | @ipfgao :
1295 | 🔗训练步骤记录
1296 |
1297 | @chuanzhubin :
1298 | 🔗代码逐行注释
1299 |
1300 | @WangRongsheng :
1301 | 🔗大型数据集预处理
1302 |
1303 | @pengqianhan :
1304 | 🔗一个简明教程
1305 |
1306 | @RyanSunn :
1307 | 🔗推理过程学习记录
1308 |
1309 | @Nijikadesu :
1310 | 🔗以交互笔记本方式分解项目代码
1311 |
1312 |
1313 |
1314 | 参考链接 & 感谢以下优秀的论文或项目
1315 |
1316 | - 排名不分任何先后顺序
1317 | - [https://github.com/meta-llama/llama3](https://github.com/meta-llama/llama3)
1318 | - [https://github.com/karpathy/llama2.c](https://github.com/karpathy/llama2.c)
1319 | - [https://github.com/DLLXW/baby-llama2-chinese](https://github.com/DLLXW/baby-llama2-chinese)
1320 | - [(DeepSeek-V2)https://arxiv.org/abs/2405.04434](https://arxiv.org/abs/2405.04434)
1321 | - [https://github.com/charent/ChatLM-mini-Chinese](https://github.com/charent/ChatLM-mini-Chinese)
1322 | - [https://github.com/wdndev/tiny-llm-zh](https://github.com/wdndev/tiny-llm-zh)
1323 | - [(Mistral-MoE)https://arxiv.org/pdf/2401.04088](https://arxiv.org/pdf/2401.04088)
1324 | - [https://github.com/Tongjilibo/build_MiniLLM_from_scratch](https://github.com/Tongjilibo/build_MiniLLM_from_scratch)
1325 | - [https://github.com/jzhang38/TinyLlama](https://github.com/jzhang38/TinyLlama)
1326 | - [https://github.com/AI-Study-Han/Zero-Chatgpt](https://github.com/AI-Study-Han/Zero-Chatgpt)
1327 | - [https://github.com/xusenlinzy/api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)
1328 | - [https://github.com/HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM)
1329 |
1330 |
1331 |
1332 | ## 🫶支持者
1333 |
1334 |
1335 |
1336 |
1337 |
1338 |
1339 |
1340 |
1341 |
1342 |
1343 |
1344 |
1345 |
1346 |
1347 |
1348 |
1349 |
1350 |
1351 |
1352 |
1353 |
1354 |
1355 |
1356 | # License
1357 |
1358 | This repository is licensed under the [Apache-2.0 License](LICENSE).
1359 |
1360 |
1361 |
--------------------------------------------------------------------------------
/eval_model.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import random
3 | import time
4 | import numpy as np
5 | import torch
6 | import warnings
7 | from transformers import AutoTokenizer, AutoModelForCausalLM
8 | from model.model import MiniMindLM
9 | from model.LMConfig import LMConfig
10 | from model.model_lora import *
11 |
12 | warnings.filterwarnings('ignore')
13 |
14 |
15 | def init_model(args):
16 | tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
17 | if args.load == 0:
18 | moe_path = '_moe' if args.use_moe else ''
19 | modes = {0: 'pretrain', 1: 'full_sft', 2: 'rlhf', 3: 'reason', 4: 'grpo'}
20 | ckp = f'./{args.out_dir}/{modes[args.model_mode]}_{args.dim}{moe_path}.pth'
21 |
22 | model = MiniMindLM(LMConfig(
23 | dim=args.dim,
24 | n_layers=args.n_layers,
25 | max_seq_len=args.max_seq_len,
26 | use_moe=args.use_moe
27 | ))
28 |
29 | state_dict = torch.load(ckp, map_location=args.device)
30 | model.load_state_dict({k: v for k, v in state_dict.items() if 'mask' not in k}, strict=True)
31 |
32 | if args.lora_name != 'None':
33 | apply_lora(model)
34 | load_lora(model, f'./{args.out_dir}/lora/{args.lora_name}_{args.dim}.pth')
35 | else:
36 | transformers_model_path = './MiniMind2'
37 | tokenizer = AutoTokenizer.from_pretrained(transformers_model_path)
38 | model = AutoModelForCausalLM.from_pretrained(transformers_model_path, trust_remote_code=True)
39 | print(f'MiniMind模型参数量: {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.2f}M(illion)')
40 | return model.eval().to(args.device), tokenizer
41 |
42 |
43 | def get_prompt_datas(args):
44 | if args.model_mode == 0:
45 | # pretrain模型的接龙能力(无法对话)
46 | prompt_datas = [
47 | '马克思主义基本原理',
48 | '人类大脑的主要功能',
49 | '万有引力原理是',
50 | '世界上最高的山峰是',
51 | '二氧化碳在空气中',
52 | '地球上最大的动物有',
53 | '杭州市的美食有'
54 | ]
55 | else:
56 | if args.lora_name == 'None':
57 | # 通用对话问题
58 | prompt_datas = [
59 | '请介绍一下自己。',
60 | '你更擅长哪一个学科?',
61 | '鲁迅的《狂人日记》是如何批判封建礼教的?',
62 | '我咳嗽已经持续了两周,需要去医院检查吗?',
63 | '详细的介绍光速的物理概念。',
64 | '推荐一些杭州的特色美食吧。',
65 | '请为我讲解“大语言模型”这个概念。',
66 | '如何理解ChatGPT?',
67 | 'Introduce the history of the United States, please.'
68 | ]
69 | else:
70 | # 特定领域问题
71 | lora_prompt_datas = {
72 | 'lora_identity': [
73 | "你是ChatGPT吧。",
74 | "你叫什么名字?",
75 | "你和openai是什么关系?"
76 | ],
77 | 'lora_medical': [
78 | '我最近经常感到头晕,可能是什么原因?',
79 | '我咳嗽已经持续了两周,需要去医院检查吗?',
80 | '服用抗生素时需要注意哪些事项?',
81 | '体检报告中显示胆固醇偏高,我该怎么办?',
82 | '孕妇在饮食上需要注意什么?',
83 | '老年人如何预防骨质疏松?',
84 | '我最近总是感到焦虑,应该怎么缓解?',
85 | '如果有人突然晕倒,应该如何急救?'
86 | ],
87 | }
88 | prompt_datas = lora_prompt_datas[args.lora_name]
89 |
90 | return prompt_datas
91 |
92 |
93 | # 设置可复现的随机种子
94 | def setup_seed(seed):
95 | random.seed(seed)
96 | np.random.seed(seed)
97 | torch.manual_seed(seed)
98 | torch.cuda.manual_seed(seed)
99 | torch.cuda.manual_seed_all(seed)
100 | torch.backends.cudnn.deterministic = True
101 | torch.backends.cudnn.benchmark = False
102 |
103 |
104 | def main():
105 | parser = argparse.ArgumentParser(description="Chat with MiniMind")
106 | parser.add_argument('--lora_name', default='None', type=str)
107 | parser.add_argument('--out_dir', default='out', type=str)
108 | parser.add_argument('--temperature', default=0.85, type=float)
109 | parser.add_argument('--top_p', default=0.85, type=float)
110 | parser.add_argument('--device', default='cuda' if torch.cuda.is_available() else 'cpu', type=str)
111 | # 此处max_seq_len(最大允许输入长度)并不意味模型具有对应的长文本的性能,仅防止QA出现被截断的问题
112 | # MiniMind2-moe (145M):(dim=640, n_layers=8, use_moe=True)
113 | # MiniMind2-Small (26M):(dim=512, n_layers=8)
114 | # MiniMind2 (104M):(dim=768, n_layers=16)
115 | parser.add_argument('--dim', default=512, type=int)
116 | parser.add_argument('--n_layers', default=8, type=int)
117 | parser.add_argument('--max_seq_len', default=8192, type=int)
118 | parser.add_argument('--use_moe', default=False, type=bool)
119 | # 携带历史对话上下文条数
120 | # history_cnt需要设为偶数,即【用户问题, 模型回答】为1组;设置为0时,即当前query不携带历史上文
121 | # 模型未经过外推微调时,在更长的上下文的chat_template时难免出现性能的明显退化,因此需要注意此处设置
122 | parser.add_argument('--history_cnt', default=0, type=int)
123 | parser.add_argument('--stream', default=True, type=bool)
124 | parser.add_argument('--load', default=0, type=int, help="0: 原生torch权重,1: transformers加载")
125 | parser.add_argument('--model_mode', default=1, type=int,
126 | help="0: 预训练模型,1: SFT-Chat模型,2: RLHF-Chat模型,3: Reason模型,4: RLAIF-Chat模型")
127 | args = parser.parse_args()
128 |
129 | model, tokenizer = init_model(args)
130 |
131 | prompts = get_prompt_datas(args)
132 | test_mode = int(input('[0] 自动测试\n[1] 手动输入\n'))
133 | messages = []
134 | for idx, prompt in enumerate(prompts if test_mode == 0 else iter(lambda: input('👶: '), '')):
135 | setup_seed(random.randint(0, 2048))
136 | # setup_seed(2025) # 如需固定每次输出则换成【固定】的随机种子
137 | if test_mode == 0: print(f'👶: {prompt}')
138 |
139 | messages = messages[-args.history_cnt:] if args.history_cnt else []
140 | messages.append({"role": "user", "content": prompt})
141 |
142 | new_prompt = tokenizer.apply_chat_template(
143 | messages,
144 | tokenize=False,
145 | add_generation_prompt=True
146 | )[-args.max_seq_len - 1:] if args.model_mode != 0 else (tokenizer.bos_token + prompt)
147 |
148 | answer = new_prompt
149 | with torch.no_grad():
150 | x = torch.tensor(tokenizer(new_prompt)['input_ids'], device=args.device).unsqueeze(0)
151 | outputs = model.generate(
152 | x,
153 | eos_token_id=tokenizer.eos_token_id,
154 | max_new_tokens=args.max_seq_len,
155 | temperature=args.temperature,
156 | top_p=args.top_p,
157 | stream=args.stream,
158 | pad_token_id=tokenizer.pad_token_id
159 | )
160 |
161 | print('🤖️: ', end='')
162 | try:
163 | if not args.stream:
164 | print(tokenizer.decode(outputs.squeeze()[x.shape[1]:].tolist(), skip_special_tokens=True), end='')
165 | else:
166 | history_idx = 0
167 | for y in outputs:
168 | answer = tokenizer.decode(y[0].tolist(), skip_special_tokens=True)
169 | if (answer and answer[-1] == '�') or not answer:
170 | continue
171 | print(answer[history_idx:], end='', flush=True)
172 | history_idx = len(answer)
173 | except StopIteration:
174 | print("No answer")
175 | print('\n')
176 |
177 | messages.append({"role": "assistant", "content": answer})
178 |
179 |
180 | if __name__ == "__main__":
181 | main()
182 |
--------------------------------------------------------------------------------
/images/1-wiki.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/1-wiki.png
--------------------------------------------------------------------------------
/images/2-wiki.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/2-wiki.png
--------------------------------------------------------------------------------
/images/3-wiki.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/3-wiki.png
--------------------------------------------------------------------------------
/images/4-wiki.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/4-wiki.png
--------------------------------------------------------------------------------
/images/5-wiki.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/5-wiki.png
--------------------------------------------------------------------------------
/images/LLM-structure-moe.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/LLM-structure-moe.png
--------------------------------------------------------------------------------
/images/LLM-structure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/LLM-structure.png
--------------------------------------------------------------------------------
/images/and_huggingface.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/and_huggingface.png
--------------------------------------------------------------------------------
/images/and_modelscope.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/and_modelscope.png
--------------------------------------------------------------------------------
/images/compare_radar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/compare_radar.png
--------------------------------------------------------------------------------
/images/dataset.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/dataset.jpg
--------------------------------------------------------------------------------
/images/gpt3_config.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/gpt3_config.png
--------------------------------------------------------------------------------
/images/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/logo.png
--------------------------------------------------------------------------------
/images/logo2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/logo2.png
--------------------------------------------------------------------------------
/images/minimind2.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/minimind2.gif
--------------------------------------------------------------------------------
/images/pre_512_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/pre_512_loss.png
--------------------------------------------------------------------------------
/images/pre_768_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/pre_768_loss.png
--------------------------------------------------------------------------------
/images/sft_512_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/sft_512_loss.png
--------------------------------------------------------------------------------
/images/sft_768_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingyaogong/minimind/7da201a944a90ed49daef8a0265c959288dff83a/images/sft_768_loss.png
--------------------------------------------------------------------------------
/model/LMConfig.py:
--------------------------------------------------------------------------------
1 | from transformers import PretrainedConfig
2 | from typing import List
3 |
4 |
5 | class LMConfig(PretrainedConfig):
6 | model_type = "minimind"
7 |
8 | def __init__(
9 | self,
10 | dim: int = 512,
11 | n_layers: int = 8,
12 | n_heads: int = 8,
13 | n_kv_heads: int = 2,
14 | vocab_size: int = 6400,
15 | hidden_dim: int = None,
16 | multiple_of: int = 64,
17 | norm_eps: float = 1e-5,
18 | max_seq_len: int = 8192,
19 | rope_theta: int = 1e6,
20 | dropout: float = 0.0,
21 | flash_attn: bool = True,
22 | ####################################################
23 | # Here are the specific configurations of MOE
24 | # When use_moe is false, the following is invalid
25 | ####################################################
26 | use_moe: bool = False,
27 | ####################################################
28 | num_experts_per_tok: int = 2,
29 | n_routed_experts: int = 4,
30 | n_shared_experts: bool = True,
31 | scoring_func: str = 'softmax',
32 | aux_loss_alpha: float = 0.1,
33 | seq_aux: bool = True,
34 | norm_topk_prob: bool = True,
35 | **kwargs,
36 | ):
37 | self.dim = dim
38 | self.n_layers = n_layers
39 | self.n_heads = n_heads
40 | self.n_kv_heads = n_kv_heads
41 | self.vocab_size = vocab_size
42 | self.hidden_dim = hidden_dim
43 | self.multiple_of = multiple_of
44 | self.norm_eps = norm_eps
45 | self.max_seq_len = max_seq_len
46 | self.rope_theta = rope_theta
47 | self.dropout = dropout
48 | self.flash_attn = flash_attn
49 | ####################################################
50 | # Here are the specific configurations of MOE
51 | # When use_moe is false, the following is invalid
52 | ####################################################
53 | self.use_moe = use_moe
54 | self.num_experts_per_tok = num_experts_per_tok # 每个token选择的专家数量
55 | self.n_routed_experts = n_routed_experts # 总的专家数量
56 | self.n_shared_experts = n_shared_experts # 共享专家
57 | self.scoring_func = scoring_func # 评分函数,默认为'softmax'
58 | self.aux_loss_alpha = aux_loss_alpha # 辅助损失的alpha参数
59 | self.seq_aux = seq_aux # 是否在序列级别上计算辅助损失
60 | self.norm_topk_prob = norm_topk_prob # 是否标准化top-k概率
61 | super().__init__(**kwargs)
62 |
--------------------------------------------------------------------------------
/model/dataset.py:
--------------------------------------------------------------------------------
1 | import json
2 | import random
3 | import re
4 |
5 | import pandas as pd
6 | import numpy as np
7 | from torch.utils.data import Dataset, DataLoader
8 | import torch
9 | from sklearn.model_selection import train_test_split
10 | import os
11 | import ast
12 |
13 | os.environ["TOKENIZERS_PARALLELISM"] = "false"
14 |
15 |
16 | class PretrainDataset(Dataset):
17 | def __init__(self, data_path, tokenizer, max_length=512):
18 | super().__init__()
19 | self.tokenizer = tokenizer
20 | self.max_length = max_length
21 | self.samples = self.load_data(data_path)
22 |
23 | def load_data(self, path):
24 | samples = []
25 | with open(path, 'r', encoding='utf-8') as f:
26 | for line_num, line in enumerate(f, 1):
27 | data = json.loads(line.strip())
28 | samples.append(data)
29 | return samples
30 |
31 | def __len__(self):
32 | return len(self.samples)
33 |
34 | def __getitem__(self, index):
35 | sample = self.samples[index]
36 |
37 | # 构建输入文本
38 | text = f"{self.tokenizer.bos_token}{str(sample['text'])}{self.tokenizer.eos_token}"
39 | encoding = self.tokenizer(
40 | text,
41 | max_length=self.max_length,
42 | padding='max_length',
43 | truncation=True,
44 | return_tensors='pt'
45 | )
46 | input_ids = encoding.input_ids.squeeze()
47 | loss_mask = (input_ids != self.tokenizer.pad_token_id)
48 |
49 | X = torch.tensor(input_ids[:-1], dtype=torch.long)
50 | Y = torch.tensor(input_ids[1:], dtype=torch.long)
51 | loss_mask = torch.tensor(loss_mask[1:], dtype=torch.long)
52 | return X, Y, loss_mask
53 |
54 |
55 | class SFTDataset(Dataset):
56 | def __init__(self, jsonl_path, tokenizer, max_length=1024):
57 | super().__init__()
58 | self.tokenizer = tokenizer
59 | self.max_length = max_length
60 | self.samples = self.load_data(jsonl_path)
61 | self.bos_id = tokenizer('assistant', add_special_tokens=False).input_ids
62 | self.eos_id = tokenizer(' ', add_special_tokens=False).input_ids
63 |
64 | def __len__(self):
65 | return len(self.samples)
66 |
67 | def load_data(self, path):
68 | samples = []
69 | with open(path, 'r', encoding='utf-8') as f:
70 | for line_num, line in enumerate(f, 1):
71 | data = json.loads(line.strip())
72 | samples.append(data)
73 | return samples
74 |
75 | def _create_chat_prompt(self, conversations):
76 | """构建符合ChatML格式的对话"""
77 | messages = []
78 | for i, turn in enumerate(conversations):
79 | role = 'user' if i % 2 == 0 else 'assistant'
80 | messages.append({"role": role, "content": turn['content']})
81 | return self.tokenizer.apply_chat_template(
82 | messages,
83 | tokenize=False,
84 | add_generation_prompt=False
85 | )
86 |
87 | def _generate_loss_mask(self, input_ids):
88 | loss_mask = [0] * len(input_ids)
89 | i = 0
90 | while i < len(input_ids):
91 | if input_ids[i:i + len(self.bos_id)] == self.bos_id:
92 | start = i + len(self.bos_id)
93 | end = start
94 | while end < len(input_ids):
95 | if input_ids[end:end + len(self.eos_id)] == self.eos_id:
96 | break
97 | end += 1
98 | for j in range(start + 1, min(end + len(self.eos_id) + 1, self.max_length)):
99 | loss_mask[j] = 1
100 | i = end + len(self.eos_id) if end < len(input_ids) else len(input_ids)
101 | else:
102 | i += 1
103 | return loss_mask
104 |
105 | def __getitem__(self, index):
106 | sample = self.samples[index]
107 | # 构建对话提示
108 | prompt = self._create_chat_prompt(sample['conversations'])
109 | input_ids = self.tokenizer(prompt).input_ids[:self.max_length]
110 | input_ids += [self.tokenizer.pad_token_id] * (self.max_length - len(input_ids))
111 |
112 | # 生成动态损失掩码
113 | loss_mask = self._generate_loss_mask(input_ids)
114 |
115 | # 构建训练数据
116 | X = torch.tensor(input_ids[:-1], dtype=torch.long)
117 | Y = torch.tensor(input_ids[1:], dtype=torch.long)
118 | loss_mask = torch.tensor(loss_mask[1:], dtype=torch.long) # 对齐预测位置
119 |
120 | return X, Y, loss_mask
121 |
122 |
123 | class DPODataset(Dataset):
124 | def __init__(self, file_path, tokenizer, max_length=4096):
125 | super().__init__()
126 | self.tokenizer = tokenizer
127 | self.max_length = max_length
128 | self.padding = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
129 | self.bos_id = tokenizer('assistant', add_special_tokens=False).input_ids
130 | self.eos_id = tokenizer(' ', add_special_tokens=False).input_ids
131 | with open(file_path, 'r', encoding='utf-8') as f:
132 | self.data = []
133 | for line in f:
134 | line = line.strip()
135 | obj = json.loads(line)
136 | self.data.append(obj)
137 |
138 | def __len__(self):
139 | return len(self.data)
140 |
141 | def __getitem__(self, index):
142 | item = self.data[index]
143 | chosen = item['chosen'] # 是一个 list,里面包含若干 {role, content}
144 | rejected = item['rejected'] # 同上
145 | chosen_prompt = self.tokenizer.apply_chat_template(
146 | chosen, tokenize=False, add_generation_prompt=False
147 | )
148 |
149 | rejected_prompt = self.tokenizer.apply_chat_template(
150 | rejected, tokenize=False, add_generation_prompt=False
151 | )
152 | chosen_encoding = self.tokenizer(
153 | chosen_prompt, truncation=True, max_length=self.max_length, padding='max_length'
154 | )
155 | rejected_encoding = self.tokenizer(
156 | rejected_prompt, truncation=True, max_length=self.max_length, padding='max_length'
157 | )
158 |
159 | chosen_input_ids = chosen_encoding['input_ids']
160 | chosen_loss_mask = self._generate_loss_mask(chosen_input_ids)
161 |
162 | rejected_input_ids = rejected_encoding['input_ids']
163 | rejected_loss_mask = self._generate_loss_mask(rejected_input_ids)
164 | x_chosen = torch.tensor(chosen_input_ids[:-1], dtype=torch.long)
165 | y_chosen = torch.tensor(chosen_input_ids[1:], dtype=torch.long)
166 | mask_chosen = torch.tensor(chosen_loss_mask[1:], dtype=torch.long)
167 | x_rejected = torch.tensor(rejected_input_ids[:-1], dtype=torch.long)
168 | y_rejected = torch.tensor(rejected_input_ids[1:], dtype=torch.long)
169 | mask_rejected = torch.tensor(rejected_loss_mask[1:], dtype=torch.long)
170 |
171 | return {
172 | 'x_chosen': x_chosen,
173 | 'y_chosen': y_chosen,
174 | 'mask_chosen': mask_chosen,
175 | 'x_rejected': x_rejected,
176 | 'y_rejected': y_rejected,
177 | 'mask_rejected': mask_rejected
178 | }
179 |
180 | def _generate_loss_mask(self, input_ids):
181 | loss_mask = [0] * len(input_ids)
182 | i = 0
183 | while i < len(input_ids):
184 | if input_ids[i:i + len(self.bos_id)] == self.bos_id:
185 | start = i + len(self.bos_id)
186 | end = start
187 | while end < len(input_ids):
188 | if input_ids[end:end + len(self.eos_id)] == self.eos_id:
189 | break
190 | end += 1
191 | for j in range(start + 1, min(end + len(self.eos_id) + 1, self.max_length)):
192 | loss_mask[j] = 1
193 | i = end + len(self.eos_id) if end < len(input_ids) else len(input_ids)
194 | else:
195 | i += 1
196 | return loss_mask
197 |
198 |
199 | class RLAIFDataset(Dataset):
200 | def __init__(self, jsonl_path, tokenizer, max_length=1024):
201 | super().__init__()
202 | self.tokenizer = tokenizer
203 | self.max_length = max_length
204 | self.samples = self.load_data(jsonl_path)
205 | self.bos_id = tokenizer('assistant', add_special_tokens=False).input_ids
206 | self.eos_id = tokenizer(' ', add_special_tokens=False).input_ids
207 |
208 | def __len__(self):
209 | return len(self.samples)
210 |
211 | def load_data(self, path):
212 | samples = []
213 | with open(path, 'r', encoding='utf-8') as f:
214 | for line_num, line in enumerate(f, 1):
215 | data = json.loads(line.strip())
216 | samples.append(data)
217 | return samples
218 |
219 | def _create_chat_prompt(self, conversations):
220 | """构建符合ChatML格式的对话"""
221 | messages = []
222 | answer = ''
223 | for i, turn in enumerate(conversations):
224 | role = 'user' if i % 2 == 0 else 'assistant'
225 | messages.append({"role": role, "content": turn['content']})
226 | answer = turn['content']
227 | return self.tokenizer.apply_chat_template(
228 | messages[:-1],
229 | tokenize=False,
230 | add_generation_prompt=True
231 | ), answer
232 |
233 | def __getitem__(self, index):
234 | sample = self.samples[index]
235 | # 构建对话提示
236 | prompt, answer = self._create_chat_prompt(sample['conversations'])
237 |
238 | return {
239 | 'prompt': prompt,
240 | 'answer': answer
241 | }
242 |
243 |
244 | if __name__ == "__main__":
245 | pass
246 |
--------------------------------------------------------------------------------
/model/minimind_tokenizer/tokenizer_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "add_bos_token": false,
3 | "add_eos_token": false,
4 | "add_prefix_space": false,
5 | "added_tokens_decoder": {
6 | "0": {
7 | "content": "",
8 | "lstrip": false,
9 | "normalized": false,
10 | "rstrip": false,
11 | "single_word": false,
12 | "special": true
13 | },
14 | "1": {
15 | "content": "",
16 | "lstrip": false,
17 | "normalized": false,
18 | "rstrip": false,
19 | "single_word": false,
20 | "special": true
21 | },
22 | "2": {
23 | "content": " ",
24 | "lstrip": false,
25 | "normalized": false,
26 | "rstrip": false,
27 | "single_word": false,
28 | "special": true
29 | }
30 | },
31 | "additional_special_tokens": [],
32 | "bos_token": "",
33 | "clean_up_tokenization_spaces": false,
34 | "eos_token": " ",
35 | "legacy": true,
36 | "model_max_length": 32768,
37 | "pad_token": "",
38 | "sp_model_kwargs": {},
39 | "spaces_between_special_tokens": false,
40 | "tokenizer_class": "PreTrainedTokenizerFast",
41 | "unk_token": "",
42 | "chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{{ 'system\\n' + system_message + ' \\n' }}{% else %}{{ 'system\\n你是 MiniMind,是一个有用的人工智能助手。 \\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ 'user\\n' + content + ' \\nassistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + ' ' + '\\n' }}{% endif %}{% endfor %}"
43 | }
--------------------------------------------------------------------------------
/model/model.py:
--------------------------------------------------------------------------------
1 | import math
2 | import struct
3 | import inspect
4 | import time
5 |
6 | from .LMConfig import LMConfig
7 | from typing import Any, Optional, Tuple, List, Union
8 | import numpy as np
9 | import torch
10 | import torch.nn.functional as F
11 | from torch import nn
12 | from transformers import PreTrainedModel
13 | from transformers.modeling_outputs import CausalLMOutputWithPast
14 |
15 |
16 | class RMSNorm(torch.nn.Module):
17 | def __init__(self, dim: int, eps: float = 1e-6):
18 | super().__init__()
19 | self.eps = eps
20 | self.weight = nn.Parameter(torch.ones(dim))
21 |
22 | def _norm(self, x):
23 | return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
24 |
25 | def forward(self, x):
26 | return self.weight * self._norm(x.float()).type_as(x)
27 |
28 |
29 | def precompute_pos_cis(dim: int, end: int = int(32 * 1024), theta: float = 1e6):
30 | freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
31 | t = torch.arange(end, device=freqs.device) # type: ignore
32 | freqs = torch.outer(t, freqs).float() # type: ignore
33 | pos_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64
34 | return pos_cis
35 |
36 |
37 | def apply_rotary_emb(xq, xk, pos_cis):
38 | def unite_shape(pos_cis, x):
39 | ndim = x.ndim
40 | assert 0 <= 1 < ndim
41 | assert pos_cis.shape == (x.shape[1], x.shape[-1])
42 | shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
43 | return pos_cis.view(*shape)
44 |
45 | xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
46 | xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
47 | pos_cis = unite_shape(pos_cis, xq_)
48 | xq_out = torch.view_as_real(xq_ * pos_cis).flatten(3)
49 | xk_out = torch.view_as_real(xk_ * pos_cis).flatten(3)
50 | return xq_out.type_as(xq), xk_out.type_as(xk)
51 |
52 |
53 | def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
54 | """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
55 | bs, slen, n_kv_heads, head_dim = x.shape
56 | if n_rep == 1:
57 | return x
58 | return (
59 | x[:, :, :, None, :]
60 | .expand(bs, slen, n_kv_heads, n_rep, head_dim)
61 | .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
62 | )
63 |
64 |
65 | class Attention(nn.Module):
66 | def __init__(self, args: LMConfig):
67 | super().__init__()
68 | self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
69 | assert args.n_heads % self.n_kv_heads == 0
70 | self.n_local_heads = args.n_heads
71 | self.n_local_kv_heads = self.n_kv_heads
72 | self.n_rep = self.n_local_heads // self.n_local_kv_heads
73 | self.head_dim = args.dim // args.n_heads
74 | self.wq = nn.Linear(args.dim, args.n_heads * self.head_dim, bias=False)
75 | self.wk = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
76 | self.wv = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
77 | self.wo = nn.Linear(args.n_heads * self.head_dim, args.dim, bias=False)
78 | self.attn_dropout = nn.Dropout(args.dropout)
79 | self.resid_dropout = nn.Dropout(args.dropout)
80 | self.dropout = args.dropout
81 | self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention') and args.flash_attn
82 | # print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
83 | mask = torch.full((1, 1, args.max_seq_len, args.max_seq_len), float("-inf"))
84 | mask = torch.triu(mask, diagonal=1)
85 | self.register_buffer("mask", mask, persistent=False)
86 |
87 | def forward(self,
88 | x: torch.Tensor,
89 | pos_cis: torch.Tensor,
90 | past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
91 | use_cache=False):
92 | bsz, seq_len, _ = x.shape
93 | xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
94 | xq = xq.view(bsz, seq_len, self.n_local_heads, self.head_dim)
95 | xk = xk.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)
96 | xv = xv.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)
97 |
98 | xq, xk = apply_rotary_emb(xq, xk, pos_cis)
99 | # kv_cache实现
100 | if past_key_value is not None:
101 | xk = torch.cat([past_key_value[0], xk], dim=1)
102 | xv = torch.cat([past_key_value[1], xv], dim=1)
103 | past_kv = (xk, xv) if use_cache else None
104 |
105 | xq, xk, xv = (
106 | xq.transpose(1, 2),
107 | repeat_kv(xk, self.n_rep).transpose(1, 2),
108 | repeat_kv(xv, self.n_rep).transpose(1, 2)
109 | )
110 | if self.flash and seq_len != 1:
111 | dropout_p = self.dropout if self.training else 0.0
112 | output = F.scaled_dot_product_attention(
113 | xq, xk, xv,
114 | attn_mask=None,
115 | dropout_p=dropout_p,
116 | is_causal=True
117 | )
118 | else:
119 | scores = (xq @ xk.transpose(-2, -1)) / math.sqrt(self.head_dim)
120 | scores += self.mask[:, :, :seq_len, :seq_len]
121 | scores = F.softmax(scores.float(), dim=-1).type_as(xq)
122 | scores = self.attn_dropout(scores)
123 | output = scores @ xv
124 |
125 | output = output.transpose(1, 2).reshape(bsz, seq_len, -1)
126 | output = self.resid_dropout(self.wo(output))
127 | return output, past_kv
128 |
129 |
130 | class FeedForward(nn.Module):
131 | def __init__(self, config: LMConfig):
132 | super().__init__()
133 | if config.hidden_dim is None:
134 | hidden_dim = 4 * config.dim
135 | hidden_dim = int(2 * hidden_dim / 3)
136 | config.hidden_dim = config.multiple_of * ((hidden_dim + config.multiple_of - 1) // config.multiple_of)
137 | self.w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
138 | self.w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
139 | self.w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
140 | self.dropout = nn.Dropout(config.dropout)
141 |
142 | def forward(self, x):
143 | return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))
144 |
145 |
146 | class MoEGate(nn.Module):
147 | def __init__(self, config: LMConfig):
148 | super().__init__()
149 | self.config = config
150 | self.top_k = config.num_experts_per_tok
151 | self.n_routed_experts = config.n_routed_experts
152 |
153 | self.scoring_func = config.scoring_func
154 | self.alpha = config.aux_loss_alpha
155 | self.seq_aux = config.seq_aux
156 |
157 | self.norm_topk_prob = config.norm_topk_prob
158 | self.gating_dim = config.dim
159 | self.weight = nn.Parameter(torch.empty((self.n_routed_experts, self.gating_dim)))
160 | self.reset_parameters()
161 |
162 | def reset_parameters(self) -> None:
163 | import torch.nn.init as init
164 | init.kaiming_uniform_(self.weight, a=math.sqrt(5))
165 |
166 | def forward(self, hidden_states):
167 | bsz, seq_len, h = hidden_states.shape
168 | hidden_states = hidden_states.view(-1, h)
169 | logits = F.linear(hidden_states, self.weight, None)
170 | if self.scoring_func == 'softmax':
171 | scores = logits.softmax(dim=-1)
172 | else:
173 | raise NotImplementedError(f'insupportable scoring function for MoE gating: {self.scoring_func}')
174 |
175 | topk_weight, topk_idx = torch.topk(scores, k=self.top_k, dim=-1, sorted=False)
176 |
177 | if self.top_k > 1 and self.norm_topk_prob:
178 | denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
179 | topk_weight = topk_weight / denominator
180 |
181 | if self.training and self.alpha > 0.0:
182 | scores_for_aux = scores
183 | aux_topk = self.top_k
184 | topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
185 | if self.seq_aux:
186 | scores_for_seq_aux = scores_for_aux.view(bsz, seq_len, -1)
187 | ce = torch.zeros(bsz, self.n_routed_experts, device=hidden_states.device)
188 | ce.scatter_add_(1, topk_idx_for_aux_loss,
189 | torch.ones(bsz, seq_len * aux_topk, device=hidden_states.device)).div_(
190 | seq_len * aux_topk / self.n_routed_experts)
191 | aux_loss = (ce * scores_for_seq_aux.mean(dim=1)).sum(dim=1).mean() * self.alpha
192 | else:
193 | mask_ce = F.one_hot(topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts)
194 | ce = mask_ce.float().mean(0)
195 | Pi = scores_for_aux.mean(0)
196 | fi = ce * self.n_routed_experts
197 | aux_loss = (Pi * fi).sum() * self.alpha
198 | else:
199 | aux_loss = 0
200 | return topk_idx, topk_weight, aux_loss
201 |
202 |
203 | class MOEFeedForward(nn.Module):
204 | def __init__(self, config: LMConfig):
205 | super().__init__()
206 | self.config = config
207 | self.experts = nn.ModuleList([
208 | FeedForward(config)
209 | for _ in range(config.n_routed_experts)
210 | ])
211 | self.gate = MoEGate(config)
212 | if config.n_shared_experts is not None:
213 | self.shared_experts = FeedForward(config)
214 |
215 | def forward(self, x):
216 | identity = x
217 | orig_shape = x.shape
218 | bsz, seq_len, _ = x.shape
219 | # 使用门控机制选择专家
220 | topk_idx, topk_weight, aux_loss = self.gate(x)
221 | x = x.view(-1, x.shape[-1])
222 | flat_topk_idx = topk_idx.view(-1)
223 | if self.training:
224 | x = x.repeat_interleave(self.config.num_experts_per_tok, dim=0)
225 | y = torch.empty_like(x, dtype=torch.float16)
226 | for i, expert in enumerate(self.experts):
227 | y[flat_topk_idx == i] = expert(x[flat_topk_idx == i]).to(y.dtype) # 确保类型一致
228 | y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
229 | y = y.view(*orig_shape)
230 | else:
231 | y = self.moe_infer(x, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape)
232 | if self.config.n_shared_experts is not None:
233 | y = y + self.shared_experts(identity)
234 | self.aux_loss = aux_loss
235 | return y
236 |
237 | @torch.no_grad()
238 | def moe_infer(self, x, flat_expert_indices, flat_expert_weights):
239 | expert_cache = torch.zeros_like(x)
240 | idxs = flat_expert_indices.argsort()
241 | tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0)
242 | token_idxs = idxs // self.config.num_experts_per_tok
243 | # 当tokens_per_expert = [6, 15, 20, 26],tokens_per_expert.shape[0]即为专家数量(此时为4)
244 | # 且token_idxs = [3, 7, 19, 21, 24, 25, 4, 5, 6, 10, 11, 12...] 时
245 | # 意味token_idxs[:6] -> [3, 7, 19, 21, 24, 25]这6个位置属于专家0处理的token(每个token有可能被多个专家处理,这取决于num_experts_per_tok)
246 | # 接下来9个位置token_idxs[6:15] -> [4, 5, 6, 10, 11, 12...]属于专家1处理的token...依此类推
247 | for i, end_idx in enumerate(tokens_per_expert):
248 | start_idx = 0 if i == 0 else tokens_per_expert[i - 1]
249 | if start_idx == end_idx:
250 | continue
251 | expert = self.experts[i]
252 | exp_token_idx = token_idxs[start_idx:end_idx]
253 | expert_tokens = x[exp_token_idx]
254 | expert_out = expert(expert_tokens).to(expert_cache.dtype)
255 | expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]])
256 | expert_cache.scatter_add_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out)
257 |
258 | return expert_cache
259 |
260 |
261 | class MiniMindBlock(nn.Module):
262 | def __init__(self, layer_id: int, config: LMConfig):
263 | super().__init__()
264 | self.n_heads = config.n_heads
265 | self.dim = config.dim
266 | self.head_dim = config.dim // config.n_heads
267 | self.attention = Attention(config)
268 |
269 | self.layer_id = layer_id
270 | self.attention_norm = RMSNorm(config.dim, eps=config.norm_eps)
271 | self.ffn_norm = RMSNorm(config.dim, eps=config.norm_eps)
272 | self.feed_forward = FeedForward(config) if not config.use_moe else MOEFeedForward(config)
273 |
274 | def forward(self, x, pos_cis, past_key_value=None, use_cache=False):
275 | h_attn, past_kv = self.attention(
276 | self.attention_norm(x),
277 | pos_cis,
278 | past_key_value=past_key_value,
279 | use_cache=use_cache
280 | )
281 | h = x + h_attn
282 | out = h + self.feed_forward(self.ffn_norm(h))
283 | return out, past_kv
284 |
285 |
286 | class MiniMindLM(PreTrainedModel):
287 | config_class = LMConfig
288 |
289 | def __init__(self, params: LMConfig = None):
290 | self.params = params or LMConfig()
291 | super().__init__(self.params)
292 | self.vocab_size, self.n_layers = params.vocab_size, params.n_layers
293 | self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
294 | self.dropout = nn.Dropout(params.dropout)
295 | self.layers = nn.ModuleList([MiniMindBlock(l, params) for l in range(self.n_layers)])
296 | self.norm = RMSNorm(params.dim, eps=params.norm_eps)
297 | self.output = nn.Linear(params.dim, params.vocab_size, bias=False)
298 | self.tok_embeddings.weight = self.output.weight
299 | self.register_buffer("pos_cis",
300 | precompute_pos_cis(dim=params.dim // params.n_heads, theta=params.rope_theta),
301 | persistent=False)
302 | self.OUT = CausalLMOutputWithPast()
303 |
304 | def forward(self,
305 | input_ids: Optional[torch.Tensor] = None,
306 | past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
307 | use_cache: bool = False,
308 | logits_to_keep: Union[int, torch.Tensor] = 0,
309 | **args):
310 | past_key_values = past_key_values or [None] * len(self.layers)
311 | start_pos = args.get('start_pos', 0)
312 | h = self.dropout(self.tok_embeddings(input_ids))
313 | pos_cis = self.pos_cis[start_pos:start_pos + input_ids.size(1)]
314 | past_kvs = []
315 | for l, layer in enumerate(self.layers):
316 | h, past_kv = layer(
317 | h, pos_cis,
318 | past_key_value=past_key_values[l],
319 | use_cache=use_cache
320 | )
321 | past_kvs.append(past_kv)
322 |
323 | slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
324 | logits = self.output(self.norm(h)[:, slice_indices, :])
325 | aux_loss = sum(l.feed_forward.aux_loss for l in self.layers if isinstance(l.feed_forward, MOEFeedForward))
326 | self.OUT.__setitem__('last_hidden_state', h)
327 | self.OUT.__setitem__('logits', logits)
328 | self.OUT.__setitem__('aux_loss', aux_loss)
329 | self.OUT.__setitem__('past_key_values', past_kvs)
330 | return self.OUT
331 |
332 | @torch.inference_mode()
333 | def generate(self, input_ids, eos_token_id=2, max_new_tokens=1024, temperature=0.75, top_p=0.90,
334 | stream=False, rp=1., use_cache=True, pad_token_id=0, num_return_sequences=1, **args):
335 | # 流式生成
336 | if stream:
337 | return self._stream(input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args)
338 |
339 | # 直接生成
340 | generated = []
341 | for i in range(input_ids.size(0)):
342 | non_pad = input_ids[i][input_ids[i] != pad_token_id].unsqueeze(0)
343 | for _ in range(num_return_sequences):
344 | out = self._stream(non_pad, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args)
345 | tokens_list = [tokens[:, -1:] for tokens in out]
346 | gen = torch.cat(tokens_list, dim=-1) if tokens_list else non_pad
347 | full_sequence = torch.cat([non_pad, gen], dim=-1)
348 | generated.append(full_sequence)
349 |
350 | max_length = max(seq.size(1) for seq in generated)
351 | generated = [
352 | torch.cat(
353 | [seq, torch.full((1, max_length - seq.size(1)), pad_token_id, dtype=seq.dtype, device=seq.device)],
354 | dim=-1)
355 | for seq in generated
356 | ]
357 | output = torch.cat(generated, dim=0)
358 | res = output.view(input_ids.size(0) * num_return_sequences, -1)
359 | return res
360 |
361 | def _stream(self, input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args):
362 | start, first_seq, past_kvs = input_ids.shape[1], True, None
363 | while input_ids.shape[1] < max_new_tokens - 1:
364 | if first_seq or not use_cache:
365 | out, first_seq = self(input_ids, past_key_values=past_kvs, use_cache=use_cache, **args), False
366 | else:
367 | out = self(input_ids[:, -1:], past_key_values=past_kvs, use_cache=use_cache,
368 | start_pos=input_ids.shape[1] - 1, **args)
369 | logits, past_kvs = out.logits[:, -1, :], out.past_key_values
370 | logits[:, list(set(input_ids.tolist()[0]))] /= rp
371 | logits /= (temperature + 1e-9)
372 | if top_p is not None and top_p < 1.0:
373 | sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
374 | sorted_probs = F.softmax(sorted_logits, dim=-1)
375 | cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
376 | sorted_indices_to_remove = cumulative_probs > top_p
377 | sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
378 | sorted_indices_to_remove[:, 0] = False
379 | indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
380 | logits[indices_to_remove] = -float('Inf')
381 | input_ids_next = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
382 | input_ids = torch.cat((input_ids, input_ids_next), dim=1)
383 | yield input_ids[:, start:]
384 | if input_ids_next.item() == eos_token_id:
385 | break
386 |
--------------------------------------------------------------------------------
/model/model_lora.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch import optim, nn
3 |
4 |
5 | # 定义Lora网络结构
6 | class LoRA(nn.Module):
7 | def __init__(self, in_features, out_features, rank):
8 | super().__init__()
9 | self.rank = rank # LoRA的秩(rank),控制低秩矩阵的大小
10 | self.A = nn.Linear(in_features, rank, bias=False) # 低秩矩阵A
11 | self.B = nn.Linear(rank, out_features, bias=False) # 低秩矩阵B
12 | # 矩阵A高斯初始化
13 | self.A.weight.data.normal_(mean=0.0, std=0.02)
14 | # 矩阵B全0初始化
15 | self.B.weight.data.zero_()
16 |
17 | def forward(self, x):
18 | return self.B(self.A(x))
19 |
20 |
21 | def apply_lora(model, rank=16):
22 | for name, module in model.named_modules():
23 | if isinstance(module, nn.Linear) and module.weight.shape[0] == module.weight.shape[1]:
24 | lora = LoRA(module.weight.shape[0], module.weight.shape[1], rank=rank).to(model.device)
25 | setattr(module, "lora", lora)
26 | original_forward = module.forward
27 |
28 | # 显式绑定
29 | def forward_with_lora(x, layer1=original_forward, layer2=lora):
30 | return layer1(x) + layer2(x)
31 |
32 | module.forward = forward_with_lora
33 |
34 |
35 | def load_lora(model, path):
36 | state_dict = torch.load(path, map_location=model.device)
37 | for name, module in model.named_modules():
38 | if hasattr(module, 'lora'):
39 | lora_state = {k.replace(f'{name}.lora.', ''): v for k, v in state_dict.items() if f'{name}.lora.' in k}
40 | module.lora.load_state_dict(lora_state)
41 |
42 |
43 | def save_lora(model, path):
44 | state_dict = {}
45 | for name, module in model.named_modules():
46 | if hasattr(module, 'lora'):
47 | lora_state = {f'{name}.lora.{k}': v for k, v in module.lora.state_dict().items()}
48 | state_dict.update(lora_state)
49 | torch.save(state_dict, path)
50 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | datasets==2.21.0
2 | datasketch==1.6.4
3 | Flask==3.0.3
4 | Flask_Cors==4.0.0
5 | jieba==0.42.1
6 | jsonlines==4.0.0
7 | marshmallow==3.22.0
8 | matplotlib==3.10.0
9 | ngrok==1.4.0
10 | nltk==3.8
11 | numpy==1.26.4
12 | openai==1.59.6
13 | pandas==1.5.3
14 | peft==0.7.1
15 | psutil==5.9.8
16 | pydantic==2.8.2
17 | rich==13.7.1
18 | scikit_learn==1.5.1
19 | sentence_transformers==2.3.1
20 | simhash==2.1.2
21 | tiktoken==0.5.1
22 | transformers==4.48.0
23 | jinja2==3.1.2
24 | jsonlines==4.0.0
25 | trl==0.13.0
26 | ujson==5.1.0
27 | wandb==0.18.3
28 | streamlit==1.30.0
29 | torch==2.2.2
30 | torchvision==0.17.2
--------------------------------------------------------------------------------
/scripts/chat_openai_api.py:
--------------------------------------------------------------------------------
1 | from openai import OpenAI
2 |
3 | client = OpenAI(
4 | api_key="none",
5 | base_url="http://localhost:8998/v1"
6 | )
7 | stream = True
8 | conversation_history_origin = []
9 | conversation_history = conversation_history_origin.copy()
10 | history_messages_num = 2 # 设置为偶数(Q+A),为0则每次不携带历史对话进行独立QA
11 | while True:
12 | query = input('[Q]: ')
13 | conversation_history.append({"role": "user", "content": query})
14 | response = client.chat.completions.create(
15 | model="minimind",
16 | messages=conversation_history[-history_messages_num:],
17 | stream=stream
18 | )
19 | if not stream:
20 | assistant_res = response.choices[0].message.content
21 | print('[A]: ', assistant_res)
22 | else:
23 | print('[A]: ', end='')
24 | assistant_res = ''
25 | for chunk in response:
26 | print(chunk.choices[0].delta.content or "", end="")
27 | assistant_res += chunk.choices[0].delta.content or ""
28 |
29 | conversation_history.append({"role": "assistant", "content": assistant_res})
30 | print('\n\n')
31 |
--------------------------------------------------------------------------------
/scripts/convert_model.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import warnings
3 | import sys
4 | import os
5 |
6 | __package__ = "scripts"
7 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
8 | from transformers import AutoTokenizer, AutoModelForCausalLM
9 | from model.LMConfig import LMConfig
10 | from model.model import MiniMindLM
11 |
12 | warnings.filterwarnings('ignore', category=UserWarning)
13 |
14 |
15 | def convert_torch2transformers(torch_path, transformers_path):
16 | def export_tokenizer(transformers_path):
17 | tokenizer = AutoTokenizer.from_pretrained('../model/minimind_tokenizer')
18 | tokenizer.save_pretrained(transformers_path)
19 |
20 | LMConfig.register_for_auto_class()
21 | MiniMindLM.register_for_auto_class("AutoModelForCausalLM")
22 | lm_model = MiniMindLM(lm_config)
23 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
24 | state_dict = torch.load(torch_path, map_location=device)
25 | lm_model.load_state_dict(state_dict, strict=False)
26 | model_params = sum(p.numel() for p in lm_model.parameters() if p.requires_grad)
27 | print(f'模型参数: {model_params / 1e6} 百万 = {model_params / 1e9} B (Billion)')
28 | lm_model.save_pretrained(transformers_path, safe_serialization=False)
29 | export_tokenizer(transformers_path)
30 | print(f"模型已保存为 Transformers 格式: {transformers_path}")
31 |
32 |
33 | def convert_transformers2torch(transformers_path, torch_path):
34 | model = AutoModelForCausalLM.from_pretrained(transformers_path, trust_remote_code=True)
35 | torch.save(model.state_dict(), torch_path)
36 | print(f"模型已保存为 PyTorch 格式: {torch_path}")
37 |
38 |
39 | # don't need to use
40 | def push_to_hf(export_model_path):
41 | def init_model():
42 | tokenizer = AutoTokenizer.from_pretrained('../model/minimind_tokenizer')
43 | model = AutoModelForCausalLM.from_pretrained(export_model_path, trust_remote_code=True)
44 | return model, tokenizer
45 |
46 | model, tokenizer = init_model()
47 | # model.push_to_hub(model_path)
48 | # tokenizer.push_to_hub(model_path, safe_serialization=False)
49 |
50 |
51 | if __name__ == '__main__':
52 | lm_config = LMConfig(dim=512, n_layers=8, max_seq_len=8192, use_moe=False)
53 |
54 | torch_path = f"../out/rlhf_{lm_config.dim}{'_moe' if lm_config.use_moe else ''}.pth"
55 |
56 | transformers_path = '../MiniMind2-Small'
57 |
58 | # convert torch to transformers model
59 | convert_torch2transformers(torch_path, transformers_path)
60 |
61 | # # convert transformers to torch model
62 | # convert_transformers2torch(transformers_path, torch_path)
63 |
--------------------------------------------------------------------------------
/scripts/serve_openai_api.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import json
3 | import os
4 | import sys
5 |
6 | __package__ = "scripts"
7 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
8 | import time
9 | import torch
10 | import warnings
11 | import uvicorn
12 | from fastapi import FastAPI, HTTPException
13 | from fastapi.responses import StreamingResponse
14 | from pydantic import BaseModel
15 | from transformers import AutoTokenizer, AutoModelForCausalLM
16 | from model.LMConfig import LMConfig
17 | from model.model import MiniMindLM
18 | from model.model_lora import apply_lora, load_lora
19 |
20 | warnings.filterwarnings('ignore')
21 |
22 | app = FastAPI()
23 |
24 |
25 | def init_model(args):
26 | tokenizer = AutoTokenizer.from_pretrained('../model/minimind_tokenizer')
27 | if args.load == 0:
28 | moe_path = '_moe' if args.use_moe else ''
29 | modes = {0: 'pretrain', 1: 'full_sft', 2: 'rlhf', 3: 'reason'}
30 | ckp = f'../{args.out_dir}/{modes[args.model_mode]}_{args.dim}{moe_path}.pth'
31 |
32 | model = MiniMindLM(LMConfig(
33 | dim=args.dim,
34 | n_layers=args.n_layers,
35 | max_seq_len=args.max_seq_len,
36 | use_moe=args.use_moe
37 | ))
38 |
39 | state_dict = torch.load(ckp, map_location=device)
40 | model.load_state_dict({k: v for k, v in state_dict.items() if 'mask' not in k}, strict=True)
41 |
42 | if args.lora_name != 'None':
43 | apply_lora(model)
44 | load_lora(model, f'../{args.out_dir}/{args.lora_name}_{args.dim}.pth')
45 | else:
46 | model = AutoModelForCausalLM.from_pretrained(
47 | './MiniMind2',
48 | trust_remote_code=True
49 | )
50 | print(f'MiniMind模型参数量: {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.2f}M(illion)')
51 | return model.eval().to(device), tokenizer
52 |
53 |
54 | class ChatRequest(BaseModel):
55 | model: str
56 | messages: list
57 | temperature: float = 0.7
58 | top_p: float = 0.92
59 | max_tokens: int = 8192
60 | stream: bool = False
61 |
62 |
63 | def generate_stream_response(messages, temperature, top_p, max_tokens):
64 | try:
65 | new_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)[-max_tokens:]
66 | x = tokenizer(new_prompt).data['input_ids']
67 | x = (torch.tensor(x, dtype=torch.long, device=device)[None, ...])
68 | with torch.no_grad():
69 | res_y = model.generate(
70 | x,
71 | eos_token_id=tokenizer.eos_token_id,
72 | max_new_tokens=max_tokens,
73 | temperature=temperature,
74 | top_p=top_p,
75 | stream=True,
76 | rp=1.,
77 | pad_token_id=tokenizer.pad_token_id
78 | )
79 | history_idx = 0
80 | for y in res_y:
81 | answer = tokenizer.decode(y[0].tolist(), skip_special_tokens=True)
82 | if (answer and answer[-1] == '�') or not answer:
83 | continue
84 | delta = answer[history_idx:]
85 | history_idx = len(answer)
86 | json_data = {
87 | 'id': f'chatcmpl-{int(time.time())}',
88 | 'object': 'chat.completion.chunk',
89 | 'created': int(time.time()),
90 | 'model': 'minimind',
91 | 'choices': [{'index': 0, 'delta': {'content': delta}, 'finish_reason': None}]
92 | }
93 | yield f"data: {json.dumps(json_data)}\n\n"
94 |
95 | except Exception as e:
96 | yield f"data: {json.dumps({'error': str(e)})}\n\n"
97 |
98 |
99 | @app.post("/v1/chat/completions")
100 | async def chat_completions(request: ChatRequest):
101 | try:
102 | if request.stream:
103 | return StreamingResponse(
104 | generate_stream_response(
105 | messages=request.messages,
106 | temperature=request.temperature,
107 | top_p=request.top_p,
108 | max_tokens=request.max_tokens
109 | ),
110 | media_type="text/event-stream"
111 | )
112 | else:
113 | new_prompt = tokenizer.apply_chat_template(
114 | request.messages,
115 | tokenize=False,
116 | add_generation_prompt=True
117 | )[-request.max_tokens:]
118 | x = tokenizer(new_prompt).data['input_ids']
119 | x = (torch.tensor(x, dtype=torch.long, device=device)[None, ...])
120 | with torch.no_grad():
121 | res_y = model.generate(
122 | x,
123 | eos_token_id=tokenizer.eos_token_id,
124 | max_new_tokens=request.max_tokens,
125 | temperature=request.temperature,
126 | top_p=request.top_p,
127 | stream=False,
128 | rp=1.,
129 | pad_token_id=tokenizer.pad_token_id
130 | )
131 | answer = tokenizer.decode(res_y.squeeze()[x.shape[1]:].tolist(), skip_special_tokens=True)
132 | return {
133 | "id": f"chatcmpl-{int(time.time())}",
134 | "object": "chat.completion",
135 | "created": int(time.time()),
136 | "model": "minimind",
137 | "choices": [
138 | {
139 | "index": 0,
140 | "message": {"role": "assistant", "content": answer},
141 | "finish_reason": "stop"
142 | }
143 | ]
144 | }
145 |
146 | except Exception as e:
147 | raise HTTPException(status_code=500, detail=str(e))
148 |
149 |
150 | if __name__ == "__main__":
151 | parser = argparse.ArgumentParser(description="Server for MiniMind")
152 | parser.add_argument('--out_dir', default='out', type=str)
153 | parser.add_argument('--lora_name', default='None', type=str)
154 | parser.add_argument('--dim', default=512, type=int)
155 | parser.add_argument('--n_layers', default=8, type=int)
156 | parser.add_argument('--max_seq_len', default=8192, type=int)
157 | parser.add_argument('--use_moe', default=False, type=bool)
158 | parser.add_argument('--load', default=0, type=int, help="0: 从原生torch权重,1: 利用transformers加载")
159 | parser.add_argument('--model_mode', default=1, type=int, help="0: 预训练模型,1: SFT-Chat模型,2: RLHF-Chat模型,3: Reason模型")
160 |
161 | device = 'cuda' if torch.cuda.is_available() else 'cpu'
162 | model, tokenizer = init_model(parser.parse_args())
163 |
164 | uvicorn.run(app, host="0.0.0.0", port=8998)
165 |
--------------------------------------------------------------------------------
/scripts/train_tokenizer.py:
--------------------------------------------------------------------------------
1 | import random
2 | from tqdm import tqdm
3 | from transformers import AutoTokenizer
4 | import json
5 | from datasets import load_dataset
6 | from tokenizers import (
7 | decoders,
8 | models,
9 | normalizers,
10 | pre_tokenizers,
11 | processors,
12 | trainers,
13 | Tokenizer,
14 | )
15 | import os
16 |
17 | random.seed(42)
18 |
19 |
20 | def train_tokenizer():
21 | # 读取JSONL文件并提取文本数据
22 | def read_texts_from_jsonl(file_path):
23 | with open(file_path, 'r', encoding='utf-8') as f:
24 | for line in f:
25 | data = json.loads(line)
26 | yield data['text']
27 |
28 | data_path = '../dataset/pretrain_hq.jsonl'
29 |
30 | # 初始化tokenizer
31 | tokenizer = Tokenizer(models.BPE())
32 | tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
33 |
34 | # 定义特殊token
35 | special_tokens = ["", "", " "]
36 |
37 | # 设置训练器并添加特殊token
38 | trainer = trainers.BpeTrainer(
39 | vocab_size=6400,
40 | special_tokens=special_tokens, # 确保这三个token被包含
41 | show_progress=True,
42 | initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
43 | )
44 |
45 | # 读取文本数据
46 | texts = read_texts_from_jsonl(data_path)
47 |
48 | # 训练tokenizer
49 | tokenizer.train_from_iterator(texts, trainer=trainer)
50 |
51 | # 设置解码器
52 | tokenizer.decoder = decoders.ByteLevel()
53 |
54 | # 检查特殊token的索引
55 | assert tokenizer.token_to_id("") == 0
56 | assert tokenizer.token_to_id("") == 1
57 | assert tokenizer.token_to_id(" ") == 2
58 |
59 | # 保存tokenizer
60 | tokenizer_dir = "../model/minimind_tokenizer"
61 | os.makedirs(tokenizer_dir, exist_ok=True)
62 | tokenizer.save(os.path.join(tokenizer_dir, "tokenizer.json"))
63 | tokenizer.model.save("../model/minimind_tokenizer")
64 |
65 | # 手动创建配置文件
66 | config = {
67 | "add_bos_token": False,
68 | "add_eos_token": False,
69 | "add_prefix_space": False,
70 | "added_tokens_decoder": {
71 | "0": {
72 | "content": "",
73 | "lstrip": False,
74 | "normalized": False,
75 | "rstrip": False,
76 | "single_word": False,
77 | "special": True
78 | },
79 | "1": {
80 | "content": "",
81 | "lstrip": False,
82 | "normalized": False,
83 | "rstrip": False,
84 | "single_word": False,
85 | "special": True
86 | },
87 | "2": {
88 | "content": " ",
89 | "lstrip": False,
90 | "normalized": False,
91 | "rstrip": False,
92 | "single_word": False,
93 | "special": True
94 | }
95 | },
96 | "additional_special_tokens": [],
97 | "bos_token": "",
98 | "clean_up_tokenization_spaces": False,
99 | "eos_token": " ",
100 | "legacy": True,
101 | "model_max_length": 32768,
102 | "pad_token": "",
103 | "sp_model_kwargs": {},
104 | "spaces_between_special_tokens": False,
105 | "tokenizer_class": "PreTrainedTokenizerFast",
106 | "unk_token": "",
107 | "chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{{ 'system\\n' + system_message + ' \\n' }}{% else %}{{ 'system\\n你是 MiniMind,是一个有用的人工智能助手。 \\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ 'user\\n' + content + ' \\nassistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + ' ' + '\\n' }}{% endif %}{% endfor %}"
108 | }
109 |
110 | # 保存配置文件
111 | with open(os.path.join(tokenizer_dir, "tokenizer_config.json"), "w", encoding="utf-8") as config_file:
112 | json.dump(config, config_file, ensure_ascii=False, indent=4)
113 |
114 | print("Tokenizer training completed and saved.")
115 |
116 |
117 | def eval_tokenizer():
118 | from transformers import AutoTokenizer
119 |
120 | # 加载预训练的tokenizer
121 | tokenizer = AutoTokenizer.from_pretrained("../model/minimind_tokenizer")
122 |
123 | messages = [
124 | {"role": "system", "content": "你是一个优秀的聊天机器人,总是给我正确的回应!"},
125 | {"role": "user", "content": '你来自哪里?'},
126 | {"role": "assistant", "content": '我来自地球'}
127 | ]
128 | new_prompt = tokenizer.apply_chat_template(
129 | messages,
130 | tokenize=False
131 | )
132 | print(new_prompt)
133 |
134 | # 获取实际词汇表长度(包括特殊符号)
135 | actual_vocab_size = len(tokenizer)
136 | print('tokenizer实际词表长度:', actual_vocab_size)
137 |
138 | model_inputs = tokenizer(new_prompt)
139 | print('encoder长度:', len(model_inputs['input_ids']))
140 |
141 | input_ids = model_inputs['input_ids']
142 | response = tokenizer.decode(input_ids, skip_special_tokens=False)
143 | print('decoder和原始文本是否一致:', response == new_prompt)
144 |
145 |
146 | def main():
147 | train_tokenizer()
148 | eval_tokenizer()
149 |
150 |
151 | if __name__ == '__main__':
152 | main()
153 |
--------------------------------------------------------------------------------
/scripts/web_demo.py:
--------------------------------------------------------------------------------
1 | import random
2 | import re
3 | import time
4 |
5 | import numpy as np
6 | import streamlit as st
7 | import torch
8 |
9 | st.set_page_config(page_title="MiniMind", initial_sidebar_state="collapsed")
10 |
11 | # 在文件开头的 CSS 样式中修改按钮样式
12 | st.markdown("""
13 |
66 | """, unsafe_allow_html=True)
67 |
68 | system_prompt = []
69 | device = "cuda" if torch.cuda.is_available() else "cpu"
70 |
71 |
72 | def process_assistant_content(content):
73 | if 'R1' not in MODEL_PATHS[selected_model][1]:
74 | return content
75 |
76 | if '' in content and ' ' in content:
77 | content = re.sub(r'()(.*?)( )',
78 | r'推理内容(展开) \2 ',
79 | content,
80 | flags=re.DOTALL)
81 |
82 | if '' in content and ' ' not in content:
83 | content = re.sub(r'(.*?)$',
84 | r'推理中... \1 ',
85 | content,
86 | flags=re.DOTALL)
87 |
88 | if '' not in content and ' ' in content:
89 | content = re.sub(r'(.*?) ',
90 | r'推理内容(展开) \1 ',
91 | content,
92 | flags=re.DOTALL)
93 |
94 | return content
95 |
96 |
97 | @st.cache_resource
98 | def load_model_tokenizer(model_path):
99 | model = AutoModelForCausalLM.from_pretrained(
100 | model_path,
101 | trust_remote_code=True
102 | )
103 | tokenizer = AutoTokenizer.from_pretrained(
104 | model_path,
105 | trust_remote_code=True
106 | )
107 | model = model.eval().to(device)
108 | return model, tokenizer
109 |
110 |
111 | def clear_chat_messages():
112 | del st.session_state.messages
113 | del st.session_state.chat_messages
114 |
115 |
116 | def init_chat_messages():
117 | if "messages" in st.session_state:
118 | for i, message in enumerate(st.session_state.messages):
119 | if message["role"] == "assistant":
120 | with st.chat_message("assistant", avatar=image_url):
121 | st.markdown(process_assistant_content(message["content"]), unsafe_allow_html=True)
122 | # 在消息内容下方添加按钮
123 | if st.button("🗑", key=f"delete_{i}"):
124 | st.session_state.messages.pop(i)
125 | st.session_state.messages.pop(i - 1)
126 | st.session_state.chat_messages.pop(i)
127 | st.session_state.chat_messages.pop(i - 1)
128 | st.rerun()
129 | else:
130 | st.markdown(
131 | f'',
132 | unsafe_allow_html=True)
133 |
134 | else:
135 | st.session_state.messages = []
136 | st.session_state.chat_messages = []
137 |
138 | return st.session_state.messages
139 |
140 |
141 | # 添加这两个辅助函数
142 | def regenerate_answer(index):
143 | st.session_state.messages.pop()
144 | st.session_state.chat_messages.pop()
145 | st.rerun()
146 |
147 |
148 | def delete_conversation(index):
149 | st.session_state.messages.pop(index)
150 | st.session_state.messages.pop(index - 1)
151 | st.session_state.chat_messages.pop(index)
152 | st.session_state.chat_messages.pop(index - 1)
153 | st.rerun()
154 |
155 |
156 | # 侧边栏模型选择
157 | st.sidebar.title("模型设定调整")
158 |
159 | st.sidebar.text("【注】训练数据偏差,增加上下文记忆时\n多轮对话(较单轮)容易出现能力衰减")
160 | st.session_state.history_chat_num = st.sidebar.slider("Number of Historical Dialogues", 0, 6, 0, step=2)
161 | # st.session_state.history_chat_num = 0
162 | st.session_state.max_new_tokens = st.sidebar.slider("Max Sequence Length", 256, 8192, 8192, step=1)
163 | st.session_state.top_p = st.sidebar.slider("Top-P", 0.8, 0.99, 0.85, step=0.01)
164 | st.session_state.temperature = st.sidebar.slider("Temperature", 0.6, 1.2, 0.85, step=0.01)
165 |
166 | # 模型路径映射
167 | MODEL_PATHS = {
168 | "MiniMind2-R1 (0.1B)": ["../MiniMind2-R1", "MiniMind2-R1"],
169 | "MiniMind2-Small-R1 (0.02B)": ["../MiniMind2-Small-R1", "MiniMind2-Small-R1"],
170 | "MiniMind2 (0.1B)": ["../MiniMind2", "MiniMind2"],
171 | "MiniMind2-MoE (0.15B)": ["../MiniMind2-MoE", "MiniMind2-MoE"],
172 | "MiniMind2-Small (0.02B)": ["../MiniMind2-Small", "MiniMind2-Small"],
173 | "MiniMind-V1 (0.1B)": ["../minimind-v1", "MiniMind-V1"],
174 | "MiniMind-V1-MoE (0.1B)": ["../minimind-v1-moe", "MiniMind-V1-MoE"],
175 | "MiniMind-V1-Small (0.02B)": ["../minimind-v1-small", "MiniMind-V1-Small"],
176 | }
177 |
178 | selected_model = st.sidebar.selectbox('Models', list(MODEL_PATHS.keys()), index=2) # 默认选择 MiniMind2
179 | model_path = MODEL_PATHS[selected_model][0]
180 |
181 | slogan = f"Hi, I'm {MODEL_PATHS[selected_model][1]}"
182 |
183 | image_url = "https://www.modelscope.cn/api/v1/studio/gongjy/MiniMind/repo?Revision=master&FilePath=images%2Flogo2.png&View=true"
184 |
185 | st.markdown(
186 | f''
187 | '
'
188 | f'
'
189 | f'
{slogan} '
190 | '
'
191 | '
内容完全由AI生成,请务必仔细甄别 Content AI-generated, please discern with care '
192 | '
',
193 | unsafe_allow_html=True
194 | )
195 |
196 |
197 | def setup_seed(seed):
198 | random.seed(seed)
199 | np.random.seed(seed)
200 | torch.manual_seed(seed)
201 | torch.cuda.manual_seed(seed)
202 | torch.cuda.manual_seed_all(seed)
203 | torch.backends.cudnn.deterministic = True
204 | torch.backends.cudnn.benchmark = False
205 |
206 |
207 | def main():
208 | model, tokenizer = load_model_tokenizer(model_path)
209 |
210 | # 初始化消息列表
211 | if "messages" not in st.session_state:
212 | st.session_state.messages = []
213 | st.session_state.chat_messages = []
214 |
215 | # Use session state messages
216 | messages = st.session_state.messages
217 |
218 | # 在显示历史消息的循环中
219 | for i, message in enumerate(messages):
220 | if message["role"] == "assistant":
221 | with st.chat_message("assistant", avatar=image_url):
222 | st.markdown(process_assistant_content(message["content"]), unsafe_allow_html=True)
223 | if st.button("×", key=f"delete_{i}"):
224 | # 删除当前消息及其之后的所有消息
225 | st.session_state.messages = st.session_state.messages[:i - 1]
226 | st.session_state.chat_messages = st.session_state.chat_messages[:i - 1]
227 | st.rerun()
228 | else:
229 | st.markdown(
230 | f'',
231 | unsafe_allow_html=True)
232 |
233 | # 处理新的输入或重新生成
234 | prompt = st.chat_input(key="input", placeholder="给 MiniMind 发送消息")
235 |
236 | # 检查是否需要重新生成
237 | if hasattr(st.session_state, 'regenerate') and st.session_state.regenerate:
238 | prompt = st.session_state.last_user_message
239 | regenerate_index = st.session_state.regenerate_index # 获取重新生成的位置
240 | # 清除所有重新生成相关的状态
241 | delattr(st.session_state, 'regenerate')
242 | delattr(st.session_state, 'last_user_message')
243 | delattr(st.session_state, 'regenerate_index')
244 |
245 | if prompt:
246 | st.markdown(
247 | f'',
248 | unsafe_allow_html=True)
249 | messages.append({"role": "user", "content": prompt})
250 | st.session_state.chat_messages.append({"role": "user", "content": prompt})
251 |
252 | with st.chat_message("assistant", avatar=image_url):
253 | placeholder = st.empty()
254 | random_seed = random.randint(0, 2 ** 32 - 1)
255 | setup_seed(random_seed)
256 |
257 | st.session_state.chat_messages = system_prompt + st.session_state.chat_messages[
258 | -(st.session_state.history_chat_num + 1):]
259 | new_prompt = tokenizer.apply_chat_template(
260 | st.session_state.chat_messages,
261 | tokenize=False,
262 | add_generation_prompt=True
263 | )[-(st.session_state.max_new_tokens - 1):]
264 |
265 | x = torch.tensor(tokenizer(new_prompt)['input_ids'], device=device).unsqueeze(0)
266 | with torch.no_grad():
267 | res_y = model.generate(x, tokenizer.eos_token_id, max_new_tokens=st.session_state.max_new_tokens,
268 | temperature=st.session_state.temperature,
269 | top_p=st.session_state.top_p, stream=True)
270 | try:
271 | for y in res_y:
272 | answer = tokenizer.decode(y[0].tolist(), skip_special_tokens=True)
273 | if (answer and answer[-1] == '�') or not answer:
274 | continue
275 | placeholder.markdown(process_assistant_content(answer), unsafe_allow_html=True)
276 | except StopIteration:
277 | print("No answer")
278 |
279 | assistant_answer = answer.replace(new_prompt, "")
280 | messages.append({"role": "assistant", "content": assistant_answer})
281 | st.session_state.chat_messages.append({"role": "assistant", "content": assistant_answer})
282 |
283 | with st.empty():
284 | if st.button("×", key=f"delete_{len(messages) - 1}"):
285 | st.session_state.messages = st.session_state.messages[:-2]
286 | st.session_state.chat_messages = st.session_state.chat_messages[:-2]
287 | st.rerun()
288 |
289 |
290 | if __name__ == "__main__":
291 | from transformers import AutoModelForCausalLM, AutoTokenizer
292 |
293 | main()
294 |
--------------------------------------------------------------------------------
/train_distill_reason.py:
--------------------------------------------------------------------------------
1 | import os
2 | import platform
3 | import argparse
4 | import time
5 | import math
6 | import warnings
7 |
8 | import pandas as pd
9 | import torch
10 | import torch.nn.functional as F
11 | import torch.distributed as dist
12 | from contextlib import nullcontext
13 |
14 | from torch import optim, nn
15 | from torch.nn.parallel import DistributedDataParallel
16 | from torch.utils.data import DataLoader, DistributedSampler
17 | from transformers import AutoTokenizer, AutoModelForCausalLM
18 | from model.model import MiniMindLM
19 | from model.LMConfig import LMConfig
20 | from model.dataset import SFTDataset
21 |
22 | warnings.filterwarnings('ignore')
23 |
24 |
25 | def Logger(content):
26 | if not ddp or dist.get_rank() == 0:
27 | print(content)
28 |
29 |
30 | def get_lr(current_step, total_steps, lr):
31 | return lr / 10 + 0.5 * lr * (1 + math.cos(math.pi * current_step / total_steps))
32 |
33 |
34 | def train_epoch(epoch, wandb):
35 | # 思考标签占位符
36 | start_of_think_ids = tokenizer('').input_ids
37 | end_of_think_ids = tokenizer(' ').input_ids
38 | start_of_answer_ids = tokenizer('').input_ids
39 | end_of_answer_ids = tokenizer(' ').input_ids
40 | loss_fct = nn.CrossEntropyLoss(reduction='none')
41 | start_time = time.time()
42 | for step, (X, Y, loss_mask) in enumerate(train_loader):
43 | X = X.to(args.device)
44 | Y = Y.to(args.device)
45 | loss_mask = loss_mask.to(args.device)
46 | lr = get_lr(epoch * iter_per_epoch + step, args.epochs * iter_per_epoch, args.learning_rate)
47 | for param_group in optimizer.param_groups:
48 | param_group['lr'] = lr
49 |
50 | with ctx:
51 | res = model(X)
52 | loss = loss_fct(
53 | res.logits.view(-1, res.logits.size(-1)),
54 | Y.view(-1)
55 | ).view(Y.size())
56 | sp_ids = torch.isin(Y.view(-1),
57 | torch.tensor(start_of_think_ids + end_of_think_ids
58 | + start_of_answer_ids + end_of_answer_ids
59 | ).to(args.device))
60 | # 在 sp_ids 对应的位置增加额外的惩罚
61 | loss_mask = loss_mask.view(-1)
62 | loss_mask_sum = loss_mask.sum()
63 | loss_mask[sp_ids] = 10
64 | loss_mask = loss_mask.view(Y.size())
65 | loss = (loss * loss_mask).sum() / loss_mask_sum
66 | loss += res.aux_loss
67 | loss = loss / args.accumulation_steps
68 |
69 | scaler.scale(loss).backward()
70 |
71 | if (step + 1) % args.accumulation_steps == 0:
72 | scaler.unscale_(optimizer)
73 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
74 |
75 | scaler.step(optimizer)
76 | scaler.update()
77 |
78 | optimizer.zero_grad(set_to_none=True)
79 |
80 | if step % args.log_interval == 0:
81 | spend_time = time.time() - start_time
82 | Logger(
83 | 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
84 | epoch + 1,
85 | args.epochs,
86 | step,
87 | iter_per_epoch,
88 | loss.item(),
89 | optimizer.param_groups[-1]['lr'],
90 | spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
91 |
92 | if (wandb is not None) and (not ddp or dist.get_rank() == 0):
93 | wandb.log({"loss": loss,
94 | "lr": optimizer.param_groups[-1]['lr'],
95 | "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
96 |
97 | if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
98 | model.eval()
99 | moe_path = '_moe' if lm_config.use_moe else ''
100 | ckp = f'{args.save_dir}/reason_{lm_config.dim}{moe_path}.pth'
101 |
102 | if isinstance(model, torch.nn.parallel.DistributedDataParallel):
103 | state_dict = model.module.state_dict()
104 | else:
105 | state_dict = model.state_dict()
106 |
107 | torch.save(state_dict, ckp)
108 | model.train()
109 |
110 |
111 | def init_model(lm_config):
112 | tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
113 | model = MiniMindLM(lm_config)
114 | moe_path = '_moe' if lm_config.use_moe else ''
115 | ckp = f'./out/rlhf_{lm_config.dim}{moe_path}.pth'
116 | state_dict = torch.load(ckp, map_location=args.device)
117 | model.load_state_dict(state_dict, strict=False)
118 | Logger(f'LLM总参数量:{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
119 | model = model.to(args.device)
120 | return model, tokenizer
121 |
122 |
123 | def init_distributed_mode():
124 | if not ddp: return
125 | global ddp_local_rank, DEVICE
126 |
127 | dist.init_process_group(backend="nccl")
128 | ddp_rank = int(os.environ["RANK"])
129 | ddp_local_rank = int(os.environ["LOCAL_RANK"])
130 | ddp_world_size = int(os.environ["WORLD_SIZE"])
131 | DEVICE = f"cuda:{ddp_local_rank}"
132 | torch.cuda.set_device(DEVICE)
133 |
134 |
135 | if __name__ == "__main__":
136 | parser = argparse.ArgumentParser(description="MiniMind Distill Reasoning")
137 | parser.add_argument("--out_dir", type=str, default="out")
138 | parser.add_argument("--epochs", type=int, default=1)
139 | parser.add_argument("--batch_size", type=int, default=8)
140 | parser.add_argument("--learning_rate", type=float, default=1e-6)
141 | parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
142 | parser.add_argument("--dtype", type=str, default="bfloat16")
143 | parser.add_argument("--use_wandb", action="store_true")
144 | parser.add_argument("--wandb_project", type=str, default="MiniMind-Full-SFT")
145 | parser.add_argument("--num_workers", type=int, default=1)
146 | parser.add_argument("--ddp", action="store_true")
147 | parser.add_argument("--accumulation_steps", type=int, default=1)
148 | parser.add_argument("--grad_clip", type=float, default=1.0)
149 | parser.add_argument("--warmup_iters", type=int, default=0)
150 | parser.add_argument("--log_interval", type=int, default=1)
151 | parser.add_argument("--save_interval", type=int, default=50)
152 | parser.add_argument('--local_rank', type=int, default=-1)
153 | parser.add_argument('--dim', default=512, type=int)
154 | parser.add_argument('--n_layers', default=8, type=int)
155 | parser.add_argument('--max_seq_len', default=1024, type=int)
156 | parser.add_argument('--use_moe', default=False, type=bool)
157 | parser.add_argument("--data_path", type=str, default="./dataset/r1_mix_1024.jsonl")
158 |
159 | args = parser.parse_args()
160 |
161 | lm_config = LMConfig(dim=args.dim, n_layers=args.n_layers, max_seq_len=args.max_seq_len, use_moe=args.use_moe)
162 | args.save_dir = os.path.join(args.out_dir)
163 | os.makedirs(args.save_dir, exist_ok=True)
164 | os.makedirs(args.out_dir, exist_ok=True)
165 | tokens_per_iter = args.batch_size * lm_config.max_seq_len
166 | device_type = "cuda" if "cuda" in args.device else "cpu"
167 |
168 | args.wandb_run_name = f"MiniMind-Distill-Reasoning-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
169 |
170 | ctx = nullcontext() if device_type == "cpu" else torch.cuda.amp.autocast()
171 | ddp = int(os.environ.get("RANK", -1)) != -1 # is this a ddp run?
172 | ddp_local_rank, DEVICE = 0, "cuda:0"
173 | base_seed = 1337
174 | torch.manual_seed(base_seed)
175 | torch.cuda.manual_seed(base_seed)
176 |
177 | if ddp:
178 | init_distributed_mode()
179 | args.device = torch.device(DEVICE)
180 | rank = dist.get_rank()
181 | torch.manual_seed(base_seed + rank)
182 | # 同时设置 CUDA 的随机种子
183 | torch.cuda.manual_seed(base_seed + rank)
184 |
185 | if args.use_wandb and (not ddp or ddp_local_rank == 0):
186 | import wandb
187 |
188 | wandb.init(project=args.wandb_project, name=args.wandb_run_name)
189 | else:
190 | wandb = None
191 |
192 | model, tokenizer = init_model(lm_config)
193 |
194 | train_ds = SFTDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
195 | train_sampler = DistributedSampler(train_ds) if ddp else None
196 | train_loader = DataLoader(
197 | train_ds,
198 | batch_size=args.batch_size,
199 | pin_memory=True,
200 | drop_last=False,
201 | shuffle=False,
202 | num_workers=args.num_workers,
203 | sampler=train_sampler
204 | )
205 |
206 | scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16', 'bfloat16']))
207 | optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
208 |
209 | if ddp:
210 | model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
211 | model = DistributedDataParallel(model, device_ids=[ddp_local_rank])
212 |
213 | iter_per_epoch = len(train_loader)
214 | for epoch in range(args.epochs):
215 | train_epoch(epoch, wandb)
216 |
--------------------------------------------------------------------------------
/train_distillation.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import time
4 | import math
5 | import warnings
6 |
7 | import pandas as pd
8 | import torch
9 | import torch.nn.functional as F
10 | import torch.distributed as dist
11 | from contextlib import nullcontext
12 |
13 | from torch import optim, nn
14 | from torch.nn.parallel import DistributedDataParallel
15 | from torch.utils.data import DataLoader, DistributedSampler
16 | from transformers import AutoTokenizer, AutoModelForCausalLM
17 | from model.model import MiniMindLM
18 | from model.LMConfig import LMConfig
19 | from model.dataset import SFTDataset
20 |
21 | warnings.filterwarnings('ignore')
22 |
23 |
24 | def Logger(content):
25 | if not ddp or dist.get_rank() == 0:
26 | print(content)
27 |
28 |
29 | def get_lr(current_step, total_steps, lr):
30 | return lr / 10 + 0.5 * lr * (1 + math.cos(math.pi * current_step / total_steps))
31 |
32 |
33 | def distillation_loss_fn(student_logits, teacher_logits, temperature=1.0, reduction='batchmean'):
34 | with torch.no_grad():
35 | teacher_probs = F.softmax(teacher_logits / temperature, dim=-1).detach()
36 |
37 | student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)
38 |
39 | kl = F.kl_div(
40 | student_log_probs,
41 | teacher_probs,
42 | reduction=reduction
43 | )
44 | return (temperature ** 2) * kl
45 |
46 |
47 | def train_epoch(epoch, wandb, alpha=0.0, temperature=1.0):
48 | start_time = time.time()
49 |
50 | if teacher_model is not None:
51 | teacher_model.eval()
52 | teacher_model.requires_grad_(False)
53 |
54 | for step, (X, Y, loss_mask) in enumerate(train_loader):
55 | X = X.to(args.device)
56 | Y = Y.to(args.device)
57 | loss_mask = loss_mask.to(args.device)
58 | lr = get_lr(epoch * iter_per_epoch + step,
59 | args.epochs * iter_per_epoch,
60 | args.learning_rate)
61 | for param_group in optimizer.param_groups:
62 | param_group['lr'] = lr
63 |
64 | # 前向传播(学生模型)
65 | with ctx:
66 | res = model(X)
67 | student_logits = res.logits
68 |
69 | # 教师模型前向传播(只在eval & no_grad)
70 | if teacher_model is not None:
71 | with torch.no_grad():
72 | teacher_logits = teacher_model(X).logits
73 | vocab_size_student = student_logits.size(-1) # N
74 | teacher_logits = teacher_logits[..., :vocab_size_student]
75 |
76 | # ========== 计算损失 ==========
77 | # 1) Ground-Truth CE Loss(可选)
78 | loss_mask_flat = loss_mask.view(-1)
79 | ce_loss = F.cross_entropy(
80 | student_logits.view(-1, student_logits.size(-1)),
81 | Y.view(-1),
82 | ignore_index=0,
83 | reduction='none'
84 | )
85 | ce_loss = torch.sum(ce_loss * loss_mask_flat) / loss_mask_flat.sum()
86 | if lm_config_student.use_moe:
87 | ce_loss += res.aux_loss
88 |
89 | # 2) Distillation Loss(可选)
90 | if teacher_model is not None:
91 | # 只在有效token位置做蒸馏
92 | distill_loss = distillation_loss_fn(
93 | student_logits.view(-1, student_logits.size(-1))[loss_mask_flat == 1],
94 | teacher_logits.view(-1, teacher_logits.size(-1))[loss_mask_flat == 1],
95 | temperature=temperature
96 | )
97 | else:
98 | distill_loss = torch.tensor(0.0, device=args.device)
99 |
100 | # 3) 总损失 = alpha * CE + (1-alpha) * Distill
101 | loss = alpha * ce_loss + (1 - alpha) * distill_loss
102 |
103 | scaler.scale(loss).backward()
104 |
105 | if (step + 1) % args.accumulation_steps == 0:
106 | scaler.unscale_(optimizer)
107 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
108 | scaler.step(optimizer)
109 | scaler.update()
110 | optimizer.zero_grad(set_to_none=True)
111 |
112 | if step % args.log_interval == 0:
113 | spend_time = time.time() - start_time
114 | Logger(
115 | 'Epoch:[{}/{}]({}/{}) loss:{:.4f} lr:{:.12f} epoch_Time:{}min:'.format(
116 | epoch,
117 | args.epochs - 1,
118 | step,
119 | iter_per_epoch,
120 | loss.item(),
121 | optimizer.param_groups[-1]['lr'],
122 | spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60
123 | )
124 | )
125 |
126 | if (wandb is not None) and (not ddp or dist.get_rank() == 0):
127 | wandb.log({
128 | "loss": loss.item(),
129 | "ce_loss": ce_loss.item(),
130 | "distill_loss": distill_loss.item() if teacher_model is not None else 0.0,
131 | "lr": optimizer.param_groups[-1]['lr'],
132 | "last-time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60
133 | })
134 |
135 | if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
136 | model.eval()
137 | moe_path = '_moe' if lm_config_student.use_moe else ''
138 | ckp = f'{args.save_dir}/full_dist_{lm_config_student.dim}{moe_path}.pth'
139 | if isinstance(model, torch.nn.parallel.DistributedDataParallel):
140 | state_dict = model.module.state_dict()
141 | else:
142 | state_dict = model.state_dict()
143 | torch.save(state_dict, ckp)
144 | model.train()
145 |
146 |
147 | def init_student_model(lm_config):
148 | tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
149 | model = MiniMindLM(lm_config)
150 | moe_path = '_moe' if lm_config.use_moe else ''
151 | ckp = f'./out/full_sft_{lm_config.dim}{moe_path}.pth'
152 | state_dict = torch.load(ckp, map_location=args.device)
153 | model.load_state_dict(state_dict, strict=False)
154 | Logger(f'学生模型(LLM)总参数量:{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
155 | model = model.to(args.device)
156 |
157 | return model, tokenizer
158 |
159 |
160 | def init_teacher_model(lm_config):
161 | model = MiniMindLM(lm_config)
162 | moe_path = '_moe' if lm_config.use_moe else ''
163 | ckp = f'./out/full_sft_{lm_config.dim}{moe_path}.pth'
164 | state_dict = torch.load(ckp, map_location=args.device)
165 | model.load_state_dict(state_dict, strict=False)
166 | Logger(f'教师模型(LLM)总参数量:{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
167 | model = model.to(args.device)
168 | return model
169 |
170 |
171 | def init_distributed_mode():
172 | if not ddp: return
173 | global ddp_local_rank, DEVICE
174 |
175 | dist.init_process_group(backend="nccl")
176 | ddp_rank = int(os.environ["RANK"])
177 | ddp_local_rank = int(os.environ["LOCAL_RANK"])
178 | ddp_world_size = int(os.environ["WORLD_SIZE"])
179 | DEVICE = f"cuda:{ddp_local_rank}"
180 | torch.cuda.set_device(DEVICE)
181 |
182 |
183 | if __name__ == "__main__":
184 | parser = argparse.ArgumentParser(description="MiniMind Full SFT")
185 | parser.add_argument("--out_dir", type=str, default="out")
186 | parser.add_argument("--epochs", type=int, default=6)
187 | parser.add_argument("--batch_size", type=int, default=32)
188 | parser.add_argument("--learning_rate", type=float, default=5e-6)
189 | parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
190 | parser.add_argument("--dtype", type=str, default="bfloat16")
191 | parser.add_argument("--use_wandb", action="store_true")
192 | parser.add_argument("--wandb_project", type=str, default="MiniMind-Full-SFT")
193 | parser.add_argument("--num_workers", type=int, default=1)
194 | parser.add_argument("--ddp", action="store_true")
195 | parser.add_argument("--accumulation_steps", type=int, default=1)
196 | parser.add_argument("--grad_clip", type=float, default=1.0)
197 | parser.add_argument("--warmup_iters", type=int, default=0)
198 | parser.add_argument("--log_interval", type=int, default=100)
199 | parser.add_argument("--save_interval", type=int, default=100)
200 | parser.add_argument('--local_rank', type=int, default=-1)
201 | parser.add_argument("--data_path", type=str, default="./dataset/sft_data.jsonl")
202 |
203 | args = parser.parse_args()
204 | # 定义学生模型和教师模型
205 | lm_config_student = LMConfig(dim=512, n_layers=8, max_seq_len=512)
206 | lm_config_teacher = LMConfig(dim=768, n_layers=16, max_seq_len=512)
207 | max_seq_len = lm_config_student.max_seq_len
208 | args.save_dir = os.path.join(args.out_dir)
209 | os.makedirs(args.save_dir, exist_ok=True)
210 | os.makedirs(args.out_dir, exist_ok=True)
211 | tokens_per_iter = args.batch_size * max_seq_len
212 | device_type = "cuda" if "cuda" in args.device else "cpu"
213 |
214 | args.wandb_run_name = f"MiniMind-Dist-SFT-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
215 |
216 | ctx = nullcontext() if device_type == "cpu" else torch.cuda.amp.autocast()
217 | ddp = int(os.environ.get("RANK", -1)) != -1 # is this a ddp run?
218 | ddp_local_rank, DEVICE = 0, "cuda:0"
219 | base_seed = 1337
220 | torch.manual_seed(base_seed)
221 | torch.cuda.manual_seed(base_seed)
222 |
223 | if ddp:
224 | init_distributed_mode()
225 | args.device = torch.device(DEVICE)
226 | rank = dist.get_rank()
227 | torch.manual_seed(base_seed + rank)
228 | # 同时设置 CUDA 的随机种子
229 | torch.cuda.manual_seed(base_seed + rank)
230 |
231 | if args.use_wandb and (not ddp or ddp_local_rank == 0):
232 | import wandb
233 |
234 | wandb.init(project=args.wandb_project, name=args.wandb_run_name)
235 | else:
236 | wandb = None
237 |
238 | # 初始化学生模型和教师模型
239 | model, tokenizer = init_student_model(lm_config_student)
240 | teacher_model = init_teacher_model(lm_config_teacher)
241 |
242 | train_ds = SFTDataset(args.data_path, tokenizer, max_length=max_seq_len)
243 | train_sampler = DistributedSampler(train_ds) if ddp else None
244 | train_loader = DataLoader(
245 | train_ds,
246 | batch_size=args.batch_size,
247 | pin_memory=True,
248 | drop_last=False,
249 | shuffle=False,
250 | num_workers=args.num_workers,
251 | sampler=train_sampler
252 | )
253 |
254 | scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16', 'bfloat16']))
255 | optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
256 |
257 | if ddp:
258 | model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
259 | model = DistributedDataParallel(model, device_ids=[ddp_local_rank])
260 |
261 | iter_per_epoch = len(train_loader)
262 | for epoch in range(args.epochs):
263 | train_epoch(epoch, wandb)
264 |
--------------------------------------------------------------------------------
/train_dpo.py:
--------------------------------------------------------------------------------
1 | import os
2 | import platform
3 | import argparse
4 | import time
5 | import math
6 | import warnings
7 |
8 | import pandas as pd
9 | import torch
10 | import torch.nn.functional as F
11 | import torch.distributed as dist
12 | from contextlib import nullcontext
13 |
14 | from torch import optim, nn
15 | from torch.nn.parallel import DistributedDataParallel
16 | from torch.utils.data import DataLoader, DistributedSampler
17 | from transformers import AutoTokenizer, AutoModelForCausalLM
18 | from model.model import MiniMindLM
19 | from model.LMConfig import LMConfig
20 | from model.dataset import DPODataset
21 |
22 | warnings.filterwarnings('ignore')
23 |
24 |
25 | def Logger(content):
26 | if not ddp or dist.get_rank() == 0:
27 | print(content)
28 |
29 |
30 | def get_lr(current_step, total_steps, lr):
31 | return lr / 10 + 0.5 * lr * (1 + math.cos(math.pi * current_step / total_steps))
32 |
33 |
34 | def logits_to_probs(logits, labels):
35 | # logits shape: (batch_size, seq_len, vocab_size)
36 | # labels shape: (batch_size, seq_len)
37 | # probs shape: (batch_size, seq_len)
38 | log_probs = F.log_softmax(logits, dim=2)
39 | probs = torch.gather(log_probs, dim=2, index=labels.unsqueeze(2)).squeeze(-1)
40 | return probs
41 |
42 |
43 | def dpo_loss(ref_probs, probs, mask, beta):
44 | # ref_probs 和 probs 都是 shape: (batch_size, seq_len)
45 | # https://github.com/jingyaogong/minimind/issues/298
46 | seq_lengths = mask.sum(dim=1, keepdim=True) # (batch_size, 1)
47 | ref_probs = (ref_probs * mask).sum(dim=1) / seq_lengths.squeeze()
48 | probs = (probs * mask).sum(dim=1) / seq_lengths.squeeze()
49 |
50 | # 将 chosen 和 rejected 数据分开
51 | batch_size = ref_probs.shape[0]
52 | chosen_ref_probs = ref_probs[:batch_size // 2]
53 | reject_ref_probs = ref_probs[batch_size // 2:]
54 | chosen_probs = probs[:batch_size // 2]
55 | reject_probs = probs[batch_size // 2:]
56 |
57 | pi_logratios = chosen_probs - reject_probs
58 | ref_logratios = chosen_ref_probs - reject_ref_probs
59 | logits = pi_logratios - ref_logratios
60 | loss = -F.logsigmoid(beta * logits)
61 | return loss.mean()
62 |
63 |
64 | def train_epoch(epoch, wandb):
65 | start_time = time.time()
66 | for step, batch in enumerate(train_loader):
67 | x_chosen = batch['x_chosen'].to(args.device)
68 | x_rejected = batch['x_rejected'].to(args.device)
69 | y_chosen = batch['y_chosen'].to(args.device)
70 | y_rejected = batch['y_rejected'].to(args.device)
71 | mask_chosen = batch['mask_chosen'].to(args.device)
72 | mask_rejected = batch['mask_rejected'].to(args.device)
73 | x = torch.cat([x_chosen, x_rejected], dim=0)
74 | y = torch.cat([y_chosen, y_rejected], dim=0)
75 | mask = torch.cat([mask_chosen, mask_rejected], dim=0)
76 |
77 | lr = get_lr(epoch * iter_per_epoch + step, args.epochs * iter_per_epoch, args.learning_rate)
78 | for param_group in optimizer.param_groups:
79 | param_group['lr'] = lr
80 |
81 | with ctx:
82 | with torch.no_grad():
83 | ref_outputs = ref_model(x)
84 | ref_logits = ref_outputs.logits
85 | ref_probs = logits_to_probs(ref_logits, y)
86 | ref_probs = ref_probs * mask
87 | outputs = model(x)
88 | logits = outputs.logits
89 | probs = logits_to_probs(logits, y)
90 | probs = probs * mask
91 | loss = dpo_loss(ref_probs, probs, mask, beta=0.1)
92 | loss = loss / args.accumulation_steps
93 |
94 | scaler.scale(loss).backward()
95 |
96 | if (step + 1) % args.accumulation_steps == 0:
97 | scaler.unscale_(optimizer)
98 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
99 | scaler.step(optimizer)
100 | scaler.update()
101 | optimizer.zero_grad(set_to_none=True)
102 |
103 | if step % args.log_interval == 0:
104 | spend_time = time.time() - start_time
105 | Logger(
106 | 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
107 | epoch + 1,
108 | args.epochs,
109 | step,
110 | iter_per_epoch,
111 | loss.item(),
112 | optimizer.param_groups[-1]['lr'],
113 | spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
114 |
115 | if (wandb is not None) and (not ddp or dist.get_rank() == 0):
116 | wandb.log({"loss": loss,
117 | "lr": optimizer.param_groups[-1]['lr'],
118 | "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
119 |
120 | if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
121 | model.eval()
122 | moe_path = '_moe' if lm_config.use_moe else ''
123 | ckp = f'{args.save_dir}/rlhf_{lm_config.dim}{moe_path}.pth'
124 |
125 | if isinstance(model, torch.nn.parallel.DistributedDataParallel):
126 | state_dict = model.module.state_dict()
127 | else:
128 | state_dict = model.state_dict()
129 |
130 | torch.save(state_dict, ckp)
131 | model.train()
132 |
133 |
134 | def init_model(lm_config):
135 | tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
136 | model = MiniMindLM(lm_config)
137 | moe_path = '_moe' if lm_config.use_moe else ''
138 | ckp = f'./out/full_sft_{lm_config.dim}{moe_path}.pth'
139 | state_dict = torch.load(ckp, map_location=args.device)
140 | model.load_state_dict(state_dict, strict=False)
141 | # 初始化参考模型
142 | ref_model = MiniMindLM(lm_config)
143 | ref_model.load_state_dict(state_dict, strict=False)
144 | ref_model.eval()
145 | ref_model.requires_grad_(False)
146 |
147 | Logger(f'LLM总参数量:{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
148 | model = model.to(args.device)
149 | ref_model = ref_model.to(args.device)
150 |
151 | return model, ref_model, tokenizer
152 |
153 |
154 | def init_distributed_mode():
155 | if not ddp: return
156 | global ddp_local_rank, DEVICE
157 |
158 | dist.init_process_group(backend="nccl")
159 | ddp_rank = int(os.environ["RANK"])
160 | ddp_local_rank = int(os.environ["LOCAL_RANK"])
161 | ddp_world_size = int(os.environ["WORLD_SIZE"])
162 | DEVICE = f"cuda:{ddp_local_rank}"
163 | torch.cuda.set_device(DEVICE)
164 |
165 |
166 | if __name__ == "__main__":
167 | parser = argparse.ArgumentParser(description="MiniMind RLHF")
168 | parser.add_argument("--out_dir", type=str, default="out")
169 | parser.add_argument("--epochs", type=int, default=2)
170 | parser.add_argument("--batch_size", type=int, default=8)
171 | # sft阶段学习率为 「5e-6」->「5e-7」长度512,建议离线正负样本「概率」偏好对齐阶段lr <=「1e-8」长度3000,否则很容易遗忘训坏
172 | parser.add_argument("--learning_rate", type=float, default=1e-8)
173 | parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
174 | parser.add_argument("--dtype", type=str, default="bfloat16")
175 | parser.add_argument("--use_wandb", action="store_true")
176 | parser.add_argument("--wandb_project", type=str, default="MiniMind-RLHF-SFT")
177 | parser.add_argument("--num_workers", type=int, default=1)
178 | parser.add_argument("--ddp", action="store_true")
179 | parser.add_argument("--accumulation_steps", type=int, default=1)
180 | parser.add_argument("--grad_clip", type=float, default=1.0)
181 | parser.add_argument("--warmup_iters", type=int, default=0)
182 | parser.add_argument("--log_interval", type=int, default=100)
183 | parser.add_argument("--save_interval", type=int, default=100)
184 | parser.add_argument('--local_rank', type=int, default=-1)
185 | parser.add_argument('--dim', default=512, type=int)
186 | parser.add_argument('--n_layers', default=8, type=int)
187 | parser.add_argument('--max_seq_len', default=1024, type=int)
188 | parser.add_argument('--use_moe', default=False, type=bool)
189 | parser.add_argument("--data_path", type=str, default="./dataset/dpo.jsonl")
190 |
191 | args = parser.parse_args()
192 |
193 | lm_config = LMConfig(dim=args.dim, n_layers=args.n_layers, max_seq_len=args.max_seq_len, use_moe=args.use_moe)
194 | args.save_dir = os.path.join(args.out_dir)
195 | os.makedirs(args.save_dir, exist_ok=True)
196 | os.makedirs(args.out_dir, exist_ok=True)
197 | tokens_per_iter = args.batch_size * lm_config.max_seq_len
198 | device_type = "cuda" if "cuda" in args.device else "cpu"
199 |
200 | args.wandb_run_name = f"MiniMind-Full-DPO-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
201 |
202 | ctx = nullcontext() if device_type == "cpu" else torch.cuda.amp.autocast()
203 | ddp = int(os.environ.get("RANK", -1)) != -1 # is this a ddp run?
204 | ddp_local_rank, DEVICE = 0, "cuda:0"
205 | base_seed = 1337
206 | torch.manual_seed(base_seed)
207 | torch.cuda.manual_seed(base_seed)
208 |
209 | if ddp:
210 | init_distributed_mode()
211 | args.device = torch.device(DEVICE)
212 | rank = dist.get_rank()
213 | torch.manual_seed(base_seed + rank)
214 | # 同时设置 CUDA 的随机种子
215 | torch.cuda.manual_seed(base_seed + rank)
216 |
217 | if args.use_wandb and (not ddp or ddp_local_rank == 0):
218 | import wandb
219 |
220 | wandb.init(project=args.wandb_project, name=args.wandb_run_name)
221 | else:
222 | wandb = None
223 |
224 | model, ref_model, tokenizer = init_model(lm_config)
225 |
226 | train_ds = DPODataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
227 | train_sampler = DistributedSampler(train_ds) if ddp else None
228 | train_loader = DataLoader(
229 | train_ds,
230 | batch_size=args.batch_size,
231 | pin_memory=True,
232 | drop_last=False,
233 | shuffle=False,
234 | num_workers=args.num_workers,
235 | sampler=train_sampler
236 | )
237 |
238 | scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16', 'bfloat16']))
239 | optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
240 |
241 | if ddp:
242 | model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
243 | model = DistributedDataParallel(model, device_ids=[ddp_local_rank])
244 |
245 | iter_per_epoch = len(train_loader)
246 | for epoch in range(args.epochs):
247 | train_epoch(epoch, wandb)
248 |
--------------------------------------------------------------------------------
/train_full_sft.py:
--------------------------------------------------------------------------------
1 | import os
2 | import platform
3 | import argparse
4 | import time
5 | import math
6 | import warnings
7 |
8 | import pandas as pd
9 | import torch
10 | import torch.nn.functional as F
11 | import torch.distributed as dist
12 | from contextlib import nullcontext
13 |
14 | from torch import optim, nn
15 | from torch.nn.parallel import DistributedDataParallel
16 | from torch.utils.data import DataLoader, DistributedSampler
17 | from transformers import AutoTokenizer, AutoModelForCausalLM
18 | from model.model import MiniMindLM
19 | from model.LMConfig import LMConfig
20 | from model.dataset import SFTDataset
21 |
22 | warnings.filterwarnings('ignore')
23 |
24 |
25 | def Logger(content):
26 | if not ddp or dist.get_rank() == 0:
27 | print(content)
28 |
29 |
30 | def get_lr(current_step, total_steps, lr):
31 | return lr / 10 + 0.5 * lr * (1 + math.cos(math.pi * current_step / total_steps))
32 |
33 |
34 | def train_epoch(epoch, wandb):
35 | loss_fct = nn.CrossEntropyLoss(reduction='none')
36 | start_time = time.time()
37 | for step, (X, Y, loss_mask) in enumerate(train_loader):
38 | X = X.to(args.device)
39 | Y = Y.to(args.device)
40 | loss_mask = loss_mask.to(args.device)
41 | lr = get_lr(epoch * iter_per_epoch + step, args.epochs * iter_per_epoch, args.learning_rate)
42 | for param_group in optimizer.param_groups:
43 | param_group['lr'] = lr
44 |
45 | with ctx:
46 | res = model(X)
47 | loss = loss_fct(
48 | res.logits.view(-1, res.logits.size(-1)),
49 | Y.view(-1)
50 | ).view(Y.size())
51 |
52 | loss = (loss * loss_mask).sum() / loss_mask.sum()
53 | loss += res.aux_loss
54 | loss = loss / args.accumulation_steps
55 |
56 | scaler.scale(loss).backward()
57 |
58 | if (step + 1) % args.accumulation_steps == 0:
59 | scaler.unscale_(optimizer)
60 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
61 |
62 | scaler.step(optimizer)
63 | scaler.update()
64 |
65 | optimizer.zero_grad(set_to_none=True)
66 |
67 | if step % args.log_interval == 0:
68 | spend_time = time.time() - start_time
69 | Logger(
70 | 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
71 | epoch + 1,
72 | args.epochs,
73 | step,
74 | iter_per_epoch,
75 | loss.item(),
76 | optimizer.param_groups[-1]['lr'],
77 | spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
78 |
79 | if (wandb is not None) and (not ddp or dist.get_rank() == 0):
80 | wandb.log({"loss": loss,
81 | "lr": optimizer.param_groups[-1]['lr'],
82 | "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
83 |
84 | if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
85 | model.eval()
86 | moe_path = '_moe' if lm_config.use_moe else ''
87 | ckp = f'{args.save_dir}/full_sft_{lm_config.dim}{moe_path}.pth'
88 |
89 | if isinstance(model, torch.nn.parallel.DistributedDataParallel):
90 | state_dict = model.module.state_dict()
91 | else:
92 | state_dict = model.state_dict()
93 |
94 | torch.save(state_dict, ckp)
95 | model.train()
96 |
97 |
98 | def init_model(lm_config):
99 | tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
100 | model = MiniMindLM(lm_config)
101 | moe_path = '_moe' if lm_config.use_moe else ''
102 | ckp = f'./out/pretrain_{lm_config.dim}{moe_path}.pth'
103 | state_dict = torch.load(ckp, map_location=args.device)
104 | model.load_state_dict(state_dict, strict=False)
105 | Logger(f'LLM总参数量:{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
106 | model = model.to(args.device)
107 | return model, tokenizer
108 |
109 |
110 | def init_distributed_mode():
111 | if not ddp: return
112 | global ddp_local_rank, DEVICE
113 |
114 | dist.init_process_group(backend="nccl")
115 | ddp_rank = int(os.environ["RANK"])
116 | ddp_local_rank = int(os.environ["LOCAL_RANK"])
117 | ddp_world_size = int(os.environ["WORLD_SIZE"])
118 | DEVICE = f"cuda:{ddp_local_rank}"
119 | torch.cuda.set_device(DEVICE)
120 |
121 |
122 | if __name__ == "__main__":
123 | parser = argparse.ArgumentParser(description="MiniMind Full SFT")
124 | parser.add_argument("--out_dir", type=str, default="out")
125 | parser.add_argument("--epochs", type=int, default=1)
126 | parser.add_argument("--batch_size", type=int, default=32)
127 | parser.add_argument("--learning_rate", type=float, default=5e-5)
128 | parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
129 | parser.add_argument("--dtype", type=str, default="bfloat16")
130 | parser.add_argument("--use_wandb", action="store_true")
131 | parser.add_argument("--wandb_project", type=str, default="MiniMind-Full-SFT")
132 | parser.add_argument("--num_workers", type=int, default=1)
133 | parser.add_argument("--ddp", action="store_true")
134 | parser.add_argument("--accumulation_steps", type=int, default=1)
135 | parser.add_argument("--grad_clip", type=float, default=1.0)
136 | parser.add_argument("--warmup_iters", type=int, default=0)
137 | parser.add_argument("--log_interval", type=int, default=100)
138 | parser.add_argument("--save_interval", type=int, default=100)
139 | parser.add_argument('--local_rank', type=int, default=-1)
140 | parser.add_argument('--dim', default=512, type=int)
141 | parser.add_argument('--n_layers', default=8, type=int)
142 | parser.add_argument('--max_seq_len', default=512, type=int)
143 | parser.add_argument('--use_moe', default=False, type=bool)
144 | parser.add_argument("--data_path", type=str, default="./dataset/sft_mini_512.jsonl")
145 |
146 | args = parser.parse_args()
147 |
148 | lm_config = LMConfig(dim=args.dim, n_layers=args.n_layers, max_seq_len=args.max_seq_len, use_moe=args.use_moe)
149 | args.save_dir = os.path.join(args.out_dir)
150 | os.makedirs(args.save_dir, exist_ok=True)
151 | os.makedirs(args.out_dir, exist_ok=True)
152 | tokens_per_iter = args.batch_size * lm_config.max_seq_len
153 | device_type = "cuda" if "cuda" in args.device else "cpu"
154 |
155 | args.wandb_run_name = f"MiniMind-Full-SFT-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
156 |
157 | ctx = nullcontext() if device_type == "cpu" else torch.cuda.amp.autocast()
158 | ddp = int(os.environ.get("RANK", -1)) != -1 # is this a ddp run?
159 | ddp_local_rank, DEVICE = 0, "cuda:0"
160 | base_seed = 1337
161 | torch.manual_seed(base_seed)
162 | torch.cuda.manual_seed(base_seed)
163 |
164 | if ddp:
165 | init_distributed_mode()
166 | args.device = torch.device(DEVICE)
167 | rank = dist.get_rank()
168 | torch.manual_seed(base_seed + rank)
169 | # 同时设置 CUDA 的随机种子
170 | torch.cuda.manual_seed(base_seed + rank)
171 |
172 | if args.use_wandb and (not ddp or ddp_local_rank == 0):
173 | import wandb
174 |
175 | wandb.init(project=args.wandb_project, name=args.wandb_run_name)
176 | else:
177 | wandb = None
178 |
179 | model, tokenizer = init_model(lm_config)
180 |
181 | train_ds = SFTDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
182 | train_sampler = DistributedSampler(train_ds) if ddp else None
183 | train_loader = DataLoader(
184 | train_ds,
185 | batch_size=args.batch_size,
186 | pin_memory=True,
187 | drop_last=False,
188 | shuffle=False,
189 | num_workers=args.num_workers,
190 | sampler=train_sampler
191 | )
192 |
193 | scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16', 'bfloat16']))
194 | optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
195 |
196 | if ddp:
197 | model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
198 | model = DistributedDataParallel(model, device_ids=[ddp_local_rank])
199 |
200 | iter_per_epoch = len(train_loader)
201 | for epoch in range(args.epochs):
202 | train_epoch(epoch, wandb)
203 |
--------------------------------------------------------------------------------
/train_lora.py:
--------------------------------------------------------------------------------
1 | import os
2 | import platform
3 | import argparse
4 | import random
5 | import time
6 | import math
7 | import warnings
8 | import torch.distributed as dist
9 | from contextlib import nullcontext
10 | from torch.utils.data import DataLoader, DistributedSampler
11 | from transformers import AutoTokenizer, AutoModelForCausalLM
12 | from model.model import MiniMindLM
13 | from model.LMConfig import LMConfig
14 | from model.dataset import SFTDataset
15 | from model.model_lora import *
16 |
17 | warnings.filterwarnings('ignore')
18 |
19 |
20 | # Logger function
21 | def Logger(content):
22 | if not ddp or dist.get_rank() == 0:
23 | print(content)
24 |
25 |
26 | def get_lr(current_step, total_steps, lr):
27 | return lr / 10 + 0.5 * lr * (1 + math.cos(math.pi * current_step / total_steps))
28 |
29 |
30 | # 代码和full_sft「几乎」一致
31 | def train_epoch(epoch, wandb):
32 | loss_fct = nn.CrossEntropyLoss(reduction='none')
33 | start_time = time.time()
34 | for step, (X, Y, loss_mask) in enumerate(train_loader):
35 | X = X.to(args.device)
36 | Y = Y.to(args.device)
37 | loss_mask = loss_mask.to(args.device)
38 | lr = get_lr(epoch * iter_per_epoch + step, args.epochs * iter_per_epoch, args.learning_rate)
39 | for param_group in optimizer.param_groups:
40 | param_group['lr'] = lr
41 |
42 | with ctx:
43 | res = model(X)
44 | loss = loss_fct(
45 | res.logits.view(-1, res.logits.size(-1)),
46 | Y.view(-1)
47 | ).view(Y.size())
48 | loss = (loss * loss_mask).sum() / loss_mask.sum()
49 | loss += res.aux_loss
50 | loss = loss / args.accumulation_steps
51 |
52 | scaler.scale(loss).backward()
53 |
54 | if (step + 1) % args.accumulation_steps == 0:
55 | scaler.unscale_(optimizer)
56 | torch.nn.utils.clip_grad_norm_(lora_params, args.grad_clip)
57 |
58 | scaler.step(optimizer)
59 | scaler.update()
60 |
61 | optimizer.zero_grad(set_to_none=True)
62 |
63 | if step % args.log_interval == 0:
64 | spend_time = time.time() - start_time
65 | Logger(
66 | 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
67 | epoch + 1,
68 | args.epochs,
69 | step,
70 | iter_per_epoch,
71 | loss.item(),
72 | optimizer.param_groups[-1]['lr'],
73 | spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
74 |
75 | if (wandb is not None) and (not ddp or dist.get_rank() == 0):
76 | wandb.log({"loss": loss,
77 | "lr": optimizer.param_groups[-1]['lr'],
78 | "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
79 |
80 | if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
81 | model.eval()
82 | # 【区别1】只保存lora权重即可
83 | save_lora(model, f'{args.save_dir}/lora/{args.lora_name}_{lm_config.dim}.pth')
84 | model.train()
85 |
86 |
87 | def init_model(lm_config):
88 | tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
89 | model = MiniMindLM(lm_config)
90 | moe_path = '_moe' if lm_config.use_moe else ''
91 | ckp = f'./out/rlhf_{lm_config.dim}{moe_path}.pth'
92 | state_dict = torch.load(ckp, map_location=args.device)
93 | model.load_state_dict(state_dict, strict=False)
94 | return model.to(args.device), tokenizer
95 |
96 |
97 | def init_distributed_mode():
98 | if not ddp: return
99 | global ddp_local_rank, DEVICE
100 |
101 | dist.init_process_group(backend="nccl")
102 | ddp_rank = int(os.environ["RANK"])
103 | ddp_local_rank = int(os.environ["LOCAL_RANK"])
104 | ddp_world_size = int(os.environ["WORLD_SIZE"])
105 | DEVICE = f"cuda:{ddp_local_rank}"
106 | torch.cuda.set_device(DEVICE)
107 |
108 |
109 | if __name__ == "__main__":
110 | parser = argparse.ArgumentParser(description="MiniMind SFT with LoRA")
111 | parser.add_argument("--out_dir", type=str, default="out")
112 | parser.add_argument("--epochs", type=int, default=50)
113 | parser.add_argument("--batch_size", type=int, default=16)
114 | parser.add_argument("--learning_rate", type=float, default=5e-5)
115 | parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
116 | parser.add_argument("--dtype", type=str, default="bfloat16")
117 | parser.add_argument("--use_wandb", action="store_true")
118 | parser.add_argument("--wandb_project", type=str, default="MiniMind-LoRA-SFT")
119 | parser.add_argument("--num_workers", type=int, default=1)
120 | parser.add_argument("--ddp", action="store_true")
121 | parser.add_argument("--accumulation_steps", type=int, default=1)
122 | parser.add_argument("--grad_clip", type=float, default=1.0)
123 | parser.add_argument("--warmup_iters", type=int, default=0)
124 | parser.add_argument("--log_interval", type=int, default=100)
125 | parser.add_argument("--save_interval", type=int, default=1)
126 | parser.add_argument('--local_rank', type=int, default=-1)
127 | parser.add_argument('--dim', default=512, type=int)
128 | parser.add_argument('--n_layers', default=8, type=int)
129 | parser.add_argument('--max_seq_len', default=512, type=int)
130 | parser.add_argument('--use_moe', default=False, type=bool)
131 | parser.add_argument("--data_path", type=str, default="./dataset/lora_identity.jsonl")
132 | parser.add_argument("--lora_name", type=str, default="lora_identity", help="根据任务保存成lora_(英文/医学/心理...)")
133 | args = parser.parse_args()
134 |
135 | lm_config = LMConfig(dim=args.dim, n_layers=args.n_layers, max_seq_len=args.max_seq_len, use_moe=args.use_moe)
136 | args.save_dir = os.path.join(args.out_dir)
137 | os.makedirs(args.save_dir, exist_ok=True)
138 | os.makedirs(args.out_dir, exist_ok=True)
139 | tokens_per_iter = args.batch_size * lm_config.max_seq_len
140 | device_type = "cuda" if "cuda" in args.device else "cpu"
141 |
142 | ctx = nullcontext() if device_type == "cpu" else torch.cuda.amp.autocast()
143 | ddp = int(os.environ.get("RANK", -1)) != -1 # is this a ddp run?
144 | ddp_local_rank, DEVICE = 0, "cuda:0"
145 | base_seed = 1337
146 | torch.manual_seed(base_seed)
147 | torch.cuda.manual_seed(base_seed)
148 |
149 | if ddp:
150 | init_distributed_mode()
151 | args.device = torch.device(DEVICE)
152 | rank = dist.get_rank()
153 | torch.manual_seed(base_seed + rank)
154 | # 同时设置 CUDA 的随机种子
155 | torch.cuda.manual_seed(base_seed + rank)
156 |
157 | args.wandb_run_name = f"MiniMind-Lora-SFT-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
158 | if args.use_wandb and (not ddp or ddp_local_rank == 0):
159 | import wandb
160 |
161 | wandb.init(project=args.wandb_project, name=args.wandb_run_name)
162 | else:
163 | wandb = None
164 |
165 | model, tokenizer = init_model(lm_config)
166 | apply_lora(model)
167 |
168 | total_params = sum(p.numel() for p in model.parameters()) # 总参数数量
169 | lora_params_count = sum(p.numel() for name, p in model.named_parameters() if 'lora' in name) # LoRA 参数数量
170 | if not ddp or dist.get_rank() == 0:
171 | print(f"LLM 总参数量: {total_params}")
172 | print(f"LoRA 参数量: {lora_params_count}")
173 | print(f"LoRA 参数占比: {lora_params_count / total_params * 100:.2f}%")
174 |
175 | for name, param in model.named_parameters():
176 | if 'lora' not in name:
177 | param.requires_grad = False
178 | lora_params = []
179 | for name, param in model.named_parameters():
180 | if 'lora' in name:
181 | lora_params.append(param)
182 |
183 | # 只对 LoRA 参数进行优化
184 | optimizer = optim.AdamW(lora_params, lr=args.learning_rate)
185 | train_ds = SFTDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
186 | train_sampler = DistributedSampler(train_ds) if ddp else None
187 | train_loader = DataLoader(
188 | train_ds,
189 | batch_size=args.batch_size,
190 | pin_memory=True,
191 | drop_last=False,
192 | shuffle=False,
193 | num_workers=args.num_workers,
194 | sampler=train_sampler
195 | )
196 |
197 | scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16', 'bfloat16']))
198 | iter_per_epoch = len(train_loader)
199 |
200 | for epoch in range(args.epochs):
201 | train_epoch(epoch, wandb)
202 |
--------------------------------------------------------------------------------
/train_pretrain.py:
--------------------------------------------------------------------------------
1 | import os
2 | import platform
3 | import argparse
4 | import time
5 | import math
6 | import warnings
7 | import pandas as pd
8 | import torch
9 | import torch.distributed as dist
10 | from torch import optim, nn
11 | from torch.nn.parallel import DistributedDataParallel
12 | from torch.optim.lr_scheduler import CosineAnnealingLR
13 | from torch.utils.data import DataLoader, DistributedSampler
14 | from contextlib import nullcontext
15 |
16 | from transformers import AutoTokenizer
17 |
18 | from model.model import MiniMindLM
19 | from model.LMConfig import LMConfig
20 | from model.dataset import PretrainDataset
21 |
22 | warnings.filterwarnings('ignore')
23 |
24 |
25 | def Logger(content):
26 | if not ddp or dist.get_rank() == 0:
27 | print(content)
28 |
29 |
30 | def get_lr(current_step, total_steps, lr):
31 | return lr / 10 + 0.5 * lr * (1 + math.cos(math.pi * current_step / total_steps))
32 |
33 |
34 | def train_epoch(epoch, wandb):
35 | loss_fct = nn.CrossEntropyLoss(reduction='none')
36 | start_time = time.time()
37 | for step, (X, Y, loss_mask) in enumerate(train_loader):
38 | X = X.to(args.device)
39 | Y = Y.to(args.device)
40 | loss_mask = loss_mask.to(args.device)
41 |
42 | lr = get_lr(epoch * iter_per_epoch + step, args.epochs * iter_per_epoch, args.learning_rate)
43 | for param_group in optimizer.param_groups:
44 | param_group['lr'] = lr
45 |
46 | with ctx:
47 | res = model(X)
48 | loss = loss_fct(
49 | res.logits.view(-1, res.logits.size(-1)),
50 | Y.view(-1)
51 | ).view(Y.size())
52 | loss = (loss * loss_mask).sum() / loss_mask.sum()
53 | loss += res.aux_loss
54 | loss = loss / args.accumulation_steps
55 |
56 | scaler.scale(loss).backward()
57 |
58 | if (step + 1) % args.accumulation_steps == 0:
59 | scaler.unscale_(optimizer)
60 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
61 |
62 | scaler.step(optimizer)
63 | scaler.update()
64 |
65 | optimizer.zero_grad(set_to_none=True)
66 |
67 | if step % args.log_interval == 0:
68 | spend_time = time.time() - start_time
69 | Logger(
70 | 'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.12f} epoch_Time:{}min:'.format(
71 | epoch + 1,
72 | args.epochs,
73 | step,
74 | iter_per_epoch,
75 | loss.item() * args.accumulation_steps,
76 | optimizer.param_groups[-1]['lr'],
77 | spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
78 |
79 | if (wandb is not None) and (not ddp or dist.get_rank() == 0):
80 | wandb.log({"loss": loss.item() * args.accumulation_steps,
81 | "lr": optimizer.param_groups[-1]['lr'],
82 | "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
83 |
84 | if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
85 | model.eval()
86 | moe_path = '_moe' if lm_config.use_moe else ''
87 | ckp = f'{args.save_dir}/pretrain_{lm_config.dim}{moe_path}.pth'
88 |
89 | if isinstance(model, torch.nn.parallel.DistributedDataParallel):
90 | state_dict = model.module.state_dict()
91 | else:
92 | state_dict = model.state_dict()
93 |
94 | torch.save(state_dict, ckp)
95 | model.train()
96 |
97 |
98 | def init_model(lm_config):
99 | tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
100 | model = MiniMindLM(lm_config).to(args.device)
101 | Logger(f'LLM总参数量:{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
102 | return model, tokenizer
103 |
104 |
105 | def init_distributed_mode():
106 | if not ddp: return
107 | global ddp_local_rank, DEVICE
108 |
109 | dist.init_process_group(backend="nccl")
110 | ddp_rank = int(os.environ["RANK"])
111 | ddp_local_rank = int(os.environ["LOCAL_RANK"])
112 | ddp_world_size = int(os.environ["WORLD_SIZE"])
113 | DEVICE = f"cuda:{ddp_local_rank}"
114 | torch.cuda.set_device(DEVICE)
115 |
116 |
117 | # torchrun --nproc_per_node 2 1-pretrain.py
118 | if __name__ == "__main__":
119 | parser = argparse.ArgumentParser(description="MiniMind Pretraining")
120 | parser.add_argument("--out_dir", type=str, default="out")
121 | # 若要以最快速度实现zero则epochs设置为1轮;否则应当利用有限的数据训练2~6个epochs。
122 | parser.add_argument("--epochs", type=int, default=1)
123 | parser.add_argument("--batch_size", type=int, default=32)
124 | parser.add_argument("--learning_rate", type=float, default=5e-4)
125 | parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
126 | parser.add_argument("--dtype", type=str, default="bfloat16")
127 | parser.add_argument("--use_wandb", action="store_true")
128 | parser.add_argument("--wandb_project", type=str, default="MiniMind-Pretrain")
129 | parser.add_argument("--num_workers", type=int, default=1)
130 | parser.add_argument("--ddp", action="store_true")
131 | parser.add_argument("--accumulation_steps", type=int, default=8)
132 | parser.add_argument("--grad_clip", type=float, default=1.0)
133 | parser.add_argument("--warmup_iters", type=int, default=0)
134 | parser.add_argument("--log_interval", type=int, default=100)
135 | parser.add_argument("--save_interval", type=int, default=100)
136 | parser.add_argument('--local_rank', type=int, default=-1)
137 | parser.add_argument('--dim', default=512, type=int)
138 | parser.add_argument('--n_layers', default=8, type=int)
139 | parser.add_argument('--max_seq_len', default=512, type=int)
140 | parser.add_argument('--use_moe', default=False, type=bool)
141 | parser.add_argument("--data_path", type=str, default="./dataset/pretrain_hq.jsonl")
142 | args = parser.parse_args()
143 |
144 | lm_config = LMConfig(dim=args.dim, n_layers=args.n_layers, max_seq_len=args.max_seq_len, use_moe=args.use_moe)
145 | args.save_dir = os.path.join(args.out_dir)
146 | os.makedirs(args.save_dir, exist_ok=True)
147 | os.makedirs(args.out_dir, exist_ok=True)
148 | tokens_per_iter = args.batch_size * lm_config.max_seq_len
149 | device_type = "cuda" if "cuda" in args.device else "cpu"
150 |
151 | args.wandb_run_name = f"MiniMind-Pretrain-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
152 |
153 | ctx = nullcontext() if device_type == "cpu" else torch.cuda.amp.autocast()
154 |
155 | ddp = int(os.environ.get("RANK", -1)) != -1 # is this a ddp run?
156 | ddp_local_rank, DEVICE = 0, "cuda:0"
157 |
158 | base_seed = 1337
159 | torch.manual_seed(base_seed)
160 | torch.cuda.manual_seed(base_seed)
161 |
162 | if ddp:
163 | init_distributed_mode()
164 | args.device = torch.device(DEVICE)
165 | rank = dist.get_rank()
166 | torch.manual_seed(base_seed + rank)
167 | # 同时设置 CUDA 的随机种子
168 | torch.cuda.manual_seed(base_seed + rank)
169 |
170 | if args.use_wandb and (not ddp or ddp_local_rank == 0):
171 | import wandb
172 |
173 | wandb.init(project=args.wandb_project, name=args.wandb_run_name)
174 | else:
175 | wandb = None
176 |
177 | model, tokenizer = init_model(lm_config)
178 | train_ds = PretrainDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
179 | train_sampler = DistributedSampler(train_ds) if ddp else None
180 | train_loader = DataLoader(
181 | train_ds,
182 | batch_size=args.batch_size,
183 | pin_memory=True,
184 | drop_last=False,
185 | shuffle=False,
186 | num_workers=args.num_workers,
187 | sampler=train_sampler
188 | )
189 |
190 | scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16', 'bfloat16']))
191 | optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
192 |
193 | if ddp:
194 | model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
195 | model = DistributedDataParallel(model, device_ids=[ddp_local_rank])
196 |
197 | iter_per_epoch = len(train_loader)
198 | for epoch in range(args.epochs):
199 | train_epoch(epoch, wandb)
200 |
--------------------------------------------------------------------------------