├── FAQ.md ├── MODEL_LICENSE ├── README.md ├── README2.md ├── README_EN.md ├── api.py ├── chatglm2PT ├── configuration_chatglm.py └── modelling_chatglm.py ├── cli_demo.py ├── evaluation ├── README.md └── evaluate_ceval.py ├── openai_api.py ├── ptuning ├── README.md ├── arguments.py ├── deepspeed.json ├── ds_train_finetune.sh ├── evaluate.sh ├── evaluate_finetune.sh ├── main.py ├── train.sh ├── train_chat.sh ├── trainer.py ├── trainer_seq2seq.py ├── web_demo.py └── web_demo.sh ├── requirements.txt ├── resources ├── WECHAT.md ├── cli-demo.png ├── knowledge.png ├── long-context.png ├── math.png ├── web-demo.gif ├── web-demo.png └── wechat.jpg ├── utils.py ├── web_demo.py └── web_demo2.py /FAQ.md: -------------------------------------------------------------------------------- 1 | ## Q1 2 | 3 | **Mac直接加载量化后的模型出现提示 `clang: error: unsupported option '-fopenmp'** 4 | 5 | 这是由于Mac由于本身缺乏omp导致的,此时可运行但是单核。需要单独安装 openmp 依赖,即可在Mac下使用OMP: 6 | 7 | ```bash 8 | # 参考`https://mac.r-project.org/openmp/` 9 | ## 假设: gcc(clang)是14.x版本,其他版本见R-Project提供的表格 10 | curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz 11 | sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C / 12 | ``` 13 | 此时会安装下面几个文件:`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`。 14 | 15 | > 注意:如果你之前运行`ChatGLM2-6B`项目失败过,最好清一下Hugging Face的缓存,i.e. 默认下是 `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令,请明确知道自己在删除什么。 16 | -------------------------------------------------------------------------------- /MODEL_LICENSE: -------------------------------------------------------------------------------- 1 | The ChatGLM-6B License 2 | 3 | 一、定义 4 | 5 | “许可方”是指分发其软件的 ChatGLM2-6B 模型团队。 6 | 7 | “软件”是指根据本许可提供的 ChatGLM2-6B 模型参数。 8 | 9 | 2. 许可授予 10 | 11 | 根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可,仅用于您的非商业研究目的。 12 | 13 | 上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。 14 | 15 | 3.限制 16 | 17 | 您不得出于任何商业、军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。 18 | 19 | 您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。 20 | 21 | 4.免责声明 22 | 23 | 本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。 24 | 25 | 5. 责任限制 26 | 27 | 除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。 28 | 29 | 6.争议解决 30 | 31 | 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。 32 | 33 | 请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 glm-130b@googlegroups.com 与我们联系。 34 | 35 | 1. Definitions 36 | 37 | “Licensor” means the ChatGLM2-6B Model Team that distributes its Software. 38 | 39 | “Software” means the ChatGLM2-6B model parameters made available under this license. 40 | 41 | 2. License Grant 42 | 43 | Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes. 44 | 45 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 46 | 47 | 3. Restriction 48 | 49 | You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes. 50 | 51 | You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings. 52 | 53 | 4. Disclaimer 54 | 55 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 56 | 57 | 5. Limitation of Liability 58 | 59 | EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 60 | 61 | 6. Dispute Resolution 62 | 63 | This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing. 64 | 65 | Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at glm-130b@googlegroups.com. 66 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ChatGLM2-6B-Explained 2 | 3 | ChatGLM2-6B-相关代码,逐行详解版。 4 | 逐步更新,欢迎大家Star,Fork,参与进来,提交PR。 5 | 注:xxx表示伪目录,非有效。 6 | 7 | ## 8 | 这个项目主要是数据相关的流转,测试,还有p tuning v2相关微调。若是想弄懂大模型的原理,建议看[GLM-Explained](https://github.com/ArtificialZeng/GLM-Explained) 9 | 10 | 此外,大模型还基于两个非常重要的基础库,那便是[transformers](https://github.com/ArtificialZeng/tranformers-expalined),和[pytorch](https://github.com/ArtificialZeng/pytorch-explained),同样这两个库也有关键代码的逐行解析版本。 11 | # ChatGLM2-6B-Explained 12 | 13 | 14 | 15 | * [x/](./src) 16 | * [x/](./src/utils) 17 | * [main.py](./ptuning/main.py) 18 | * [train.sh参数解释](./ptuning/train.sh) 19 | * [x.py](./src/train_sft.py) 20 | * [chatglm2PT](./chatglm2PT) 21 | * [/configuration_chatglm.py](./chatglm2PT/configuration_chatglm.py) 这段代码定义了一个名为ChatGLMConfig的类,用于配置和管理ChatGLM模型。 22 | * [/modelling_chatglm.py](./chatglm2PT/configuration_chatglm.py) 23 | * 24 | * [x/](./examples) 25 | * [x.md](./examples/ads_generation.md) 26 | * [README.md](./README.md) 27 | 28 | 29 | # CSDN彩色博客版: 30 | * [ChatGLM1/2 系列源码解析系列-专栏地址](https://blog.csdn.net/sinat_37574187/category_12365053.html) 31 | * [/src/utils/](./ChatGLM-Efficient-Tuning-Explained/src/utils) 32 | * [CSDN彩色源码解析main.py(一)](https://zengxiaojian.blog.csdn.net/article/details/131617133?spm=1001.2014.3001.5502) 33 | * [CSDN彩色源码解析main.py(二)](https://blog.csdn.net/sinat_37574187/article/details/131621397) 34 | * [ChatGLM2-6B源码解析 web_demo.py](https://blog.csdn.net/sinat_37574187/article/details/131404024) 35 | * [README.md](./ChatGLM-Efficient-Tuning-Explained/README.md) 36 | 37 | 38 | ## 引用 - 源项目 39 | -------------------------------------------------------------------------------- /README2.md: -------------------------------------------------------------------------------- 1 | # ChatGLM2-6B 2 | 3 |

4 | 🤗 HF Repo • 🐦 Twitter • 📃 [GLM@ACL 22] [GitHub] • 📃 [GLM-130B@ICLR 23] [GitHub]
5 |

6 |

7 | 👋 加入我们的 SlackWeChat 8 |

9 | 10 | *Read this in [English](README_EN.md)* 11 | 12 | ## 介绍 13 | 14 | ChatGLM**2**-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM**2**-6B 引入了如下新特性: 15 | 16 | 1. **更强大的性能**:基于 ChatGLM 初代模型的开发经验,我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 [GLM](https://github.com/THUDM/GLM) 的混合目标函数,经过了 1.4T 中英标识符的预训练与人类偏好对齐训练,[评测结果](#评测结果)显示,相比于初代模型,ChatGLM2-6B 在 MMLU(+23%)、CEval(+33%)、GSM8K(+571%) 、BBH(+60%)等数据集上的性能取得了大幅度的提升,在同尺寸开源模型中具有较强的竞争力。 17 | 2. **更长的上下文**:基于 [FlashAttention](https://github.com/HazyResearch/flash-attention) 技术,我们将基座模型的上下文长度(Context Length)由 ChatGLM-6B 的 2K 扩展到了 32K,并在对话阶段使用 8K 的上下文长度训练,允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限,我们会在后续迭代升级中着重进行优化。 18 | 3. **更高效的推理**:基于 [Multi-Query Attention](http://arxiv.org/abs/1911.02150) 技术,ChatGLM2-6B 有更高效的推理速度和更低的显存占用:在官方的模型实现下,推理速度相比初代提升了 42%,INT4 量化下,6G 显存支持的对话长度由 1K 提升到了 8K。 19 | 4. **更开放的协议**:ChatGLM2-6B 权重对学术研究**完全开放**,在获得官方的书面许可后,亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用,我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。 20 | 21 | ----- 22 | 23 | ChatGLM2-6B 开源模型旨在与开源社区一起推动大模型技术发展,恳请开发者和大家遵守[开源协议](MODEL_LICENSE),勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。**目前,本项目团队未基于 ChatGLM2-6B 开发任何应用,包括网页端、安卓、苹果 iOS 及 Windows App 等应用。** 24 | 25 | 尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性,但由于 ChatGLM2-6B 模型规模较小,且模型受概率随机性因素影响,无法保证输出内容的准确性,且模型易被误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。** 26 | 27 | ## 更新信息 28 | **[2023/07/04]** 发布 P-Tuning v2 与 全参数微调脚本,参见 [P-Tuning](./ptuning)。 29 | 30 | ## 友情链接 31 | 对 ChatGLM2 进行加速的开源项目: 32 | * [fastllm](https://github.com/ztxz16/fastllm/): 全平台加速推理方案,单GPU批量推理每秒可达10000+token,手机端最低3G内存实时运行(骁龙865上约4~5 token/s) 33 | * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): 类似 llama.cpp 的 CPU 量化加速推理方案,实现 Mac 笔记本上实时对话 34 | 35 | ## 评测结果 36 | 我们选取了部分中英文典型数据集进行了评测,以下为 ChatGLM2-6B 模型在 [MMLU](https://github.com/hendrycks/test) (英文)、[C-Eval](https://cevalbenchmark.com/static/leaderboard.html)(中文)、[GSM8K](https://github.com/openai/grade-school-math)(数学)、[BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)(英文) 上的测评结果。在 [evaluation](./evaluation/README.md) 中提供了在 C-Eval 上进行测评的脚本。 37 | 38 | ### MMLU 39 | 40 | | Model | Average | STEM | Social Sciences | Humanities | Others | 41 | | ----- | ----- | ---- | ----- | ----- | ----- | 42 | | ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 | 43 | | ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 | 44 | | ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 | 45 | 46 | > Chat 模型使用 zero-shot CoT (Chain-of-Thought) 的方法测试,Base 模型使用 few-shot answer-only 的方法测试 47 | 48 | ### C-Eval 49 | 50 | | Model | Average | STEM | Social Sciences | Humanities | Others | 51 | | ----- | ---- | ---- | ----- | ----- | ----- | 52 | | ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 | 53 | | ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 | 54 | | ChatGLM2-6B | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 | 55 | 56 | > Chat 模型使用 zero-shot CoT 的方法测试,Base 模型使用 few-shot answer only 的方法测试 57 | 58 | ### GSM8K 59 | 60 | | Model | Accuracy | Accuracy (Chinese)* | 61 | | ----- | ----- | ----- | 62 | | ChatGLM-6B | 4.82 | 5.85 | 63 | | ChatGLM2-6B (base) | 32.37 | 28.95 | 64 | | ChatGLM2-6B | 28.05 | 20.45 | 65 | 66 | > 所有模型均使用 few-shot CoT 的方法测试,CoT prompt 来自 http://arxiv.org/abs/2201.11903 67 | > 68 | > \* 我们使用翻译 API 翻译了 GSM8K 中的 500 道题目和 CoT prompt 并进行了人工校对 69 | 70 | 71 | ### BBH 72 | 73 | | Model | Accuracy | 74 | | ----- | ----- | 75 | | ChatGLM-6B | 18.73 | 76 | | ChatGLM2-6B (base) | 33.68 | 77 | | ChatGLM2-6B | 30.00 | 78 | 79 | > 所有模型均使用 few-shot CoT 的方法测试,CoT prompt 来自 https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts 80 | 81 | ## 推理性能 82 | ChatGLM2-6B 使用了 [Multi-Query Attention](http://arxiv.org/abs/1911.02150),提高了生成速度。生成 2000 个字符的平均速度对比如下 83 | 84 | | Model | 推理速度 (字符/秒) | 85 | | ---- | ----- | 86 | | ChatGLM-6B | 31.49 | 87 | | ChatGLM2-6B | 44.62 | 88 | 89 | > 使用官方实现,batch size = 1,max length = 2048,bf16 精度,测试硬件为 A100-SXM4-80G,软件环境为 PyTorch 2.0.1 90 | 91 | Multi-Query Attention 同时也降低了生成过程中 KV Cache 的显存占用,此外,ChatGLM2-6B 采用 Causal Mask 进行对话训练,连续对话时可复用前面轮次的 KV Cache,进一步优化了显存占用。因此,使用 6GB 显存的显卡进行 INT4 量化的推理时,初代的 ChatGLM-6B 模型最多能够生成 1119 个字符就会提示显存耗尽,而 ChatGLM2-6B 能够生成至少 8192 个字符。 92 | 93 | | **量化等级** | **编码 2048 长度的最小显存** | **生成 8192 长度的最小显存** | 94 | | -------------- |---------------------|---------------------| 95 | | FP16 / BF16 | 13.1 GB | 12.8 GB | 96 | | INT8 | 8.2 GB | 8.1 GB | 97 | | INT4 | 5.5 GB | 5.1 GB | 98 | 99 | > ChatGLM2-6B 利用了 PyTorch 2.0 引入的 `torch.nn.functional.scaled_dot_product_attention` 实现高效的 Attention 计算,如果 PyTorch 版本较低则会 fallback 到朴素的 Attention 实现,出现显存占用高于上表的情况。 100 | 101 | 我们也测试了量化对模型性能的影响。结果表明,量化对模型性能的影响在可接受范围内。 102 | 103 | | 量化等级 | Accuracy (MMLU) | Accuracy (C-Eval dev) | 104 | | ----- | ----- |-----------------------| 105 | | BF16 | 45.47 | 53.57 | 106 | | INT4 | 43.13 | 50.30 | 107 | 108 | 109 | 110 | ## ChatGLM2-6B 示例 111 | 112 | 相比于初代模型,ChatGLM2-6B 多个维度的能力都取得了提升,以下是一些对比示例。更多 ChatGLM2-6B 的可能,等待你来探索发现! 113 | 114 |
数理逻辑 115 | 116 | ![](resources/math.png) 117 | 118 |
119 | 120 |
知识推理 121 | 122 | ![](resources/knowledge.png) 123 | 124 |
125 | 126 |
长文档理解 127 | 128 | ![](resources/long-context.png) 129 | 130 |
131 | 132 | ## 使用方式 133 | ### 环境安装 134 | 首先需要下载本仓库: 135 | ```shell 136 | git clone https://github.com/THUDM/ChatGLM2-6B 137 | cd ChatGLM2-6B 138 | ``` 139 | 140 | 然后使用 pip 安装依赖: 141 | ``` 142 | pip install -r requirements.txt 143 | ``` 144 | 其中 `transformers` 库版本推荐为 `4.30.2`,`torch` 推荐使用 2.0 及以上的版本,以获得最佳的推理性能。 145 | 146 | ### 代码调用 147 | 148 | 可以通过如下代码调用 ChatGLM2-6B 模型来生成对话: 149 | 150 | ```python 151 | >>> from transformers import AutoTokenizer, AutoModel 152 | >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 153 | >>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda') 154 | >>> model = model.eval() 155 | >>> response, history = model.chat(tokenizer, "你好", history=[]) 156 | >>> print(response) 157 | 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。 158 | >>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) 159 | >>> print(response) 160 | 晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法: 161 | 162 | 1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。 163 | 2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。 164 | 3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。 165 | 4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。 166 | 5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。 167 | 6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。 168 | 169 | 如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。 170 | ``` 171 | 172 | #### 从本地加载模型 173 | 以上代码会由 `transformers` 自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm2-6b)。如果你的网络环境较差,下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地,然后从本地加载。 174 | 175 | 从 Hugging Face Hub 下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage),然后运行 176 | ```Shell 177 | git clone https://huggingface.co/THUDM/chatglm2-6b 178 | ``` 179 | 180 | 如果你从 Hugging Face Hub 上下载 checkpoint 的速度较慢,可以只下载模型实现 181 | ```Shell 182 | GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b 183 | ``` 184 | 然后从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载模型参数文件,并将下载的文件替换到本地的 `chatglm2-6b` 目录下。 185 | 186 | 187 | 将模型下载到本地之后,将以上代码中的 `THUDM/chatglm2-6b` 替换为你本地的 `chatglm2-6b` 文件夹的路径,即可从本地加载模型。 188 | 189 | 模型的实现仍然处在变动中。如果希望固定使用的模型实现以保证兼容性,可以在 `from_pretrained` 的调用中增加 `revision="v1.0"` 参数。`v1.0` 是当前最新的版本号,完整的版本列表参见 [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log)。 190 | 191 | ### 网页版 Demo 192 | 193 | ![web-demo](resources/web-demo.gif) 194 | 195 | 可以通过以下命令启动基于 Streamlit 的网页版 demo: 196 | ```shell 197 | streamlit run web_demo2.py 198 | ``` 199 | 200 | 程序会运行一个 Web Server,并输出地址。在浏览器中打开输出的地址即可使用。 201 | 202 | 203 | [web_demo.py](./web_demo.py) 中提供了旧版基于 Gradio 的 web demo,可以通过如下命令运行: 204 | ```shell 205 | python web_demo.py 206 | ``` 207 | 经测试,如果输入的 prompt 较长的话,使用基于 Streamlit 的网页版 Demo 会更流畅。 208 | 209 | ### 命令行 Demo 210 | 211 | ![cli-demo](resources/cli-demo.png) 212 | 213 | 运行仓库中 [cli_demo.py](cli_demo.py): 214 | 215 | ```shell 216 | python cli_demo.py 217 | ``` 218 | 219 | 程序会在命令行中进行交互式的对话,在命令行中输入指示并回车即可生成回复,输入 `clear` 可以清空对话历史,输入 `stop` 终止程序。 220 | 221 | ### API 部署 222 | 首先需要安装额外的依赖 `pip install fastapi uvicorn`,然后运行仓库中的 [api.py](api.py): 223 | ```shell 224 | python api.py 225 | ``` 226 | 默认部署在本地的 8000 端口,通过 POST 方法进行调用 227 | ```shell 228 | curl -X POST "http://127.0.0.1:8000" \ 229 | -H 'Content-Type: application/json' \ 230 | -d '{"prompt": "你好", "history": []}' 231 | ``` 232 | 得到的返回值为 233 | ```shell 234 | { 235 | "response":"你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。", 236 | "history":[["你好","你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。"]], 237 | "status":200, 238 | "time":"2023-03-23 21:38:40" 239 | } 240 | ``` 241 | 感谢 [@hiyouga]() 实现了 OpenAI 格式的流式 API 部署,可以作为任意基于 ChatGPT 的应用的后端,比如 [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web)。可以通过运行仓库中的[openai_api.py](openai_api.py) 进行部署: 242 | ```shell 243 | python openai_api.py 244 | ``` 245 | 进行 API 调用的示例代码为 246 | ```python 247 | import openai 248 | if __name__ == "__main__": 249 | openai.api_base = "http://localhost:8000/v1" 250 | openai.api_key = "none" 251 | for chunk in openai.ChatCompletion.create( 252 | model="chatglm2-6b", 253 | messages=[ 254 | {"role": "user", "content": "你好"} 255 | ], 256 | stream=True 257 | ): 258 | if hasattr(chunk.choices[0].delta, "content"): 259 | print(chunk.choices[0].delta.content, end="", flush=True) 260 | ``` 261 | 262 | 263 | ## 低成本部署 264 | 265 | ### 模型量化 266 | 267 | 默认情况下,模型以 FP16 精度加载,运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限,可以尝试以量化方式加载模型,使用方法如下: 268 | 269 | ```python 270 | # 按需修改,目前只支持 4/8 bit 量化 271 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda() 272 | ``` 273 | 274 | 模型量化会带来一定的性能损失,经过测试,ChatGLM2-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。 275 | 276 | 如果你的内存不足,可以直接加载量化后的模型: 277 | ```python 278 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda() 279 | ``` 280 | 281 | 282 | 283 | ### CPU 部署 284 | 285 | 如果你没有 GPU 硬件的话,也可以在 CPU 上进行推理,但是推理速度会更慢。使用方法如下(需要大概 32GB 内存) 286 | ```python 287 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float() 288 | ``` 289 | 如果你的内存不足的话,也可以使用量化后的模型 290 | ```python 291 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).float() 292 | ``` 293 | 在 cpu 上运行量化后的模型需要安装 `gcc` 与 `openmp`。多数 Linux 发行版默认已安装。对于 Windows ,可在安装 [TDM-GCC](https://jmeubank.github.io/tdm-gcc/) 时勾选 `openmp`。 Windows 测试环境 `gcc` 版本为 `TDM-GCC 10.3.0`, Linux 为 `gcc 11.3.0`。在 MacOS 上请参考 [Q1](FAQ.md#q1)。 294 | 295 | ### Mac 部署 296 | 297 | 对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac,可以使用 MPS 后端来在 GPU 上运行 ChatGLM2-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly(正确的版本号应该是2.x.x.dev2023xxxx,而不是 2.x.x)。 298 | 299 | 目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载,并使用 mps 后端: 300 | ```python 301 | model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps') 302 | ``` 303 | 304 | 加载半精度的 ChatGLM2-6B 模型需要大概 13GB 内存。内存较小的机器(比如 16GB 内存的 MacBook Pro),在空余内存不足的情况下会使用硬盘上的虚拟内存,导致推理速度严重变慢。 305 | 此时可以使用量化后的模型 chatglm2-6b-int4。因为 GPU 上量化的 kernel 是使用 CUDA 编写的,因此无法在 MacOS 上使用,只能使用 CPU 进行推理。 306 | 为了充分使用 CPU 并行,还需要[单独安装 OpenMP](FAQ.md#q1)。 307 | 308 | 在 Mac 上进行推理也可以使用 [ChatGLM.cpp](https://github.com/li-plus/chatglm.cpp) 309 | 310 | ### 多卡部署 311 | 如果你有多张 GPU,但是每张 GPU 的显存大小都不足以容纳完整的模型,那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`,然后通过如下方法加载模型: 312 | ```python 313 | from utils import load_model_on_gpus 314 | model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2) 315 | ``` 316 | 即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的,你也可以传入 `device_map` 参数来自己指定。 317 | 318 | ## 协议 319 | 320 | 本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源,ChatGLM2-6B 模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。ChatGLM2-6B 权重对学术研究**完全开放**,在获得官方的书面许可后,亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用,我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。申请商用许可与捐赠请联系 [license@zhipuai.cn](mailto:license@zhipuai.cn)。 321 | 322 | 323 | ## 引用 324 | 325 | 如果你觉得我们的工作有帮助的话,请考虑引用下列论文,ChatGLM2-6B 的论文会在近期公布,敬请期待~ 326 | 327 | ``` 328 | @article{zeng2022glm, 329 | title={Glm-130b: An open bilingual pre-trained model}, 330 | author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others}, 331 | journal={arXiv preprint arXiv:2210.02414}, 332 | year={2022} 333 | } 334 | ``` 335 | ``` 336 | @inproceedings{du2022glm, 337 | title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling}, 338 | author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie}, 339 | booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, 340 | pages={320--335}, 341 | year={2022} 342 | } 343 | ``` 344 | -------------------------------------------------------------------------------- /README_EN.md: -------------------------------------------------------------------------------- 1 |

2 | 🤗 HF Repo • 🐦 Twitter • 📃 [GLM@ACL 22] [GitHub] • 📃 [GLM-130B@ICLR 23] [GitHub]
3 |

4 |

5 | 👋 Join our Slack and WeChat 6 |

7 | 8 | ## Introduction 9 | 10 | ChatGLM**2**-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B). It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features: 11 | 12 | 1. **Stronger Performance**: Based on the development experience of the first-generation ChatGLM model, we have fully upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of [GLM](https://github.com/THUDM/GLM), and has undergone pre-training with 1.4T bilingual tokens and human preference alignment training. The [evaluation results](README.md#evaluation-results) show that, compared to the first-generation model, ChatGLM2-6B has achieved substantial improvements in performance on datasets like MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%), showing strong competitiveness among models of the same size. 13 | 2. **Longer Context**: Based on [FlashAttention](https://github.com/HazyResearch/flash-attention) technique, we have extended the context length of the base model from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment, allowing for more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, which we will focus on optimizing in future iterations. 14 | 3. **More Efficient Inference**: Based on [Multi-Query Attention](http://arxiv.org/abs/1911.02150) technique, ChatGLM2-6B has more efficient inference speed and lower GPU memory usage: under the official implementation, the inference speed has increased by 42% compared to the first generation; under INT4 quantization, the dialogue length supported by 6G GPU memory has increased from 1K to 8K. 15 | 4. **More Open License**: The weights of ChatGLM2-6B are **fully open** to academic research, and with our official written permission, the weights of ChatGLM2-6B are also **permitted for commercial use**. If you find our open-source model useful for your business, we welcome your donation towards the development of the next-generation model ChatGLM3. 16 | 17 | ----- 18 | 19 | The open-source ChatGLM2-6B is intended to promote the development of LLMs together with the open-source community. We earnestly request developers and everyone to abide by the [open-source license](MODEL_LICENSE). Do not use the open-source model, code, or any derivatives from the open-source project for any purposes that may harm nations or societies, or for any services that have not undergone safety assessments and legal approval. **At present, our project team has not developed any applications based on ChatGLM2-6B, including web, Android, Apple iOS, and Windows App applications.** 20 | 21 | Although the model strives to ensure the compliance and accuracy of data at each stage of training, due to the smaller scale of the ChatGLM2-6B model, and its susceptibility to probabilistic randomness, the accuracy of output content cannot be guaranteed, and the model can easily be misled. **Our project does not assume any risks or responsibilities arising from data security, public opinion risks, or any instances of the model being misled, abused, disseminated, or improperly used due to the open-source model and code.** 22 | 23 | ## Projects 24 | Open source projects that accelerate ChatGLM2: 25 | * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): Real-time CPU inference on a MacBook accelerated by quantization, similar to llama.cpp. 26 | 27 | ## Evaluation 28 | We selected some typical Chinese and English datasets for evaluation. Below are the evaluation results of the ChatGLM2-6B model on [MMLU](https://github.com/hendrycks/test) (English), [C-Eval](https://cevalbenchmark.com/static/leaderboard.html) (Chinese), [GSM8K](https://github.com/openai/grade-school-math) (Mathematics), [BBH](https://github.com/suzgunmirac/BIG-Bench-Hard) (English). 29 | 30 | ### MMLU 31 | 32 | | Model | Average | STEM | Social Sciences | Humanities | Others | 33 | | ----- | ----- | ---- | ----- | ----- | ----- | 34 | | ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 | 35 | | ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 | 36 | | ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 | 37 | 38 | > Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only 39 | 40 | ### C-Eval 41 | 42 | | Model | Average | STEM | Social Sciences | Humanities | Others | 43 | | ----- | ---- | ---- | ----- | ----- | ----- | 44 | | ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 | 45 | | ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 | 46 | | ChatGLM2-6B | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 | 47 | 48 | > Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only 49 | 50 | ### GSM8K 51 | 52 | | Model | Accuracy | Accuracy (Chinese)* | 53 | | ----- | ----- | ----- | 54 | | ChatGLM-6B | 4.82 | 5.85 | 55 | | ChatGLM2-6B (base) | 32.37 | 28.95 | 56 | | ChatGLM2-6B | 28.05 | 20.45 | 57 | 58 | > All model versions are evaluated under few-shot CoT, and CoT prompts are from http://arxiv.org/abs/2201.11903 59 | > \* We translate a 500-query subset of GSM8K and its corresponding CoT prompts using machine translation API and subsequent human proofreading. 60 | 61 | 62 | ### BBH 63 | 64 | | Model | Accuracy | 65 | | ----- | ----- | 66 | | ChatGLM-6B | 18.73 | 67 | | ChatGLM2-6B (base) | 33.68 | 68 | | ChatGLM2-6B | 30.00 | 69 | 70 | > All model versions are evaluated under few-shot CoT, and CoT prompts are from https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts 71 | 72 | ## Inference Efficiency 73 | ChatGLM2-6B employs [Multi-Query Attention](http://arxiv.org/abs/1911.02150) to improve inference speed. Here is a comparison of the average speed for generating 2000 tokens. 74 | 75 | 76 | | Model | Inference Speed (tokens/s) | 77 | | ---- | ----- | 78 | | ChatGLM-6B | 31.49 | 79 | | ChatGLM2-6B | 44.62 | 80 | 81 | > Under our official implementation, batch size = 1, max length = 2048, bf16 precision, tested with an A100-SXM-80G and PyTorch 2.0 environment 82 | 83 | Multi-Query Attention also reduces the GPU memory usage of the KV Cache during inference. Additionally, ChatGLM2-6B uses Causal Mask for dialogue training, which allows the reuse of the KV Cache from previous rounds in continuous dialogues, further optimizing GPU memory usage. Therefore, when performing INT4 quantization inference with a 6GB GPU, while the first-generation ChatGLM-6B can only generate a maximum of 1119 tokens before running out of memory, ChatGLM2-6B can generate at least 8192 tokens. 84 | 85 | | **Quantization** | **Encoding 2048 Tokens** | **Decoding 8192 Tokens** | 86 | | -------------- | --------------------- | --------------- | 87 | | FP16 / BF16 | 13.1 GB | 12.8 GB | 88 | | INT8 | 8.2 GB | 8.1 GB | 89 | | INT4 | 5.5 GB | 5.1 GB | 90 | 91 | > ChatGLM2-6B takes advantage of `torch.nn.functional.scaled_dot_product_attention` introduced in PyTorch 2.0 for efficient Attention computation. If the PyTorch version is lower, it will fallback to the naive Attention implementation, which may result in higher GPU memory usage than shown in the table above. 92 | 93 | We also tested the impact of quantization on model performance. The results show that the impact of quantization on model performance is within an acceptable range. 94 | 95 | | Quantization | Accuracy (MMLU) | Accuracy (C-Eval dev) | 96 | | ----- | ----- |-----------------------| 97 | | BF16 | 45.47 | 53.57 | 98 | | INT4 | 43.13 | 50.30 | 99 | 100 | 101 | ## ChatGLM2-6B Examples 102 | 103 | Compared to the first-generation model, ChatGLM2-6B has made improvements in multiple dimensions. Below are some comparison examples. More possibilities with ChatGLM2-6B are waiting for you to explore and discover! 104 | 105 |
Mathematics and Logic 106 | 107 | ![](examples/math.png) 108 | 109 |
110 | 111 |
Knowledge Reasoning 112 | 113 | ![](examples/knowledge.png) 114 | 115 |
116 | 117 |
Long Document Understanding 118 | 119 | ![](examples/long-context.png) 120 | 121 |
122 | 123 | ## Getting Started 124 | ### Environment Setup 125 | 126 | Install dependencies with pip: `pip install -r requirements.txt`. It's recommended to use version `4.27.1` for the `transformers` library and use version 2.0 or higher for `torch` to achieve the best inference performance. 127 | 128 | We provide a web page demo and a command line demo. You need to download this repository to use them: 129 | 130 | ```shell 131 | git clone https://github.com/THUDM/ChatGLM2-6B 132 | cd ChatGLM2-6B 133 | ``` 134 | 135 | ### Usage 136 | 137 | Generate dialogue with the following code: 138 | 139 | ```python 140 | >>> from transformers import AutoTokenizer, AutoModel 141 | >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 142 | >>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda').eval() 143 | >>> response, history = model.chat(tokenizer, "你好", history=[]) 144 | >>> print(response) 145 | 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。 146 | >>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) 147 | >>> print(response) 148 | 晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法: 149 | 150 | 1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。 151 | 2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。 152 | 3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。 153 | 4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。 154 | 5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。 155 | 6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。 156 | 157 | 如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。 158 | ``` 159 | The implementation of the model is still in development. If you want to fix the used model implementation to ensure compatibility, you can add the `revision="v1.0"` parameter in the `from_pretrained` call. `v1.0` is the latest version number. For a complete list of versions, see [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log). 160 | 161 | ### Web Demo 162 | 163 | ![web-demo](resources/web-demo.gif) 164 | 165 | Install Gradio `pip install gradio`,and run [web_demo.py](web_demo.py): 166 | 167 | ```shell 168 | python web_demo.py 169 | ``` 170 | 171 | The program runs a web server and outputs the URL. Open the URL in the browser to use the web demo. 172 | 173 | #### CLI Demo 174 | 175 | ![cli-demo](resources/cli-demo.png) 176 | 177 | Run [cli_demo.py](cli_demo.py) in the repo: 178 | 179 | ```shell 180 | python cli_demo.py 181 | ``` 182 | 183 | The command runs an interactive program in the shell. Type your instruction in the shell and hit enter to generate the response. Type `clear` to clear the dialogue history and `stop` to terminate the program. 184 | 185 | ## API Deployment 186 | First install the additional dependency `pip install fastapi uvicorn`. The run [api.py](api.py) in the repo. 187 | ```shell 188 | python api.py 189 | ``` 190 | By default the api runs at the`8000`port of the local machine. You can call the API via 191 | ```shell 192 | curl -X POST "http://127.0.0.1:8000" \ 193 | -H 'Content-Type: application/json' \ 194 | -d '{"prompt": "你好", "history": []}' 195 | ``` 196 | The returned value is 197 | ```shell 198 | { 199 | "response":"你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。", 200 | "history":[["你好","你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。"]], 201 | "status":200, 202 | "time":"2023-03-23 21:38:40" 203 | } 204 | ``` 205 | ## Deployment 206 | 207 | ### Quantization 208 | 209 | By default, the model parameters are loaded with FP16 precision, which require about 13GB of GPU memory. It your GPU memory is limited, you can try to load the model parameters with quantization: 210 | 211 | ```python 212 | # hange according to your hardware. Only support 4/8 bit quantization now. 213 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda() 214 | ``` 215 | 216 | Model quantization will bring some performance loss on datasets. But after testing, ChatGLM2-6B can still perform natural and smooth generation under 4-bit quantization. 217 | 218 | ### CPU Deployment 219 | 220 | If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory): 221 | ```python 222 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float() 223 | ``` 224 | 225 | ### Inference on Mac 226 | 227 | For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly. (The correct version number should be 2.1.0.dev2023xxxx, not 2.0.0). 228 | 229 | Currently you must [load the model locally](README_en.md#load-the-model-locally) on MacOS. Change the code to load the model from your local path, and use the mps backend: 230 | ```python 231 | model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps') 232 | ``` 233 | 234 | Loading a FP16 ChatGLM-6B model requires about 13GB of memory. Machines with less memory (such as a MacBook Pro with 16GB of memory) will use the virtual memory on the hard disk when there is insufficient free memory, resulting in a serious slowdown in inference speed. 235 | 236 | ## License 237 | 238 | The code of this repository is licensed under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0). The use of the ChatGLM2-6B model weights is subject to the [Model License](MODEL_LICENSE). ChatGLM2-6B weights are **completely open** for academic research, and **commercial use** is also allowed after **obtaining official written permission**. If you find our open source model useful for your business, we welcome your donations towards the development of the next generation model, ChatGLM3. For related matters, please contact [yiwen.xu@zhipuai.cn](mailto:yiwen.xu@zhipuai.cn). 239 | 240 | ## Citation 241 | 242 | If you find our work useful, please consider citing the following papers. The technical report for ChatGLM2-6B will be out soon. 243 | 244 | ``` 245 | @article{zeng2022glm, 246 | title={Glm-130b: An open bilingual pre-trained model}, 247 | author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others}, 248 | journal={arXiv preprint arXiv:2210.02414}, 249 | year={2022} 250 | } 251 | ``` 252 | ``` 253 | @inproceedings{du2022glm, 254 | title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling}, 255 | author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie}, 256 | booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, 257 | pages={320--335}, 258 | year={2022} 259 | } 260 | ``` -------------------------------------------------------------------------------- /api.py: -------------------------------------------------------------------------------- 1 | from fastapi import FastAPI, Request 2 | from transformers import AutoTokenizer, AutoModel 3 | import uvicorn, json, datetime 4 | import torch 5 | 6 | DEVICE = "cuda" 7 | DEVICE_ID = "0" 8 | CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE 9 | 10 | 11 | def torch_gc(): 12 | if torch.cuda.is_available(): 13 | with torch.cuda.device(CUDA_DEVICE): 14 | torch.cuda.empty_cache() 15 | torch.cuda.ipc_collect() 16 | 17 | 18 | app = FastAPI() 19 | 20 | 21 | @app.post("/") 22 | async def create_item(request: Request): 23 | global model, tokenizer 24 | json_post_raw = await request.json() 25 | json_post = json.dumps(json_post_raw) 26 | json_post_list = json.loads(json_post) 27 | prompt = json_post_list.get('prompt') 28 | history = json_post_list.get('history') 29 | max_length = json_post_list.get('max_length') 30 | top_p = json_post_list.get('top_p') 31 | temperature = json_post_list.get('temperature') 32 | response, history = model.chat(tokenizer, 33 | prompt, 34 | history=history, 35 | max_length=max_length if max_length else 2048, 36 | top_p=top_p if top_p else 0.7, 37 | temperature=temperature if temperature else 0.95) 38 | now = datetime.datetime.now() 39 | time = now.strftime("%Y-%m-%d %H:%M:%S") 40 | answer = { 41 | "response": response, 42 | "history": history, 43 | "status": 200, 44 | "time": time 45 | } 46 | log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"' 47 | print(log) 48 | torch_gc() 49 | return answer 50 | 51 | 52 | if __name__ == '__main__': 53 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 54 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda() 55 | # 多显卡支持,使用下面三行代替上面两行,将num_gpus改为你实际的显卡数量 56 | # model_path = "THUDM/chatglm2-6b" 57 | # tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) 58 | # model = load_model_on_gpus(model_path, num_gpus=2) 59 | model.eval() 60 | uvicorn.run(app, host='0.0.0.0', port=8000, workers=1) 61 | -------------------------------------------------------------------------------- /chatglm2PT/configuration_chatglm.py: -------------------------------------------------------------------------------- 1 | # 导入transformers库的PretrainedConfig模块 2 | from transformers import PretrainedConfig 3 | 4 | # 定义一个名为ChatGLMConfig的新类,继承自PretrainedConfig 5 | class ChatGLMConfig(PretrainedConfig): 6 | # 定义模型的类型为"chatglm" 7 | model_type = "chatglm" 8 | 9 | # 定义类的初始化函数,设置模型的各种配置参数和默认值 10 | def __init__( 11 | self, 12 | num_layers=28, # 定义模型中的层数,默认为28 13 | padded_vocab_size=65024, # 定义词汇表的大小,默认为65024 14 | hidden_size=4096, # 定义隐藏层的大小,默认为4096 15 | ffn_hidden_size=13696, # 定义前馈神经网络的隐藏层大小,默认为13696 16 | kv_channels=128, # 定义键值对的通道数量,默认为128 17 | num_attention_heads=32, # 定义注意力头的数量,默认为32 18 | seq_length=2048, # 定义序列长度,默认为2048 19 | hidden_dropout=0.0, # 定义隐藏层的dropout比例,默认为0 20 | attention_dropout=0.0, # 定义注意力层的dropout比例,默认为0 21 | layernorm_epsilon=1e-5, # 定义LayerNorm层中的一个小常数,默认为1e-5 22 | rmsnorm=True, # 定义是否使用RMS Normalization,默认为True 23 | apply_residual_connection_post_layernorm=False, # 定义是否在LayerNorm后应用残差连接,默认为False 24 | post_layer_norm=True, # 定义是否应用Post-Layer Norm,默认为True 25 | add_bias_linear=False, # 定义是否在线性层添加偏置项,默认为False 26 | add_qkv_bias=False, # 定义是否在查询/键/值三个权重矩阵上添加偏置项,默认为False 27 | bias_dropout_fusion=True, # 定义是否将偏置和dropout融合,默认为True 28 | multi_query_attention=False, # 定义是否使用多查询注意力,默认为False 29 | multi_query_group_num=1, # 定义多查询组的数量,默认为1 30 | apply_query_key_layer_scaling=True, # 定义是否应用查询键层的缩放,默认为True 31 | attention_softmax_in_fp32=True, # 定义注意力softmax是否使用单精度浮点数,默认为True 32 | fp32_residual_connection=False, # 定义是否在残差连接中使用单精度浮点数,默认为False 33 | quantization_bit=0, # 定义量化位数,默认为0 34 | pre_seq_len=None, # 定义预序列长度,默认为None 35 | prefix_projection=False, # 定义是否应用前缀投影,默认为False 36 | **kwargs # 接收其他以关键字方式给出的参数 37 | ): #这段代码定义了一个名为ChatGLMConfig的类,用于配置和管理ChatGLM模型。 38 | 39 | 40 | self.num_layers = num_layers 41 | self.vocab_size = padded_vocab_size 42 | self.padded_vocab_size = padded_vocab_size 43 | self.hidden_size = hidden_size 44 | self.ffn_hidden_size = ffn_hidden_size 45 | self.kv_channels = kv_channels 46 | self.num_attention_heads = num_attention_heads 47 | self.seq_length = seq_length 48 | self.hidden_dropout = hidden_dropout 49 | self.attention_dropout = attention_dropout 50 | self.layernorm_epsilon = layernorm_epsilon 51 | self.rmsnorm = rmsnorm 52 | self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm 53 | self.post_layer_norm = post_layer_norm 54 | self.add_bias_linear = add_bias_linear 55 | self.add_qkv_bias = add_qkv_bias 56 | self.bias_dropout_fusion = bias_dropout_fusion 57 | self.multi_query_attention = multi_query_attention 58 | self.multi_query_group_num = multi_query_group_num 59 | self.apply_query_key_layer_scaling = apply_query_key_layer_scaling 60 | self.attention_softmax_in_fp32 = attention_softmax_in_fp32 61 | self.fp32_residual_connection = fp32_residual_connection 62 | self.quantization_bit = quantization_bit 63 | self.pre_seq_len = pre_seq_len 64 | self.prefix_projection = prefix_projection 65 | super().__init__(**kwargs) 66 | -------------------------------------------------------------------------------- /chatglm2PT/modelling_chatglm.py: -------------------------------------------------------------------------------- 1 | """ PyTorch ChatGLM model. """ 2 | 3 | import math 4 | import copy 5 | import warnings 6 | import re 7 | import sys 8 | 9 | import torch 10 | import torch.utils.checkpoint 11 | import torch.nn.functional as F 12 | from torch import nn 13 | from torch.nn import CrossEntropyLoss, LayerNorm # 从PyTorch的神经网络(nn)模块导入CrossEntropyLoss(损失函数)和LayerNorm(层标准化方法) 14 | from torch.nn.utils import skip_init # 从PyTorch的神经网络的工具(nn.utils)模块导入skip_init,一种跳过权重初始化的实用函数 15 | from typing import Optional, Tuple, Union, List, Callable, Dict, Any # 导入typing模块的子模块,用于定义变量、函数参数、返回值等的类型 16 | 17 | from transformers.modeling_outputs import ( 18 | BaseModelOutputWithPast, 19 | CausalLMOutputWithPast, 20 | ) # 从Hugging Face的transformers库中的modeling_outputs模块导入BaseModelOutputWithPast和CausalLMOutputWithPast类 21 | 22 | from transformers.modeling_utils import PreTrainedModel # 从Hugging Face的transformers库中的modeling_utils模块导入PreTrainedModel类,这是所有预训练模型的基类 23 | from transformers.utils import logging # 从Hugging Face的transformers库中的utils模块导入logging,这是用于创建和配置日志的工具 24 | 25 | from transformers.generation.logits_process import LogitsProcessor # 从Hugging Face的transformers库中的generation.logits_process模块导入LogitsProcessor类,这个类可以处理和修改模型生成过程中的logits 26 | from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput # 从Hugging Face的transformers库中的generation.utils模块导入四个类或接口 27 | 28 | from .configuration_chatglm import ChatGLMConfig # 从当前目录下的configuration_chatglm模块导入ChatGLMConfig类,这是特定于ChatGLM模型的配置类 29 | 30 | 31 | # flags required to enable jit fusion kernels # 启用JIT(Just-In-Time)编译器的融合内核所需的标志 32 | 33 | if sys.platform != 'darwin': # 检查当前操作系统是否不是'darwin'。'darwin'通常代表Mac OS X系统。 34 | torch._C._jit_set_profiling_mode(False) # 设置PyTorch JIT编译器的性能分析模式为False,禁用性能分析。 35 | torch._C._jit_set_profiling_executor(False) # 设置PyTorch JIT编译器的执行器的性能分析模式为False,禁用性能分析。 36 | torch._C._jit_override_can_fuse_on_cpu(True) # 允许JIT编译器在CPU上进行操作融合,即将多个操作合并为一个操作,以提高计算效率。 37 | torch._C._jit_override_can_fuse_on_gpu(True) # 允许JIT编译器在GPU上进行操作融合,提高计算效率。 38 | 39 | logger = logging.get_logger(__name__) # 创建一个日志记录器实例,名字是当前模块的名字。 40 | 41 | _CHECKPOINT_FOR_DOC = "THUDM/ChatGLM2-6B" # 定义一个变量,用于存储预训练模型的checkpoint名称。 42 | _CONFIG_FOR_DOC = "ChatGLM6BConfig" # 定义一个变量,用于存储预训练模型的配置类的名称。 43 | 44 | CHATGLM_6B_PRETRAINED_MODEL_ARCHIVE_LIST = [ 45 | "THUDM/chatglm2-6b", 46 | # See all ChatGLM models at https://huggingface.co/models?filter=chatglm # 列表中包含了预训练模型的名称 47 | ] 48 | 49 | 50 | def default_init(cls, *args, **kwargs): # 定义一个函数,函数名为default_init。接收一个类(cls)、一个参数列表(*args)和一个关键字参数字典(**kwargs)。 51 | return cls(*args, **kwargs) # 初始化类cls的一个实例,并返回这个实例。 52 | 53 | class InvalidScoreLogitsProcessor(LogitsProcessor): # 定义一个类,类名为InvalidScoreLogitsProcessor,继承了LogitsProcessor类。 54 | 55 | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor: # 定义类InvalidScoreLogitsProcessor的方法,方法名为__call__。接收self(类的实例自身)、input_ids(输入的ID,为长整型Tensor)和scores(分数,为浮点型Tensor)。返回值也是浮点型Tensor。 56 | if torch.isnan(scores).any() or torch.isinf(scores).any(): # 如果scores中有任何值是NaN(Not a Number)或者是无穷大或无穷小,那么将scores中的所有元素都设置为0,然后将其第5个元素设置为5e4(50000)。 57 | scores.zero_() 58 | scores[..., 5] = 5e4 59 | return scores # 将处理后的scores返回。 60 | 61 | 62 | 63 | class PrefixEncoder(torch.nn.Module): # 定义一个名为PrefixEncoder的类,它继承自torch.nn.Module,这是PyTorch中所有神经网络模块的基类。 64 | """ 65 | The torch.nn model to encode the prefix 66 | Input shape: (batch-size, prefix-length) 67 | Output shape: (batch-size, prefix-length, 2*layers*hidden) 68 | """ # 这是一个多行注释,解释了这个类的主要功能,以及其输入和输出的形状。 69 | 70 | def __init__(self, config: ChatGLMConfig): # 定义了这个类的初始化方法。它接收一个名为config的参数,该参数是ChatGLMConfig类的一个实例。 71 | super().__init__() # 这行代码调用父类的初始化方法,确保父类的构造函数被正确地执行。 72 | self.prefix_projection = config.prefix_projection # 从配置对象中取出prefix_projection值,并保存到这个类的实例中。 73 | 74 | if self.prefix_projection: # 这行代码检查self.prefix_projection的值是否为真。如果为真,则执行以下的代码块。 75 | # Use a two-layer MLP to encode the prefix 76 | kv_size = config.num_layers * config.kv_channels * config.multi_query_group_num * 2 # 这行代码计算了kv_size的值,这是关键值对的大小。 77 | self.embedding = torch.nn.Embedding(config.pre_seq_len, kv_size) # 这行代码定义了一个嵌入层,嵌入层的输入大小是config.pre_seq_len,输出大小是kv_size。 78 | 79 | self.trans = torch.nn.Sequential( # 这行代码定义了一个序列模型,它包含两个线性层和一个双曲正切激活函数。 80 | torch.nn.Linear(kv_size, config.hidden_size), 81 | torch.nn.Tanh(), 82 | torch.nn.Linear(config.hidden_size, kv_size) 83 | ) 84 | else: # 如果self.prefix_projection为假,那么就会执行这个代码块。 85 | self.embedding = torch.nn.Embedding(config.pre_seq_len, 86 | config.num_layers * config.kv_channels * config.multi_query_group_num * 2) # 定义一个嵌入层,输入大小是config.pre_seq_len,输出大小是config.num_layers * config.kv_channels * config.multi_query_group_num * 2。 87 | 88 | 89 | def forward(self, prefix: torch.Tensor): # 定义了PrefixEncoder类的forward方法。这个方法接收一个名为prefix的参数,类型为torch.Tensor,这是PyTorch中张量的类型。 90 | 91 | if self.prefix_projection: # 这行代码检查self.prefix_projection的值是否为真。如果为真,则执行以下的代码块。 92 | prefix_tokens = self.embedding(prefix) # 这行代码通过将prefix传递给嵌入层,将前缀转化为嵌入向量,并将结果保存到prefix_tokens。 93 | past_key_values = self.trans(prefix_tokens) # 这行代码通过将prefix_tokens传递给self.trans,将嵌入向量转化为past_key_values,self.trans是一个前面定义的线性模型。 94 | 95 | else: # 如果self.prefix_projection为假,那么就会执行这个代码块。 96 | past_key_values = self.embedding(prefix) # 这行代码通过将prefix传递给嵌入层,将前缀转化为past_key_values。 97 | 98 | return past_key_values # 返回past_key_values,这是模型的输出。 99 | 100 | 101 | 102 | def split_tensor_along_last_dim( 103 | tensor: torch.Tensor, 104 | num_partitions: int, 105 | contiguous_split_chunks: bool = False, 106 | ) -> List[torch.Tensor]: 107 | """Split a tensor along its last dimension. 108 | Arguments: 109 | tensor: input tensor. 110 | num_partitions: number of partitions to split the tensor 111 | contiguous_split_chunks: If True, make each chunk contiguous 112 | in memory. 113 | Returns: 114 | A list of Tensors 115 | """ 116 | # Get the size and dimension. 117 | last_dim = tensor.dim() - 1 118 | last_dim_size = tensor.size()[last_dim] // num_partitions 119 | # Split. 120 | tensor_list = torch.split(tensor, last_dim_size, dim=last_dim) 121 | # Note: torch.split does not create contiguous tensors by default. 122 | if contiguous_split_chunks: 123 | return tuple(chunk.contiguous() for chunk in tensor_list) 124 | 125 | return tensor_list 126 | 127 | 128 | class RotaryEmbedding(nn.Module): 129 | def __init__(self, dim, original_impl=False, device=None, dtype=None): 130 | super().__init__() 131 | inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim)) 132 | self.register_buffer("inv_freq", inv_freq) 133 | self.dim = dim 134 | self.original_impl = original_impl 135 | 136 | def forward_impl( 137 | self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000 138 | ): 139 | """Enhanced Transformer with Rotary Position Embedding. 140 | Derived from: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/ 141 | transformers/rope/__init__.py. MIT License: 142 | https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license. 143 | """ 144 | # $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$ 145 | theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=dtype, device=device) / n_elem)) 146 | 147 | # Create position indexes `[0, 1, ..., seq_len - 1]` 148 | seq_idx = torch.arange(seq_len, dtype=dtype, device=device) 149 | 150 | # Calculate the product of position index and $\theta_i$ 151 | idx_theta = torch.outer(seq_idx, theta).float() 152 | 153 | cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1) 154 | 155 | # this is to mimic the behaviour of complex32, else we will get different results 156 | if dtype in (torch.float16, torch.bfloat16, torch.int8): 157 | cache = cache.bfloat16() if dtype == torch.bfloat16 else cache.half() 158 | return cache 159 | 160 | def forward(self, max_seq_len, offset=0): 161 | return self.forward_impl( 162 | max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device 163 | ) 164 | 165 | 166 | @torch.jit.script 167 | def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor: 168 | # x: [sq, b, np, hn] 169 | sq, b, np, hn = x.size(0), x.size(1), x.size(2), x.size(3) 170 | rot_dim = rope_cache.shape[-2] * 2 171 | x, x_pass = x[..., :rot_dim], x[..., rot_dim:] 172 | # truncate to support variable sizes 173 | rope_cache = rope_cache[:sq] 174 | xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2) 175 | rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2) 176 | x_out2 = torch.stack( 177 | [ 178 | xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1], 179 | xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1], 180 | ], 181 | -1, 182 | ) 183 | x_out2 = x_out2.flatten(3) 184 | return torch.cat((x_out2, x_pass), dim=-1) 185 | 186 | 187 | class RMSNorm(torch.nn.Module): 188 | def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None, **kwargs): 189 | super().__init__() 190 | self.weight = torch.nn.Parameter(torch.empty(normalized_shape, device=device, dtype=dtype)) 191 | self.eps = eps 192 | 193 | def forward(self, hidden_states: torch.Tensor): 194 | input_dtype = hidden_states.dtype 195 | variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True) 196 | hidden_states = hidden_states * torch.rsqrt(variance + self.eps) 197 | 198 | return (self.weight * hidden_states).to(input_dtype) 199 | 200 | 201 | class CoreAttention(torch.nn.Module): 202 | def __init__(self, config: ChatGLMConfig, layer_number): 203 | super(CoreAttention, self).__init__() 204 | 205 | self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling 206 | self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32 207 | if self.apply_query_key_layer_scaling: 208 | self.attention_softmax_in_fp32 = True 209 | self.layer_number = max(1, layer_number) 210 | 211 | projection_size = config.kv_channels * config.num_attention_heads 212 | 213 | # Per attention head and per partition values. 214 | self.hidden_size_per_partition = projection_size 215 | self.hidden_size_per_attention_head = projection_size // config.num_attention_heads 216 | self.num_attention_heads_per_partition = config.num_attention_heads 217 | 218 | coeff = None 219 | self.norm_factor = math.sqrt(self.hidden_size_per_attention_head) 220 | if self.apply_query_key_layer_scaling: 221 | coeff = self.layer_number 222 | self.norm_factor *= coeff 223 | self.coeff = coeff 224 | 225 | self.attention_dropout = torch.nn.Dropout(config.attention_dropout) 226 | 227 | def forward(self, query_layer, key_layer, value_layer, attention_mask): 228 | pytorch_major_version = int(torch.__version__.split('.')[0]) 229 | if pytorch_major_version >= 2: 230 | query_layer, key_layer, value_layer = [k.permute(1, 2, 0, 3) for k in [query_layer, key_layer, value_layer]] 231 | if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]: 232 | context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer, 233 | is_causal=True) 234 | else: 235 | if attention_mask is not None: 236 | attention_mask = ~attention_mask 237 | context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer, 238 | attention_mask) 239 | context_layer = context_layer.permute(2, 0, 1, 3) 240 | new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,) 241 | context_layer = context_layer.reshape(*new_context_layer_shape) 242 | else: 243 | # Raw attention scores 244 | 245 | # [b, np, sq, sk] 246 | output_size = (query_layer.size(1), query_layer.size(2), query_layer.size(0), key_layer.size(0)) 247 | 248 | # [sq, b, np, hn] -> [sq, b * np, hn] 249 | query_layer = query_layer.view(output_size[2], output_size[0] * output_size[1], -1) 250 | # [sk, b, np, hn] -> [sk, b * np, hn] 251 | key_layer = key_layer.view(output_size[3], output_size[0] * output_size[1], -1) 252 | 253 | # preallocting input tensor: [b * np, sq, sk] 254 | matmul_input_buffer = torch.empty( 255 | output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype, 256 | device=query_layer.device 257 | ) 258 | 259 | # Raw attention scores. [b * np, sq, sk] 260 | matmul_result = torch.baddbmm( 261 | matmul_input_buffer, 262 | query_layer.transpose(0, 1), # [b * np, sq, hn] 263 | key_layer.transpose(0, 1).transpose(1, 2), # [b * np, hn, sk] 264 | beta=0.0, 265 | alpha=(1.0 / self.norm_factor), 266 | ) 267 | 268 | # change view to [b, np, sq, sk] 269 | attention_scores = matmul_result.view(*output_size) 270 | 271 | # =========================== 272 | # Attention probs and dropout 273 | # =========================== 274 | 275 | # attention scores and attention mask [b, np, sq, sk] 276 | if self.attention_softmax_in_fp32: 277 | attention_scores = attention_scores.float() 278 | if self.coeff is not None: 279 | attention_scores = attention_scores * self.coeff 280 | if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]: 281 | attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3], 282 | device=attention_scores.device, dtype=torch.bool) 283 | attention_mask.tril_() 284 | attention_mask = ~attention_mask 285 | if attention_mask is not None: 286 | attention_scores = attention_scores.masked_fill(attention_mask, float("-inf")) 287 | attention_probs = F.softmax(attention_scores, dim=-1) 288 | attention_probs = attention_probs.type_as(value_layer) 289 | 290 | # This is actually dropping out entire tokens to attend to, which might 291 | # seem a bit unusual, but is taken from the original Transformer paper. 292 | attention_probs = self.attention_dropout(attention_probs) 293 | # ========================= 294 | # Context layer. [sq, b, hp] 295 | # ========================= 296 | 297 | # value_layer -> context layer. 298 | # [sk, b, np, hn] --> [b, np, sq, hn] 299 | 300 | # context layer shape: [b, np, sq, hn] 301 | output_size = (value_layer.size(1), value_layer.size(2), query_layer.size(0), value_layer.size(3)) 302 | # change view [sk, b * np, hn] 303 | value_layer = value_layer.view(value_layer.size(0), output_size[0] * output_size[1], -1) 304 | # change view [b * np, sq, sk] 305 | attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1) 306 | # matmul: [b * np, sq, hn] 307 | context_layer = torch.bmm(attention_probs, value_layer.transpose(0, 1)) 308 | # change view [b, np, sq, hn] 309 | context_layer = context_layer.view(*output_size) 310 | # [b, np, sq, hn] --> [sq, b, np, hn] 311 | context_layer = context_layer.permute(2, 0, 1, 3).contiguous() 312 | # [sq, b, np, hn] --> [sq, b, hp] 313 | new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,) 314 | context_layer = context_layer.view(*new_context_layer_shape) 315 | 316 | return context_layer 317 | 318 | 319 | class SelfAttention(torch.nn.Module): 320 | """Parallel self-attention layer abstract class. 321 | Self-attention layer takes input with size [s, b, h] 322 | and returns output of the same size. 323 | """ 324 | 325 | def __init__(self, config: ChatGLMConfig, layer_number, device=None): 326 | super(SelfAttention, self).__init__() 327 | self.layer_number = max(1, layer_number) 328 | 329 | self.projection_size = config.kv_channels * config.num_attention_heads 330 | 331 | # Per attention head and per partition values. 332 | self.hidden_size_per_attention_head = self.projection_size // config.num_attention_heads 333 | self.num_attention_heads_per_partition = config.num_attention_heads 334 | 335 | self.multi_query_attention = config.multi_query_attention 336 | self.qkv_hidden_size = 3 * self.projection_size 337 | if self.multi_query_attention: 338 | self.num_multi_query_groups_per_partition = config.multi_query_group_num 339 | self.qkv_hidden_size = ( 340 | self.projection_size + 2 * self.hidden_size_per_attention_head * config.multi_query_group_num 341 | ) 342 | self.query_key_value = nn.Linear(config.hidden_size, self.qkv_hidden_size, 343 | bias=config.add_bias_linear or config.add_qkv_bias, 344 | device=device, **_config_to_kwargs(config) 345 | ) 346 | 347 | self.core_attention = CoreAttention(config, self.layer_number) 348 | 349 | # Output. 350 | self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear, 351 | device=device, **_config_to_kwargs(config) 352 | ) 353 | 354 | def _allocate_memory(self, inference_max_sequence_len, batch_size, device=None, dtype=None): 355 | if self.multi_query_attention: 356 | num_attention_heads = self.num_multi_query_groups_per_partition 357 | else: 358 | num_attention_heads = self.num_attention_heads_per_partition 359 | return torch.empty( 360 | inference_max_sequence_len, 361 | batch_size, 362 | num_attention_heads, 363 | self.hidden_size_per_attention_head, 364 | dtype=dtype, 365 | device=device, 366 | ) 367 | 368 | def forward( 369 | self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True 370 | ): 371 | # hidden_states: [sq, b, h] 372 | 373 | # ================================================= 374 | # Pre-allocate memory for key-values for inference. 375 | # ================================================= 376 | # ===================== 377 | # Query, Key, and Value 378 | # ===================== 379 | 380 | # Attention heads [sq, b, h] --> [sq, b, (np * 3 * hn)] 381 | mixed_x_layer = self.query_key_value(hidden_states) 382 | 383 | if self.multi_query_attention: 384 | (query_layer, key_layer, value_layer) = mixed_x_layer.split( 385 | [ 386 | self.num_attention_heads_per_partition * self.hidden_size_per_attention_head, 387 | self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head, 388 | self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head, 389 | ], 390 | dim=-1, 391 | ) 392 | query_layer = query_layer.view( 393 | query_layer.size()[:-1] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head) 394 | ) 395 | key_layer = key_layer.view( 396 | key_layer.size()[:-1] + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head) 397 | ) 398 | value_layer = value_layer.view( 399 | value_layer.size()[:-1] 400 | + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head) 401 | ) 402 | else: 403 | new_tensor_shape = mixed_x_layer.size()[:-1] + \ 404 | (self.num_attention_heads_per_partition, 405 | 3 * self.hidden_size_per_attention_head) 406 | mixed_x_layer = mixed_x_layer.view(*new_tensor_shape) 407 | 408 | # [sq, b, np, 3 * hn] --> 3 [sq, b, np, hn] 409 | (query_layer, key_layer, value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3) 410 | 411 | # apply relative positional encoding (rotary embedding) 412 | if rotary_pos_emb is not None: 413 | query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb) 414 | key_layer = apply_rotary_pos_emb(key_layer, rotary_pos_emb) 415 | 416 | # adjust key and value for inference 417 | if kv_cache is not None: 418 | cache_k, cache_v = kv_cache 419 | key_layer = torch.cat((cache_k, key_layer), dim=0) 420 | value_layer = torch.cat((cache_v, value_layer), dim=0) 421 | if use_cache: 422 | kv_cache = (key_layer, value_layer) 423 | else: 424 | kv_cache = None 425 | 426 | if self.multi_query_attention: 427 | key_layer = key_layer.unsqueeze(-2) 428 | key_layer = key_layer.expand( 429 | -1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1 430 | ) 431 | key_layer = key_layer.contiguous().view( 432 | key_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head) 433 | ) 434 | value_layer = value_layer.unsqueeze(-2) 435 | value_layer = value_layer.expand( 436 | -1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1 437 | ) 438 | value_layer = value_layer.contiguous().view( 439 | value_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head) 440 | ) 441 | 442 | # ================================== 443 | # core attention computation 444 | # ================================== 445 | 446 | context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask) 447 | 448 | # ================= 449 | # Output. [sq, b, h] 450 | # ================= 451 | 452 | output = self.dense(context_layer) 453 | 454 | return output, kv_cache 455 | 456 | 457 | def _config_to_kwargs(args): 458 | common_kwargs = { 459 | "dtype": args.torch_dtype, 460 | } 461 | return common_kwargs 462 | 463 | 464 | class MLP(torch.nn.Module): 465 | """MLP. 466 | MLP will take the input with h hidden state, project it to 4*h 467 | hidden dimension, perform nonlinear transformation, and project the 468 | state back into h hidden dimension. 469 | """ 470 | 471 | def __init__(self, config: ChatGLMConfig, device=None): 472 | super(MLP, self).__init__() 473 | 474 | self.add_bias = config.add_bias_linear 475 | 476 | # Project to 4h. If using swiglu double the output width, see https://arxiv.org/pdf/2002.05202.pdf 477 | self.dense_h_to_4h = nn.Linear( 478 | config.hidden_size, 479 | config.ffn_hidden_size * 2, 480 | bias=self.add_bias, 481 | device=device, 482 | **_config_to_kwargs(config) 483 | ) 484 | 485 | def swiglu(x): 486 | x = torch.chunk(x, 2, dim=-1) 487 | return F.silu(x[0]) * x[1] 488 | 489 | self.activation_func = swiglu 490 | 491 | # Project back to h. 492 | self.dense_4h_to_h = nn.Linear( 493 | config.ffn_hidden_size, 494 | config.hidden_size, 495 | bias=self.add_bias, 496 | device=device, 497 | **_config_to_kwargs(config) 498 | ) 499 | 500 | def forward(self, hidden_states): 501 | # [s, b, 4hp] 502 | intermediate_parallel = self.dense_h_to_4h(hidden_states) 503 | intermediate_parallel = self.activation_func(intermediate_parallel) 504 | # [s, b, h] 505 | output = self.dense_4h_to_h(intermediate_parallel) 506 | return output 507 | 508 | 509 | class GLMBlock(torch.nn.Module): 510 | """A single transformer layer. 511 | Transformer layer takes input with size [s, b, h] and returns an 512 | output of the same size. 513 | """ 514 | 515 | def __init__(self, config: ChatGLMConfig, layer_number, device=None): 516 | super(GLMBlock, self).__init__() 517 | self.layer_number = layer_number 518 | 519 | self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm 520 | 521 | self.fp32_residual_connection = config.fp32_residual_connection 522 | 523 | LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm 524 | # Layernorm on the input data. 525 | self.input_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device, 526 | dtype=config.torch_dtype) 527 | 528 | # Self attention. 529 | self.self_attention = SelfAttention(config, layer_number, device=device) 530 | self.hidden_dropout = config.hidden_dropout 531 | 532 | # Layernorm on the attention output 533 | self.post_attention_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device, 534 | dtype=config.torch_dtype) 535 | 536 | # MLP 537 | self.mlp = MLP(config, device=device) 538 | 539 | def forward( 540 | self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True, 541 | ): 542 | # hidden_states: [s, b, h] 543 | 544 | # Layer norm at the beginning of the transformer layer. 545 | layernorm_output = self.input_layernorm(hidden_states) 546 | # Self attention. 547 | attention_output, kv_cache = self.self_attention( 548 | layernorm_output, 549 | attention_mask, 550 | rotary_pos_emb, 551 | kv_cache=kv_cache, 552 | use_cache=use_cache 553 | ) 554 | 555 | # Residual connection. 556 | if self.apply_residual_connection_post_layernorm: 557 | residual = layernorm_output 558 | else: 559 | residual = hidden_states 560 | 561 | layernorm_input = torch.nn.functional.dropout(attention_output, p=self.hidden_dropout, training=self.training) 562 | layernorm_input = residual + layernorm_input 563 | 564 | # Layer norm post the self attention. 565 | layernorm_output = self.post_attention_layernorm(layernorm_input) 566 | 567 | # MLP. 568 | mlp_output = self.mlp(layernorm_output) 569 | 570 | # Second residual connection. 571 | if self.apply_residual_connection_post_layernorm: 572 | residual = layernorm_output 573 | else: 574 | residual = layernorm_input 575 | 576 | output = torch.nn.functional.dropout(mlp_output, p=self.hidden_dropout, training=self.training) 577 | output = residual + output 578 | 579 | return output, kv_cache 580 | 581 | 582 | class GLMTransformer(torch.nn.Module): 583 | """Transformer class.""" 584 | 585 | def __init__(self, config: ChatGLMConfig, device=None): 586 | super(GLMTransformer, self).__init__() 587 | 588 | self.fp32_residual_connection = config.fp32_residual_connection 589 | self.post_layer_norm = config.post_layer_norm 590 | 591 | # Number of layers. 592 | self.num_layers = config.num_layers 593 | 594 | # Transformer layers. 595 | def build_layer(layer_number): 596 | return GLMBlock(config, layer_number, device=device) 597 | 598 | self.layers = torch.nn.ModuleList([build_layer(i + 1) for i in range(self.num_layers)]) 599 | 600 | if self.post_layer_norm: 601 | LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm 602 | # Final layer norm before output. 603 | self.final_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device, 604 | dtype=config.torch_dtype) 605 | 606 | self.gradient_checkpointing = False 607 | 608 | def _get_layer(self, layer_number): 609 | return self.layers[layer_number] 610 | 611 | def forward( 612 | self, hidden_states, attention_mask, rotary_pos_emb, kv_caches=None, 613 | use_cache: Optional[bool] = True, 614 | output_hidden_states: Optional[bool] = False, 615 | ): 616 | if not kv_caches: 617 | kv_caches = [None for _ in range(self.num_layers)] 618 | presents = () if use_cache else None 619 | if self.gradient_checkpointing and self.training: 620 | if use_cache: 621 | logger.warning_once( 622 | "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." 623 | ) 624 | use_cache = False 625 | 626 | all_self_attentions = None 627 | all_hidden_states = () if output_hidden_states else None 628 | for index in range(self.num_layers): 629 | if output_hidden_states: 630 | all_hidden_states = all_hidden_states + (hidden_states,) 631 | 632 | layer = self._get_layer(index) 633 | if self.gradient_checkpointing and self.training: 634 | layer_ret = torch.utils.checkpoint.checkpoint( 635 | layer, 636 | hidden_states, 637 | attention_mask, 638 | rotary_pos_emb, 639 | kv_caches[index], 640 | use_cache 641 | ) 642 | else: 643 | layer_ret = layer( 644 | hidden_states, 645 | attention_mask, 646 | rotary_pos_emb, 647 | kv_cache=kv_caches[index], 648 | use_cache=use_cache 649 | ) 650 | hidden_states, kv_cache = layer_ret 651 | if use_cache: 652 | presents = presents + (kv_cache,) 653 | 654 | if output_hidden_states: 655 | all_hidden_states = all_hidden_states + (hidden_states,) 656 | 657 | # Final layer norm. 658 | if self.post_layer_norm: 659 | hidden_states = self.final_layernorm(hidden_states) 660 | 661 | return hidden_states, presents, all_hidden_states, all_self_attentions 662 | 663 | 664 | class ChatGLMPreTrainedModel(PreTrainedModel): 665 | """ 666 | An abstract class to handle weights initialization and 667 | a simple interface for downloading and loading pretrained models. 668 | """ 669 | 670 | is_parallelizable = False 671 | supports_gradient_checkpointing = True 672 | config_class = ChatGLMConfig 673 | base_model_prefix = "transformer" 674 | _no_split_modules = ["GLMBlock"] 675 | 676 | def _init_weights(self, module: nn.Module): 677 | """Initialize the weights.""" 678 | return 679 | 680 | def get_masks(self, input_ids, past_key_values, padding_mask=None): 681 | batch_size, seq_length = input_ids.shape 682 | full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_ids.device) 683 | full_attention_mask.tril_() 684 | past_length = 0 685 | if past_key_values: 686 | past_length = past_key_values[0][0].shape[0] 687 | if past_length: 688 | full_attention_mask = torch.cat((torch.ones(batch_size, seq_length, past_length, 689 | device=input_ids.device), full_attention_mask), dim=-1) 690 | if padding_mask is not None: 691 | full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1) 692 | if not past_length and padding_mask is not None: 693 | full_attention_mask -= padding_mask.unsqueeze(-1) - 1 694 | full_attention_mask = (full_attention_mask < 0.5).bool() 695 | full_attention_mask.unsqueeze_(1) 696 | return full_attention_mask 697 | 698 | def get_position_ids(self, input_ids, device): 699 | batch_size, seq_length = input_ids.shape 700 | position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1) 701 | return position_ids 702 | 703 | def _set_gradient_checkpointing(self, module, value=False): 704 | if isinstance(module, GLMTransformer): 705 | module.gradient_checkpointing = value 706 | 707 | 708 | class Embedding(torch.nn.Module): 709 | """Language model embeddings.""" 710 | 711 | def __init__(self, config: ChatGLMConfig, device=None): 712 | super(Embedding, self).__init__() 713 | 714 | self.hidden_size = config.hidden_size 715 | # Word embeddings (parallel). 716 | self.word_embeddings = nn.Embedding( 717 | config.padded_vocab_size, 718 | self.hidden_size, 719 | dtype=config.torch_dtype, 720 | device=device 721 | ) 722 | self.fp32_residual_connection = config.fp32_residual_connection 723 | 724 | def forward(self, input_ids): 725 | # Embeddings. 726 | words_embeddings = self.word_embeddings(input_ids) 727 | embeddings = words_embeddings 728 | # Data format change to avoid explicit tranposes : [b s h] --> [s b h]. 729 | embeddings = embeddings.transpose(0, 1).contiguous() 730 | # If the input flag for fp32 residual connection is set, convert for float. 731 | if self.fp32_residual_connection: 732 | embeddings = embeddings.float() 733 | return embeddings 734 | 735 | 736 | class ChatGLMModel(ChatGLMPreTrainedModel): 737 | def __init__(self, config: ChatGLMConfig, device=None, empty_init=True): 738 | super().__init__(config) 739 | if empty_init: 740 | init_method = skip_init 741 | else: 742 | init_method = default_init 743 | init_kwargs = {} 744 | if device is not None: 745 | init_kwargs["device"] = device 746 | self.embedding = init_method(Embedding, config, **init_kwargs) 747 | self.num_layers = config.num_layers 748 | self.multi_query_group_num = config.multi_query_group_num 749 | self.kv_channels = config.kv_channels 750 | 751 | # Rotary positional embeddings 752 | self.seq_length = config.seq_length 753 | rotary_dim = ( 754 | config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels 755 | ) 756 | 757 | self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, original_impl=config.original_rope, device=device, 758 | dtype=config.torch_dtype) 759 | self.encoder = init_method(GLMTransformer, config, **init_kwargs) 760 | self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False, 761 | dtype=config.torch_dtype, **init_kwargs) 762 | self.pre_seq_len = config.pre_seq_len 763 | self.prefix_projection = config.prefix_projection 764 | if self.pre_seq_len is not None: 765 | for param in self.parameters(): 766 | param.requires_grad = False 767 | self.prefix_tokens = torch.arange(self.pre_seq_len).long() 768 | self.prefix_encoder = PrefixEncoder(config) 769 | self.dropout = torch.nn.Dropout(0.1) 770 | 771 | def get_input_embeddings(self): 772 | return self.embedding.word_embeddings 773 | 774 | def get_prompt(self, batch_size, device, dtype=torch.half): 775 | prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(device) 776 | past_key_values = self.prefix_encoder(prefix_tokens).type(dtype) 777 | past_key_values = past_key_values.view( 778 | batch_size, 779 | self.pre_seq_len, 780 | self.num_layers * 2, 781 | self.multi_query_group_num, 782 | self.kv_channels 783 | ) 784 | # seq_len, b, nh, hidden_size 785 | past_key_values = self.dropout(past_key_values) 786 | past_key_values = past_key_values.permute([2, 1, 0, 3, 4]).split(2) 787 | return past_key_values 788 | 789 | def forward( 790 | self, 791 | input_ids, 792 | position_ids: Optional[torch.Tensor] = None, 793 | attention_mask: Optional[torch.BoolTensor] = None, 794 | full_attention_mask: Optional[torch.BoolTensor] = None, 795 | past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None, 796 | inputs_embeds: Optional[torch.Tensor] = None, 797 | use_cache: Optional[bool] = None, 798 | output_hidden_states: Optional[bool] = None, 799 | return_dict: Optional[bool] = None, 800 | ): 801 | output_hidden_states = ( 802 | output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states 803 | ) 804 | use_cache = use_cache if use_cache is not None else self.config.use_cache 805 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 806 | 807 | batch_size, seq_length = input_ids.shape 808 | 809 | if inputs_embeds is None: 810 | inputs_embeds = self.embedding(input_ids) 811 | 812 | if self.pre_seq_len is not None: 813 | if past_key_values is None: 814 | past_key_values = self.get_prompt(batch_size=batch_size, device=input_ids.device, 815 | dtype=inputs_embeds.dtype) 816 | if attention_mask is not None: 817 | attention_mask = torch.cat([attention_mask.new_ones((batch_size, self.pre_seq_len)), 818 | attention_mask], dim=-1) 819 | 820 | if full_attention_mask is None: 821 | if (attention_mask is not None and not attention_mask.all()) or (past_key_values and seq_length != 1): 822 | full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask) 823 | 824 | # Rotary positional embeddings 825 | rotary_pos_emb = self.rotary_pos_emb(self.seq_length) 826 | if position_ids is not None: 827 | rotary_pos_emb = rotary_pos_emb[position_ids] 828 | else: 829 | rotary_pos_emb = rotary_pos_emb[None, :seq_length] 830 | rotary_pos_emb = rotary_pos_emb.transpose(0, 1).contiguous() 831 | 832 | # Run encoder. 833 | hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( 834 | inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb, 835 | kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states 836 | ) 837 | 838 | if not return_dict: 839 | return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None) 840 | 841 | return BaseModelOutputWithPast( 842 | last_hidden_state=hidden_states, 843 | past_key_values=presents, 844 | hidden_states=all_hidden_states, 845 | attentions=all_self_attentions, 846 | ) 847 | 848 | def quantize(self, weight_bit_width: int): 849 | from .quantization import quantize 850 | quantize(self.encoder, weight_bit_width) 851 | return self 852 | 853 | 854 | class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel): 855 | def __init__(self, config: ChatGLMConfig, empty_init=True, device=None): 856 | super().__init__(config) 857 | 858 | self.max_sequence_length = config.max_length 859 | self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device) 860 | self.config = config 861 | self.quantized = False 862 | 863 | if self.config.quantization_bit: 864 | self.quantize(self.config.quantization_bit, empty_init=True) 865 | 866 | def _update_model_kwargs_for_generation( 867 | self, 868 | outputs: ModelOutput, 869 | model_kwargs: Dict[str, Any], 870 | is_encoder_decoder: bool = False, 871 | standardize_cache_format: bool = False, 872 | ) -> Dict[str, Any]: 873 | # update past_key_values 874 | model_kwargs["past_key_values"] = self._extract_past_from_model_output( 875 | outputs, standardize_cache_format=standardize_cache_format 876 | ) 877 | 878 | # update attention mask 879 | if "attention_mask" in model_kwargs: 880 | attention_mask = model_kwargs["attention_mask"] 881 | model_kwargs["attention_mask"] = torch.cat( 882 | [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1 883 | ) 884 | 885 | # update position ids 886 | if "position_ids" in model_kwargs: 887 | position_ids = model_kwargs["position_ids"] 888 | new_position_id = position_ids[..., -1:].clone() 889 | new_position_id += 1 890 | model_kwargs["position_ids"] = torch.cat( 891 | [position_ids, new_position_id], dim=-1 892 | ) 893 | 894 | model_kwargs["is_first_forward"] = False 895 | return model_kwargs 896 | 897 | def prepare_inputs_for_generation( 898 | self, 899 | input_ids: torch.LongTensor, 900 | past_key_values: Optional[torch.Tensor] = None, 901 | attention_mask: Optional[torch.Tensor] = None, 902 | position_ids: Optional[torch.Tensor] = None, 903 | is_first_forward: bool = True, 904 | **kwargs 905 | ) -> dict: 906 | # only last token for input_ids if past is not None 907 | if position_ids is None: 908 | position_ids = self.get_position_ids(input_ids, device=input_ids.device) 909 | if not is_first_forward: 910 | position_ids = position_ids[..., -1:] 911 | input_ids = input_ids[:, -1:] 912 | return { 913 | "input_ids": input_ids, 914 | "past_key_values": past_key_values, 915 | "position_ids": position_ids, 916 | "attention_mask": attention_mask, 917 | "return_last_logit": True 918 | } 919 | 920 | def forward( 921 | self, 922 | input_ids: Optional[torch.Tensor] = None, 923 | position_ids: Optional[torch.Tensor] = None, 924 | attention_mask: Optional[torch.Tensor] = None, 925 | past_key_values: Optional[Tuple[torch.FloatTensor]] = None, 926 | inputs_embeds: Optional[torch.Tensor] = None, 927 | labels: Optional[torch.Tensor] = None, 928 | use_cache: Optional[bool] = None, 929 | output_attentions: Optional[bool] = None, 930 | output_hidden_states: Optional[bool] = None, 931 | return_dict: Optional[bool] = None, 932 | return_last_logit: Optional[bool] = False, 933 | ): 934 | use_cache = use_cache if use_cache is not None else self.config.use_cache 935 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 936 | 937 | transformer_outputs = self.transformer( 938 | input_ids=input_ids, 939 | position_ids=position_ids, 940 | attention_mask=attention_mask, 941 | past_key_values=past_key_values, 942 | inputs_embeds=inputs_embeds, 943 | use_cache=use_cache, 944 | output_hidden_states=output_hidden_states, 945 | return_dict=return_dict, 946 | ) 947 | 948 | hidden_states = transformer_outputs[0] 949 | if return_last_logit: 950 | hidden_states = hidden_states[-1:] 951 | lm_logits = self.transformer.output_layer(hidden_states) 952 | lm_logits = lm_logits.transpose(0, 1).contiguous() 953 | 954 | loss = None 955 | if labels is not None: 956 | lm_logits = lm_logits.to(torch.float32) 957 | 958 | # Shift so that tokens < n predict n 959 | shift_logits = lm_logits[..., :-1, :].contiguous() 960 | shift_labels = labels[..., 1:].contiguous() 961 | # Flatten the tokens 962 | loss_fct = CrossEntropyLoss(ignore_index=-100) 963 | loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) 964 | 965 | lm_logits = lm_logits.to(hidden_states.dtype) 966 | loss = loss.to(hidden_states.dtype) 967 | 968 | if not return_dict: 969 | output = (lm_logits,) + transformer_outputs[1:] 970 | return ((loss,) + output) if loss is not None else output 971 | 972 | return CausalLMOutputWithPast( 973 | loss=loss, 974 | logits=lm_logits, 975 | past_key_values=transformer_outputs.past_key_values, 976 | hidden_states=transformer_outputs.hidden_states, 977 | attentions=transformer_outputs.attentions, 978 | ) 979 | 980 | @staticmethod 981 | def _reorder_cache( 982 | past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], beam_idx: torch.LongTensor 983 | ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]: 984 | """ 985 | This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or 986 | [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct 987 | beam_idx at every generation step. 988 | Output shares the same memory storage as `past`. 989 | """ 990 | return tuple( 991 | ( 992 | layer_past[0].index_select(1, beam_idx.to(layer_past[0].device)), 993 | layer_past[1].index_select(1, beam_idx.to(layer_past[1].device)), 994 | ) 995 | for layer_past in past 996 | ) 997 | 998 | def process_response(self, response): 999 | response = response.strip() 1000 | response = response.replace("[[训练时间]]", "2023年") 1001 | return response 1002 | 1003 | def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = None): 1004 | prompt = tokenizer.build_prompt(query, history=history) 1005 | inputs = tokenizer([prompt], return_tensors="pt") 1006 | inputs = inputs.to(self.device) 1007 | return inputs 1008 | 1009 | def build_stream_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = None): 1010 | if history: 1011 | prompt = "\n\n[Round {}]\n\n问:{}\n\n答:".format(len(history) + 1, query) 1012 | input_ids = tokenizer.encode(prompt, add_special_tokens=False) 1013 | input_ids = input_ids[1:] 1014 | inputs = tokenizer.batch_encode_plus([(input_ids, None)], return_tensors="pt", add_special_tokens=False) 1015 | else: 1016 | prompt = "[Round {}]\n\n问:{}\n\n答:".format(len(history) + 1, query) 1017 | inputs = tokenizer([prompt], return_tensors="pt") 1018 | inputs = inputs.to(self.device) 1019 | return inputs 1020 | 1021 | @torch.inference_mode() 1022 | def chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, max_length: int = 8192, num_beams=1, 1023 | do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None, **kwargs): 1024 | if history is None: 1025 | history = [] 1026 | if logits_processor is None: 1027 | logits_processor = LogitsProcessorList() 1028 | logits_processor.append(InvalidScoreLogitsProcessor()) 1029 | gen_kwargs = {"max_length": max_length, "num_beams": num_beams, "do_sample": do_sample, "top_p": top_p, 1030 | "temperature": temperature, "logits_processor": logits_processor, **kwargs} 1031 | inputs = self.build_inputs(tokenizer, query, history=history) 1032 | outputs = self.generate(**inputs, **gen_kwargs) 1033 | outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):] 1034 | response = tokenizer.decode(outputs) 1035 | response = self.process_response(response) 1036 | history = history + [(query, response)] 1037 | return response, history 1038 | 1039 | @torch.inference_mode() 1040 | def stream_chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, past_key_values=None, 1041 | max_length: int = 8192, do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None, 1042 | return_past_key_values=False, **kwargs): 1043 | if history is None: 1044 | history = [] 1045 | if logits_processor is None: 1046 | logits_processor = LogitsProcessorList() 1047 | logits_processor.append(InvalidScoreLogitsProcessor()) 1048 | gen_kwargs = {"max_length": max_length, "do_sample": do_sample, "top_p": top_p, 1049 | "temperature": temperature, "logits_processor": logits_processor, **kwargs} 1050 | if past_key_values is None and not return_past_key_values: 1051 | inputs = self.build_inputs(tokenizer, query, history=history) 1052 | else: 1053 | inputs = self.build_stream_inputs(tokenizer, query, history=history) 1054 | if past_key_values is not None: 1055 | past_length = past_key_values[0][0].shape[0] 1056 | if self.transformer.pre_seq_len is not None: 1057 | past_length -= self.transformer.pre_seq_len 1058 | inputs.position_ids += past_length 1059 | attention_mask = inputs.attention_mask 1060 | attention_mask = torch.cat((attention_mask.new_ones(1, past_length), attention_mask), dim=1) 1061 | inputs['attention_mask'] = attention_mask 1062 | for outputs in self.stream_generate(**inputs, past_key_values=past_key_values, 1063 | return_past_key_values=return_past_key_values, **gen_kwargs): 1064 | if return_past_key_values: 1065 | outputs, past_key_values = outputs 1066 | outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):] 1067 | response = tokenizer.decode(outputs) 1068 | if response and response[-1] != "�": 1069 | response = self.process_response(response) 1070 | new_history = history + [(query, response)] 1071 | if return_past_key_values: 1072 | yield response, new_history, past_key_values 1073 | else: 1074 | yield response, new_history 1075 | 1076 | @torch.inference_mode() 1077 | def stream_generate( 1078 | self, 1079 | input_ids, 1080 | generation_config: Optional[GenerationConfig] = None, 1081 | logits_processor: Optional[LogitsProcessorList] = None, 1082 | stopping_criteria: Optional[StoppingCriteriaList] = None, 1083 | prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None, 1084 | return_past_key_values=False, 1085 | **kwargs, 1086 | ): 1087 | batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1] 1088 | 1089 | if generation_config is None: 1090 | generation_config = self.generation_config 1091 | generation_config = copy.deepcopy(generation_config) 1092 | model_kwargs = generation_config.update(**kwargs) 1093 | bos_token_id, eos_token_id = generation_config.bos_token_id, generation_config.eos_token_id 1094 | 1095 | if isinstance(eos_token_id, int): 1096 | eos_token_id = [eos_token_id] 1097 | 1098 | has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None 1099 | if has_default_max_length and generation_config.max_new_tokens is None: 1100 | warnings.warn( 1101 | f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. " 1102 | "This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we" 1103 | " recommend using `max_new_tokens` to control the maximum length of the generation.", 1104 | UserWarning, 1105 | ) 1106 | elif generation_config.max_new_tokens is not None: 1107 | generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length 1108 | if not has_default_max_length: 1109 | logger.warn( 1110 | f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(=" 1111 | f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. " 1112 | "Please refer to the documentation for more information. " 1113 | "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)", 1114 | UserWarning, 1115 | ) 1116 | 1117 | if input_ids_seq_length >= generation_config.max_length: 1118 | input_ids_string = "decoder_input_ids" if self.config.is_encoder_decoder else "input_ids" 1119 | logger.warning( 1120 | f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to" 1121 | f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider" 1122 | " increasing `max_new_tokens`." 1123 | ) 1124 | 1125 | # 2. Set generation parameters if not already defined 1126 | logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList() 1127 | stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList() 1128 | 1129 | logits_processor = self._get_logits_processor( 1130 | generation_config=generation_config, 1131 | input_ids_seq_length=input_ids_seq_length, 1132 | encoder_input_ids=input_ids, 1133 | prefix_allowed_tokens_fn=prefix_allowed_tokens_fn, 1134 | logits_processor=logits_processor, 1135 | ) 1136 | 1137 | stopping_criteria = self._get_stopping_criteria( 1138 | generation_config=generation_config, stopping_criteria=stopping_criteria 1139 | ) 1140 | logits_warper = self._get_logits_warper(generation_config) 1141 | 1142 | unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1) 1143 | scores = None 1144 | while True: 1145 | model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) 1146 | # forward pass to get next token 1147 | outputs = self( 1148 | **model_inputs, 1149 | return_dict=True, 1150 | output_attentions=False, 1151 | output_hidden_states=False, 1152 | ) 1153 | 1154 | next_token_logits = outputs.logits[:, -1, :] 1155 | 1156 | # pre-process distribution 1157 | next_token_scores = logits_processor(input_ids, next_token_logits) 1158 | next_token_scores = logits_warper(input_ids, next_token_scores) 1159 | 1160 | # sample 1161 | probs = nn.functional.softmax(next_token_scores, dim=-1) 1162 | if generation_config.do_sample: 1163 | next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) 1164 | else: 1165 | next_tokens = torch.argmax(probs, dim=-1) 1166 | 1167 | # update generated ids, model inputs, and length for next step 1168 | input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1) 1169 | model_kwargs = self._update_model_kwargs_for_generation( 1170 | outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder 1171 | ) 1172 | unfinished_sequences = unfinished_sequences.mul((sum(next_tokens != i for i in eos_token_id)).long()) 1173 | if return_past_key_values: 1174 | yield input_ids, outputs.past_key_values 1175 | else: 1176 | yield input_ids 1177 | # stop when each sentence is finished, or if we exceed the maximum length 1178 | if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores): 1179 | break 1180 | 1181 | def quantize(self, bits: int, empty_init=False, device=None, **kwargs): 1182 | if bits == 0: 1183 | return 1184 | 1185 | from .quantization import quantize 1186 | 1187 | if self.quantized: 1188 | logger.info("Already quantized.") 1189 | return self 1190 | 1191 | self.quantized = True 1192 | 1193 | self.config.quantization_bit = bits 1194 | 1195 | self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device, 1196 | **kwargs) 1197 | return self 1198 | -------------------------------------------------------------------------------- /cli_demo.py: -------------------------------------------------------------------------------- 1 | import os 2 | import platform 3 | import signal 4 | from transformers import AutoTokenizer, AutoModel 5 | import readline 6 | 7 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 8 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda() 9 | # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量 10 | # from utils import load_model_on_gpus 11 | # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2) 12 | model = model.eval() 13 | 14 | os_name = platform.system() 15 | clear_command = 'cls' if os_name == 'Windows' else 'clear' 16 | stop_stream = False 17 | 18 | 19 | def build_prompt(history): 20 | prompt = "欢迎使用 ChatGLM2-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序" 21 | for query, response in history: 22 | prompt += f"\n\n用户:{query}" 23 | prompt += f"\n\nChatGLM2-6B:{response}" 24 | return prompt 25 | 26 | 27 | def signal_handler(signal, frame): 28 | global stop_stream 29 | stop_stream = True 30 | 31 | 32 | def main(): 33 | past_key_values, history = None, [] 34 | global stop_stream 35 | print("欢迎使用 ChatGLM2-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序") 36 | while True: 37 | query = input("\n用户:") 38 | if query.strip() == "stop": 39 | break 40 | if query.strip() == "clear": 41 | past_key_values, history = None, [] 42 | os.system(clear_command) 43 | print("欢迎使用 ChatGLM2-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序") 44 | continue 45 | print("\nChatGLM:", end="") 46 | current_length = 0 47 | for response, history, past_key_values in model.stream_chat(tokenizer, query, history=history, 48 | past_key_values=past_key_values, 49 | return_past_key_values=True): 50 | if stop_stream: 51 | stop_stream = False 52 | break 53 | else: 54 | print(response[current_length:], end="", flush=True) 55 | current_length = len(response) 56 | print("") 57 | 58 | 59 | if __name__ == "__main__": 60 | main() 61 | -------------------------------------------------------------------------------- /evaluation/README.md: -------------------------------------------------------------------------------- 1 | 首先从 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/e84444333b6d434ea7b0) 下载处理好的 C-Eval 数据集,解压到 `evaluation` 目录下。然后运行 2 | 3 | ```shell 4 | cd evaluation 5 | python evaluate_ceval.py 6 | ``` 7 | 8 | 这个脚本会在C-Eval的验证集上进行预测并输出准确率。如果想要得到测试集上的结果可以将代码中的 `./CEval/val/**/*.jsonl` 改为 `./CEval/test/**/*.jsonl`,并按照 C-Eval 规定的格式保存结果并在 [官网](https://cevalbenchmark.com/) 上提交。 9 | 10 | 汇报的结果使用的是内部的并行测试框架,结果可能会有轻微波动。 -------------------------------------------------------------------------------- /evaluation/evaluate_ceval.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import re 4 | import json 5 | import torch 6 | import torch.utils.data 7 | from transformers import AutoTokenizer, AutoModel 8 | from tqdm import tqdm 9 | 10 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 11 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).bfloat16().cuda() 12 | 13 | choices = ["A", "B", "C", "D"] 14 | choice_tokens = [tokenizer.encode(choice, add_special_tokens=False)[0] for choice in choices] 15 | 16 | 17 | def build_prompt(text): 18 | return "[Round {}]\n\n问:{}\n\n答:".format(1, text) 19 | 20 | 21 | extraction_prompt = '综上所述,ABCD中正确的选项是:' 22 | 23 | accuracy_dict, count_dict = {}, {} 24 | with torch.no_grad(): 25 | for entry in glob.glob("./CEval/val/**/*.jsonl", recursive=True): 26 | dataset = [] 27 | with open(entry, encoding='utf-8') as file: 28 | for line in file: 29 | dataset.append(json.loads(line)) 30 | correct = 0 31 | dataloader = torch.utils.data.DataLoader(dataset, batch_size=8) 32 | for batch in tqdm(dataloader): 33 | texts = batch["inputs_pretokenized"] 34 | queries = [build_prompt(query) for query in texts] 35 | inputs = tokenizer(queries, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda') 36 | outputs = model.generate(**inputs, do_sample=False, max_new_tokens=512) 37 | intermediate_outputs = [] 38 | for idx in range(len(outputs)): 39 | output = outputs.tolist()[idx][len(inputs["input_ids"][idx]):] 40 | response = tokenizer.decode(output) 41 | intermediate_outputs.append(response) 42 | answer_texts = [text + intermediate + "\n" + extraction_prompt for text, intermediate in 43 | zip(texts, intermediate_outputs)] 44 | input_tokens = [build_prompt(answer_text) for answer_text in answer_texts] 45 | inputs = tokenizer(input_tokens, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda') 46 | outputs = model(**inputs, return_last_logit=True) 47 | logits = outputs.logits[:, -1] 48 | logits = logits[:, choice_tokens] 49 | preds = logits.argmax(dim=-1) 50 | correct += (preds.cpu() == batch["label"]).sum().item() 51 | accuracy = correct / len(dataset) 52 | print(entry, accuracy) 53 | accuracy_dict[entry] = accuracy 54 | count_dict[entry] = len(dataset) 55 | 56 | acc_total, count_total = 0.0, 0 57 | for key in accuracy_dict: 58 | acc_total += accuracy_dict[key] * count_dict[key] 59 | count_total += count_dict[key] 60 | print(acc_total / count_total) -------------------------------------------------------------------------------- /openai_api.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Implements API for ChatGLM2-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat) 3 | # Usage: python openai_api.py 4 | # Visit http://localhost:8000/docs for documents. 5 | 6 | 7 | import time 8 | import torch 9 | import uvicorn 10 | from pydantic import BaseModel, Field 11 | from fastapi import FastAPI, HTTPException 12 | from fastapi.middleware.cors import CORSMiddleware 13 | from contextlib import asynccontextmanager 14 | from typing import Any, Dict, List, Literal, Optional, Union 15 | from transformers import AutoTokenizer, AutoModel 16 | from sse_starlette.sse import ServerSentEvent, EventSourceResponse 17 | 18 | 19 | @asynccontextmanager 20 | async def lifespan(app: FastAPI): # collects GPU memory 21 | yield 22 | if torch.cuda.is_available(): 23 | torch.cuda.empty_cache() 24 | torch.cuda.ipc_collect() 25 | 26 | 27 | app = FastAPI(lifespan=lifespan) 28 | 29 | app.add_middleware( 30 | CORSMiddleware, 31 | allow_origins=["*"], 32 | allow_credentials=True, 33 | allow_methods=["*"], 34 | allow_headers=["*"], 35 | ) 36 | 37 | class ModelCard(BaseModel): 38 | id: str 39 | object: str = "model" 40 | created: int = Field(default_factory=lambda: int(time.time())) 41 | owned_by: str = "owner" 42 | root: Optional[str] = None 43 | parent: Optional[str] = None 44 | permission: Optional[list] = None 45 | 46 | 47 | class ModelList(BaseModel): 48 | object: str = "list" 49 | data: List[ModelCard] = [] 50 | 51 | 52 | class ChatMessage(BaseModel): 53 | role: Literal["user", "assistant", "system"] 54 | content: str 55 | 56 | 57 | class DeltaMessage(BaseModel): 58 | role: Optional[Literal["user", "assistant", "system"]] = None 59 | content: Optional[str] = None 60 | 61 | 62 | class ChatCompletionRequest(BaseModel): 63 | model: str 64 | messages: List[ChatMessage] 65 | temperature: Optional[float] = None 66 | top_p: Optional[float] = None 67 | max_length: Optional[int] = None 68 | stream: Optional[bool] = False 69 | 70 | 71 | class ChatCompletionResponseChoice(BaseModel): 72 | index: int 73 | message: ChatMessage 74 | finish_reason: Literal["stop", "length"] 75 | 76 | 77 | class ChatCompletionResponseStreamChoice(BaseModel): 78 | index: int 79 | delta: DeltaMessage 80 | finish_reason: Optional[Literal["stop", "length"]] 81 | 82 | 83 | class ChatCompletionResponse(BaseModel): 84 | model: str 85 | object: Literal["chat.completion", "chat.completion.chunk"] 86 | choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]] 87 | created: Optional[int] = Field(default_factory=lambda: int(time.time())) 88 | 89 | 90 | @app.get("/v1/models", response_model=ModelList) 91 | async def list_models(): 92 | global model_args 93 | model_card = ModelCard(id="gpt-3.5-turbo") 94 | return ModelList(data=[model_card]) 95 | 96 | 97 | @app.post("/v1/chat/completions", response_model=ChatCompletionResponse) 98 | async def create_chat_completion(request: ChatCompletionRequest): 99 | global model, tokenizer 100 | 101 | if request.messages[-1].role != "user": 102 | raise HTTPException(status_code=400, detail="Invalid request") 103 | query = request.messages[-1].content 104 | 105 | prev_messages = request.messages[:-1] 106 | if len(prev_messages) > 0 and prev_messages[0].role == "system": 107 | query = prev_messages.pop(0).content + query 108 | 109 | history = [] 110 | if len(prev_messages) % 2 == 0: 111 | for i in range(0, len(prev_messages), 2): 112 | if prev_messages[i].role == "user" and prev_messages[i+1].role == "assistant": 113 | history.append([prev_messages[i].content, prev_messages[i+1].content]) 114 | 115 | if request.stream: 116 | generate = predict(query, history, request.model) 117 | return EventSourceResponse(generate, media_type="text/event-stream") 118 | 119 | response, _ = model.chat(tokenizer, query, history=history) 120 | choice_data = ChatCompletionResponseChoice( 121 | index=0, 122 | message=ChatMessage(role="assistant", content=response), 123 | finish_reason="stop" 124 | ) 125 | 126 | return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion") 127 | 128 | 129 | async def predict(query: str, history: List[List[str]], model_id: str): 130 | global model, tokenizer 131 | 132 | choice_data = ChatCompletionResponseStreamChoice( 133 | index=0, 134 | delta=DeltaMessage(role="assistant"), 135 | finish_reason=None 136 | ) 137 | chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk") 138 | yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False)) 139 | 140 | current_length = 0 141 | 142 | for new_response, _ in model.stream_chat(tokenizer, query, history): 143 | if len(new_response) == current_length: 144 | continue 145 | 146 | new_text = new_response[current_length:] 147 | current_length = len(new_response) 148 | 149 | choice_data = ChatCompletionResponseStreamChoice( 150 | index=0, 151 | delta=DeltaMessage(content=new_text), 152 | finish_reason=None 153 | ) 154 | chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk") 155 | yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False)) 156 | 157 | 158 | choice_data = ChatCompletionResponseStreamChoice( 159 | index=0, 160 | delta=DeltaMessage(), 161 | finish_reason="stop" 162 | ) 163 | chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk") 164 | yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False)) 165 | yield '[DONE]' 166 | 167 | 168 | 169 | if __name__ == "__main__": 170 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 171 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda() 172 | # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量 173 | # from utils import load_model_on_gpus 174 | # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2) 175 | model.eval() 176 | 177 | uvicorn.run(app, host='0.0.0.0', port=8000, workers=1) 178 | -------------------------------------------------------------------------------- /ptuning/README.md: -------------------------------------------------------------------------------- 1 | # ChatGLM2-6B-PT 2 | 本仓库实现了对于 ChatGLM2-6B 模型基于 [P-Tuning v2](https://github.com/THUDM/P-tuning-v2) 的微调。P-Tuning v2 将需要微调的参数量减少到原来的 0.1%,再通过模型量化、Gradient Checkpoint 等方法,最低只需要 7GB 显存即可运行。 3 | 4 | 下面以 [ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法。 5 | 6 | ## 软件依赖 7 | 运行微调除 ChatGLM2-6B 的依赖之外,还需要安装以下依赖 8 | ``` 9 | pip install rouge_chinese nltk jieba datasets 10 | ``` 11 | ## 使用方法 12 | 13 | ### 下载数据集 14 | ADGEN 数据集任务为根据输入(content)生成一段广告词(summary)。 15 | 16 | ```json 17 | { 18 | "content": "类型#上衣*版型#宽松*版型#显瘦*图案#线条*衣样式#衬衫*衣袖型#泡泡袖*衣款式#抽绳", 19 | "summary": "这件衬衫的款式非常的宽松,利落的线条可以很好的隐藏身材上的小缺点,穿在身上有着很好的显瘦效果。领口装饰了一个可爱的抽绳,漂亮的绳结展现出了十足的个性,配合时尚的泡泡袖型,尽显女性甜美可爱的气息。" 20 | } 21 | ``` 22 | 23 | 从 [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1) 下载处理好的 ADGEN 数据集,将解压后的 `AdvertiseGen` 目录放到本目录下。 24 | 25 | ### 训练 26 | 27 | #### P-Tuning v2 28 | 29 | 运行以下指令进行训练: 30 | ```shell 31 | bash train.sh 32 | ``` 33 | `train.sh` 中的 `PRE_SEQ_LEN` 和 `LR` 分别是 soft prompt 长度和训练的学习率,可以进行调节以取得最佳的效果。P-Tuning-v2 方法会冻结全部的模型参数,可通过调整 `quantization_bit` 来被原始模型的量化等级,不加此选项则为 FP16 精度加载。 34 | 35 | 在默认配置 `quantization_bit=4`、`per_device_train_batch_size=1`、`gradient_accumulation_steps=16` 下,INT4 的模型参数被冻结,一次训练迭代会以 1 的批处理大小进行 16 次累加的前后向传播,等效为 16 的总批处理大小,此时最低只需 6.7G 显存。若想在同等批处理大小下提升训练效率,可在二者乘积不变的情况下,加大 `per_device_train_batch_size` 的值,但也会带来更多的显存消耗,请根据实际情况酌情调整。 36 | 37 | 如果你想要[从本地加载模型](../README.md#从本地加载模型),可以将 `train.sh` 中的 `THUDM/chatglm2-6b` 改为你本地的模型路径。 38 | 39 | #### Finetune 40 | 41 | 如果需要进行全参数的 Finetune,需要安装 [Deepspeed](https://github.com/microsoft/DeepSpeed),然后运行以下指令: 42 | 43 | ```shell 44 | bash ds_train_finetune.sh 45 | ``` 46 | 47 | ### 推理 48 | 49 | 在 P-tuning v2 训练时模型只保存 PrefixEncoder 部分的参数,所以在推理时需要同时加载原 ChatGLM2-6B 模型以及 PrefixEncoder 的权重,因此需要指定 `evaluate.sh` 中的参数: 50 | 51 | ```shell 52 | --model_name_or_path THUDM/chatglm2-6b 53 | --ptuning_checkpoint $CHECKPOINT_PATH 54 | ``` 55 | 56 | 如果是,只需要跟之前一样设定 `model_name_or_path`: 57 | 58 | ```shell 59 | --model_name_or_path $CHECKPOINT_PATH 60 | ``` 61 | 62 | 评测指标为中文 Rouge score 和 BLEU-4。生成的结果保存在 63 | `./output/adgen-chatglm2-6b-pt-128-2e-2/generated_predictions.txt`。 64 | 65 | ### 例子 66 | #### 示例1 67 | * Input: 类型#上衣\*材质#牛仔布\*颜色#白色\*风格#简约\*图案#刺绣\*衣样式#外套\*衣款式#破洞 68 | * Label: 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。 69 | * Output[微调前]: 这件上衣的材质是牛仔布,颜色是白色,风格是简约,图案是刺绣,衣样式是外套,衣款式是破洞。 70 | * Output[微调后]: 这是一款简约的牛仔外套,破洞设计,将牛仔布破洞,带来一种随意与个性。破洞的牛仔外套,展现出时尚气息,带来一种休闲感。同时,刺绣图案,让整件外套更加立体。 71 | 72 | #### 示例2 73 | 74 | * Input: 类型#裙\*版型#显瘦\*风格#文艺\*风格#简约\*图案#印花\*图案#撞色\*裙下摆#压褶\*裙长#连衣裙\*裙领型#圆领 75 | * Label: 文艺个性的印花连衣裙,藏青色底蕴,低调又大气,撞色太阳花分布整个裙身,绚丽而美好,带来时尚减龄的气质。基础款的舒适圆领,简约不失大方,勾勒精致脸庞。领后是一粒包布扣固定,穿脱十分方便。前片立体的打褶设计,搭配后片压褶的做工,增添层次和空间感,显瘦又有型。 76 | * Output[微调前]: 类型#裙*版型#显瘦*风格#文艺*风格#简约*图案#印花*图案#撞色*裙下摆#压褶*裙长#连衣裙*裙领型#圆领 1\. 连衣裙:简约风格,裙长为膝盖以上,裙领型为圆领。2\. 裙下摆:压褶设计,使裙摆呈现出流畅的褶皱效果。3\. 裙领型:裙领型为圆领,使穿上连衣裙后更加有型。4\. 版型:采用显瘦设计,让连衣裙看起来更加苗条。5\. 风格:文艺风格,让连衣裙更加有内涵和品味。6\. 图案:印花设计,在连衣裙上印有独特的图案。7\. 撞色:采用撞色设计,让连衣裙在色彩上更加鲜明、富有层次感。 77 | * Output[微调后]: 这是一款文艺范的连衣裙,以印花为元素,采用简约的印花,既能够突出文艺气质,又能够展现简约风。在印花的同时又有领子和裙摆的压褶设计,更加凸显文艺气质。简约而不会过于单调,搭配出街,穿着十分舒适。 78 | 79 | 80 | ## 模型部署 81 | 首先载入Tokenizer: 82 | 83 | ```python 84 | from transformers import AutoConfig, AutoModel, AutoTokenizer 85 | 86 | # 载入Tokenizer 87 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 88 | ``` 89 | 90 | 1. 如果需要加载的 P-Tuning 的 checkpoint: 91 | 92 | ```python 93 | config = AutoConfig.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, pre_seq_len=128) 94 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", config=config, trust_remote_code=True) 95 | prefix_state_dict = torch.load(os.path.join(CHECKPOINT_PATH, "pytorch_model.bin")) 96 | new_prefix_state_dict = {} 97 | for k, v in prefix_state_dict.items(): 98 | if k.startswith("transformer.prefix_encoder."): 99 | new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v 100 | model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict) 101 | ``` 102 | 注意你可能需要将 `pre_seq_len` 改成你训练时的实际值。如果你是[从本地加载模型](../README.md#从本地加载模型)的话,需要将 `THUDM/chatglm2-6b` 改成本地的模型路径(注意不是checkpoint路径)。 103 | 104 | 2. 如果需要加载的是全参数微调的 checkpoint,则直接加载整个 checkpoint: 105 | 106 | ```python 107 | model = AutoModel.from_pretrained(CHECKPOINT_PATH, trust_remote_code=True) 108 | ``` 109 | 110 | 之后根据需求可以进行量化,也可以直接使用: 111 | 112 | ```python 113 | # Comment out the following line if you don't use quantization 114 | model = model.quantize(4) 115 | model = model.cuda() 116 | model = model.eval() 117 | 118 | response, history = model.chat(tokenizer, "你好", history=[]) 119 | ``` 120 | 121 | 你也可以直接运行支持加载 P-Tuning v2 checkpoint 的 [web demo](./web_demo.py) 122 | ```shell 123 | bash web_demo.sh 124 | ``` 125 | 可能需要修改 [web_demo.sh](./web_demo.sh) 的内容以符合你实际的 checkpoint 情况。 126 | 127 | ## 使用自己的数据集 128 | 修改 `train.sh` 和 `evaluate.sh` 中的 `train_file`、`validation_file`和`test_file`为你自己的 JSON 格式数据集路径,并将 `prompt_column` 和 `response_column` 改为 JSON 文件中输入文本和输出文本对应的 KEY。可能还需要增大 `max_source_length` 和 `max_target_length` 来匹配你自己的数据集中的最大输入输出长度。 129 | 130 | ## 对话数据集 131 | 132 | 如需要使用多轮对话数据对模型进行微调,可以提供聊天历史,例如以下是一个三轮对话的训练数据: 133 | 134 | ```json lines 135 | {"prompt": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "response": "用电脑能读数据流吗?水温多少", "history": []} 136 | {"prompt": "95", "response": "上下水管温差怎么样啊?空气是不是都排干净了呢?", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"]]} 137 | {"prompt": "是的。上下水管都好的", "response": "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况!", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"], ["95", "上下水管温差怎么样啊?空气是不是都排干净了呢?"]]} 138 | ``` 139 | 140 | 训练时需要指定 `--history_column` 为数据中聊天历史的 key(在此例子中是 `history`),将自动把聊天历史拼接。要注意超过输入长度 `max_source_length` 的内容会被截断。 141 | 142 | 可以参考以下指令: 143 | 144 | ```shell 145 | bash train_chat.sh 146 | ``` 147 | 148 | ## 引用 149 | 150 | ``` 151 | @inproceedings{liu2022p, 152 | title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks}, 153 | author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie}, 154 | booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)}, 155 | pages={61--68}, 156 | year={2022} 157 | } 158 | ``` 159 | 160 | 161 | 162 | -------------------------------------------------------------------------------- /ptuning/arguments.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass, field 2 | from typing import Optional 3 | 4 | 5 | @dataclass 6 | class ModelArguments: 7 | """ 8 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. 9 | """ 10 | 11 | model_name_or_path: str = field( 12 | metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} 13 | ) 14 | ptuning_checkpoint: str = field( 15 | default=None, metadata={"help": "Path to p-tuning v2 checkpoints"} 16 | ) 17 | config_name: Optional[str] = field( 18 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} 19 | ) 20 | tokenizer_name: Optional[str] = field( 21 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} 22 | ) 23 | cache_dir: Optional[str] = field( 24 | default=None, 25 | metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"}, 26 | ) 27 | use_fast_tokenizer: bool = field( 28 | default=True, 29 | metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, 30 | ) 31 | model_revision: str = field( 32 | default="main", 33 | metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, 34 | ) 35 | use_auth_token: bool = field( 36 | default=False, 37 | metadata={ 38 | "help": ( 39 | "Will use the token generated when running `huggingface-cli login` (necessary to use this script " 40 | "with private models)." 41 | ) 42 | }, 43 | ) 44 | resize_position_embeddings: Optional[bool] = field( 45 | default=None, 46 | metadata={ 47 | "help": ( 48 | "Whether to automatically resize the position embeddings if `max_source_length` exceeds " 49 | "the model's position embeddings." 50 | ) 51 | }, 52 | ) 53 | quantization_bit: Optional[int] = field( 54 | default=None 55 | ) 56 | pre_seq_len: Optional[int] = field( 57 | default=None 58 | ) 59 | prefix_projection: bool = field( 60 | default=False 61 | ) 62 | 63 | 64 | @dataclass 65 | class DataTrainingArguments: 66 | """ 67 | Arguments pertaining to what data we are going to input our model for training and eval. 68 | """ 69 | 70 | lang: Optional[str] = field(default=None, metadata={"help": "Language id for summarization."}) 71 | 72 | dataset_name: Optional[str] = field( 73 | default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} 74 | ) 75 | dataset_config_name: Optional[str] = field( 76 | default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} 77 | ) 78 | prompt_column: Optional[str] = field( 79 | default=None, 80 | metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."}, 81 | ) 82 | response_column: Optional[str] = field( 83 | default=None, 84 | metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."}, 85 | ) 86 | history_column: Optional[str] = field( 87 | default=None, 88 | metadata={"help": "The name of the column in the datasets containing the history of chat."}, 89 | ) 90 | train_file: Optional[str] = field( 91 | default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."} 92 | ) 93 | validation_file: Optional[str] = field( 94 | default=None, 95 | metadata={ 96 | "help": ( 97 | "An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file)." 98 | ) 99 | }, 100 | ) 101 | test_file: Optional[str] = field( 102 | default=None, 103 | metadata={ 104 | "help": "An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file)." 105 | }, 106 | ) 107 | overwrite_cache: bool = field( 108 | default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} 109 | ) 110 | preprocessing_num_workers: Optional[int] = field( 111 | default=None, 112 | metadata={"help": "The number of processes to use for the preprocessing."}, 113 | ) 114 | max_source_length: Optional[int] = field( 115 | default=1024, 116 | metadata={ 117 | "help": ( 118 | "The maximum total input sequence length after tokenization. Sequences longer " 119 | "than this will be truncated, sequences shorter will be padded." 120 | ) 121 | }, 122 | ) 123 | max_target_length: Optional[int] = field( 124 | default=128, 125 | metadata={ 126 | "help": ( 127 | "The maximum total sequence length for target text after tokenization. Sequences longer " 128 | "than this will be truncated, sequences shorter will be padded." 129 | ) 130 | }, 131 | ) 132 | val_max_target_length: Optional[int] = field( 133 | default=None, 134 | metadata={ 135 | "help": ( 136 | "The maximum total sequence length for validation target text after tokenization. Sequences longer " 137 | "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." 138 | "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " 139 | "during ``evaluate`` and ``predict``." 140 | ) 141 | }, 142 | ) 143 | pad_to_max_length: bool = field( 144 | default=False, 145 | metadata={ 146 | "help": ( 147 | "Whether to pad all samples to model maximum sentence length. " 148 | "If False, will pad the samples dynamically when batching to the maximum length in the batch. More " 149 | "efficient on GPU but very bad for TPU." 150 | ) 151 | }, 152 | ) 153 | max_train_samples: Optional[int] = field( 154 | default=None, 155 | metadata={ 156 | "help": ( 157 | "For debugging purposes or quicker training, truncate the number of training examples to this " 158 | "value if set." 159 | ) 160 | }, 161 | ) 162 | max_eval_samples: Optional[int] = field( 163 | default=None, 164 | metadata={ 165 | "help": ( 166 | "For debugging purposes or quicker training, truncate the number of evaluation examples to this " 167 | "value if set." 168 | ) 169 | }, 170 | ) 171 | max_predict_samples: Optional[int] = field( 172 | default=None, 173 | metadata={ 174 | "help": ( 175 | "For debugging purposes or quicker training, truncate the number of prediction examples to this " 176 | "value if set." 177 | ) 178 | }, 179 | ) 180 | num_beams: Optional[int] = field( 181 | default=None, 182 | metadata={ 183 | "help": ( 184 | "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, " 185 | "which is used during ``evaluate`` and ``predict``." 186 | ) 187 | }, 188 | ) 189 | ignore_pad_token_for_loss: bool = field( 190 | default=True, 191 | metadata={ 192 | "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not." 193 | }, 194 | ) 195 | source_prefix: Optional[str] = field( 196 | default="", metadata={"help": "A prefix to add before every source text (useful for T5 models)."} 197 | ) 198 | 199 | forced_bos_token: Optional[str] = field( 200 | default=None, 201 | metadata={ 202 | "help": ( 203 | "The token to force as the first generated token after the decoder_start_token_id." 204 | "Useful for multilingual models like mBART where the first generated token" 205 | "needs to be the target language token (Usually it is the target language token)" 206 | ) 207 | }, 208 | ) 209 | 210 | 211 | 212 | def __post_init__(self): 213 | if self.dataset_name is None and self.train_file is None and self.validation_file is None and self.test_file is None: 214 | raise ValueError("Need either a dataset name or a training/validation/test file.") 215 | else: 216 | if self.train_file is not None: 217 | extension = self.train_file.split(".")[-1] 218 | assert extension in ["csv", "json"], "`train_file` should be a csv or a json file." 219 | if self.validation_file is not None: 220 | extension = self.validation_file.split(".")[-1] 221 | assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file." 222 | if self.val_max_target_length is None: 223 | self.val_max_target_length = self.max_target_length 224 | 225 | -------------------------------------------------------------------------------- /ptuning/deepspeed.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_micro_batch_size_per_gpu": "auto", 3 | "zero_allow_untested_optimizer": true, 4 | "fp16": { 5 | "enabled": "auto", 6 | "loss_scale": 0, 7 | "initial_scale_power": 16, 8 | "loss_scale_window": 1000, 9 | "hysteresis": 2, 10 | "min_loss_scale": 1 11 | }, 12 | "zero_optimization": { 13 | "stage": 2, 14 | "allgather_partitions": true, 15 | "allgather_bucket_size": 5e8, 16 | "overlap_comm": false, 17 | "reduce_scatter": true, 18 | "reduce_bucket_size": 5e8, 19 | "contiguous_gradients" : true 20 | } 21 | } -------------------------------------------------------------------------------- /ptuning/ds_train_finetune.sh: -------------------------------------------------------------------------------- 1 | 2 | LR=1e-4 3 | 4 | MASTER_PORT=$(shuf -n 1 -i 10000-65535) 5 | 6 | deepspeed --num_gpus=4 --master_port $MASTER_PORT main.py \ 7 | --deepspeed deepspeed.json \ 8 | --do_train \ 9 | --train_file AdvertiseGen/train.json \ 10 | --test_file AdvertiseGen/dev.json \ 11 | --prompt_column content \ 12 | --response_column summary \ 13 | --overwrite_cache \ 14 | --model_name_or_path THUDM/chatglm2-6b \ 15 | --output_dir ./output/adgen-chatglm2-6b-ft-$LR \ 16 | --overwrite_output_dir \ 17 | --max_source_length 64 \ 18 | --max_target_length 64 \ 19 | --per_device_train_batch_size 4 \ 20 | --per_device_eval_batch_size 1 \ 21 | --gradient_accumulation_steps 1 \ 22 | --predict_with_generate \ 23 | --max_steps 5000 \ 24 | --logging_steps 10 \ 25 | --save_steps 1000 \ 26 | --learning_rate $LR \ 27 | --fp16 28 | 29 | -------------------------------------------------------------------------------- /ptuning/evaluate.sh: -------------------------------------------------------------------------------- 1 | PRE_SEQ_LEN=128 2 | CHECKPOINT=adgen-chatglm2-6b-pt-128-2e-2 3 | STEP=3000 4 | NUM_GPUS=1 5 | 6 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \ 7 | --do_predict \ 8 | --validation_file AdvertiseGen/dev.json \ 9 | --test_file AdvertiseGen/dev.json \ 10 | --overwrite_cache \ 11 | --prompt_column content \ 12 | --response_column summary \ 13 | --model_name_or_path THUDM/chatglm2-6b \ 14 | --ptuning_checkpoint ./output/$CHECKPOINT/checkpoint-$STEP \ 15 | --output_dir ./output/$CHECKPOINT \ 16 | --overwrite_output_dir \ 17 | --max_source_length 64 \ 18 | --max_target_length 64 \ 19 | --per_device_eval_batch_size 1 \ 20 | --predict_with_generate \ 21 | --pre_seq_len $PRE_SEQ_LEN \ 22 | --quantization_bit 4 23 | -------------------------------------------------------------------------------- /ptuning/evaluate_finetune.sh: -------------------------------------------------------------------------------- 1 | CHECKPOINT=adgen-chatglm2-6b-ft-1e-4 2 | STEP=3000 3 | NUM_GPUS=1 4 | 5 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \ 6 | --do_predict \ 7 | --validation_file AdvertiseGen/dev.json \ 8 | --test_file AdvertiseGen/dev.json \ 9 | --overwrite_cache \ 10 | --prompt_column content \ 11 | --response_column summary \ 12 | --model_name_or_path ./output/$CHECKPOINT/checkpoint-$STEP \ 13 | --output_dir ./output/$CHECKPOINT \ 14 | --overwrite_output_dir \ 15 | --max_source_length 256 \ 16 | --max_target_length 256 \ 17 | --per_device_eval_batch_size 1 \ 18 | --predict_with_generate \ 19 | --fp16_full_eval 20 | -------------------------------------------------------------------------------- /ptuning/main.py: -------------------------------------------------------------------------------- 1 | # CSDN彩色版: 2 | #ChatGLM2-6B源码解析./ptuning/main.py (一) https://zengxiaojian.blog.csdn.net/article/details/131617133?spm=1001.2014.3001.5502 3 | 4 | #!/usr/bin/env python 5 | # coding=utf-8 6 | # Copyright 2021 The HuggingFace Team. All rights reserved. 7 | # 8 | # Licensed under the Apache License, Version 2.0 (the "License"); 9 | # you may not use this file except in compliance with the License. 10 | # You may obtain a copy of the License at 11 | # 12 | # http://www.apache.org/licenses/LICENSE-2.0 13 | # 14 | # Unless required by applicable law or agreed to in writing, software 15 | # distributed under the License is distributed on an "AS IS" BASIS, 16 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | # See the License for the specific language governing permissions and 18 | # limitations under the License. 19 | """ 20 | Fine-tuning the library models for sequence to sequence. 21 | """ 22 | # You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments. 23 | 24 | import logging 25 | import os 26 | import sys 27 | import json 28 | 29 | import numpy as np 30 | from datasets import load_dataset #从 Hugging Face 的 datasets 库中导入 load_dataset 函数,用于加载各种预处理后的数据集。 31 | import jieba 32 | from rouge_chinese import Rouge #从 rouge_chinese 模块中导入 Rouge 类,这个类可以用来计算 Rouge 分数,它是一种用来评估机器生成文本(如机器翻译或文本摘要)与人类参考文本之间相似度的指标。 33 | from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction #从 nltk.translate.bleu_score 模块中导入 sentence_bleu 和 SmoothingFunction。sentence_bleu 是用来计算单个句子的 BLEU 分数的函数, 34 | #而 SmoothingFunction 是用来处理BLEU分数计算过程中出现的0分情况。 35 | import torch 36 | 37 | #导入了 transformers 库及其一些子模块。transformers 库提供了许多预训练的神经网络模型,可以用于各种自然语言处理任务。 38 | import transformers 39 | from transformers import ( 40 | AutoConfig, #用于自动从预训练模型的名字或路径获取模型的配置信息。 41 | AutoModel, #用于自动加载一个预训练模型。这个方法将根据模型的名字或路径自动选择正确的模型类,并加载模型。 42 | AutoTokenizer, #用于自动加载一个预训练模型的tokenizer。这个方法将根据模型的名字或路径自动选择正确的tokenizer类,并加载tokenizer。 43 | DataCollatorForSeq2Seq, #用于序列到序列(seq2seq)模型的数据收集。这个类负责将多个数据样本收集到一起,形成一个batch,供模型进行训练或评估。 44 | HfArgumentParser, #用于解析命令行参数的工具。该工具是为了更好地与Hugging Face库(transformers库的开发者)的其他工具集成。 45 | Seq2SeqTrainingArguments, #用于设置序列到序列模型的训练参数。 46 | set_seed, #用于设置随机种子,以确保实验的可重复性。 47 | ) 48 | 49 | #从 trainer_seq2seq 模块导入 Seq2SeqTrainer 类,这个类是用来训练序列到序列(seq2seq)模型的。 50 | from trainer_seq2seq import Seq2SeqTrainer 51 | 52 | from arguments import ModelArguments, DataTrainingArguments #这行代码从 arguments 模块导入了两个类,这两个类用于解析和处理命令行参数。 53 | 54 | logger = logging.getLogger(__name__) #创建一个记录器(logger),这个记录器可以用来记录脚本的运行情况。 55 | 56 | def main(): 57 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments)) #创建一个 HfArgumentParser 对象,它将解析和处理 ModelArguments、DataTrainingArguments 和 Seq2SeqTrainingArguments 这三个类的实例。 58 | if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): #检查脚本的命令行参数是否为一个 .json 文件。如果是,那么将会从这个文件中读取参数。 59 | # If we pass only one argument to the script and it's the path to a json file, 60 | # let's parse it to get our arguments. 61 | #读取 .json 文件中的参数,并将其分别赋值给 model_args、data_args 和 training_args。 62 | model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) 63 | else: 64 | #如果命令行参数不是一个 .json 文件,那么这行代码将会直接从命令行参数中解析出参数。 65 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 66 | 67 | # Setup logging 68 | #设置日志的基础配置,包括日志的格式、日期格式以及处理器。 69 | logging.basicConfig( 70 | format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", 71 | datefmt="%m/%d/%Y %H:%M:%S", 72 | handlers=[logging.StreamHandler(sys.stdout)], 73 | ) 74 | 75 | #如果 training_args.should_log 为真(即需要记录日志),那么设置日志等级为 info。 76 | if training_args.should_log: 77 | # The default of training_args.log_level is passive, so we set log level at info here to have that default. 78 | transformers.utils.logging.set_verbosity_info() 79 | 80 | log_level = training_args.get_process_log_level() 81 | #设置 logger 的日志等级。 82 | logger.setLevel(log_level) #设置logger的级别,同时也设置了transformers.utils.logging的级别。这样能够控制要显示的日志信息的详细程度。 83 | #datasets.utils.logging.set_verbosity(log_level) 84 | transformers.utils.logging.set_verbosity(log_level) #这行代码设置了transformers包中logging模块的日志等级。这里设置的等级和上面获取的日志等级是一样的。 85 | transformers.utils.logging.enable_default_handler() # 86 | transformers.utils.logging.enable_explicit_format() #这两行代码是启用默认的日志处理器并启用显式的日志格式。默认处理器通常会将日志消息发送到控制台,显式格式则指定了日志消息的输出格式。 87 | 88 | # Log on each process the small summary: 89 | logger.warning( #logger.warning和logger.info代码是打印关于训练过程的一些基本信息。这些信息包括训练过程的设备、分布式训练的设置、是否使用16位精度训练等。 90 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" 91 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" 92 | ) 93 | logger.info(f"Training/evaluation parameters {training_args}") 94 | 95 | # Set seed before initializing model. 96 | set_seed(training_args.seed) #设置了随机种子,为了让实验在多次运行时具有相同的结果。 97 | 98 | # Load dataset 99 | data_files = {} 100 | if data_args.train_file is not None: 101 | data_files["train"] = data_args.train_file 102 | extension = data_args.train_file.split(".")[-1] 103 | if data_args.validation_file is not None: 104 | data_files["validation"] = data_args.validation_file 105 | extension = data_args.validation_file.split(".")[-1] 106 | if data_args.test_file is not None: 107 | data_files["test"] = data_args.test_file 108 | extension = data_args.test_file.split(".")[-1] 109 | 110 | raw_datasets = load_dataset( 111 | extension, 112 | data_files=data_files, 113 | cache_dir=model_args.cache_dir, 114 | use_auth_token=True if model_args.use_auth_token else None, 115 | ) 116 | 117 | # Load pretrained model and tokenizer 118 | config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True) 119 | config.pre_seq_len = model_args.pre_seq_len 120 | config.prefix_projection = model_args.prefix_projection 121 | 122 | tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True) 123 | 124 | if model_args.ptuning_checkpoint is not None: 125 | # Evaluation 126 | # Loading extra state dict of prefix encoder 127 | model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True) 128 | prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin")) 129 | new_prefix_state_dict = {} 130 | for k, v in prefix_state_dict.items(): 131 | if k.startswith("transformer.prefix_encoder."): 132 | new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v 133 | model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict) 134 | else: 135 | model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True) 136 | 137 | if model_args.quantization_bit is not None: 138 | print(f"Quantized to {model_args.quantization_bit} bit") 139 | model = model.quantize(model_args.quantization_bit) 140 | if model_args.pre_seq_len is not None: 141 | # P-tuning v2 142 | model = model.half() 143 | model.transformer.prefix_encoder.float() 144 | else: 145 | # Finetune 146 | model = model.float() 147 | 148 | prefix = data_args.source_prefix if data_args.source_prefix is not None else "" 149 | 150 | # Preprocessing the datasets. 151 | # We need to tokenize inputs and targets. 152 | if training_args.do_train: 153 | column_names = raw_datasets["train"].column_names 154 | elif training_args.do_eval: 155 | column_names = raw_datasets["validation"].column_names 156 | elif training_args.do_predict: 157 | column_names = raw_datasets["test"].column_names 158 | else: 159 | logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") 160 | return 161 | 162 | # Get the column names for input/target. 163 | prompt_column = data_args.prompt_column #从数据参数中获取提示列名称,也就是用于提问的列。 164 | response_column = data_args.response_column #从数据参数中获取回答列的名称,也就是作为回答或目标的列。 165 | history_column = data_args.history_column #从数据参数中获取历史对话列的名称,如果存在的话,这些历史对话将被用作提问的上下文。 166 | 167 | # Temporarily set max_target_length for training. 168 | max_target_length = data_args.max_target_length 169 | 170 | #以下是预处理函数,它们用于将输入和目标列进行格式化和分词。格式化的结果将被用于模型的训练和验证。 171 | def preprocess_function_eval(examples): #和preprocess_function_train这两个函数是为评估和训练准备数据的。它们从示例数据中提取问题和回答,并根据需要将其进行格式化和分词。然后它们会将输入和目标添加到model_inputs列表中,然后返回这个列表。 172 | inputs, targets = [], [] 173 | for i in range(len(examples[prompt_column])): #这行代码遍历examples[prompt_column]列表的每一个元素。examples[prompt_column]表示从数据集中提取的问题或提示列。 174 | if examples[prompt_column][i] and examples[response_column][i]: #检查第i个问题/提示和对应的回答是否存在。examples[prompt_column][i]和examples[response_column][i]分别表示第i个问题/提示和对应的回答。如果其中之一不存在,那么就跳过这个样本。 175 | query = examples[prompt_column][i] #将第i个问题/提示赋值给变量query。 176 | history = examples[history_column][i] if history_column is not None else None #检查是否存在历史对话列。如果存在,那么将第i个历史对话赋值给变量history;如果不存在,那么将None赋值给history。 177 | prompt = tokenizer.build_prompt(query, history) #使用分词器的build_prompt函数将问题/提示和历史对话结合起来,生成模型的输入。这通常包括一些特定的格式和分词步骤。 178 | inputs.append(prompt) #将生成的输入添加到inputs列表中。inputs列表将被用作模型的输入。 179 | targets.append(examples[response_column][i]) #将第i个回答添加到targets列表中。targets列表将被用作模型的目标。 180 | #在这段代码执行之后,你将获得两个列表:inputs和targets。inputs列表包含了所有的输入样本,targets列表包含了所有的目标样本。这两个列表将被用于模型的训练或评估。 181 | 182 | inputs = [prefix + inp for inp in inputs] #对于输入列表inputs中的每个元素,都在它们的前面添加一个prefix,然后更新输入列表。这里的prefix可能是一个模型需要的特定前缀,比如特殊的开头标记。 183 | model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, truncation=True, padding=True) #使用tokenizer对更新后的输入进行处理,得到模型的输入。tokenizer是一个将原始文本转换为模型可以理解的形式的工具。这个处理包括截断和填充:如果输入的长度超过了data_args.max_source_length,则会被截断;如果输入的长度小于最大长度,则会被填充到最大长度。得到的model_inputs是一个字典,包含了输入的编码等信息。 184 | labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True) #对目标(也就是期望的输出)进行同样的处理,得到模型的标签。 185 | 186 | if data_args.ignore_pad_token_for_loss: #如果设置了忽略填充标记的损失,则执行以下步骤: 187 | labels["input_ids"] = [ #对于标签中的每个输入ID,如果它是填充标记的ID,则将其替换为-100,否则保持不变。这是因为在计算损失时,我们通常希望忽略填充的部分。在PyTorch中,-100是一个特殊的值,表示在计算损失时忽略这个位置。 188 | [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"] 189 | ] 190 | model_inputs["labels"] = labels["input_ids"] #这行代码将处理后的标签添加到模型的输入中。这样,模型的输入就包含了输入和对应的标签,可以直接用于训练。 191 | 192 | return model_inputs 193 | 194 | #函数preprocess_function_train(examples)的主要目标是为模型训练阶段预处理数据。给定一些训练样例examples,它将为每个样例生成模型需要的输入和标签。这个过程包括以下几个步骤: 195 | def preprocess_function_train(examples): 196 | max_seq_length = data_args.max_source_length + data_args.max_target_length + 1 #定义最大序列长度为源长度上限(即问题长度上限)加上目标长度上限(即答案长度上限)再加1。这个1通常是为特殊标记(比如序列结束标记)预留的空间。 197 | 198 | model_inputs = { #初始化model_inputs字典,用于存储模型输入的数据。 199 | "input_ids": [], 200 | "labels": [], 201 | } 202 | for i in range(len(examples[prompt_column])): #遍历每一个样例。 203 | if examples[prompt_column][i] and examples[response_column][i]: #如果样例的问题和答案都存在,那么处理这个样例。 204 | query, answer = examples[prompt_column][i], examples[response_column][i] #获取问题和答案。 205 | 206 | history = examples[history_column][i] if history_column is not None else None #获取历史对话,如果存在的话。 207 | prompt = tokenizer.build_prompt(query, history) #用tokenizer.build_prompt方法来根据问题和历史对话构建提示。 208 | 209 | prompt = prefix + prompt #在提示前面添加前缀。 210 | a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True, #将提示编码成模型可以理解的形式,得到输入ID序列a_ids。 211 | max_length=data_args.max_source_length) 212 | b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True, #同样地,将答案编码成模型可以理解的形式,得到答案ID序列b_ids。 213 | max_length=data_args.max_target_length) 214 | 215 | context_length = len(a_ids) #计算输入的长度。 216 | input_ids = a_ids + b_ids + [tokenizer.eos_token_id] #将输入和答案的ID序列拼接起来,并在最后添加一个序列结束标记的ID,得到完整的输入序列。 217 | labels = [tokenizer.pad_token_id] * context_length + b_ids + [tokenizer.eos_token_id] #标签序列的前context_length部分是填充标记的ID,后面是答案的ID序列和一个序列结束标记的ID。这样设置的原因是,我们只关心模型对答案部分的预测。 218 | 219 | pad_len = max_seq_length - len(input_ids) #计算需要填充的长度。 220 | input_ids = input_ids + [tokenizer.pad_token_id] * pad_len #在输入序列后面添加填充标记,使其长度达到max_seq_length。 221 | labels = labels + [tokenizer.pad_token_id] * pad_len #同样地,也在标签序列后面添加填充标记。 222 | if data_args.ignore_pad_token_for_loss: #如果设置了忽略填充标记的损失,那么将标签中的填充标记的ID替换为-100。 223 | labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels] #将处理好的输入和标签添加到model_inputs字典中。 224 | 225 | model_inputs["input_ids"].append(input_ids) 226 | model_inputs["labels"].append(labels) #在处理完所有样例后,返回model_inputs字典,它包含了所有样例的输入和标签,可以直接用于模型的训练。 227 | 228 | return model_inputs 229 | 230 | #这段代码主要是用来处理训练、验证和测试数据集,使其适应模型训练和预测的需要。下面来逐行解释: 231 | def print_dataset_example(example): #定义一个函数print_dataset_example(example),它用于打印给定样例的输入和标签,以及它们对应的文本形式。 232 | print("input_ids", example["input_ids"]) 233 | print("inputs", tokenizer.decode(example["input_ids"])) 234 | print("label_ids", example["labels"]) 235 | print("labels", tokenizer.decode(example["labels"])) 236 | 237 | if training_args.do_train: #如果需要进行训练。 238 | if "train" not in raw_datasets: #检查原始数据集中是否存在训练数据集,如果不存在,则抛出错误。 239 | raise ValueError("--do_train requires a train dataset") 240 | train_dataset = raw_datasets["train"] #获取训练数据集。 241 | if data_args.max_train_samples is not None: #如果设置了训练样本的最大数量。 242 | max_train_samples = min(len(train_dataset), data_args.max_train_samples) #计算实际使用的训练样本的数量,为原始训练样本数量和最大训练样本数量中的较小者。 243 | train_dataset = train_dataset.select(range(max_train_samples)) #选择所需数量的训练样本。 244 | with training_args.main_process_first(desc="train dataset map pre-processing"): #主要是为了确保主进程在所有其他进程之前运行。 245 | train_dataset = train_dataset.map( #应用预处理函数到训练数据集上,预处理函数就是前面定义的preprocess_function_train。 246 | preprocess_function_train, 247 | batched=True, 248 | num_proc=data_args.preprocessing_num_workers, 249 | remove_columns=column_names, 250 | load_from_cache_file=not data_args.overwrite_cache, 251 | desc="Running tokenizer on train dataset", 252 | ) 253 | print_dataset_example(train_dataset[0]) #打印处理后的第一个训练样例。 254 | 255 | if training_args.do_eval: #首先检查是否需要对模型进行评估。do_eval是一个布尔值,如果为True,那么这段代码会对验证集进行预处理并进行模型评估。 256 | max_target_length = data_args.val_max_target_length #设定了目标序列的最大长度。这是为了处理可能存在的长度不一致问题。 257 | if "validation" not in raw_datasets: 258 | raise ValueError("--do_eval requires a validation dataset") #检查原始数据集中是否包含验证集。如果不包含,那么将会引发一个错误。 259 | eval_dataset = raw_datasets["validation"] #从原始数据集中提取验证数据。 260 | if data_args.max_eval_samples is not None: #检查是否设定了最大的验证样本数量。如果设定了,那么就按照这个数量来选择样本。 261 | max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) #根据验证集的长度和预设的最大验证样本数量选择实际使用的样本数量。 262 | eval_dataset = eval_dataset.select(range(max_eval_samples)) #从验证集中选取一定数量的样本进行预处理和评估。 263 | with training_args.main_process_first(desc="validation dataset map pre-processing"): #接下来的部分用于实际的数据预处理:通过调用.map()函数,使用先前定义的preprocess_function_eval函数对验证数据集进行预处理。 264 | eval_dataset = eval_dataset.map( 265 | preprocess_function_eval, 266 | batched=True, 267 | num_proc=data_args.preprocessing_num_workers, 268 | remove_columns=column_names, 269 | load_from_cache_file=not data_args.overwrite_cache, 270 | desc="Running tokenizer on validation dataset", 271 | ) 272 | print_dataset_example(eval_dataset[0]) #这一行输出经过预处理后的第一个验证样本,以便检查预处理是否正确进行。 273 | 274 | if training_args.do_predict: 275 | max_target_length = data_args.val_max_target_length 276 | if "test" not in raw_datasets: 277 | raise ValueError("--do_predict requires a test dataset") 278 | predict_dataset = raw_datasets["test"] 279 | if data_args.max_predict_samples is not None: 280 | max_predict_samples = min(len(predict_dataset), data_args.max_predict_samples) 281 | predict_dataset = predict_dataset.select(range(max_predict_samples)) 282 | with training_args.main_process_first(desc="prediction dataset map pre-processing"): 283 | predict_dataset = predict_dataset.map( 284 | preprocess_function_eval, 285 | batched=True, 286 | num_proc=data_args.preprocessing_num_workers, 287 | remove_columns=column_names, 288 | load_from_cache_file=not data_args.overwrite_cache, 289 | desc="Running tokenizer on prediction dataset", 290 | ) 291 | print_dataset_example(predict_dataset[0]) 292 | 293 | # Data collator 294 | #这行代码设置了label的填充token ID。如果设置了在计算损失时忽略填充token(由data_args.ignore_pad_token_for_loss决定),那么填充token ID将被设为-100,否则填充token ID就是tokenizer的填充token ID。 295 | label_pad_token_id = -100 if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id 296 | #这行代码创建了一个用于序列到序列任务的数据整理器(data collator)。数据整理器的作用是将一个批量的数据整理成可输入模型的形式。其中,tokenizer用于对文本进行编码,model是预训练模型, 297 | data_collator = DataCollatorForSeq2Seq( 298 | tokenizer, 299 | model=model, 300 | label_pad_token_id=label_pad_token_id, #label_pad_token_id是label的填充token ID, 301 | pad_to_multiple_of=None, #表示不需要将序列长度补齐到某个数的倍数, 302 | padding=False #padding=False表示在数据整理时不进行填充。 303 | ) 304 | 305 | # Metric 306 | #这个函数定义了如何计算评估指标。eval_preds是模型的预测结果和标签,函数首先将预测结果和标签从token IDs转化为文本,然后计算并返回各个评估指标(包括ROUGE和BLEU)的平均值。 307 | def compute_metrics(eval_preds): 308 | preds, labels = eval_preds #从输入的评估预测中提取预测值和标签。 309 | if isinstance(preds, tuple): #preds = preds[0] 如果预测值是一个元组,则只取第一个元素作为预测值。 310 | preds = preds[0] 311 | decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) #使用 tokenizer 对预测的 token IDs 进行批量解码,转换为文本形式,并跳过特殊的 token。 312 | if data_args.ignore_pad_token_for_loss: #如果在计算损失时忽略了 pad token,则将标签中所有值为-100的元素(即原始的 pad token)替换为 tokenizer 的 pad token ID。 313 | # Replace -100 in the labels as we can't decode them. 314 | labels = np.where(labels != -100, labels, tokenizer.pad_token_id) 315 | decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) #使用 tokenizer 对标签的 token IDs 进行批量解码,转换为文本形式,并跳过特殊的 token。 316 | 317 | score_dict = { #初始化一个字典,用于存储各个评估指标(包括rouge-1,rouge-2,rouge-l和bleu-4)的得分。 318 | "rouge-1": [], 319 | "rouge-2": [], 320 | "rouge-l": [], 321 | "bleu-4": [] 322 | } 323 | for pred, label in zip(decoded_preds, decoded_labels): #对每一个预测值和标签的配对进行遍历。 324 | hypothesis = list(jieba.cut(pred)) 325 | reference = list(jieba.cut(label)) #hypothesis = list(jieba.cut(pred)) 和 reference = list(jieba.cut(label)) 使用 jieba 对预测和标签进行分词,生成假设和参考序列。 326 | rouge = Rouge() 327 | scores = rouge.get_scores(' '.join(hypothesis) , ' '.join(reference)) #计算 ROUGE 得分。 328 | result = scores[0] 329 | 330 | for k, v in result.items(): # 对于每一个 ROUGE 指标(rouge-1,rouge-2,rouge-l), 331 | score_dict[k].append(round(v["f"] * 100, 4)) #将 f-score 存入 score_dict。 332 | bleu_score = sentence_bleu([list(label)], list(pred), smoothing_function=SmoothingFunction().method3) #计算 BLEU 得分。 333 | score_dict["bleu-4"].append(round(bleu_score * 100, 4)) #score_dict["bleu-4"].append(round(bleu_score * 100, 4)) 将 BLEU 得分存入 score_dict。 334 | 335 | for k, v in score_dict.items(): # 对于 score_dict 中的每一个指标,计算并存储其平均值。 336 | score_dict[k] = float(np.mean(v)) 337 | return score_dict # 返回包含了各个评估指标平均得分的字典。 338 | 339 | # Override the decoding parameters of Seq2SeqTrainer 340 | #这行代码设置了生成序列的最大长度。如果训练参数中设置了生成序列的最大长度,那么就使用该值,否则使用验证集目标序列的最大长度。 341 | training_args.generation_max_length = ( 342 | training_args.generation_max_length 343 | if training_args.generation_max_length is not None 344 | else data_args.val_max_target_length 345 | ) 346 | training_args.generation_num_beams = ( 347 | data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams 348 | ) 349 | # Initialize our Trainer 350 | trainer = Seq2SeqTrainer( 351 | model=model, 352 | args=training_args, 353 | train_dataset=train_dataset if training_args.do_train else None, 354 | eval_dataset=eval_dataset if training_args.do_eval else None, 355 | tokenizer=tokenizer, 356 | data_collator=data_collator, 357 | compute_metrics=compute_metrics if training_args.predict_with_generate else None, 358 | save_changed=model_args.pre_seq_len is not None 359 | ) 360 | 361 | # Training 362 | if training_args.do_train: 363 | checkpoint = None 364 | if training_args.resume_from_checkpoint is not None: 365 | checkpoint = training_args.resume_from_checkpoint 366 | # elif last_checkpoint is not None: 367 | # checkpoint = last_checkpoint 368 | model.gradient_checkpointing_enable() 369 | model.enable_input_require_grads() 370 | train_result = trainer.train(resume_from_checkpoint=checkpoint) 371 | # trainer.save_model() # Saves the tokenizer too for easy upload 372 | 373 | metrics = train_result.metrics 374 | max_train_samples = ( 375 | data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) 376 | ) 377 | metrics["train_samples"] = min(max_train_samples, len(train_dataset)) 378 | 379 | trainer.log_metrics("train", metrics) 380 | trainer.save_metrics("train", metrics) 381 | trainer.save_state() 382 | 383 | # Evaluation 384 | results = {} 385 | max_seq_length = data_args.max_source_length + data_args.max_target_length + 1 386 | if training_args.do_eval: 387 | logger.info("*** Evaluate ***") 388 | metrics = trainer.evaluate(metric_key_prefix="eval", do_sample=True, top_p=0.7, max_length=max_seq_length, temperature=0.95) 389 | max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) 390 | metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) 391 | 392 | trainer.log_metrics("eval", metrics) 393 | trainer.save_metrics("eval", metrics) 394 | 395 | if training_args.do_predict: 396 | logger.info("*** Predict ***") 397 | predict_results = trainer.predict(predict_dataset, metric_key_prefix="predict", max_length=max_seq_length, do_sample=True, top_p=0.7, temperature=0.95) 398 | metrics = predict_results.metrics 399 | max_predict_samples = ( 400 | data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) 401 | ) 402 | metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset)) 403 | 404 | trainer.log_metrics("predict", metrics) 405 | trainer.save_metrics("predict", metrics) 406 | 407 | if trainer.is_world_process_zero(): 408 | if training_args.predict_with_generate: 409 | predictions = tokenizer.batch_decode( 410 | predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True 411 | ) 412 | predictions = [pred.strip() for pred in predictions] 413 | labels = tokenizer.batch_decode( 414 | predict_results.label_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True 415 | ) 416 | labels = [label.strip() for label in labels] 417 | output_prediction_file = os.path.join(training_args.output_dir, "generated_predictions.txt") 418 | with open(output_prediction_file, "w", encoding="utf-8") as writer: 419 | for p, l in zip(predictions, labels): 420 | res = json.dumps({"labels": l, "predict": p}, ensure_ascii=False) 421 | writer.write(f"{res}\n") 422 | return results 423 | 424 | 425 | def _mp_fn(index): 426 | # For xla_spawn (TPUs) 427 | main() 428 | 429 | 430 | if __name__ == "__main__": 431 | main() 432 | -------------------------------------------------------------------------------- /ptuning/train.sh: -------------------------------------------------------------------------------- 1 | PRE_SEQ_LEN=128 #预设定的序列长度,表示在模型输入中,将处理的最大文本序列长度设置为128个词或字符。 2 | LR=2e-2 #LR是学习率(Learning Rate)的简写,值为0.02。学习率是优化算法的一个超参数,控制模型在训练过程中学习的速度。过大的学习率可能会导致训练收敛的不稳定,过小的学习率可能会导致训练过程过于缓慢。 3 | NUM_GPUS=1 4 | 5 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \ #是使用torchrun来运行一个分布式程序的命令。这里指定了单节点(nnodes=1)运行,并在每个节点上运行的进程数为设定的GPU数量。 6 | --do_train \ #是一个标志位,指示该程序应执行训练过程。 7 | --train_file AdvertiseGen/train.json \ 8 | --validation_file AdvertiseGen/dev.json \ 9 | --preprocessing_num_workers 10 \ #这个参数指定了预处理阶段并行工作的线程数量。 10 | --prompt_column content \ 11 | --response_column summary \ #这些参数定义了在训练和验证数据中,模型输入的列名(prompt_column)以及模型数据回答的列名(response_column)。 12 | --overwrite_cache \ #一个标志位,如果设置,那么在加载数据前将删除预处理的缓存。 13 | --model_name_or_path THUDM/chatglm2-6b \ #这个参数指定了预训练模型的名称或者路径,模型将在这个预训练模型的基础上进行微调。 14 | --output_dir output/adgen-chatglm2-6b-pt-$PRE_SEQ_LEN-$LR \ 15 | --overwrite_output_dir \ #一个标志位,如果设置,那么在训练开始时将删除输出目录,以便重新开始训练。 16 | --max_source_length 64 \ 17 | --max_target_length 128 \ # 这些参数定义了源输入和目标输出的最大长度。 18 | --per_device_train_batch_size 1 \ 19 | --per_device_eval_batch_size 1 \ #参数定义了每个设备(即GPU)的训练和评估批次大小。 20 | --gradient_accumulation_steps 16 \ #这个参数定义了在进行一次参数更新之前,需要进行的梯度累积步骤数量。这是一种内存优化策略,可以使得在内存受限的情况下训练更大的模型。 21 | --predict_with_generate \ #一个标志位,如果设置,那么将使用生成式的方法(例如,自回归解码)来进行预测。 22 | --max_steps 3000 \ #定义了训练过程的最大步数。 23 | --logging_steps 10 \ #定义了记录日志和保存模型的步数间隔。 24 | --save_steps 1000 \ 25 | --learning_rate $LR \ 26 | --pre_seq_len $PRE_SEQ_LEN \ #这些参数在上面已经定义过了,这里是将它们应用于训练过程。 27 | --quantization_bit 4 #定义了模型权重量化的位数。使用模型量化可以减少模型的存储需求,并可能提高推理速度,但可能会以精度为代价。这里设定为4位,意味着每个模型权重值都将映射到16(2的4次方)个不同的值。 28 | 29 | -------------------------------------------------------------------------------- /ptuning/train_chat.sh: -------------------------------------------------------------------------------- 1 | PRE_SEQ_LEN=128 2 | LR=1e-2 3 | NUM_GPUS=1 4 | 5 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \ 6 | --do_train \ 7 | --train_file $CHAT_TRAIN_DATA \ 8 | --validation_file $CHAT_VAL_DATA \ 9 | --preprocessing_num_workers 10 \ 10 | --prompt_column prompt \ 11 | --response_column response \ 12 | --history_column history \ 13 | --overwrite_cache \ 14 | --model_name_or_path THUDM/chatglm2-6b \ 15 | --output_dir $CHECKPOINT_NAME \ 16 | --overwrite_output_dir \ 17 | --max_source_length 256 \ 18 | --max_target_length 256 \ 19 | --per_device_train_batch_size 1 \ 20 | --per_device_eval_batch_size 1 \ 21 | --gradient_accumulation_steps 16 \ 22 | --predict_with_generate \ 23 | --max_steps 3000 \ 24 | --logging_steps 10 \ 25 | --save_steps 1000 \ 26 | --learning_rate $LR \ 27 | --pre_seq_len $PRE_SEQ_LEN \ 28 | --quantization_bit 4 29 | 30 | -------------------------------------------------------------------------------- /ptuning/trainer.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2020-present the HuggingFace Inc. team. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """ 16 | The Trainer class, to easily train a 🤗 Transformers from scratch or finetune it on a new task. 17 | """ 18 | import os 19 | from typing import Optional 20 | from transformers import Trainer 21 | 22 | import torch 23 | from transformers.modeling_utils import PreTrainedModel, unwrap_model 24 | from transformers.utils import logging 25 | 26 | logger = logging.get_logger(__name__) 27 | 28 | WEIGHTS_NAME = "pytorch_model.bin" 29 | TRAINING_ARGS_NAME = "training_args.bin" 30 | 31 | 32 | class PrefixTrainer(Trainer): 33 | def __init__(self, *args, save_changed=False, **kwargs): 34 | self.save_changed = save_changed 35 | super().__init__(*args, **kwargs) 36 | 37 | def _save(self, output_dir: Optional[str] = None, state_dict=None): 38 | # If we are executing this function, we are the process zero, so we don't check for that. 39 | output_dir = output_dir if output_dir is not None else self.args.output_dir 40 | os.makedirs(output_dir, exist_ok=True) 41 | logger.info(f"Saving model checkpoint to {output_dir}") 42 | # Save a trained model and configuration using `save_pretrained()`. 43 | # They can then be reloaded using `from_pretrained()` 44 | if not isinstance(self.model, PreTrainedModel): 45 | if isinstance(unwrap_model(self.model), PreTrainedModel): 46 | if state_dict is None: 47 | state_dict = self.model.state_dict() 48 | unwrap_model(self.model).save_pretrained(output_dir, state_dict=state_dict) 49 | else: 50 | logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.") 51 | if state_dict is None: 52 | state_dict = self.model.state_dict() 53 | torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME)) 54 | else: 55 | if self.save_changed: 56 | print("Saving PrefixEncoder") 57 | state_dict = self.model.state_dict() 58 | filtered_state_dict = {} 59 | for k, v in self.model.named_parameters(): 60 | if v.requires_grad: 61 | filtered_state_dict[k] = state_dict[k] 62 | self.model.save_pretrained(output_dir, state_dict=filtered_state_dict) 63 | else: 64 | print("Saving the whole model") 65 | self.model.save_pretrained(output_dir, state_dict=state_dict) 66 | if self.tokenizer is not None: 67 | self.tokenizer.save_pretrained(output_dir) 68 | 69 | # Good practice: save your training arguments together with the trained model 70 | torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME)) 71 | -------------------------------------------------------------------------------- /ptuning/trainer_seq2seq.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 The HuggingFace Team. All rights reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from typing import Any, Dict, List, Optional, Tuple, Union 16 | 17 | import torch 18 | from torch import nn 19 | from torch.utils.data import Dataset 20 | 21 | from transformers.deepspeed import is_deepspeed_zero3_enabled 22 | from trainer import PrefixTrainer 23 | from transformers.trainer_utils import PredictionOutput 24 | from transformers.utils import logging 25 | 26 | 27 | logger = logging.get_logger(__name__) 28 | 29 | 30 | class Seq2SeqTrainer(PrefixTrainer): 31 | def evaluate( 32 | self, 33 | eval_dataset: Optional[Dataset] = None, 34 | ignore_keys: Optional[List[str]] = None, 35 | metric_key_prefix: str = "eval", 36 | **gen_kwargs 37 | ) -> Dict[str, float]: 38 | """ 39 | Run evaluation and returns metrics. 40 | 41 | The calling script will be responsible for providing a method to compute metrics, as they are task-dependent 42 | (pass it to the init `compute_metrics` argument). 43 | 44 | You can also subclass and override this method to inject custom behavior. 45 | 46 | Args: 47 | eval_dataset (`Dataset`, *optional*): 48 | Pass a dataset if you wish to override `self.eval_dataset`. If it is an [`~datasets.Dataset`], columns 49 | not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__` 50 | method. 51 | ignore_keys (`List[str]`, *optional*): 52 | A list of keys in the output of your model (if it is a dictionary) that should be ignored when 53 | gathering predictions. 54 | metric_key_prefix (`str`, *optional*, defaults to `"eval"`): 55 | An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named 56 | "eval_bleu" if the prefix is `"eval"` (default) 57 | max_length (`int`, *optional*): 58 | The maximum target length to use when predicting with the generate method. 59 | num_beams (`int`, *optional*): 60 | Number of beams for beam search that will be used when predicting with the generate method. 1 means no 61 | beam search. 62 | gen_kwargs: 63 | Additional `generate` specific kwargs. 64 | 65 | Returns: 66 | A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The 67 | dictionary also contains the epoch number which comes from the training state. 68 | """ 69 | 70 | gen_kwargs = gen_kwargs.copy() 71 | if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None: 72 | gen_kwargs["max_length"] = self.args.generation_max_length 73 | gen_kwargs["num_beams"] = ( 74 | gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams 75 | ) 76 | self._gen_kwargs = gen_kwargs 77 | 78 | return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) 79 | 80 | def predict( 81 | self, 82 | test_dataset: Dataset, 83 | ignore_keys: Optional[List[str]] = None, 84 | metric_key_prefix: str = "test", 85 | **gen_kwargs 86 | ) -> PredictionOutput: 87 | """ 88 | Run prediction and returns predictions and potential metrics. 89 | 90 | Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method 91 | will also return metrics, like in `evaluate()`. 92 | 93 | Args: 94 | test_dataset (`Dataset`): 95 | Dataset to run the predictions on. If it is a [`~datasets.Dataset`], columns not accepted by the 96 | `model.forward()` method are automatically removed. Has to implement the method `__len__` 97 | ignore_keys (`List[str]`, *optional*): 98 | A list of keys in the output of your model (if it is a dictionary) that should be ignored when 99 | gathering predictions. 100 | metric_key_prefix (`str`, *optional*, defaults to `"eval"`): 101 | An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named 102 | "eval_bleu" if the prefix is `"eval"` (default) 103 | max_length (`int`, *optional*): 104 | The maximum target length to use when predicting with the generate method. 105 | num_beams (`int`, *optional*): 106 | Number of beams for beam search that will be used when predicting with the generate method. 1 means no 107 | beam search. 108 | gen_kwargs: 109 | Additional `generate` specific kwargs. 110 | 111 | 112 | 113 | If your predictions or labels have different sequence lengths (for instance because you're doing dynamic 114 | padding in a token classification task) the predictions will be padded (on the right) to allow for 115 | concatenation into one array. The padding index is -100. 116 | 117 | 118 | 119 | Returns: *NamedTuple* A namedtuple with the following keys: 120 | 121 | - predictions (`np.ndarray`): The predictions on `test_dataset`. 122 | - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some). 123 | - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained 124 | labels). 125 | """ 126 | 127 | gen_kwargs = gen_kwargs.copy() 128 | if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None: 129 | gen_kwargs["max_length"] = self.args.generation_max_length 130 | gen_kwargs["num_beams"] = ( 131 | gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams 132 | ) 133 | self._gen_kwargs = gen_kwargs 134 | 135 | 136 | return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) 137 | 138 | def prediction_step( 139 | self, 140 | model: nn.Module, 141 | inputs: Dict[str, Union[torch.Tensor, Any]], 142 | prediction_loss_only: bool, 143 | ignore_keys: Optional[List[str]] = None, 144 | ) -> Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]: 145 | """ 146 | Perform an evaluation step on `model` using `inputs`. 147 | 148 | Subclass and override to inject custom behavior. 149 | 150 | Args: 151 | model (`nn.Module`): 152 | The model to evaluate. 153 | inputs (`Dict[str, Union[torch.Tensor, Any]]`): 154 | The inputs and targets of the model. 155 | 156 | The dictionary will be unpacked before being fed to the model. Most models expect the targets under the 157 | argument `labels`. Check your model's documentation for all accepted arguments. 158 | prediction_loss_only (`bool`): 159 | Whether or not to return the loss only. 160 | 161 | Return: 162 | Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss, logits and 163 | labels (each being optional). 164 | """ 165 | 166 | if not self.args.predict_with_generate or prediction_loss_only: 167 | return super().prediction_step( 168 | model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys 169 | ) 170 | 171 | has_labels = "labels" in inputs 172 | inputs = self._prepare_inputs(inputs) 173 | 174 | # XXX: adapt synced_gpus for fairscale as well 175 | gen_kwargs = self._gen_kwargs.copy() 176 | if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None: 177 | gen_kwargs["max_length"] = self.model.config.max_length 178 | gen_kwargs["num_beams"] = ( 179 | gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams 180 | ) 181 | default_synced_gpus = True if is_deepspeed_zero3_enabled() else False 182 | gen_kwargs["synced_gpus"] = ( 183 | gen_kwargs["synced_gpus"] if gen_kwargs.get("synced_gpus") is not None else default_synced_gpus 184 | ) 185 | 186 | if "attention_mask" in inputs: 187 | gen_kwargs["attention_mask"] = inputs.get("attention_mask", None) 188 | if "position_ids" in inputs: 189 | gen_kwargs["position_ids"] = inputs.get("position_ids", None) 190 | if "global_attention_mask" in inputs: 191 | gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None) 192 | 193 | # prepare generation inputs 194 | # some encoder-decoder models can have varying encoder's and thus 195 | # varying model input names 196 | if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name: 197 | generation_inputs = inputs[self.model.encoder.main_input_name] 198 | else: 199 | generation_inputs = inputs[self.model.main_input_name] 200 | 201 | gen_kwargs["input_ids"] = generation_inputs 202 | generated_tokens = self.model.generate(**gen_kwargs) 203 | generated_tokens = generated_tokens[:, generation_inputs.size()[-1]:] 204 | 205 | # in case the batch is shorter than max length, the output should be padded 206 | if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]: 207 | generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"]) 208 | elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < ( 209 | gen_kwargs["max_new_tokens"] + 1 210 | ): 211 | generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1) 212 | 213 | loss = None 214 | 215 | if self.args.prediction_loss_only: 216 | return (loss, None, None) 217 | 218 | if has_labels: 219 | labels = inputs["labels"] 220 | if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]: 221 | labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"]) 222 | elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < ( 223 | gen_kwargs["max_new_tokens"] + 1 224 | ): 225 | labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1)) 226 | else: 227 | labels = None 228 | 229 | return (loss, generated_tokens, labels) 230 | 231 | def _pad_tensors_to_max_len(self, tensor, max_length): 232 | if self.tokenizer is not None and hasattr(self.tokenizer, "pad_token_id"): 233 | # If PAD token is not defined at least EOS token has to be defined 234 | pad_token_id = ( 235 | self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id 236 | ) 237 | else: 238 | if self.model.config.pad_token_id is not None: 239 | pad_token_id = self.model.config.pad_token_id 240 | else: 241 | raise ValueError("Pad_token_id must be set in the configuration of the model, in order to pad tensors") 242 | 243 | padded_tensor = pad_token_id * torch.ones( 244 | (tensor.shape[0], max_length), dtype=tensor.dtype, device=tensor.device 245 | ) 246 | padded_tensor[:, : tensor.shape[-1]] = tensor 247 | return padded_tensor 248 | -------------------------------------------------------------------------------- /ptuning/web_demo.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | 3 | import gradio as gr 4 | import mdtex2html 5 | 6 | import torch 7 | import transformers 8 | from transformers import ( 9 | AutoConfig, 10 | AutoModel, 11 | AutoTokenizer, 12 | AutoTokenizer, 13 | DataCollatorForSeq2Seq, 14 | HfArgumentParser, 15 | Seq2SeqTrainingArguments, 16 | set_seed, 17 | ) 18 | 19 | from arguments import ModelArguments, DataTrainingArguments 20 | 21 | 22 | model = None 23 | tokenizer = None 24 | 25 | """Override Chatbot.postprocess""" 26 | 27 | 28 | def postprocess(self, y): 29 | if y is None: 30 | return [] 31 | for i, (message, response) in enumerate(y): 32 | y[i] = ( 33 | None if message is None else mdtex2html.convert((message)), 34 | None if response is None else mdtex2html.convert(response), 35 | ) 36 | return y 37 | 38 | 39 | gr.Chatbot.postprocess = postprocess 40 | 41 | 42 | def parse_text(text): 43 | """copy from https://github.com/GaiZhenbiao/ChuanhuChatGPT/""" 44 | lines = text.split("\n") 45 | lines = [line for line in lines if line != ""] 46 | count = 0 47 | for i, line in enumerate(lines): 48 | if "```" in line: 49 | count += 1 50 | items = line.split('`') 51 | if count % 2 == 1: 52 | lines[i] = f'
'
 53 |             else:
 54 |                 lines[i] = f'
' 55 | else: 56 | if i > 0: 57 | if count % 2 == 1: 58 | line = line.replace("`", "\`") 59 | line = line.replace("<", "<") 60 | line = line.replace(">", ">") 61 | line = line.replace(" ", " ") 62 | line = line.replace("*", "*") 63 | line = line.replace("_", "_") 64 | line = line.replace("-", "-") 65 | line = line.replace(".", ".") 66 | line = line.replace("!", "!") 67 | line = line.replace("(", "(") 68 | line = line.replace(")", ")") 69 | line = line.replace("$", "$") 70 | lines[i] = "
"+line 71 | text = "".join(lines) 72 | return text 73 | 74 | 75 | def predict(input, chatbot, max_length, top_p, temperature, history, past_key_values): 76 | chatbot.append((parse_text(input), "")) 77 | for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values, 78 | return_past_key_values=True, 79 | max_length=max_length, top_p=top_p, 80 | temperature=temperature): 81 | chatbot[-1] = (parse_text(input), parse_text(response)) 82 | 83 | yield chatbot, history, past_key_values 84 | 85 | 86 | def reset_user_input(): 87 | return gr.update(value='') 88 | 89 | 90 | def reset_state(): 91 | return [], [], None 92 | 93 | 94 | with gr.Blocks() as demo: 95 | gr.HTML("""

ChatGLM2-6B

""") 96 | 97 | chatbot = gr.Chatbot() 98 | with gr.Row(): 99 | with gr.Column(scale=4): 100 | with gr.Column(scale=12): 101 | user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style( 102 | container=False) 103 | with gr.Column(min_width=32, scale=1): 104 | submitBtn = gr.Button("Submit", variant="primary") 105 | with gr.Column(scale=1): 106 | emptyBtn = gr.Button("Clear History") 107 | max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True) 108 | top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True) 109 | temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True) 110 | 111 | history = gr.State([]) 112 | past_key_values = gr.State(None) 113 | 114 | submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values], 115 | [chatbot, history, past_key_values], show_progress=True) 116 | submitBtn.click(reset_user_input, [], [user_input]) 117 | 118 | emptyBtn.click(reset_state, outputs=[chatbot, history, past_key_values], show_progress=True) 119 | 120 | 121 | def main(): 122 | global model, tokenizer 123 | 124 | parser = HfArgumentParser(( 125 | ModelArguments)) 126 | if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): 127 | # If we pass only one argument to the script and it's the path to a json file, 128 | # let's parse it to get our arguments. 129 | model_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))[0] 130 | else: 131 | model_args = parser.parse_args_into_dataclasses()[0] 132 | 133 | tokenizer = AutoTokenizer.from_pretrained( 134 | model_args.model_name_or_path, trust_remote_code=True) 135 | config = AutoConfig.from_pretrained( 136 | model_args.model_name_or_path, trust_remote_code=True) 137 | 138 | config.pre_seq_len = model_args.pre_seq_len 139 | config.prefix_projection = model_args.prefix_projection 140 | 141 | if model_args.ptuning_checkpoint is not None: 142 | print(f"Loading prefix_encoder weight from {model_args.ptuning_checkpoint}") 143 | model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True) 144 | prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin")) 145 | new_prefix_state_dict = {} 146 | for k, v in prefix_state_dict.items(): 147 | if k.startswith("transformer.prefix_encoder."): 148 | new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v 149 | model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict) 150 | else: 151 | model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True) 152 | 153 | if model_args.quantization_bit is not None: 154 | print(f"Quantized to {model_args.quantization_bit} bit") 155 | model = model.quantize(model_args.quantization_bit) 156 | model = model.cuda() 157 | if model_args.pre_seq_len is not None: 158 | # P-tuning v2 159 | model.transformer.prefix_encoder.float() 160 | 161 | model = model.eval() 162 | demo.queue().launch(share=False, inbrowser=True) 163 | 164 | 165 | 166 | if __name__ == "__main__": 167 | main() -------------------------------------------------------------------------------- /ptuning/web_demo.sh: -------------------------------------------------------------------------------- 1 | PRE_SEQ_LEN=128 2 | 3 | CUDA_VISIBLE_DEVICES=0 python3 web_demo.py \ 4 | --model_name_or_path THUDM/chatglm2-6b \ 5 | --ptuning_checkpoint output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-3000 \ 6 | --pre_seq_len $PRE_SEQ_LEN 7 | 8 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | protobuf 2 | transformers==4.30.2 3 | cpm_kernels 4 | torch>=2.0 5 | gradio 6 | mdtex2html 7 | sentencepiece 8 | accelerate 9 | sse-starlette 10 | streamlit>=1.24.0 -------------------------------------------------------------------------------- /resources/WECHAT.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | 4 |

扫码关注公众号,加入「ChatGLM交流群」

5 |

Scan the QR code to follow the official account and join the "ChatGLM Discussion Group"

6 |
7 | 8 | -------------------------------------------------------------------------------- /resources/cli-demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/cli-demo.png -------------------------------------------------------------------------------- /resources/knowledge.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/knowledge.png -------------------------------------------------------------------------------- /resources/long-context.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/long-context.png -------------------------------------------------------------------------------- /resources/math.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/math.png -------------------------------------------------------------------------------- /resources/web-demo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/web-demo.gif -------------------------------------------------------------------------------- /resources/web-demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/web-demo.png -------------------------------------------------------------------------------- /resources/wechat.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/wechat.jpg -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | from typing import Dict, Tuple, Union, Optional 3 | 4 | from torch.nn import Module 5 | from transformers import AutoModel 6 | 7 | 8 | def auto_configure_device_map(num_gpus: int) -> Dict[str, int]: 9 | # transformer.word_embeddings 占用1层 10 | # transformer.final_layernorm 和 lm_head 占用1层 11 | # transformer.layers 占用 28 层 12 | # 总共30层分配到num_gpus张卡上 13 | num_trans_layers = 28 14 | per_gpu_layers = 30 / num_gpus 15 | 16 | # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError 17 | # windows下 model.device 会被设置成 transformer.word_embeddings.device 18 | # linux下 model.device 会被设置成 lm_head.device 19 | # 在调用chat或者stream_chat时,input_ids会被放到model.device上 20 | # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError 21 | # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上 22 | # 本文件来源于https://github.com/THUDM/ChatGLM-6B/blob/main/utils.py 23 | # 仅此处做少许修改以支持ChatGLM2 24 | device_map = { 25 | 'transformer.embedding.word_embeddings': 0, 26 | 'transformer.encoder.final_layernorm': 0, 27 | 'transformer.output_layer': 0, 28 | 'transformer.rotary_pos_emb': 0, 29 | 'lm_head': 0 30 | } 31 | 32 | used = 2 33 | gpu_target = 0 34 | for i in range(num_trans_layers): 35 | if used >= per_gpu_layers: 36 | gpu_target += 1 37 | used = 0 38 | assert gpu_target < num_gpus 39 | device_map[f'transformer.encoder.layers.{i}'] = gpu_target 40 | used += 1 41 | 42 | return device_map 43 | 44 | 45 | def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2, 46 | device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module: 47 | if num_gpus < 2 and device_map is None: 48 | model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda() 49 | else: 50 | from accelerate import dispatch_model 51 | 52 | model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half() 53 | 54 | if device_map is None: 55 | device_map = auto_configure_device_map(num_gpus) 56 | 57 | model = dispatch_model(model, device_map=device_map) 58 | 59 | return model 60 | -------------------------------------------------------------------------------- /web_demo.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModel, AutoTokenizer 2 | import gradio as gr 3 | import mdtex2html 4 | from utils import load_model_on_gpus 5 | 6 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 7 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda() 8 | # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量 9 | # from utils import load_model_on_gpus 10 | # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2) 11 | model = model.eval() 12 | 13 | """Override Chatbot.postprocess""" 14 | 15 | 16 | def postprocess(self, y): 17 | if y is None: 18 | return [] 19 | for i, (message, response) in enumerate(y): 20 | y[i] = ( 21 | None if message is None else mdtex2html.convert((message)), 22 | None if response is None else mdtex2html.convert(response), 23 | ) 24 | return y 25 | 26 | 27 | gr.Chatbot.postprocess = postprocess 28 | 29 | 30 | def parse_text(text): 31 | """copy from https://github.com/GaiZhenbiao/ChuanhuChatGPT/""" 32 | lines = text.split("\n") 33 | lines = [line for line in lines if line != ""] 34 | count = 0 35 | for i, line in enumerate(lines): 36 | if "```" in line: 37 | count += 1 38 | items = line.split('`') 39 | if count % 2 == 1: 40 | lines[i] = f'
'
 41 |             else:
 42 |                 lines[i] = f'
' 43 | else: 44 | if i > 0: 45 | if count % 2 == 1: 46 | line = line.replace("`", "\`") 47 | line = line.replace("<", "<") 48 | line = line.replace(">", ">") 49 | line = line.replace(" ", " ") 50 | line = line.replace("*", "*") 51 | line = line.replace("_", "_") 52 | line = line.replace("-", "-") 53 | line = line.replace(".", ".") 54 | line = line.replace("!", "!") 55 | line = line.replace("(", "(") 56 | line = line.replace(")", ")") 57 | line = line.replace("$", "$") 58 | lines[i] = "
"+line 59 | text = "".join(lines) 60 | return text 61 | 62 | 63 | def predict(input, chatbot, max_length, top_p, temperature, history, past_key_values): 64 | chatbot.append((parse_text(input), "")) 65 | for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values, 66 | return_past_key_values=True, 67 | max_length=max_length, top_p=top_p, 68 | temperature=temperature): 69 | chatbot[-1] = (parse_text(input), parse_text(response)) 70 | 71 | yield chatbot, history, past_key_values 72 | 73 | 74 | def reset_user_input(): 75 | return gr.update(value='') 76 | 77 | 78 | def reset_state(): 79 | return [], [], None 80 | 81 | 82 | with gr.Blocks() as demo: 83 | gr.HTML("""

ChatGLM2-6B

""") 84 | 85 | chatbot = gr.Chatbot() 86 | with gr.Row(): 87 | with gr.Column(scale=4): 88 | with gr.Column(scale=12): 89 | user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style( 90 | container=False) 91 | with gr.Column(min_width=32, scale=1): 92 | submitBtn = gr.Button("Submit", variant="primary") 93 | with gr.Column(scale=1): 94 | emptyBtn = gr.Button("Clear History") 95 | max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True) 96 | top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True) 97 | temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True) 98 | 99 | history = gr.State([]) 100 | past_key_values = gr.State(None) 101 | 102 | submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values], 103 | [chatbot, history, past_key_values], show_progress=True) 104 | submitBtn.click(reset_user_input, [], [user_input]) 105 | 106 | emptyBtn.click(reset_state, outputs=[chatbot, history, past_key_values], show_progress=True) 107 | 108 | demo.queue().launch(share=False, inbrowser=True) 109 | -------------------------------------------------------------------------------- /web_demo2.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModel, AutoTokenizer 2 | import streamlit as st 3 | 4 | 5 | st.set_page_config( 6 | page_title="ChatGLM2-6b 演示", 7 | page_icon=":robot:", 8 | layout='wide' 9 | ) 10 | 11 | 12 | @st.cache_resource 13 | def get_model(): 14 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) 15 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda() 16 | # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量 17 | # from utils import load_model_on_gpus 18 | # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2) 19 | model = model.eval() 20 | return tokenizer, model 21 | 22 | 23 | tokenizer, model = get_model() 24 | 25 | st.title("ChatGLM2-6B") 26 | 27 | max_length = st.sidebar.slider( 28 | 'max_length', 0, 32768, 8192, step=1 29 | ) 30 | top_p = st.sidebar.slider( 31 | 'top_p', 0.0, 1.0, 0.8, step=0.01 32 | ) 33 | temperature = st.sidebar.slider( 34 | 'temperature', 0.0, 1.0, 0.8, step=0.01 35 | ) 36 | 37 | if 'history' not in st.session_state: 38 | st.session_state.history = [] 39 | 40 | if 'past_key_values' not in st.session_state: 41 | st.session_state.past_key_values = None 42 | 43 | for i, (query, response) in enumerate(st.session_state.history): 44 | with st.chat_message(name="user", avatar="user"): 45 | st.markdown(query) 46 | with st.chat_message(name="assistant", avatar="assistant"): 47 | st.markdown(response) 48 | with st.chat_message(name="user", avatar="user"): 49 | input_placeholder = st.empty() 50 | with st.chat_message(name="assistant", avatar="assistant"): 51 | message_placeholder = st.empty() 52 | 53 | prompt_text = st.text_area(label="用户命令输入", 54 | height=100, 55 | placeholder="请在这儿输入您的命令") 56 | 57 | button = st.button("发送", key="predict") 58 | 59 | if button: 60 | input_placeholder.markdown(prompt_text) 61 | history, past_key_values = st.session_state.history, st.session_state.past_key_values 62 | for response, history, past_key_values in model.stream_chat(tokenizer, prompt_text, history, 63 | past_key_values=past_key_values, 64 | max_length=max_length, top_p=top_p, 65 | temperature=temperature, 66 | return_past_key_values=True): 67 | message_placeholder.markdown(response) 68 | 69 | st.session_state.history = history 70 | st.session_state.past_key_values = past_key_values 71 | --------------------------------------------------------------------------------