├── FAQ.md
├── MODEL_LICENSE
├── README.md
├── README2.md
├── README_EN.md
├── api.py
├── chatglm2PT
    ├── configuration_chatglm.py
    └── modelling_chatglm.py
├── cli_demo.py
├── evaluation
    ├── README.md
    └── evaluate_ceval.py
├── openai_api.py
├── ptuning
    ├── README.md
    ├── arguments.py
    ├── deepspeed.json
    ├── ds_train_finetune.sh
    ├── evaluate.sh
    ├── evaluate_finetune.sh
    ├── main.py
    ├── train.sh
    ├── train_chat.sh
    ├── trainer.py
    ├── trainer_seq2seq.py
    ├── web_demo.py
    └── web_demo.sh
├── requirements.txt
├── resources
    ├── WECHAT.md
    ├── cli-demo.png
    ├── knowledge.png
    ├── long-context.png
    ├── math.png
    ├── web-demo.gif
    ├── web-demo.png
    └── wechat.jpg
├── utils.py
├── web_demo.py
└── web_demo2.py


/FAQ.md:
--------------------------------------------------------------------------------
 1 | ## Q1
 2 | 
 3 | **Mac直接加载量化后的模型出现提示 `clang: error: unsupported option '-fopenmp'**
 4 | 
 5 | 这是由于Mac由于本身缺乏omp导致的，此时可运行但是单核。需要单独安装 openmp 依赖，即可在Mac下使用OMP：
 6 | 
 7 | ```bash
 8 | # 参考`https://mac.r-project.org/openmp/`
 9 | ## 假设: gcc(clang)是14.x版本，其他版本见R-Project提供的表格
10 | curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
11 | sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
12 | ```
13 | 此时会安装下面几个文件：`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`。
14 | 
15 | > 注意：如果你之前运行`ChatGLM2-6B`项目失败过，最好清一下Hugging Face的缓存，i.e. 默认下是 `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令，请明确知道自己在删除什么。
16 | 


--------------------------------------------------------------------------------
/MODEL_LICENSE:
--------------------------------------------------------------------------------
 1 | The ChatGLM-6B License
 2 | 
 3 | 一、定义
 4 | 
 5 | “许可方”是指分发其软件的 ChatGLM2-6B 模型团队。
 6 | 
 7 | “软件”是指根据本许可提供的 ChatGLM2-6B 模型参数。
 8 | 
 9 | 2. 许可授予
10 | 
11 | 根据本许可的条款和条件，许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可，仅用于您的非商业研究目的。
12 | 
13 | 上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
14 | 
15 | 3.限制
16 | 
17 | 您不得出于任何商业、军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
18 | 
19 | 您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
20 | 
21 | 4.免责声明
22 | 
23 | 本软件“按原样”提供，不提供任何明示或暗示的保证，包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下，作者或版权持有人均不对任何索赔、损害或其他责任负责，无论是在合同诉讼、侵权行为还是其他方面，由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
24 | 
25 | 5. 责任限制
26 | 
27 | 除适用法律禁止的范围外，在任何情况下且根据任何法律理论，无论是基于侵权行为、疏忽、合同、责任或其他原因，任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害，或任何其他商业损失，即使许可人已被告知此类损害的可能性。
28 | 
29 | 6.争议解决
30 | 
31 | 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
32 | 
33 | 请注意，许可证可能会更新到更全面的版本。 有关许可和版权的任何问题，请通过 glm-130b@googlegroups.com 与我们联系。
34 | 
35 | 1. Definitions
36 | 
37 | “Licensor” means the ChatGLM2-6B Model Team that distributes its Software.
38 | 
39 | “Software” means the ChatGLM2-6B model parameters made available under this license.
40 | 
41 | 2. License Grant
42 | 
43 | Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
44 | 
45 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
46 | 
47 | 3. Restriction
48 | 
49 | You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
50 | 
51 | You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
52 | 
53 | 4. Disclaimer
54 | 
55 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
56 | 
57 | 5. Limitation of Liability
58 | 
59 | EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
60 | 
61 | 6. Dispute Resolution
62 | 
63 | This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
64 | 
65 | Note that the license is subject to update to a more comprehensive version.  For any questions related to the license and copyright, please contact us at glm-130b@googlegroups.com.
66 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ChatGLM2-6B-Explained
 2 | 
 3 | ChatGLM2-6B-相关代码，逐行详解版。  
 4 | 逐步更新，欢迎大家Star，Fork，参与进来，提交PR。   
 5 | 注：xxx表示伪目录，非有效。
 6 | 
 7 | ##
 8 | 这个项目主要是数据相关的流转，测试，还有p tuning v2相关微调。若是想弄懂大模型的原理，建议看[GLM-Explained](https://github.com/ArtificialZeng/GLM-Explained)
 9 | 
10 | 此外，大模型还基于两个非常重要的基础库，那便是[transformers](https://github.com/ArtificialZeng/tranformers-expalined)，和[pytorch](https://github.com/ArtificialZeng/pytorch-explained)，同样这两个库也有关键代码的逐行解析版本。
11 | # ChatGLM2-6B-Explained
12 | 
13 | 
14 | 
15 | * [x/](./src)
16 |   * [x/](./src/utils)
17 |     * [main.py](./ptuning/main.py)
18 |     * [train.sh参数解释](./ptuning/train.sh) 
19 |   * [x.py](./src/train_sft.py)
20 | * [chatglm2PT](./chatglm2PT)
21 |   * [/configuration_chatglm.py](./chatglm2PT/configuration_chatglm.py)  这段代码定义了一个名为ChatGLMConfig的类，用于配置和管理ChatGLM模型。
22 |   * [/modelling_chatglm.py](./chatglm2PT/configuration_chatglm.py)
23 | * 
24 | * [x/](./examples)
25 |   * [x.md](./examples/ads_generation.md)
26 | * [README.md](./README.md)
27 | 
28 | 
29 | # CSDN彩色博客版：
30 | * [ChatGLM1/2 系列源码解析系列-专栏地址](https://blog.csdn.net/sinat_37574187/category_12365053.html) 
31 |   * [/src/utils/](./ChatGLM-Efficient-Tuning-Explained/src/utils)
32 |     * [CSDN彩色源码解析main.py(一)](https://zengxiaojian.blog.csdn.net/article/details/131617133?spm=1001.2014.3001.5502)
33 |     * [CSDN彩色源码解析main.py(二)](https://blog.csdn.net/sinat_37574187/article/details/131621397)
34 | * [ChatGLM2-6B源码解析 web_demo.py](https://blog.csdn.net/sinat_37574187/article/details/131404024)
35 | * [README.md](./ChatGLM-Efficient-Tuning-Explained/README.md)
36 | 
37 | 
38 | ## 引用 - 源项目
39 | 


--------------------------------------------------------------------------------
/README2.md:
--------------------------------------------------------------------------------
  1 | # ChatGLM2-6B
  2 | 
  3 | <p align="center">
  4 | 🤗 <a href="https://huggingface.co/THUDM/chatglm2-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
  5 | </p>
  6 | <p align="center">
  7 |     👋 加入我们的 <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1y7pqoloy-9b1g6T6JjA8J0KxvUjbwJw" target="_blank">Slack</a> 和 <a href="resources/WECHAT.md" target="_blank">WeChat</a>
  8 | </p>
  9 | 
 10 | *Read this in [English](README_EN.md)*
 11 | 
 12 | ## 介绍
 13 | 
 14 | ChatGLM**2**-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM**2**-6B 引入了如下新特性：
 15 | 
 16 | 1. **更强大的性能**：基于 ChatGLM 初代模型的开发经验，我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 [GLM](https://github.com/THUDM/GLM) 的混合目标函数，经过了 1.4T 中英标识符的预训练与人类偏好对齐训练，[评测结果](#评测结果)显示，相比于初代模型，ChatGLM2-6B 在 MMLU（+23%）、CEval（+33%）、GSM8K（+571%） 、BBH（+60%）等数据集上的性能取得了大幅度的提升，在同尺寸开源模型中具有较强的竞争力。
 17 | 2. **更长的上下文**：基于 [FlashAttention](https://github.com/HazyResearch/flash-attention) 技术，我们将基座模型的上下文长度（Context Length）由 ChatGLM-6B 的 2K 扩展到了 32K，并在对话阶段使用 8K 的上下文长度训练，允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限，我们会在后续迭代升级中着重进行优化。
 18 | 3. **更高效的推理**：基于 [Multi-Query Attention](http://arxiv.org/abs/1911.02150) 技术，ChatGLM2-6B 有更高效的推理速度和更低的显存占用：在官方的模型实现下，推理速度相比初代提升了 42%，INT4 量化下，6G 显存支持的对话长度由 1K 提升到了 8K。
 19 | 4. **更开放的协议**：ChatGLM2-6B 权重对学术研究**完全开放**，在获得官方的书面许可后，亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用，我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。
 20 | 
 21 | -----
 22 | 
 23 | ChatGLM2-6B 开源模型旨在与开源社区一起推动大模型技术发展，恳请开发者和大家遵守[开源协议](MODEL_LICENSE)，勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。**目前，本项目团队未基于 ChatGLM2-6B 开发任何应用，包括网页端、安卓、苹果 iOS 及 Windows App 等应用。**
 24 | 
 25 | 尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性，但由于 ChatGLM2-6B 模型规模较小，且模型受概率随机性因素影响，无法保证输出内容的准确性，且模型易被误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。**
 26 | 
 27 | ## 更新信息
 28 | **[2023/07/04]** 发布 P-Tuning v2 与 全参数微调脚本，参见 [P-Tuning](./ptuning)。
 29 | 
 30 | ## 友情链接
 31 | 对 ChatGLM2 进行加速的开源项目：
 32 | * [fastllm](https://github.com/ztxz16/fastllm/): 全平台加速推理方案，单GPU批量推理每秒可达10000+token，手机端最低3G内存实时运行（骁龙865上约4~5 token/s）
 33 | * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): 类似 llama.cpp 的 CPU 量化加速推理方案，实现 Mac 笔记本上实时对话
 34 | 
 35 | ## 评测结果
 36 | 我们选取了部分中英文典型数据集进行了评测，以下为 ChatGLM2-6B 模型在 [MMLU](https://github.com/hendrycks/test) (英文)、[C-Eval](https://cevalbenchmark.com/static/leaderboard.html)（中文）、[GSM8K](https://github.com/openai/grade-school-math)（数学）、[BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)（英文） 上的测评结果。在 [evaluation](./evaluation/README.md) 中提供了在 C-Eval 上进行测评的脚本。
 37 | 
 38 | ### MMLU
 39 | 
 40 | | Model | Average | STEM | Social Sciences | Humanities | Others |
 41 | | ----- | ----- | ---- | ----- | ----- | ----- |
 42 | | ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 |
 43 | | ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 |
 44 | | ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 |
 45 | 
 46 | > Chat 模型使用 zero-shot CoT (Chain-of-Thought) 的方法测试，Base 模型使用 few-shot answer-only 的方法测试
 47 | 
 48 | ### C-Eval
 49 | 
 50 | | Model | Average | STEM | Social Sciences | Humanities | Others |
 51 | | ----- | ---- | ---- | ----- | ----- | ----- |
 52 | | ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 |
 53 | | ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 |
 54 | | ChatGLM2-6B | 50.1 | 46.4	| 60.4 | 50.6 | 46.9 | 
 55 | 
 56 | > Chat 模型使用 zero-shot CoT 的方法测试，Base 模型使用 few-shot answer only 的方法测试
 57 | 
 58 | ### GSM8K
 59 | 
 60 | | Model | Accuracy | Accuracy (Chinese)* |
 61 | | ----- | ----- | ----- |
 62 | | ChatGLM-6B | 4.82 | 5.85 |
 63 | | ChatGLM2-6B (base) | 32.37 | 28.95 |
 64 | | ChatGLM2-6B | 28.05 | 20.45 |
 65 | 
 66 | > 所有模型均使用 few-shot CoT 的方法测试，CoT prompt 来自 http://arxiv.org/abs/2201.11903
 67 | > 
 68 | > \* 我们使用翻译 API 翻译了 GSM8K 中的 500 道题目和 CoT prompt 并进行了人工校对
 69 | 
 70 | 
 71 | ### BBH
 72 | 
 73 | | Model | Accuracy |
 74 | | ----- | ----- |
 75 | | ChatGLM-6B | 18.73 |
 76 | | ChatGLM2-6B (base) | 33.68 |
 77 | | ChatGLM2-6B | 30.00 |
 78 | 
 79 | > 所有模型均使用 few-shot CoT 的方法测试，CoT prompt 来自 https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts
 80 | 
 81 | ## 推理性能
 82 | ChatGLM2-6B 使用了 [Multi-Query Attention](http://arxiv.org/abs/1911.02150)，提高了生成速度。生成 2000 个字符的平均速度对比如下
 83 | 
 84 | | Model | 推理速度 (字符/秒) |
 85 | | ----  | -----  |
 86 | | ChatGLM-6B  | 31.49 |
 87 | | ChatGLM2-6B | 44.62 |
 88 | 
 89 | > 使用官方实现，batch size = 1，max length = 2048，bf16 精度，测试硬件为 A100-SXM4-80G，软件环境为 PyTorch 2.0.1
 90 | 
 91 | Multi-Query Attention 同时也降低了生成过程中 KV Cache 的显存占用，此外，ChatGLM2-6B 采用 Causal Mask 进行对话训练，连续对话时可复用前面轮次的 KV Cache，进一步优化了显存占用。因此，使用 6GB 显存的显卡进行 INT4 量化的推理时，初代的 ChatGLM-6B 模型最多能够生成 1119 个字符就会提示显存耗尽，而 ChatGLM2-6B 能够生成至少 8192 个字符。
 92 | 
 93 | | **量化等级** | **编码 2048 长度的最小显存** | **生成 8192 长度的最小显存** |
 94 | | -------------- |---------------------|---------------------|
 95 | | FP16 / BF16 | 13.1 GB             | 12.8 GB             | 
 96 | | INT8           | 8.2 GB              | 8.1 GB              |
 97 | | INT4           | 5.5 GB              | 5.1 GB              |
 98 | 
 99 | > ChatGLM2-6B 利用了 PyTorch 2.0 引入的 `torch.nn.functional.scaled_dot_product_attention` 实现高效的 Attention 计算，如果 PyTorch 版本较低则会 fallback 到朴素的 Attention 实现，出现显存占用高于上表的情况。
100 | 
101 | 我们也测试了量化对模型性能的影响。结果表明，量化对模型性能的影响在可接受范围内。
102 | 
103 | | 量化等级 | Accuracy (MMLU) | Accuracy (C-Eval dev) |
104 | | ----- | ----- |-----------------------|
105 | | BF16 | 45.47 | 53.57                 |
106 | | INT4 | 43.13 | 50.30                 |
107 | 
108 | 
109 | 
110 | ## ChatGLM2-6B 示例
111 | 
112 | 相比于初代模型，ChatGLM2-6B 多个维度的能力都取得了提升，以下是一些对比示例。更多 ChatGLM2-6B 的可能，等待你来探索发现！
113 | 
114 | <details><summary><b>数理逻辑</b></summary>
115 | 
116 | ![](resources/math.png)
117 | 
118 | </details>
119 | 
120 | <details><summary><b>知识推理</b></summary>
121 | 
122 | ![](resources/knowledge.png)
123 | 
124 | </details>
125 | 
126 | <details><summary><b>长文档理解</b></summary>
127 | 
128 | ![](resources/long-context.png)
129 | 
130 | </details>
131 | 
132 | ## 使用方式
133 | ### 环境安装
134 | 首先需要下载本仓库：
135 | ```shell
136 | git clone https://github.com/THUDM/ChatGLM2-6B
137 | cd ChatGLM2-6B
138 | ```
139 | 
140 | 然后使用 pip 安装依赖：
141 | ```
142 | pip install -r requirements.txt
143 | ```
144 | 其中 `transformers` 库版本推荐为 `4.30.2`，`torch` 推荐使用 2.0 及以上的版本，以获得最佳的推理性能。
145 | 
146 | ### 代码调用 
147 | 
148 | 可以通过如下代码调用 ChatGLM2-6B 模型来生成对话：
149 | 
150 | ```python
151 | >>> from transformers import AutoTokenizer, AutoModel
152 | >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
153 | >>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda')
154 | >>> model = model.eval()
155 | >>> response, history = model.chat(tokenizer, "你好", history=[])
156 | >>> print(response)
157 | 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
158 | >>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
159 | >>> print(response)
160 | 晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
161 | 
162 | 1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。
163 | 2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
164 | 3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。
165 | 4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
166 | 5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
167 | 6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。
168 | 
169 | 如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
170 | ```
171 | 
172 | #### 从本地加载模型
173 | 以上代码会由 `transformers` 自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm2-6b)。如果你的网络环境较差，下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地，然后从本地加载。
174 | 
175 | 从 Hugging Face Hub 下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)，然后运行
176 | ```Shell
177 | git clone https://huggingface.co/THUDM/chatglm2-6b
178 | ```
179 | 
180 | 如果你从 Hugging Face Hub 上下载 checkpoint 的速度较慢，可以只下载模型实现
181 | ```Shell
182 | GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b
183 | ```
184 | 然后从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载模型参数文件，并将下载的文件替换到本地的 `chatglm2-6b` 目录下。
185 | 
186 | 
187 | 将模型下载到本地之后，将以上代码中的 `THUDM/chatglm2-6b` 替换为你本地的 `chatglm2-6b` 文件夹的路径，即可从本地加载模型。
188 | 
189 | 模型的实现仍然处在变动中。如果希望固定使用的模型实现以保证兼容性，可以在 `from_pretrained` 的调用中增加 `revision="v1.0"` 参数。`v1.0` 是当前最新的版本号，完整的版本列表参见 [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log)。
190 | 
191 | ### 网页版 Demo
192 | 
193 | ![web-demo](resources/web-demo.gif)
194 | 
195 | 可以通过以下命令启动基于 Streamlit 的网页版 demo：
196 | ```shell
197 | streamlit run web_demo2.py
198 | ```
199 | 
200 | 程序会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。
201 | 
202 | 
203 | [web_demo.py](./web_demo.py) 中提供了旧版基于 Gradio 的 web demo，可以通过如下命令运行：
204 | ```shell
205 | python web_demo.py
206 | ```
207 | 经测试，如果输入的 prompt 较长的话，使用基于 Streamlit 的网页版 Demo 会更流畅。
208 | 
209 | ### 命令行 Demo
210 | 
211 | ![cli-demo](resources/cli-demo.png)
212 | 
213 | 运行仓库中 [cli_demo.py](cli_demo.py)：
214 | 
215 | ```shell
216 | python cli_demo.py
217 | ```
218 | 
219 | 程序会在命令行中进行交互式的对话，在命令行中输入指示并回车即可生成回复，输入 `clear` 可以清空对话历史，输入 `stop` 终止程序。
220 | 
221 | ### API 部署
222 | 首先需要安装额外的依赖 `pip install fastapi uvicorn`，然后运行仓库中的 [api.py](api.py)：
223 | ```shell
224 | python api.py
225 | ```
226 | 默认部署在本地的 8000 端口，通过 POST 方法进行调用
227 | ```shell
228 | curl -X POST "http://127.0.0.1:8000" \
229 |      -H 'Content-Type: application/json' \
230 |      -d '{"prompt": "你好", "history": []}'
231 | ```
232 | 得到的返回值为
233 | ```shell
234 | {
235 |   "response":"你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。",
236 |   "history":[["你好","你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。"]],
237 |   "status":200,
238 |   "time":"2023-03-23 21:38:40"
239 | }
240 | ```
241 | 感谢 [@hiyouga]() 实现了 OpenAI 格式的流式 API 部署，可以作为任意基于 ChatGPT 的应用的后端，比如 [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web)。可以通过运行仓库中的[openai_api.py](openai_api.py) 进行部署：
242 | ```shell
243 | python openai_api.py
244 | ```
245 | 进行 API 调用的示例代码为
246 | ```python
247 | import openai
248 | if __name__ == "__main__":
249 |     openai.api_base = "http://localhost:8000/v1"
250 |     openai.api_key = "none"
251 |     for chunk in openai.ChatCompletion.create(
252 |         model="chatglm2-6b",
253 |         messages=[
254 |             {"role": "user", "content": "你好"}
255 |         ],
256 |         stream=True
257 |     ):
258 |         if hasattr(chunk.choices[0].delta, "content"):
259 |             print(chunk.choices[0].delta.content, end="", flush=True)
260 | ```
261 | 
262 | 
263 | ## 低成本部署
264 | 
265 | ### 模型量化
266 | 
267 | 默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下：
268 | 
269 | ```python
270 | # 按需修改，目前只支持 4/8 bit 量化
271 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()
272 | ```
273 | 
274 | 模型量化会带来一定的性能损失，经过测试，ChatGLM2-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。
275 | 
276 | 如果你的内存不足，可以直接加载量化后的模型：
277 | ```python
278 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda()
279 | ```
280 | 
281 | <!-- 量化模型的参数文件也可以从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载。 -->
282 | 
283 | ### CPU 部署
284 | 
285 | 如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存）
286 | ```python
287 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()
288 | ```
289 | 如果你的内存不足的话，也可以使用量化后的模型
290 | ```python
291 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).float()
292 | ```
293 | 在 cpu 上运行量化后的模型需要安装 `gcc` 与 `openmp`。多数 Linux 发行版默认已安装。对于 Windows ，可在安装 [TDM-GCC](https://jmeubank.github.io/tdm-gcc/) 时勾选 `openmp`。 Windows 测试环境 `gcc` 版本为 `TDM-GCC 10.3.0`， Linux 为 `gcc 11.3.0`。在 MacOS 上请参考 [Q1](FAQ.md#q1)。
294 | 
295 | ### Mac 部署
296 | 
297 | 对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac，可以使用 MPS 后端来在 GPU 上运行 ChatGLM2-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly（正确的版本号应该是2.x.x.dev2023xxxx，而不是 2.x.x）。
298 | 
299 | 目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载，并使用 mps 后端：
300 | ```python
301 | model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
302 | ```
303 | 
304 | 加载半精度的 ChatGLM2-6B 模型需要大概 13GB 内存。内存较小的机器（比如 16GB 内存的 MacBook Pro），在空余内存不足的情况下会使用硬盘上的虚拟内存，导致推理速度严重变慢。
305 | 此时可以使用量化后的模型 chatglm2-6b-int4。因为 GPU 上量化的 kernel 是使用 CUDA 编写的，因此无法在 MacOS 上使用，只能使用 CPU 进行推理。
306 | 为了充分使用 CPU 并行，还需要[单独安装 OpenMP](FAQ.md#q1)。
307 | 
308 | 在 Mac 上进行推理也可以使用 [ChatGLM.cpp](https://github.com/li-plus/chatglm.cpp)
309 | 
310 | ### 多卡部署
311 | 如果你有多张 GPU，但是每张 GPU 的显存大小都不足以容纳完整的模型，那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`，然后通过如下方法加载模型：
312 | ```python
313 | from utils import load_model_on_gpus
314 | model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
315 | ```
316 | 即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的，你也可以传入 `device_map` 参数来自己指定。 
317 | 
318 | ## 协议
319 | 
320 | 本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源，ChatGLM2-6B 模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。ChatGLM2-6B 权重对学术研究**完全开放**，在获得官方的书面许可后，亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用，我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。申请商用许可与捐赠请联系 [license@zhipuai.cn](mailto:license@zhipuai.cn)。 
321 | 
322 | 
323 | ## 引用
324 | 
325 | 如果你觉得我们的工作有帮助的话，请考虑引用下列论文，ChatGLM2-6B 的论文会在近期公布，敬请期待～
326 | 
327 | ```
328 | @article{zeng2022glm,
329 |   title={Glm-130b: An open bilingual pre-trained model},
330 |   author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
331 |   journal={arXiv preprint arXiv:2210.02414},
332 |   year={2022}
333 | }
334 | ```
335 | ```
336 | @inproceedings{du2022glm,
337 |   title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
338 |   author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
339 |   booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
340 |   pages={320--335},
341 |   year={2022}
342 | }
343 | ```
344 | 


--------------------------------------------------------------------------------
/README_EN.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 | 🤗 <a href="https://huggingface.co/THUDM/chatglm2-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
  3 | </p>
  4 | <p align="center">
  5 |     👋 Join our <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1y7pqoloy-9b1g6T6JjA8J0KxvUjbwJw" target="_blank">Slack</a> and <a href="resources/WECHAT.md" target="_blank">WeChat</a>
  6 | </p>
  7 | 
  8 | ## Introduction
  9 | 
 10 | ChatGLM**2**-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B). It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features:
 11 | 
 12 | 1. **Stronger Performance**: Based on the development experience of the first-generation ChatGLM model, we have fully upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of [GLM](https://github.com/THUDM/GLM), and has undergone pre-training with 1.4T bilingual tokens and human preference alignment training. The [evaluation results](README.md#evaluation-results) show that, compared to the first-generation model, ChatGLM2-6B has achieved substantial improvements in performance on datasets like MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%), showing strong competitiveness among models of the same size.
 13 | 2. **Longer Context**: Based on [FlashAttention](https://github.com/HazyResearch/flash-attention) technique, we have extended the context length of the base model from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment, allowing for more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, which we will focus on optimizing in future iterations.
 14 | 3. **More Efficient Inference**: Based on [Multi-Query Attention](http://arxiv.org/abs/1911.02150) technique, ChatGLM2-6B has more efficient inference speed and lower GPU memory usage: under the official  implementation, the inference speed has increased by 42% compared to the first generation; under INT4 quantization, the dialogue length supported by 6G GPU memory has increased from 1K to 8K.
 15 | 4. **More Open License**: The weights of ChatGLM2-6B are **fully open** to academic research, and with our official written permission, the weights of ChatGLM2-6B are also **permitted for commercial use**. If you find our open-source model useful for your business, we welcome your donation towards the development of the next-generation model ChatGLM3.
 16 | 
 17 | -----
 18 | 
 19 | The open-source ChatGLM2-6B is intended to promote the development of LLMs together with the open-source community. We earnestly request developers and everyone to abide by the [open-source license](MODEL_LICENSE). Do not use the open-source model, code, or any derivatives from the open-source project for any purposes that may harm nations or societies, or for any services that have not undergone safety assessments and legal approval. **At present, our project team has not developed any applications based on ChatGLM2-6B, including web, Android, Apple iOS, and Windows App applications.**
 20 | 
 21 | Although the model strives to ensure the compliance and accuracy of data at each stage of training, due to the smaller scale of the ChatGLM2-6B model, and its susceptibility to probabilistic randomness, the accuracy of output content cannot be guaranteed, and the model can easily be misled. **Our project does not assume any risks or responsibilities arising from data security, public opinion risks, or any instances of the model being misled, abused, disseminated, or improperly used due to the open-source model and code.**
 22 | 
 23 | ## Projects
 24 | Open source projects that accelerate ChatGLM2:
 25 | * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): Real-time CPU inference on a MacBook accelerated by quantization, similar to llama.cpp.
 26 | 
 27 | ## Evaluation
 28 | We selected some typical Chinese and English datasets for evaluation. Below are the evaluation results of the ChatGLM2-6B model on [MMLU](https://github.com/hendrycks/test) (English), [C-Eval](https://cevalbenchmark.com/static/leaderboard.html) (Chinese), [GSM8K](https://github.com/openai/grade-school-math) (Mathematics), [BBH](https://github.com/suzgunmirac/BIG-Bench-Hard) (English).
 29 | 
 30 | ### MMLU
 31 | 
 32 | | Model | Average | STEM | Social Sciences | Humanities | Others |
 33 | | ----- | ----- | ---- | ----- | ----- | ----- |
 34 | | ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 |
 35 | | ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 |
 36 | | ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 |
 37 | 
 38 | > Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only
 39 | 
 40 | ### C-Eval
 41 | 
 42 | | Model | Average | STEM | Social Sciences | Humanities | Others |
 43 | | ----- | ---- | ---- | ----- | ----- | ----- |
 44 | | ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 |
 45 | | ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 |
 46 | | ChatGLM2-6B | 50.1 | 46.4	| 60.4 | 50.6 | 46.9 | 
 47 | 
 48 | > Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only
 49 | 
 50 | ### GSM8K
 51 | 
 52 | | Model | Accuracy | Accuracy (Chinese)* |
 53 | | ----- | ----- | ----- |
 54 | | ChatGLM-6B | 4.82 | 5.85 |
 55 | | ChatGLM2-6B (base) | 32.37 | 28.95 |
 56 | | ChatGLM2-6B | 28.05 | 20.45 |
 57 | 
 58 | > All model versions are evaluated under few-shot CoT, and CoT prompts are from http://arxiv.org/abs/2201.11903
 59 | > \* We translate a 500-query subset of GSM8K and its corresponding CoT prompts using machine translation API and subsequent human proofreading.
 60 | 
 61 | 
 62 | ### BBH
 63 | 
 64 | | Model | Accuracy |
 65 | | ----- | ----- |
 66 | | ChatGLM-6B | 18.73 |
 67 | | ChatGLM2-6B (base) | 33.68 |
 68 | | ChatGLM2-6B | 30.00 |
 69 | 
 70 | > All model versions are evaluated under few-shot CoT, and CoT prompts are from https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts
 71 | 
 72 | ## Inference Efficiency
 73 | ChatGLM2-6B employs [Multi-Query Attention](http://arxiv.org/abs/1911.02150) to improve inference speed. Here is a comparison of the average speed for generating 2000 tokens.
 74 | 
 75 | 
 76 | | Model | Inference Speed (tokens/s) |
 77 | | ----  | -----  |
 78 | | ChatGLM-6B  | 31.49 |
 79 | | ChatGLM2-6B | 44.62 |
 80 | 
 81 | > Under our official implementation, batch size = 1, max length = 2048, bf16 precision, tested with an A100-SXM-80G and PyTorch 2.0 environment
 82 | 
 83 | Multi-Query Attention also reduces the GPU memory usage of the KV Cache during inference. Additionally, ChatGLM2-6B uses Causal Mask for dialogue training, which allows the reuse of the KV Cache from previous rounds in continuous dialogues, further optimizing GPU memory usage. Therefore, when performing INT4 quantization inference with a 6GB GPU, while the first-generation ChatGLM-6B can only generate a maximum of 1119 tokens before running out of memory, ChatGLM2-6B can generate at least 8192 tokens.
 84 | 
 85 | | **Quantization** | **Encoding 2048 Tokens** | **Decoding 8192 Tokens** |
 86 | | -------------- | --------------------- | --------------- |
 87 | | FP16 / BF16 | 13.1 GB             | 12.8 GB             | 
 88 | | INT8           | 8.2 GB              | 8.1 GB              |
 89 | | INT4           | 5.5 GB              | 5.1 GB              |
 90 | 
 91 | > ChatGLM2-6B takes advantage of `torch.nn.functional.scaled_dot_product_attention` introduced in PyTorch 2.0 for efficient Attention computation. If the PyTorch version is lower, it will fallback to the naive Attention implementation, which may result in higher GPU memory usage than shown in the table above.
 92 | 
 93 | We also tested the impact of quantization on model performance. The results show that the impact of quantization on model performance is within an acceptable range.
 94 | 
 95 | | Quantization | Accuracy (MMLU) | Accuracy (C-Eval dev) |
 96 | | ----- | ----- |-----------------------|
 97 | | BF16 | 45.47 | 53.57                 |
 98 | | INT4 | 43.13 | 50.30                 |
 99 | 
100 | 
101 | ## ChatGLM2-6B Examples
102 | 
103 | Compared to the first-generation model, ChatGLM2-6B has made improvements in multiple dimensions. Below are some comparison examples. More possibilities with ChatGLM2-6B are waiting for you to explore and discover!
104 | 
105 | <details><summary><b>Mathematics and Logic</b></summary>
106 | 
107 | ![](examples/math.png)
108 | 
109 | </details>
110 | 
111 | <details><summary><b>Knowledge Reasoning</b></summary>
112 | 
113 | ![](examples/knowledge.png)
114 | 
115 | </details>
116 | 
117 | <details><summary><b>Long Document Understanding</b></summary>
118 | 
119 | ![](examples/long-context.png)
120 | 
121 | </details>
122 | 
123 | ## Getting Started
124 | ### Environment Setup
125 | 
126 | Install dependencies with pip: `pip install -r requirements.txt`. It's recommended to use version `4.27.1` for the `transformers` library and use version 2.0 or higher for `torch` to achieve the best inference performance.
127 | 
128 | We provide a web page demo and a command line demo. You need to download this repository to use them:
129 | 
130 | ```shell
131 | git clone https://github.com/THUDM/ChatGLM2-6B
132 | cd ChatGLM2-6B
133 | ```
134 | 
135 | ### Usage
136 | 
137 | Generate dialogue with the following code:
138 | 
139 | ```python
140 | >>> from transformers import AutoTokenizer, AutoModel
141 | >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
142 | >>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda').eval()
143 | >>> response, history = model.chat(tokenizer, "你好", history=[])
144 | >>> print(response)
145 | 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
146 | >>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
147 | >>> print(response)
148 | 晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
149 | 
150 | 1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。
151 | 2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
152 | 3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。
153 | 4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
154 | 5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
155 | 6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。
156 | 
157 | 如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
158 | ```
159 | The implementation of the model is still in development. If you want to fix the used model implementation to ensure compatibility, you can add the `revision="v1.0"` parameter in the `from_pretrained` call. `v1.0` is the latest version number. For a complete list of versions, see [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log).
160 | 
161 | ### Web Demo
162 | 
163 | ![web-demo](resources/web-demo.gif)
164 | 
165 | Install Gradio `pip install gradio`，and run [web_demo.py](web_demo.py):
166 | 
167 | ```shell
168 | python web_demo.py
169 | ```
170 | 
171 | The program runs a web server and outputs the URL. Open the URL in the browser to use the web demo.
172 | 
173 | #### CLI Demo
174 | 
175 | ![cli-demo](resources/cli-demo.png)
176 | 
177 | Run [cli_demo.py](cli_demo.py) in the repo:
178 | 
179 | ```shell
180 | python cli_demo.py
181 | ```
182 | 
183 | The command runs an interactive program in the shell. Type your instruction in the shell and hit enter to generate the response. Type `clear` to clear the dialogue history and `stop` to terminate the program.
184 | 
185 | ## API Deployment
186 | First install the additional dependency `pip install fastapi uvicorn`. The run [api.py](api.py) in the repo.
187 | ```shell
188 | python api.py
189 | ```
190 | By default the api runs at the`8000`port of the local machine. You can call the API via 
191 | ```shell
192 | curl -X POST "http://127.0.0.1:8000" \
193 |      -H 'Content-Type: application/json' \
194 |      -d '{"prompt": "你好", "history": []}'
195 | ```
196 | The returned value is
197 | ```shell
198 | {
199 |   "response":"你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。",
200 |   "history":[["你好","你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。"]],
201 |   "status":200,
202 |   "time":"2023-03-23 21:38:40"
203 | }
204 | ```
205 | ## Deployment
206 | 
207 | ### Quantization
208 | 
209 | By default, the model parameters are loaded with FP16 precision, which require about 13GB of GPU memory. It your GPU memory is limited, you can try to load the model parameters with quantization:
210 | 
211 | ```python
212 | # hange according to your hardware. Only support 4/8 bit quantization now.
213 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()
214 | ```
215 | 
216 | Model quantization will bring some performance loss on datasets. But after testing, ChatGLM2-6B can still perform natural and smooth generation under 4-bit quantization.
217 | 
218 | ### CPU Deployment
219 | 
220 | If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory):
221 | ```python
222 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()
223 | ```
224 | 
225 | ### Inference on Mac
226 | 
227 | For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly. (The correct version number should be 2.1.0.dev2023xxxx, not 2.0.0).
228 | 
229 | Currently you must [load the model locally](README_en.md#load-the-model-locally) on MacOS. Change the code to load the model from your local path, and use the mps backend:
230 | ```python
231 | model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
232 | ```
233 | 
234 | Loading a FP16 ChatGLM-6B model requires about 13GB of memory. Machines with less memory (such as a MacBook Pro with 16GB of memory) will use the virtual memory on the hard disk when there is insufficient free memory, resulting in a serious slowdown in inference speed.
235 | 
236 | ## License
237 | 
238 | The code of this repository is licensed under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0). The use of the ChatGLM2-6B model weights is subject to the [Model License](MODEL_LICENSE). ChatGLM2-6B weights are **completely open** for academic research, and **commercial use** is also allowed after **obtaining official written permission**. If you find our open source model useful for your business, we welcome your donations towards the development of the next generation model, ChatGLM3. For related matters, please contact [yiwen.xu@zhipuai.cn](mailto:yiwen.xu@zhipuai.cn).
239 | 
240 | ## Citation
241 | 
242 | If you find our work useful, please consider citing the following papers. The technical report for ChatGLM2-6B will be out soon.
243 | 
244 | ```
245 | @article{zeng2022glm,
246 |   title={Glm-130b: An open bilingual pre-trained model},
247 |   author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
248 |   journal={arXiv preprint arXiv:2210.02414},
249 |   year={2022}
250 | }
251 | ```
252 | ```
253 | @inproceedings{du2022glm,
254 |   title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
255 |   author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
256 |   booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
257 |   pages={320--335},
258 |   year={2022}
259 | }
260 | ```


--------------------------------------------------------------------------------
/api.py:
--------------------------------------------------------------------------------
 1 | from fastapi import FastAPI, Request
 2 | from transformers import AutoTokenizer, AutoModel
 3 | import uvicorn, json, datetime
 4 | import torch
 5 | 
 6 | DEVICE = "cuda"
 7 | DEVICE_ID = "0"
 8 | CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
 9 | 
10 | 
11 | def torch_gc():
12 |     if torch.cuda.is_available():
13 |         with torch.cuda.device(CUDA_DEVICE):
14 |             torch.cuda.empty_cache()
15 |             torch.cuda.ipc_collect()
16 | 
17 | 
18 | app = FastAPI()
19 | 
20 | 
21 | @app.post("/")
22 | async def create_item(request: Request):
23 |     global model, tokenizer
24 |     json_post_raw = await request.json()
25 |     json_post = json.dumps(json_post_raw)
26 |     json_post_list = json.loads(json_post)
27 |     prompt = json_post_list.get('prompt')
28 |     history = json_post_list.get('history')
29 |     max_length = json_post_list.get('max_length')
30 |     top_p = json_post_list.get('top_p')
31 |     temperature = json_post_list.get('temperature')
32 |     response, history = model.chat(tokenizer,
33 |                                    prompt,
34 |                                    history=history,
35 |                                    max_length=max_length if max_length else 2048,
36 |                                    top_p=top_p if top_p else 0.7,
37 |                                    temperature=temperature if temperature else 0.95)
38 |     now = datetime.datetime.now()
39 |     time = now.strftime("%Y-%m-%d %H:%M:%S")
40 |     answer = {
41 |         "response": response,
42 |         "history": history,
43 |         "status": 200,
44 |         "time": time
45 |     }
46 |     log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'
47 |     print(log)
48 |     torch_gc()
49 |     return answer
50 | 
51 | 
52 | if __name__ == '__main__':
53 |     tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
54 |     model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
55 |     # 多显卡支持，使用下面三行代替上面两行，将num_gpus改为你实际的显卡数量
56 |     # model_path = "THUDM/chatglm2-6b"
57 |     # tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
58 |     # model = load_model_on_gpus(model_path, num_gpus=2)
59 |     model.eval()
60 |     uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
61 | 


--------------------------------------------------------------------------------
/chatglm2PT/configuration_chatglm.py:
--------------------------------------------------------------------------------
 1 | # 导入transformers库的PretrainedConfig模块
 2 | from transformers import PretrainedConfig
 3 | 
 4 | # 定义一个名为ChatGLMConfig的新类，继承自PretrainedConfig
 5 | class ChatGLMConfig(PretrainedConfig):
 6 |     # 定义模型的类型为"chatglm"
 7 |     model_type = "chatglm"
 8 | 
 9 |     # 定义类的初始化函数，设置模型的各种配置参数和默认值
10 |     def __init__(
11 |         self,
12 |         num_layers=28,  # 定义模型中的层数，默认为28
13 |         padded_vocab_size=65024,  # 定义词汇表的大小，默认为65024
14 |         hidden_size=4096,  # 定义隐藏层的大小，默认为4096
15 |         ffn_hidden_size=13696,  # 定义前馈神经网络的隐藏层大小，默认为13696
16 |         kv_channels=128,  # 定义键值对的通道数量，默认为128
17 |         num_attention_heads=32,  # 定义注意力头的数量，默认为32
18 |         seq_length=2048,  # 定义序列长度，默认为2048
19 |         hidden_dropout=0.0,  # 定义隐藏层的dropout比例，默认为0
20 |         attention_dropout=0.0,  # 定义注意力层的dropout比例，默认为0
21 |         layernorm_epsilon=1e-5,  # 定义LayerNorm层中的一个小常数，默认为1e-5
22 |         rmsnorm=True,  # 定义是否使用RMS Normalization，默认为True
23 |         apply_residual_connection_post_layernorm=False,  # 定义是否在LayerNorm后应用残差连接，默认为False
24 |         post_layer_norm=True,  # 定义是否应用Post-Layer Norm，默认为True
25 |         add_bias_linear=False,  # 定义是否在线性层添加偏置项，默认为False
26 |         add_qkv_bias=False,  # 定义是否在查询/键/值三个权重矩阵上添加偏置项，默认为False
27 |         bias_dropout_fusion=True,  # 定义是否将偏置和dropout融合，默认为True
28 |         multi_query_attention=False,  # 定义是否使用多查询注意力，默认为False
29 |         multi_query_group_num=1,  # 定义多查询组的数量，默认为1
30 |         apply_query_key_layer_scaling=True,  # 定义是否应用查询键层的缩放，默认为True
31 |         attention_softmax_in_fp32=True,  # 定义注意力softmax是否使用单精度浮点数，默认为True
32 |         fp32_residual_connection=False,  # 定义是否在残差连接中使用单精度浮点数，默认为False
33 |         quantization_bit=0,  # 定义量化位数，默认为0
34 |         pre_seq_len=None,  # 定义预序列长度，默认为None
35 |         prefix_projection=False,  # 定义是否应用前缀投影，默认为False
36 |         **kwargs  # 接收其他以关键字方式给出的参数
37 |     ):   #这段代码定义了一个名为ChatGLMConfig的类，用于配置和管理ChatGLM模型。
38 |         
39 |         
40 |         self.num_layers = num_layers
41 |         self.vocab_size = padded_vocab_size
42 |         self.padded_vocab_size = padded_vocab_size
43 |         self.hidden_size = hidden_size
44 |         self.ffn_hidden_size = ffn_hidden_size
45 |         self.kv_channels = kv_channels
46 |         self.num_attention_heads = num_attention_heads
47 |         self.seq_length = seq_length
48 |         self.hidden_dropout = hidden_dropout
49 |         self.attention_dropout = attention_dropout
50 |         self.layernorm_epsilon = layernorm_epsilon
51 |         self.rmsnorm = rmsnorm
52 |         self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
53 |         self.post_layer_norm = post_layer_norm
54 |         self.add_bias_linear = add_bias_linear
55 |         self.add_qkv_bias = add_qkv_bias
56 |         self.bias_dropout_fusion = bias_dropout_fusion
57 |         self.multi_query_attention = multi_query_attention
58 |         self.multi_query_group_num = multi_query_group_num
59 |         self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
60 |         self.attention_softmax_in_fp32 = attention_softmax_in_fp32
61 |         self.fp32_residual_connection = fp32_residual_connection
62 |         self.quantization_bit = quantization_bit
63 |         self.pre_seq_len = pre_seq_len
64 |         self.prefix_projection = prefix_projection
65 |         super().__init__(**kwargs)
66 | 


--------------------------------------------------------------------------------
/chatglm2PT/modelling_chatglm.py:
--------------------------------------------------------------------------------
   1 | """ PyTorch ChatGLM model. """
   2 | 
   3 | import math
   4 | import copy
   5 | import warnings
   6 | import re
   7 | import sys
   8 | 
   9 | import torch
  10 | import torch.utils.checkpoint
  11 | import torch.nn.functional as F
  12 | from torch import nn
  13 | from torch.nn import CrossEntropyLoss, LayerNorm  # 从PyTorch的神经网络(nn)模块导入CrossEntropyLoss（损失函数）和LayerNorm（层标准化方法）
  14 | from torch.nn.utils import skip_init  # 从PyTorch的神经网络的工具(nn.utils)模块导入skip_init，一种跳过权重初始化的实用函数
  15 | from typing import Optional, Tuple, Union, List, Callable, Dict, Any  # 导入typing模块的子模块，用于定义变量、函数参数、返回值等的类型
  16 | 
  17 | from transformers.modeling_outputs import (
  18 |     BaseModelOutputWithPast,
  19 |     CausalLMOutputWithPast,
  20 | )  # 从Hugging Face的transformers库中的modeling_outputs模块导入BaseModelOutputWithPast和CausalLMOutputWithPast类
  21 | 
  22 | from transformers.modeling_utils import PreTrainedModel  # 从Hugging Face的transformers库中的modeling_utils模块导入PreTrainedModel类，这是所有预训练模型的基类
  23 | from transformers.utils import logging  # 从Hugging Face的transformers库中的utils模块导入logging，这是用于创建和配置日志的工具
  24 | 
  25 | from transformers.generation.logits_process import LogitsProcessor  # 从Hugging Face的transformers库中的generation.logits_process模块导入LogitsProcessor类，这个类可以处理和修改模型生成过程中的logits
  26 | from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput  # 从Hugging Face的transformers库中的generation.utils模块导入四个类或接口
  27 | 
  28 | from .configuration_chatglm import ChatGLMConfig  # 从当前目录下的configuration_chatglm模块导入ChatGLMConfig类，这是特定于ChatGLM模型的配置类
  29 | 
  30 | 
  31 | # flags required to enable jit fusion kernels  # 启用JIT（Just-In-Time）编译器的融合内核所需的标志
  32 | 
  33 | if sys.platform != 'darwin':  # 检查当前操作系统是否不是'darwin'。'darwin'通常代表Mac OS X系统。
  34 |     torch._C._jit_set_profiling_mode(False)  # 设置PyTorch JIT编译器的性能分析模式为False，禁用性能分析。
  35 |     torch._C._jit_set_profiling_executor(False)  # 设置PyTorch JIT编译器的执行器的性能分析模式为False，禁用性能分析。
  36 |     torch._C._jit_override_can_fuse_on_cpu(True)  # 允许JIT编译器在CPU上进行操作融合，即将多个操作合并为一个操作，以提高计算效率。
  37 |     torch._C._jit_override_can_fuse_on_gpu(True)  # 允许JIT编译器在GPU上进行操作融合，提高计算效率。
  38 | 
  39 | logger = logging.get_logger(__name__)  # 创建一个日志记录器实例，名字是当前模块的名字。
  40 | 
  41 | _CHECKPOINT_FOR_DOC = "THUDM/ChatGLM2-6B"  # 定义一个变量，用于存储预训练模型的checkpoint名称。
  42 | _CONFIG_FOR_DOC = "ChatGLM6BConfig"  # 定义一个变量，用于存储预训练模型的配置类的名称。
  43 | 
  44 | CHATGLM_6B_PRETRAINED_MODEL_ARCHIVE_LIST = [
  45 |     "THUDM/chatglm2-6b",
  46 |     # See all ChatGLM models at https://huggingface.co/models?filter=chatglm  # 列表中包含了预训练模型的名称
  47 | ]
  48 | 
  49 | 
  50 | def default_init(cls, *args, **kwargs):  # 定义一个函数，函数名为default_init。接收一个类（cls）、一个参数列表（*args）和一个关键字参数字典（**kwargs）。
  51 |     return cls(*args, **kwargs)  # 初始化类cls的一个实例，并返回这个实例。
  52 | 
  53 | class InvalidScoreLogitsProcessor(LogitsProcessor):  # 定义一个类，类名为InvalidScoreLogitsProcessor，继承了LogitsProcessor类。
  54 | 
  55 |     def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:  # 定义类InvalidScoreLogitsProcessor的方法，方法名为__call__。接收self（类的实例自身）、input_ids（输入的ID，为长整型Tensor）和scores（分数，为浮点型Tensor）。返回值也是浮点型Tensor。
  56 |         if torch.isnan(scores).any() or torch.isinf(scores).any():  # 如果scores中有任何值是NaN（Not a Number）或者是无穷大或无穷小，那么将scores中的所有元素都设置为0，然后将其第5个元素设置为5e4（50000）。
  57 |             scores.zero_()
  58 |             scores[..., 5] = 5e4
  59 |         return scores  # 将处理后的scores返回。
  60 | 
  61 | 
  62 | 
  63 | class PrefixEncoder(torch.nn.Module):  # 定义一个名为PrefixEncoder的类，它继承自torch.nn.Module，这是PyTorch中所有神经网络模块的基类。
  64 |     """
  65 |     The torch.nn model to encode the prefix
  66 |     Input shape: (batch-size, prefix-length)
  67 |     Output shape: (batch-size, prefix-length, 2*layers*hidden)
  68 |     """  # 这是一个多行注释，解释了这个类的主要功能，以及其输入和输出的形状。
  69 | 
  70 |     def __init__(self, config: ChatGLMConfig):  # 定义了这个类的初始化方法。它接收一个名为config的参数，该参数是ChatGLMConfig类的一个实例。
  71 |         super().__init__()  # 这行代码调用父类的初始化方法，确保父类的构造函数被正确地执行。
  72 |         self.prefix_projection = config.prefix_projection  # 从配置对象中取出prefix_projection值，并保存到这个类的实例中。
  73 | 
  74 |         if self.prefix_projection:  # 这行代码检查self.prefix_projection的值是否为真。如果为真，则执行以下的代码块。
  75 |             # Use a two-layer MLP to encode the prefix
  76 |             kv_size = config.num_layers * config.kv_channels * config.multi_query_group_num * 2  # 这行代码计算了kv_size的值，这是关键值对的大小。
  77 |             self.embedding = torch.nn.Embedding(config.pre_seq_len, kv_size)  # 这行代码定义了一个嵌入层，嵌入层的输入大小是config.pre_seq_len，输出大小是kv_size。
  78 | 
  79 |             self.trans = torch.nn.Sequential(  # 这行代码定义了一个序列模型，它包含两个线性层和一个双曲正切激活函数。
  80 |                 torch.nn.Linear(kv_size, config.hidden_size),
  81 |                 torch.nn.Tanh(),
  82 |                 torch.nn.Linear(config.hidden_size, kv_size)
  83 |             )
  84 |         else:  # 如果self.prefix_projection为假，那么就会执行这个代码块。
  85 |             self.embedding = torch.nn.Embedding(config.pre_seq_len,
  86 |                                                 config.num_layers * config.kv_channels * config.multi_query_group_num * 2)  # 定义一个嵌入层，输入大小是config.pre_seq_len，输出大小是config.num_layers * config.kv_channels * config.multi_query_group_num * 2。
  87 | 
  88 |    
  89 |     def forward(self, prefix: torch.Tensor):  # 定义了PrefixEncoder类的forward方法。这个方法接收一个名为prefix的参数，类型为torch.Tensor，这是PyTorch中张量的类型。
  90 | 
  91 |         if self.prefix_projection:  # 这行代码检查self.prefix_projection的值是否为真。如果为真，则执行以下的代码块。
  92 |             prefix_tokens = self.embedding(prefix)  # 这行代码通过将prefix传递给嵌入层，将前缀转化为嵌入向量，并将结果保存到prefix_tokens。
  93 |             past_key_values = self.trans(prefix_tokens)  # 这行代码通过将prefix_tokens传递给self.trans，将嵌入向量转化为past_key_values，self.trans是一个前面定义的线性模型。
  94 | 
  95 |         else:  # 如果self.prefix_projection为假，那么就会执行这个代码块。
  96 |             past_key_values = self.embedding(prefix)  # 这行代码通过将prefix传递给嵌入层，将前缀转化为past_key_values。
  97 | 
  98 |         return past_key_values  # 返回past_key_values，这是模型的输出。
  99 | 
 100 | 
 101 | 
 102 | def split_tensor_along_last_dim(
 103 |         tensor: torch.Tensor,
 104 |         num_partitions: int,
 105 |         contiguous_split_chunks: bool = False,
 106 | ) -> List[torch.Tensor]:
 107 |     """Split a tensor along its last dimension.
 108 |     Arguments:
 109 |         tensor: input tensor.
 110 |         num_partitions: number of partitions to split the tensor
 111 |         contiguous_split_chunks: If True, make each chunk contiguous
 112 |                                  in memory.
 113 |     Returns:
 114 |         A list of Tensors
 115 |     """
 116 |     # Get the size and dimension.
 117 |     last_dim = tensor.dim() - 1
 118 |     last_dim_size = tensor.size()[last_dim] // num_partitions
 119 |     # Split.
 120 |     tensor_list = torch.split(tensor, last_dim_size, dim=last_dim)
 121 |     # Note: torch.split does not create contiguous tensors by default.
 122 |     if contiguous_split_chunks:
 123 |         return tuple(chunk.contiguous() for chunk in tensor_list)
 124 | 
 125 |     return tensor_list
 126 | 
 127 | 
 128 | class RotaryEmbedding(nn.Module):
 129 |     def __init__(self, dim, original_impl=False, device=None, dtype=None):
 130 |         super().__init__()
 131 |         inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
 132 |         self.register_buffer("inv_freq", inv_freq)
 133 |         self.dim = dim
 134 |         self.original_impl = original_impl
 135 | 
 136 |     def forward_impl(
 137 |             self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000
 138 |     ):
 139 |         """Enhanced Transformer with Rotary Position Embedding.
 140 |         Derived from: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/
 141 |         transformers/rope/__init__.py. MIT License:
 142 |         https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license.
 143 |         """
 144 |         # $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$
 145 |         theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=dtype, device=device) / n_elem))
 146 | 
 147 |         # Create position indexes `[0, 1, ..., seq_len - 1]`
 148 |         seq_idx = torch.arange(seq_len, dtype=dtype, device=device)
 149 | 
 150 |         # Calculate the product of position index and $\theta_i$
 151 |         idx_theta = torch.outer(seq_idx, theta).float()
 152 | 
 153 |         cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1)
 154 | 
 155 |         # this is to mimic the behaviour of complex32, else we will get different results
 156 |         if dtype in (torch.float16, torch.bfloat16, torch.int8):
 157 |             cache = cache.bfloat16() if dtype == torch.bfloat16 else cache.half()
 158 |         return cache
 159 | 
 160 |     def forward(self, max_seq_len, offset=0):
 161 |         return self.forward_impl(
 162 |             max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device
 163 |         )
 164 | 
 165 | 
 166 | @torch.jit.script
 167 | def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
 168 |     # x: [sq, b, np, hn]
 169 |     sq, b, np, hn = x.size(0), x.size(1), x.size(2), x.size(3)
 170 |     rot_dim = rope_cache.shape[-2] * 2
 171 |     x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
 172 |     # truncate to support variable sizes
 173 |     rope_cache = rope_cache[:sq]
 174 |     xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2)
 175 |     rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
 176 |     x_out2 = torch.stack(
 177 |         [
 178 |             xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
 179 |             xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1],
 180 |         ],
 181 |         -1,
 182 |     )
 183 |     x_out2 = x_out2.flatten(3)
 184 |     return torch.cat((x_out2, x_pass), dim=-1)
 185 | 
 186 | 
 187 | class RMSNorm(torch.nn.Module):
 188 |     def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None, **kwargs):
 189 |         super().__init__()
 190 |         self.weight = torch.nn.Parameter(torch.empty(normalized_shape, device=device, dtype=dtype))
 191 |         self.eps = eps
 192 | 
 193 |     def forward(self, hidden_states: torch.Tensor):
 194 |         input_dtype = hidden_states.dtype
 195 |         variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
 196 |         hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
 197 | 
 198 |         return (self.weight * hidden_states).to(input_dtype)
 199 | 
 200 | 
 201 | class CoreAttention(torch.nn.Module):
 202 |     def __init__(self, config: ChatGLMConfig, layer_number):
 203 |         super(CoreAttention, self).__init__()
 204 | 
 205 |         self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
 206 |         self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
 207 |         if self.apply_query_key_layer_scaling:
 208 |             self.attention_softmax_in_fp32 = True
 209 |         self.layer_number = max(1, layer_number)
 210 | 
 211 |         projection_size = config.kv_channels * config.num_attention_heads
 212 | 
 213 |         # Per attention head and per partition values.
 214 |         self.hidden_size_per_partition = projection_size
 215 |         self.hidden_size_per_attention_head = projection_size // config.num_attention_heads
 216 |         self.num_attention_heads_per_partition = config.num_attention_heads
 217 | 
 218 |         coeff = None
 219 |         self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
 220 |         if self.apply_query_key_layer_scaling:
 221 |             coeff = self.layer_number
 222 |             self.norm_factor *= coeff
 223 |         self.coeff = coeff
 224 | 
 225 |         self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
 226 | 
 227 |     def forward(self, query_layer, key_layer, value_layer, attention_mask):
 228 |         pytorch_major_version = int(torch.__version__.split('.')[0])
 229 |         if pytorch_major_version >= 2:
 230 |             query_layer, key_layer, value_layer = [k.permute(1, 2, 0, 3) for k in [query_layer, key_layer, value_layer]]
 231 |             if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
 232 |                 context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
 233 |                                                                                  is_causal=True)
 234 |             else:
 235 |                 if attention_mask is not None:
 236 |                     attention_mask = ~attention_mask
 237 |                 context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
 238 |                                                                                  attention_mask)
 239 |             context_layer = context_layer.permute(2, 0, 1, 3)
 240 |             new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
 241 |             context_layer = context_layer.reshape(*new_context_layer_shape)
 242 |         else:
 243 |             # Raw attention scores
 244 | 
 245 |             # [b, np, sq, sk]
 246 |             output_size = (query_layer.size(1), query_layer.size(2), query_layer.size(0), key_layer.size(0))
 247 | 
 248 |             # [sq, b, np, hn] -> [sq, b * np, hn]
 249 |             query_layer = query_layer.view(output_size[2], output_size[0] * output_size[1], -1)
 250 |             # [sk, b, np, hn] -> [sk, b * np, hn]
 251 |             key_layer = key_layer.view(output_size[3], output_size[0] * output_size[1], -1)
 252 | 
 253 |             # preallocting input tensor: [b * np, sq, sk]
 254 |             matmul_input_buffer = torch.empty(
 255 |                 output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
 256 |                 device=query_layer.device
 257 |             )
 258 | 
 259 |             # Raw attention scores. [b * np, sq, sk]
 260 |             matmul_result = torch.baddbmm(
 261 |                 matmul_input_buffer,
 262 |                 query_layer.transpose(0, 1),  # [b * np, sq, hn]
 263 |                 key_layer.transpose(0, 1).transpose(1, 2),  # [b * np, hn, sk]
 264 |                 beta=0.0,
 265 |                 alpha=(1.0 / self.norm_factor),
 266 |             )
 267 | 
 268 |             # change view to [b, np, sq, sk]
 269 |             attention_scores = matmul_result.view(*output_size)
 270 | 
 271 |             # ===========================
 272 |             # Attention probs and dropout
 273 |             # ===========================
 274 | 
 275 |             # attention scores and attention mask [b, np, sq, sk]
 276 |             if self.attention_softmax_in_fp32:
 277 |                 attention_scores = attention_scores.float()
 278 |             if self.coeff is not None:
 279 |                 attention_scores = attention_scores * self.coeff
 280 |             if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
 281 |                 attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
 282 |                                             device=attention_scores.device, dtype=torch.bool)
 283 |                 attention_mask.tril_()
 284 |                 attention_mask = ~attention_mask
 285 |             if attention_mask is not None:
 286 |                 attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
 287 |             attention_probs = F.softmax(attention_scores, dim=-1)
 288 |             attention_probs = attention_probs.type_as(value_layer)
 289 | 
 290 |             # This is actually dropping out entire tokens to attend to, which might
 291 |             # seem a bit unusual, but is taken from the original Transformer paper.
 292 |             attention_probs = self.attention_dropout(attention_probs)
 293 |             # =========================
 294 |             # Context layer. [sq, b, hp]
 295 |             # =========================
 296 | 
 297 |             # value_layer -> context layer.
 298 |             # [sk, b, np, hn] --> [b, np, sq, hn]
 299 | 
 300 |             # context layer shape: [b, np, sq, hn]
 301 |             output_size = (value_layer.size(1), value_layer.size(2), query_layer.size(0), value_layer.size(3))
 302 |             # change view [sk, b * np, hn]
 303 |             value_layer = value_layer.view(value_layer.size(0), output_size[0] * output_size[1], -1)
 304 |             # change view [b * np, sq, sk]
 305 |             attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
 306 |             # matmul: [b * np, sq, hn]
 307 |             context_layer = torch.bmm(attention_probs, value_layer.transpose(0, 1))
 308 |             # change view [b, np, sq, hn]
 309 |             context_layer = context_layer.view(*output_size)
 310 |             # [b, np, sq, hn] --> [sq, b, np, hn]
 311 |             context_layer = context_layer.permute(2, 0, 1, 3).contiguous()
 312 |             # [sq, b, np, hn] --> [sq, b, hp]
 313 |             new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
 314 |             context_layer = context_layer.view(*new_context_layer_shape)
 315 | 
 316 |         return context_layer
 317 | 
 318 | 
 319 | class SelfAttention(torch.nn.Module):
 320 |     """Parallel self-attention layer abstract class.
 321 |     Self-attention layer takes input with size [s, b, h]
 322 |     and returns output of the same size.
 323 |     """
 324 | 
 325 |     def __init__(self, config: ChatGLMConfig, layer_number, device=None):
 326 |         super(SelfAttention, self).__init__()
 327 |         self.layer_number = max(1, layer_number)
 328 | 
 329 |         self.projection_size = config.kv_channels * config.num_attention_heads
 330 | 
 331 |         # Per attention head and per partition values.
 332 |         self.hidden_size_per_attention_head = self.projection_size // config.num_attention_heads
 333 |         self.num_attention_heads_per_partition = config.num_attention_heads
 334 | 
 335 |         self.multi_query_attention = config.multi_query_attention
 336 |         self.qkv_hidden_size = 3 * self.projection_size
 337 |         if self.multi_query_attention:
 338 |             self.num_multi_query_groups_per_partition = config.multi_query_group_num
 339 |             self.qkv_hidden_size = (
 340 |                     self.projection_size + 2 * self.hidden_size_per_attention_head * config.multi_query_group_num
 341 |             )
 342 |         self.query_key_value = nn.Linear(config.hidden_size, self.qkv_hidden_size,
 343 |                                          bias=config.add_bias_linear or config.add_qkv_bias,
 344 |                                          device=device, **_config_to_kwargs(config)
 345 |                                          )
 346 | 
 347 |         self.core_attention = CoreAttention(config, self.layer_number)
 348 | 
 349 |         # Output.
 350 |         self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear,
 351 |                                device=device, **_config_to_kwargs(config)
 352 |                                )
 353 | 
 354 |     def _allocate_memory(self, inference_max_sequence_len, batch_size, device=None, dtype=None):
 355 |         if self.multi_query_attention:
 356 |             num_attention_heads = self.num_multi_query_groups_per_partition
 357 |         else:
 358 |             num_attention_heads = self.num_attention_heads_per_partition
 359 |         return torch.empty(
 360 |             inference_max_sequence_len,
 361 |             batch_size,
 362 |             num_attention_heads,
 363 |             self.hidden_size_per_attention_head,
 364 |             dtype=dtype,
 365 |             device=device,
 366 |         )
 367 | 
 368 |     def forward(
 369 |             self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True
 370 |     ):
 371 |         # hidden_states: [sq, b, h]
 372 | 
 373 |         # =================================================
 374 |         # Pre-allocate memory for key-values for inference.
 375 |         # =================================================
 376 |         # =====================
 377 |         # Query, Key, and Value
 378 |         # =====================
 379 | 
 380 |         # Attention heads [sq, b, h] --> [sq, b, (np * 3 * hn)]
 381 |         mixed_x_layer = self.query_key_value(hidden_states)
 382 | 
 383 |         if self.multi_query_attention:
 384 |             (query_layer, key_layer, value_layer) = mixed_x_layer.split(
 385 |                 [
 386 |                     self.num_attention_heads_per_partition * self.hidden_size_per_attention_head,
 387 |                     self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head,
 388 |                     self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head,
 389 |                 ],
 390 |                 dim=-1,
 391 |             )
 392 |             query_layer = query_layer.view(
 393 |                 query_layer.size()[:-1] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head)
 394 |             )
 395 |             key_layer = key_layer.view(
 396 |                 key_layer.size()[:-1] + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head)
 397 |             )
 398 |             value_layer = value_layer.view(
 399 |                 value_layer.size()[:-1]
 400 |                 + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head)
 401 |             )
 402 |         else:
 403 |             new_tensor_shape = mixed_x_layer.size()[:-1] + \
 404 |                                (self.num_attention_heads_per_partition,
 405 |                                 3 * self.hidden_size_per_attention_head)
 406 |             mixed_x_layer = mixed_x_layer.view(*new_tensor_shape)
 407 | 
 408 |             # [sq, b, np, 3 * hn] --> 3 [sq, b, np, hn]
 409 |             (query_layer, key_layer, value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3)
 410 | 
 411 |         # apply relative positional encoding (rotary embedding)
 412 |         if rotary_pos_emb is not None:
 413 |             query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
 414 |             key_layer = apply_rotary_pos_emb(key_layer, rotary_pos_emb)
 415 | 
 416 |         # adjust key and value for inference
 417 |         if kv_cache is not None:
 418 |             cache_k, cache_v = kv_cache
 419 |             key_layer = torch.cat((cache_k, key_layer), dim=0)
 420 |             value_layer = torch.cat((cache_v, value_layer), dim=0)
 421 |         if use_cache:
 422 |             kv_cache = (key_layer, value_layer)
 423 |         else:
 424 |             kv_cache = None
 425 | 
 426 |         if self.multi_query_attention:
 427 |             key_layer = key_layer.unsqueeze(-2)
 428 |             key_layer = key_layer.expand(
 429 |                 -1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1
 430 |             )
 431 |             key_layer = key_layer.contiguous().view(
 432 |                 key_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head)
 433 |             )
 434 |             value_layer = value_layer.unsqueeze(-2)
 435 |             value_layer = value_layer.expand(
 436 |                 -1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1
 437 |             )
 438 |             value_layer = value_layer.contiguous().view(
 439 |                 value_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head)
 440 |             )
 441 | 
 442 |         # ==================================
 443 |         # core attention computation
 444 |         # ==================================
 445 | 
 446 |         context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask)
 447 | 
 448 |         # =================
 449 |         # Output. [sq, b, h]
 450 |         # =================
 451 | 
 452 |         output = self.dense(context_layer)
 453 | 
 454 |         return output, kv_cache
 455 | 
 456 | 
 457 | def _config_to_kwargs(args):
 458 |     common_kwargs = {
 459 |         "dtype": args.torch_dtype,
 460 |     }
 461 |     return common_kwargs
 462 | 
 463 | 
 464 | class MLP(torch.nn.Module):
 465 |     """MLP.
 466 |     MLP will take the input with h hidden state, project it to 4*h
 467 |     hidden dimension, perform nonlinear transformation, and project the
 468 |     state back into h hidden dimension.
 469 |     """
 470 | 
 471 |     def __init__(self, config: ChatGLMConfig, device=None):
 472 |         super(MLP, self).__init__()
 473 | 
 474 |         self.add_bias = config.add_bias_linear
 475 | 
 476 |         # Project to 4h. If using swiglu double the output width, see https://arxiv.org/pdf/2002.05202.pdf
 477 |         self.dense_h_to_4h = nn.Linear(
 478 |             config.hidden_size,
 479 |             config.ffn_hidden_size * 2,
 480 |             bias=self.add_bias,
 481 |             device=device,
 482 |             **_config_to_kwargs(config)
 483 |         )
 484 | 
 485 |         def swiglu(x):
 486 |             x = torch.chunk(x, 2, dim=-1)
 487 |             return F.silu(x[0]) * x[1]
 488 | 
 489 |         self.activation_func = swiglu
 490 | 
 491 |         # Project back to h.
 492 |         self.dense_4h_to_h = nn.Linear(
 493 |             config.ffn_hidden_size,
 494 |             config.hidden_size,
 495 |             bias=self.add_bias,
 496 |             device=device,
 497 |             **_config_to_kwargs(config)
 498 |         )
 499 | 
 500 |     def forward(self, hidden_states):
 501 |         # [s, b, 4hp]
 502 |         intermediate_parallel = self.dense_h_to_4h(hidden_states)
 503 |         intermediate_parallel = self.activation_func(intermediate_parallel)
 504 |         # [s, b, h]
 505 |         output = self.dense_4h_to_h(intermediate_parallel)
 506 |         return output
 507 | 
 508 | 
 509 | class GLMBlock(torch.nn.Module):
 510 |     """A single transformer layer.
 511 |     Transformer layer takes input with size [s, b, h] and returns an
 512 |     output of the same size.
 513 |     """
 514 | 
 515 |     def __init__(self, config: ChatGLMConfig, layer_number, device=None):
 516 |         super(GLMBlock, self).__init__()
 517 |         self.layer_number = layer_number
 518 | 
 519 |         self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
 520 | 
 521 |         self.fp32_residual_connection = config.fp32_residual_connection
 522 | 
 523 |         LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
 524 |         # Layernorm on the input data.
 525 |         self.input_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
 526 |                                              dtype=config.torch_dtype)
 527 | 
 528 |         # Self attention.
 529 |         self.self_attention = SelfAttention(config, layer_number, device=device)
 530 |         self.hidden_dropout = config.hidden_dropout
 531 | 
 532 |         # Layernorm on the attention output
 533 |         self.post_attention_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
 534 |                                                       dtype=config.torch_dtype)
 535 | 
 536 |         # MLP
 537 |         self.mlp = MLP(config, device=device)
 538 | 
 539 |     def forward(
 540 |             self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True,
 541 |     ):
 542 |         # hidden_states: [s, b, h]
 543 | 
 544 |         # Layer norm at the beginning of the transformer layer.
 545 |         layernorm_output = self.input_layernorm(hidden_states)
 546 |         # Self attention.
 547 |         attention_output, kv_cache = self.self_attention(
 548 |             layernorm_output,
 549 |             attention_mask,
 550 |             rotary_pos_emb,
 551 |             kv_cache=kv_cache,
 552 |             use_cache=use_cache
 553 |         )
 554 | 
 555 |         # Residual connection.
 556 |         if self.apply_residual_connection_post_layernorm:
 557 |             residual = layernorm_output
 558 |         else:
 559 |             residual = hidden_states
 560 | 
 561 |         layernorm_input = torch.nn.functional.dropout(attention_output, p=self.hidden_dropout, training=self.training)
 562 |         layernorm_input = residual + layernorm_input
 563 | 
 564 |         # Layer norm post the self attention.
 565 |         layernorm_output = self.post_attention_layernorm(layernorm_input)
 566 | 
 567 |         # MLP.
 568 |         mlp_output = self.mlp(layernorm_output)
 569 | 
 570 |         # Second residual connection.
 571 |         if self.apply_residual_connection_post_layernorm:
 572 |             residual = layernorm_output
 573 |         else:
 574 |             residual = layernorm_input
 575 | 
 576 |         output = torch.nn.functional.dropout(mlp_output, p=self.hidden_dropout, training=self.training)
 577 |         output = residual + output
 578 | 
 579 |         return output, kv_cache
 580 | 
 581 | 
 582 | class GLMTransformer(torch.nn.Module):
 583 |     """Transformer class."""
 584 | 
 585 |     def __init__(self, config: ChatGLMConfig, device=None):
 586 |         super(GLMTransformer, self).__init__()
 587 | 
 588 |         self.fp32_residual_connection = config.fp32_residual_connection
 589 |         self.post_layer_norm = config.post_layer_norm
 590 | 
 591 |         # Number of layers.
 592 |         self.num_layers = config.num_layers
 593 | 
 594 |         # Transformer layers.
 595 |         def build_layer(layer_number):
 596 |             return GLMBlock(config, layer_number, device=device)
 597 | 
 598 |         self.layers = torch.nn.ModuleList([build_layer(i + 1) for i in range(self.num_layers)])
 599 | 
 600 |         if self.post_layer_norm:
 601 |             LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
 602 |             # Final layer norm before output.
 603 |             self.final_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
 604 |                                                  dtype=config.torch_dtype)
 605 | 
 606 |         self.gradient_checkpointing = False
 607 | 
 608 |     def _get_layer(self, layer_number):
 609 |         return self.layers[layer_number]
 610 | 
 611 |     def forward(
 612 |             self, hidden_states, attention_mask, rotary_pos_emb, kv_caches=None,
 613 |             use_cache: Optional[bool] = True,
 614 |             output_hidden_states: Optional[bool] = False,
 615 |     ):
 616 |         if not kv_caches:
 617 |             kv_caches = [None for _ in range(self.num_layers)]
 618 |         presents = () if use_cache else None
 619 |         if self.gradient_checkpointing and self.training:
 620 |             if use_cache:
 621 |                 logger.warning_once(
 622 |                     "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
 623 |                 )
 624 |                 use_cache = False
 625 | 
 626 |         all_self_attentions = None
 627 |         all_hidden_states = () if output_hidden_states else None
 628 |         for index in range(self.num_layers):
 629 |             if output_hidden_states:
 630 |                 all_hidden_states = all_hidden_states + (hidden_states,)
 631 | 
 632 |             layer = self._get_layer(index)
 633 |             if self.gradient_checkpointing and self.training:
 634 |                 layer_ret = torch.utils.checkpoint.checkpoint(
 635 |                     layer,
 636 |                     hidden_states,
 637 |                     attention_mask,
 638 |                     rotary_pos_emb,
 639 |                     kv_caches[index],
 640 |                     use_cache
 641 |                 )
 642 |             else:
 643 |                 layer_ret = layer(
 644 |                     hidden_states,
 645 |                     attention_mask,
 646 |                     rotary_pos_emb,
 647 |                     kv_cache=kv_caches[index],
 648 |                     use_cache=use_cache
 649 |                 )
 650 |             hidden_states, kv_cache = layer_ret
 651 |             if use_cache:
 652 |                 presents = presents + (kv_cache,)
 653 | 
 654 |         if output_hidden_states:
 655 |             all_hidden_states = all_hidden_states + (hidden_states,)
 656 | 
 657 |         # Final layer norm.
 658 |         if self.post_layer_norm:
 659 |             hidden_states = self.final_layernorm(hidden_states)
 660 | 
 661 |         return hidden_states, presents, all_hidden_states, all_self_attentions
 662 | 
 663 | 
 664 | class ChatGLMPreTrainedModel(PreTrainedModel):
 665 |     """
 666 |     An abstract class to handle weights initialization and
 667 |     a simple interface for downloading and loading pretrained models.
 668 |     """
 669 | 
 670 |     is_parallelizable = False
 671 |     supports_gradient_checkpointing = True
 672 |     config_class = ChatGLMConfig
 673 |     base_model_prefix = "transformer"
 674 |     _no_split_modules = ["GLMBlock"]
 675 | 
 676 |     def _init_weights(self, module: nn.Module):
 677 |         """Initialize the weights."""
 678 |         return
 679 | 
 680 |     def get_masks(self, input_ids, past_key_values, padding_mask=None):
 681 |         batch_size, seq_length = input_ids.shape
 682 |         full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_ids.device)
 683 |         full_attention_mask.tril_()
 684 |         past_length = 0
 685 |         if past_key_values:
 686 |             past_length = past_key_values[0][0].shape[0]
 687 |         if past_length:
 688 |             full_attention_mask = torch.cat((torch.ones(batch_size, seq_length, past_length,
 689 |                                                         device=input_ids.device), full_attention_mask), dim=-1)
 690 |         if padding_mask is not None:
 691 |             full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
 692 |         if not past_length and padding_mask is not None:
 693 |             full_attention_mask -= padding_mask.unsqueeze(-1) - 1
 694 |         full_attention_mask = (full_attention_mask < 0.5).bool()
 695 |         full_attention_mask.unsqueeze_(1)
 696 |         return full_attention_mask
 697 | 
 698 |     def get_position_ids(self, input_ids, device):
 699 |         batch_size, seq_length = input_ids.shape
 700 |         position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)
 701 |         return position_ids
 702 | 
 703 |     def _set_gradient_checkpointing(self, module, value=False):
 704 |         if isinstance(module, GLMTransformer):
 705 |             module.gradient_checkpointing = value
 706 | 
 707 | 
 708 | class Embedding(torch.nn.Module):
 709 |     """Language model embeddings."""
 710 | 
 711 |     def __init__(self, config: ChatGLMConfig, device=None):
 712 |         super(Embedding, self).__init__()
 713 | 
 714 |         self.hidden_size = config.hidden_size
 715 |         # Word embeddings (parallel).
 716 |         self.word_embeddings = nn.Embedding(
 717 |             config.padded_vocab_size,
 718 |             self.hidden_size,
 719 |             dtype=config.torch_dtype,
 720 |             device=device
 721 |         )
 722 |         self.fp32_residual_connection = config.fp32_residual_connection
 723 | 
 724 |     def forward(self, input_ids):
 725 |         # Embeddings.
 726 |         words_embeddings = self.word_embeddings(input_ids)
 727 |         embeddings = words_embeddings
 728 |         # Data format change to avoid explicit tranposes : [b s h] --> [s b h].
 729 |         embeddings = embeddings.transpose(0, 1).contiguous()
 730 |         # If the input flag for fp32 residual connection is set, convert for float.
 731 |         if self.fp32_residual_connection:
 732 |             embeddings = embeddings.float()
 733 |         return embeddings
 734 | 
 735 | 
 736 | class ChatGLMModel(ChatGLMPreTrainedModel):
 737 |     def __init__(self, config: ChatGLMConfig, device=None, empty_init=True):
 738 |         super().__init__(config)
 739 |         if empty_init:
 740 |             init_method = skip_init
 741 |         else:
 742 |             init_method = default_init
 743 |         init_kwargs = {}
 744 |         if device is not None:
 745 |             init_kwargs["device"] = device
 746 |         self.embedding = init_method(Embedding, config, **init_kwargs)
 747 |         self.num_layers = config.num_layers
 748 |         self.multi_query_group_num = config.multi_query_group_num
 749 |         self.kv_channels = config.kv_channels
 750 | 
 751 |         # Rotary positional embeddings
 752 |         self.seq_length = config.seq_length
 753 |         rotary_dim = (
 754 |             config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels
 755 |         )
 756 | 
 757 |         self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, original_impl=config.original_rope, device=device,
 758 |                                               dtype=config.torch_dtype)
 759 |         self.encoder = init_method(GLMTransformer, config, **init_kwargs)
 760 |         self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
 761 |                                         dtype=config.torch_dtype, **init_kwargs)
 762 |         self.pre_seq_len = config.pre_seq_len
 763 |         self.prefix_projection = config.prefix_projection
 764 |         if self.pre_seq_len is not None:
 765 |             for param in self.parameters():
 766 |                 param.requires_grad = False
 767 |             self.prefix_tokens = torch.arange(self.pre_seq_len).long()
 768 |             self.prefix_encoder = PrefixEncoder(config)
 769 |             self.dropout = torch.nn.Dropout(0.1)
 770 | 
 771 |     def get_input_embeddings(self):
 772 |         return self.embedding.word_embeddings
 773 | 
 774 |     def get_prompt(self, batch_size, device, dtype=torch.half):
 775 |         prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(device)
 776 |         past_key_values = self.prefix_encoder(prefix_tokens).type(dtype)
 777 |         past_key_values = past_key_values.view(
 778 |             batch_size,
 779 |             self.pre_seq_len,
 780 |             self.num_layers * 2,
 781 |             self.multi_query_group_num,
 782 |             self.kv_channels
 783 |         )
 784 |         # seq_len, b, nh, hidden_size
 785 |         past_key_values = self.dropout(past_key_values)
 786 |         past_key_values = past_key_values.permute([2, 1, 0, 3, 4]).split(2)
 787 |         return past_key_values
 788 | 
 789 |     def forward(
 790 |             self,
 791 |             input_ids,
 792 |             position_ids: Optional[torch.Tensor] = None,
 793 |             attention_mask: Optional[torch.BoolTensor] = None,
 794 |             full_attention_mask: Optional[torch.BoolTensor] = None,
 795 |             past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
 796 |             inputs_embeds: Optional[torch.Tensor] = None,
 797 |             use_cache: Optional[bool] = None,
 798 |             output_hidden_states: Optional[bool] = None,
 799 |             return_dict: Optional[bool] = None,
 800 |     ):
 801 |         output_hidden_states = (
 802 |             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
 803 |         )
 804 |         use_cache = use_cache if use_cache is not None else self.config.use_cache
 805 |         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 806 | 
 807 |         batch_size, seq_length = input_ids.shape
 808 | 
 809 |         if inputs_embeds is None:
 810 |             inputs_embeds = self.embedding(input_ids)
 811 | 
 812 |         if self.pre_seq_len is not None:
 813 |             if past_key_values is None:
 814 |                 past_key_values = self.get_prompt(batch_size=batch_size, device=input_ids.device,
 815 |                                                   dtype=inputs_embeds.dtype)
 816 |             if attention_mask is not None:
 817 |                 attention_mask = torch.cat([attention_mask.new_ones((batch_size, self.pre_seq_len)),
 818 |                                             attention_mask], dim=-1)
 819 | 
 820 |         if full_attention_mask is None:
 821 |             if (attention_mask is not None and not attention_mask.all()) or (past_key_values and seq_length != 1):
 822 |                 full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask)
 823 | 
 824 |         # Rotary positional embeddings
 825 |         rotary_pos_emb = self.rotary_pos_emb(self.seq_length)
 826 |         if position_ids is not None:
 827 |             rotary_pos_emb = rotary_pos_emb[position_ids]
 828 |         else:
 829 |             rotary_pos_emb = rotary_pos_emb[None, :seq_length]
 830 |         rotary_pos_emb = rotary_pos_emb.transpose(0, 1).contiguous()
 831 | 
 832 |         # Run encoder.
 833 |         hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
 834 |             inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb,
 835 |             kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states
 836 |         )
 837 | 
 838 |         if not return_dict:
 839 |             return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
 840 | 
 841 |         return BaseModelOutputWithPast(
 842 |             last_hidden_state=hidden_states,
 843 |             past_key_values=presents,
 844 |             hidden_states=all_hidden_states,
 845 |             attentions=all_self_attentions,
 846 |         )
 847 | 
 848 |     def quantize(self, weight_bit_width: int):
 849 |         from .quantization import quantize
 850 |         quantize(self.encoder, weight_bit_width)
 851 |         return self
 852 | 
 853 | 
 854 | class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
 855 |     def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
 856 |         super().__init__(config)
 857 | 
 858 |         self.max_sequence_length = config.max_length
 859 |         self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
 860 |         self.config = config
 861 |         self.quantized = False
 862 | 
 863 |         if self.config.quantization_bit:
 864 |             self.quantize(self.config.quantization_bit, empty_init=True)
 865 | 
 866 |     def _update_model_kwargs_for_generation(
 867 |             self,
 868 |             outputs: ModelOutput,
 869 |             model_kwargs: Dict[str, Any],
 870 |             is_encoder_decoder: bool = False,
 871 |             standardize_cache_format: bool = False,
 872 |     ) -> Dict[str, Any]:
 873 |         # update past_key_values
 874 |         model_kwargs["past_key_values"] = self._extract_past_from_model_output(
 875 |             outputs, standardize_cache_format=standardize_cache_format
 876 |         )
 877 | 
 878 |         # update attention mask
 879 |         if "attention_mask" in model_kwargs:
 880 |             attention_mask = model_kwargs["attention_mask"]
 881 |             model_kwargs["attention_mask"] = torch.cat(
 882 |                 [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
 883 |             )
 884 | 
 885 |         # update position ids
 886 |         if "position_ids" in model_kwargs:
 887 |             position_ids = model_kwargs["position_ids"]
 888 |             new_position_id = position_ids[..., -1:].clone()
 889 |             new_position_id += 1
 890 |             model_kwargs["position_ids"] = torch.cat(
 891 |                 [position_ids, new_position_id], dim=-1
 892 |             )
 893 | 
 894 |         model_kwargs["is_first_forward"] = False
 895 |         return model_kwargs
 896 | 
 897 |     def prepare_inputs_for_generation(
 898 |             self,
 899 |             input_ids: torch.LongTensor,
 900 |             past_key_values: Optional[torch.Tensor] = None,
 901 |             attention_mask: Optional[torch.Tensor] = None,
 902 |             position_ids: Optional[torch.Tensor] = None,
 903 |             is_first_forward: bool = True,
 904 |             **kwargs
 905 |     ) -> dict:
 906 |         # only last token for input_ids if past is not None
 907 |         if position_ids is None:
 908 |             position_ids = self.get_position_ids(input_ids, device=input_ids.device)
 909 |         if not is_first_forward:
 910 |             position_ids = position_ids[..., -1:]
 911 |             input_ids = input_ids[:, -1:]
 912 |         return {
 913 |             "input_ids": input_ids,
 914 |             "past_key_values": past_key_values,
 915 |             "position_ids": position_ids,
 916 |             "attention_mask": attention_mask,
 917 |             "return_last_logit": True
 918 |         }
 919 | 
 920 |     def forward(
 921 |             self,
 922 |             input_ids: Optional[torch.Tensor] = None,
 923 |             position_ids: Optional[torch.Tensor] = None,
 924 |             attention_mask: Optional[torch.Tensor] = None,
 925 |             past_key_values: Optional[Tuple[torch.FloatTensor]] = None,
 926 |             inputs_embeds: Optional[torch.Tensor] = None,
 927 |             labels: Optional[torch.Tensor] = None,
 928 |             use_cache: Optional[bool] = None,
 929 |             output_attentions: Optional[bool] = None,
 930 |             output_hidden_states: Optional[bool] = None,
 931 |             return_dict: Optional[bool] = None,
 932 |             return_last_logit: Optional[bool] = False,
 933 |     ):
 934 |         use_cache = use_cache if use_cache is not None else self.config.use_cache
 935 |         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
 936 | 
 937 |         transformer_outputs = self.transformer(
 938 |             input_ids=input_ids,
 939 |             position_ids=position_ids,
 940 |             attention_mask=attention_mask,
 941 |             past_key_values=past_key_values,
 942 |             inputs_embeds=inputs_embeds,
 943 |             use_cache=use_cache,
 944 |             output_hidden_states=output_hidden_states,
 945 |             return_dict=return_dict,
 946 |         )
 947 | 
 948 |         hidden_states = transformer_outputs[0]
 949 |         if return_last_logit:
 950 |             hidden_states = hidden_states[-1:]
 951 |         lm_logits = self.transformer.output_layer(hidden_states)
 952 |         lm_logits = lm_logits.transpose(0, 1).contiguous()
 953 | 
 954 |         loss = None
 955 |         if labels is not None:
 956 |             lm_logits = lm_logits.to(torch.float32)
 957 | 
 958 |             # Shift so that tokens < n predict n
 959 |             shift_logits = lm_logits[..., :-1, :].contiguous()
 960 |             shift_labels = labels[..., 1:].contiguous()
 961 |             # Flatten the tokens
 962 |             loss_fct = CrossEntropyLoss(ignore_index=-100)
 963 |             loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
 964 | 
 965 |             lm_logits = lm_logits.to(hidden_states.dtype)
 966 |             loss = loss.to(hidden_states.dtype)
 967 | 
 968 |         if not return_dict:
 969 |             output = (lm_logits,) + transformer_outputs[1:]
 970 |             return ((loss,) + output) if loss is not None else output
 971 | 
 972 |         return CausalLMOutputWithPast(
 973 |             loss=loss,
 974 |             logits=lm_logits,
 975 |             past_key_values=transformer_outputs.past_key_values,
 976 |             hidden_states=transformer_outputs.hidden_states,
 977 |             attentions=transformer_outputs.attentions,
 978 |         )
 979 | 
 980 |     @staticmethod
 981 |     def _reorder_cache(
 982 |             past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], beam_idx: torch.LongTensor
 983 |     ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]:
 984 |         """
 985 |         This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
 986 |         [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
 987 |         beam_idx at every generation step.
 988 |         Output shares the same memory storage as `past`.
 989 |         """
 990 |         return tuple(
 991 |             (
 992 |                 layer_past[0].index_select(1, beam_idx.to(layer_past[0].device)),
 993 |                 layer_past[1].index_select(1, beam_idx.to(layer_past[1].device)),
 994 |             )
 995 |             for layer_past in past
 996 |         )
 997 | 
 998 |     def process_response(self, response):
 999 |         response = response.strip()
1000 |         response = response.replace("[[训练时间]]", "2023年")
1001 |         return response
1002 | 
1003 |     def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = None):
1004 |         prompt = tokenizer.build_prompt(query, history=history)
1005 |         inputs = tokenizer([prompt], return_tensors="pt")
1006 |         inputs = inputs.to(self.device)
1007 |         return inputs
1008 | 
1009 |     def build_stream_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = None):
1010 |         if history:
1011 |             prompt = "\n\n[Round {}]\n\n问：{}\n\n答：".format(len(history) + 1, query)
1012 |             input_ids = tokenizer.encode(prompt, add_special_tokens=False)
1013 |             input_ids = input_ids[1:]
1014 |             inputs = tokenizer.batch_encode_plus([(input_ids, None)], return_tensors="pt", add_special_tokens=False)
1015 |         else:
1016 |             prompt = "[Round {}]\n\n问：{}\n\n答：".format(len(history) + 1, query)
1017 |             inputs = tokenizer([prompt], return_tensors="pt")
1018 |         inputs = inputs.to(self.device)
1019 |         return inputs
1020 | 
1021 |     @torch.inference_mode()
1022 |     def chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, max_length: int = 8192, num_beams=1,
1023 |              do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None, **kwargs):
1024 |         if history is None:
1025 |             history = []
1026 |         if logits_processor is None:
1027 |             logits_processor = LogitsProcessorList()
1028 |         logits_processor.append(InvalidScoreLogitsProcessor())
1029 |         gen_kwargs = {"max_length": max_length, "num_beams": num_beams, "do_sample": do_sample, "top_p": top_p,
1030 |                       "temperature": temperature, "logits_processor": logits_processor, **kwargs}
1031 |         inputs = self.build_inputs(tokenizer, query, history=history)
1032 |         outputs = self.generate(**inputs, **gen_kwargs)
1033 |         outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):]
1034 |         response = tokenizer.decode(outputs)
1035 |         response = self.process_response(response)
1036 |         history = history + [(query, response)]
1037 |         return response, history
1038 | 
1039 |     @torch.inference_mode()
1040 |     def stream_chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, past_key_values=None,
1041 |                     max_length: int = 8192, do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None,
1042 |                     return_past_key_values=False, **kwargs):
1043 |         if history is None:
1044 |             history = []
1045 |         if logits_processor is None:
1046 |             logits_processor = LogitsProcessorList()
1047 |         logits_processor.append(InvalidScoreLogitsProcessor())
1048 |         gen_kwargs = {"max_length": max_length, "do_sample": do_sample, "top_p": top_p,
1049 |                       "temperature": temperature, "logits_processor": logits_processor, **kwargs}
1050 |         if past_key_values is None and not return_past_key_values:
1051 |             inputs = self.build_inputs(tokenizer, query, history=history)
1052 |         else:
1053 |             inputs = self.build_stream_inputs(tokenizer, query, history=history)
1054 |         if past_key_values is not None:
1055 |             past_length = past_key_values[0][0].shape[0]
1056 |             if self.transformer.pre_seq_len is not None:
1057 |                 past_length -= self.transformer.pre_seq_len
1058 |             inputs.position_ids += past_length
1059 |             attention_mask = inputs.attention_mask
1060 |             attention_mask = torch.cat((attention_mask.new_ones(1, past_length), attention_mask), dim=1)
1061 |             inputs['attention_mask'] = attention_mask
1062 |         for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
1063 |                                             return_past_key_values=return_past_key_values, **gen_kwargs):
1064 |             if return_past_key_values:
1065 |                 outputs, past_key_values = outputs
1066 |             outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):]
1067 |             response = tokenizer.decode(outputs)
1068 |             if response and response[-1] != "�":
1069 |                 response = self.process_response(response)
1070 |                 new_history = history + [(query, response)]
1071 |                 if return_past_key_values:
1072 |                     yield response, new_history, past_key_values
1073 |                 else:
1074 |                     yield response, new_history
1075 | 
1076 |     @torch.inference_mode()
1077 |     def stream_generate(
1078 |             self,
1079 |             input_ids,
1080 |             generation_config: Optional[GenerationConfig] = None,
1081 |             logits_processor: Optional[LogitsProcessorList] = None,
1082 |             stopping_criteria: Optional[StoppingCriteriaList] = None,
1083 |             prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
1084 |             return_past_key_values=False,
1085 |             **kwargs,
1086 |     ):
1087 |         batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]
1088 | 
1089 |         if generation_config is None:
1090 |             generation_config = self.generation_config
1091 |         generation_config = copy.deepcopy(generation_config)
1092 |         model_kwargs = generation_config.update(**kwargs)
1093 |         bos_token_id, eos_token_id = generation_config.bos_token_id, generation_config.eos_token_id
1094 | 
1095 |         if isinstance(eos_token_id, int):
1096 |             eos_token_id = [eos_token_id]
1097 | 
1098 |         has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None
1099 |         if has_default_max_length and generation_config.max_new_tokens is None:
1100 |             warnings.warn(
1101 |                 f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. "
1102 |                 "This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we"
1103 |                 " recommend using `max_new_tokens` to control the maximum length of the generation.",
1104 |                 UserWarning,
1105 |             )
1106 |         elif generation_config.max_new_tokens is not None:
1107 |             generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length
1108 |             if not has_default_max_length:
1109 |                 logger.warn(
1110 |                     f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(="
1111 |                     f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. "
1112 |                     "Please refer to the documentation for more information. "
1113 |                     "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)",
1114 |                     UserWarning,
1115 |                 )
1116 | 
1117 |         if input_ids_seq_length >= generation_config.max_length:
1118 |             input_ids_string = "decoder_input_ids" if self.config.is_encoder_decoder else "input_ids"
1119 |             logger.warning(
1120 |                 f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to"
1121 |                 f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider"
1122 |                 " increasing `max_new_tokens`."
1123 |             )
1124 | 
1125 |         # 2. Set generation parameters if not already defined
1126 |         logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
1127 |         stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
1128 | 
1129 |         logits_processor = self._get_logits_processor(
1130 |             generation_config=generation_config,
1131 |             input_ids_seq_length=input_ids_seq_length,
1132 |             encoder_input_ids=input_ids,
1133 |             prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
1134 |             logits_processor=logits_processor,
1135 |         )
1136 | 
1137 |         stopping_criteria = self._get_stopping_criteria(
1138 |             generation_config=generation_config, stopping_criteria=stopping_criteria
1139 |         )
1140 |         logits_warper = self._get_logits_warper(generation_config)
1141 | 
1142 |         unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
1143 |         scores = None
1144 |         while True:
1145 |             model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
1146 |             # forward pass to get next token
1147 |             outputs = self(
1148 |                 **model_inputs,
1149 |                 return_dict=True,
1150 |                 output_attentions=False,
1151 |                 output_hidden_states=False,
1152 |             )
1153 | 
1154 |             next_token_logits = outputs.logits[:, -1, :]
1155 | 
1156 |             # pre-process distribution
1157 |             next_token_scores = logits_processor(input_ids, next_token_logits)
1158 |             next_token_scores = logits_warper(input_ids, next_token_scores)
1159 | 
1160 |             # sample
1161 |             probs = nn.functional.softmax(next_token_scores, dim=-1)
1162 |             if generation_config.do_sample:
1163 |                 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
1164 |             else:
1165 |                 next_tokens = torch.argmax(probs, dim=-1)
1166 | 
1167 |             # update generated ids, model inputs, and length for next step
1168 |             input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
1169 |             model_kwargs = self._update_model_kwargs_for_generation(
1170 |                 outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
1171 |             )
1172 |             unfinished_sequences = unfinished_sequences.mul((sum(next_tokens != i for i in eos_token_id)).long())
1173 |             if return_past_key_values:
1174 |                 yield input_ids, outputs.past_key_values
1175 |             else:
1176 |                 yield input_ids
1177 |             # stop when each sentence is finished, or if we exceed the maximum length
1178 |             if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
1179 |                 break
1180 | 
1181 |     def quantize(self, bits: int, empty_init=False, device=None, **kwargs):
1182 |         if bits == 0:
1183 |             return
1184 | 
1185 |         from .quantization import quantize
1186 | 
1187 |         if self.quantized:
1188 |             logger.info("Already quantized.")
1189 |             return self
1190 | 
1191 |         self.quantized = True
1192 | 
1193 |         self.config.quantization_bit = bits
1194 | 
1195 |         self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
1196 |                                             **kwargs)
1197 |         return self
1198 | 


--------------------------------------------------------------------------------
/cli_demo.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import platform
 3 | import signal
 4 | from transformers import AutoTokenizer, AutoModel
 5 | import readline
 6 | 
 7 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
 8 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
 9 | # 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
10 | # from utils import load_model_on_gpus
11 | # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
12 | model = model.eval()
13 | 
14 | os_name = platform.system()
15 | clear_command = 'cls' if os_name == 'Windows' else 'clear'
16 | stop_stream = False
17 | 
18 | 
19 | def build_prompt(history):
20 |     prompt = "欢迎使用 ChatGLM2-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序"
21 |     for query, response in history:
22 |         prompt += f"\n\n用户：{query}"
23 |         prompt += f"\n\nChatGLM2-6B：{response}"
24 |     return prompt
25 | 
26 | 
27 | def signal_handler(signal, frame):
28 |     global stop_stream
29 |     stop_stream = True
30 | 
31 | 
32 | def main():
33 |     past_key_values, history = None, []
34 |     global stop_stream
35 |     print("欢迎使用 ChatGLM2-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
36 |     while True:
37 |         query = input("\n用户：")
38 |         if query.strip() == "stop":
39 |             break
40 |         if query.strip() == "clear":
41 |             past_key_values, history = None, []
42 |             os.system(clear_command)
43 |             print("欢迎使用 ChatGLM2-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
44 |             continue
45 |         print("\nChatGLM：", end="")
46 |         current_length = 0
47 |         for response, history, past_key_values in model.stream_chat(tokenizer, query, history=history,
48 |                                                                     past_key_values=past_key_values,
49 |                                                                     return_past_key_values=True):
50 |             if stop_stream:
51 |                 stop_stream = False
52 |                 break
53 |             else:
54 |                 print(response[current_length:], end="", flush=True)
55 |                 current_length = len(response)
56 |         print("")
57 | 
58 | 
59 | if __name__ == "__main__":
60 |     main()
61 | 


--------------------------------------------------------------------------------
/evaluation/README.md:
--------------------------------------------------------------------------------
 1 | 首先从 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/e84444333b6d434ea7b0) 下载处理好的 C-Eval 数据集，解压到 `evaluation` 目录下。然后运行
 2 | 
 3 | ```shell
 4 | cd evaluation
 5 | python evaluate_ceval.py
 6 | ```
 7 | 
 8 | 这个脚本会在C-Eval的验证集上进行预测并输出准确率。如果想要得到测试集上的结果可以将代码中的 `./CEval/val/**/*.jsonl` 改为 `./CEval/test/**/*.jsonl`，并按照 C-Eval 规定的格式保存结果并在 [官网](https://cevalbenchmark.com/) 上提交。
 9 | 
10 | 汇报的结果使用的是内部的并行测试框架，结果可能会有轻微波动。


--------------------------------------------------------------------------------
/evaluation/evaluate_ceval.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import glob
 3 | import re
 4 | import json
 5 | import torch
 6 | import torch.utils.data
 7 | from transformers import AutoTokenizer, AutoModel
 8 | from tqdm import tqdm
 9 | 
10 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
11 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).bfloat16().cuda()
12 | 
13 | choices = ["A", "B", "C", "D"]
14 | choice_tokens = [tokenizer.encode(choice, add_special_tokens=False)[0] for choice in choices]
15 | 
16 | 
17 | def build_prompt(text):
18 |     return "[Round {}]\n\n问：{}\n\n答：".format(1, text)
19 | 
20 | 
21 | extraction_prompt = '综上所述，ABCD中正确的选项是：'
22 | 
23 | accuracy_dict, count_dict = {}, {}
24 | with torch.no_grad():
25 |     for entry in glob.glob("./CEval/val/**/*.jsonl", recursive=True):
26 |         dataset = []
27 |         with open(entry, encoding='utf-8') as file:
28 |             for line in file:
29 |                 dataset.append(json.loads(line))
30 |         correct = 0
31 |         dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
32 |         for batch in tqdm(dataloader):
33 |             texts = batch["inputs_pretokenized"]
34 |             queries = [build_prompt(query) for query in texts]
35 |             inputs = tokenizer(queries, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
36 |             outputs = model.generate(**inputs, do_sample=False, max_new_tokens=512)
37 |             intermediate_outputs = []
38 |             for idx in range(len(outputs)):
39 |                 output = outputs.tolist()[idx][len(inputs["input_ids"][idx]):]
40 |                 response = tokenizer.decode(output)
41 |                 intermediate_outputs.append(response)
42 |             answer_texts = [text + intermediate + "\n" + extraction_prompt for text, intermediate in
43 |                             zip(texts, intermediate_outputs)]
44 |             input_tokens = [build_prompt(answer_text) for answer_text in answer_texts]
45 |             inputs = tokenizer(input_tokens, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
46 |             outputs = model(**inputs, return_last_logit=True)
47 |             logits = outputs.logits[:, -1]
48 |             logits = logits[:, choice_tokens]
49 |             preds = logits.argmax(dim=-1)
50 |             correct += (preds.cpu() == batch["label"]).sum().item()
51 |         accuracy = correct / len(dataset)
52 |         print(entry, accuracy)
53 |         accuracy_dict[entry] = accuracy
54 |         count_dict[entry] = len(dataset)
55 | 
56 | acc_total, count_total = 0.0, 0
57 | for key in accuracy_dict:
58 |     acc_total += accuracy_dict[key] * count_dict[key]
59 |     count_total += count_dict[key]
60 | print(acc_total / count_total)


--------------------------------------------------------------------------------
/openai_api.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Implements API for ChatGLM2-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat)
  3 | # Usage: python openai_api.py
  4 | # Visit http://localhost:8000/docs for documents.
  5 | 
  6 | 
  7 | import time
  8 | import torch
  9 | import uvicorn
 10 | from pydantic import BaseModel, Field
 11 | from fastapi import FastAPI, HTTPException
 12 | from fastapi.middleware.cors import CORSMiddleware
 13 | from contextlib import asynccontextmanager
 14 | from typing import Any, Dict, List, Literal, Optional, Union
 15 | from transformers import AutoTokenizer, AutoModel
 16 | from sse_starlette.sse import ServerSentEvent, EventSourceResponse
 17 | 
 18 | 
 19 | @asynccontextmanager
 20 | async def lifespan(app: FastAPI): # collects GPU memory
 21 |     yield
 22 |     if torch.cuda.is_available():
 23 |         torch.cuda.empty_cache()
 24 |         torch.cuda.ipc_collect()
 25 | 
 26 | 
 27 | app = FastAPI(lifespan=lifespan)
 28 | 
 29 | app.add_middleware(
 30 |     CORSMiddleware,
 31 |     allow_origins=["*"],
 32 |     allow_credentials=True,
 33 |     allow_methods=["*"],
 34 |     allow_headers=["*"],
 35 | )
 36 | 
 37 | class ModelCard(BaseModel):
 38 |     id: str
 39 |     object: str = "model"
 40 |     created: int = Field(default_factory=lambda: int(time.time()))
 41 |     owned_by: str = "owner"
 42 |     root: Optional[str] = None
 43 |     parent: Optional[str] = None
 44 |     permission: Optional[list] = None
 45 | 
 46 | 
 47 | class ModelList(BaseModel):
 48 |     object: str = "list"
 49 |     data: List[ModelCard] = []
 50 | 
 51 | 
 52 | class ChatMessage(BaseModel):
 53 |     role: Literal["user", "assistant", "system"]
 54 |     content: str
 55 | 
 56 | 
 57 | class DeltaMessage(BaseModel):
 58 |     role: Optional[Literal["user", "assistant", "system"]] = None
 59 |     content: Optional[str] = None
 60 | 
 61 | 
 62 | class ChatCompletionRequest(BaseModel):
 63 |     model: str
 64 |     messages: List[ChatMessage]
 65 |     temperature: Optional[float] = None
 66 |     top_p: Optional[float] = None
 67 |     max_length: Optional[int] = None
 68 |     stream: Optional[bool] = False
 69 | 
 70 | 
 71 | class ChatCompletionResponseChoice(BaseModel):
 72 |     index: int
 73 |     message: ChatMessage
 74 |     finish_reason: Literal["stop", "length"]
 75 | 
 76 | 
 77 | class ChatCompletionResponseStreamChoice(BaseModel):
 78 |     index: int
 79 |     delta: DeltaMessage
 80 |     finish_reason: Optional[Literal["stop", "length"]]
 81 | 
 82 | 
 83 | class ChatCompletionResponse(BaseModel):
 84 |     model: str
 85 |     object: Literal["chat.completion", "chat.completion.chunk"]
 86 |     choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
 87 |     created: Optional[int] = Field(default_factory=lambda: int(time.time()))
 88 | 
 89 | 
 90 | @app.get("/v1/models", response_model=ModelList)
 91 | async def list_models():
 92 |     global model_args
 93 |     model_card = ModelCard(id="gpt-3.5-turbo")
 94 |     return ModelList(data=[model_card])
 95 | 
 96 | 
 97 | @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
 98 | async def create_chat_completion(request: ChatCompletionRequest):
 99 |     global model, tokenizer
100 | 
101 |     if request.messages[-1].role != "user":
102 |         raise HTTPException(status_code=400, detail="Invalid request")
103 |     query = request.messages[-1].content
104 | 
105 |     prev_messages = request.messages[:-1]
106 |     if len(prev_messages) > 0 and prev_messages[0].role == "system":
107 |         query = prev_messages.pop(0).content + query
108 | 
109 |     history = []
110 |     if len(prev_messages) % 2 == 0:
111 |         for i in range(0, len(prev_messages), 2):
112 |             if prev_messages[i].role == "user" and prev_messages[i+1].role == "assistant":
113 |                 history.append([prev_messages[i].content, prev_messages[i+1].content])
114 | 
115 |     if request.stream:
116 |         generate = predict(query, history, request.model)
117 |         return EventSourceResponse(generate, media_type="text/event-stream")
118 | 
119 |     response, _ = model.chat(tokenizer, query, history=history)
120 |     choice_data = ChatCompletionResponseChoice(
121 |         index=0,
122 |         message=ChatMessage(role="assistant", content=response),
123 |         finish_reason="stop"
124 |     )
125 | 
126 |     return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion")
127 | 
128 | 
129 | async def predict(query: str, history: List[List[str]], model_id: str):
130 |     global model, tokenizer
131 | 
132 |     choice_data = ChatCompletionResponseStreamChoice(
133 |         index=0,
134 |         delta=DeltaMessage(role="assistant"),
135 |         finish_reason=None
136 |     )
137 |     chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
138 |     yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
139 | 
140 |     current_length = 0
141 | 
142 |     for new_response, _ in model.stream_chat(tokenizer, query, history):
143 |         if len(new_response) == current_length:
144 |             continue
145 | 
146 |         new_text = new_response[current_length:]
147 |         current_length = len(new_response)
148 | 
149 |         choice_data = ChatCompletionResponseStreamChoice(
150 |             index=0,
151 |             delta=DeltaMessage(content=new_text),
152 |             finish_reason=None
153 |         )
154 |         chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
155 |         yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
156 | 
157 | 
158 |     choice_data = ChatCompletionResponseStreamChoice(
159 |         index=0,
160 |         delta=DeltaMessage(),
161 |         finish_reason="stop"
162 |     )
163 |     chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
164 |     yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
165 |     yield '[DONE]'
166 | 
167 | 
168 | 
169 | if __name__ == "__main__":
170 |     tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
171 |     model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
172 |     # 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
173 |     # from utils import load_model_on_gpus
174 |     # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
175 |     model.eval()
176 | 
177 |     uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
178 | 


--------------------------------------------------------------------------------
/ptuning/README.md:
--------------------------------------------------------------------------------
  1 | # ChatGLM2-6B-PT
  2 | 本仓库实现了对于 ChatGLM2-6B 模型基于 [P-Tuning v2](https://github.com/THUDM/P-tuning-v2) 的微调。P-Tuning v2 将需要微调的参数量减少到原来的 0.1%，再通过模型量化、Gradient Checkpoint 等方法，最低只需要 7GB 显存即可运行。
  3 | 
  4 | 下面以 [ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法。
  5 | 
  6 | ## 软件依赖
  7 | 运行微调除 ChatGLM2-6B 的依赖之外，还需要安装以下依赖
  8 | ```
  9 | pip install rouge_chinese nltk jieba datasets
 10 | ```
 11 | ## 使用方法
 12 | 
 13 | ### 下载数据集
 14 | ADGEN 数据集任务为根据输入（content）生成一段广告词（summary）。
 15 | 
 16 | ```json
 17 | {
 18 |     "content": "类型#上衣*版型#宽松*版型#显瘦*图案#线条*衣样式#衬衫*衣袖型#泡泡袖*衣款式#抽绳",
 19 |     "summary": "这件衬衫的款式非常的宽松，利落的线条可以很好的隐藏身材上的小缺点，穿在身上有着很好的显瘦效果。领口装饰了一个可爱的抽绳，漂亮的绳结展现出了十足的个性，配合时尚的泡泡袖型，尽显女性甜美可爱的气息。"
 20 | }
 21 | ```
 22 | 
 23 | 从 [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1) 下载处理好的 ADGEN 数据集，将解压后的 `AdvertiseGen` 目录放到本目录下。
 24 | 
 25 | ### 训练
 26 | 
 27 | #### P-Tuning v2
 28 | 
 29 | 运行以下指令进行训练：
 30 | ```shell
 31 | bash train.sh
 32 | ```
 33 | `train.sh` 中的 `PRE_SEQ_LEN` 和 `LR` 分别是 soft prompt 长度和训练的学习率，可以进行调节以取得最佳的效果。P-Tuning-v2 方法会冻结全部的模型参数，可通过调整 `quantization_bit` 来被原始模型的量化等级，不加此选项则为 FP16 精度加载。
 34 | 
 35 | 在默认配置 `quantization_bit=4`、`per_device_train_batch_size=1`、`gradient_accumulation_steps=16` 下，INT4 的模型参数被冻结，一次训练迭代会以 1 的批处理大小进行 16 次累加的前后向传播，等效为 16 的总批处理大小，此时最低只需 6.7G 显存。若想在同等批处理大小下提升训练效率，可在二者乘积不变的情况下，加大 `per_device_train_batch_size` 的值，但也会带来更多的显存消耗，请根据实际情况酌情调整。
 36 | 
 37 | 如果你想要[从本地加载模型](../README.md#从本地加载模型)，可以将 `train.sh` 中的 `THUDM/chatglm2-6b` 改为你本地的模型路径。
 38 | 
 39 | #### Finetune
 40 | 
 41 | 如果需要进行全参数的 Finetune，需要安装 [Deepspeed](https://github.com/microsoft/DeepSpeed)，然后运行以下指令：
 42 | 
 43 | ```shell
 44 | bash ds_train_finetune.sh
 45 | ```
 46 | 
 47 | ### 推理
 48 | 
 49 | 在 P-tuning v2 训练时模型只保存 PrefixEncoder 部分的参数，所以在推理时需要同时加载原 ChatGLM2-6B 模型以及 PrefixEncoder 的权重，因此需要指定 `evaluate.sh` 中的参数：
 50 | 
 51 | ```shell
 52 | --model_name_or_path THUDM/chatglm2-6b
 53 | --ptuning_checkpoint $CHECKPOINT_PATH
 54 | ```
 55 | 
 56 | 如果是，只需要跟之前一样设定 `model_name_or_path`：
 57 | 
 58 | ```shell
 59 | --model_name_or_path $CHECKPOINT_PATH
 60 | ```
 61 | 
 62 | 评测指标为中文 Rouge score 和 BLEU-4。生成的结果保存在
 63 | `./output/adgen-chatglm2-6b-pt-128-2e-2/generated_predictions.txt`。
 64 | 
 65 | ### 例子
 66 | #### 示例1
 67 | * Input: 类型#上衣\*材质#牛仔布\*颜色#白色\*风格#简约\*图案#刺绣\*衣样式#外套\*衣款式#破洞
 68 | * Label: 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。
 69 | * Output[微调前]: 这件上衣的材质是牛仔布,颜色是白色,风格是简约,图案是刺绣,衣样式是外套,衣款式是破洞。
 70 | * Output[微调后]: 这是一款简约的牛仔外套,破洞设计,将牛仔布破洞,带来一种随意与个性。破洞的牛仔外套,展现出时尚气息,带来一种休闲感。同时,刺绣图案,让整件外套更加立体。
 71 | 
 72 | #### 示例2
 73 | 
 74 | * Input: 类型#裙\*版型#显瘦\*风格#文艺\*风格#简约\*图案#印花\*图案#撞色\*裙下摆#压褶\*裙长#连衣裙\*裙领型#圆领
 75 | * Label: 文艺个性的印花连衣裙,藏青色底蕴,低调又大气,撞色太阳花分布整个裙身,绚丽而美好,带来时尚减龄的气质。基础款的舒适圆领,简约不失大方,勾勒精致脸庞。领后是一粒包布扣固定,穿脱十分方便。前片立体的打褶设计,搭配后片压褶的做工,增添层次和空间感,显瘦又有型。
 76 | * Output[微调前]: 类型#裙*版型#显瘦*风格#文艺*风格#简约*图案#印花*图案#撞色*裙下摆#压褶*裙长#连衣裙*裙领型#圆领 1\. 连衣裙:简约风格,裙长为膝盖以上,裙领型为圆领。2\. 裙下摆:压褶设计,使裙摆呈现出流畅的褶皱效果。3\. 裙领型:裙领型为圆领,使穿上连衣裙后更加有型。4\. 版型:采用显瘦设计,让连衣裙看起来更加苗条。5\. 风格:文艺风格,让连衣裙更加有内涵和品味。6\. 图案:印花设计,在连衣裙上印有独特的图案。7\. 撞色:采用撞色设计,让连衣裙在色彩上更加鲜明、富有层次感。
 77 | * Output[微调后]: 这是一款文艺范的连衣裙,以印花为元素,采用简约的印花,既能够突出文艺气质,又能够展现简约风。在印花的同时又有领子和裙摆的压褶设计,更加凸显文艺气质。简约而不会过于单调,搭配出街,穿着十分舒适。
 78 | 
 79 | 
 80 | ## 模型部署
 81 | 首先载入Tokenizer：
 82 | 
 83 | ```python
 84 | from transformers import AutoConfig, AutoModel, AutoTokenizer
 85 | 
 86 | # 载入Tokenizer
 87 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
 88 | ```
 89 | 
 90 | 1. 如果需要加载的 P-Tuning 的 checkpoint：
 91 | 
 92 | ```python
 93 | config = AutoConfig.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, pre_seq_len=128)
 94 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", config=config, trust_remote_code=True)
 95 | prefix_state_dict = torch.load(os.path.join(CHECKPOINT_PATH, "pytorch_model.bin"))
 96 | new_prefix_state_dict = {}
 97 | for k, v in prefix_state_dict.items():
 98 |     if k.startswith("transformer.prefix_encoder."):
 99 |         new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
100 | model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
101 | ```
102 | 注意你可能需要将 `pre_seq_len` 改成你训练时的实际值。如果你是[从本地加载模型](../README.md#从本地加载模型)的话，需要将 `THUDM/chatglm2-6b` 改成本地的模型路径（注意不是checkpoint路径）。
103 | 
104 | 2. 如果需要加载的是全参数微调的 checkpoint，则直接加载整个 checkpoint：
105 | 
106 | ```python
107 | model = AutoModel.from_pretrained(CHECKPOINT_PATH, trust_remote_code=True)
108 | ```
109 | 
110 | 之后根据需求可以进行量化，也可以直接使用：
111 | 
112 | ```python
113 | # Comment out the following line if you don't use quantization
114 | model = model.quantize(4)
115 | model = model.cuda()
116 | model = model.eval()
117 | 
118 | response, history = model.chat(tokenizer, "你好", history=[])
119 | ```
120 | 
121 | 你也可以直接运行支持加载 P-Tuning v2 checkpoint 的 [web demo](./web_demo.py)
122 | ```shell
123 | bash web_demo.sh
124 | ```
125 | 可能需要修改 [web_demo.sh](./web_demo.sh) 的内容以符合你实际的 checkpoint 情况。
126 | 
127 | ## 使用自己的数据集
128 | 修改 `train.sh` 和 `evaluate.sh` 中的 `train_file`、`validation_file`和`test_file`为你自己的 JSON 格式数据集路径，并将 `prompt_column` 和 `response_column` 改为 JSON 文件中输入文本和输出文本对应的 KEY。可能还需要增大 `max_source_length` 和 `max_target_length` 来匹配你自己的数据集中的最大输入输出长度。
129 | 
130 | ## 对话数据集
131 | 
132 | 如需要使用多轮对话数据对模型进行微调，可以提供聊天历史，例如以下是一个三轮对话的训练数据：
133 | 
134 | ```json lines
135 | {"prompt": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "response": "用电脑能读数据流吗？水温多少", "history": []}
136 | {"prompt": "95", "response": "上下水管温差怎么样啊？空气是不是都排干净了呢？", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗？水温多少"]]}
137 | {"prompt": "是的。上下水管都好的", "response": "那就要检查线路了，一般风扇继电器是由电脑控制吸合的，如果电路存在断路，或者电脑坏了的话会出现继电器不吸合的情况！", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗？水温多少"], ["95", "上下水管温差怎么样啊？空气是不是都排干净了呢？"]]}
138 | ```
139 | 
140 | 训练时需要指定 `--history_column` 为数据中聊天历史的 key（在此例子中是 `history`），将自动把聊天历史拼接。要注意超过输入长度 `max_source_length` 的内容会被截断。
141 | 
142 | 可以参考以下指令：
143 | 
144 | ```shell
145 | bash train_chat.sh
146 | ```
147 | 
148 | ## 引用
149 | 
150 | ```
151 | @inproceedings{liu2022p,
152 |   title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
153 |   author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
154 |   booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
155 |   pages={61--68},
156 |   year={2022}
157 | }
158 | ```
159 | 
160 | 
161 | 
162 | 


--------------------------------------------------------------------------------
/ptuning/arguments.py:
--------------------------------------------------------------------------------
  1 | from dataclasses import dataclass, field
  2 | from typing import Optional
  3 | 
  4 | 
  5 | @dataclass
  6 | class ModelArguments:
  7 |     """
  8 |     Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
  9 |     """
 10 | 
 11 |     model_name_or_path: str = field(
 12 |         metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
 13 |     )
 14 |     ptuning_checkpoint: str = field(
 15 |         default=None, metadata={"help": "Path to p-tuning v2 checkpoints"}
 16 |     )
 17 |     config_name: Optional[str] = field(
 18 |         default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
 19 |     )
 20 |     tokenizer_name: Optional[str] = field(
 21 |         default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
 22 |     )
 23 |     cache_dir: Optional[str] = field(
 24 |         default=None,
 25 |         metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
 26 |     )
 27 |     use_fast_tokenizer: bool = field(
 28 |         default=True,
 29 |         metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
 30 |     )
 31 |     model_revision: str = field(
 32 |         default="main",
 33 |         metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
 34 |     )
 35 |     use_auth_token: bool = field(
 36 |         default=False,
 37 |         metadata={
 38 |             "help": (
 39 |                 "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
 40 |                 "with private models)."
 41 |             )
 42 |         },
 43 |     )
 44 |     resize_position_embeddings: Optional[bool] = field(
 45 |         default=None,
 46 |         metadata={
 47 |             "help": (
 48 |                 "Whether to automatically resize the position embeddings if `max_source_length` exceeds "
 49 |                 "the model's position embeddings."
 50 |             )
 51 |         },
 52 |     )
 53 |     quantization_bit: Optional[int] = field(
 54 |         default=None
 55 |     )
 56 |     pre_seq_len: Optional[int] = field(
 57 |         default=None
 58 |     )
 59 |     prefix_projection: bool = field(
 60 |         default=False
 61 |     )
 62 | 
 63 | 
 64 | @dataclass
 65 | class DataTrainingArguments:
 66 |     """
 67 |     Arguments pertaining to what data we are going to input our model for training and eval.
 68 |     """
 69 | 
 70 |     lang: Optional[str] = field(default=None, metadata={"help": "Language id for summarization."})
 71 | 
 72 |     dataset_name: Optional[str] = field(
 73 |         default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
 74 |     )
 75 |     dataset_config_name: Optional[str] = field(
 76 |         default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
 77 |     )
 78 |     prompt_column: Optional[str] = field(
 79 |         default=None,
 80 |         metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
 81 |     )
 82 |     response_column: Optional[str] = field(
 83 |         default=None,
 84 |         metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
 85 |     )
 86 |     history_column: Optional[str] = field(
 87 |         default=None,
 88 |         metadata={"help": "The name of the column in the datasets containing the history of chat."},
 89 |     )
 90 |     train_file: Optional[str] = field(
 91 |         default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
 92 |     )
 93 |     validation_file: Optional[str] = field(
 94 |         default=None,
 95 |         metadata={
 96 |             "help": (
 97 |                 "An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
 98 |             )
 99 |         },
100 |     )
101 |     test_file: Optional[str] = field(
102 |         default=None,
103 |         metadata={
104 |             "help": "An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
105 |         },
106 |     )
107 |     overwrite_cache: bool = field(
108 |         default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
109 |     )
110 |     preprocessing_num_workers: Optional[int] = field(
111 |         default=None,
112 |         metadata={"help": "The number of processes to use for the preprocessing."},
113 |     )
114 |     max_source_length: Optional[int] = field(
115 |         default=1024,
116 |         metadata={
117 |             "help": (
118 |                 "The maximum total input sequence length after tokenization. Sequences longer "
119 |                 "than this will be truncated, sequences shorter will be padded."
120 |             )
121 |         },
122 |     )
123 |     max_target_length: Optional[int] = field(
124 |         default=128,
125 |         metadata={
126 |             "help": (
127 |                 "The maximum total sequence length for target text after tokenization. Sequences longer "
128 |                 "than this will be truncated, sequences shorter will be padded."
129 |             )
130 |         },
131 |     )
132 |     val_max_target_length: Optional[int] = field(
133 |         default=None,
134 |         metadata={
135 |             "help": (
136 |                 "The maximum total sequence length for validation target text after tokenization. Sequences longer "
137 |                 "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
138 |                 "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
139 |                 "during ``evaluate`` and ``predict``."
140 |             )
141 |         },
142 |     )
143 |     pad_to_max_length: bool = field(
144 |         default=False,
145 |         metadata={
146 |             "help": (
147 |                 "Whether to pad all samples to model maximum sentence length. "
148 |                 "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
149 |                 "efficient on GPU but very bad for TPU."
150 |             )
151 |         },
152 |     )
153 |     max_train_samples: Optional[int] = field(
154 |         default=None,
155 |         metadata={
156 |             "help": (
157 |                 "For debugging purposes or quicker training, truncate the number of training examples to this "
158 |                 "value if set."
159 |             )
160 |         },
161 |     )
162 |     max_eval_samples: Optional[int] = field(
163 |         default=None,
164 |         metadata={
165 |             "help": (
166 |                 "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
167 |                 "value if set."
168 |             )
169 |         },
170 |     )
171 |     max_predict_samples: Optional[int] = field(
172 |         default=None,
173 |         metadata={
174 |             "help": (
175 |                 "For debugging purposes or quicker training, truncate the number of prediction examples to this "
176 |                 "value if set."
177 |             )
178 |         },
179 |     )
180 |     num_beams: Optional[int] = field(
181 |         default=None,
182 |         metadata={
183 |             "help": (
184 |                 "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
185 |                 "which is used during ``evaluate`` and ``predict``."
186 |             )
187 |         },
188 |     )
189 |     ignore_pad_token_for_loss: bool = field(
190 |         default=True,
191 |         metadata={
192 |             "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
193 |         },
194 |     )
195 |     source_prefix: Optional[str] = field(
196 |         default="", metadata={"help": "A prefix to add before every source text (useful for T5 models)."}
197 |     )
198 | 
199 |     forced_bos_token: Optional[str] = field(
200 |         default=None,
201 |         metadata={
202 |             "help": (
203 |                 "The token to force as the first generated token after the decoder_start_token_id."
204 |                 "Useful for multilingual models like mBART where the first generated token"
205 |                 "needs to be the target language token (Usually it is the target language token)"
206 |             )
207 |         },
208 |     )
209 | 
210 | 
211 | 
212 |     def __post_init__(self):
213 |         if self.dataset_name is None and self.train_file is None and self.validation_file is None and self.test_file is None:
214 |             raise ValueError("Need either a dataset name or a training/validation/test file.")
215 |         else:
216 |             if self.train_file is not None:
217 |                 extension = self.train_file.split(".")[-1]
218 |                 assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
219 |             if self.validation_file is not None:
220 |                 extension = self.validation_file.split(".")[-1]
221 |                 assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file."
222 |         if self.val_max_target_length is None:
223 |             self.val_max_target_length = self.max_target_length
224 | 
225 | 


--------------------------------------------------------------------------------
/ptuning/deepspeed.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "train_micro_batch_size_per_gpu": "auto",
 3 |   "zero_allow_untested_optimizer": true,
 4 |   "fp16": {
 5 |     "enabled": "auto",
 6 |     "loss_scale": 0,
 7 |     "initial_scale_power": 16,
 8 |     "loss_scale_window": 1000,
 9 |     "hysteresis": 2,
10 |     "min_loss_scale": 1
11 |   },
12 |   "zero_optimization": {
13 |     "stage": 2,
14 |     "allgather_partitions": true,
15 |     "allgather_bucket_size": 5e8,
16 |     "overlap_comm": false,
17 |     "reduce_scatter": true,
18 |     "reduce_bucket_size": 5e8,
19 |     "contiguous_gradients" : true
20 |   }
21 | }


--------------------------------------------------------------------------------
/ptuning/ds_train_finetune.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | LR=1e-4
 3 | 
 4 | MASTER_PORT=$(shuf -n 1 -i 10000-65535)
 5 | 
 6 | deepspeed --num_gpus=4 --master_port $MASTER_PORT main.py \
 7 |     --deepspeed deepspeed.json \
 8 |     --do_train \
 9 |     --train_file AdvertiseGen/train.json \
10 |     --test_file AdvertiseGen/dev.json \
11 |     --prompt_column content \
12 |     --response_column summary \
13 |     --overwrite_cache \
14 |     --model_name_or_path THUDM/chatglm2-6b \
15 |     --output_dir ./output/adgen-chatglm2-6b-ft-$LR \
16 |     --overwrite_output_dir \
17 |     --max_source_length 64 \
18 |     --max_target_length 64 \
19 |     --per_device_train_batch_size 4 \
20 |     --per_device_eval_batch_size 1 \
21 |     --gradient_accumulation_steps 1 \
22 |     --predict_with_generate \
23 |     --max_steps 5000 \
24 |     --logging_steps 10 \
25 |     --save_steps 1000 \
26 |     --learning_rate $LR \
27 |     --fp16
28 | 
29 | 


--------------------------------------------------------------------------------
/ptuning/evaluate.sh:
--------------------------------------------------------------------------------
 1 | PRE_SEQ_LEN=128
 2 | CHECKPOINT=adgen-chatglm2-6b-pt-128-2e-2
 3 | STEP=3000
 4 | NUM_GPUS=1
 5 | 
 6 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
 7 |     --do_predict \
 8 |     --validation_file AdvertiseGen/dev.json \
 9 |     --test_file AdvertiseGen/dev.json \
10 |     --overwrite_cache \
11 |     --prompt_column content \
12 |     --response_column summary \
13 |     --model_name_or_path THUDM/chatglm2-6b \
14 |     --ptuning_checkpoint ./output/$CHECKPOINT/checkpoint-$STEP \
15 |     --output_dir ./output/$CHECKPOINT \
16 |     --overwrite_output_dir \
17 |     --max_source_length 64 \
18 |     --max_target_length 64 \
19 |     --per_device_eval_batch_size 1 \
20 |     --predict_with_generate \
21 |     --pre_seq_len $PRE_SEQ_LEN \
22 |     --quantization_bit 4
23 | 


--------------------------------------------------------------------------------
/ptuning/evaluate_finetune.sh:
--------------------------------------------------------------------------------
 1 | CHECKPOINT=adgen-chatglm2-6b-ft-1e-4
 2 | STEP=3000
 3 | NUM_GPUS=1
 4 | 
 5 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
 6 |     --do_predict \
 7 |     --validation_file AdvertiseGen/dev.json \
 8 |     --test_file AdvertiseGen/dev.json \
 9 |     --overwrite_cache \
10 |     --prompt_column content \
11 |     --response_column summary \
12 |     --model_name_or_path ./output/$CHECKPOINT/checkpoint-$STEP  \
13 |     --output_dir ./output/$CHECKPOINT \
14 |     --overwrite_output_dir \
15 |     --max_source_length 256 \
16 |     --max_target_length 256 \
17 |     --per_device_eval_batch_size 1 \
18 |     --predict_with_generate \
19 |     --fp16_full_eval
20 | 


--------------------------------------------------------------------------------
/ptuning/main.py:
--------------------------------------------------------------------------------
  1 | # CSDN彩色版：
  2 | #ChatGLM2-6B源码解析./ptuning/main.py （一）  https://zengxiaojian.blog.csdn.net/article/details/131617133?spm=1001.2014.3001.5502
  3 | 
  4 | #!/usr/bin/env python
  5 | # coding=utf-8
  6 | # Copyright 2021 The HuggingFace Team. All rights reserved.
  7 | #
  8 | # Licensed under the Apache License, Version 2.0 (the "License");
  9 | # you may not use this file except in compliance with the License.
 10 | # You may obtain a copy of the License at
 11 | #
 12 | #     http://www.apache.org/licenses/LICENSE-2.0
 13 | #
 14 | # Unless required by applicable law or agreed to in writing, software
 15 | # distributed under the License is distributed on an "AS IS" BASIS,
 16 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 17 | # See the License for the specific language governing permissions and
 18 | # limitations under the License.
 19 | """
 20 | Fine-tuning the library models for sequence to sequence.
 21 | """
 22 | # You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.
 23 | 
 24 | import logging
 25 | import os
 26 | import sys
 27 | import json
 28 | 
 29 | import numpy as np
 30 | from datasets import load_dataset  #从 Hugging Face 的 datasets 库中导入 load_dataset 函数，用于加载各种预处理后的数据集。
 31 | import jieba 
 32 | from rouge_chinese import Rouge  #从 rouge_chinese 模块中导入 Rouge 类，这个类可以用来计算 Rouge 分数，它是一种用来评估机器生成文本（如机器翻译或文本摘要）与人类参考文本之间相似度的指标。
 33 | from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction  #从 nltk.translate.bleu_score 模块中导入 sentence_bleu 和 SmoothingFunction。sentence_bleu 是用来计算单个句子的 BLEU 分数的函数，
 34 | #而 SmoothingFunction 是用来处理BLEU分数计算过程中出现的0分情况。
 35 | import torch
 36 | 
 37 | #导入了 transformers 库及其一些子模块。transformers 库提供了许多预训练的神经网络模型，可以用于各种自然语言处理任务。
 38 | import transformers
 39 | from transformers import (
 40 |     AutoConfig,  #用于自动从预训练模型的名字或路径获取模型的配置信息。
 41 |     AutoModel,  #用于自动加载一个预训练模型。这个方法将根据模型的名字或路径自动选择正确的模型类，并加载模型。
 42 |     AutoTokenizer,  #用于自动加载一个预训练模型的tokenizer。这个方法将根据模型的名字或路径自动选择正确的tokenizer类，并加载tokenizer。
 43 |     DataCollatorForSeq2Seq,  #用于序列到序列（seq2seq）模型的数据收集。这个类负责将多个数据样本收集到一起，形成一个batch，供模型进行训练或评估。
 44 |     HfArgumentParser,  #用于解析命令行参数的工具。该工具是为了更好地与Hugging Face库（transformers库的开发者）的其他工具集成。
 45 |     Seq2SeqTrainingArguments,  #用于设置序列到序列模型的训练参数。
 46 |     set_seed,  #用于设置随机种子，以确保实验的可重复性。
 47 | )
 48 | 
 49 | #从 trainer_seq2seq 模块导入 Seq2SeqTrainer 类，这个类是用来训练序列到序列（seq2seq）模型的。
 50 | from trainer_seq2seq import Seq2SeqTrainer
 51 | 
 52 | from arguments import ModelArguments, DataTrainingArguments #这行代码从 arguments 模块导入了两个类，这两个类用于解析和处理命令行参数。
 53 | 
 54 | logger = logging.getLogger(__name__)  #创建一个记录器（logger），这个记录器可以用来记录脚本的运行情况。
 55 | 
 56 | def main():
 57 |     parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))  #创建一个 HfArgumentParser 对象，它将解析和处理 ModelArguments、DataTrainingArguments 和 Seq2SeqTrainingArguments 这三个类的实例。
 58 |     if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):  #检查脚本的命令行参数是否为一个 .json 文件。如果是，那么将会从这个文件中读取参数。
 59 |         # If we pass only one argument to the script and it's the path to a json file,
 60 |         # let's parse it to get our arguments.
 61 |         #读取 .json 文件中的参数，并将其分别赋值给 model_args、data_args 和 training_args。
 62 |         model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
 63 |     else:
 64 |         #如果命令行参数不是一个 .json 文件，那么这行代码将会直接从命令行参数中解析出参数。
 65 |         model_args, data_args, training_args = parser.parse_args_into_dataclasses()
 66 | 
 67 |     # Setup logging
 68 |     #设置日志的基础配置，包括日志的格式、日期格式以及处理器。
 69 |     logging.basicConfig(
 70 |         format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
 71 |         datefmt="%m/%d/%Y %H:%M:%S",
 72 |         handlers=[logging.StreamHandler(sys.stdout)],
 73 |     )
 74 |     
 75 |     #如果 training_args.should_log 为真（即需要记录日志），那么设置日志等级为 info。
 76 |     if training_args.should_log:
 77 |         # The default of training_args.log_level is passive, so we set log level at info here to have that default.
 78 |         transformers.utils.logging.set_verbosity_info()
 79 | 
 80 |     log_level = training_args.get_process_log_level()
 81 |     #设置 logger 的日志等级。
 82 |     logger.setLevel(log_level) #设置logger的级别，同时也设置了transformers.utils.logging的级别。这样能够控制要显示的日志信息的详细程度。
 83 |     #datasets.utils.logging.set_verbosity(log_level)
 84 |     transformers.utils.logging.set_verbosity(log_level) #这行代码设置了transformers包中logging模块的日志等级。这里设置的等级和上面获取的日志等级是一样的。
 85 |     transformers.utils.logging.enable_default_handler() #
 86 |     transformers.utils.logging.enable_explicit_format() #这两行代码是启用默认的日志处理器并启用显式的日志格式。默认处理器通常会将日志消息发送到控制台，显式格式则指定了日志消息的输出格式。
 87 | 
 88 |     # Log on each process the small summary:
 89 |     logger.warning(  #logger.warning和logger.info代码是打印关于训练过程的一些基本信息。这些信息包括训练过程的设备、分布式训练的设置、是否使用16位精度训练等。
 90 |         f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
 91 |         + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
 92 |     )
 93 |     logger.info(f"Training/evaluation parameters {training_args}")
 94 | 
 95 |     # Set seed before initializing model.
 96 |     set_seed(training_args.seed)  #设置了随机种子，为了让实验在多次运行时具有相同的结果。
 97 | 
 98 |     # Load dataset
 99 |     data_files = {}
100 |     if data_args.train_file is not None:
101 |         data_files["train"] = data_args.train_file
102 |         extension = data_args.train_file.split(".")[-1]
103 |     if data_args.validation_file is not None:
104 |         data_files["validation"] = data_args.validation_file
105 |         extension = data_args.validation_file.split(".")[-1]
106 |     if data_args.test_file is not None:
107 |         data_files["test"] = data_args.test_file
108 |         extension = data_args.test_file.split(".")[-1]
109 | 
110 |     raw_datasets = load_dataset(
111 |         extension,
112 |         data_files=data_files,
113 |         cache_dir=model_args.cache_dir,
114 |         use_auth_token=True if model_args.use_auth_token else None,
115 |     )
116 | 
117 |     # Load pretrained model and tokenizer
118 |     config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
119 |     config.pre_seq_len = model_args.pre_seq_len
120 |     config.prefix_projection = model_args.prefix_projection
121 | 
122 |     tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
123 | 
124 |     if model_args.ptuning_checkpoint is not None:
125 |         # Evaluation
126 |         # Loading extra state dict of prefix encoder
127 |         model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
128 |         prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin"))
129 |         new_prefix_state_dict = {}
130 |         for k, v in prefix_state_dict.items():
131 |             if k.startswith("transformer.prefix_encoder."):
132 |                 new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
133 |         model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
134 |     else:
135 |         model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
136 | 
137 |     if model_args.quantization_bit is not None:
138 |         print(f"Quantized to {model_args.quantization_bit} bit")
139 |         model = model.quantize(model_args.quantization_bit)
140 |     if model_args.pre_seq_len is not None:
141 |         # P-tuning v2
142 |         model = model.half()
143 |         model.transformer.prefix_encoder.float()
144 |     else:
145 |         # Finetune
146 |         model = model.float()
147 | 
148 |     prefix = data_args.source_prefix if data_args.source_prefix is not None else ""
149 | 
150 |     # Preprocessing the datasets.
151 |     # We need to tokenize inputs and targets.
152 |     if training_args.do_train: 
153 |         column_names = raw_datasets["train"].column_names
154 |     elif training_args.do_eval:
155 |         column_names = raw_datasets["validation"].column_names
156 |     elif training_args.do_predict:
157 |         column_names = raw_datasets["test"].column_names
158 |     else:
159 |         logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.")
160 |         return
161 | 
162 |     # Get the column names for input/target.
163 |     prompt_column = data_args.prompt_column  #从数据参数中获取提示列名称，也就是用于提问的列。
164 |     response_column = data_args.response_column  #从数据参数中获取回答列的名称，也就是作为回答或目标的列。
165 |     history_column = data_args.history_column  #从数据参数中获取历史对话列的名称，如果存在的话，这些历史对话将被用作提问的上下文。
166 |     
167 |     # Temporarily set max_target_length for training.
168 |     max_target_length = data_args.max_target_length
169 | 
170 |     #以下是预处理函数，它们用于将输入和目标列进行格式化和分词。格式化的结果将被用于模型的训练和验证。
171 |     def preprocess_function_eval(examples):  #和preprocess_function_train这两个函数是为评估和训练准备数据的。它们从示例数据中提取问题和回答，并根据需要将其进行格式化和分词。然后它们会将输入和目标添加到model_inputs列表中，然后返回这个列表。
172 |         inputs, targets = [], []
173 |         for i in range(len(examples[prompt_column])):  #这行代码遍历examples[prompt_column]列表的每一个元素。examples[prompt_column]表示从数据集中提取的问题或提示列。
174 |             if examples[prompt_column][i] and examples[response_column][i]:  #检查第i个问题/提示和对应的回答是否存在。examples[prompt_column][i]和examples[response_column][i]分别表示第i个问题/提示和对应的回答。如果其中之一不存在，那么就跳过这个样本。
175 |                 query = examples[prompt_column][i]  #将第i个问题/提示赋值给变量query。
176 |                 history = examples[history_column][i] if history_column is not None else None  #检查是否存在历史对话列。如果存在，那么将第i个历史对话赋值给变量history；如果不存在，那么将None赋值给history。
177 |                 prompt = tokenizer.build_prompt(query, history)  #使用分词器的build_prompt函数将问题/提示和历史对话结合起来，生成模型的输入。这通常包括一些特定的格式和分词步骤。
178 |                 inputs.append(prompt)  #将生成的输入添加到inputs列表中。inputs列表将被用作模型的输入。
179 |                 targets.append(examples[response_column][i])  #将第i个回答添加到targets列表中。targets列表将被用作模型的目标。
180 |                 #在这段代码执行之后，你将获得两个列表：inputs和targets。inputs列表包含了所有的输入样本，targets列表包含了所有的目标样本。这两个列表将被用于模型的训练或评估。
181 | 
182 |         inputs = [prefix + inp for inp in inputs]  #对于输入列表inputs中的每个元素，都在它们的前面添加一个prefix，然后更新输入列表。这里的prefix可能是一个模型需要的特定前缀，比如特殊的开头标记。
183 |         model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, truncation=True, padding=True)  #使用tokenizer对更新后的输入进行处理，得到模型的输入。tokenizer是一个将原始文本转换为模型可以理解的形式的工具。这个处理包括截断和填充：如果输入的长度超过了data_args.max_source_length，则会被截断；如果输入的长度小于最大长度，则会被填充到最大长度。得到的model_inputs是一个字典，包含了输入的编码等信息。
184 |         labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)  #对目标（也就是期望的输出）进行同样的处理，得到模型的标签。
185 |  
186 |         if data_args.ignore_pad_token_for_loss:  #如果设置了忽略填充标记的损失，则执行以下步骤：
187 |             labels["input_ids"] = [  #对于标签中的每个输入ID，如果它是填充标记的ID，则将其替换为-100，否则保持不变。这是因为在计算损失时，我们通常希望忽略填充的部分。在PyTorch中，-100是一个特殊的值，表示在计算损失时忽略这个位置。
188 |                 [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
189 |             ]
190 |         model_inputs["labels"] = labels["input_ids"]  #这行代码将处理后的标签添加到模型的输入中。这样，模型的输入就包含了输入和对应的标签，可以直接用于训练。
191 | 
192 |         return model_inputs
193 | 
194 |     #函数preprocess_function_train(examples)的主要目标是为模型训练阶段预处理数据。给定一些训练样例examples，它将为每个样例生成模型需要的输入和标签。这个过程包括以下几个步骤：
195 |     def preprocess_function_train(examples):
196 |         max_seq_length = data_args.max_source_length + data_args.max_target_length + 1  #定义最大序列长度为源长度上限（即问题长度上限）加上目标长度上限（即答案长度上限）再加1。这个1通常是为特殊标记（比如序列结束标记）预留的空间。
197 | 
198 |         model_inputs = {  #初始化model_inputs字典，用于存储模型输入的数据。
199 |             "input_ids": [],
200 |             "labels": [],
201 |         }
202 |         for i in range(len(examples[prompt_column])):  #遍历每一个样例。
203 |             if examples[prompt_column][i] and examples[response_column][i]:  #如果样例的问题和答案都存在，那么处理这个样例。
204 |                 query, answer = examples[prompt_column][i], examples[response_column][i]  #获取问题和答案。
205 | 
206 |                 history = examples[history_column][i] if history_column is not None else None  #获取历史对话，如果存在的话。
207 |                 prompt = tokenizer.build_prompt(query, history)  #用tokenizer.build_prompt方法来根据问题和历史对话构建提示。
208 | 
209 |                 prompt = prefix + prompt  #在提示前面添加前缀。
210 |                 a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True,  #将提示编码成模型可以理解的形式，得到输入ID序列a_ids。
211 |                                          max_length=data_args.max_source_length)
212 |                 b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True,  #同样地，将答案编码成模型可以理解的形式，得到答案ID序列b_ids。
213 |                                          max_length=data_args.max_target_length)
214 | 
215 |                 context_length = len(a_ids)  #计算输入的长度。
216 |                 input_ids = a_ids + b_ids + [tokenizer.eos_token_id]  #将输入和答案的ID序列拼接起来，并在最后添加一个序列结束标记的ID，得到完整的输入序列。
217 |                 labels = [tokenizer.pad_token_id] * context_length + b_ids + [tokenizer.eos_token_id]  #标签序列的前context_length部分是填充标记的ID，后面是答案的ID序列和一个序列结束标记的ID。这样设置的原因是，我们只关心模型对答案部分的预测。
218 |                 
219 |                 pad_len = max_seq_length - len(input_ids)  #计算需要填充的长度。
220 |                 input_ids = input_ids + [tokenizer.pad_token_id] * pad_len  #在输入序列后面添加填充标记，使其长度达到max_seq_length。
221 |                 labels = labels + [tokenizer.pad_token_id] * pad_len  #同样地，也在标签序列后面添加填充标记。
222 |                 if data_args.ignore_pad_token_for_loss:  #如果设置了忽略填充标记的损失，那么将标签中的填充标记的ID替换为-100。
223 |                     labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]  #将处理好的输入和标签添加到model_inputs字典中。
224 | 
225 |                 model_inputs["input_ids"].append(input_ids)
226 |                 model_inputs["labels"].append(labels)  #在处理完所有样例后，返回model_inputs字典，它包含了所有样例的输入和标签，可以直接用于模型的训练。
227 | 
228 |         return model_inputs
229 | 
230 |     #这段代码主要是用来处理训练、验证和测试数据集，使其适应模型训练和预测的需要。下面来逐行解释：
231 |     def print_dataset_example(example):  #定义一个函数print_dataset_example(example)，它用于打印给定样例的输入和标签，以及它们对应的文本形式。
232 |         print("input_ids", example["input_ids"])
233 |         print("inputs", tokenizer.decode(example["input_ids"]))
234 |         print("label_ids", example["labels"])
235 |         print("labels", tokenizer.decode(example["labels"]))
236 | 
237 |     if training_args.do_train:  #如果需要进行训练。
238 |         if "train" not in raw_datasets:  #检查原始数据集中是否存在训练数据集，如果不存在，则抛出错误。
239 |             raise ValueError("--do_train requires a train dataset")
240 |         train_dataset = raw_datasets["train"]  #获取训练数据集。
241 |         if data_args.max_train_samples is not None:  #如果设置了训练样本的最大数量。
242 |             max_train_samples = min(len(train_dataset), data_args.max_train_samples)  #计算实际使用的训练样本的数量，为原始训练样本数量和最大训练样本数量中的较小者。
243 |             train_dataset = train_dataset.select(range(max_train_samples))  #选择所需数量的训练样本。
244 |         with training_args.main_process_first(desc="train dataset map pre-processing"):  #主要是为了确保主进程在所有其他进程之前运行。
245 |             train_dataset = train_dataset.map(  #应用预处理函数到训练数据集上，预处理函数就是前面定义的preprocess_function_train。
246 |                 preprocess_function_train,
247 |                 batched=True,
248 |                 num_proc=data_args.preprocessing_num_workers,
249 |                 remove_columns=column_names,
250 |                 load_from_cache_file=not data_args.overwrite_cache,
251 |                 desc="Running tokenizer on train dataset",
252 |             )
253 |         print_dataset_example(train_dataset[0])  #打印处理后的第一个训练样例。
254 | 
255 |     if training_args.do_eval:  #首先检查是否需要对模型进行评估。do_eval是一个布尔值，如果为True，那么这段代码会对验证集进行预处理并进行模型评估。
256 |         max_target_length = data_args.val_max_target_length  #设定了目标序列的最大长度。这是为了处理可能存在的长度不一致问题。
257 |         if "validation" not in raw_datasets:
258 |             raise ValueError("--do_eval requires a validation dataset")  #检查原始数据集中是否包含验证集。如果不包含，那么将会引发一个错误。
259 |         eval_dataset = raw_datasets["validation"]  #从原始数据集中提取验证数据。
260 |         if data_args.max_eval_samples is not None:  #检查是否设定了最大的验证样本数量。如果设定了，那么就按照这个数量来选择样本。
261 |             max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples)  #根据验证集的长度和预设的最大验证样本数量选择实际使用的样本数量。
262 |             eval_dataset = eval_dataset.select(range(max_eval_samples))  #从验证集中选取一定数量的样本进行预处理和评估。
263 |         with training_args.main_process_first(desc="validation dataset map pre-processing"):  #接下来的部分用于实际的数据预处理：通过调用.map()函数，使用先前定义的preprocess_function_eval函数对验证数据集进行预处理。
264 |             eval_dataset = eval_dataset.map(
265 |                 preprocess_function_eval,
266 |                 batched=True,
267 |                 num_proc=data_args.preprocessing_num_workers,
268 |                 remove_columns=column_names,
269 |                 load_from_cache_file=not data_args.overwrite_cache,
270 |                 desc="Running tokenizer on validation dataset",
271 |             )
272 |         print_dataset_example(eval_dataset[0])  #这一行输出经过预处理后的第一个验证样本，以便检查预处理是否正确进行。
273 | 
274 |     if training_args.do_predict:
275 |         max_target_length = data_args.val_max_target_length
276 |         if "test" not in raw_datasets:
277 |             raise ValueError("--do_predict requires a test dataset")
278 |         predict_dataset = raw_datasets["test"]
279 |         if data_args.max_predict_samples is not None:
280 |             max_predict_samples = min(len(predict_dataset), data_args.max_predict_samples)
281 |             predict_dataset = predict_dataset.select(range(max_predict_samples))
282 |         with training_args.main_process_first(desc="prediction dataset map pre-processing"):
283 |             predict_dataset = predict_dataset.map(
284 |                 preprocess_function_eval,
285 |                 batched=True,
286 |                 num_proc=data_args.preprocessing_num_workers,
287 |                 remove_columns=column_names,
288 |                 load_from_cache_file=not data_args.overwrite_cache,
289 |                 desc="Running tokenizer on prediction dataset",
290 |             )
291 |         print_dataset_example(predict_dataset[0])
292 | 
293 |     # Data collator
294 |     #这行代码设置了label的填充token ID。如果设置了在计算损失时忽略填充token（由data_args.ignore_pad_token_for_loss决定），那么填充token ID将被设为-100，否则填充token ID就是tokenizer的填充token ID。
295 |     label_pad_token_id = -100 if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id  
296 |     #这行代码创建了一个用于序列到序列任务的数据整理器（data collator）。数据整理器的作用是将一个批量的数据整理成可输入模型的形式。其中，tokenizer用于对文本进行编码，model是预训练模型，
297 |     data_collator = DataCollatorForSeq2Seq(
298 |         tokenizer,
299 |         model=model,
300 |         label_pad_token_id=label_pad_token_id,  #label_pad_token_id是label的填充token ID，
301 |         pad_to_multiple_of=None,  #表示不需要将序列长度补齐到某个数的倍数，
302 |         padding=False  #padding=False表示在数据整理时不进行填充。
303 |     )
304 | 
305 |     # Metric
306 |     #这个函数定义了如何计算评估指标。eval_preds是模型的预测结果和标签，函数首先将预测结果和标签从token IDs转化为文本，然后计算并返回各个评估指标（包括ROUGE和BLEU）的平均值。
307 |     def compute_metrics(eval_preds):
308 |         preds, labels = eval_preds  #从输入的评估预测中提取预测值和标签。
309 |         if isinstance(preds, tuple):  #preds = preds[0] 如果预测值是一个元组，则只取第一个元素作为预测值。
310 |             preds = preds[0]
311 |         decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)  #使用 tokenizer 对预测的 token IDs 进行批量解码，转换为文本形式，并跳过特殊的 token。
312 |         if data_args.ignore_pad_token_for_loss:  #如果在计算损失时忽略了 pad token，则将标签中所有值为-100的元素（即原始的 pad token）替换为 tokenizer 的 pad token ID。
313 |             # Replace -100 in the labels as we can't decode them.
314 |             labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
315 |         decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)  #使用 tokenizer 对标签的 token IDs 进行批量解码，转换为文本形式，并跳过特殊的 token。
316 | 
317 |         score_dict = {  #初始化一个字典，用于存储各个评估指标（包括rouge-1，rouge-2，rouge-l和bleu-4）的得分。
318 |             "rouge-1": [],
319 |             "rouge-2": [],
320 |             "rouge-l": [],
321 |             "bleu-4": []
322 |         }
323 |         for pred, label in zip(decoded_preds, decoded_labels): #对每一个预测值和标签的配对进行遍历。
324 |             hypothesis = list(jieba.cut(pred))
325 |             reference = list(jieba.cut(label))  #hypothesis = list(jieba.cut(pred)) 和 reference = list(jieba.cut(label)) 使用 jieba 对预测和标签进行分词，生成假设和参考序列。
326 |             rouge = Rouge()
327 |             scores = rouge.get_scores(' '.join(hypothesis) , ' '.join(reference)) #计算 ROUGE 得分。
328 |             result = scores[0]
329 |             
330 |             for k, v in result.items():  # 对于每一个 ROUGE 指标（rouge-1，rouge-2，rouge-l），
331 |                 score_dict[k].append(round(v["f"] * 100, 4))  #将 f-score 存入 score_dict。
332 |             bleu_score = sentence_bleu([list(label)], list(pred), smoothing_function=SmoothingFunction().method3)  #计算 BLEU 得分。
333 |             score_dict["bleu-4"].append(round(bleu_score * 100, 4))  #score_dict["bleu-4"].append(round(bleu_score * 100, 4)) 将 BLEU 得分存入 score_dict。
334 | 
335 |         for k, v in score_dict.items():  # 对于 score_dict 中的每一个指标，计算并存储其平均值。
336 |             score_dict[k] = float(np.mean(v))
337 |         return score_dict  # 返回包含了各个评估指标平均得分的字典。
338 | 
339 |     # Override the decoding parameters of Seq2SeqTrainer
340 |     #这行代码设置了生成序列的最大长度。如果训练参数中设置了生成序列的最大长度，那么就使用该值，否则使用验证集目标序列的最大长度。
341 |     training_args.generation_max_length = (
342 |         training_args.generation_max_length
343 |         if training_args.generation_max_length is not None
344 |         else data_args.val_max_target_length
345 |     )
346 |     training_args.generation_num_beams = (
347 |         data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams
348 |     )
349 |     # Initialize our Trainer
350 |     trainer = Seq2SeqTrainer(
351 |         model=model,
352 |         args=training_args,
353 |         train_dataset=train_dataset if training_args.do_train else None,
354 |         eval_dataset=eval_dataset if training_args.do_eval else None,
355 |         tokenizer=tokenizer,
356 |         data_collator=data_collator,
357 |         compute_metrics=compute_metrics if training_args.predict_with_generate else None,
358 |         save_changed=model_args.pre_seq_len is not None
359 |     )
360 | 
361 |     # Training
362 |     if training_args.do_train:
363 |         checkpoint = None
364 |         if training_args.resume_from_checkpoint is not None:
365 |             checkpoint = training_args.resume_from_checkpoint
366 |         # elif last_checkpoint is not None:
367 |         #     checkpoint = last_checkpoint
368 |         model.gradient_checkpointing_enable()
369 |         model.enable_input_require_grads()
370 |         train_result = trainer.train(resume_from_checkpoint=checkpoint)
371 |         # trainer.save_model()  # Saves the tokenizer too for easy upload
372 | 
373 |         metrics = train_result.metrics
374 |         max_train_samples = (
375 |             data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
376 |         )
377 |         metrics["train_samples"] = min(max_train_samples, len(train_dataset))
378 | 
379 |         trainer.log_metrics("train", metrics)
380 |         trainer.save_metrics("train", metrics)
381 |         trainer.save_state()
382 | 
383 |     # Evaluation
384 |     results = {}
385 |     max_seq_length = data_args.max_source_length + data_args.max_target_length + 1
386 |     if training_args.do_eval:
387 |         logger.info("*** Evaluate ***")
388 |         metrics = trainer.evaluate(metric_key_prefix="eval", do_sample=True, top_p=0.7, max_length=max_seq_length, temperature=0.95)
389 |         max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
390 |         metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
391 | 
392 |         trainer.log_metrics("eval", metrics)
393 |         trainer.save_metrics("eval", metrics)
394 | 
395 |     if training_args.do_predict:
396 |         logger.info("*** Predict ***")
397 |         predict_results = trainer.predict(predict_dataset, metric_key_prefix="predict", max_length=max_seq_length, do_sample=True, top_p=0.7, temperature=0.95)
398 |         metrics = predict_results.metrics
399 |         max_predict_samples = (
400 |             data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset)
401 |         )
402 |         metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset))
403 | 
404 |         trainer.log_metrics("predict", metrics)
405 |         trainer.save_metrics("predict", metrics)
406 | 
407 |         if trainer.is_world_process_zero():
408 |             if training_args.predict_with_generate:
409 |                 predictions = tokenizer.batch_decode(
410 |                     predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
411 |                 )
412 |                 predictions = [pred.strip() for pred in predictions]
413 |                 labels = tokenizer.batch_decode(
414 |                     predict_results.label_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
415 |                 )
416 |                 labels = [label.strip() for label in labels]
417 |                 output_prediction_file = os.path.join(training_args.output_dir, "generated_predictions.txt")
418 |                 with open(output_prediction_file, "w", encoding="utf-8") as writer:
419 |                     for p, l in zip(predictions, labels):
420 |                         res = json.dumps({"labels": l, "predict": p}, ensure_ascii=False)
421 |                         writer.write(f"{res}\n")
422 |     return results
423 | 
424 | 
425 | def _mp_fn(index):
426 |     # For xla_spawn (TPUs)
427 |     main()
428 | 
429 | 
430 | if __name__ == "__main__":
431 |     main()
432 | 


--------------------------------------------------------------------------------
/ptuning/train.sh:
--------------------------------------------------------------------------------
 1 | PRE_SEQ_LEN=128  #预设定的序列长度，表示在模型输入中，将处理的最大文本序列长度设置为128个词或字符。
 2 | LR=2e-2  #LR是学习率（Learning Rate）的简写，值为0.02。学习率是优化算法的一个超参数，控制模型在训练过程中学习的速度。过大的学习率可能会导致训练收敛的不稳定，过小的学习率可能会导致训练过程过于缓慢。
 3 | NUM_GPUS=1
 4 | 
 5 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \  #是使用torchrun来运行一个分布式程序的命令。这里指定了单节点（nnodes=1）运行，并在每个节点上运行的进程数为设定的GPU数量。
 6 |     --do_train \  #是一个标志位，指示该程序应执行训练过程。
 7 |     --train_file AdvertiseGen/train.json \
 8 |     --validation_file AdvertiseGen/dev.json \
 9 |     --preprocessing_num_workers 10 \  #这个参数指定了预处理阶段并行工作的线程数量。
10 |     --prompt_column content \
11 |     --response_column summary \  #这些参数定义了在训练和验证数据中，模型输入的列名（prompt_column）以及模型数据回答的列名（response_column）。
12 |     --overwrite_cache \  #一个标志位，如果设置，那么在加载数据前将删除预处理的缓存。
13 |     --model_name_or_path THUDM/chatglm2-6b \  #这个参数指定了预训练模型的名称或者路径，模型将在这个预训练模型的基础上进行微调。
14 |     --output_dir output/adgen-chatglm2-6b-pt-$PRE_SEQ_LEN-$LR \
15 |     --overwrite_output_dir \  #一个标志位，如果设置，那么在训练开始时将删除输出目录，以便重新开始训练。
16 |     --max_source_length 64 \
17 |     --max_target_length 128 \  # 这些参数定义了源输入和目标输出的最大长度。
18 |     --per_device_train_batch_size 1 \
19 |     --per_device_eval_batch_size 1 \  #参数定义了每个设备（即GPU）的训练和评估批次大小。
20 |     --gradient_accumulation_steps 16 \  #这个参数定义了在进行一次参数更新之前，需要进行的梯度累积步骤数量。这是一种内存优化策略，可以使得在内存受限的情况下训练更大的模型。
21 |     --predict_with_generate \  #一个标志位，如果设置，那么将使用生成式的方法（例如，自回归解码）来进行预测。
22 |     --max_steps 3000 \  #定义了训练过程的最大步数。
23 |     --logging_steps 10 \ #定义了记录日志和保存模型的步数间隔。
24 |     --save_steps 1000 \
25 |     --learning_rate $LR \
26 |     --pre_seq_len $PRE_SEQ_LEN \  #这些参数在上面已经定义过了，这里是将它们应用于训练过程。
27 |     --quantization_bit 4  #定义了模型权重量化的位数。使用模型量化可以减少模型的存储需求，并可能提高推理速度，但可能会以精度为代价。这里设定为4位，意味着每个模型权重值都将映射到16（2的4次方）个不同的值。
28 | 
29 | 


--------------------------------------------------------------------------------
/ptuning/train_chat.sh:
--------------------------------------------------------------------------------
 1 | PRE_SEQ_LEN=128
 2 | LR=1e-2
 3 | NUM_GPUS=1
 4 | 
 5 | torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
 6 |     --do_train \
 7 |     --train_file $CHAT_TRAIN_DATA \
 8 |     --validation_file $CHAT_VAL_DATA \
 9 |     --preprocessing_num_workers 10 \
10 |     --prompt_column prompt \
11 |     --response_column response \
12 |     --history_column history \
13 |     --overwrite_cache \
14 |     --model_name_or_path THUDM/chatglm2-6b \
15 |     --output_dir $CHECKPOINT_NAME \
16 |     --overwrite_output_dir \
17 |     --max_source_length 256 \
18 |     --max_target_length 256 \
19 |     --per_device_train_batch_size 1 \
20 |     --per_device_eval_batch_size 1 \
21 |     --gradient_accumulation_steps 16 \
22 |     --predict_with_generate \
23 |     --max_steps 3000 \
24 |     --logging_steps 10 \
25 |     --save_steps 1000 \
26 |     --learning_rate $LR \
27 |     --pre_seq_len $PRE_SEQ_LEN \
28 |     --quantization_bit 4
29 | 
30 | 


--------------------------------------------------------------------------------
/ptuning/trainer.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | # Copyright 2020-present the HuggingFace Inc. team.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """
16 | The Trainer class, to easily train a 🤗 Transformers from scratch or finetune it on a new task.
17 | """
18 | import os
19 | from typing import Optional
20 | from transformers import Trainer
21 | 
22 | import torch
23 | from transformers.modeling_utils import PreTrainedModel, unwrap_model
24 | from transformers.utils import logging
25 | 
26 | logger = logging.get_logger(__name__)
27 | 
28 | WEIGHTS_NAME = "pytorch_model.bin"
29 | TRAINING_ARGS_NAME = "training_args.bin"
30 | 
31 | 
32 | class PrefixTrainer(Trainer):
33 |     def __init__(self, *args, save_changed=False, **kwargs):
34 |         self.save_changed = save_changed
35 |         super().__init__(*args, **kwargs)
36 | 
37 |     def _save(self, output_dir: Optional[str] = None, state_dict=None):
38 |         # If we are executing this function, we are the process zero, so we don't check for that.
39 |         output_dir = output_dir if output_dir is not None else self.args.output_dir
40 |         os.makedirs(output_dir, exist_ok=True)
41 |         logger.info(f"Saving model checkpoint to {output_dir}")
42 |         # Save a trained model and configuration using `save_pretrained()`.
43 |         # They can then be reloaded using `from_pretrained()`
44 |         if not isinstance(self.model, PreTrainedModel):
45 |             if isinstance(unwrap_model(self.model), PreTrainedModel):
46 |                 if state_dict is None:
47 |                     state_dict = self.model.state_dict()
48 |                 unwrap_model(self.model).save_pretrained(output_dir, state_dict=state_dict)
49 |             else:
50 |                 logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
51 |                 if state_dict is None:
52 |                     state_dict = self.model.state_dict()
53 |                 torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
54 |         else:
55 |             if self.save_changed:
56 |                 print("Saving PrefixEncoder")
57 |                 state_dict = self.model.state_dict()
58 |                 filtered_state_dict = {}
59 |                 for k, v in self.model.named_parameters():
60 |                     if v.requires_grad:
61 |                         filtered_state_dict[k] = state_dict[k]
62 |                 self.model.save_pretrained(output_dir, state_dict=filtered_state_dict)
63 |             else:
64 |                 print("Saving the whole model")
65 |                 self.model.save_pretrained(output_dir, state_dict=state_dict)
66 |         if self.tokenizer is not None:
67 |             self.tokenizer.save_pretrained(output_dir)
68 | 
69 |         # Good practice: save your training arguments together with the trained model
70 |         torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
71 | 


--------------------------------------------------------------------------------
/ptuning/trainer_seq2seq.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2020 The HuggingFace Team. All rights reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | from typing import Any, Dict, List, Optional, Tuple, Union
 16 | 
 17 | import torch
 18 | from torch import nn
 19 | from torch.utils.data import Dataset
 20 | 
 21 | from transformers.deepspeed import is_deepspeed_zero3_enabled
 22 | from trainer import PrefixTrainer
 23 | from transformers.trainer_utils import PredictionOutput
 24 | from transformers.utils import logging
 25 | 
 26 | 
 27 | logger = logging.get_logger(__name__)
 28 | 
 29 | 
 30 | class Seq2SeqTrainer(PrefixTrainer):
 31 |     def evaluate(
 32 |         self,
 33 |         eval_dataset: Optional[Dataset] = None,
 34 |         ignore_keys: Optional[List[str]] = None,
 35 |         metric_key_prefix: str = "eval",
 36 |         **gen_kwargs
 37 |     ) -> Dict[str, float]:
 38 |         """
 39 |         Run evaluation and returns metrics.
 40 | 
 41 |         The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
 42 |         (pass it to the init `compute_metrics` argument).
 43 | 
 44 |         You can also subclass and override this method to inject custom behavior.
 45 | 
 46 |         Args:
 47 |             eval_dataset (`Dataset`, *optional*):
 48 |                 Pass a dataset if you wish to override `self.eval_dataset`. If it is an [`~datasets.Dataset`], columns
 49 |                 not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
 50 |                 method.
 51 |             ignore_keys (`List[str]`, *optional*):
 52 |                 A list of keys in the output of your model (if it is a dictionary) that should be ignored when
 53 |                 gathering predictions.
 54 |             metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
 55 |                 An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
 56 |                 "eval_bleu" if the prefix is `"eval"` (default)
 57 |             max_length (`int`, *optional*):
 58 |                 The maximum target length to use when predicting with the generate method.
 59 |             num_beams (`int`, *optional*):
 60 |                 Number of beams for beam search that will be used when predicting with the generate method. 1 means no
 61 |                 beam search.
 62 |             gen_kwargs:
 63 |                 Additional `generate` specific kwargs.
 64 | 
 65 |         Returns:
 66 |             A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
 67 |             dictionary also contains the epoch number which comes from the training state.
 68 |         """
 69 | 
 70 |         gen_kwargs = gen_kwargs.copy()
 71 |         if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
 72 |             gen_kwargs["max_length"] = self.args.generation_max_length
 73 |         gen_kwargs["num_beams"] = (
 74 |             gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
 75 |         )
 76 |         self._gen_kwargs = gen_kwargs
 77 | 
 78 |         return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
 79 | 
 80 |     def predict(
 81 |         self,
 82 |         test_dataset: Dataset,
 83 |         ignore_keys: Optional[List[str]] = None,
 84 |         metric_key_prefix: str = "test",
 85 |         **gen_kwargs
 86 |     ) -> PredictionOutput:
 87 |         """
 88 |         Run prediction and returns predictions and potential metrics.
 89 | 
 90 |         Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method
 91 |         will also return metrics, like in `evaluate()`.
 92 | 
 93 |         Args:
 94 |             test_dataset (`Dataset`):
 95 |                 Dataset to run the predictions on. If it is a [`~datasets.Dataset`], columns not accepted by the
 96 |                 `model.forward()` method are automatically removed. Has to implement the method `__len__`
 97 |             ignore_keys (`List[str]`, *optional*):
 98 |                 A list of keys in the output of your model (if it is a dictionary) that should be ignored when
 99 |                 gathering predictions.
100 |             metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
101 |                 An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
102 |                 "eval_bleu" if the prefix is `"eval"` (default)
103 |             max_length (`int`, *optional*):
104 |                 The maximum target length to use when predicting with the generate method.
105 |             num_beams (`int`, *optional*):
106 |                 Number of beams for beam search that will be used when predicting with the generate method. 1 means no
107 |                 beam search.
108 |             gen_kwargs:
109 |                 Additional `generate` specific kwargs.
110 | 
111 |         <Tip>
112 | 
113 |         If your predictions or labels have different sequence lengths (for instance because you're doing dynamic
114 |         padding in a token classification task) the predictions will be padded (on the right) to allow for
115 |         concatenation into one array. The padding index is -100.
116 | 
117 |         </Tip>
118 | 
119 |         Returns: *NamedTuple* A namedtuple with the following keys:
120 | 
121 |             - predictions (`np.ndarray`): The predictions on `test_dataset`.
122 |             - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some).
123 |             - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained
124 |               labels).
125 |         """
126 | 
127 |         gen_kwargs = gen_kwargs.copy()
128 |         if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
129 |             gen_kwargs["max_length"] = self.args.generation_max_length
130 |         gen_kwargs["num_beams"] = (
131 |             gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
132 |         )
133 |         self._gen_kwargs = gen_kwargs
134 | 
135 | 
136 |         return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
137 | 
138 |     def prediction_step(
139 |         self,
140 |         model: nn.Module,
141 |         inputs: Dict[str, Union[torch.Tensor, Any]],
142 |         prediction_loss_only: bool,
143 |         ignore_keys: Optional[List[str]] = None,
144 |     ) -> Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]:
145 |         """
146 |         Perform an evaluation step on `model` using `inputs`.
147 | 
148 |         Subclass and override to inject custom behavior.
149 | 
150 |         Args:
151 |             model (`nn.Module`):
152 |                 The model to evaluate.
153 |             inputs (`Dict[str, Union[torch.Tensor, Any]]`):
154 |                 The inputs and targets of the model.
155 | 
156 |                 The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
157 |                 argument `labels`. Check your model's documentation for all accepted arguments.
158 |             prediction_loss_only (`bool`):
159 |                 Whether or not to return the loss only.
160 | 
161 |         Return:
162 |             Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss, logits and
163 |             labels (each being optional).
164 |         """
165 | 
166 |         if not self.args.predict_with_generate or prediction_loss_only:
167 |             return super().prediction_step(
168 |                 model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
169 |             )
170 | 
171 |         has_labels = "labels" in inputs
172 |         inputs = self._prepare_inputs(inputs)
173 | 
174 |         # XXX: adapt synced_gpus for fairscale as well
175 |         gen_kwargs = self._gen_kwargs.copy()
176 |         if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
177 |             gen_kwargs["max_length"] = self.model.config.max_length
178 |         gen_kwargs["num_beams"] = (
179 |             gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams
180 |         )
181 |         default_synced_gpus = True if is_deepspeed_zero3_enabled() else False
182 |         gen_kwargs["synced_gpus"] = (
183 |             gen_kwargs["synced_gpus"] if gen_kwargs.get("synced_gpus") is not None else default_synced_gpus
184 |         )
185 | 
186 |         if "attention_mask" in inputs:
187 |             gen_kwargs["attention_mask"] = inputs.get("attention_mask", None)
188 |         if "position_ids" in inputs:
189 |             gen_kwargs["position_ids"] = inputs.get("position_ids", None)
190 |         if "global_attention_mask" in inputs:
191 |             gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None)
192 | 
193 |         # prepare generation inputs
194 |         # some encoder-decoder models can have varying encoder's and thus
195 |         # varying model input names
196 |         if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name:
197 |             generation_inputs = inputs[self.model.encoder.main_input_name]
198 |         else:
199 |             generation_inputs = inputs[self.model.main_input_name]
200 | 
201 |         gen_kwargs["input_ids"] = generation_inputs
202 |         generated_tokens = self.model.generate(**gen_kwargs)
203 |         generated_tokens = generated_tokens[:, generation_inputs.size()[-1]:]
204 | 
205 |         # in case the batch is shorter than max length, the output should be padded
206 |         if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]:
207 |             generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
208 |         elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < (
209 |             gen_kwargs["max_new_tokens"] + 1
210 |         ):
211 |             generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1)
212 | 
213 |         loss = None
214 | 
215 |         if self.args.prediction_loss_only:
216 |             return (loss, None, None)
217 | 
218 |         if has_labels:
219 |             labels = inputs["labels"]
220 |             if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]:
221 |                 labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
222 |             elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < (
223 |                 gen_kwargs["max_new_tokens"] + 1
224 |             ):
225 |                 labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1))
226 |         else:
227 |             labels = None
228 | 
229 |         return (loss, generated_tokens, labels)
230 | 
231 |     def _pad_tensors_to_max_len(self, tensor, max_length):
232 |         if self.tokenizer is not None and hasattr(self.tokenizer, "pad_token_id"):
233 |             # If PAD token is not defined at least EOS token has to be defined
234 |             pad_token_id = (
235 |                 self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id
236 |             )
237 |         else:
238 |             if self.model.config.pad_token_id is not None:
239 |                 pad_token_id = self.model.config.pad_token_id
240 |             else:
241 |                 raise ValueError("Pad_token_id must be set in the configuration of the model, in order to pad tensors")
242 | 
243 |         padded_tensor = pad_token_id * torch.ones(
244 |             (tensor.shape[0], max_length), dtype=tensor.dtype, device=tensor.device
245 |         )
246 |         padded_tensor[:, : tensor.shape[-1]] = tensor
247 |         return padded_tensor
248 | 


--------------------------------------------------------------------------------
/ptuning/web_demo.py:
--------------------------------------------------------------------------------
  1 | import os, sys
  2 | 
  3 | import gradio as gr
  4 | import mdtex2html
  5 | 
  6 | import torch
  7 | import transformers
  8 | from transformers import (
  9 |     AutoConfig,
 10 |     AutoModel,
 11 |     AutoTokenizer,
 12 |     AutoTokenizer,
 13 |     DataCollatorForSeq2Seq,
 14 |     HfArgumentParser,
 15 |     Seq2SeqTrainingArguments,
 16 |     set_seed,
 17 | )
 18 | 
 19 | from arguments import ModelArguments, DataTrainingArguments
 20 | 
 21 | 
 22 | model = None
 23 | tokenizer = None
 24 | 
 25 | """Override Chatbot.postprocess"""
 26 | 
 27 | 
 28 | def postprocess(self, y):
 29 |     if y is None:
 30 |         return []
 31 |     for i, (message, response) in enumerate(y):
 32 |         y[i] = (
 33 |             None if message is None else mdtex2html.convert((message)),
 34 |             None if response is None else mdtex2html.convert(response),
 35 |         )
 36 |     return y
 37 | 
 38 | 
 39 | gr.Chatbot.postprocess = postprocess
 40 | 
 41 | 
 42 | def parse_text(text):
 43 |     """copy from https://github.com/GaiZhenbiao/ChuanhuChatGPT/"""
 44 |     lines = text.split("\n")
 45 |     lines = [line for line in lines if line != ""]
 46 |     count = 0
 47 |     for i, line in enumerate(lines):
 48 |         if "```" in line:
 49 |             count += 1
 50 |             items = line.split('`')
 51 |             if count % 2 == 1:
 52 |                 lines[i] = f'<pre><code class="language-{items[-1]}">'
 53 |             else:
 54 |                 lines[i] = f'<br></code></pre>'
 55 |         else:
 56 |             if i > 0:
 57 |                 if count % 2 == 1:
 58 |                     line = line.replace("`", "\`")
 59 |                     line = line.replace("<", "&lt;")
 60 |                     line = line.replace(">", "&gt;")
 61 |                     line = line.replace(" ", "&nbsp;")
 62 |                     line = line.replace("*", "&ast;")
 63 |                     line = line.replace("_", "&lowbar;")
 64 |                     line = line.replace("-", "&#45;")
 65 |                     line = line.replace(".", "&#46;")
 66 |                     line = line.replace("!", "&#33;")
 67 |                     line = line.replace("(", "&#40;")
 68 |                     line = line.replace(")", "&#41;")
 69 |                     line = line.replace("$", "&#36;")
 70 |                 lines[i] = "<br>"+line
 71 |     text = "".join(lines)
 72 |     return text
 73 | 
 74 | 
 75 | def predict(input, chatbot, max_length, top_p, temperature, history, past_key_values):
 76 |     chatbot.append((parse_text(input), ""))
 77 |     for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values,
 78 |                                                                 return_past_key_values=True,
 79 |                                                                 max_length=max_length, top_p=top_p,
 80 |                                                                 temperature=temperature):
 81 |         chatbot[-1] = (parse_text(input), parse_text(response))
 82 | 
 83 |         yield chatbot, history, past_key_values
 84 | 
 85 | 
 86 | def reset_user_input():
 87 |     return gr.update(value='')
 88 | 
 89 | 
 90 | def reset_state():
 91 |     return [], [], None
 92 | 
 93 | 
 94 | with gr.Blocks() as demo:
 95 |     gr.HTML("""<h1 align="center">ChatGLM2-6B</h1>""")
 96 | 
 97 |     chatbot = gr.Chatbot()
 98 |     with gr.Row():
 99 |         with gr.Column(scale=4):
100 |             with gr.Column(scale=12):
101 |                 user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(
102 |                     container=False)
103 |             with gr.Column(min_width=32, scale=1):
104 |                 submitBtn = gr.Button("Submit", variant="primary")
105 |         with gr.Column(scale=1):
106 |             emptyBtn = gr.Button("Clear History")
107 |             max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True)
108 |             top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True)
109 |             temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True)
110 | 
111 |     history = gr.State([])
112 |     past_key_values = gr.State(None)
113 | 
114 |     submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values],
115 |                     [chatbot, history, past_key_values], show_progress=True)
116 |     submitBtn.click(reset_user_input, [], [user_input])
117 | 
118 |     emptyBtn.click(reset_state, outputs=[chatbot, history, past_key_values], show_progress=True)
119 | 
120 | 
121 | def main():
122 |     global model, tokenizer
123 | 
124 |     parser = HfArgumentParser((
125 |         ModelArguments))
126 |     if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
127 |         # If we pass only one argument to the script and it's the path to a json file,
128 |         # let's parse it to get our arguments.
129 |         model_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))[0]
130 |     else:
131 |         model_args = parser.parse_args_into_dataclasses()[0]
132 | 
133 |     tokenizer = AutoTokenizer.from_pretrained(
134 |         model_args.model_name_or_path, trust_remote_code=True)
135 |     config = AutoConfig.from_pretrained(
136 |         model_args.model_name_or_path, trust_remote_code=True)
137 | 
138 |     config.pre_seq_len = model_args.pre_seq_len
139 |     config.prefix_projection = model_args.prefix_projection
140 | 
141 |     if model_args.ptuning_checkpoint is not None:
142 |         print(f"Loading prefix_encoder weight from {model_args.ptuning_checkpoint}")
143 |         model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
144 |         prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin"))
145 |         new_prefix_state_dict = {}
146 |         for k, v in prefix_state_dict.items():
147 |             if k.startswith("transformer.prefix_encoder."):
148 |                 new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
149 |         model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
150 |     else:
151 |         model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
152 | 
153 |     if model_args.quantization_bit is not None:
154 |         print(f"Quantized to {model_args.quantization_bit} bit")
155 |         model = model.quantize(model_args.quantization_bit)
156 |     model = model.cuda()
157 |     if model_args.pre_seq_len is not None:
158 |         # P-tuning v2
159 |         model.transformer.prefix_encoder.float()
160 |     
161 |     model = model.eval()
162 |     demo.queue().launch(share=False, inbrowser=True)
163 | 
164 | 
165 | 
166 | if __name__ == "__main__":
167 |     main()


--------------------------------------------------------------------------------
/ptuning/web_demo.sh:
--------------------------------------------------------------------------------
1 | PRE_SEQ_LEN=128
2 | 
3 | CUDA_VISIBLE_DEVICES=0 python3 web_demo.py \
4 |     --model_name_or_path THUDM/chatglm2-6b \
5 |     --ptuning_checkpoint output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-3000 \
6 |     --pre_seq_len $PRE_SEQ_LEN
7 | 
8 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | protobuf
 2 | transformers==4.30.2
 3 | cpm_kernels
 4 | torch>=2.0
 5 | gradio
 6 | mdtex2html
 7 | sentencepiece
 8 | accelerate
 9 | sse-starlette
10 | streamlit>=1.24.0


--------------------------------------------------------------------------------
/resources/WECHAT.md:
--------------------------------------------------------------------------------
1 | <div align="center">
2 | <img src=wechat.jpg width="60%"/>
3 | 
4 | <p> 扫码关注公众号，加入「ChatGLM交流群」 </p>
5 | <p> Scan the QR code to follow the official account and join the "ChatGLM Discussion Group" </p>
6 | </div>
7 | 
8 | 


--------------------------------------------------------------------------------
/resources/cli-demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/cli-demo.png


--------------------------------------------------------------------------------
/resources/knowledge.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/knowledge.png


--------------------------------------------------------------------------------
/resources/long-context.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/long-context.png


--------------------------------------------------------------------------------
/resources/math.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/math.png


--------------------------------------------------------------------------------
/resources/web-demo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/web-demo.gif


--------------------------------------------------------------------------------
/resources/web-demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/web-demo.png


--------------------------------------------------------------------------------
/resources/wechat.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArtificialZeng/ChatGLM2-6B-Explained/50c70850d88f846402a504ba18f5841dffefc600/resources/wechat.jpg


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from typing import Dict, Tuple, Union, Optional
 3 | 
 4 | from torch.nn import Module
 5 | from transformers import AutoModel
 6 | 
 7 | 
 8 | def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
 9 |     # transformer.word_embeddings 占用1层
10 |     # transformer.final_layernorm 和 lm_head 占用1层
11 |     # transformer.layers 占用 28 层
12 |     # 总共30层分配到num_gpus张卡上
13 |     num_trans_layers = 28
14 |     per_gpu_layers = 30 / num_gpus
15 | 
16 |     # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
17 |     # windows下 model.device 会被设置成 transformer.word_embeddings.device
18 |     # linux下 model.device 会被设置成 lm_head.device
19 |     # 在调用chat或者stream_chat时,input_ids会被放到model.device上
20 |     # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
21 |     # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
22 |     # 本文件来源于https://github.com/THUDM/ChatGLM-6B/blob/main/utils.py
23 |     # 仅此处做少许修改以支持ChatGLM2
24 |     device_map = {
25 |         'transformer.embedding.word_embeddings': 0,
26 |         'transformer.encoder.final_layernorm': 0,
27 |         'transformer.output_layer': 0,
28 |         'transformer.rotary_pos_emb': 0,
29 |         'lm_head': 0
30 |     }
31 | 
32 |     used = 2
33 |     gpu_target = 0
34 |     for i in range(num_trans_layers):
35 |         if used >= per_gpu_layers:
36 |             gpu_target += 1
37 |             used = 0
38 |         assert gpu_target < num_gpus
39 |         device_map[f'transformer.encoder.layers.{i}'] = gpu_target
40 |         used += 1
41 | 
42 |     return device_map
43 | 
44 | 
45 | def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
46 |                        device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module:
47 |     if num_gpus < 2 and device_map is None:
48 |         model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda()
49 |     else:
50 |         from accelerate import dispatch_model
51 | 
52 |         model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()
53 | 
54 |         if device_map is None:
55 |             device_map = auto_configure_device_map(num_gpus)
56 | 
57 |         model = dispatch_model(model, device_map=device_map)
58 | 
59 |     return model
60 | 


--------------------------------------------------------------------------------
/web_demo.py:
--------------------------------------------------------------------------------
  1 | from transformers import AutoModel, AutoTokenizer
  2 | import gradio as gr
  3 | import mdtex2html
  4 | from utils import load_model_on_gpus
  5 | 
  6 | tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
  7 | model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
  8 | # 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
  9 | # from utils import load_model_on_gpus
 10 | # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
 11 | model = model.eval()
 12 | 
 13 | """Override Chatbot.postprocess"""
 14 | 
 15 | 
 16 | def postprocess(self, y):
 17 |     if y is None:
 18 |         return []
 19 |     for i, (message, response) in enumerate(y):
 20 |         y[i] = (
 21 |             None if message is None else mdtex2html.convert((message)),
 22 |             None if response is None else mdtex2html.convert(response),
 23 |         )
 24 |     return y
 25 | 
 26 | 
 27 | gr.Chatbot.postprocess = postprocess
 28 | 
 29 | 
 30 | def parse_text(text):
 31 |     """copy from https://github.com/GaiZhenbiao/ChuanhuChatGPT/"""
 32 |     lines = text.split("\n")
 33 |     lines = [line for line in lines if line != ""]
 34 |     count = 0
 35 |     for i, line in enumerate(lines):
 36 |         if "```" in line:
 37 |             count += 1
 38 |             items = line.split('`')
 39 |             if count % 2 == 1:
 40 |                 lines[i] = f'<pre><code class="language-{items[-1]}">'
 41 |             else:
 42 |                 lines[i] = f'<br></code></pre>'
 43 |         else:
 44 |             if i > 0:
 45 |                 if count % 2 == 1:
 46 |                     line = line.replace("`", "\`")
 47 |                     line = line.replace("<", "&lt;")
 48 |                     line = line.replace(">", "&gt;")
 49 |                     line = line.replace(" ", "&nbsp;")
 50 |                     line = line.replace("*", "&ast;")
 51 |                     line = line.replace("_", "&lowbar;")
 52 |                     line = line.replace("-", "&#45;")
 53 |                     line = line.replace(".", "&#46;")
 54 |                     line = line.replace("!", "&#33;")
 55 |                     line = line.replace("(", "&#40;")
 56 |                     line = line.replace(")", "&#41;")
 57 |                     line = line.replace("$", "&#36;")
 58 |                 lines[i] = "<br>"+line
 59 |     text = "".join(lines)
 60 |     return text
 61 | 
 62 | 
 63 | def predict(input, chatbot, max_length, top_p, temperature, history, past_key_values):
 64 |     chatbot.append((parse_text(input), ""))
 65 |     for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values,
 66 |                                                                 return_past_key_values=True,
 67 |                                                                 max_length=max_length, top_p=top_p,
 68 |                                                                 temperature=temperature):
 69 |         chatbot[-1] = (parse_text(input), parse_text(response))
 70 | 
 71 |         yield chatbot, history, past_key_values
 72 | 
 73 | 
 74 | def reset_user_input():
 75 |     return gr.update(value='')
 76 | 
 77 | 
 78 | def reset_state():
 79 |     return [], [], None
 80 | 
 81 | 
 82 | with gr.Blocks() as demo:
 83 |     gr.HTML("""<h1 align="center">ChatGLM2-6B</h1>""")
 84 | 
 85 |     chatbot = gr.Chatbot()
 86 |     with gr.Row():
 87 |         with gr.Column(scale=4):
 88 |             with gr.Column(scale=12):
 89 |                 user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(
 90 |                     container=False)
 91 |             with gr.Column(min_width=32, scale=1):
 92 |                 submitBtn = gr.Button("Submit", variant="primary")
 93 |         with gr.Column(scale=1):
 94 |             emptyBtn = gr.Button("Clear History")
 95 |             max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True)
 96 |             top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True)
 97 |             temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True)
 98 | 
 99 |     history = gr.State([])
100 |     past_key_values = gr.State(None)
101 | 
102 |     submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values],
103 |                     [chatbot, history, past_key_values], show_progress=True)
104 |     submitBtn.click(reset_user_input, [], [user_input])
105 | 
106 |     emptyBtn.click(reset_state, outputs=[chatbot, history, past_key_values], show_progress=True)
107 | 
108 | demo.queue().launch(share=False, inbrowser=True)
109 | 


--------------------------------------------------------------------------------
/web_demo2.py:
--------------------------------------------------------------------------------
 1 | from transformers import AutoModel, AutoTokenizer
 2 | import streamlit as st
 3 | 
 4 | 
 5 | st.set_page_config(
 6 |     page_title="ChatGLM2-6b 演示",
 7 |     page_icon=":robot:",
 8 |     layout='wide'
 9 | )
10 | 
11 | 
12 | @st.cache_resource
13 | def get_model():
14 |     tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
15 |     model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
16 |     # 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
17 |     # from utils import load_model_on_gpus
18 |     # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
19 |     model = model.eval()
20 |     return tokenizer, model
21 | 
22 | 
23 | tokenizer, model = get_model()
24 | 
25 | st.title("ChatGLM2-6B")
26 | 
27 | max_length = st.sidebar.slider(
28 |     'max_length', 0, 32768, 8192, step=1
29 | )
30 | top_p = st.sidebar.slider(
31 |     'top_p', 0.0, 1.0, 0.8, step=0.01
32 | )
33 | temperature = st.sidebar.slider(
34 |     'temperature', 0.0, 1.0, 0.8, step=0.01
35 | )
36 | 
37 | if 'history' not in st.session_state:
38 |     st.session_state.history = []
39 | 
40 | if 'past_key_values' not in st.session_state:
41 |     st.session_state.past_key_values = None
42 | 
43 | for i, (query, response) in enumerate(st.session_state.history):
44 |     with st.chat_message(name="user", avatar="user"):
45 |         st.markdown(query)
46 |     with st.chat_message(name="assistant", avatar="assistant"):
47 |         st.markdown(response)
48 | with st.chat_message(name="user", avatar="user"):
49 |     input_placeholder = st.empty()
50 | with st.chat_message(name="assistant", avatar="assistant"):
51 |     message_placeholder = st.empty()
52 | 
53 | prompt_text = st.text_area(label="用户命令输入",
54 |                            height=100,
55 |                            placeholder="请在这儿输入您的命令")
56 | 
57 | button = st.button("发送", key="predict")
58 | 
59 | if button:
60 |     input_placeholder.markdown(prompt_text)
61 |     history, past_key_values = st.session_state.history, st.session_state.past_key_values
62 |     for response, history, past_key_values in model.stream_chat(tokenizer, prompt_text, history,
63 |                                                                 past_key_values=past_key_values,
64 |                                                                 max_length=max_length, top_p=top_p,
65 |                                                                 temperature=temperature,
66 |                                                                 return_past_key_values=True):
67 |         message_placeholder.markdown(response)
68 | 
69 |     st.session_state.history = history
70 |     st.session_state.past_key_values = past_key_values
71 | 


--------------------------------------------------------------------------------