├── .gitignore
├── README.md
├── assets
├── dataset.png
└── ner_dataset.png
├── notebook
├── .ipynb_checkpoints
│ ├── train_qwen2-checkpoint.ipynb
│ └── train_qwen2_ner-checkpoint.ipynb
├── train_glm4.ipynb
├── train_glm4_ner.ipynb
├── train_qwen2.ipynb
└── train_qwen2_ner.ipynb
├── predict_glm4.py
├── predict_qwen2.py
├── qwen2_vl
├── csv2json.py
├── data2csv.py
├── predict_qwen2_vl.py
├── requirements.txt
└── train_qwen2_vl.py
├── requirements.txt
├── train_glm4.py
├── train_glm4_ner.py
├── train_qwen2.py
└── train_qwen2_ner.py
/.gitignore:
--------------------------------------------------------------------------------
1 | /ZhipuAI
2 | /qwen
3 | /output
4 | /*.jsonl
5 | /._____temp
6 | /.ipynb_checkpoints
7 | .DS_Store
8 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # LLM Finetune
2 |
3 | 实验细节看:
4 | [](https://swanlab.cn/@ZeyiLin/Qwen2-fintune/runs/cfg5f8dzkp6vouxzaxlx6/chart)
5 | [](https://swanlab.cn/@ZeyiLin/GLM4-fintune/runs/eabll3xug8orsxzjy4yu4/chart)
6 | [](https://swanlab.cn/@ZeyiLin/Qwen2-VL-finetune/runs/pkgest5xhdn3ukpdy6kv5/chart)
7 |
8 | ## News
9 |
10 | - [Qwen3-Medical-sft](https://github.com/Zeyi-Lin/Qwen3-Medical-SFT): Qwen3微调实战:医疗R1推理风格聊天
11 |
12 |
13 | ## 准备工作
14 |
15 | 安装环境:`pip install -r requirements.txt`
16 |
17 | 文本分类任务-数据集下载:在[huangjintao/zh_cls_fudan-news](https://modelscope.cn/datasets/swift/zh_cls_fudan-news/files)下载`train.jsonl`和`test.jsonl`到根目录下。
18 |
19 | 命名实体识别任务-数据集下载:在[qgyd2021/chinese_ner_sft](https://huggingface.co/datasets/qgyd2021/chinese_ner_sft/tree/main/data)下载`ccfbdci.jsonl`到根目录下。
20 |
21 | Qwen2-VL多模态任务-数据集下载:
22 | ```bash
23 | cd ./qwen2_vl
24 | python data2csv.py
25 | python csv2json.py
26 | ```
27 |
28 | ## 训练
29 |
30 | | 模型 | 任务 | 运行命令 | Notebook | 文章 |
31 | | ---------- | ----------------- | -------------------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
32 | | Qwen2-VL-2b | 多模态微调 | 进入qwen2_vl目录,运行python train_qwen2_vl.py | wait | [知乎](https://zhuanlan.zhihu.com/p/702491999) |
33 | | Qwen2-1.5b | 指令微调-文本分类 | python train_qwen2.py | [Jupyter Notebook](notebook/train_qwen2.ipynb) | [知乎](https://zhuanlan.zhihu.com/p/702491999) |
34 | | Qwen2-1.5b | 指令微调-命名实体识别 | python train_qwen2_ner.py | [Jupyter Notebook](notebook/train_qwen2_ner.ipynb) | [知乎](https://zhuanlan.zhihu.com/p/704463319) |
35 | | GLM4-9b | 指令微调-文本分类 | python train_glm4.py | [Jupyter Notebook](notebook/train_glm4.ipynb) | [知乎](https://zhuanlan.zhihu.com/p/702608991) |
36 | | GLM4-9b | 指令微调-命名实体识别 | python train_glm4_ner.py | [Jupyter Notebook](notebook/train_glm4_ner.ipynb) | [知乎](https://zhuanlan.zhihu.com/p/704719982) |
37 |
38 |
39 | ## 推理
40 |
41 | Qwen2系列:
42 |
43 | ```bash
44 | python predict_qwen2.py
45 | ```
46 |
47 | Qwen2-VL系列:
48 |
49 | ```bash
50 | cd ./qwen2_vl
51 | python predict_qwen2_vl.py
52 | ```
53 |
54 | GLM4系列:
55 |
56 | ```bash
57 | python predict_glm4.py
58 | ```
59 |
60 | ## 引用工具
61 |
62 | - [SwanLab](https://github.com/SwanHubX/SwanLab):AI训练记录、分析与可视化工具
--------------------------------------------------------------------------------
/assets/dataset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Zeyi-Lin/LLM-Finetune/c523c926d169fa3187de9bdb88b35faaf79680c9/assets/dataset.png
--------------------------------------------------------------------------------
/assets/ner_dataset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Zeyi-Lin/LLM-Finetune/c523c926d169fa3187de9bdb88b35faaf79680c9/assets/ner_dataset.png
--------------------------------------------------------------------------------
/notebook/.ipynb_checkpoints/train_qwen2-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 👾Qwen2大模型微调入门\n",
8 | "\n",
9 | "作者:林泽毅\n",
10 | "\n",
11 | "教程文章:https://zhuanlan.zhihu.com/p/702491999 \n",
12 | "\n",
13 | "显存要求:10GB左右 \n",
14 | "\n",
15 | "实验过程看:https://swanlab.cn/@ZeyiLin/Qwen2-fintune/runs/cfg5f8dzkp6vouxzaxlx6/chart"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "## 1.安装环境\n",
23 | "\n",
24 | "本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.9"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "%pip install torch swanlab modelscope transformers datasets peft pandas accelerate"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "如果是第一次使用SwanLab,则前往[SwanLab](https://swanlab.cn)注册账号后,在[用户设置](https://swanlab.cn/settings/overview)复制API Key,如果执行下面的代码:"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "name": "stdout",
50 | "output_type": "stream",
51 | "text": [
52 | "\u001b[1m\u001b[34mswanlab\u001b[0m\u001b[0m: You are already logged in. Use `\u001b[1mswanlab login --relogin\u001b[0m` to force relogin.\n"
53 | ]
54 | }
55 | ],
56 | "source": [
57 | "!swanlab login"
58 | ]
59 | },
60 | {
61 | "attachments": {},
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "## 2. 数据集加载\n",
66 | "\n",
67 | "1. 在[zh_cls_fudan-news - modelscope](https://modelscope.cn/datasets/huangjintao/zh_cls_fudan-news/files)下载train.jsonl和test.jsonl到同级目录下。\n",
68 | "\n",
69 | "
"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "2. 将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 1,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# 2.将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl\n",
86 | "\n",
87 | "import json\n",
88 | "import pandas as pd\n",
89 | "import os\n",
90 | "\n",
91 | "def dataset_jsonl_transfer(origin_path, new_path):\n",
92 | " \"\"\"\n",
93 | " 将原始数据集转换为大模型微调所需数据格式的新数据集\n",
94 | " \"\"\"\n",
95 | " messages = []\n",
96 | "\n",
97 | " # 读取旧的JSONL文件\n",
98 | " with open(origin_path, \"r\") as file:\n",
99 | " for line in file:\n",
100 | " # 解析每一行的json数据\n",
101 | " data = json.loads(line)\n",
102 | " context = data[\"text\"]\n",
103 | " catagory = data[\"category\"]\n",
104 | " label = data[\"output\"]\n",
105 | " message = {\n",
106 | " \"instruction\": \"你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型\",\n",
107 | " \"input\": f\"文本:{context},类型选型:{catagory}\",\n",
108 | " \"output\": label,\n",
109 | " }\n",
110 | " messages.append(message)\n",
111 | "\n",
112 | " # 保存重构后的JSONL文件\n",
113 | " with open(new_path, \"w\", encoding=\"utf-8\") as file:\n",
114 | " for message in messages:\n",
115 | " file.write(json.dumps(message, ensure_ascii=False) + \"\\n\")\n",
116 | "\n",
117 | "\n",
118 | "# 加载、处理数据集和测试集\n",
119 | "train_dataset_path = \"train.jsonl\"\n",
120 | "test_dataset_path = \"test.jsonl\"\n",
121 | "\n",
122 | "train_jsonl_new_path = \"new_train.jsonl\"\n",
123 | "test_jsonl_new_path = \"new_test.jsonl\"\n",
124 | "\n",
125 | "if not os.path.exists(train_jsonl_new_path):\n",
126 | " dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)\n",
127 | "if not os.path.exists(test_jsonl_new_path):\n",
128 | " dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)\n",
129 | "\n",
130 | "train_df = pd.read_json(train_jsonl_new_path, lines=True)[:1000] # 取前1000条做训练(可选)\n",
131 | "test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10] # 取前10条做主观评测"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "## 3. 下载/加载模型和tokenizer"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "from modelscope import snapshot_download, AutoTokenizer\n",
148 | "from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq\n",
149 | "import torch\n",
150 | "\n",
151 | "# 在modelscope上下载Qwen模型到本地目录下\n",
152 | "model_dir = snapshot_download(\"qwen/Qwen2-1.5B-Instruct\", cache_dir=\"./\", revision=\"master\")\n",
153 | "\n",
154 | "# Transformers加载模型权重\n",
155 | "tokenizer = AutoTokenizer.from_pretrained(\"./qwen/Qwen2-1___5B-Instruct/\", use_fast=False, trust_remote_code=True)\n",
156 | "model = AutoModelForCausalLM.from_pretrained(\"./qwen/Qwen2-1___5B-Instruct/\", device_map=\"auto\", torch_dtype=torch.bfloat16)\n",
157 | "model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "## 4. 预处理训练数据"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 3,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "def process_func(example):\n",
174 | " \"\"\"\n",
175 | " 将数据集进行预处理\n",
176 | " \"\"\"\n",
177 | " MAX_LENGTH = 384\n",
178 | " input_ids, attention_mask, labels = [], [], []\n",
179 | " instruction = tokenizer(\n",
180 | " f\"<|im_start|>system\\n你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型<|im_end|>\\n<|im_start|>user\\n{example['input']}<|im_end|>\\n<|im_start|>assistant\\n\",\n",
181 | " add_special_tokens=False,\n",
182 | " )\n",
183 | " response = tokenizer(f\"{example['output']}\", add_special_tokens=False)\n",
184 | " input_ids = (\n",
185 | " instruction[\"input_ids\"] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
186 | " )\n",
187 | " attention_mask = instruction[\"attention_mask\"] + response[\"attention_mask\"] + [1]\n",
188 | " labels = (\n",
189 | " [-100] * len(instruction[\"input_ids\"])\n",
190 | " + response[\"input_ids\"]\n",
191 | " + [tokenizer.pad_token_id]\n",
192 | " )\n",
193 | " if len(input_ids) > MAX_LENGTH: # 做一个截断\n",
194 | " input_ids = input_ids[:MAX_LENGTH]\n",
195 | " attention_mask = attention_mask[:MAX_LENGTH]\n",
196 | " labels = labels[:MAX_LENGTH]\n",
197 | " return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"labels\": labels}"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "from datasets import Dataset\n",
207 | "\n",
208 | "train_ds = Dataset.from_pandas(train_df)\n",
209 | "train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "## 5. 设置LORA"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": 5,
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "from peft import LoraConfig, TaskType, get_peft_model\n",
226 | "\n",
227 | "config = LoraConfig(\n",
228 | " task_type=TaskType.CAUSAL_LM,\n",
229 | " target_modules=[\n",
230 | " \"q_proj\",\n",
231 | " \"k_proj\",\n",
232 | " \"v_proj\",\n",
233 | " \"o_proj\",\n",
234 | " \"gate_proj\",\n",
235 | " \"up_proj\",\n",
236 | " \"down_proj\",\n",
237 | " ],\n",
238 | " inference_mode=False, # 训练模式\n",
239 | " r=8, # Lora 秩\n",
240 | " lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理\n",
241 | " lora_dropout=0.1, # Dropout 比例\n",
242 | ")\n",
243 | "\n",
244 | "model = get_peft_model(model, config)"
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "## 6. 训练"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 6,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "args = TrainingArguments(\n",
261 | " output_dir=\"./output/Qwen2\",\n",
262 | " per_device_train_batch_size=4,\n",
263 | " gradient_accumulation_steps=4,\n",
264 | " logging_steps=10,\n",
265 | " num_train_epochs=2,\n",
266 | " save_steps=100,\n",
267 | " learning_rate=1e-4,\n",
268 | " save_on_each_node=True,\n",
269 | " gradient_checkpointing=True,\n",
270 | " report_to=\"none\",\n",
271 | ")"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 7,
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "from swanlab.integration.huggingface import SwanLabCallback\n",
281 | "import swanlab\n",
282 | "\n",
283 | "swanlab_callback = SwanLabCallback(\n",
284 | " project=\"Qwen2-fintune\",\n",
285 | " experiment_name=\"Qwen2-1.5B-Instruct\",\n",
286 | " description=\"使用通义千问Qwen2-1.5B-Instruct模型在zh_cls_fudan-news数据集上微调。\",\n",
287 | " config={\n",
288 | " \"model\": \"qwen/Qwen2-1.5B-Instruct\",\n",
289 | " \"dataset\": \"huangjintao/zh_cls_fudan-news\",\n",
290 | " },\n",
291 | ")"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {},
298 | "outputs": [],
299 | "source": [
300 | "trainer = Trainer(\n",
301 | " model=model,\n",
302 | " args=args,\n",
303 | " train_dataset=train_dataset,\n",
304 | " data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
305 | " callbacks=[swanlab_callback],\n",
306 | ")\n",
307 | "\n",
308 | "trainer.train()\n",
309 | "\n",
310 | "\n",
311 | "# ====== 训练结束后的预测 ===== #\n",
312 | "\n",
313 | "def predict(messages, model, tokenizer):\n",
314 | " device = \"cuda\"\n",
315 | " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
316 | " model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
317 | " generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)\n",
318 | " generated_ids = [\n",
319 | " output_ids[len(input_ids) :]\n",
320 | " for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
321 | " ]\n",
322 | "\n",
323 | " response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
324 | " print(response)\n",
325 | "\n",
326 | " return response\n",
327 | " \n",
328 | "\n",
329 | "test_text_list = []\n",
330 | "for index, row in test_df.iterrows():\n",
331 | " instruction = row[\"instruction\"]\n",
332 | " input_value = row[\"input\"]\n",
333 | "\n",
334 | " messages = [\n",
335 | " {\"role\": \"system\", \"content\": f\"{instruction}\"},\n",
336 | " {\"role\": \"user\", \"content\": f\"{input_value}\"},\n",
337 | " ]\n",
338 | "\n",
339 | " response = predict(messages, model, tokenizer)\n",
340 | " messages.append({\"role\": \"assistant\", \"content\": f\"{response}\"})\n",
341 | " result_text = f\"{messages[0]}\\n\\n{messages[1]}\\n\\n{messages[2]}\"\n",
342 | " test_text_list.append(swanlab.Text(result_text, caption=response))\n",
343 | "\n",
344 | "swanlab.log({\"Prediction\": test_text_list})\n",
345 | "swanlab.finish()"
346 | ]
347 | }
348 | ],
349 | "metadata": {
350 | "kernelspec": {
351 | "display_name": "Python 3 (ipykernel)",
352 | "language": "python",
353 | "name": "python3"
354 | },
355 | "language_info": {
356 | "codemirror_mode": {
357 | "name": "ipython",
358 | "version": 3
359 | },
360 | "file_extension": ".py",
361 | "mimetype": "text/x-python",
362 | "name": "python",
363 | "nbconvert_exporter": "python",
364 | "pygments_lexer": "ipython3",
365 | "version": "3.10.13"
366 | }
367 | },
368 | "nbformat": 4,
369 | "nbformat_minor": 4
370 | }
371 |
--------------------------------------------------------------------------------
/notebook/.ipynb_checkpoints/train_qwen2_ner-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 👾Qwen2大模型微调入门-命名实体识别任务\n",
8 | "\n",
9 | "作者:林泽毅\n",
10 | "\n",
11 | "教程文章:https://zhuanlan.zhihu.com/p/704463319\n",
12 | "\n",
13 | "显存要求:10GB左右 \n",
14 | "\n",
15 | "实验过程看:https://swanlab.cn/@ZeyiLin/Qwen2-NER-fintune/runs/9gdyrkna1rxjjmz0nks2c/chart"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "## 1.安装环境\n",
23 | "\n",
24 | "本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.11"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "%pip install torch swanlab modelscope transformers datasets peft pandas accelerate"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "如果是第一次使用SwanLab,则前往[SwanLab](https://swanlab.cn)注册账号后,在[用户设置](https://swanlab.cn/settings/overview)复制API Key,如果执行下面的代码:"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "name": "stdout",
50 | "output_type": "stream",
51 | "text": [
52 | "\u001b[1m\u001b[34mswanlab\u001b[0m\u001b[0m: You are already logged in. Use `\u001b[1mswanlab login --relogin\u001b[0m` to force relogin.\n"
53 | ]
54 | }
55 | ],
56 | "source": [
57 | "!swanlab login"
58 | ]
59 | },
60 | {
61 | "attachments": {},
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "## 2. 数据集加载\n",
66 | "\n",
67 | "1. 在[chinese_ner_sft - huggingface](https://huggingface.co/datasets/qgyd2021/chinese_ner_sft/tree/main/data)下载ccfbdci.jsonl到同级目录下。\n",
68 | "\n",
69 | "
"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "2. 将ccfbdci.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 1,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# 2.将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl\n",
86 | "\n",
87 | "import json\n",
88 | "import pandas as pd\n",
89 | "import os\n",
90 | "\n",
91 | "def dataset_jsonl_transfer(origin_path, new_path):\n",
92 | " \"\"\"\n",
93 | " 将原始数据集转换为大模型微调所需数据格式的新数据集\n",
94 | " \"\"\"\n",
95 | " messages = []\n",
96 | "\n",
97 | " # 读取旧的JSONL文件\n",
98 | " with open(origin_path, \"r\") as file:\n",
99 | " for line in file:\n",
100 | " # 解析每一行的json数据\n",
101 | " data = json.loads(line)\n",
102 | " input_text = data[\"text\"]\n",
103 | " entities = data[\"entities\"]\n",
104 | " match_names = [\"地点\", \"人名\", \"地理实体\", \"组织\"]\n",
105 | " \n",
106 | " entity_sentence = \"\"\n",
107 | " for entity in entities:\n",
108 | " entity_json = dict(entity)\n",
109 | " entity_text = entity_json[\"entity_text\"]\n",
110 | " entity_names = entity_json[\"entity_names\"]\n",
111 | " for name in entity_names:\n",
112 | " if name in match_names:\n",
113 | " entity_label = name\n",
114 | " break\n",
115 | " \n",
116 | " entity_sentence += f\"\"\"{{\"entity_text\": \"{entity_text}\", \"entity_label\": \"{entity_label}\"}}\"\"\"\n",
117 | " \n",
118 | " if entity_sentence == \"\":\n",
119 | " entity_sentence = \"没有找到任何实体\"\n",
120 | " \n",
121 | " message = {\n",
122 | " \"instruction\": \"\"\"你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {\"entity_text\": \"南京\", \"entity_label\": \"地理实体\"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出\"没有找到任何实体\". \"\"\",\n",
123 | " \"input\": f\"文本:{input_text}\",\n",
124 | " \"output\": entity_sentence,\n",
125 | " }\n",
126 | " \n",
127 | " messages.append(message)\n",
128 | "\n",
129 | " # 保存重构后的JSONL文件\n",
130 | " with open(new_path, \"w\", encoding=\"utf-8\") as file:\n",
131 | " for message in messages:\n",
132 | " file.write(json.dumps(message, ensure_ascii=False) + \"\\n\")\n",
133 | "\n",
134 | "\n",
135 | "# 加载、处理数据集和测试集\n",
136 | "train_dataset_path = \"ccfbdci.jsonl\"\n",
137 | "train_jsonl_new_path = \"ccf_train.jsonl\"\n",
138 | "\n",
139 | "if not os.path.exists(train_jsonl_new_path):\n",
140 | " dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)\n",
141 | "\n",
142 | "total_df = pd.read_json(train_jsonl_new_path, lines=True)\n",
143 | "train_df = total_df[int(len(total_df) * 0.1):] # 取90%的数据做训练集\n",
144 | "test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20) # 随机取10%的数据中的20条做测试集"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "## 3. 下载/加载模型和tokenizer"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "from modelscope import snapshot_download, AutoTokenizer\n",
161 | "from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq\n",
162 | "import torch\n",
163 | "\n",
164 | "model_id = \"qwen/Qwen2-1.5B-Instruct\" \n",
165 | "model_dir = \"./qwen/Qwen2-1___5B-Instruct\"\n",
166 | "\n",
167 | "# 在modelscope上下载Qwen模型到本地目录下\n",
168 | "model_dir = snapshot_download(model_id, cache_dir=\"./\", revision=\"master\")\n",
169 | "\n",
170 | "# Transformers加载模型权重\n",
171 | "tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)\n",
172 | "model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"auto\", torch_dtype=torch.bfloat16)\n",
173 | "model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "## 4. 预处理训练数据"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": 3,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "def process_func(example):\n",
190 | " \"\"\"\n",
191 | " 将数据集进行预处理, 处理成模型可以接受的格式\n",
192 | " \"\"\"\n",
193 | "\n",
194 | " MAX_LENGTH = 384 \n",
195 | " input_ids, attention_mask, labels = [], [], []\n",
196 | " system_prompt = \"\"\"你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {\"entity_text\": \"南京\", \"entity_label\": \"地理实体\"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出\"没有找到任何实体\".\"\"\"\n",
197 | " \n",
198 | " instruction = tokenizer(\n",
199 | " f\"<|im_start|>system\\n{system_prompt}<|im_end|>\\n<|im_start|>user\\n{example['input']}<|im_end|>\\n<|im_start|>assistant\\n\",\n",
200 | " add_special_tokens=False,\n",
201 | " )\n",
202 | " response = tokenizer(f\"{example['output']}\", add_special_tokens=False)\n",
203 | " input_ids = instruction[\"input_ids\"] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
204 | " attention_mask = (\n",
205 | " instruction[\"attention_mask\"] + response[\"attention_mask\"] + [1]\n",
206 | " )\n",
207 | " labels = [-100] * len(instruction[\"input_ids\"]) + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
208 | " if len(input_ids) > MAX_LENGTH: # 做一个截断\n",
209 | " input_ids = input_ids[:MAX_LENGTH]\n",
210 | " attention_mask = attention_mask[:MAX_LENGTH]\n",
211 | " labels = labels[:MAX_LENGTH]\n",
212 | " return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"labels\": labels} "
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": null,
218 | "metadata": {},
219 | "outputs": [],
220 | "source": [
221 | "from datasets import Dataset\n",
222 | "\n",
223 | "train_ds = Dataset.from_pandas(train_df)\n",
224 | "train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "## 5. 设置LORA"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 5,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "from peft import LoraConfig, TaskType, get_peft_model\n",
241 | "\n",
242 | "config = LoraConfig(\n",
243 | " task_type=TaskType.CAUSAL_LM,\n",
244 | " target_modules=[\n",
245 | " \"q_proj\",\n",
246 | " \"k_proj\",\n",
247 | " \"v_proj\",\n",
248 | " \"o_proj\",\n",
249 | " \"gate_proj\",\n",
250 | " \"up_proj\",\n",
251 | " \"down_proj\",\n",
252 | " ],\n",
253 | " inference_mode=False, # 训练模式\n",
254 | " r=8, # Lora 秩\n",
255 | " lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理\n",
256 | " lora_dropout=0.1, # Dropout 比例\n",
257 | ")\n",
258 | "\n",
259 | "model = get_peft_model(model, config)"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "## 6. 训练"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 6,
272 | "metadata": {},
273 | "outputs": [],
274 | "source": [
275 | "args = TrainingArguments(\n",
276 | " output_dir=\"./output/Qwen2-NER\",\n",
277 | " per_device_train_batch_size=4,\n",
278 | " per_device_eval_batch_size=4,\n",
279 | " gradient_accumulation_steps=4,\n",
280 | " logging_steps=10,\n",
281 | " num_train_epochs=2,\n",
282 | " save_steps=100,\n",
283 | " learning_rate=1e-4,\n",
284 | " save_on_each_node=True,\n",
285 | " gradient_checkpointing=True,\n",
286 | " report_to=\"none\",\n",
287 | ")"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 7,
293 | "metadata": {},
294 | "outputs": [],
295 | "source": [
296 | "from swanlab.integration.huggingface import SwanLabCallback\n",
297 | "import swanlab\n",
298 | "\n",
299 | "swanlab_callback = SwanLabCallback(\n",
300 | " project=\"Qwen2-NER-fintune\",\n",
301 | " experiment_name=\"Qwen2-1.5B-Instruct\",\n",
302 | " description=\"使用通义千问Qwen2-1.5B-Instruct模型在NER数据集上微调,实现关键实体识别任务。\",\n",
303 | " config={\n",
304 | " \"model\": model_id,\n",
305 | " \"model_dir\": model_dir,\n",
306 | " \"dataset\": \"qgyd2021/chinese_ner_sft\",\n",
307 | " },\n",
308 | ")"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {},
315 | "outputs": [],
316 | "source": [
317 | "trainer = Trainer(\n",
318 | " model=model,\n",
319 | " args=args,\n",
320 | " train_dataset=train_dataset,\n",
321 | " data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
322 | " callbacks=[swanlab_callback],\n",
323 | ")\n",
324 | "\n",
325 | "trainer.train()\n",
326 | "\n",
327 | "\n",
328 | "# ====== 训练结束后的预测 ===== #\n",
329 | "\n",
330 | "def predict(messages, model, tokenizer):\n",
331 | " device = \"cuda\"\n",
332 | " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
333 | " model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
334 | " generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)\n",
335 | " generated_ids = [\n",
336 | " output_ids[len(input_ids) :]\n",
337 | " for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
338 | " ]\n",
339 | "\n",
340 | " response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
341 | " print(response)\n",
342 | "\n",
343 | " return response\n",
344 | " \n",
345 | "\n",
346 | "test_text_list = []\n",
347 | "for index, row in test_df.iterrows():\n",
348 | " instruction = row[\"instruction\"]\n",
349 | " input_value = row[\"input\"]\n",
350 | "\n",
351 | " messages = [\n",
352 | " {\"role\": \"system\", \"content\": f\"{instruction}\"},\n",
353 | " {\"role\": \"user\", \"content\": f\"{input_value}\"},\n",
354 | " ]\n",
355 | "\n",
356 | " response = predict(messages, model, tokenizer)\n",
357 | " messages.append({\"role\": \"assistant\", \"content\": f\"{response}\"})\n",
358 | " result_text = f\"{messages[0]}\\n\\n{messages[1]}\\n\\n{messages[2]}\"\n",
359 | " test_text_list.append(swanlab.Text(result_text, caption=response))\n",
360 | "\n",
361 | "swanlab.log({\"Prediction\": test_text_list})\n",
362 | "swanlab.finish()"
363 | ]
364 | }
365 | ],
366 | "metadata": {
367 | "kernelspec": {
368 | "display_name": "Python 3 (ipykernel)",
369 | "language": "python",
370 | "name": "python3"
371 | },
372 | "language_info": {
373 | "codemirror_mode": {
374 | "name": "ipython",
375 | "version": 3
376 | },
377 | "file_extension": ".py",
378 | "mimetype": "text/x-python",
379 | "name": "python",
380 | "nbconvert_exporter": "python",
381 | "pygments_lexer": "ipython3",
382 | "version": "3.10.13"
383 | }
384 | },
385 | "nbformat": 4,
386 | "nbformat_minor": 4
387 | }
388 |
--------------------------------------------------------------------------------
/notebook/train_glm4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# GLM4大模型微调入门\n",
8 | "\n",
9 | "作者:林泽毅\n",
10 | "\n",
11 | "教程文章:https://zhuanlan.zhihu.com/p/702608991\n",
12 | "\n",
13 | "显存要求:40GB左右 \n",
14 | "\n",
15 | "实验过程看:https://swanlab.cn/@ZeyiLin/GLM4-fintune/runs/eabll3xug8orsxzjy4yu4/chart"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "## 1.安装环境\n",
23 | "\n",
24 | "本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.10、tiktoken==0.7.0"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "%pip install torch swanlab modelscope transformers datasets peft pandas accelerate tiktoken"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "如果是第一次使用SwanLab,则前往[SwanLab](https://swanlab.cn)注册账号后,在[用户设置](https://swanlab.cn/settings/overview)复制API Key,如果执行下面的代码:"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": null,
46 | "metadata": {},
47 | "outputs": [],
48 | "source": [
49 | "!swanlab login"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "## 2. 数据集加载\n",
57 | "\n",
58 | "1. 在[zh_cls_fudan-news - modelscope](https://modelscope.cn/datasets/huangjintao/zh_cls_fudan-news/files)下载train.jsonl和test.jsonl到同级目录下。\n",
59 | "\n",
60 | "
"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "2. 将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": null,
73 | "metadata": {},
74 | "outputs": [],
75 | "source": [
76 | "# 2.将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl\n",
77 | "\n",
78 | "import json\n",
79 | "import pandas as pd\n",
80 | "import os\n",
81 | "\n",
82 | "def dataset_jsonl_transfer(origin_path, new_path):\n",
83 | " \"\"\"\n",
84 | " 将原始数据集转换为大模型微调所需数据格式的新数据集\n",
85 | " \"\"\"\n",
86 | " messages = []\n",
87 | "\n",
88 | " # 读取旧的JSONL文件\n",
89 | " with open(origin_path, \"r\") as file:\n",
90 | " for line in file:\n",
91 | " # 解析每一行的json数据\n",
92 | " data = json.loads(line)\n",
93 | " context = data[\"text\"]\n",
94 | " catagory = data[\"category\"]\n",
95 | " label = data[\"output\"]\n",
96 | " message = {\n",
97 | " \"instruction\": \"你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型\",\n",
98 | " \"input\": f\"文本:{context},类型选型:{catagory}\",\n",
99 | " \"output\": label,\n",
100 | " }\n",
101 | " messages.append(message)\n",
102 | "\n",
103 | " # 保存重构后的JSONL文件\n",
104 | " with open(new_path, \"w\", encoding=\"utf-8\") as file:\n",
105 | " for message in messages:\n",
106 | " file.write(json.dumps(message, ensure_ascii=False) + \"\\n\")\n",
107 | "\n",
108 | "\n",
109 | "# 加载、处理数据集和测试集\n",
110 | "train_dataset_path = \"train.jsonl\"\n",
111 | "test_dataset_path = \"test.jsonl\"\n",
112 | "\n",
113 | "train_jsonl_new_path = \"new_train.jsonl\"\n",
114 | "test_jsonl_new_path = \"new_test.jsonl\"\n",
115 | "\n",
116 | "if not os.path.exists(train_jsonl_new_path):\n",
117 | " dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)\n",
118 | "if not os.path.exists(test_jsonl_new_path):\n",
119 | " dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)\n",
120 | "\n",
121 | "train_df = pd.read_json(train_jsonl_new_path, lines=True)[:1000] # 取前1000条做训练(可选)\n",
122 | "test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10] # 取前10条做主观评测"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "## 3. 下载/加载模型和tokenizer"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {},
136 | "outputs": [],
137 | "source": [
138 | "from modelscope import snapshot_download, AutoTokenizer\n",
139 | "from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq\n",
140 | "import torch\n",
141 | "\n",
142 | "# 在modelscope上下载Qwen模型到本地目录下\n",
143 | "model_dir = snapshot_download(\"ZhipuAI/glm-4-9b-chat\", cache_dir=\"./\", revision=\"master\")\n",
144 | "\n",
145 | "# Transformers加载模型权重\n",
146 | "tokenizer = AutoTokenizer.from_pretrained(\"./ZhipuAI/glm-4-9b-chat/\", use_fast=False, trust_remote_code=True)\n",
147 | "model = AutoModelForCausalLM.from_pretrained(\"./ZhipuAI/glm-4-9b-chat/\", device_map=\"auto\", torch_dtype=torch.bfloat16, trust_remote_code=True)\n",
148 | "model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "## 4. 预处理训练数据"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": null,
161 | "metadata": {},
162 | "outputs": [],
163 | "source": [
164 | "def process_func(example):\n",
165 | " \"\"\"\n",
166 | " 将数据集进行预处理\n",
167 | " \"\"\"\n",
168 | " MAX_LENGTH = 384\n",
169 | " input_ids, attention_mask, labels = [], [], []\n",
170 | " instruction = tokenizer(\n",
171 | " f\"<|im_start|>system\\n你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型<|im_end|>\\n<|im_start|>user\\n{example['input']}<|im_end|>\\n<|im_start|>assistant\\n\",\n",
172 | " add_special_tokens=False,\n",
173 | " )\n",
174 | " response = tokenizer(f\"{example['output']}\", add_special_tokens=False)\n",
175 | " input_ids = (\n",
176 | " instruction[\"input_ids\"] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
177 | " )\n",
178 | " attention_mask = instruction[\"attention_mask\"] + response[\"attention_mask\"] + [1]\n",
179 | " labels = (\n",
180 | " [-100] * len(instruction[\"input_ids\"])\n",
181 | " + response[\"input_ids\"]\n",
182 | " + [tokenizer.pad_token_id]\n",
183 | " )\n",
184 | " if len(input_ids) > MAX_LENGTH: # 做一个截断\n",
185 | " input_ids = input_ids[:MAX_LENGTH]\n",
186 | " attention_mask = attention_mask[:MAX_LENGTH]\n",
187 | " labels = labels[:MAX_LENGTH]\n",
188 | " return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"labels\": labels}"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": null,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": [
197 | "from datasets import Dataset\n",
198 | "\n",
199 | "train_ds = Dataset.from_pandas(train_df)\n",
200 | "train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "## 5. 设置LORA"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "from peft import LoraConfig, TaskType, get_peft_model\n",
217 | "\n",
218 | "config = LoraConfig(\n",
219 | " task_type=TaskType.CAUSAL_LM,\n",
220 | " target_modules=[\"query_key_value\", \"dense\", \"dense_h_to_4h\", \"activation_func\", \"dense_4h_to_h\"],\n",
221 | " inference_mode=False, # 训练模式\n",
222 | " r=8, # Lora 秩\n",
223 | " lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理\n",
224 | " lora_dropout=0.1, # Dropout 比例\n",
225 | ")\n",
226 | "\n",
227 | "model = get_peft_model(model, config)"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "## 6. 训练"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "args = TrainingArguments(\n",
244 | " output_dir=\"./output/GLM4-9b\",\n",
245 | " per_device_train_batch_size=4,\n",
246 | " gradient_accumulation_steps=4,\n",
247 | " logging_steps=10,\n",
248 | " num_train_epochs=2,\n",
249 | " save_steps=100,\n",
250 | " learning_rate=1e-4,\n",
251 | " save_on_each_node=True,\n",
252 | " gradient_checkpointing=True,\n",
253 | " report_to=\"none\",\n",
254 | ")"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "from swanlab.integration.huggingface import SwanLabCallback\n",
264 | "import swanlab\n",
265 | "\n",
266 | "swanlab_callback = SwanLabCallback(\n",
267 | " project=\"GLM4-fintune\",\n",
268 | " experiment_name=\"GLM4-9B-Chat\",\n",
269 | " description=\"使用智谱GLM4-9B-Chat模型在zh_cls_fudan-news数据集上微调。\",\n",
270 | " config={\n",
271 | " \"model\": \"ZhipuAI/glm-4-9b-chat\",\n",
272 | " \"dataset\": \"huangjintao/zh_cls_fudan-news\",\n",
273 | " },\n",
274 | ")"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {},
281 | "outputs": [],
282 | "source": [
283 | "trainer = Trainer(\n",
284 | " model=model,\n",
285 | " args=args,\n",
286 | " train_dataset=train_dataset,\n",
287 | " data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
288 | " callbacks=[swanlab_callback],\n",
289 | ")\n",
290 | "\n",
291 | "trainer.train()\n",
292 | "\n",
293 | "\n",
294 | "# ====== 训练结束后的预测 ===== #\n",
295 | "\n",
296 | "def predict(messages, model, tokenizer):\n",
297 | " device = \"cuda\"\n",
298 | " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
299 | " model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
300 | " generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)\n",
301 | " generated_ids = [\n",
302 | " output_ids[len(input_ids) :]\n",
303 | " for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
304 | " ]\n",
305 | "\n",
306 | " response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
307 | " print(response)\n",
308 | "\n",
309 | " return response\n",
310 | " \n",
311 | "\n",
312 | "test_text_list = []\n",
313 | "for index, row in test_df.iterrows():\n",
314 | " instruction = row[\"instruction\"]\n",
315 | " input_value = row[\"input\"]\n",
316 | "\n",
317 | " messages = [\n",
318 | " {\"role\": \"system\", \"content\": f\"{instruction}\"},\n",
319 | " {\"role\": \"user\", \"content\": f\"{input_value}\"},\n",
320 | " ]\n",
321 | "\n",
322 | " response = predict(messages, model, tokenizer)\n",
323 | " messages.append({\"role\": \"assistant\", \"content\": f\"{response}\"})\n",
324 | " result_text = f\"{messages[0]}\\n\\n{messages[1]}\\n\\n{messages[2]}\"\n",
325 | " test_text_list.append(swanlab.Text(result_text, caption=response))\n",
326 | "\n",
327 | "swanlab.log({\"Prediction\": test_text_list})\n",
328 | "swanlab.finish()"
329 | ]
330 | }
331 | ],
332 | "metadata": {
333 | "kernelspec": {
334 | "display_name": "Python 3 (ipykernel)",
335 | "language": "python",
336 | "name": "python3"
337 | },
338 | "language_info": {
339 | "codemirror_mode": {
340 | "name": "ipython",
341 | "version": 3
342 | },
343 | "file_extension": ".py",
344 | "mimetype": "text/x-python",
345 | "name": "python",
346 | "nbconvert_exporter": "python",
347 | "pygments_lexer": "ipython3",
348 | "version": "3.10.13"
349 | }
350 | },
351 | "nbformat": 4,
352 | "nbformat_minor": 4
353 | }
354 |
--------------------------------------------------------------------------------
/notebook/train_glm4_ner.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# GLM4大模型微调入门-命名实体识别任务\n",
8 | "\n",
9 | "作者:林泽毅"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "## 1.安装环境\n",
17 | "\n",
18 | "本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.11、tiktoken==0.7.0"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": null,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "%pip install torch swanlab modelscope transformers datasets peft pandas accelerate tiktoken"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "如果是第一次使用SwanLab,则前往[SwanLab](https://swanlab.cn)注册账号后,在[用户设置](https://swanlab.cn/settings/overview)复制API Key,如果执行下面的代码:"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": null,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "!swanlab login"
44 | ]
45 | },
46 | {
47 | "attachments": {},
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "## 2. 数据集加载\n",
52 | "\n",
53 | "1. 在[chinese_ner_sft - huggingface](https://huggingface.co/datasets/qgyd2021/chinese_ner_sft/tree/main/data)下载ccfbdci.jsonl到同级目录下。\n",
54 | "\n",
55 | "
"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "2. 将ccfbdci.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {},
69 | "outputs": [],
70 | "source": [
71 | "# 2.将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl\n",
72 | "\n",
73 | "import json\n",
74 | "import pandas as pd\n",
75 | "import os\n",
76 | "\n",
77 | "def dataset_jsonl_transfer(origin_path, new_path):\n",
78 | " \"\"\"\n",
79 | " 将原始数据集转换为大模型微调所需数据格式的新数据集\n",
80 | " \"\"\"\n",
81 | " messages = []\n",
82 | "\n",
83 | " # 读取旧的JSONL文件\n",
84 | " with open(origin_path, \"r\") as file:\n",
85 | " for line in file:\n",
86 | " # 解析每一行的json数据\n",
87 | " data = json.loads(line)\n",
88 | " input_text = data[\"text\"]\n",
89 | " entities = data[\"entities\"]\n",
90 | " match_names = [\"地点\", \"人名\", \"地理实体\", \"组织\"]\n",
91 | " \n",
92 | " entity_sentence = \"\"\n",
93 | " for entity in entities:\n",
94 | " entity_json = dict(entity)\n",
95 | " entity_text = entity_json[\"entity_text\"]\n",
96 | " entity_names = entity_json[\"entity_names\"]\n",
97 | " for name in entity_names:\n",
98 | " if name in match_names:\n",
99 | " entity_label = name\n",
100 | " break\n",
101 | " \n",
102 | " entity_sentence += f\"\"\"{{\"entity_text\": \"{entity_text}\", \"entity_label\": \"{entity_label}\"}}\"\"\"\n",
103 | " \n",
104 | " if entity_sentence == \"\":\n",
105 | " entity_sentence = \"没有找到任何实体\"\n",
106 | " \n",
107 | " message = {\n",
108 | " \"instruction\": \"\"\"你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {\"entity_text\": \"南京\", \"entity_label\": \"地理实体\"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出\"没有找到任何实体\". \"\"\",\n",
109 | " \"input\": f\"文本:{input_text}\",\n",
110 | " \"output\": entity_sentence,\n",
111 | " }\n",
112 | " \n",
113 | " messages.append(message)\n",
114 | "\n",
115 | " # 保存重构后的JSONL文件\n",
116 | " with open(new_path, \"w\", encoding=\"utf-8\") as file:\n",
117 | " for message in messages:\n",
118 | " file.write(json.dumps(message, ensure_ascii=False) + \"\\n\")\n",
119 | "\n",
120 | "\n",
121 | "# 加载、处理数据集和测试集\n",
122 | "train_dataset_path = \"ccfbdci.jsonl\"\n",
123 | "train_jsonl_new_path = \"ccf_train.jsonl\"\n",
124 | "\n",
125 | "if not os.path.exists(train_jsonl_new_path):\n",
126 | " dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)\n",
127 | "\n",
128 | "total_df = pd.read_json(train_jsonl_new_path, lines=True)\n",
129 | "train_df = total_df[int(len(total_df) * 0.1):] # 取90%的数据做训练集\n",
130 | "test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20) # 随机取10%的数据中的20条做测试集"
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "## 3. 下载/加载模型和tokenizer"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "from modelscope import snapshot_download, AutoTokenizer\n",
147 | "from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq\n",
148 | "import torch\n",
149 | "\n",
150 | "model_id = \"ZhipuAI/glm-4-9b-chat\" \n",
151 | "model_dir = \"./ZhipuAI/glm-4-9b-chat/\"\n",
152 | "\n",
153 | "# 在modelscope上下载GLM4模型到本地目录下\n",
154 | "model_dir = snapshot_download(model_id, cache_dir=\"./\", revision=\"master\")\n",
155 | "\n",
156 | "# Transformers加载模型权重\n",
157 | "tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)\n",
158 | "model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"auto\", torch_dtype=torch.bfloat16, trust_remote_code=True)\n",
159 | "model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "## 4. 预处理训练数据"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": null,
172 | "metadata": {},
173 | "outputs": [],
174 | "source": [
175 | "def process_func(example):\n",
176 | " \"\"\"\n",
177 | " 将数据集进行预处理, 处理成模型可以接受的格式\n",
178 | " \"\"\"\n",
179 | "\n",
180 | " MAX_LENGTH = 384 \n",
181 | " input_ids, attention_mask, labels = [], [], []\n",
182 | " system_prompt = \"\"\"你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {\"entity_text\": \"南京\", \"entity_label\": \"地理实体\"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出\"没有找到任何实体\".\"\"\"\n",
183 | " \n",
184 | " instruction = tokenizer(\n",
185 | " f\"<|system|>\\n{system_prompt}<|endoftext|>\\n<|user|>\\n{example['input']}<|endoftext|>\\n<|assistant|>\\n\",\n",
186 | " add_special_tokens=False,\n",
187 | " )\n",
188 | " response = tokenizer(f\"{example['output']}\", add_special_tokens=False)\n",
189 | " input_ids = instruction[\"input_ids\"] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
190 | " attention_mask = (\n",
191 | " instruction[\"attention_mask\"] + response[\"attention_mask\"] + [1]\n",
192 | " )\n",
193 | " labels = [-100] * len(instruction[\"input_ids\"]) + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
194 | " if len(input_ids) > MAX_LENGTH: # 做一个截断\n",
195 | " input_ids = input_ids[:MAX_LENGTH]\n",
196 | " attention_mask = attention_mask[:MAX_LENGTH]\n",
197 | " labels = labels[:MAX_LENGTH]\n",
198 | " return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"labels\": labels} "
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "from datasets import Dataset\n",
208 | "\n",
209 | "train_ds = Dataset.from_pandas(train_df)\n",
210 | "train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)"
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "metadata": {},
216 | "source": [
217 | "## 5. 设置LORA"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": null,
223 | "metadata": {},
224 | "outputs": [],
225 | "source": [
226 | "from peft import LoraConfig, TaskType, get_peft_model\n",
227 | "\n",
228 | "config = LoraConfig(\n",
229 | " task_type=TaskType.CAUSAL_LM,\n",
230 | " target_modules=[\"query_key_value\", \"dense\", \"dense_h_to_4h\", \"activation_func\", \"dense_4h_to_h\"],\n",
231 | " inference_mode=False, # 训练模式\n",
232 | " r=8, # Lora 秩\n",
233 | " lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理\n",
234 | " lora_dropout=0.1, # Dropout 比例\n",
235 | ")\n",
236 | "\n",
237 | "model = get_peft_model(model, config)"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "## 6. 训练"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": null,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "args = TrainingArguments(\n",
254 | " output_dir=\"./output/GLM4-NER\",\n",
255 | " per_device_train_batch_size=4,\n",
256 | " per_device_eval_batch_size=4,\n",
257 | " gradient_accumulation_steps=4,\n",
258 | " logging_steps=10,\n",
259 | " num_train_epochs=2,\n",
260 | " save_steps=100,\n",
261 | " learning_rate=1e-4,\n",
262 | " save_on_each_node=True,\n",
263 | " gradient_checkpointing=True,\n",
264 | " report_to=\"none\",\n",
265 | ")"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": null,
271 | "metadata": {},
272 | "outputs": [],
273 | "source": [
274 | "from swanlab.integration.huggingface import SwanLabCallback\n",
275 | "import swanlab\n",
276 | "\n",
277 | "swanlab_callback = SwanLabCallback(\n",
278 | " project=\"GLM4-NER-fintune\",\n",
279 | " experiment_name=\"GLM4-9B-Chat\",\n",
280 | " description=\"使用智谱GLM4-9B-Chat模型在NER数据集上微调,实现关键实体识别任务。\",\n",
281 | " config={\n",
282 | " \"model\": model_id,\n",
283 | " \"model_dir\": model_dir,\n",
284 | " \"dataset\": \"qgyd2021/chinese_ner_sft\",\n",
285 | " },\n",
286 | ")"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "trainer = Trainer(\n",
296 | " model=model,\n",
297 | " args=args,\n",
298 | " train_dataset=train_dataset,\n",
299 | " data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
300 | " callbacks=[swanlab_callback],\n",
301 | ")\n",
302 | "\n",
303 | "trainer.train()\n",
304 | "\n",
305 | "\n",
306 | "# ====== 训练结束后的预测 ===== #\n",
307 | "\n",
308 | "def predict(messages, model, tokenizer):\n",
309 | " device = \"cuda\"\n",
310 | " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
311 | " model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
312 | " generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)\n",
313 | " generated_ids = [\n",
314 | " output_ids[len(input_ids) :]\n",
315 | " for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
316 | " ]\n",
317 | "\n",
318 | " response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
319 | " print(response)\n",
320 | "\n",
321 | " return response\n",
322 | " \n",
323 | "\n",
324 | "test_text_list = []\n",
325 | "for index, row in test_df.iterrows():\n",
326 | " instruction = row[\"instruction\"]\n",
327 | " input_value = row[\"input\"]\n",
328 | "\n",
329 | " messages = [\n",
330 | " {\"role\": \"system\", \"content\": f\"{instruction}\"},\n",
331 | " {\"role\": \"user\", \"content\": f\"{input_value}\"},\n",
332 | " ]\n",
333 | "\n",
334 | " response = predict(messages, model, tokenizer)\n",
335 | " messages.append({\"role\": \"assistant\", \"content\": f\"{response}\"})\n",
336 | " result_text = f\"{messages[0]}\\n\\n{messages[1]}\\n\\n{messages[2]}\"\n",
337 | " test_text_list.append(swanlab.Text(result_text, caption=response))\n",
338 | "\n",
339 | "swanlab.log({\"Prediction\": test_text_list})\n",
340 | "swanlab.finish()"
341 | ]
342 | }
343 | ],
344 | "metadata": {
345 | "kernelspec": {
346 | "display_name": "Python 3 (ipykernel)",
347 | "language": "python",
348 | "name": "python3"
349 | },
350 | "language_info": {
351 | "codemirror_mode": {
352 | "name": "ipython",
353 | "version": 3
354 | },
355 | "file_extension": ".py",
356 | "mimetype": "text/x-python",
357 | "name": "python",
358 | "nbconvert_exporter": "python",
359 | "pygments_lexer": "ipython3",
360 | "version": "3.10.14"
361 | }
362 | },
363 | "nbformat": 4,
364 | "nbformat_minor": 4
365 | }
366 |
--------------------------------------------------------------------------------
/notebook/train_qwen2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 👾Qwen2大模型微调入门\n",
8 | "\n",
9 | "作者:林泽毅\n",
10 | "\n",
11 | "教程文章:https://zhuanlan.zhihu.com/p/702491999 \n",
12 | "\n",
13 | "显存要求:10GB左右 \n",
14 | "\n",
15 | "实验过程看:https://swanlab.cn/@ZeyiLin/Qwen2-fintune/runs/cfg5f8dzkp6vouxzaxlx6/chart"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "## 1.安装环境\n",
23 | "\n",
24 | "本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.9"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "%pip install torch swanlab modelscope transformers datasets peft pandas accelerate"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "如果是第一次使用SwanLab,则前往[SwanLab](https://swanlab.cn)注册账号后,在[用户设置](https://swanlab.cn/settings/overview)复制API Key,如果执行下面的代码:"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "name": "stdout",
50 | "output_type": "stream",
51 | "text": [
52 | "\u001b[1m\u001b[34mswanlab\u001b[0m\u001b[0m: You are already logged in. Use `\u001b[1mswanlab login --relogin\u001b[0m` to force relogin.\n"
53 | ]
54 | }
55 | ],
56 | "source": [
57 | "!swanlab login"
58 | ]
59 | },
60 | {
61 | "attachments": {},
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "## 2. 数据集加载\n",
66 | "\n",
67 | "1. 在[zh_cls_fudan-news - modelscope](https://modelscope.cn/datasets/huangjintao/zh_cls_fudan-news/files)下载train.jsonl和test.jsonl到同级目录下。\n",
68 | "\n",
69 | "
"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "2. 将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 1,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# 2.将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl\n",
86 | "\n",
87 | "import json\n",
88 | "import pandas as pd\n",
89 | "import os\n",
90 | "\n",
91 | "def dataset_jsonl_transfer(origin_path, new_path):\n",
92 | " \"\"\"\n",
93 | " 将原始数据集转换为大模型微调所需数据格式的新数据集\n",
94 | " \"\"\"\n",
95 | " messages = []\n",
96 | "\n",
97 | " # 读取旧的JSONL文件\n",
98 | " with open(origin_path, \"r\") as file:\n",
99 | " for line in file:\n",
100 | " # 解析每一行的json数据\n",
101 | " data = json.loads(line)\n",
102 | " context = data[\"text\"]\n",
103 | " catagory = data[\"category\"]\n",
104 | " label = data[\"output\"]\n",
105 | " message = {\n",
106 | " \"instruction\": \"你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型\",\n",
107 | " \"input\": f\"文本:{context},类型选型:{catagory}\",\n",
108 | " \"output\": label,\n",
109 | " }\n",
110 | " messages.append(message)\n",
111 | "\n",
112 | " # 保存重构后的JSONL文件\n",
113 | " with open(new_path, \"w\", encoding=\"utf-8\") as file:\n",
114 | " for message in messages:\n",
115 | " file.write(json.dumps(message, ensure_ascii=False) + \"\\n\")\n",
116 | "\n",
117 | "\n",
118 | "# 加载、处理数据集和测试集\n",
119 | "train_dataset_path = \"train.jsonl\"\n",
120 | "test_dataset_path = \"test.jsonl\"\n",
121 | "\n",
122 | "train_jsonl_new_path = \"new_train.jsonl\"\n",
123 | "test_jsonl_new_path = \"new_test.jsonl\"\n",
124 | "\n",
125 | "if not os.path.exists(train_jsonl_new_path):\n",
126 | " dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)\n",
127 | "if not os.path.exists(test_jsonl_new_path):\n",
128 | " dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)\n",
129 | "\n",
130 | "train_df = pd.read_json(train_jsonl_new_path, lines=True)[:1000] # 取前1000条做训练(可选)\n",
131 | "test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10] # 取前10条做主观评测"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "## 3. 下载/加载模型和tokenizer"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "from modelscope import snapshot_download, AutoTokenizer\n",
148 | "from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq\n",
149 | "import torch\n",
150 | "\n",
151 | "# 在modelscope上下载Qwen模型到本地目录下\n",
152 | "model_dir = snapshot_download(\"qwen/Qwen2-1.5B-Instruct\", cache_dir=\"./\", revision=\"master\")\n",
153 | "\n",
154 | "# Transformers加载模型权重\n",
155 | "tokenizer = AutoTokenizer.from_pretrained(\"./qwen/Qwen2-1___5B-Instruct/\", use_fast=False, trust_remote_code=True)\n",
156 | "model = AutoModelForCausalLM.from_pretrained(\"./qwen/Qwen2-1___5B-Instruct/\", device_map=\"auto\", torch_dtype=torch.bfloat16)\n",
157 | "model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "## 4. 预处理训练数据"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 3,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "def process_func(example):\n",
174 | " \"\"\"\n",
175 | " 将数据集进行预处理\n",
176 | " \"\"\"\n",
177 | " MAX_LENGTH = 384\n",
178 | " input_ids, attention_mask, labels = [], [], []\n",
179 | " instruction = tokenizer(\n",
180 | " f\"<|im_start|>system\\n你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型<|im_end|>\\n<|im_start|>user\\n{example['input']}<|im_end|>\\n<|im_start|>assistant\\n\",\n",
181 | " add_special_tokens=False,\n",
182 | " )\n",
183 | " response = tokenizer(f\"{example['output']}\", add_special_tokens=False)\n",
184 | " input_ids = (\n",
185 | " instruction[\"input_ids\"] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
186 | " )\n",
187 | " attention_mask = instruction[\"attention_mask\"] + response[\"attention_mask\"] + [1]\n",
188 | " labels = (\n",
189 | " [-100] * len(instruction[\"input_ids\"])\n",
190 | " + response[\"input_ids\"]\n",
191 | " + [tokenizer.pad_token_id]\n",
192 | " )\n",
193 | " if len(input_ids) > MAX_LENGTH: # 做一个截断\n",
194 | " input_ids = input_ids[:MAX_LENGTH]\n",
195 | " attention_mask = attention_mask[:MAX_LENGTH]\n",
196 | " labels = labels[:MAX_LENGTH]\n",
197 | " return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"labels\": labels}"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "from datasets import Dataset\n",
207 | "\n",
208 | "train_ds = Dataset.from_pandas(train_df)\n",
209 | "train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "## 5. 设置LORA"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": 5,
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "from peft import LoraConfig, TaskType, get_peft_model\n",
226 | "\n",
227 | "config = LoraConfig(\n",
228 | " task_type=TaskType.CAUSAL_LM,\n",
229 | " target_modules=[\n",
230 | " \"q_proj\",\n",
231 | " \"k_proj\",\n",
232 | " \"v_proj\",\n",
233 | " \"o_proj\",\n",
234 | " \"gate_proj\",\n",
235 | " \"up_proj\",\n",
236 | " \"down_proj\",\n",
237 | " ],\n",
238 | " inference_mode=False, # 训练模式\n",
239 | " r=8, # Lora 秩\n",
240 | " lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理\n",
241 | " lora_dropout=0.1, # Dropout 比例\n",
242 | ")\n",
243 | "\n",
244 | "model = get_peft_model(model, config)"
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "## 6. 训练"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 6,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "args = TrainingArguments(\n",
261 | " output_dir=\"./output/Qwen2\",\n",
262 | " per_device_train_batch_size=4,\n",
263 | " gradient_accumulation_steps=4,\n",
264 | " logging_steps=10,\n",
265 | " num_train_epochs=2,\n",
266 | " save_steps=100,\n",
267 | " learning_rate=1e-4,\n",
268 | " save_on_each_node=True,\n",
269 | " gradient_checkpointing=True,\n",
270 | " report_to=\"none\",\n",
271 | ")"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 7,
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "from swanlab.integration.huggingface import SwanLabCallback\n",
281 | "import swanlab\n",
282 | "\n",
283 | "swanlab_callback = SwanLabCallback(\n",
284 | " project=\"Qwen2-fintune\",\n",
285 | " experiment_name=\"Qwen2-1.5B-Instruct\",\n",
286 | " description=\"使用通义千问Qwen2-1.5B-Instruct模型在zh_cls_fudan-news数据集上微调。\",\n",
287 | " config={\n",
288 | " \"model\": \"qwen/Qwen2-1.5B-Instruct\",\n",
289 | " \"dataset\": \"huangjintao/zh_cls_fudan-news\",\n",
290 | " },\n",
291 | ")"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {},
298 | "outputs": [],
299 | "source": [
300 | "trainer = Trainer(\n",
301 | " model=model,\n",
302 | " args=args,\n",
303 | " train_dataset=train_dataset,\n",
304 | " data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
305 | " callbacks=[swanlab_callback],\n",
306 | ")\n",
307 | "\n",
308 | "trainer.train()\n",
309 | "\n",
310 | "\n",
311 | "# ====== 训练结束后的预测 ===== #\n",
312 | "\n",
313 | "def predict(messages, model, tokenizer):\n",
314 | " device = \"cuda\"\n",
315 | " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
316 | " model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
317 | " generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)\n",
318 | " generated_ids = [\n",
319 | " output_ids[len(input_ids) :]\n",
320 | " for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
321 | " ]\n",
322 | "\n",
323 | " response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
324 | " print(response)\n",
325 | "\n",
326 | " return response\n",
327 | " \n",
328 | "\n",
329 | "test_text_list = []\n",
330 | "for index, row in test_df.iterrows():\n",
331 | " instruction = row[\"instruction\"]\n",
332 | " input_value = row[\"input\"]\n",
333 | "\n",
334 | " messages = [\n",
335 | " {\"role\": \"system\", \"content\": f\"{instruction}\"},\n",
336 | " {\"role\": \"user\", \"content\": f\"{input_value}\"},\n",
337 | " ]\n",
338 | "\n",
339 | " response = predict(messages, model, tokenizer)\n",
340 | " messages.append({\"role\": \"assistant\", \"content\": f\"{response}\"})\n",
341 | " result_text = f\"{messages[0]}\\n\\n{messages[1]}\\n\\n{messages[2]}\"\n",
342 | " test_text_list.append(swanlab.Text(result_text, caption=response))\n",
343 | "\n",
344 | "swanlab.log({\"Prediction\": test_text_list})\n",
345 | "swanlab.finish()"
346 | ]
347 | }
348 | ],
349 | "metadata": {
350 | "kernelspec": {
351 | "display_name": "Python 3 (ipykernel)",
352 | "language": "python",
353 | "name": "python3"
354 | },
355 | "language_info": {
356 | "codemirror_mode": {
357 | "name": "ipython",
358 | "version": 3
359 | },
360 | "file_extension": ".py",
361 | "mimetype": "text/x-python",
362 | "name": "python",
363 | "nbconvert_exporter": "python",
364 | "pygments_lexer": "ipython3",
365 | "version": "3.10.13"
366 | }
367 | },
368 | "nbformat": 4,
369 | "nbformat_minor": 4
370 | }
371 |
--------------------------------------------------------------------------------
/notebook/train_qwen2_ner.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 👾Qwen2大模型微调入门-命名实体识别任务\n",
8 | "\n",
9 | "作者:林泽毅\n",
10 | "\n",
11 | "教程文章:https://zhuanlan.zhihu.com/p/704463319\n",
12 | "\n",
13 | "显存要求:10GB左右 \n",
14 | "\n",
15 | "实验过程看:https://swanlab.cn/@ZeyiLin/Qwen2-NER-fintune/runs/9gdyrkna1rxjjmz0nks2c/chart"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "## 1.安装环境\n",
23 | "\n",
24 | "本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.11"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "%pip install torch swanlab modelscope transformers datasets peft pandas accelerate"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "如果是第一次使用SwanLab,则前往[SwanLab](https://swanlab.cn)注册账号后,在[用户设置](https://swanlab.cn/settings/overview)复制API Key,如果执行下面的代码:"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "name": "stdout",
50 | "output_type": "stream",
51 | "text": [
52 | "\u001b[1m\u001b[34mswanlab\u001b[0m\u001b[0m: You are already logged in. Use `\u001b[1mswanlab login --relogin\u001b[0m` to force relogin.\n"
53 | ]
54 | }
55 | ],
56 | "source": [
57 | "!swanlab login"
58 | ]
59 | },
60 | {
61 | "attachments": {},
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "## 2. 数据集加载\n",
66 | "\n",
67 | "1. 在[chinese_ner_sft - huggingface](https://huggingface.co/datasets/qgyd2021/chinese_ner_sft/tree/main/data)下载ccfbdci.jsonl到同级目录下。\n",
68 | "\n",
69 | "
"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "2. 将ccfbdci.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 1,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# 2.将train.jsonl和test.jsonl进行处理,转换成new_train.jsonl和new_test.jsonl\n",
86 | "\n",
87 | "import json\n",
88 | "import pandas as pd\n",
89 | "import os\n",
90 | "\n",
91 | "def dataset_jsonl_transfer(origin_path, new_path):\n",
92 | " \"\"\"\n",
93 | " 将原始数据集转换为大模型微调所需数据格式的新数据集\n",
94 | " \"\"\"\n",
95 | " messages = []\n",
96 | "\n",
97 | " # 读取旧的JSONL文件\n",
98 | " with open(origin_path, \"r\") as file:\n",
99 | " for line in file:\n",
100 | " # 解析每一行的json数据\n",
101 | " data = json.loads(line)\n",
102 | " input_text = data[\"text\"]\n",
103 | " entities = data[\"entities\"]\n",
104 | " match_names = [\"地点\", \"人名\", \"地理实体\", \"组织\"]\n",
105 | " \n",
106 | " entity_sentence = \"\"\n",
107 | " for entity in entities:\n",
108 | " entity_json = dict(entity)\n",
109 | " entity_text = entity_json[\"entity_text\"]\n",
110 | " entity_names = entity_json[\"entity_names\"]\n",
111 | " for name in entity_names:\n",
112 | " if name in match_names:\n",
113 | " entity_label = name\n",
114 | " break\n",
115 | " \n",
116 | " entity_sentence += f\"\"\"{{\"entity_text\": \"{entity_text}\", \"entity_label\": \"{entity_label}\"}}\"\"\"\n",
117 | " \n",
118 | " if entity_sentence == \"\":\n",
119 | " entity_sentence = \"没有找到任何实体\"\n",
120 | " \n",
121 | " message = {\n",
122 | " \"instruction\": \"\"\"你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {\"entity_text\": \"南京\", \"entity_label\": \"地理实体\"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出\"没有找到任何实体\". \"\"\",\n",
123 | " \"input\": f\"文本:{input_text}\",\n",
124 | " \"output\": entity_sentence,\n",
125 | " }\n",
126 | " \n",
127 | " messages.append(message)\n",
128 | "\n",
129 | " # 保存重构后的JSONL文件\n",
130 | " with open(new_path, \"w\", encoding=\"utf-8\") as file:\n",
131 | " for message in messages:\n",
132 | " file.write(json.dumps(message, ensure_ascii=False) + \"\\n\")\n",
133 | "\n",
134 | "\n",
135 | "# 加载、处理数据集和测试集\n",
136 | "train_dataset_path = \"ccfbdci.jsonl\"\n",
137 | "train_jsonl_new_path = \"ccf_train.jsonl\"\n",
138 | "\n",
139 | "if not os.path.exists(train_jsonl_new_path):\n",
140 | " dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)\n",
141 | "\n",
142 | "total_df = pd.read_json(train_jsonl_new_path, lines=True)\n",
143 | "train_df = total_df[int(len(total_df) * 0.1):] # 取90%的数据做训练集\n",
144 | "test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20) # 随机取10%的数据中的20条做测试集"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "## 3. 下载/加载模型和tokenizer"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "from modelscope import snapshot_download, AutoTokenizer\n",
161 | "from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq\n",
162 | "import torch\n",
163 | "\n",
164 | "model_id = \"qwen/Qwen2-1.5B-Instruct\" \n",
165 | "model_dir = \"./qwen/Qwen2-1___5B-Instruct\"\n",
166 | "\n",
167 | "# 在modelscope上下载Qwen模型到本地目录下\n",
168 | "model_dir = snapshot_download(model_id, cache_dir=\"./\", revision=\"master\")\n",
169 | "\n",
170 | "# Transformers加载模型权重\n",
171 | "tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)\n",
172 | "model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"auto\", torch_dtype=torch.bfloat16)\n",
173 | "model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "## 4. 预处理训练数据"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": 3,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "def process_func(example):\n",
190 | " \"\"\"\n",
191 | " 将数据集进行预处理, 处理成模型可以接受的格式\n",
192 | " \"\"\"\n",
193 | "\n",
194 | " MAX_LENGTH = 384 \n",
195 | " input_ids, attention_mask, labels = [], [], []\n",
196 | " system_prompt = \"\"\"你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {\"entity_text\": \"南京\", \"entity_label\": \"地理实体\"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出\"没有找到任何实体\".\"\"\"\n",
197 | " \n",
198 | " instruction = tokenizer(\n",
199 | " f\"<|im_start|>system\\n{system_prompt}<|im_end|>\\n<|im_start|>user\\n{example['input']}<|im_end|>\\n<|im_start|>assistant\\n\",\n",
200 | " add_special_tokens=False,\n",
201 | " )\n",
202 | " response = tokenizer(f\"{example['output']}\", add_special_tokens=False)\n",
203 | " input_ids = instruction[\"input_ids\"] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
204 | " attention_mask = (\n",
205 | " instruction[\"attention_mask\"] + response[\"attention_mask\"] + [1]\n",
206 | " )\n",
207 | " labels = [-100] * len(instruction[\"input_ids\"]) + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
208 | " if len(input_ids) > MAX_LENGTH: # 做一个截断\n",
209 | " input_ids = input_ids[:MAX_LENGTH]\n",
210 | " attention_mask = attention_mask[:MAX_LENGTH]\n",
211 | " labels = labels[:MAX_LENGTH]\n",
212 | " return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"labels\": labels} "
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": null,
218 | "metadata": {},
219 | "outputs": [],
220 | "source": [
221 | "from datasets import Dataset\n",
222 | "\n",
223 | "train_ds = Dataset.from_pandas(train_df)\n",
224 | "train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "## 5. 设置LORA"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 5,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "from peft import LoraConfig, TaskType, get_peft_model\n",
241 | "\n",
242 | "config = LoraConfig(\n",
243 | " task_type=TaskType.CAUSAL_LM,\n",
244 | " target_modules=[\n",
245 | " \"q_proj\",\n",
246 | " \"k_proj\",\n",
247 | " \"v_proj\",\n",
248 | " \"o_proj\",\n",
249 | " \"gate_proj\",\n",
250 | " \"up_proj\",\n",
251 | " \"down_proj\",\n",
252 | " ],\n",
253 | " inference_mode=False, # 训练模式\n",
254 | " r=8, # Lora 秩\n",
255 | " lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理\n",
256 | " lora_dropout=0.1, # Dropout 比例\n",
257 | ")\n",
258 | "\n",
259 | "model = get_peft_model(model, config)"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "## 6. 训练"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 6,
272 | "metadata": {},
273 | "outputs": [],
274 | "source": [
275 | "args = TrainingArguments(\n",
276 | " output_dir=\"./output/Qwen2-NER\",\n",
277 | " per_device_train_batch_size=4,\n",
278 | " per_device_eval_batch_size=4,\n",
279 | " gradient_accumulation_steps=4,\n",
280 | " logging_steps=10,\n",
281 | " num_train_epochs=2,\n",
282 | " save_steps=100,\n",
283 | " learning_rate=1e-4,\n",
284 | " save_on_each_node=True,\n",
285 | " gradient_checkpointing=True,\n",
286 | " report_to=\"none\",\n",
287 | ")"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 7,
293 | "metadata": {},
294 | "outputs": [],
295 | "source": [
296 | "from swanlab.integration.huggingface import SwanLabCallback\n",
297 | "import swanlab\n",
298 | "\n",
299 | "swanlab_callback = SwanLabCallback(\n",
300 | " project=\"Qwen2-NER-fintune\",\n",
301 | " experiment_name=\"Qwen2-1.5B-Instruct\",\n",
302 | " description=\"使用通义千问Qwen2-1.5B-Instruct模型在NER数据集上微调,实现关键实体识别任务。\",\n",
303 | " config={\n",
304 | " \"model\": model_id,\n",
305 | " \"model_dir\": model_dir,\n",
306 | " \"dataset\": \"qgyd2021/chinese_ner_sft\",\n",
307 | " },\n",
308 | ")"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {},
315 | "outputs": [],
316 | "source": [
317 | "trainer = Trainer(\n",
318 | " model=model,\n",
319 | " args=args,\n",
320 | " train_dataset=train_dataset,\n",
321 | " data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
322 | " callbacks=[swanlab_callback],\n",
323 | ")\n",
324 | "\n",
325 | "trainer.train()\n",
326 | "\n",
327 | "\n",
328 | "# ====== 训练结束后的预测 ===== #\n",
329 | "\n",
330 | "def predict(messages, model, tokenizer):\n",
331 | " device = \"cuda\"\n",
332 | " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
333 | " model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
334 | " generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)\n",
335 | " generated_ids = [\n",
336 | " output_ids[len(input_ids) :]\n",
337 | " for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
338 | " ]\n",
339 | "\n",
340 | " response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
341 | " print(response)\n",
342 | "\n",
343 | " return response\n",
344 | " \n",
345 | "\n",
346 | "test_text_list = []\n",
347 | "for index, row in test_df.iterrows():\n",
348 | " instruction = row[\"instruction\"]\n",
349 | " input_value = row[\"input\"]\n",
350 | "\n",
351 | " messages = [\n",
352 | " {\"role\": \"system\", \"content\": f\"{instruction}\"},\n",
353 | " {\"role\": \"user\", \"content\": f\"{input_value}\"},\n",
354 | " ]\n",
355 | "\n",
356 | " response = predict(messages, model, tokenizer)\n",
357 | " messages.append({\"role\": \"assistant\", \"content\": f\"{response}\"})\n",
358 | " result_text = f\"{messages[0]}\\n\\n{messages[1]}\\n\\n{messages[2]}\"\n",
359 | " test_text_list.append(swanlab.Text(result_text, caption=response))\n",
360 | "\n",
361 | "swanlab.log({\"Prediction\": test_text_list})\n",
362 | "swanlab.finish()"
363 | ]
364 | }
365 | ],
366 | "metadata": {
367 | "kernelspec": {
368 | "display_name": "Python 3 (ipykernel)",
369 | "language": "python",
370 | "name": "python3"
371 | },
372 | "language_info": {
373 | "codemirror_mode": {
374 | "name": "ipython",
375 | "version": 3
376 | },
377 | "file_extension": ".py",
378 | "mimetype": "text/x-python",
379 | "name": "python",
380 | "nbconvert_exporter": "python",
381 | "pygments_lexer": "ipython3",
382 | "version": "3.10.13"
383 | }
384 | },
385 | "nbformat": 4,
386 | "nbformat_minor": 4
387 | }
388 |
--------------------------------------------------------------------------------
/predict_glm4.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from transformers import AutoModelForCausalLM, AutoTokenizer
3 | from peft import PeftModel
4 |
5 | def predict(messages, model, tokenizer):
6 | device = "cuda"
7 | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
8 | model_inputs = tokenizer([text], return_tensors="pt").to(device)
9 |
10 | generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
11 | generated_ids = [
12 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
13 | ]
14 |
15 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
16 | return response
17 |
18 | # 加载原下载路径的tokenizer
19 | tokenizer = AutoTokenizer.from_pretrained("./ZhipuAI/glm-4-9b-chat/", use_fast=False, trust_remote_code=True)
20 | model = AutoModelForCausalLM.from_pretrained("./ZhipuAI/glm-4-9b-chat/", device_map="auto", torch_dtype=torch.bfloat16)
21 |
22 | # 加载训练好的Lora模型,将下面的checkpoint-[XXX]替换为实际的checkpoint文件名名称
23 | model = PeftModel.from_pretrained(model, model_id="./output/GLM4-9b/checkpoint-1700")
24 |
25 | test_texts = {
26 | 'instruction': "你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型",
27 | 'input': "文本:航空动力学报JOURNAL OF AEROSPACE POWER1998年 第4期 No.4 1998科技期刊管路系统敷设的并行工程模型研究*陈志英* * 马 枚北京航空航天大学【摘要】 提出了一种应用于并行工程模型转换研究的标号法,该法是将现行串行设计过程(As-is)转换为并行设计过程(To-be)。本文应用该法将发动机外部管路系统敷设过程模型进行了串并行转换,应用并行工程过程重构的手段,得到了管路敷设并行过程模型,文中对该模型进行了详细分析。最后对转换前后的模型进行了时间效益分析,得到了对敷管工程有指导意义的结论。\t主题词: 航空发动机 管路系统 计算机辅助设计 建立模型 自由词: 并行工程分类号: V2331 管路系统现行设计过程发动机管路设计的传统过程是串行的,即根据发动机结构总体设计要求,首先设计发动机主体及附件,之后确定各附件位置及相互连接关系。传统的管路敷设方法通常先制作一个与原机比例为1∶1的金属样机,安装上附件模型或实际的附件,敷管工程师用4mm铅丝进行附件间导管连接的实地测量打样,产生导管走向的三维空间形状,进行间隙检查、导管最小弯曲半径检查、导管强度振动分析、可维修性、可达性及热膨胀约束分析等等[1]。历经多次反复敷管并满足所有这些要求后交弯管车间按此铅丝的三维空间形状进行弯管,之后安装导管、管接头、卡箍等。这样每敷设一根管时均需经上述过程,直到敷完所有连接导管。最后装飞机检验图1 管路系统传统设计过程模型整个发动机外廓与飞机短舱间的间隙是否合适,检验发动机各导管与飞机连接是否合理。管路系统传统研制过程模型参见图1。由该过程模型可见,传统的管路系统设计是经验设计,其管路敷设过程涉及总体、附件、材料、工艺、加工、飞机部门等众多部门和环节,而管路设计本身尚不规范,设计研制过程中必然要有大量反复工作存在,从而会大大延长发动机研制周期。面临现代社会越来越激烈的市场竞争,使得产品开发者力争在最短时间内(T),消耗最少的资金(C),生产出满足市场需求(Q)的产品,必须改变产品开发模式,通过过程重构,把传统的串行过程转换为并行设计过程,提高敷管过程的并行度[2]。2 过程重构及标号法过程重构是研究如何将传统的开发过程转换为并行的开发过程,从而使产品研制周期大大缩短,并取得“一次成功”。过程模型的转换可以通过专家经验法实现,然而对于一个复杂过程,如果不借助数学工具,仅靠人工观察是不够的。一个复杂产品的开发需要多部门、多学科的共同参与,因此会有大量活动,而且活动间的约束关系错综复杂[3]。本文提出了进行过程重构的“标号法”算法。思路如下:2.1 概念与定义定义1设有n个活动V1,V2,…,Vn,之间存在有序关系,定义关联矩阵:A=[aij]n×n若Vj是Vi的前序,aij=1;若Vj不是Vi的前序,aij=0;并规定aii=1。定义2 若ai1i2=ai2i3=…=ais-1is=aisi1=1,则称C={Vi1,Vi2,…,Vis}为1个环。定义3 若C为一个环,且C与其它Vj不构成环,称C为饱和环。定义4 L(1),L(2),…,L(k)称为层集。L(i)是Vj或C的集合,且L(1)∪L(2)∪…∪L(k)={1,2,…,n},即L(1),…,L(k)互不相交。2.2 基本原理(1)Vi没有前序,即 Ai*e=1,则Vi为初始层OA;(2){V1,V2,…,Vn}中的初始层有且只有环存在,则A*e>e。2.3 迭代(1)Ae= XJ>eJ可对层进行标号L(s);(2)A=令A=AJ,S=S+1,进行步骤(1);(3)AJ*e>eJ ,搜索法找到一个饱和环C(k);(4)对C(k)进行标号L(k),并判断其前序性。若J=Φ(空)则结束。若J≠Φ,令A=AJ并返回到步骤(1)。3 并行过程模型的建立根据前述标号法方法对图1包含20项任务的管路系统传统设计过程模型建立关联矩阵A,如图2所示(图中数字含义与图1相同)。对此关联矩阵用本文提出的标号法进行过程重构,得到变换后的关联矩阵A′,见图3。从而把原20项任务确定出7个活动层,其中第2层含2个并行活动,第3层包含3个并行活动,第5层包含4个并行活动,第6层是一个饱和环。通过对此实例的过程变换与分析,可以看出“标号法”具有以下特点:(1)总结出此类算法的2条基本原理,证实了算法的可行性;(2)在算法上用单位向量与活动关联阵相乘,求初始层OA,使标号更加明确;(3)寻找层的同时进行了标号,对活动项排序的同时也进行了标号;(4)定义了饱和环概念,消除了嵌环的问题,从而可找到最小环,消除原过程的大循环;(5)用数学的概念进行了算法表达,可对任何过程通过计算进行模型转换。图2 关联矩阵A 图3 转换后的关联矩阵A′由于工程性问题通常存在任务反馈,反馈对于生产研制过程是必不可少的而又非常重要的。为提高并行度,有的专家提出无条件消除反馈[4],但这势必带来产品设计过程中的不合理性,大大降低了其工程实用性。正是因为反馈参数的存在,才使产品开发过程出现循环反复,延长了研制周期。因此解决反馈问题是工程性课题的重点研究内容。本文提出把“标号法”应用于解决工程实际问题时,可采用修正方法,即:把既是后续任务又具有约束条件的任务项,变换为约束控制类任务提前到被分析任务项前进行,使其成为前序约束任务项来制约过程设计。由此可知任务项11(约束分析)、12(工艺性分析)和任务19(装飞机检验)符合上述原则,可考虑把这三项任务提前到管路敷设任务9前进行。转换后的并行过程模型见图4(图中数字含义与图1相同)。从图中可看出这7个层次,即结构总体设计、外形设计、建模、样机装配、材料准备及约束、敷管和质量检验。图4 管路系统并行设计过程模型(图中数字说明见图1)由此可见,经过重构后的敷管过程具有以下几个特点:(1)敷管过程大循环多反馈变换为小循环少反馈,一些任务项转换为约束控制类任务,使需要多次调整管路走向的敷管任务环内项目数大大减少,缩短了开发周期;(2)对管路敷设有反馈的所有任务项集中在敷管任务项前后进行,突出体现了并行工程强调协作的重要作用意义;例如把装飞机检验提到敷管同时进行,可完全避免大返工现象发生;(3)过程管理流程的层次更加分明,有利于量化管理过程,有利于产品管理、组织管理、资源管理等做相应调整。4 效益分析经分段分析计算,传统的管路敷设过程若一次敷管成功要持续14个月左右,但事实上没有一台新机研制过程中是一次敷管成功的。由上述过程重构分析可见,传统的串行敷管过程基本上一次敷管只偏重满足一种约束要求,而敷管过程是一个多约束控制过程,因此必然造成多次敷管方能满足所有要求。例如,首次敷管偏重于考虑不干涉问题,但当管路敷设完之后,才发现已敷管路不能满足其它敷管要求时,这时调整某些导管,会发生“牵一动百”效应,势必需要再花一段时间(2个月左右)调整管路走向。在实际工程中,由于管路设计任务滞后发动机其它零部件研制阶段,使得留给管路设计任务的周期短而又短,为满足发动机试车任务的急需,可能先敷设一个不考虑任何约束的管路系统。因此在传统敷管方式下多次敷管过程是必然的。当然实际工程中做到同时满足所有约束要求的一次敷管成功,不仅要建立1个并行过程模型,还需要一些技术和管理条件的支持,如到一定阶段加强各部门协调。还有采用CAD技术,金属表1 效益分析传统敷管成功次数传统时间节省时间节省效益一次二次三次…十次14个月16个月18个月32个月5个月7个月9个月23个月36.710%43.755%50.000%71.870%* 节省效益=节省时间/传统时间样机的制作也可相应废除掉,改做电子模型样机,至少可节省数万元的制作费[5]。另外把原每敷设完一根导管就要进行弯管和装管的过程,改为在计算机上敷设完所有管路模型之后进行弯管及装配,可避免大量无效弯管的浪费,也大大节省了开发周期,同时降低开发费用,使一次敷管成功成为可能。管路敷设过程重构后,由于一些任务可同时进行,因此并行敷管过程只需9个月左右时间。效益分析及结果见表1。结束语 改变现行开发模式,建立以CAD为技术支持的管路系统敷设开发数据工具,形成以过程管理为基础的管路敷设系统,将大大缩短开发周期,也必将提高敷管质量,同时也将降低开发费用。参 考 文 献1 马晓锐.航空发动机外部管路敷设的专家系统:[学位论文].航空航天工业部六o六研究所,19932 王甫君.支持并行工程集团群工作的通信、协调、控制系统:[学位论文].北京航空航天大学,19973 Biren Prased.Concurrent Engineering Fundamentals. Volume I,Prentice Hall PTR,USA,19964 A Kusiak, T N Larson.Reengineering of Design and Mannufacturing Processes.Computer and Engingeering,1994,26(3):521-5365 James C B,Henry A. J.Electronic Engine‘Mockup’Shortens Design Time. AEROSPACE AMERICA,PP98-100,January 1985(责任编辑 杨再荣)1998年5月收稿;1998年7月收到修改稿。 *潮疚南倒家自然科学基金资助项目,编号:95404F100A(69584001)* *男 38岁 副研究员 北京航空航天大学405教研室 100083,类型选型:['Military', 'Space']'"
28 | }
29 |
30 | instruction = test_texts['instruction']
31 | input_value = test_texts['input']
32 |
33 | messages = [
34 | {"role": "system", "content": f"{instruction}"},
35 | {"role": "user", "content": f"{input_value}"}
36 | ]
37 |
38 | response = predict(messages, model, tokenizer)
39 | print(response)
--------------------------------------------------------------------------------
/predict_qwen2.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from transformers import AutoModelForCausalLM, AutoTokenizer
3 | from peft import PeftModel
4 |
5 | def predict(messages, model, tokenizer):
6 | device = "cuda"
7 | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
8 | model_inputs = tokenizer([text], return_tensors="pt").to(device)
9 |
10 | generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
11 | generated_ids = [
12 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
13 | ]
14 |
15 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
16 | return response
17 |
18 | # 加载原下载路径的tokenizer
19 | tokenizer = AutoTokenizer.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", use_fast=False, trust_remote_code=True)
20 | model = AutoModelForCausalLM.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16)
21 |
22 | # 加载训练好的Lora模型,将下面的checkpoint-[XXX]替换为实际的checkpoint文件名名称
23 | model = PeftModel.from_pretrained(model, model_id="./output/Qwen2/checkpoint-1700")
24 |
25 | test_texts = {
26 | 'instruction': "你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型",
27 | 'input': "文本:航空动力学报JOURNAL OF AEROSPACE POWER1998年 第4期 No.4 1998科技期刊管路系统敷设的并行工程模型研究*陈志英* * 马 枚北京航空航天大学【摘要】 提出了一种应用于并行工程模型转换研究的标号法,该法是将现行串行设计过程(As-is)转换为并行设计过程(To-be)。本文应用该法将发动机外部管路系统敷设过程模型进行了串并行转换,应用并行工程过程重构的手段,得到了管路敷设并行过程模型,文中对该模型进行了详细分析。最后对转换前后的模型进行了时间效益分析,得到了对敷管工程有指导意义的结论。\t主题词: 航空发动机 管路系统 计算机辅助设计 建立模型 自由词: 并行工程分类号: V2331 管路系统现行设计过程发动机管路设计的传统过程是串行的,即根据发动机结构总体设计要求,首先设计发动机主体及附件,之后确定各附件位置及相互连接关系。传统的管路敷设方法通常先制作一个与原机比例为1∶1的金属样机,安装上附件模型或实际的附件,敷管工程师用4mm铅丝进行附件间导管连接的实地测量打样,产生导管走向的三维空间形状,进行间隙检查、导管最小弯曲半径检查、导管强度振动分析、可维修性、可达性及热膨胀约束分析等等[1]。历经多次反复敷管并满足所有这些要求后交弯管车间按此铅丝的三维空间形状进行弯管,之后安装导管、管接头、卡箍等。这样每敷设一根管时均需经上述过程,直到敷完所有连接导管。最后装飞机检验图1 管路系统传统设计过程模型整个发动机外廓与飞机短舱间的间隙是否合适,检验发动机各导管与飞机连接是否合理。管路系统传统研制过程模型参见图1。由该过程模型可见,传统的管路系统设计是经验设计,其管路敷设过程涉及总体、附件、材料、工艺、加工、飞机部门等众多部门和环节,而管路设计本身尚不规范,设计研制过程中必然要有大量反复工作存在,从而会大大延长发动机研制周期。面临现代社会越来越激烈的市场竞争,使得产品开发者力争在最短时间内(T),消耗最少的资金(C),生产出满足市场需求(Q)的产品,必须改变产品开发模式,通过过程重构,把传统的串行过程转换为并行设计过程,提高敷管过程的并行度[2]。2 过程重构及标号法过程重构是研究如何将传统的开发过程转换为并行的开发过程,从而使产品研制周期大大缩短,并取得“一次成功”。过程模型的转换可以通过专家经验法实现,然而对于一个复杂过程,如果不借助数学工具,仅靠人工观察是不够的。一个复杂产品的开发需要多部门、多学科的共同参与,因此会有大量活动,而且活动间的约束关系错综复杂[3]。本文提出了进行过程重构的“标号法”算法。思路如下:2.1 概念与定义定义1设有n个活动V1,V2,…,Vn,之间存在有序关系,定义关联矩阵:A=[aij]n×n若Vj是Vi的前序,aij=1;若Vj不是Vi的前序,aij=0;并规定aii=1。定义2 若ai1i2=ai2i3=…=ais-1is=aisi1=1,则称C={Vi1,Vi2,…,Vis}为1个环。定义3 若C为一个环,且C与其它Vj不构成环,称C为饱和环。定义4 L(1),L(2),…,L(k)称为层集。L(i)是Vj或C的集合,且L(1)∪L(2)∪…∪L(k)={1,2,…,n},即L(1),…,L(k)互不相交。2.2 基本原理(1)Vi没有前序,即 Ai*e=1,则Vi为初始层OA;(2){V1,V2,…,Vn}中的初始层有且只有环存在,则A*e>e。2.3 迭代(1)Ae= XJ>eJ可对层进行标号L(s);(2)A=令A=AJ,S=S+1,进行步骤(1);(3)AJ*e>eJ ,搜索法找到一个饱和环C(k);(4)对C(k)进行标号L(k),并判断其前序性。若J=Φ(空)则结束。若J≠Φ,令A=AJ并返回到步骤(1)。3 并行过程模型的建立根据前述标号法方法对图1包含20项任务的管路系统传统设计过程模型建立关联矩阵A,如图2所示(图中数字含义与图1相同)。对此关联矩阵用本文提出的标号法进行过程重构,得到变换后的关联矩阵A′,见图3。从而把原20项任务确定出7个活动层,其中第2层含2个并行活动,第3层包含3个并行活动,第5层包含4个并行活动,第6层是一个饱和环。通过对此实例的过程变换与分析,可以看出“标号法”具有以下特点:(1)总结出此类算法的2条基本原理,证实了算法的可行性;(2)在算法上用单位向量与活动关联阵相乘,求初始层OA,使标号更加明确;(3)寻找层的同时进行了标号,对活动项排序的同时也进行了标号;(4)定义了饱和环概念,消除了嵌环的问题,从而可找到最小环,消除原过程的大循环;(5)用数学的概念进行了算法表达,可对任何过程通过计算进行模型转换。图2 关联矩阵A 图3 转换后的关联矩阵A′由于工程性问题通常存在任务反馈,反馈对于生产研制过程是必不可少的而又非常重要的。为提高并行度,有的专家提出无条件消除反馈[4],但这势必带来产品设计过程中的不合理性,大大降低了其工程实用性。正是因为反馈参数的存在,才使产品开发过程出现循环反复,延长了研制周期。因此解决反馈问题是工程性课题的重点研究内容。本文提出把“标号法”应用于解决工程实际问题时,可采用修正方法,即:把既是后续任务又具有约束条件的任务项,变换为约束控制类任务提前到被分析任务项前进行,使其成为前序约束任务项来制约过程设计。由此可知任务项11(约束分析)、12(工艺性分析)和任务19(装飞机检验)符合上述原则,可考虑把这三项任务提前到管路敷设任务9前进行。转换后的并行过程模型见图4(图中数字含义与图1相同)。从图中可看出这7个层次,即结构总体设计、外形设计、建模、样机装配、材料准备及约束、敷管和质量检验。图4 管路系统并行设计过程模型(图中数字说明见图1)由此可见,经过重构后的敷管过程具有以下几个特点:(1)敷管过程大循环多反馈变换为小循环少反馈,一些任务项转换为约束控制类任务,使需要多次调整管路走向的敷管任务环内项目数大大减少,缩短了开发周期;(2)对管路敷设有反馈的所有任务项集中在敷管任务项前后进行,突出体现了并行工程强调协作的重要作用意义;例如把装飞机检验提到敷管同时进行,可完全避免大返工现象发生;(3)过程管理流程的层次更加分明,有利于量化管理过程,有利于产品管理、组织管理、资源管理等做相应调整。4 效益分析经分段分析计算,传统的管路敷设过程若一次敷管成功要持续14个月左右,但事实上没有一台新机研制过程中是一次敷管成功的。由上述过程重构分析可见,传统的串行敷管过程基本上一次敷管只偏重满足一种约束要求,而敷管过程是一个多约束控制过程,因此必然造成多次敷管方能满足所有要求。例如,首次敷管偏重于考虑不干涉问题,但当管路敷设完之后,才发现已敷管路不能满足其它敷管要求时,这时调整某些导管,会发生“牵一动百”效应,势必需要再花一段时间(2个月左右)调整管路走向。在实际工程中,由于管路设计任务滞后发动机其它零部件研制阶段,使得留给管路设计任务的周期短而又短,为满足发动机试车任务的急需,可能先敷设一个不考虑任何约束的管路系统。因此在传统敷管方式下多次敷管过程是必然的。当然实际工程中做到同时满足所有约束要求的一次敷管成功,不仅要建立1个并行过程模型,还需要一些技术和管理条件的支持,如到一定阶段加强各部门协调。还有采用CAD技术,金属表1 效益分析传统敷管成功次数传统时间节省时间节省效益一次二次三次…十次14个月16个月18个月32个月5个月7个月9个月23个月36.710%43.755%50.000%71.870%* 节省效益=节省时间/传统时间样机的制作也可相应废除掉,改做电子模型样机,至少可节省数万元的制作费[5]。另外把原每敷设完一根导管就要进行弯管和装管的过程,改为在计算机上敷设完所有管路模型之后进行弯管及装配,可避免大量无效弯管的浪费,也大大节省了开发周期,同时降低开发费用,使一次敷管成功成为可能。管路敷设过程重构后,由于一些任务可同时进行,因此并行敷管过程只需9个月左右时间。效益分析及结果见表1。结束语 改变现行开发模式,建立以CAD为技术支持的管路系统敷设开发数据工具,形成以过程管理为基础的管路敷设系统,将大大缩短开发周期,也必将提高敷管质量,同时也将降低开发费用。参 考 文 献1 马晓锐.航空发动机外部管路敷设的专家系统:[学位论文].航空航天工业部六o六研究所,19932 王甫君.支持并行工程集团群工作的通信、协调、控制系统:[学位论文].北京航空航天大学,19973 Biren Prased.Concurrent Engineering Fundamentals. Volume I,Prentice Hall PTR,USA,19964 A Kusiak, T N Larson.Reengineering of Design and Mannufacturing Processes.Computer and Engingeering,1994,26(3):521-5365 James C B,Henry A. J.Electronic Engine‘Mockup’Shortens Design Time. AEROSPACE AMERICA,PP98-100,January 1985(责任编辑 杨再荣)1998年5月收稿;1998年7月收到修改稿。 *潮疚南倒家自然科学基金资助项目,编号:95404F100A(69584001)* *男 38岁 副研究员 北京航空航天大学405教研室 100083,类型选型:['Military', 'Space']'"
28 | }
29 |
30 | instruction = test_texts['instruction']
31 | input_value = test_texts['input']
32 |
33 | messages = [
34 | {"role": "system", "content": f"{instruction}"},
35 | {"role": "user", "content": f"{input_value}"}
36 | ]
37 |
38 | response = predict(messages, model, tokenizer)
39 | print(response)
--------------------------------------------------------------------------------
/qwen2_vl/csv2json.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import json
3 |
4 | # 载入CSV文件
5 | df = pd.read_csv('./coco-2024-dataset.csv')
6 | conversations = []
7 |
8 | # 添加对话数据
9 | for i in range(len(df)):
10 | conversations.append({
11 | "id": f"identity_{i+1}",
12 | "conversations": [
13 | {
14 | "from": "user",
15 | "value": f"COCO Yes: <|vision_start|>{df.iloc[i]['image_path']}<|vision_end|>"
16 | },
17 | {
18 | "from": "assistant",
19 | "value": df.iloc[i]['caption']
20 | }
21 | ]
22 | })
23 |
24 | # 保存为Json
25 | with open('data_vl.json', 'w', encoding='utf-8') as f:
26 | json.dump(conversations, f, ensure_ascii=False, indent=2)
--------------------------------------------------------------------------------
/qwen2_vl/data2csv.py:
--------------------------------------------------------------------------------
1 | # 导入所需的库
2 | from modelscope.msdatasets import MsDataset
3 | import os
4 | import pandas as pd
5 |
6 | MAX_DATA_NUMBER = 500
7 |
8 | # 检查目录是否已存在
9 | if not os.path.exists('coco_2014_caption'):
10 | # 从modelscope下载COCO 2014图像描述数据集
11 | ds = MsDataset.load('modelscope/coco_2014_caption', subset_name='coco_2014_caption', split='train')
12 | print(len(ds))
13 | # 设置处理的图片数量上限
14 | total = min(MAX_DATA_NUMBER, len(ds))
15 |
16 | # 创建保存图片的目录
17 | os.makedirs('coco_2014_caption', exist_ok=True)
18 |
19 | # 初始化存储图片路径和描述的列表
20 | image_paths = []
21 | captions = []
22 |
23 | for i in range(total):
24 | # 获取每个样本的信息
25 | item = ds[i]
26 | image_id = item['image_id']
27 | caption = item['caption']
28 | image = item['image']
29 |
30 | # 保存图片并记录路径
31 | image_path = os.path.abspath(f'coco_2014_caption/{image_id}.jpg')
32 | image.save(image_path)
33 |
34 | # 将路径和描述添加到列表中
35 | image_paths.append(image_path)
36 | captions.append(caption)
37 |
38 | # 每处理50张图片打印一次进度
39 | if (i + 1) % 50 == 0:
40 | print(f'Processing {i+1}/{total} images ({(i+1)/total*100:.1f}%)')
41 |
42 | # 将图片路径和描述保存为CSV文件
43 | df = pd.DataFrame({
44 | 'image_path': image_paths,
45 | 'caption': captions
46 | })
47 |
48 | # 将数据保存为CSV文件
49 | df.to_csv('./coco-2024-dataset.csv', index=False)
50 |
51 | print(f'数据处理完成,共处理了{total}张图片')
52 |
53 | else:
54 | print('coco_2014_caption目录已存在,跳过数据处理步骤')
--------------------------------------------------------------------------------
/qwen2_vl/predict_qwen2_vl.py:
--------------------------------------------------------------------------------
1 | from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
2 | from qwen_vl_utils import process_vision_info
3 | from peft import PeftModel, LoraConfig, TaskType
4 |
5 | config = LoraConfig(
6 | task_type=TaskType.CAUSAL_LM,
7 | target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
8 | inference_mode=True,
9 | r=64, # Lora 秩
10 | lora_alpha=16, # Lora alaph,具体作用参见 Lora 原理
11 | lora_dropout=0.05, # Dropout 比例
12 | bias="none",
13 | )
14 |
15 | # default: Load the model on the available device(s)
16 | model = Qwen2VLForConditionalGeneration.from_pretrained(
17 | "./Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
18 | )
19 | model = PeftModel.from_pretrained(model, model_id="./output/Qwen2-VL-2B/checkpoint-62", config=config)
20 | processor = AutoProcessor.from_pretrained("./Qwen/Qwen2-VL-2B-Instruct")
21 |
22 | messages = [
23 | {
24 | "role": "user",
25 | "content": [
26 | {
27 | "type": "image",
28 | "image": "测试图像路径",
29 | },
30 | {"type": "text", "text": "COCO Yes:"},
31 | ],
32 | }
33 | ]
34 |
35 | # Preparation for inference
36 | text = processor.apply_chat_template(
37 | messages, tokenize=False, add_generation_prompt=True
38 | )
39 | image_inputs, video_inputs = process_vision_info(messages)
40 | inputs = processor(
41 | text=[text],
42 | images=image_inputs,
43 | videos=video_inputs,
44 | padding=True,
45 | return_tensors="pt",
46 | )
47 | inputs = inputs.to("cuda")
48 |
49 | # Inference: Generation of the output
50 | generated_ids = model.generate(**inputs, max_new_tokens=128)
51 | generated_ids_trimmed = [
52 | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
53 | ]
54 | output_text = processor.batch_decode(
55 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
56 | )
57 | print(output_text)
--------------------------------------------------------------------------------
/qwen2_vl/requirements.txt:
--------------------------------------------------------------------------------
1 | torch
2 | torchvision
3 | swanlab>=0.3.25
4 | modelscope>=1.18.0
5 | transformers>=4.46.2
6 | datasets
7 | peft>=0.13.2
8 | accelerate>=1.1.1
9 | qwen-vl-utils>=0.8.0
--------------------------------------------------------------------------------
/qwen2_vl/train_qwen2_vl.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from datasets import Dataset
3 | from modelscope import snapshot_download, AutoTokenizer
4 | from swanlab.integration.transformers import SwanLabCallback
5 | from qwen_vl_utils import process_vision_info
6 | from peft import LoraConfig, TaskType, get_peft_model, PeftModel
7 | from transformers import (
8 | TrainingArguments,
9 | Trainer,
10 | DataCollatorForSeq2Seq,
11 | Qwen2VLForConditionalGeneration,
12 | AutoProcessor,
13 | )
14 | import swanlab
15 | import json
16 |
17 |
18 | def process_func(example):
19 | """
20 | 将数据集进行预处理
21 | """
22 | MAX_LENGTH = 8192
23 | input_ids, attention_mask, labels = [], [], []
24 | conversation = example["conversations"]
25 | input_content = conversation[0]["value"]
26 | output_content = conversation[1]["value"]
27 | file_path = input_content.split("<|vision_start|>")[1].split("<|vision_end|>")[0] # 获取图像路径
28 | messages = [
29 | {
30 | "role": "user",
31 | "content": [
32 | {
33 | "type": "image",
34 | "image": f"{file_path}",
35 | "resized_height": 280,
36 | "resized_width": 280,
37 | },
38 | {"type": "text", "text": "COCO Yes:"},
39 | ],
40 | }
41 | ]
42 | text = processor.apply_chat_template(
43 | messages, tokenize=False, add_generation_prompt=True
44 | ) # 获取文本
45 | image_inputs, video_inputs = process_vision_info(messages) # 获取数据数据(预处理过)
46 | inputs = processor(
47 | text=[text],
48 | images=image_inputs,
49 | videos=video_inputs,
50 | padding=True,
51 | return_tensors="pt",
52 | )
53 | inputs = {key: value.tolist() for key, value in inputs.items()} #tensor -> list,为了方便拼接
54 | instruction = inputs
55 |
56 | response = tokenizer(f"{output_content}", add_special_tokens=False)
57 |
58 |
59 | input_ids = (
60 | instruction["input_ids"][0] + response["input_ids"] + [tokenizer.pad_token_id]
61 | )
62 |
63 | attention_mask = instruction["attention_mask"][0] + response["attention_mask"] + [1]
64 | labels = (
65 | [-100] * len(instruction["input_ids"][0])
66 | + response["input_ids"]
67 | + [tokenizer.pad_token_id]
68 | )
69 | if len(input_ids) > MAX_LENGTH: # 做一个截断
70 | input_ids = input_ids[:MAX_LENGTH]
71 | attention_mask = attention_mask[:MAX_LENGTH]
72 | labels = labels[:MAX_LENGTH]
73 |
74 | input_ids = torch.tensor(input_ids)
75 | attention_mask = torch.tensor(attention_mask)
76 | labels = torch.tensor(labels)
77 | inputs['pixel_values'] = torch.tensor(inputs['pixel_values'])
78 | inputs['image_grid_thw'] = torch.tensor(inputs['image_grid_thw']).squeeze(0) #由(1,h,w)变换为(h,w)
79 | return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels,
80 | "pixel_values": inputs['pixel_values'], "image_grid_thw": inputs['image_grid_thw']}
81 |
82 |
83 | def predict(messages, model):
84 | # 准备推理
85 | text = processor.apply_chat_template(
86 | messages, tokenize=False, add_generation_prompt=True
87 | )
88 | image_inputs, video_inputs = process_vision_info(messages)
89 | inputs = processor(
90 | text=[text],
91 | images=image_inputs,
92 | videos=video_inputs,
93 | padding=True,
94 | return_tensors="pt",
95 | )
96 | inputs = inputs.to("cuda")
97 |
98 | # 生成输出
99 | generated_ids = model.generate(**inputs, max_new_tokens=128)
100 | generated_ids_trimmed = [
101 | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
102 | ]
103 | output_text = processor.batch_decode(
104 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
105 | )
106 |
107 | return output_text[0]
108 |
109 |
110 | # 在modelscope上下载Qwen2-VL模型到本地目录下
111 | model_dir = snapshot_download("Qwen/Qwen2-VL-2B-Instruct", cache_dir="./", revision="master")
112 |
113 | # 使用Transformers加载模型权重
114 | tokenizer = AutoTokenizer.from_pretrained("./Qwen/Qwen2-VL-2B-Instruct/", use_fast=False, trust_remote_code=True)
115 | processor = AutoProcessor.from_pretrained("./Qwen/Qwen2-VL-2B-Instruct")
116 |
117 | model = Qwen2VLForConditionalGeneration.from_pretrained("./Qwen/Qwen2-VL-2B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True,)
118 | model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法
119 |
120 | # 处理数据集:读取json文件
121 | # 拆分成训练集和测试集,保存为data_vl_train.json和data_vl_test.json
122 | train_json_path = "data_vl.json"
123 | with open(train_json_path, 'r') as f:
124 | data = json.load(f)
125 | train_data = data[:-4]
126 | test_data = data[-4:]
127 |
128 | with open("data_vl_train.json", "w") as f:
129 | json.dump(train_data, f)
130 |
131 | with open("data_vl_test.json", "w") as f:
132 | json.dump(test_data, f)
133 |
134 | train_ds = Dataset.from_json("data_vl_train.json")
135 | train_dataset = train_ds.map(process_func)
136 |
137 | # 配置LoRA
138 | config = LoraConfig(
139 | task_type=TaskType.CAUSAL_LM,
140 | target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
141 | inference_mode=False, # 训练模式
142 | r=64, # Lora 秩
143 | lora_alpha=16, # Lora alaph,具体作用参见 Lora 原理
144 | lora_dropout=0.05, # Dropout 比例
145 | bias="none",
146 | )
147 |
148 | # 获取LoRA模型
149 | peft_model = get_peft_model(model, config)
150 |
151 | # 配置训练参数
152 | args = TrainingArguments(
153 | output_dir="./output/Qwen2-VL-2B",
154 | per_device_train_batch_size=4,
155 | gradient_accumulation_steps=4,
156 | logging_steps=10,
157 | logging_first_step=5,
158 | num_train_epochs=2,
159 | save_steps=100,
160 | learning_rate=1e-4,
161 | save_on_each_node=True,
162 | gradient_checkpointing=True,
163 | report_to="none",
164 | )
165 |
166 | # 设置SwanLab回调
167 | swanlab_callback = SwanLabCallback(
168 | project="Qwen2-VL-finetune",
169 | experiment_name="qwen2-vl-coco2014",
170 | config={
171 | "model": "https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct",
172 | "dataset": "https://modelscope.cn/datasets/modelscope/coco_2014_caption/quickstart",
173 | "github": "https://github.com/datawhalechina/self-llm",
174 | "prompt": "COCO Yes: ",
175 | "train_data_number": len(train_data),
176 | "lora_rank": 64,
177 | "lora_alpha": 16,
178 | "lora_dropout": 0.1,
179 | },
180 | )
181 |
182 | # 配置Trainer
183 | trainer = Trainer(
184 | model=peft_model,
185 | args=args,
186 | train_dataset=train_dataset,
187 | data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
188 | callbacks=[swanlab_callback],
189 | )
190 |
191 | # 开启模型训练
192 | trainer.train()
193 |
194 | # ====================测试模式===================
195 | # 配置测试参数
196 | val_config = LoraConfig(
197 | task_type=TaskType.CAUSAL_LM,
198 | target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
199 | inference_mode=True, # 训练模式
200 | r=64, # Lora 秩
201 | lora_alpha=16, # Lora alaph,具体作用参见 Lora 原理
202 | lora_dropout=0.05, # Dropout 比例
203 | bias="none",
204 | )
205 |
206 | # 获取测试模型
207 | val_peft_model = PeftModel.from_pretrained(model, model_id="./output/Qwen2-VL-2B/checkpoint-62", config=val_config)
208 |
209 | # 读取测试数据
210 | with open("data_vl_test.json", "r") as f:
211 | test_dataset = json.load(f)
212 |
213 | test_image_list = []
214 | for item in test_dataset:
215 | input_image_prompt = item["conversations"][0]["value"]
216 | # 去掉前后的<|vision_start|>和<|vision_end|>
217 | origin_image_path = input_image_prompt.split("<|vision_start|>")[1].split("<|vision_end|>")[0]
218 |
219 | messages = [{
220 | "role": "user",
221 | "content": [
222 | {
223 | "type": "image",
224 | "image": origin_image_path
225 | },
226 | {
227 | "type": "text",
228 | "text": "COCO Yes:"
229 | }
230 | ]}]
231 |
232 | response = predict(messages, val_peft_model)
233 | messages.append({"role": "assistant", "content": f"{response}"})
234 | print(messages[-1])
235 |
236 | test_image_list.append(swanlab.Image(origin_image_path, caption=response))
237 |
238 | swanlab.log({"Prediction": test_image_list})
239 |
240 | # 在Jupyter Notebook中运行时要停止SwanLab记录,需要调用swanlab.finish()
241 | swanlab.finish()
242 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch
2 | swanlab
3 | modelscope
4 | transformers
5 | datasets
6 | peft
7 | accelerate
8 | pandas
9 | tiktoken
--------------------------------------------------------------------------------
/train_glm4.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pandas as pd
3 | import torch
4 | from datasets import Dataset
5 | from modelscope import snapshot_download, AutoTokenizer
6 | from swanlab.integration.huggingface import SwanLabCallback
7 | from peft import LoraConfig, TaskType, get_peft_model
8 | from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
9 | import os
10 | import swanlab
11 |
12 |
13 | def dataset_jsonl_transfer(origin_path, new_path):
14 | """
15 | 将原始数据集转换为大模型微调所需数据格式的新数据集
16 | """
17 | messages = []
18 |
19 | # 读取旧的JSONL文件
20 | with open(origin_path, "r") as file:
21 | for line in file:
22 | # 解析每一行的json数据
23 | data = json.loads(line)
24 | context = data["text"]
25 | catagory = data["category"]
26 | label = data["output"]
27 | message = {
28 | "instruction": "你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型",
29 | "input": f"文本:{context},类型选型:{catagory}",
30 | "output": label,
31 | }
32 | messages.append(message)
33 |
34 | # 保存重构后的JSONL文件
35 | with open(new_path, "w", encoding="utf-8") as file:
36 | for message in messages:
37 | file.write(json.dumps(message, ensure_ascii=False) + "\n")
38 |
39 |
40 | def process_func(example):
41 | """
42 | 将数据集进行预处理
43 | """
44 | MAX_LENGTH = 384
45 | input_ids, attention_mask, labels = [], [], []
46 | instruction = tokenizer(
47 | f"<|system|>\n你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型<|endoftext|>\n<|user|>\n{example['input']}<|endoftext|>\n<|assistant|>\n",
48 | add_special_tokens=False,
49 | )
50 | response = tokenizer(f"{example['output']}", add_special_tokens=False)
51 | input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
52 | attention_mask = (
53 | instruction["attention_mask"] + response["attention_mask"] + [1]
54 | )
55 | labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
56 | if len(input_ids) > MAX_LENGTH: # 做一个截断
57 | input_ids = input_ids[:MAX_LENGTH]
58 | attention_mask = attention_mask[:MAX_LENGTH]
59 | labels = labels[:MAX_LENGTH]
60 | return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
61 |
62 |
63 | def predict(messages, model, tokenizer):
64 | device = "cuda"
65 | text = tokenizer.apply_chat_template(
66 | messages,
67 | tokenize=False,
68 | add_generation_prompt=True
69 | )
70 | model_inputs = tokenizer([text], return_tensors="pt").to(device)
71 |
72 | generated_ids = model.generate(
73 | model_inputs.input_ids,
74 | max_new_tokens=512
75 | )
76 | generated_ids = [
77 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
78 | ]
79 |
80 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
81 |
82 | print(response)
83 |
84 | return response
85 |
86 | # 在modelscope上下载GLM模型到本地目录下
87 | model_dir = snapshot_download("ZhipuAI/glm-4-9b-chat", cache_dir="./", revision="master")
88 |
89 | # Transformers加载模型权重
90 | tokenizer = AutoTokenizer.from_pretrained("./ZhipuAI/glm-4-9b-chat/", use_fast=False, trust_remote_code=True)
91 | model = AutoModelForCausalLM.from_pretrained("./ZhipuAI/glm-4-9b-chat/", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
92 | model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法
93 |
94 | # 加载、处理数据集和测试集
95 | train_dataset_path = "train.jsonl"
96 | test_dataset_path = "test.jsonl"
97 |
98 | train_jsonl_new_path = "new_train.jsonl"
99 | test_jsonl_new_path = "new_test.jsonl"
100 |
101 | if not os.path.exists(train_jsonl_new_path):
102 | dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
103 | if not os.path.exists(test_jsonl_new_path):
104 | dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)
105 |
106 | # 得到训练集
107 | train_df = pd.read_json(train_jsonl_new_path, lines=True)
108 | train_ds = Dataset.from_pandas(train_df)
109 | train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)
110 |
111 | config = LoraConfig(
112 | task_type=TaskType.CAUSAL_LM,
113 | target_modules=["query_key_value", "dense", "dense_h_to_4h", "activation_func", "dense_4h_to_h"],
114 | inference_mode=False, # 训练模式
115 | r=8, # Lora 秩
116 | lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理
117 | lora_dropout=0.1, # Dropout 比例
118 | )
119 |
120 | model = get_peft_model(model, config)
121 |
122 | args = TrainingArguments(
123 | output_dir="./output/GLM4-9b",
124 | per_device_train_batch_size=4,
125 | gradient_accumulation_steps=4,
126 | logging_steps=10,
127 | num_train_epochs=2,
128 | save_steps=100,
129 | learning_rate=1e-4,
130 | save_on_each_node=True,
131 | gradient_checkpointing=True,
132 | report_to="none",
133 | )
134 |
135 | swanlab_callback = SwanLabCallback(
136 | project="GLM4-fintune",
137 | experiment_name="GLM4-9B-Chat",
138 | description="使用智谱GLM4-9B-Chat模型在zh_cls_fudan-news数据集上微调。",
139 | config={
140 | "model": "ZhipuAI/glm-4-9b-chat",
141 | "dataset": "huangjintao/zh_cls_fudan-news",
142 | },
143 | )
144 |
145 | trainer = Trainer(
146 | model=model,
147 | args=args,
148 | train_dataset=train_dataset,
149 | data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
150 | callbacks=[swanlab_callback],
151 | )
152 |
153 | trainer.train()
154 |
155 | # 用测试集的前10条,测试模型
156 | test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]
157 |
158 | test_text_list = []
159 | for index, row in test_df.iterrows():
160 | instruction = row['instruction']
161 | input_value = row['input']
162 |
163 | messages = [
164 | {"role": "system", "content": f"{instruction}"},
165 | {"role": "user", "content": f"{input_value}"}
166 | ]
167 |
168 | response = predict(messages, model, tokenizer)
169 | messages.append({"role": "assistant", "content": f"{response}"})
170 | result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
171 | test_text_list.append(swanlab.Text(result_text, caption=response))
172 |
173 | swanlab.log({"Prediction": test_text_list})
174 | swanlab.finish()
--------------------------------------------------------------------------------
/train_glm4_ner.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pandas as pd
3 | import torch
4 | from datasets import Dataset
5 | from modelscope import snapshot_download, AutoTokenizer
6 | from swanlab.integration.huggingface import SwanLabCallback
7 | from peft import LoraConfig, TaskType, get_peft_model
8 | from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
9 | import os
10 | import swanlab
11 |
12 |
13 | def dataset_jsonl_transfer(origin_path, new_path):
14 | """
15 | 将原始数据集转换为大模型微调所需数据格式的新数据集
16 | """
17 | messages = []
18 |
19 | # 读取旧的JSONL文件
20 | with open(origin_path, "r") as file:
21 | for line in file:
22 | # 解析每一行的json数据
23 | data = json.loads(line)
24 | input_text = data["text"]
25 | entities = data["entities"]
26 | match_names = ["地点", "人名", "地理实体", "组织"]
27 |
28 | entity_sentence = ""
29 | for entity in entities:
30 | entity_json = dict(entity)
31 | entity_text = entity_json["entity_text"]
32 | entity_names = entity_json["entity_names"]
33 | for name in entity_names:
34 | if name in match_names:
35 | entity_label = name
36 | break
37 |
38 | entity_sentence += f"""{{"entity_text": "{entity_text}", "entity_label": "{entity_label}"}}"""
39 |
40 | if entity_sentence == "":
41 | entity_sentence = "没有找到任何实体"
42 |
43 | message = {
44 | "instruction": """你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {"entity_text": "南京", "entity_label": "地理实体"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出"没有找到任何实体". """,
45 | "input": f"文本:{input_text}",
46 | "output": entity_sentence,
47 | }
48 |
49 | messages.append(message)
50 |
51 | # 保存重构后的JSONL文件
52 | with open(new_path, "w", encoding="utf-8") as file:
53 | for message in messages:
54 | file.write(json.dumps(message, ensure_ascii=False) + "\n")
55 |
56 |
57 | def process_func(example):
58 | """
59 | 对数据集进行数据预处理,主要用于被dataset.map调用
60 | """
61 |
62 | MAX_LENGTH = 384
63 | input_ids, attention_mask, labels = [], [], []
64 | system_prompt = """你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {"entity_text": "南京", "entity_label": "地理实体"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出"没有找到任何实体"."""
65 |
66 | instruction = tokenizer(
67 | f"<|system|>\n{system_prompt}<|endoftext|>\n<|user|>\n{example['input']}<|endoftext|>\n<|assistant|>\n",
68 | add_special_tokens=False,
69 | )
70 | response = tokenizer(f"{example['output']}", add_special_tokens=False)
71 | input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
72 | attention_mask = (
73 | instruction["attention_mask"] + response["attention_mask"] + [1]
74 | )
75 | labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
76 | if len(input_ids) > MAX_LENGTH: # 做一个截断
77 | input_ids = input_ids[:MAX_LENGTH]
78 | attention_mask = attention_mask[:MAX_LENGTH]
79 | labels = labels[:MAX_LENGTH]
80 | return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
81 |
82 |
83 | def predict(messages, model, tokenizer):
84 | """对测试集进行模型推理,得到预测结果"""
85 | device = "cuda"
86 | text = tokenizer.apply_chat_template(
87 | messages,
88 | tokenize=False,
89 | add_generation_prompt=True
90 | )
91 | model_inputs = tokenizer([text], return_tensors="pt").to(device)
92 |
93 | generated_ids = model.generate(
94 | model_inputs.input_ids,
95 | max_new_tokens=512
96 | )
97 | generated_ids = [
98 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
99 | ]
100 |
101 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
102 |
103 | print(response)
104 |
105 | return response
106 |
107 |
108 | model_id = "ZhipuAI/glm-4-9b-chat"
109 | model_dir = "./ZhipuAI/glm-4-9b-chat/"
110 |
111 | # 在modelscope上下载GLM4模型到本地目录下
112 | model_dir = snapshot_download(model_id, cache_dir="./", revision="master")
113 |
114 | # Transformers加载模型权重
115 | tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)
116 | model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
117 | model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法
118 |
119 | # 加载、处理数据集和测试集
120 | train_dataset_path = "ccfbdci.jsonl"
121 | train_jsonl_new_path = "ccf_train.jsonl"
122 |
123 | if not os.path.exists(train_jsonl_new_path):
124 | dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
125 |
126 | # 得到训练集
127 | total_df = pd.read_json(train_jsonl_new_path, lines=True)
128 | train_df = total_df[int(len(total_df) * 0.1):]
129 | train_ds = Dataset.from_pandas(train_df)
130 | train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)
131 |
132 | # 配置LoRA
133 | config = LoraConfig(
134 | task_type=TaskType.CAUSAL_LM,
135 | target_modules=["query_key_value", "dense", "dense_h_to_4h", "activation_func", "dense_4h_to_h"],
136 | inference_mode=False, # 训练模式
137 | r=8, # Lora 秩
138 | lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理
139 | lora_dropout=0.1, # Dropout 比例
140 | )
141 |
142 | # 得到被peft包装后的模型
143 | model = get_peft_model(model, config)
144 |
145 | # 配置Transformers训练参数
146 | args = TrainingArguments(
147 | output_dir="./output/GLM4-NER",
148 | per_device_train_batch_size=4,
149 | per_device_eval_batch_size=4,
150 | gradient_accumulation_steps=4,
151 | logging_steps=10,
152 | num_train_epochs=2,
153 | save_steps=100,
154 | learning_rate=1e-4,
155 | save_on_each_node=True,
156 | gradient_checkpointing=True,
157 | report_to="none",
158 | )
159 |
160 | # 设置SwanLab与Transformers的回调
161 | swanlab_callback = SwanLabCallback(
162 | project="GLM4-NER-fintune",
163 | experiment_name="GLM4-9B-Chat",
164 | description="使用智谱GLM4-9B-Chat模型在NER数据集上微调,实现关键实体识别任务。",
165 | config={
166 | "model": model_id,
167 | "model_dir": model_dir,
168 | "dataset": "qgyd2021/chinese_ner_sft",
169 | },
170 | )
171 |
172 | # 设置Transformers Trainer
173 | trainer = Trainer(
174 | model=model,
175 | args=args,
176 | train_dataset=train_dataset,
177 | data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
178 | callbacks=[swanlab_callback],
179 | )
180 |
181 | # 开始训练
182 | trainer.train()
183 |
184 | # 用随机20条数据测试模型
185 | test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20)
186 |
187 | test_text_list = []
188 | for index, row in test_df.iterrows():
189 | instruction = row['instruction']
190 | input_value = row['input']
191 |
192 | messages = [
193 | {"role": "system", "content": f"{instruction}"},
194 | {"role": "user", "content": f"{input_value}"}
195 | ]
196 |
197 | response = predict(messages, model, tokenizer)
198 | messages.append({"role": "assistant", "content": f"{response}"})
199 | result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
200 | test_text_list.append(swanlab.Text(result_text, caption=response))
201 |
202 | # 记录测试结果
203 | swanlab.log({"Prediction": test_text_list})
204 | # 关闭SwanLab记录
205 | swanlab.finish()
--------------------------------------------------------------------------------
/train_qwen2.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pandas as pd
3 | import torch
4 | from datasets import Dataset
5 | from modelscope import snapshot_download, AutoTokenizer
6 | from swanlab.integration.huggingface import SwanLabCallback
7 | from peft import LoraConfig, TaskType, get_peft_model
8 | from transformers import (
9 | AutoModelForCausalLM,
10 | TrainingArguments,
11 | Trainer,
12 | DataCollatorForSeq2Seq,
13 | )
14 | import os
15 | import swanlab
16 |
17 |
18 | def dataset_jsonl_transfer(origin_path, new_path):
19 | """
20 | 将原始数据集转换为大模型微调所需数据格式的新数据集
21 | """
22 | messages = []
23 |
24 | # 读取旧的JSONL文件
25 | with open(origin_path, "r") as file:
26 | for line in file:
27 | # 解析每一行的json数据
28 | data = json.loads(line)
29 | context = data["text"]
30 | catagory = data["category"]
31 | label = data["output"]
32 | message = {
33 | "instruction": "你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型",
34 | "input": f"文本:{context},类型选型:{catagory}",
35 | "output": label,
36 | }
37 | messages.append(message)
38 |
39 | # 保存重构后的JSONL文件
40 | with open(new_path, "w", encoding="utf-8") as file:
41 | for message in messages:
42 | file.write(json.dumps(message, ensure_ascii=False) + "\n")
43 |
44 |
45 | def process_func(example):
46 | """
47 | 将数据集进行预处理
48 | """
49 | MAX_LENGTH = 384
50 | input_ids, attention_mask, labels = [], [], []
51 | instruction = tokenizer(
52 | f"<|im_start|>system\n你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",
53 | add_special_tokens=False,
54 | )
55 | response = tokenizer(f"{example['output']}", add_special_tokens=False)
56 | input_ids = (
57 | instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
58 | )
59 | attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]
60 | labels = (
61 | [-100] * len(instruction["input_ids"])
62 | + response["input_ids"]
63 | + [tokenizer.pad_token_id]
64 | )
65 | if len(input_ids) > MAX_LENGTH: # 做一个截断
66 | input_ids = input_ids[:MAX_LENGTH]
67 | attention_mask = attention_mask[:MAX_LENGTH]
68 | labels = labels[:MAX_LENGTH]
69 | return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
70 |
71 |
72 | def predict(messages, model, tokenizer):
73 | device = "cuda"
74 | text = tokenizer.apply_chat_template(
75 | messages, tokenize=False, add_generation_prompt=True
76 | )
77 | model_inputs = tokenizer([text], return_tensors="pt").to(device)
78 |
79 | generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
80 | generated_ids = [
81 | output_ids[len(input_ids) :]
82 | for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
83 | ]
84 |
85 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
86 |
87 | print(response)
88 |
89 | return response
90 |
91 |
92 | # 在modelscope上下载Qwen模型到本地目录下
93 | model_dir = snapshot_download(
94 | "qwen/Qwen2-1.5B-Instruct", cache_dir="./", revision="master"
95 | )
96 |
97 | # Transformers加载模型权重
98 | tokenizer = AutoTokenizer.from_pretrained(
99 | "./qwen/Qwen2-1___5B-Instruct/", use_fast=False, trust_remote_code=True
100 | )
101 | model = AutoModelForCausalLM.from_pretrained(
102 | "./qwen/Qwen2-1___5B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16
103 | )
104 | model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法
105 |
106 | # 加载、处理数据集和测试集
107 | train_dataset_path = "train.jsonl"
108 | test_dataset_path = "test.jsonl"
109 |
110 | train_jsonl_new_path = "new_train.jsonl"
111 | test_jsonl_new_path = "new_test.jsonl"
112 |
113 | if not os.path.exists(train_jsonl_new_path):
114 | dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
115 | if not os.path.exists(test_jsonl_new_path):
116 | dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)
117 |
118 | # 得到训练集
119 | train_df = pd.read_json(train_jsonl_new_path, lines=True)
120 | train_ds = Dataset.from_pandas(train_df)
121 | train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)
122 |
123 | config = LoraConfig(
124 | task_type=TaskType.CAUSAL_LM,
125 | target_modules=[
126 | "q_proj",
127 | "k_proj",
128 | "v_proj",
129 | "o_proj",
130 | "gate_proj",
131 | "up_proj",
132 | "down_proj",
133 | ],
134 | inference_mode=False, # 训练模式
135 | r=8, # Lora 秩
136 | lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理
137 | lora_dropout=0.1, # Dropout 比例
138 | )
139 |
140 | model = get_peft_model(model, config)
141 |
142 | args = TrainingArguments(
143 | output_dir="./output/Qwen1.5",
144 | per_device_train_batch_size=4,
145 | gradient_accumulation_steps=4,
146 | logging_steps=10,
147 | num_train_epochs=2,
148 | save_steps=100,
149 | learning_rate=1e-4,
150 | save_on_each_node=True,
151 | gradient_checkpointing=True,
152 | report_to="none",
153 | )
154 |
155 | swanlab_callback = SwanLabCallback(
156 | project="Qwen2-fintune",
157 | experiment_name="Qwen2-1.5B-Instruct",
158 | description="使用通义千问Qwen2-1.5B-Instruct模型在zh_cls_fudan-news数据集上微调。",
159 | config={
160 | "model": "qwen/Qwen2-1.5B-Instruct",
161 | "dataset": "huangjintao/zh_cls_fudan-news",
162 | },
163 | )
164 |
165 | trainer = Trainer(
166 | model=model,
167 | args=args,
168 | train_dataset=train_dataset,
169 | data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
170 | callbacks=[swanlab_callback],
171 | )
172 |
173 | trainer.train()
174 |
175 | # 用测试集的前10条,测试模型
176 | test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]
177 |
178 | test_text_list = []
179 | for index, row in test_df.iterrows():
180 | instruction = row["instruction"]
181 | input_value = row["input"]
182 |
183 | messages = [
184 | {"role": "system", "content": f"{instruction}"},
185 | {"role": "user", "content": f"{input_value}"},
186 | ]
187 |
188 | response = predict(messages, model, tokenizer)
189 | messages.append({"role": "assistant", "content": f"{response}"})
190 | result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
191 | test_text_list.append(swanlab.Text(result_text, caption=response))
192 |
193 | swanlab.log({"Prediction": test_text_list})
194 | swanlab.finish()
195 |
--------------------------------------------------------------------------------
/train_qwen2_ner.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pandas as pd
3 | import torch
4 | from datasets import Dataset
5 | from modelscope import snapshot_download, AutoTokenizer
6 | from swanlab.integration.huggingface import SwanLabCallback
7 | from peft import LoraConfig, TaskType, get_peft_model
8 | from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
9 | import os
10 | import swanlab
11 |
12 |
13 | def dataset_jsonl_transfer(origin_path, new_path):
14 | """
15 | 将原始数据集转换为大模型微调所需数据格式的新数据集
16 | """
17 | messages = []
18 |
19 | # 读取旧的JSONL文件
20 | with open(origin_path, "r") as file:
21 | for line in file:
22 | # 解析每一行的json数据
23 | data = json.loads(line)
24 | input_text = data["text"]
25 | entities = data["entities"]
26 | match_names = ["地点", "人名", "地理实体", "组织"]
27 |
28 | entity_sentence = ""
29 | for entity in entities:
30 | entity_json = dict(entity)
31 | entity_text = entity_json["entity_text"]
32 | entity_names = entity_json["entity_names"]
33 | for name in entity_names:
34 | if name in match_names:
35 | entity_label = name
36 | break
37 |
38 | entity_sentence += f"""{{"entity_text": "{entity_text}", "entity_label": "{entity_label}"}}"""
39 |
40 | if entity_sentence == "":
41 | entity_sentence = "没有找到任何实体"
42 |
43 | message = {
44 | "instruction": """你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {"entity_text": "南京", "entity_label": "地理实体"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出"没有找到任何实体". """,
45 | "input": f"文本:{input_text}",
46 | "output": entity_sentence,
47 | }
48 |
49 | messages.append(message)
50 |
51 | # 保存重构后的JSONL文件
52 | with open(new_path, "w", encoding="utf-8") as file:
53 | for message in messages:
54 | file.write(json.dumps(message, ensure_ascii=False) + "\n")
55 |
56 |
57 | def process_func(example):
58 | """
59 | 将数据集进行预处理, 处理成模型可以接受的格式
60 | """
61 |
62 | MAX_LENGTH = 384
63 | input_ids, attention_mask, labels = [], [], []
64 | system_prompt = """你是一个文本实体识别领域的专家,你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {"entity_text": "南京", "entity_label": "地理实体"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出"没有找到任何实体"."""
65 |
66 | instruction = tokenizer(
67 | f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",
68 | add_special_tokens=False,
69 | )
70 | response = tokenizer(f"{example['output']}", add_special_tokens=False)
71 | input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
72 | attention_mask = (
73 | instruction["attention_mask"] + response["attention_mask"] + [1]
74 | )
75 | labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
76 | if len(input_ids) > MAX_LENGTH: # 做一个截断
77 | input_ids = input_ids[:MAX_LENGTH]
78 | attention_mask = attention_mask[:MAX_LENGTH]
79 | labels = labels[:MAX_LENGTH]
80 | return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
81 |
82 |
83 | def predict(messages, model, tokenizer):
84 | device = "cuda"
85 | text = tokenizer.apply_chat_template(
86 | messages,
87 | tokenize=False,
88 | add_generation_prompt=True
89 | )
90 | model_inputs = tokenizer([text], return_tensors="pt").to(device)
91 |
92 | generated_ids = model.generate(
93 | model_inputs.input_ids,
94 | max_new_tokens=512
95 | )
96 | generated_ids = [
97 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
98 | ]
99 |
100 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
101 |
102 | print(response)
103 |
104 | return response
105 |
106 |
107 | model_id = "qwen/Qwen2-1.5B-Instruct"
108 | model_dir = "./qwen/Qwen2-1___5B-Instruct"
109 |
110 | # 在modelscope上下载Qwen模型到本地目录下
111 | model_dir = snapshot_download(model_id, cache_dir="./", revision="master")
112 |
113 | # Transformers加载模型权重
114 | tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)
115 | model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16)
116 | model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法
117 |
118 | # 加载、处理数据集和测试集
119 | train_dataset_path = "ccfbdci.jsonl"
120 | train_jsonl_new_path = "ccf_train.jsonl"
121 |
122 | if not os.path.exists(train_jsonl_new_path):
123 | dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
124 |
125 | # 得到训练集
126 | total_df = pd.read_json(train_jsonl_new_path, lines=True)
127 | train_df = total_df[int(len(total_df) * 0.1):]
128 | train_ds = Dataset.from_pandas(train_df)
129 | train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)
130 |
131 |
132 | config = LoraConfig(
133 | task_type=TaskType.CAUSAL_LM,
134 | target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
135 | inference_mode=False, # 训练模式
136 | r=8, # Lora 秩
137 | lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理
138 | lora_dropout=0.1, # Dropout 比例
139 | )
140 |
141 | model = get_peft_model(model, config)
142 |
143 | args = TrainingArguments(
144 | output_dir="./output/Qwen2-NER",
145 | per_device_train_batch_size=4,
146 | per_device_eval_batch_size=4,
147 | gradient_accumulation_steps=4,
148 | logging_steps=10,
149 | num_train_epochs=2,
150 | save_steps=100,
151 | learning_rate=1e-4,
152 | save_on_each_node=True,
153 | gradient_checkpointing=True,
154 | report_to="none",
155 | )
156 |
157 | swanlab_callback = SwanLabCallback(
158 | project="Qwen2-NER-fintune",
159 | experiment_name="Qwen2-1.5B-Instruct",
160 | description="使用通义千问Qwen2-1.5B-Instruct模型在NER数据集上微调,实现关键实体识别任务。",
161 | config={
162 | "model": model_id,
163 | "model_dir": model_dir,
164 | "dataset": "qgyd2021/chinese_ner_sft",
165 | },
166 | )
167 |
168 | trainer = Trainer(
169 | model=model,
170 | args=args,
171 | train_dataset=train_dataset,
172 | data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
173 | callbacks=[swanlab_callback],
174 | )
175 |
176 | trainer.train()
177 |
178 | # 用测试集的随机20条,测试模型
179 | # 得到测试集
180 | test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20)
181 |
182 | test_text_list = []
183 | for index, row in test_df.iterrows():
184 | instruction = row['instruction']
185 | input_value = row['input']
186 |
187 | messages = [
188 | {"role": "system", "content": f"{instruction}"},
189 | {"role": "user", "content": f"{input_value}"}
190 | ]
191 |
192 | response = predict(messages, model, tokenizer)
193 | messages.append({"role": "assistant", "content": f"{response}"})
194 | result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
195 | test_text_list.append(swanlab.Text(result_text, caption=response))
196 |
197 | swanlab.log({"Prediction": test_text_list})
198 | swanlab.finish()
--------------------------------------------------------------------------------