├── README.md ├── data_process └── data_process.py ├── image-1.png ├── image-2.png ├── image.png ├── image.webp ├── loss_vs_time_hours.png ├── loss_vs_tokens_millions.png ├── pretrain ├── ds_config.json ├── generate_pretrain_data.py ├── model │ ├── config.json │ ├── configuration_miaomiao.py │ └── modeling_miaomiao.py ├── pretrain.py ├── pretrain.sh ├── pretrain_dataset.py └── test_pretrain_model.py ├── requirements.txt ├── rlhf ├── rlhf │ ├── __pycache__ │ │ ├── ppo_trainer.cpython-311.pyc │ │ └── rlhf_engine.cpython-311.pyc │ ├── ppo_trainer.py │ └── rlhf_engine.py ├── rlhf_data_process.py ├── rw_eval.py ├── step2.py ├── step2.sh ├── step2_eval.sh ├── step3.py ├── step3.sh └── utils │ ├── __pycache__ │ ├── data_utils.cpython-311.pyc │ ├── ds_utils.cpython-311.pyc │ ├── model_utils.cpython-311.pyc │ ├── perf.cpython-311.pyc │ ├── raw_datasets.cpython-311.pyc │ ├── reward_model.cpython-311.pyc │ └── utils.cpython-311.pyc │ ├── data_utils.py │ ├── ds_utils.py │ ├── model_utils.py │ ├── perf.py │ ├── raw_datasets.py │ ├── reward_model.py │ └── utils.py ├── sft ├── ds_config.json ├── model │ ├── config.json │ ├── configuration_miaomiao.py │ ├── merges.txt │ ├── modeling_miaomiao.py │ ├── tokenization_miaomiao.py │ ├── tokenizer.json │ ├── tokenizer_config.json │ └── vocab.json ├── sft.py ├── sft.sh ├── sft_data_filted.py ├── sft_dataset.py └── test_sft_model.py └── train_tokenizer ├── miaomiao_tokenizer ├── merges.txt ├── tokenization_miaomiao.py ├── tokenizer.json ├── tokenizer_config.json └── vocab.json └── train_tokenizer.py /README.md: -------------------------------------------------------------------------------- 1 | # Zero-Chatgpt 2 |

3 | Zero-Chatgpt 4 |

5 | 6 | 本开源项目的目的是想从0开始，将chatgpt的技术路线跑一遍。 7 | 包括：数据收集 -> 数据清洗和去重 -> 词表训练 -> 语言模型预训练 -> 指令微调 -> 强化学习（rlhf，ppo）。 8 | 最主要的是把代码和流程跑通，效果有时间再调优。 9 | 预训练数据：10B token，指令微调数据：30w条，rlhf数据：10w条，模型大小：0.1B。 10 | 训练流程和代码都已经跑通，想要更好的效果的话可以直接调整模型配置文件做scaling up，这边训练的经验看更大的模型、更多的数据对于效果的提升是十分明显的。 11 | 12 | —————————————————————————————————————————————————————————————————— 13 | 介绍下另一个开源图文多模态项目：[Zero-Qwen-VL](https://github.com/AI-Study-Han/Zero-Qwen-VL)，从0开始训练一个对中文支持更友好的图文大模型，跑通图文多模态的训练流程。本项目用的是qwen-vl的图片编码器和Qwen2-0.5B-Instruct的语言模型，计算资源足够的话可以自己换成更大的模型，会有更好的效果。 14 | ## 一、训练环境 15 | cuda 12.1、pytorch、transformers、deepspeed等常用的环境，这里的requirements.txt是运行环境的介绍的列表。 16 | 17 | 计算资源是2块A40，预训练是2天左右。 18 | 19 | ## 二、训练数据、模型权重和训练镜像文件 20 | [预训练数据、微调数据、rlhf数据、模型权重、预训练和指令微调镜像](https://huggingface.co/My521/Zero-Chatgpt/tree/main)都放在这里了，模型权重去掉前缀名后（修改为model.safetensors或者pytorch_model.bin）和模型代码、配置文件放在一起（model文件夹下）就可以加载了。预训练数据、训练镜像太大，稍后上传。 21 | 22 | | 文件名称 | 文件介绍 | 23 | |------------------------|--------------------------------------------------------| 24 | | [pretrain_model.safetensors](https://huggingface.co/My521/Zero-Chatgpt/blob/main/pretrain_model.safetensors) | 预训练模型的权重文件| 25 | | [pretrain_model.safetensors](https://huggingface.co/My521/Zero-Chatgpt/blob/main/sft_model.safetensors) | 指令微调后模型的权重文件| 26 | | [rlhf_pytorch_model.bin](https://huggingface.co/My521/Zero-Chatgpt/blob/main/rlhf_pytorch_model.bin) | rlhf后的模型权重文件| 27 | | [pretrain_sft.tar](https://huggingface.co/My521/Zero-Chatgpt/blob/main/pretrain_sft.tar) | 预训练和sft运行镜像| 28 | | [rlhf.tar](https://huggingface.co/My521/Zero-Chatgpt/blob/main/rlhf.tar) | rlhf运行镜像| 29 | | [rlhf.jsonl](https://huggingface.co/My521/Zero-Chatgpt/blob/main/rlhf.jsonl) | rlhf数据集| 30 | | [sft.jsonl](https://huggingface.co/My521/Zero-Chatgpt/blob/main/sft.jsonl) | sft数据| 31 | | [pretrain.bin](https://huggingface.co/My521/Zero-Chatgpt/blob/main/pretrain.bin) | 预训练数据| 32 | 33 | 34 | ## 三、数据收集和清洗 35 | 本项目一共收集了10B左右的中文训练语料，包括[中文维基百科](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/blob/main/wikipedia-cn-20230720-filtered.json)，[中文百度百科](https://huggingface.co/datasets/xuqinyang/BaiduBaike-5.63M/blob/main/563w_baidubaike.json)和[SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)随机抽取了部分数据。 36 | 37 | 中文维基百科和SkyPile-150B数据比较干净，只对中文百度百科进行了清洗和去重。去除了一些人物介绍、产品介绍和长度比较短的数据，并进行了严格的去重，最终563w条数据只剩下140多w条数据。 38 | 39 | 数据处理的代码在data_process文件夹下。 40 | 41 | ## 四、Tokenizer训练 42 | 从3类数据中随机抽取了部分数据（取决你服务器内存大小，本项目抽取了1.5G文本）训练。词表大小设置为32000（参考llama），因为这里模型设置的比较小，为了避免模型头重脚轻（embedding层参数占比太高），所以词表也比较小。special_tokens参考qwen设置。 43 | 44 | tokenizer训练的代码在train_tokenizer文件夹下。 45 | 46 | ## 五、预训练 47 | 模型结构参考llama（这也是大多数开源模型的选择），模型代码参考huggingface的代码（之前训练代码不兼容huggingface，进行rlhf的时候坑太多，后面改了）。这里考虑到手头目前可以使用的计算资源，模型大小设计为0.1B左右，计算资源多的可以对模型和数据进行scaling。训练过更大的模型和更多的数据，更大的模型和更多的数据效果就是更好，差异还是很明显的。 48 | 49 | 首先对数据进行分词，生成.bin文件，然后使用huggingface的trainer进行训练。 50 | 51 | 预训练数据生成代码、训练脚本、训练代码在pretrain文件夹下。 52 | 53 |

54 | loss 55 |

56 | 57 |

58 | loss 59 |

60 | 61 | ## 六、指令微调 62 | 这里指令微调的数据使用了[firefly-train-1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M/blob/main/firefly-train-1.1M.jsonl)，[ruozhiout_qa_cn.jsonl](https://www.modelscope.cn/datasets/baicai003/Llama3-Chinese-dataset/files)。 63 | 64 | firefly-train-1.1M的数据质量不是特别高，这里根据问题长度对数据集进行了清洗和去重，最后剩余40多w条数据。因为模型尺寸比较小，只想训练单论对话能力，这里也只保留了单轮对话数据。 65 | 66 | sft过程使用了30w条数据，效果也不是很好，可能是因为模型尺寸的原因。之前尝试使用50B token训练了1.5B的模型，2w条训练数据就有比较好的对话能力，这里0.1B的模型2w条sft数据训练后对话能力还是比较差，需要更多的sft数据训练，这里用了30w条。 67 | 68 | 指令微调的脚本和代码在sft文件夹下面。 69 | 70 | 微调后模型简单测试结果： 71 | ![alt text](image.png) 72 | 73 | ## 七、强化学习 74 | 强化学习的数据是根据sft没有使用的数据进行生成的，sft数据原有的回答为"chosen"，使用之前指令微调后的模型生成的回答作为"rejected"，一共生成了10w条数据。rlhf的代码参考[DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat#readme)并进行了一定的修改。 75 | 76 | 其中5w条数据用来训练Reward Model（step 2），1个epoch后分类准确率可以达到92%。 77 | 78 | 所有10w条数据用来训练Reinforcement Learning with Human Feedback（step 3）。 79 | 80 | 从最后的训练结果看，rlhf的效果并不好，随着训练步数的增加，模型能力反而下降了（训练步数比较小的时候效果还可以），拒绝回答的频率增加了，但是复读机的频率降低了。一开始考虑可能是学习率设置的比较大的原因，降低学习率后会有一定的缓解，但是效果仍然不是很好，可能还是数据的质量比较差。 81 | 82 | 强化学习的脚本和代码在rlhf文件夹下面。 83 | 84 | 训练步数比较少的时候的回答： 85 | ![alt text](image-1.png) 86 | 87 | 训练步数比较多的时候的回答： 88 | ![alt text](image-2.png) 89 | 90 | -------------------------------------------------------------------------------- /data_process/data_process.py: -------------------------------------------------------------------------------- 1 | 2 | import json 3 | import time 4 | from tqdm import tqdm 5 | import re 6 | import os 7 | import pandas as pd 8 | from datasketch import MinHash, MinHashLSH 9 | import random 10 | 11 | 12 | def process_baike(): 13 | input_file = './563w_baidubaike.json' 14 | output_file = './baidubaike_no_depulication.json' 15 | batch_size = 100000 16 | 17 | processed_lines = 0 18 | start_time = time.time() 19 | # 正则表达式模式匹配 [1]、[2]、[3]、[1-2] 等内容 20 | bracket_pattern = re.compile(r'\[\d+(-\d+)?\]') 21 | punctuation_pattern = re.compile(r'[。！？：]$') 22 | chinese_char_pattern = re.compile(r'[\u4e00-\u9fa5]') 23 | repeated_punctuation_pattern = re.compile(r'([。！？])\1+') 24 | whitespace_pattern = re.compile(r'\s+|　+') 25 | 26 | 27 | def process_lines(lines, outfile): 28 | nonlocal processed_lines, start_time 29 | for line in lines: 30 | try: 31 | data = json.loads(line) 32 | text = "" 33 | title = data.get("title", "") 34 | summary = data.get("summary", "") 35 | if summary is None or summary.strip() == "": 36 | text = f"{title}。" 37 | elif summary.startswith(title): 38 | text = f"{summary}" 39 | if not punctuation_pattern.search(text): 40 | text += "。" 41 | else: 42 | text = f"{title}，{summary}" 43 | if not punctuation_pattern.search(text): 44 | text += "。" 45 | skip_line = False 46 | sections = data.get("sections", []) 47 | for section in sections: 48 | section_title = section.get("title", "") 49 | if "重要参数" in section_title or "项目简介" in section_title or "产品简介" in section_title or "个人资料" in section_title or "个人简介" in section_title: 50 | skip_line = True 51 | break 52 | section_content = section.get("content", "") 53 | text += f"{section_title}，{section_content}" 54 | if not punctuation_pattern.search(text): 55 | text += "。" 56 | 57 | chinese_chars = chinese_char_pattern.findall(text) 58 | if skip_line or len(chinese_chars) < 30 or text.count(' ') > 10: 59 | continue 60 | 61 | # 移除所有空白字符（包括全角空格） 62 | text = re.sub(whitespace_pattern, '', text) 63 | # 移除文本中的 [1]、[2]、[3] 等内容 64 | text = re.sub(bracket_pattern, '', text) 65 | # 合并重复的标点符号 66 | text = re.sub(repeated_punctuation_pattern, r'\1', text) 67 | new_data = { 68 | "text": text, 69 | "source": "baidubaike" 70 | } 71 | 72 | outfile.write(json.dumps(new_data, ensure_ascii=False) + '\n') 73 | processed_lines += 1 74 | 75 | except json.JSONDecodeError as e: 76 | print(f"Error decoding JSON: {e}") 77 | except Exception as e: 78 | print(f"Error processing line: {e}") 79 | 80 | # Print total processed lines and processing speed 81 | elapsed_time = time.time() - start_time 82 | speed = processed_lines / elapsed_time 83 | tqdm.write(f"Processed {processed_lines} lines at {speed:.2f} lines/second") 84 | 85 | with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile: 86 | batch_lines = [] 87 | for line in tqdm(infile, desc="Reading lines"): 88 | batch_lines.append(line) 89 | if len(batch_lines) == batch_size: 90 | process_lines(batch_lines, outfile) 91 | batch_lines = [] 92 | 93 | # Process remaining lines 94 | if batch_lines: 95 | process_lines(batch_lines, outfile) 96 | 97 | def process_cn_wiki(): 98 | input_file = "./wikipedia-cn-20230720-filtered.json" 99 | output_file = "./wiki_cn.json" 100 | 101 | with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile: 102 | data = json.load(infile) 103 | for entry in data: 104 | text = entry.get("completion", "") 105 | new_entry = { 106 | "text": text, 107 | "source": "wiki_cn" 108 | } 109 | json.dump(new_entry, outfile, ensure_ascii=False) 110 | outfile.write('\n') 111 | 112 | print("Processing complete. Output saved to", output_file) 113 | 114 | def process_skypile(): 115 | input_dir = "./SkyPile-50/" 116 | output_file = "./skypile.json" 117 | 118 | # 获取所有 .jsonl 文件列表 119 | jsonl_files = [f for f in os.listdir(input_dir) if f.endswith(".jsonl")] 120 | 121 | with open(output_file, 'w', encoding='utf-8') as outfile: 122 | # 初始化文件级别的进度条 123 | for filename in tqdm(jsonl_files, desc="Processing files"): 124 | input_file = os.path.join(input_dir, filename) 125 | with open(input_file, 'r', encoding='utf-8') as infile: 126 | for line in infile: 127 | try: 128 | data = json.loads(line) 129 | text = data.get("text", "") 130 | new_entry = { 131 | "text": text, 132 | "source": "skypile" 133 | } 134 | json.dump(new_entry, outfile, ensure_ascii=False) 135 | outfile.write('\n') 136 | except json.JSONDecodeError as e: 137 | print(f"Error decoding JSON in file {filename}: {e}") 138 | except Exception as e: 139 | print(f"Error processing line in file {filename}: {e}") 140 | 141 | print("Processing complete. Output saved to", output_file) 142 | 143 | def ngrams(text, n=2): 144 | return [text[i:i+n] for i in range(len(text)-n+1)] 145 | 146 | def process_line(line, num_perm): 147 | data = json.loads(line) 148 | text = data["text"] 149 | minhash = MinHash(num_perm=num_perm) 150 | for d in ngrams(text, 2): 151 | minhash.update(d.encode('utf-8')) 152 | return data, minhash 153 | 154 | def depulication_cn_file(input_file, output_file, threshold): 155 | # MinHash-LSH 参数 156 | num_perm = 128 157 | lsh = MinHashLSH(threshold, num_perm=num_perm) 158 | key_counter = 0 159 | 160 | retained_lines = 0 161 | processed_lines = 0 162 | 163 | # 创建进度条 164 | pbar = tqdm(desc="Processing lines", unit="line", mininterval=0.1) 165 | 166 | with open(output_file, 'w', encoding='utf-8') as out_file: 167 | start_time = time.time() 168 | with open(input_file, 'r', encoding='utf-8') as file: 169 | for line in file: 170 | data, minhash = process_line(line, num_perm) 171 | unique_key = f"{data['source']}_{key_counter}" 172 | key_counter += 1 173 | if not lsh.query(minhash): 174 | lsh.insert(unique_key, minhash) 175 | json.dump(data, out_file, ensure_ascii=False) 176 | out_file.write('\n') 177 | retained_lines += 1 178 | processed_lines += 1 179 | pbar.update(1) 180 | elapsed_time = time.time() - start_time 181 | lines_per_second = processed_lines / elapsed_time if elapsed_time > 0 else 0 182 | pbar.set_postfix({"Retained": retained_lines, "Processed": processed_lines, "Speed": f"{lines_per_second:.2f} lines/sec"}) 183 | 184 | # 关闭进度条 185 | pbar.close() 186 | 187 | def depulication_cn_files(): 188 | # 定义路径 189 | input_dir = "/home/" 190 | output_file = "/home/deduplicated_cn_data.json" 191 | # MinHash-LSH 参数 192 | num_perm = 128 193 | lsh = MinHashLSH(threshold=0.6, num_perm=num_perm) 194 | key_counter = 0 195 | 196 | retained_lines = 0 197 | processed_lines = 0 198 | 199 | # 创建进度条 200 | pbar = tqdm(desc="Processing lines", unit="line", mininterval=0.1) 201 | 202 | with open(output_file, 'w', encoding='utf-8') as out_file: 203 | start_time = time.time() 204 | for filename in os.listdir(input_dir): 205 | if filename.endswith(".json"): 206 | file_path = os.path.join(input_dir, filename) 207 | with open(file_path, 'r', encoding='utf-8') as file: 208 | for line in file: 209 | data, minhash = process_line(line, num_perm) 210 | unique_key = f"{data['source']}_{key_counter}" 211 | key_counter += 1 212 | if not lsh.query(minhash): 213 | lsh.insert(unique_key, minhash) 214 | json.dump(data, out_file, ensure_ascii=False) 215 | out_file.write('\n') 216 | retained_lines += 1 217 | processed_lines += 1 218 | pbar.update(1) 219 | elapsed_time = time.time() - start_time 220 | lines_per_second = processed_lines / elapsed_time if elapsed_time > 0 else 0 221 | pbar.set_postfix({"Retained": retained_lines, "Processed": processed_lines, "Speed": f"{lines_per_second:.2f} lines/sec"}) 222 | 223 | # 关闭进度条 224 | pbar.close() 225 | 226 | 227 | def merge_data(): 228 | input_files = [ 229 | './baidubaike.json', 230 | './wiki_cn.json', 231 | './skypile.json' 232 | ] 233 | output_file = './pretrain.json' 234 | sampling_ratios = [1, 1, 0.57] # 分别从每个文件中抽取100%、100%和57%的数据 235 | 236 | assert len(input_files) == len(sampling_ratios), "输入文件数和抽样比例数不匹配" 237 | 238 | line_counts = {} 239 | 240 | with open(output_file, 'w', encoding='utf-8') as out_f: 241 | for file, ratio in zip(input_files, sampling_ratios): 242 | line_counts[file] = 0 243 | with open(file, 'r', encoding='utf-8') as in_f: 244 | for line in in_f: 245 | if random.random() <= ratio: 246 | data = json.loads(line.strip()) 247 | out_f.write(json.dumps(data, ensure_ascii=False) + '\n') 248 | line_counts[file] += 1 249 | 250 | for file, count in line_counts.items(): 251 | print(f"{file} 写入了 {count} 行") 252 | 253 | 254 | def generate_train_tokenizer_data(): 255 | input_files = [ 256 | './baidubaike.json', 257 | './wiki_cn.json', 258 | './skypile.json' 259 | ] 260 | output_file = './train_tokenizer.json' 261 | sampling_ratios = [1, 0.5, 0.02] # 分别从每个文件中抽取100%、50%和2%的数据 262 | 263 | assert len(input_files) == len(sampling_ratios), "输入文件数和抽样比例数不匹配" 264 | 265 | line_counts = {} 266 | 267 | with open(output_file, 'w', encoding='utf-8') as out_f: 268 | for file, ratio in zip(input_files, sampling_ratios): 269 | line_counts[file] = 0 270 | with open(file, 'r', encoding='utf-8') as in_f: 271 | for line in in_f: 272 | if random.random() < ratio: 273 | data = json.loads(line.strip()) 274 | out_f.write(json.dumps(data, ensure_ascii=False) + '\n') 275 | line_counts[file] += 1 276 | 277 | for file, count in line_counts.items(): 278 | print(f"{file} 写入了 {count} 行") 279 | 280 | def sft_process_firefly(): 281 | input_data_path = './firefly-cn-train-1.1M.jsonl' 282 | output_data_path = './processed_firefly.jsonl' 283 | 284 | line_count = 0 285 | 286 | with open(input_data_path, 'r', encoding='utf-8') as infile, open(output_data_path, 'w', encoding='utf-8') as outfile: 287 | for line in infile: 288 | data = json.loads(line) 289 | conversations = data.get("conversations", []) 290 | if len(conversations) == 2 and conversations[0].get("from") == "human" and conversations[1].get("from") == "gpt": 291 | human_value = conversations[0].get("value", "") 292 | if len(human_value) > 5: 293 | new_data = { 294 | "messages": [ 295 | {"from": "user", "value": human_value}, 296 | {"from": "assistant", "value": conversations[1].get("value", "")} 297 | ] 298 | } 299 | outfile.write(json.dumps(new_data, ensure_ascii=False) + '\n') 300 | line_count += 1 301 | 302 | print(f"Total lines written: {line_count}") 303 | 304 | def process_sft_line(line, num_perm): 305 | data = json.loads(line) 306 | messages = data.get("messages", []) 307 | combined_text = ''.join([msg['value'] for msg in messages if msg['from'] in ['user', 'assistant']]) 308 | minhash = MinHash(num_perm=num_perm) 309 | for d in ngrams(combined_text, 2): 310 | minhash.update(d.encode('utf-8')) 311 | return data, minhash 312 | 313 | def depulication_cn_firefly(): 314 | # 定义路径 315 | input_file = "./processed_firefly.jsonl" 316 | output_file = "./depulication_firefly.jsonl" 317 | # MinHash-LSH 参数 318 | num_perm = 128 319 | lsh = MinHashLSH(threshold=0.4, num_perm=num_perm) 320 | key_counter = 0 321 | 322 | retained_lines = 0 323 | processed_lines = 0 324 | 325 | # 创建进度条 326 | pbar = tqdm(desc="Processing lines", unit="line", mininterval=0.1) 327 | 328 | with open(output_file, 'w', encoding='utf-8') as out_file: 329 | start_time = time.time() 330 | with open(input_file, 'r', encoding='utf-8') as file: 331 | for line in file: 332 | data, minhash = process_sft_line(line, num_perm) 333 | unique_key = f"{key_counter}" 334 | key_counter += 1 335 | if not lsh.query(minhash): 336 | lsh.insert(unique_key, minhash) 337 | json.dump(data, out_file, ensure_ascii=False) 338 | out_file.write('\n') 339 | retained_lines += 1 340 | processed_lines += 1 341 | pbar.update(1) 342 | elapsed_time = time.time() - start_time 343 | lines_per_second = processed_lines / elapsed_time if elapsed_time > 0 else 0 344 | pbar.set_postfix({"Retained": retained_lines, "Processed": processed_lines, "Speed": f"{lines_per_second:.2f} lines/sec"}) 345 | 346 | # 关闭进度条 347 | pbar.close() 348 | 349 | def generate_sft_rlfh_data(): 350 | cn_firefly_file = './depulication_firefly.jsonl' 351 | ruozhiba_file = './ruozhiout_qa_cn.jsonl' 352 | total_count = 400000 353 | sft_count = 300000 354 | #rlhf_count = total_count - sft_count 355 | output_file_sft = './sft.jsonl' 356 | output_file_rlfh = './rlhf.jsonl' 357 | # Load data from files 358 | with open(ruozhiba_file, 'r', encoding='utf-8') as f: 359 | ruozhiba_data = [json.loads(line) for line in f] 360 | 361 | with open(cn_firefly_file, 'r', encoding='utf-8') as f: 362 | cn_firefly_data = [json.loads(line) for line in f] 363 | 364 | # Extract conversations from ruozhiba_data 365 | sft_data = [] 366 | for item in ruozhiba_data: 367 | for conversation in item['conversations']: 368 | if conversation['from'] == 'human': 369 | prompt = conversation['value'] 370 | elif conversation['from'] == 'gpt': 371 | answer = conversation['value'] 372 | sft_data.append({'prompt': prompt, 'answer': answer}) 373 | 374 | # Calculate the number of entries to pick from cn_firefly_data 375 | ruozhiba_count = len(sft_data) 376 | cn_firefly_count = total_count - ruozhiba_count 377 | 378 | # Randomly select entries from cn_firefly_data 379 | random_cn_firefly_data = random.sample(cn_firefly_data, cn_firefly_count) 380 | 381 | # Extract messages from cn_firefly_data 382 | for item in random_cn_firefly_data: 383 | for message in item['messages']: 384 | if message['from'] == 'user': 385 | prompt = message['value'] 386 | elif message['from'] == 'assistant': 387 | answer = message['value'] 388 | sft_data.append({'prompt': prompt, 'answer': answer}) 389 | 390 | # Randomly shuffle the data 391 | random.shuffle(sft_data) 392 | 393 | # Split the data into two parts 394 | #split_index = len(sft_data) // 2 395 | sft_part = sft_data[:sft_count] 396 | rlhf_part = sft_data[sft_count:] 397 | 398 | # Write data to output files 399 | with open(output_file_sft, 'w', encoding='utf-8') as f: 400 | for item in sft_part: 401 | f.write(json.dumps(item, ensure_ascii=False) + '\n') 402 | 403 | with open(output_file_rlfh, 'w', encoding='utf-8') as f: 404 | for item in rlhf_part: 405 | f.write(json.dumps(item, ensure_ascii=False) + '\n') 406 | 407 | def main(): 408 | # process_baike()#保留2,763,469行 409 | #process_cn_wiki() 410 | #process_skypile() 411 | #百度百科去重 412 | # depulication_cn_file('./baidubaike_no_depulication.json', './baidubaike.json', 0.4) 413 | # merge_data() 414 | # generate_train_tokenizer_data() 415 | #sft_process_firefly() 416 | #depulication_cn_firefly() 417 | generate_sft_rlfh_data() 418 | 419 | 420 | if __name__ == '__main__': 421 | main() -------------------------------------------------------------------------------- /image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/image-1.png -------------------------------------------------------------------------------- /image-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/image-2.png -------------------------------------------------------------------------------- /image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/image.png -------------------------------------------------------------------------------- /image.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/image.webp -------------------------------------------------------------------------------- /loss_vs_time_hours.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/loss_vs_time_hours.png -------------------------------------------------------------------------------- /loss_vs_tokens_millions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/loss_vs_tokens_millions.png -------------------------------------------------------------------------------- /pretrain/ds_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "fp16": { 3 | "enabled": "auto", 4 | "loss_scale": 0, 5 | "loss_scale_window": 1000, 6 | "initial_scale_power": 16, 7 | "hysteresis": 2, 8 | "min_loss_scale": 1 9 | }, 10 | 11 | "optimizer": { 12 | "type": "AdamW", 13 | "params": { 14 | "lr": "auto", 15 | "betas": "auto", 16 | "eps": "auto", 17 | "weight_decay": "auto" 18 | } 19 | }, 20 | 21 | "scheduler": { 22 | "type": "WarmupDecayLR", 23 | "params": { 24 | "warmup_min_lr": 1e-5, 25 | "warmup_max_lr": "auto", 26 | "warmup_num_steps": "auto", 27 | "total_num_steps": "auto" 28 | } 29 | }, 30 | 31 | "zero_optimization": { 32 | "stage": 2, 33 | "allgather_partitions": true, 34 | "allgather_bucket_size": 2e8, 35 | "overlap_comm": true, 36 | "reduce_scatter": true, 37 | "reduce_bucket_size": 2e8, 38 | "contiguous_gradients": true 39 | }, 40 | 41 | "gradient_accumulation_steps": "auto", 42 | "gradient_clipping": "auto", 43 | "steps_per_print": 2000, 44 | "train_batch_size": "auto", 45 | "train_micro_batch_size_per_gpu": "auto", 46 | "wall_clock_breakdown": false 47 | } -------------------------------------------------------------------------------- /pretrain/generate_pretrain_data.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import time 4 | import random 5 | import numpy as np 6 | from multiprocessing import Process, Manager 7 | from transformers import AutoTokenizer 8 | 9 | def split_file(data_path, num_splits=20): 10 | file_handles = [open(f"{data_path}.part{i}", 'w', encoding='utf-8') for i in range(num_splits)] 11 | 12 | try: 13 | total_lines = 0 14 | with open(data_path, 'r', encoding='utf-8') as f: 15 | for i, line in enumerate(f): 16 | part_idx = i % num_splits 17 | file_handles[part_idx].write(line) 18 | total_lines += 1 19 | if total_lines % 1000 == 0: # 每处理1000行打印一次进度 20 | print(f"Processed lines: {total_lines}") 21 | finally: 22 | for handle in file_handles: 23 | handle.close() 24 | print(f"Total lines processed: {total_lines}") 25 | 26 | def process_file(part_path, bin_path, tokenizer, ratio, result_dict): 27 | source_token_counts = {} 28 | total_token_count = 0 29 | line_count = 0 30 | start_time = time.time() 31 | 32 | with open(part_path, 'r', encoding='utf-8') as f, open(bin_path, 'wb') as f2: 33 | for line in f: 34 | if random.random() > ratio: 35 | continue 36 | data = json.loads(line) 37 | text = data['text'] 38 | source = data['source'] 39 | text_id = tokenizer.encode(text, add_special_tokens=False) 40 | text_id.append(tokenizer.eos_token_id) 41 | 42 | token_count = len(text_id) 43 | if source not in source_token_counts: 44 | source_token_counts[source] = 0 45 | source_token_counts[source] += token_count 46 | 47 | total_token_count += token_count 48 | 49 | arr = np.array(text_id, dtype=np.uint16) 50 | f2.write(arr.tobytes()) 51 | 52 | line_count += 1 53 | elapsed_time = time.time() - start_time 54 | print(f"Processed lines: {line_count}, Time elapsed: {elapsed_time:.2f} seconds") 55 | 56 | result_dict[part_path] = (source_token_counts, total_token_count) 57 | 58 | def merge_bins(bin_paths, final_bin_path, chunk_size=10*1024*1024): 59 | with open(final_bin_path, 'wb') as f_out: 60 | for bin_path in bin_paths: 61 | with open(bin_path, 'rb') as f_in: 62 | while True: 63 | chunk = f_in.read(chunk_size) 64 | if not chunk: 65 | break 66 | f_out.write(chunk) 67 | 68 | def main(data_path, bin_path, ratio=1): 69 | tokenizer_path = './miaomiao_tokenizer' 70 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,trust_remote_code=True) # 主进程加载tokenizer 71 | num_splits = 20 72 | 73 | # Split the file into parts 74 | split_file(data_path, num_splits) 75 | 76 | manager = Manager() 77 | result_dict = manager.dict() 78 | 79 | processes = [] 80 | bin_paths = [f"{bin_path}.part{i}.bin" for i in range(num_splits)] 81 | 82 | for i in range(num_splits): 83 | part_path = f"{data_path}.part{i}" 84 | bin_part_path = bin_paths[i] 85 | p = Process(target=process_file, args=(part_path, bin_part_path, tokenizer, ratio, result_dict)) 86 | processes.append(p) 87 | p.start() 88 | 89 | for p in processes: 90 | p.join() 91 | 92 | # Merge binary files 93 | merge_bins(bin_paths, bin_path) 94 | 95 | # Output combined statistics 96 | combined_source_token_counts = {} 97 | combined_total_token_count = 0 98 | 99 | for source_token_counts, total_token_count in result_dict.values(): 100 | for source, count in source_token_counts.items(): 101 | if source not in combined_source_token_counts: 102 | combined_source_token_counts[source] = 0 103 | combined_source_token_counts[source] += count 104 | combined_total_token_count += total_token_count 105 | 106 | print("Token counts by source:", combined_source_token_counts) 107 | print("Total token count:", combined_total_token_count) 108 | 109 | if __name__ == "__main__": 110 | #一共15M行 111 | data_path = "./pretrain_data_train.json" 112 | bin_path = "./pretrain_data_train.bin" 113 | main(data_path, bin_path, ratio=1) 114 | -------------------------------------------------------------------------------- /pretrain/model/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "model_type": "miaomiao", 3 | "architectures": [ 4 | "MiaomiaoModel" 5 | ], 6 | "auto_map": { 7 | "AutoConfig": "configuration_miaomiao.MiaomiaoConfig", 8 | "AutoModel": "modeling_miaomiao.MiaomiaoModel", 9 | "AutoModelForCausalLM": "modeling_miaomiao.MiaomiaoForCausalLM" 10 | }, 11 | "attention_dropout": 0.0, 12 | "bos_token_id": 32005, 13 | "eos_token_id": 32005, 14 | "hidden_act": "silu", 15 | "hidden_size": 512, 16 | "initializer_range": 0.02, 17 | "intermediate_size": 2752, 18 | "max_position_embeddings": 131072, 19 | "max_window_layers": 28, 20 | "num_attention_heads": 16, 21 | "num_hidden_layers": 24, 22 | "num_key_value_heads": 16, 23 | "rms_norm_eps": 1e-06, 24 | "rope_theta": 1000000.0, 25 | "sliding_window": 131072, 26 | "tie_word_embeddings": false, 27 | "torch_dtype": "bfloat16", 28 | "transformers_version": "4.37.2", 29 | "use_cache": true, 30 | "use_sliding_window": false, 31 | "vocab_size": 32006 32 | } -------------------------------------------------------------------------------- /pretrain/model/configuration_miaomiao.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | """ Miaomiao model configuration""" 4 | 5 | from transformers.configuration_utils import PretrainedConfig 6 | from transformers.utils import logging 7 | 8 | 9 | logger = logging.get_logger(__name__) 10 | 11 | 12 | class MiaomiaoConfig(PretrainedConfig): 13 | 14 | model_type = "miaomiao" 15 | keys_to_ignore_at_inference = ["past_key_values"] 16 | 17 | def __init__( 18 | self, 19 | vocab_size=32000, 20 | hidden_size=4096, 21 | intermediate_size=11008, 22 | num_hidden_layers=32, 23 | num_attention_heads=32, 24 | num_key_value_heads=None, 25 | hidden_act="silu", 26 | max_position_embeddings=2048, 27 | initializer_range=0.02, 28 | rms_norm_eps=1e-6, 29 | use_cache=True, 30 | pad_token_id=None, 31 | bos_token_id=1, 32 | eos_token_id=2, 33 | pretraining_tp=1, 34 | tie_word_embeddings=False, 35 | rope_theta=10000.0, 36 | rope_scaling=None, 37 | attention_bias=False, 38 | attention_dropout=0.0, 39 | mlp_bias=False, 40 | _attn_implementation="eager", 41 | **kwargs, 42 | ): 43 | self.vocab_size = vocab_size 44 | self.max_position_embeddings = max_position_embeddings 45 | self.hidden_size = hidden_size 46 | self.intermediate_size = intermediate_size 47 | self.num_hidden_layers = num_hidden_layers 48 | self.num_attention_heads = num_attention_heads 49 | 50 | # for backward compatibility 51 | if num_key_value_heads is None: 52 | num_key_value_heads = num_attention_heads 53 | 54 | self.num_key_value_heads = num_key_value_heads 55 | self.hidden_act = hidden_act 56 | self.initializer_range = initializer_range 57 | self.rms_norm_eps = rms_norm_eps 58 | self.pretraining_tp = pretraining_tp 59 | self.use_cache = use_cache 60 | self.rope_theta = rope_theta 61 | self.rope_scaling = rope_scaling 62 | self._rope_scaling_validation() 63 | self.attention_bias = attention_bias 64 | self.attention_dropout = attention_dropout 65 | self.mlp_bias = mlp_bias 66 | self._attn_implementation = _attn_implementation 67 | super().__init__( 68 | pad_token_id=pad_token_id, 69 | bos_token_id=bos_token_id, 70 | eos_token_id=eos_token_id, 71 | tie_word_embeddings=tie_word_embeddings, 72 | **kwargs, 73 | ) 74 | 75 | def _rope_scaling_validation(self): 76 | """ 77 | Validate the `rope_scaling` configuration. 78 | """ 79 | if self.rope_scaling is None: 80 | return 81 | 82 | if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2: 83 | raise ValueError( 84 | "`rope_scaling` must be a dictionary with two fields, `type` and `factor`, " f"got {self.rope_scaling}" 85 | ) 86 | rope_scaling_type = self.rope_scaling.get("type", None) 87 | rope_scaling_factor = self.rope_scaling.get("factor", None) 88 | if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]: 89 | raise ValueError( 90 | f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}" 91 | ) 92 | if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0: 93 | raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}") 94 | -------------------------------------------------------------------------------- /pretrain/pretrain.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | from typing import Optional 3 | from transformers.utils.versions import require_version 4 | import transformers 5 | from model.modeling_miaomiao import MiaomiaoForCausalLM 6 | from model.configuration_miaomiao import MiaomiaoConfig 7 | from pretrain_dataset import PretrainDataset 8 | from transformers import ( 9 | CONFIG_MAPPING, 10 | MODEL_FOR_CAUSAL_LM_MAPPING, 11 | AutoConfig, 12 | AutoModelForCausalLM, 13 | HfArgumentParser, 14 | Trainer, 15 | TrainingArguments, 16 | is_torch_tpu_available, 17 | set_seed, 18 | ) 19 | from transformers.trainer_callback import TrainerCallback 20 | import torch 21 | import json 22 | import os 23 | import logging 24 | import glob 25 | import random 26 | import numpy as np 27 | 28 | # 设置随机种子 29 | def set_seed(seed): 30 | random.seed(seed) 31 | np.random.seed(seed) 32 | torch.manual_seed(seed) 33 | if torch.cuda.is_available(): 34 | torch.cuda.manual_seed_all(seed) 35 | 36 | 37 | class LoggingCallback(TrainerCallback): 38 | def __init__(self, logger): 39 | self.logger = logger 40 | 41 | def on_log(self, args, state, control, logs=None, **kwargs): 42 | if logs is not None: 43 | self.logger.info(logs) 44 | 45 | 46 | @dataclass 47 | class ModelArguments: 48 | config_file: Optional[str] = None 49 | torch_dtype: Optional[str] = None 50 | 51 | @dataclass 52 | class DataTrainingArguments: 53 | train_dataset_dir: Optional[str] = None 54 | block_size: Optional[int] = None 55 | overwrite_cache: bool = False 56 | preprocessing_num_workers: Optional[int] = None 57 | 58 | 59 | @dataclass 60 | class MyTrainingArguments(TrainingArguments): 61 | modules_to_save: Optional[str] = None 62 | 63 | 64 | # 模型初始化方式 65 | init_from: Optional[str] = "scratch" 66 | use_device: Optional[str] = 'cuda' 67 | use_compile: Optional[bool] = False 68 | log_file: Optional[str] = None 69 | nnodes: Optional[int] = None 70 | nproc_per_node: Optional[int] = None 71 | 72 | def load_config(file_path): 73 | with open(file_path, 'r', encoding='utf-8') as file: 74 | config = json.load(file) 75 | return config 76 | 77 | def init_model(training_args, model_args): 78 | if training_args.init_from == "scratch": 79 | config = MiaomiaoConfig.from_pretrained(model_args.config_file) 80 | print(config) 81 | model = MiaomiaoForCausalLM(config) 82 | return model 83 | 84 | 85 | 86 | def my_data_collator(input_datas): 87 | # 将所有样本的输入 (`X`) 和标签 (`Y`) 分别堆叠 88 | input_ids = torch.stack([input_data[0] for input_data in input_datas]) 89 | labels = torch.stack([input_data[1] for input_data in input_datas]) 90 | 91 | # 返回一个字典，包含模型需要的键和值 92 | return { 93 | "input_ids": input_ids, 94 | "labels": labels 95 | } 96 | 97 | def main(): 98 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, MyTrainingArguments)) 99 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 100 | 101 | # 设置日志记录器 102 | logging.basicConfig(filename=training_args.log_file, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 103 | logger = logging.getLogger(__name__) 104 | # 创建文件处理器，并设置写模式 105 | file_handler = logging.FileHandler(training_args.log_file, mode='w') 106 | file_handler.setLevel(logging.INFO) 107 | file_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s') 108 | file_handler.setFormatter(file_formatter) 109 | logger.addHandler(file_handler) 110 | # 输出日志到控制台（可选） 111 | console_handler = logging.StreamHandler() 112 | console_handler.setLevel(logging.INFO) 113 | formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s') 114 | console_handler.setFormatter(formatter) 115 | logger.addHandler(console_handler) 116 | 117 | set_seed(training_args.seed) 118 | 119 | model=init_model(training_args, model_args) 120 | model.to(training_args.use_device) 121 | 122 | if training_args.use_compile: 123 | model = torch.compile(model) 124 | 125 | 126 | total_params = sum(p.numel() for p in model.parameters()) 127 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 128 | logger.info(f"总参数: {total_params}") 129 | logger.info(f"可训练参数: {trainable_params}") 130 | 131 | logger.info(f"torch_dtype:{model_args.torch_dtype}") 132 | logger.info(f"training_args.bf16: {training_args.bf16}") 133 | 134 | 135 | train_data_path_list = glob.glob(os.path.join(data_args.train_dataset_dir, '*.bin')) 136 | train_ds = PretrainDataset(train_data_path_list, max_length=data_args.block_size, memmap=True, seed=training_args.seed) 137 | logger.info(f"Train dataset size: {len(train_ds)}") 138 | 139 | trainer = Trainer( 140 | model=model, 141 | args=training_args, 142 | train_dataset=train_ds, 143 | data_collator=my_data_collator, 144 | callbacks=[LoggingCallback(logger)], # 添加自定义回调 145 | ) 146 | print(training_args.bf16) 147 | 148 | trainer.train() 149 | 150 | 151 | 152 | 153 | if __name__ == "__main__": 154 | main() 155 | -------------------------------------------------------------------------------- /pretrain/pretrain.sh: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | lr=4e-4 6 | block_size=1024 7 | 8 | per_device_train_batch_size=24 9 | gradient_accumulation_steps=1 10 | config_file=./model/config.json 11 | train_dataset_dir=./pretrain_data 12 | log_file=./log/pretrain1.log 13 | output_dir=./output 14 | deepspeed_config_file=./ds_config.json 15 | random_seed=42 16 | 17 | torchrun --nnodes 1 --nproc_per_node 2 pretrain.py \ 18 | --deepspeed ${deepspeed_config_file} \ 19 | --config_file ${config_file} \ 20 | --train_dataset_dir ${train_dataset_dir} \ 21 | --per_device_train_batch_size ${per_device_train_batch_size} \ 22 | --do_train \ 23 | --bf16 True\ 24 | --torch_dtype bfloat16 \ 25 | --seed ${random_seed} \ 26 | --num_train_epochs 1 \ 27 | --logging_strategy steps \ 28 | --logging_steps 100 \ 29 | --log_file ${log_file} \ 30 | --logging_first_step True \ 31 | --adam_beta1 0.9 \ 32 | --adam_beta1 0.95 \ 33 | --lr_scheduler_type cosine \ 34 | --learning_rate ${lr} \ 35 | --warmup_ratio 0.05 \ 36 | --weight_decay 0.01 \ 37 | --save_strategy steps \ 38 | --save_total_limit 1 \ 39 | --save_steps 0.01 \ 40 | --gradient_accumulation_steps ${gradient_accumulation_steps} \ 41 | --block_size ${block_size} \ 42 | --output_dir ${output_dir} \ 43 | --overwrite_output_dir \ 44 | --ddp_timeout 30000 \ 45 | --init_from scratch \ 46 | --use_device cuda \ 47 | --use_compile False \ -------------------------------------------------------------------------------- /pretrain/pretrain_dataset.py: -------------------------------------------------------------------------------- 1 | 2 | import random 3 | import pandas as pd 4 | import numpy as np 5 | from torch.utils.data import Dataset,DataLoader 6 | import torch 7 | from sklearn.model_selection import train_test_split 8 | 9 | class PretrainDataset(Dataset): 10 | def __init__(self, data_path_lst, max_length=512, memmap=False, seed=42): 11 | super().__init__() 12 | 13 | self.max_length = max_length 14 | self.seed = seed 15 | 16 | if memmap: 17 | with open(data_path_lst[0], 'rb') as f: 18 | nbytes = f.seek(0, 2) 19 | flen = nbytes // np.dtype('int16').itemsize # 使用 int16 数据类型 20 | self.data = np.memmap(data_path_lst[0], dtype=np.dtype('int16'), shape=(flen // max_length, max_length), mode='r') 21 | else: 22 | data_lst = [] 23 | for data_path in data_path_lst: 24 | with open(data_path, 'rb') as f: 25 | data = np.fromfile(f, dtype=np.int16) # 使用 int16 数据类型 26 | data_lst.append(data) 27 | data = np.concatenate(data_lst) 28 | data = data[:max_length * (len(data) // max_length)] 29 | self.data = data.reshape(-1, max_length) 30 | 31 | self.indices = np.arange(len(self.data)) 32 | np.random.shuffle(self.indices) 33 | print("memmap:{} train data.shape:{}".format(memmap, self.data.shape)) 34 | print("downloading finished.....") 35 | 36 | def __len__(self): 37 | return self.data.shape[0] 38 | 39 | def shuffle_indices(self): 40 | np.random.seed(self.seed) 41 | 42 | def __getitem__(self, index: int): 43 | index = self.indices[index] 44 | sample = self.data[index] 45 | X = np.array(sample).astype(np.int64) 46 | Y = np.array(sample).astype(np.int64) 47 | return torch.from_numpy(X), torch.from_numpy(Y) 48 | -------------------------------------------------------------------------------- /pretrain/test_pretrain_model.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer 2 | 3 | 4 | device = "cuda" # the device to load the model onto 5 | 6 | model = AutoModelForCausalLM.from_pretrained( 7 | './pretrain/model', 8 | torch_dtype="auto", 9 | device_map="auto", 10 | trust_remote_code=True 11 | ) 12 | tokenizer = AutoTokenizer.from_pretrained('./miaomiao_tokenizer', trust_remote_code=True) 13 | text = "床前明月光，" 14 | model_inputs = tokenizer([text], return_tensors="pt").to(device) 15 | print(model_inputs) 16 | generated_ids = model.generate( 17 | **model_inputs, 18 | max_new_tokens=1024 19 | ) 20 | print(generated_ids) 21 | generated_ids = [ 22 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 23 | ] 24 | print(generated_ids) 25 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 26 | print(response) 27 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==2.1.0 2 | accelerate==0.32.1 3 | aiohttp==3.9.5 4 | aiosignal==1.3.1 5 | annotated-types==0.7.0 6 | attrs==23.2.0 7 | blinker==1.4 8 | certifi==2024.7.4 9 | charset-normalizer==3.3.2 10 | contourpy==1.2.1 11 | cryptography==3.4.8 12 | cycler==0.12.1 13 | datasets==2.20.0 14 | datasketch==1.6.5 15 | dbus-python==1.2.18 16 | deepspeed==0.14.4 17 | dill==0.3.8 18 | distro==1.7.0 19 | distro-info==1.1+ubuntu0.2 20 | einops==0.8.0 21 | filelock==3.15.4 22 | flash-attn==2.5.9.post1 23 | fonttools==4.53.1 24 | frozenlist==1.4.1 25 | fsspec==2024.5.0 26 | grpcio==1.64.1 27 | hjson==3.1.0 28 | httplib2==0.20.2 29 | huggingface-hub==0.23.4 30 | idna==3.7 31 | importlib-metadata==4.6.4 32 | jeepney==0.7.1 33 | Jinja2==3.1.4 34 | joblib==1.4.2 35 | keyring==23.5.0 36 | kiwisolver==1.4.5 37 | launchpadlib==1.10.16 38 | lazr.restfulclient==0.14.4 39 | lazr.uri==1.0.6 40 | Markdown==3.6 41 | MarkupSafe==2.1.5 42 | matplotlib==3.9.1 43 | more-itertools==8.10.0 44 | mpmath==1.3.0 45 | multidict==6.0.5 46 | multiprocess==0.70.16 47 | networkx==3.3 48 | ninja==1.11.1.1 49 | numpy==1.26.4 50 | nvidia-cublas-cu12==12.1.3.1 51 | nvidia-cuda-cupti-cu12==12.1.105 52 | nvidia-cuda-nvrtc-cu12==12.1.105 53 | nvidia-cuda-runtime-cu12==12.1.105 54 | nvidia-cudnn-cu12==8.9.2.26 55 | nvidia-cufft-cu12==11.0.2.54 56 | nvidia-curand-cu12==10.3.2.106 57 | nvidia-cusolver-cu12==11.4.5.107 58 | nvidia-cusparse-cu12==12.1.0.106 59 | nvidia-ml-py==12.555.43 60 | nvidia-nccl-cu12==2.20.5 61 | nvidia-nvjitlink-cu12==12.5.82 62 | nvidia-nvtx-cu12==12.1.105 63 | oauthlib==3.2.0 64 | packaging==24.1 65 | pandas==2.2.2 66 | pillow==10.4.0 67 | pip==24.1.2 68 | protobuf==4.25.3 69 | psutil==6.0.0 70 | py-cpuinfo==9.0.0 71 | pyarrow==16.1.0 72 | pyarrow-hotfix==0.6 73 | pydantic==2.8.2 74 | pydantic_core==2.20.1 75 | PyGObject==3.42.1 76 | PyJWT==2.3.0 77 | pyparsing==3.1.2 78 | python-apt==2.4.0+ubuntu3 79 | python-dateutil==2.9.0.post0 80 | pytz==2024.1 81 | PyYAML==6.0.1 82 | regex==2024.5.15 83 | requests==2.32.3 84 | safetensors==0.4.3 85 | scikit-learn==1.5.1 86 | scipy==1.14.0 87 | SecretStorage==3.3.1 88 | setuptools==70.3.0 89 | six==1.16.0 90 | sympy==1.13.0 91 | tensorboard==2.17.0 92 | tensorboard-data-server==0.7.2 93 | threadpoolctl==3.5.0 94 | tiktoken==0.7.0 95 | tokenizers==0.19.1 96 | torch==2.3.1 97 | torchaudio==2.3.1 98 | torchvision==0.18.1 99 | tqdm==4.66.4 100 | transformers==4.42.3 101 | triton==2.3.1 102 | typing_extensions==4.12.2 103 | tzdata==2024.1 104 | unattended-upgrades==0.1 105 | urllib3==2.2.2 106 | wadllib==1.3.6 107 | Werkzeug==3.0.3 108 | wheel==0.43.0 109 | xxhash==3.4.1 110 | yarl==1.9.4 111 | zipp==1.0.0 112 | -------------------------------------------------------------------------------- /rlhf/rlhf/__pycache__/ppo_trainer.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/rlhf/__pycache__/ppo_trainer.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/rlhf/__pycache__/rlhf_engine.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/rlhf/__pycache__/rlhf_engine.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/rlhf/ppo_trainer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # SPDX-License-Identifier: Apache-2.0 3 | 4 | # DeepSpeed Team 5 | import torch 6 | import torch.nn.functional as F 7 | import time 8 | import deepspeed 9 | from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus 10 | from deepspeed.accelerator import get_accelerator 11 | 12 | from utils.utils import print_rank_0 13 | 14 | 15 | def print_all_ranks(tag, value, rank): 16 | world_size = torch.distributed.get_world_size() 17 | all_tensor = torch.zeros(world_size, dtype=torch.float32).to( 18 | get_accelerator().current_device_name()) 19 | all_tensor[rank] = value 20 | torch.distributed.all_reduce(all_tensor, op=torch.distributed.ReduceOp.SUM) 21 | print_rank_0(f'{tag} {all_tensor}', rank) 22 | 23 | 24 | def get_model_norm(model): 25 | with torch.no_grad(): 26 | total = 0.0 27 | for param in model.parameters(): 28 | should_gather = hasattr( 29 | param, 30 | 'ds_id') and param.ds_status == ZeroParamStatus.NOT_AVAILABLE 31 | with deepspeed.zero.GatheredParameters(param, 32 | enabled=should_gather): 33 | total += float(param.float().norm()) 34 | 35 | return total 36 | 37 | 38 | def gather_log_probs(logits, labels): 39 | log_probs = F.log_softmax(logits, dim=-1) 40 | log_probs_labels = log_probs.gather(dim=-1, index=labels.unsqueeze(-1)) 41 | return log_probs_labels.squeeze(-1) 42 | 43 | 44 | class DeepSpeedPPOTrainer(): 45 | 46 | def __init__(self, rlhf_engine, args): 47 | self.rlhf_engine = rlhf_engine 48 | self.actor_model = self.rlhf_engine.actor 49 | self.critic_model = self.rlhf_engine.critic 50 | self.ref_model = self.rlhf_engine.ref 51 | self.reward_model = self.rlhf_engine.reward 52 | self.tokenizer = self.rlhf_engine.tokenizer 53 | self.args = args 54 | self.max_answer_seq_len = args.max_answer_seq_len 55 | self.end_of_conversation_token_id = self.tokenizer.eos_token_id 56 | self.z3_enabled = args.actor_zero_stage == 3 57 | self.compute_fp32_loss = self.args.compute_fp32_loss 58 | 59 | # In case the generated experience is not valid (too short), we use the last valid 60 | # generated experience. Alternatively, we can skip the step (on all workers). 61 | # For now, use the last valid experience which is a simpler solution 62 | self.last_generated_experience = None 63 | 64 | # Those value can be changed 65 | self.kl_ctl = 0.1 66 | self.clip_reward_value = 5 67 | self.cliprange = 0.2 68 | self.cliprange_value = 0.2 69 | self.gamma = 1.0 70 | self.lam = 0.95 71 | self.generate_time = 0.0 72 | 73 | def _generate_sequence(self, prompts, mask, step): 74 | 75 | max_min_length = self.max_answer_seq_len + prompts.shape[1] 76 | 77 | # This has been added due to a probability/nan error that happens after 78 | # meta-llama/Llama-2-7b-hf enabled do_sample: 79 | # https://huggingface.co/meta-llama/Llama-2-7b-hf/commit/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9 80 | 81 | kwargs = dict(do_sample=False) 82 | 83 | 84 | with torch.no_grad(): 85 | seq = self.actor_model.module.generate( 86 | prompts, 87 | attention_mask=mask, 88 | max_length=max_min_length, 89 | pad_token_id=self.tokenizer.pad_token_id, 90 | synced_gpus=self.z3_enabled, 91 | **kwargs) 92 | 93 | # Filter out seq with no answers (or very short). This happens when users directly use the pre-training ckpt without supervised finetuning 94 | # NOTE: this will causes each GPU has different number of examples 95 | batch_size = seq.shape[0] 96 | prompt_length = prompts.shape[1] 97 | self.prompt_length = prompt_length 98 | ans = seq[:, prompt_length:] 99 | valid_ans_len = (ans != self.tokenizer.pad_token_id).sum(dim=-1) 100 | 101 | if self.args.print_answers and (step % self.args.print_answers_interval 102 | == 0): 103 | print( 104 | f"--- prompt --> step={step}, rank={torch.distributed.get_rank()}, {self.tokenizer.batch_decode(prompts, skip_special_tokens=True)}" 105 | ) 106 | print( 107 | f"--- ans --> step={step}, rank={torch.distributed.get_rank()}, {self.tokenizer.batch_decode(ans, skip_special_tokens=True)}" 108 | ) 109 | 110 | out_seq = [] 111 | for i in range(batch_size): 112 | if valid_ans_len[ 113 | i] <= 1: # if the answer is shorter than 1 token, drop it 114 | print( 115 | f'Dropping too short generated answer: {step=}: \n' 116 | f'prompts: {self.tokenizer.batch_decode(prompts, skip_special_tokens=False)}\n' 117 | f'answers: {self.tokenizer.batch_decode(ans, skip_special_tokens=False)}' 118 | ) 119 | continue 120 | else: 121 | out_seq.append(seq[i:i + 1]) 122 | 123 | if not out_seq: 124 | print( 125 | f'All generated results are too short for rank={self.args.local_rank} step={step}\n' 126 | f'-> prompts: {self.tokenizer.batch_decode(prompts, skip_special_tokens=False)}\n' 127 | f'-> answers: {self.tokenizer.batch_decode(ans, skip_special_tokens=False)}' 128 | ) 129 | return None 130 | 131 | out_seq = torch.cat(out_seq, dim=0) # concat output in the batch dim 132 | 133 | return out_seq 134 | 135 | def generate_experience(self, prompts, mask, step): 136 | self.eval() 137 | generate_start = time.time() 138 | seq = self._generate_sequence(prompts, mask, step) 139 | generate_end = time.time() 140 | if seq is None: 141 | assert self.last_generated_experience is not None, f'Invalid generated experience at {step=}' 142 | prompts = self.last_generated_experience['prompts'] 143 | seq = self.last_generated_experience['seq'] 144 | else: 145 | self.last_generated_experience = {'prompts': prompts, 'seq': seq} 146 | self.train() 147 | 148 | pad_token_id = self.tokenizer.pad_token_id 149 | attention_mask = seq.not_equal(pad_token_id).long() 150 | with torch.no_grad(): 151 | output = self.actor_model(seq, attention_mask=attention_mask) 152 | output_ref = self.ref_model(seq, attention_mask=attention_mask) 153 | reward_score = self.reward_model.forward_value( 154 | seq, attention_mask, 155 | prompt_length=self.prompt_length)['chosen_end_scores'].detach( 156 | ) 157 | values = self.critic_model.forward_value( 158 | seq, attention_mask, return_value_only=True).detach()[:, :-1] 159 | 160 | logits = output.logits 161 | logits_ref = output_ref.logits 162 | if self.compute_fp32_loss: 163 | logits = logits.to(torch.float) 164 | logits_ref = logits_ref.to(torch.float) 165 | 166 | self.generate_time = generate_end - generate_start 167 | 168 | return { 169 | 'prompts': prompts, 170 | 'logprobs': gather_log_probs(logits[:, :-1, :], seq[:, 1:]), 171 | 'ref_logprobs': gather_log_probs(logits_ref[:, :-1, :], seq[:, 172 | 1:]), 173 | 'value': values, 174 | 'rewards': reward_score, 175 | 'input_ids': seq, 176 | "attention_mask": attention_mask 177 | } 178 | 179 | def compute_rewards(self, prompts, log_probs, ref_log_probs, reward_score, 180 | action_mask): 181 | 182 | kl_divergence_estimate = -self.kl_ctl * (log_probs - ref_log_probs) 183 | rewards = kl_divergence_estimate 184 | start = prompts.shape[1] - 1 185 | ends = start + action_mask[:, start:].sum(1) + 1 186 | reward_clip = torch.clamp(reward_score, -self.clip_reward_value, 187 | self.clip_reward_value) 188 | batch_size = log_probs.shape[0] 189 | for j in range(batch_size): 190 | rewards[j, start:ends[j]][-1] += reward_clip[j] 191 | 192 | return rewards 193 | 194 | def train_rlhf(self, inputs): 195 | # train the rlhf mode here 196 | ### process the old outputs 197 | prompts = inputs['prompts'] 198 | log_probs = inputs['logprobs'] 199 | ref_log_probs = inputs['ref_logprobs'] 200 | reward_score = inputs['rewards'] 201 | values = inputs['value'] 202 | attention_mask = inputs['attention_mask'] 203 | seq = inputs['input_ids'] 204 | 205 | start = prompts.size()[-1] - 1 206 | action_mask = attention_mask[:, 1:] 207 | 208 | old_values = values 209 | with torch.no_grad(): 210 | old_rewards = self.compute_rewards(prompts, log_probs, 211 | ref_log_probs, reward_score, 212 | action_mask) 213 | ends = start + action_mask[:, start:].sum(1) + 1 214 | # we need to zero out the reward and value after the end of the conversation 215 | # otherwise the advantage/return will be wrong 216 | for i in range(old_rewards.shape[0]): 217 | old_rewards[i, ends[i]:] = 0 218 | old_values[i, ends[i]:] = 0 219 | advantages, returns = self.get_advantages_and_returns( 220 | old_values, old_rewards, start) 221 | 222 | ### process the new outputs 223 | batch = {'input_ids': seq, "attention_mask": attention_mask} 224 | actor_prob = self.actor_model(**batch, use_cache=False).logits 225 | actor_log_prob = gather_log_probs(actor_prob[:, :-1, :], seq[:, 1:]) 226 | actor_loss = self.actor_loss_fn(actor_log_prob[:, start:], 227 | log_probs[:, start:], advantages, 228 | action_mask[:, start:]) 229 | self.actor_model.backward(actor_loss) 230 | 231 | if not self.args.align_overflow: 232 | self.actor_model.step() 233 | 234 | value = self.critic_model.forward_value(**batch, 235 | return_value_only=True, 236 | use_cache=False)[:, :-1] 237 | critic_loss = self.critic_loss_fn(value[:, start:], old_values[:, 238 | start:], 239 | returns, action_mask[:, start:]) 240 | self.critic_model.backward(critic_loss) 241 | 242 | if self.args.align_overflow: 243 | actor_overflow = self.actor_model.optimizer.check_overflow( 244 | external=True) 245 | critic_overflow = self.critic_model.optimizer.check_overflow( 246 | external=True) 247 | 248 | rank = torch.distributed.get_rank() 249 | if actor_overflow and not critic_overflow: 250 | self.critic_model.optimizer.skip_step = True 251 | print_rank_0( 252 | "OVERFLOW: actor overflow, skipping both actor and critic steps", 253 | rank) 254 | elif not actor_overflow and critic_overflow: 255 | self.actor_model.optimizer.skip_step = True 256 | print_rank_0( 257 | "OVERFLOW: critic overflow, skipping both actor and critic steps", 258 | rank) 259 | elif actor_overflow and critic_overflow: 260 | print_rank_0( 261 | "OVERFLOW: actor and critic overflow, skipping both actor and critic steps", 262 | rank) 263 | self.actor_model.step() 264 | 265 | self.critic_model.step() 266 | 267 | return actor_loss, critic_loss 268 | 269 | def get_overflow(self): 270 | # Overflow is not expected when using bf16 271 | # Therefore, DeepSpeed's BF16_Optimizer does not maintain an overflow indication 272 | if self.args.dtype == "bf16": 273 | return False, False 274 | 275 | actor_overflow = self.actor_model.optimizer.overflow 276 | critic_overflow = self.critic_model.optimizer.overflow 277 | 278 | return actor_overflow, critic_overflow 279 | 280 | def actor_loss_fn(self, logprobs, old_logprobs, advantages, mask): 281 | ## policy gradient loss 282 | log_ratio = (logprobs - old_logprobs) * mask 283 | ratio = torch.exp(log_ratio) 284 | pg_loss1 = -advantages * ratio 285 | pg_loss2 = -advantages * torch.clamp(ratio, 1.0 - self.cliprange, 286 | 1.0 + self.cliprange) 287 | pg_loss = torch.sum(torch.max(pg_loss1, pg_loss2) * mask) / mask.sum() 288 | return pg_loss 289 | 290 | def critic_loss_fn(self, values, old_values, returns, mask): 291 | ## value loss 292 | values_clipped = torch.clamp( 293 | values, 294 | old_values - self.cliprange_value, 295 | old_values + self.cliprange_value, 296 | ) 297 | if self.compute_fp32_loss: 298 | values = values.float() 299 | values_clipped = values_clipped.float() 300 | vf_loss1 = (values - returns)**2 301 | vf_loss2 = (values_clipped - returns)**2 302 | vf_loss = 0.5 * torch.sum( 303 | torch.max(vf_loss1, vf_loss2) * mask) / mask.sum() 304 | return vf_loss 305 | 306 | def get_advantages_and_returns(self, values, rewards, start): 307 | # Adopted from https://github.com/CarperAI/trlx/blob/main/trlx/models/modeling_ppo.py#L134 308 | lastgaelam = 0 309 | advantages_reversed = [] 310 | length = rewards.size()[-1] 311 | for t in reversed(range(start, length)): 312 | nextvalues = values[:, t + 1] if t < length - 1 else 0.0 313 | delta = rewards[:, t] + self.gamma * nextvalues - values[:, t] 314 | lastgaelam = delta + self.gamma * self.lam * lastgaelam 315 | advantages_reversed.append(lastgaelam) 316 | advantages = torch.stack(advantages_reversed[::-1], dim=1) 317 | returns = advantages + values[:, start:] 318 | return advantages.detach(), returns 319 | 320 | def _validate_training_mode(self): 321 | assert self.actor_model.module.training 322 | assert self.critic_model.module.training 323 | 324 | def _validate_evaluation_mode(self): 325 | assert not self.actor_model.module.training 326 | assert not self.critic_model.module.training 327 | assert not self.ref_model.module.training 328 | assert not self.reward_model.module.training 329 | 330 | def train(self): 331 | self.actor_model.train() 332 | self.critic_model.train() 333 | 334 | def eval(self): 335 | self.actor_model.eval() 336 | self.critic_model.eval() 337 | self.reward_model.eval() 338 | self.ref_model.eval() 339 | 340 | def dump_model_norms(self, tag): 341 | actor_model_norm = get_model_norm(self.actor_model) 342 | ref_model_norm = get_model_norm(self.ref_model) 343 | critic_model_norm = get_model_norm(self.critic_model) 344 | reward_model_norm = get_model_norm(self.reward_model) 345 | print_all_ranks(f'{tag} global_actor_model_norm', actor_model_norm, 346 | self.args.local_rank) 347 | print_all_ranks(f'{tag} global_ref_model_norm', ref_model_norm, 348 | self.args.local_rank) 349 | print_all_ranks(f'{tag} global_critic_model_norm', critic_model_norm, 350 | self.args.local_rank) 351 | print_all_ranks(f'{tag} global_reward_model_norm', reward_model_norm, 352 | self.args.local_rank) 353 | 354 | 355 | class DeepSpeedPPOTrainerUnsupervised(DeepSpeedPPOTrainer): 356 | 357 | def __init__(self, *args, **kwargs): 358 | super().__init__(*args, **kwargs) 359 | 360 | def train_unsupervised(self, inputs, unsup_coef): 361 | # Train the unsupervised model here 362 | self._validate_training_mode() 363 | 364 | outputs = self.actor_model(**inputs, use_cache=False) 365 | loss = outputs.loss 366 | self.actor_model.backward(unsup_coef * loss) 367 | self.actor_model.step() 368 | 369 | return loss 370 | -------------------------------------------------------------------------------- /rlhf/rlhf/rlhf_engine.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # SPDX-License-Identifier: Apache-2.0 3 | 4 | # DeepSpeed Team 5 | import time 6 | import torch 7 | import deepspeed 8 | from deepspeed.ops.adam import FusedAdam 9 | from deepspeed.ops.adam import DeepSpeedCPUAdam 10 | from transformers import AutoModelForCausalLM, get_scheduler 11 | 12 | from utils.ds_utils import get_train_ds_config, get_eval_ds_config 13 | from utils.model_utils import create_hf_model, create_critic_model 14 | from utils.utils import get_optimizer_grouped_parameters 15 | """ 16 | TODOs: 17 | * support HF models for critic (for debugging), must be a previously saved ckpt from step-2 18 | * determine ds_config/zero_stage based on model size, gpu style, world size, etc 19 | - get model size by creating simple meta model 20 | - 1.3b: zero-2 for actor/ref models, zero-0 for others 21 | - 13b+: zero-3 for all models 22 | """ 23 | 24 | 25 | def log_init(model_name, stime=None): 26 | if torch.distributed.get_rank() == 0: 27 | tag = "start" if stime is None else "end" 28 | suffix = "ing" if stime is None else "ed" 29 | duration = "" 30 | if stime is not None: 31 | duration = "(duration: {:.2f}s)".format(time.time() - stime) 32 | msg = f"[{tag}] Initializ{suffix} {model_name} Model [{tag}] {duration}" 33 | stars = (90 - len(msg)) // 2 34 | extra_star = "*" if (90 - len(msg)) % 2 == 1 else "" 35 | print("*" * stars + msg + "*" * stars + extra_star) 36 | return time.time() 37 | 38 | 39 | class DeepSpeedRLHFEngine(): 40 | 41 | def __init__(self, actor_model_name_or_path, critic_model_name_or_path, 42 | tokenizer, args, num_total_iters): 43 | self.args = args 44 | self.num_total_iters = num_total_iters 45 | self.tokenizer = tokenizer 46 | 47 | self.actor = self._init_actor( 48 | actor_model_name_or_path=actor_model_name_or_path) 49 | self.ref = self._init_ref( 50 | actor_model_name_or_path=actor_model_name_or_path) 51 | self.actor_ema = None 52 | if self.args.enable_ema: 53 | self.actor_ema = self._init_ema( 54 | actor_model_name_or_path=actor_model_name_or_path) 55 | self.critic = self._init_critic( 56 | critic_model_name_or_path=critic_model_name_or_path) 57 | self.reward = self._init_reward( 58 | critic_model_name_or_path=critic_model_name_or_path) 59 | if self.args.critic_gradient_checkpointing: 60 | self.critic.gradient_checkpointing_enable() 61 | 62 | def _init_actor(self, actor_model_name_or_path): 63 | stime = log_init("Actor") 64 | 65 | # DS Config 66 | ds_config = get_train_ds_config( 67 | offload=self.args.offload, 68 | dtype=self.args.dtype, 69 | stage=self.args.actor_zero_stage, 70 | enable_hybrid_engine=self.args.enable_hybrid_engine, 71 | inference_tp_size=self.args.inference_tp_size, 72 | release_inference_cache=self.args.release_inference_cache, 73 | pin_parameters=(not self.args.unpin_actor_parameters), 74 | tp_gather_partition_size=self.args.tp_gather_partition_size, 75 | max_out_tokens=self.args.max_prompt_seq_len + 76 | self.args.max_answer_seq_len, 77 | enable_tensorboard=self.args.enable_tensorboard, 78 | enable_mixed_precision_lora=self.args.enable_mixed_precision_lora, 79 | tb_path=self.args.tensorboard_path, 80 | tb_name="step3_actor") 81 | ds_config[ 82 | 'train_micro_batch_size_per_gpu'] = self.args.per_device_training_batch_size 83 | #TODO(jeff): we should probably set grad accumlation steps here as well for clarity 84 | ds_config[ 85 | 'train_batch_size'] = self.args.per_device_training_batch_size * torch.distributed.get_world_size( 86 | ) * self.args.gradient_accumulation_steps_actor 87 | 88 | # Model 89 | actor_model = create_hf_model( 90 | model_class=AutoModelForCausalLM, 91 | model_name_or_path=actor_model_name_or_path, 92 | tokenizer=self.tokenizer, 93 | ds_config=ds_config, 94 | dropout=self.args.actor_dropout) 95 | 96 | 97 | # Optimizer 98 | AdamOptimizer = DeepSpeedCPUAdam if self.args.offload else FusedAdam 99 | optim_params = get_optimizer_grouped_parameters( 100 | actor_model, self.args.actor_weight_decay, 101 | self.args.actor_lora_learning_rate) 102 | optim = AdamOptimizer(optim_params, 103 | lr=self.args.actor_learning_rate, 104 | betas=(0.9, 0.95)) 105 | 106 | # LR Scheduler 107 | lr_scheduler = get_scheduler( 108 | name=self.args.lr_scheduler_type, 109 | optimizer=optim, 110 | num_warmup_steps=self.args.num_warmup_steps, 111 | num_training_steps=self.num_total_iters, 112 | ) 113 | 114 | # DeepSpeed Engine 115 | #TODO: move enable_hybrid_engine and pin_parameters to ds_config 116 | actor_engine, *_ = deepspeed.initialize(model=actor_model, 117 | optimizer=optim, 118 | lr_scheduler=lr_scheduler, 119 | config=ds_config) 120 | 121 | log_init("Actor", stime=stime) 122 | 123 | return actor_engine 124 | 125 | def _init_ref(self, actor_model_name_or_path): 126 | stime = log_init("Ref") 127 | # DS Config 128 | zero_stage = self.args.actor_zero_stage 129 | if zero_stage != 3: 130 | # If actor is ZeRO-3 then we use it for everything, otherwise assume we have enough memory for ref model 131 | zero_stage = 0 132 | ds_config = get_eval_ds_config(self.args.offload_reference_model, 133 | self.args.dtype, zero_stage) 134 | ds_config[ 135 | 'train_micro_batch_size_per_gpu'] = self.args.per_device_training_batch_size 136 | #TODO(jeff): we should probably set grad accumlation steps here as well for clarity 137 | ds_config[ 138 | 'train_batch_size'] = self.args.per_device_training_batch_size * torch.distributed.get_world_size( 139 | ) * self.args.gradient_accumulation_steps_actor 140 | 141 | ref_model = create_hf_model(AutoModelForCausalLM, 142 | actor_model_name_or_path, self.tokenizer, 143 | ds_config) 144 | 145 | ref_engine, *_ = deepspeed.initialize(model=ref_model, 146 | config=ds_config) 147 | 148 | log_init("Ref", stime=stime) 149 | return ref_engine 150 | 151 | def _init_ema(self, actor_model_name_or_path): 152 | stime = log_init("EMA") 153 | # DS Config 154 | zero_stage = self.args.actor_zero_stage 155 | if zero_stage != 3: 156 | # If actor is ZeRO-3 then we use it for everything, otherwise assume we have enough memory 157 | zero_stage = 0 158 | ds_config = get_eval_ds_config(self.args.offload_reference_model, 159 | self.args.dtype, zero_stage) 160 | ds_config[ 161 | 'train_micro_batch_size_per_gpu'] = self.args.per_device_training_batch_size 162 | #TODO(jeff): we should probably set grad accumlation steps here as well for clarity 163 | ds_config[ 164 | 'train_batch_size'] = self.args.per_device_training_batch_size * torch.distributed.get_world_size( 165 | ) * self.args.gradient_accumulation_steps_actor 166 | 167 | actor_model_ema = create_hf_model(AutoModelForCausalLM, 168 | actor_model_name_or_path, 169 | self.tokenizer, ds_config) 170 | 171 | ema_engine, *_ = deepspeed.initialize(model=actor_model_ema, 172 | config=ds_config) 173 | 174 | log_init("EMA", stime=stime) 175 | return ema_engine 176 | 177 | def _init_critic(self, critic_model_name_or_path): 178 | stime = log_init("Critic") 179 | ds_config = get_train_ds_config( 180 | offload=self.args.offload, 181 | dtype=self.args.dtype, 182 | stage=self.args.critic_zero_stage, 183 | enable_tensorboard=self.args.enable_tensorboard, 184 | tb_path=self.args.tensorboard_path, 185 | tb_name="step3_critic") 186 | ds_config[ 187 | 'train_micro_batch_size_per_gpu'] = self.args.per_device_training_batch_size 188 | #TODO(jeff): we should probably set grad accumlation steps here as well for clarity 189 | ds_config[ 190 | 'train_batch_size'] = self.args.per_device_training_batch_size * torch.distributed.get_world_size( 191 | ) * self.args.gradient_accumulation_steps 192 | 193 | ds_eval_config = get_eval_ds_config(offload=False, 194 | dtype=self.args.dtype, 195 | stage=self.args.critic_zero_stage) 196 | # We need to set train batch size and micro batch size here to pass the sanity check of DeepSpeed engine. 197 | ds_eval_config[ 198 | 'train_micro_batch_size_per_gpu'] = self.args.per_device_training_batch_size 199 | ds_eval_config[ 200 | 'train_batch_size'] = self.args.per_device_training_batch_size * torch.distributed.get_world_size( 201 | ) * self.args.gradient_accumulation_steps 202 | 203 | # Model 204 | critic_model = create_critic_model( 205 | model_name_or_path=critic_model_name_or_path, 206 | tokenizer=self.tokenizer, 207 | ds_config=ds_eval_config, 208 | num_padding_at_beginning=self.args.num_padding_at_beginning, 209 | rlhf_training=True, 210 | dropout=self.args.critic_dropout, 211 | zero_stage=self.args.critic_zero_stage) 212 | 213 | # Optimizer 214 | AdamOptimizer = DeepSpeedCPUAdam if self.args.offload else FusedAdam 215 | optim_params = get_optimizer_grouped_parameters( 216 | critic_model, self.args.critic_weight_decay, 217 | self.args.critic_lora_learning_rate) 218 | optim = AdamOptimizer(optim_params, 219 | lr=self.args.critic_learning_rate, 220 | betas=(0.9, 0.95)) 221 | 222 | # LR Scheduler 223 | lr_scheduler = get_scheduler( 224 | name=self.args.lr_scheduler_type, 225 | optimizer=optim, 226 | num_warmup_steps=self.args.num_warmup_steps, 227 | num_training_steps=self.num_total_iters, 228 | ) 229 | 230 | # DeepSpeed Engine 231 | critic_engine, *_ = deepspeed.initialize(model=critic_model, 232 | optimizer=optim, 233 | lr_scheduler=lr_scheduler, 234 | config=ds_config) 235 | 236 | log_init("Critic", stime=stime) 237 | return critic_engine 238 | 239 | def _init_reward(self, critic_model_name_or_path): 240 | stime = log_init("Reward") 241 | # DS Config 242 | zero_stage = self.args.critic_zero_stage 243 | if zero_stage != 3: 244 | # If critic is ZeRO-3 then we use it for everything, otherwise assume we have enough memory 245 | zero_stage = 0 246 | 247 | ds_config = get_eval_ds_config(offload=self.args.offload, 248 | dtype=self.args.dtype, 249 | stage=zero_stage) 250 | ds_config[ 251 | 'train_micro_batch_size_per_gpu'] = self.args.per_device_training_batch_size 252 | ds_config[ 253 | 'train_batch_size'] = self.args.per_device_training_batch_size * torch.distributed.get_world_size( 254 | ) * self.args.gradient_accumulation_steps 255 | 256 | ds_eval_config = get_eval_ds_config(offload=False, 257 | dtype=self.args.dtype, 258 | stage=zero_stage) 259 | 260 | # We need to set train batch size and micro batch size here to pass the sanity check of DeepSpeed engine. 261 | ds_eval_config[ 262 | 'train_micro_batch_size_per_gpu'] = self.args.per_device_training_batch_size 263 | ds_eval_config[ 264 | 'train_batch_size'] = self.args.per_device_training_batch_size * torch.distributed.get_world_size( 265 | ) * self.args.gradient_accumulation_steps 266 | 267 | # Model 268 | reward_model = create_critic_model( 269 | model_name_or_path=critic_model_name_or_path, 270 | tokenizer=self.tokenizer, 271 | ds_config=ds_eval_config, 272 | num_padding_at_beginning=self.args.num_padding_at_beginning, 273 | rlhf_training=True, 274 | dropout=self.args.critic_dropout, 275 | zero_stage=zero_stage) 276 | 277 | reward_engine, *_ = deepspeed.initialize(model=reward_model, 278 | config=ds_config) 279 | 280 | log_init("Reward", stime=stime) 281 | return reward_engine 282 | -------------------------------------------------------------------------------- /rlhf/rlhf_data_process.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer 2 | import json 3 | from tqdm import tqdm 4 | import torch 5 | import random 6 | import os 7 | def split_jsonl(): 8 | input_file = './rlhf.jsonl' 9 | output_files = ['./rlhf_part1.jsonl', './rlhf_part2.jsonl', './rlhf_part3.jsonl', './rlhf_part4.jsonl'] 10 | 11 | # 读取输入文件的内容 12 | with open(input_file, 'r', encoding='utf-8') as f: 13 | lines = f.readlines() 14 | 15 | # 打乱顺序 16 | random.shuffle(lines) 17 | 18 | # 计算每个文件的行数 19 | num_lines = len(lines) 20 | chunk_size = num_lines // 4 21 | 22 | # 将行分成 4 组 23 | chunks = [lines[i * chunk_size: (i + 1) * chunk_size] for i in range(4)] 24 | 25 | # 如果有多余的行，均匀分配到各个文件 26 | for i in range(num_lines % 4): 27 | chunks[i].append(lines[4 * chunk_size + i]) 28 | 29 | # 将每组写入不同的输出文件 30 | for i, output_file in enumerate(output_files): 31 | with open(output_file, 'w', encoding='utf-8') as out_f: 32 | for line in chunks[i]: 33 | out_f.write(line) 34 | 35 | 36 | def generate_rlhf_data(): 37 | input_file = './rlhf_part4.jsonl' 38 | model_path = './sft_model' 39 | output_file = './rlhf_generate_part4.jsonl' 40 | device = "cuda" 41 | 42 | model = AutoModelForCausalLM.from_pretrained( 43 | model_path, 44 | torch_dtype="auto", 45 | device_map="auto", 46 | trust_remote_code=True 47 | ) 48 | tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) 49 | 50 | # 读取输入文件的内容 51 | with open(input_file, 'r', encoding='utf-8') as f: 52 | lines = f.readlines() 53 | 54 | # 检查 output_file 是否存在以及已经写了多少条数据 55 | existing_data = [] 56 | if os.path.exists(output_file): 57 | with open(output_file, 'r', encoding='utf-8') as out_f: 58 | existing_data = out_f.readlines() 59 | 60 | processed_prompts = set() 61 | for line in existing_data: 62 | data = json.loads(line) 63 | processed_prompts.add(data['prompt']) 64 | 65 | # 处理每一行 JSON 对象 66 | with open(output_file, 'a', encoding='utf-8') as out_f: 67 | for line in tqdm(lines, desc="Processing"): 68 | data = json.loads(line) 69 | prompt = data['prompt'] 70 | 71 | answer = data['answer'] 72 | messages = [ 73 | {"role": "user", "content": prompt} 74 | ] 75 | text = tokenizer.apply_chat_template( 76 | messages, 77 | tokenize=False, 78 | add_generation_prompt=True 79 | ) 80 | if text in processed_prompts: 81 | continue # Skip already processed prompts 82 | model_inputs = tokenizer([text], return_tensors="pt").to(device) 83 | 84 | with torch.no_grad(): 85 | generated_ids = model.generate( 86 | **model_inputs, 87 | max_new_tokens=512, 88 | do_sample=False 89 | ) 90 | generated_ids = [ 91 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 92 | ] 93 | 94 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 95 | result = { 96 | 'prompt': text, 97 | 'response': answer + tokenizer.eos_token, 98 | 'chosen': answer + tokenizer.eos_token, 99 | 'rejected': response + tokenizer.eos_token 100 | } 101 | # 将结果写入输出文件 102 | out_f.write(json.dumps(result, ensure_ascii=False) + '\n') 103 | 104 | 105 | def process_step_2_3_data(): 106 | input_files = [ 107 | './rlhf_generate_part1.jsonl', 108 | './rlhf_generate_part2.jsonl', 109 | './rlhf_generate_part3.jsonl', 110 | './rlhf_generate_part4.jsonl' 111 | ] 112 | output_step2_train_file = './step2_data/train.jsonl' 113 | output_step2_eval_file = './step2_data/eval.jsonl' 114 | output_step3_train_file = './step3_data/train.jsonl' 115 | output_step3_eval_file = './step3_data/eval.jsonl' 116 | 117 | data = [] 118 | 119 | # 读取所有输入文件 120 | for file in input_files: 121 | with open(file, 'r', encoding='utf-8') as f: 122 | for line in f: 123 | data.append(json.loads(line.strip())) 124 | 125 | # 随机打乱数据 126 | random.shuffle(data) 127 | 128 | # 分割数据 129 | total_size = len(data) 130 | step3_train_size = int(total_size * 0.95) 131 | step3_eval_size = total_size - step3_train_size 132 | step2_train_size = int(total_size * 0.475) 133 | step2_eval_size = int(total_size * 0.025) 134 | 135 | step3_train_data = data[:step3_train_size] 136 | step3_eval_data = data[step3_train_size:] 137 | 138 | step2_train_data = data[:step2_train_size] 139 | step2_eval_data = data[step2_train_size:step2_train_size + step2_eval_size] 140 | 141 | # 写入输出文件 142 | with open(output_step3_train_file, 'w', encoding='utf-8') as f: 143 | for item in step3_train_data: 144 | f.write(json.dumps(item, ensure_ascii=False) + '\n') 145 | 146 | with open(output_step3_eval_file, 'w', encoding='utf-8') as f: 147 | for item in step3_eval_data: 148 | f.write(json.dumps(item, ensure_ascii=False) + '\n') 149 | 150 | with open(output_step2_train_file, 'w', encoding='utf-8') as f: 151 | for item in step2_train_data: 152 | f.write(json.dumps(item, ensure_ascii=False) + '\n') 153 | 154 | with open(output_step2_eval_file, 'w', encoding='utf-8') as f: 155 | for item in step2_eval_data: 156 | f.write(json.dumps(item, ensure_ascii=False) + '\n') 157 | 158 | 159 | 160 | 161 | 162 | def main(): 163 | #split_jsonl() 164 | 165 | #generate_rlhf_data() 166 | process_step_2_3_data() 167 | 168 | 169 | 170 | 171 | if __name__ == "__main__": 172 | main() 173 | -------------------------------------------------------------------------------- /rlhf/rw_eval.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Copyright (c) Microsoft Corporation. 3 | # SPDX-License-Identifier: Apache-2.0 4 | 5 | # DeepSpeed Team 6 | import argparse 7 | import torch 8 | 9 | from utils.model_utils import create_critic_model 10 | from utils.utils import to_device, load_hf_tokenizer 11 | from deepspeed import get_accelerator 12 | 13 | 14 | def parse_args(): 15 | parser = argparse.ArgumentParser( 16 | description="Eval the finetued reward model") 17 | parser.add_argument( 18 | "--model_name_or_path", 19 | type=str, 20 | help= 21 | "Path to pretrained model or model identifier from huggingface.co/models.", 22 | required=True, 23 | ) 24 | parser.add_argument( 25 | "--num_padding_at_beginning", 26 | type=int, 27 | default=1, 28 | help= 29 | "OPT model has a fixed number (1) of padding tokens at the beginning of the input. " 30 | "We did not see this in other models but keep it as an option for now.", 31 | ) 32 | parser.add_argument( 33 | "--add_eot_token", 34 | action='store_true', 35 | help="Add <|endoftext|> as additional special token to tokenizer") 36 | args = parser.parse_args() 37 | return args 38 | 39 | 40 | def load_stuff(model_name_or_path, num_padding_at_beginning, 41 | additional_special_tokens): 42 | 43 | tokenizer = load_hf_tokenizer(model_name_or_path) 44 | model = create_critic_model(model_name_or_path, 45 | tokenizer, 46 | None, 47 | num_padding_at_beginning, 48 | dropout=0) 49 | 50 | return model, tokenizer 51 | 52 | 53 | def prepare_datapair(prompt, 54 | good_ans, 55 | bad_ans, 56 | tokenizer, 57 | max_seq_len=512, 58 | end_of_conversation_token=None): 59 | chosen_sentence = prompt + good_ans 60 | reject_sentence = prompt + bad_ans 61 | chosen_token = tokenizer(chosen_sentence, 62 | max_length=max_seq_len, 63 | padding="max_length", 64 | truncation=True, 65 | return_tensors="pt") 66 | 67 | reject_token = tokenizer(reject_sentence, 68 | max_length=max_seq_len, 69 | padding="max_length", 70 | truncation=True, 71 | return_tensors="pt") 72 | 73 | batch = {} 74 | batch["input_ids"] = torch.cat([chosen_token["input_ids"]] + 75 | [reject_token["input_ids"]], 76 | dim=0) 77 | batch["attention_mask"] = torch.cat([chosen_token["attention_mask"]] + 78 | [reject_token["attention_mask"]], 79 | dim=0) 80 | return batch 81 | 82 | 83 | def prepare_singlesample(prompt, 84 | good_ans, 85 | tokenizer, 86 | max_seq_len=512, 87 | end_of_conversation_token=None): 88 | chosen_sentence = prompt + good_ans + end_of_conversation_token 89 | chosen_token = tokenizer(chosen_sentence, 90 | max_length=max_seq_len, 91 | padding="max_length", 92 | truncation=True, 93 | return_tensors="pt") 94 | 95 | batch = {} 96 | batch["input_ids"] = chosen_token["input_ids"] 97 | batch["attention_mask"] = chosen_token["attention_mask"] 98 | 99 | return batch 100 | 101 | 102 | def run_pair_comparison(): 103 | args = parse_args() 104 | 105 | device = torch.device(get_accelerator().device_name(0)) 106 | 107 | args.end_of_conversation_token = None 108 | additional_special_tokens = args.end_of_conversation_token if args.add_eot_token else None 109 | 110 | rm_model, tokenizer = load_stuff(args.model_name_or_path, 111 | args.num_padding_at_beginning, 112 | additional_special_tokens) 113 | rm_model.to(device) 114 | rm_model.eval() 115 | 116 | prompt_list = [ 117 | "<|im_start|>system\n你是一个由喵阿姨开发的喵喵小助手<|im_end|>\n<|im_start|>user\n帮我生成一些音乐热评<|im_end|>\n<|im_start|>assistant\n", 118 | "<|im_start|>system\n你是一个由喵阿姨开发的喵喵小助手<|im_end|>\n<|im_start|>user\n根据开头，续写古诗：\n翠幄千章荫晚空<|im_end|>\n<|im_start|>assistant\n" 119 | ] 120 | good_ans_list = [ 121 | "1、1997年听了耀威的《有缘千里》专辑，到今年20年了，一直关注，有没有像我一样的朋友？\n2、爱的故事·上集·万屡爱意寄窗扉\n爱的故事·下集·我愿他能珍惜你\n爱的故事·曲终·只有我懂得自己<|im_end|>", 122 | "年华心赏两无穷。云头欲落催诗雨，池面微生解愠风。经笥使君谈似绮，仙舟令尹饮如虹。娵隅自适清池乐，不信参军是郝隆。<|im_end|>" 123 | ] 124 | bad_ans_list = [ 125 | "1、我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，我一直都觉得，这首歌是我的最爱，<|im_end|>", 126 | "金蟾照影照金蟾。玉兔飞来玉兔飞，玉兔飞来玉兔飞。<|im_end|>" 127 | ] 128 | 129 | for prompt, good_ans, bad_ans in zip(prompt_list, good_ans_list, 130 | bad_ans_list): 131 | batch = prepare_datapair( 132 | prompt, 133 | good_ans, 134 | bad_ans, 135 | tokenizer, 136 | max_seq_len=512, 137 | end_of_conversation_token=None) 138 | batch = to_device(batch, device) 139 | # Run inference 140 | with torch.no_grad(): 141 | outputs = rm_model(**batch) 142 | print("==================Eval result============================") 143 | print("prompt: ", prompt) 144 | print("\ngood_ans: ", good_ans) 145 | print("\nbad_ans:", bad_ans) 146 | print() 147 | print("=============Scores (higher, better)========================") 148 | print("good_ans score: ", outputs["chosen_mean_scores"].item()) 149 | print("bad_ans score: ", outputs["rejected_mean_scores"].item()) 150 | 151 | 152 | def run_single_sample(): 153 | args = parse_args() 154 | device = torch.device(get_accelerator().device_name()) 155 | 156 | args.end_of_conversation_token = None 157 | additional_special_tokens = args.end_of_conversation_token if args.add_eot_token else None 158 | 159 | rm_model, tokenizer = load_stuff(args.model_name_or_path, 160 | args.num_padding_at_beginning, 161 | additional_special_tokens) 162 | rm_model.to(device) 163 | 164 | prompt = "Human: Explain the moon landing to a 6 year old in a few sentences." 165 | my_ans = "Assistant: The moon landing was a major milestone in the history of human exploration of the solar system. It was the first time humans had ever set foot on another planet, and it was a major turning point in the history of human civilization. The astronauts, Neil Armstrong, Buzz Aldrin, and Michael Collins, successfully landed the Apollo 11 spacecraft on the moon, marking the first time humans had ever set foot on another" 166 | 167 | batch = prepare_singlesample( 168 | prompt, 169 | my_ans, 170 | tokenizer, 171 | max_seq_len=512, 172 | end_of_conversation_token=args.end_of_conversation_token) 173 | batch = to_device(batch, device) 174 | 175 | rm_model.eval() 176 | # Run inference 177 | with torch.no_grad(): 178 | outputs = rm_model.forward_value( 179 | **batch, prompt_length=max(2, args.num_padding_at_beginning) 180 | ) # we just need to skip the number of padding tokens at the beginning 181 | print("==================Eval result============================") 182 | print("prompt: ", prompt) 183 | print("my_ans: ", my_ans) 184 | print() 185 | print("=============Scores========================") 186 | print("my_ans score: ", outputs["chosen_end_scores"].item()) 187 | 188 | 189 | if __name__ == "__main__": 190 | run_pair_comparison() 191 | # run_single_sample() 192 | -------------------------------------------------------------------------------- /rlhf/step2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Copyright (c) Microsoft Corporation. 3 | # SPDX-License-Identifier: Apache-2.0 4 | 5 | # DeepSpeed Team 6 | import argparse 7 | import math 8 | 9 | import torch 10 | from torch.utils.data import DataLoader, RandomSampler, SequentialSampler 11 | from torch.utils.data.distributed import DistributedSampler 12 | 13 | from transformers import ( 14 | SchedulerType, 15 | get_scheduler, 16 | ) 17 | 18 | import deepspeed 19 | from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam 20 | from deepspeed.accelerator import get_accelerator 21 | 22 | from utils.model_utils import create_critic_model 23 | from utils.data_utils import create_prompt_dataset, DataCollatorReward 24 | from utils.utils import print_rank_0, to_device, save_hf_format, set_random_seed, get_all_reduce_mean, get_optimizer_grouped_parameters, load_hf_tokenizer 25 | from utils.ds_utils import get_train_ds_config 26 | 27 | 28 | def parse_args(): 29 | parser = argparse.ArgumentParser( 30 | description= 31 | "Finetune a transformers model on a causal language modeling task") 32 | parser.add_argument('--data_path', 33 | type=str) 34 | parser.add_argument( 35 | '--data_output_path', 36 | type=str, 37 | default='/tmp/data_files/', 38 | help='Where to store the data-related files such as shuffle index.') 39 | parser.add_argument( 40 | "--model_name_or_path", 41 | type=str, 42 | help= 43 | "Path to pretrained model or model identifier from huggingface.co/models.", 44 | required=True, 45 | ) 46 | parser.add_argument( 47 | "--num_padding_at_beginning", 48 | type=int, 49 | default=1, 50 | help= 51 | "OPT model has a fixed number (1) of padding tokens at the beginning of the input. " 52 | "We did not see this in other models but keep it as an option for now.", 53 | ) 54 | parser.add_argument( 55 | "--per_device_train_batch_size", 56 | type=int, 57 | default=16, 58 | help="Batch size (per device) for the training dataloader.", 59 | ) 60 | parser.add_argument( 61 | "--per_device_eval_batch_size", 62 | type=int, 63 | default=16, 64 | help="Batch size (per device) for the evaluation dataloader.", 65 | ) 66 | parser.add_argument( 67 | "--max_seq_len", 68 | type=int, 69 | default=512, 70 | help="The maximum sequence length.", 71 | ) 72 | parser.add_argument( 73 | "--learning_rate", 74 | type=float, 75 | default=5e-5, 76 | help= 77 | "Initial learning rate (after the potential warmup period) to use.", 78 | ) 79 | parser.add_argument("--weight_decay", 80 | type=float, 81 | default=0., 82 | help="Weight decay to use.") 83 | parser.add_argument("--num_train_epochs", 84 | type=int, 85 | default=1, 86 | help="Total number of training epochs to perform.") 87 | parser.add_argument( 88 | "--gradient_accumulation_steps", 89 | type=int, 90 | default=1, 91 | help= 92 | "Number of updates steps to accumulate before performing a backward/update pass.", 93 | ) 94 | parser.add_argument( 95 | "--lr_scheduler_type", 96 | type=SchedulerType, 97 | default="cosine", 98 | help="The scheduler type to use.", 99 | choices=[ 100 | "linear", "cosine", "cosine_with_restarts", "polynomial", 101 | "constant", "constant_with_warmup" 102 | ], 103 | ) 104 | parser.add_argument( 105 | "--num_warmup_steps", 106 | type=int, 107 | default=0, 108 | help="Number of steps for the warmup in the lr scheduler.") 109 | parser.add_argument("--output_dir", 110 | type=str, 111 | default=None, 112 | help="Where to store the model.") 113 | parser.add_argument("--seed", 114 | type=int, 115 | default=1234, 116 | help="A seed for reproducible training.") 117 | parser.add_argument("--local_rank", 118 | type=int, 119 | default=-1, 120 | help="local_rank for distributed training on gpus") 121 | parser.add_argument( 122 | '--gradient_checkpointing', 123 | action='store_true', 124 | help='Enable HF gradient checkpointing for Actor model.') 125 | parser.add_argument( 126 | "--dropout", 127 | type=float, 128 | default=None, 129 | help="If dropout configured, use it. " 130 | "Otherwise, keep the default dropout configuration of the model.") 131 | # deepspeed features 132 | parser.add_argument('--offload', 133 | action='store_true', 134 | help='Enable ZeRO Offload techniques.') 135 | parser.add_argument('--dtype', 136 | type=str, 137 | default='fp16', 138 | choices=['fp16', 'bf16'], 139 | help='Training data type') 140 | parser.add_argument( 141 | '--zero_stage', 142 | type=int, 143 | default=0, 144 | help='ZeRO optimization stage for Actor model (and clones).') 145 | ## LoRA for efficient training setting 146 | parser.add_argument("--lora_dim", 147 | type=int, 148 | default=0, 149 | help="If > 0, use LoRA for efficient training.") 150 | parser.add_argument("--lora_module_name", 151 | type=str, 152 | default="decoder.layers.", 153 | help="The scope of LoRA.") 154 | parser.add_argument('--only_optimize_lora', 155 | action='store_true', 156 | help='Only optimize the LoRA parameters.') 157 | parser.add_argument( 158 | "--lora_learning_rate", 159 | type=float, 160 | default=5e-4, 161 | help= 162 | "Initial LoRA learning rate (after the potential warmup period) to use." 163 | ) 164 | 165 | # Evaluation 166 | parser.add_argument("--eval_interval", 167 | type=int, 168 | default=0, 169 | help="If > 0, perform evaluation at this interval") 170 | parser.add_argument("--eval_iters", 171 | type=int, 172 | default=100, 173 | help="Maximum evaluation iterations") 174 | ## low precision 175 | parser.add_argument( 176 | '--compute_fp32_loss', 177 | action='store_true', 178 | help='Relevant for low precision dtypes (fp16, bf16, etc.). ' 179 | 'If specified, loss is calculated in fp32.') 180 | 181 | ## Tensorboard logging 182 | parser.add_argument('--enable_tensorboard', 183 | action='store_true', 184 | help='Enable tensorboard logging') 185 | parser.add_argument('--tensorboard_path', 186 | type=str, 187 | default="step2_tensorboard") 188 | ## Tokenizer 189 | parser.add_argument( 190 | "--add_eot_token", 191 | action='store_true', 192 | help="Add <|endoftext|> as additional special token to tokenizer") 193 | parser = deepspeed.add_config_arguments(parser) 194 | args = parser.parse_args() 195 | 196 | return args 197 | 198 | 199 | def main(): 200 | args = parse_args() 201 | 202 | if args.local_rank == -1: 203 | device = torch.device(get_accelerator().device_name()) 204 | else: 205 | get_accelerator().set_device(args.local_rank) 206 | device = torch.device(get_accelerator().device_name(), args.local_rank) 207 | # Initializes the distributed backend which will take care of sychronizing nodes/GPUs 208 | # torch.distributed.init_process_group(backend='nccl') 209 | deepspeed.init_distributed() 210 | 211 | args.global_rank = torch.distributed.get_rank() 212 | 213 | ds_config = get_train_ds_config(offload=args.offload, 214 | dtype=args.dtype, 215 | stage=args.zero_stage, 216 | enable_tensorboard=args.enable_tensorboard, 217 | tb_path=args.tensorboard_path, 218 | tb_name="step2_model") 219 | ds_config[ 220 | 'train_micro_batch_size_per_gpu'] = args.per_device_train_batch_size 221 | ds_config[ 222 | 'train_batch_size'] = args.per_device_train_batch_size * torch.distributed.get_world_size( 223 | ) * args.gradient_accumulation_steps 224 | 225 | # If passed along, set the training seed now. 226 | set_random_seed(args.seed) 227 | torch.distributed.barrier() 228 | 229 | tokenizer = load_hf_tokenizer(args.model_name_or_path) 230 | rm_model = create_critic_model(args.model_name_or_path, 231 | tokenizer, 232 | ds_config, 233 | args.num_padding_at_beginning, 234 | dropout=args.dropout, 235 | zero_stage=args.zero_stage, 236 | compute_fp32_loss=args.compute_fp32_loss) 237 | 238 | 239 | print_rank_0("create_prompt_dataset前") 240 | train_phase = 2 241 | train_dataset, eval_dataset = create_prompt_dataset( 242 | args.local_rank, args.data_path, train_phase, args.seed, tokenizer,args.max_seq_len) 243 | 244 | # 打印train_dataset的部分内容 245 | 246 | 247 | 248 | data_collator = DataCollatorReward() 249 | 250 | if args.local_rank == -1: 251 | train_sampler = RandomSampler(train_dataset) 252 | eval_sampler = SequentialSampler(eval_dataset) 253 | else: 254 | train_sampler = DistributedSampler(train_dataset) 255 | eval_sampler = DistributedSampler(eval_dataset) 256 | train_dataloader = DataLoader(train_dataset, 257 | collate_fn=data_collator, 258 | sampler=train_sampler, 259 | batch_size=args.per_device_train_batch_size) 260 | eval_dataloader = DataLoader(eval_dataset, 261 | collate_fn=data_collator, 262 | sampler=eval_sampler, 263 | batch_size=args.per_device_eval_batch_size) 264 | 265 | 266 | 267 | 268 | 269 | def evaluation_reward(model, dataloader, eval_iters): 270 | model.eval() 271 | correct_predictions = 0 272 | total_predictions = 0 273 | chosen_scores = 0. 274 | rejected_scores = 0. 275 | for _step, _batch in enumerate(dataloader): 276 | _batch = to_device(_batch, device) 277 | with torch.no_grad(): 278 | _outputs = model(**_batch) 279 | 280 | chosen = _outputs["chosen_mean_scores"] 281 | rejected = _outputs["rejected_mean_scores"] 282 | correct_predictions += (chosen > rejected).sum() 283 | total_predictions += chosen.shape[0] 284 | chosen_scores += _outputs["chosen_mean_scores"].mean().float() 285 | rejected_scores += _outputs["rejected_mean_scores"].mean().float() 286 | if (_step + 1) == eval_iters: 287 | break 288 | _acc = correct_predictions / total_predictions 289 | chosen_scores = chosen_scores / (_step + 1) 290 | rejected_scores = rejected_scores / (_step + 1) 291 | try: 292 | _acc = get_all_reduce_mean(_acc).item() 293 | chosen_scores = get_all_reduce_mean(chosen_scores).item() 294 | rejected_scores = get_all_reduce_mean(rejected_scores).item() 295 | except: 296 | pass 297 | return chosen_scores, rejected_scores, _acc 298 | 299 | # Split weights in two groups, one with weight decay and the other not. 300 | optimizer_grouped_parameters = get_optimizer_grouped_parameters( 301 | rm_model, args.weight_decay, args.lora_learning_rate) 302 | 303 | AdamOptimizer = FusedAdam 304 | 305 | optimizer = AdamOptimizer(optimizer_grouped_parameters, 306 | lr=args.learning_rate, 307 | betas=(0.9, 0.95)) 308 | 309 | num_update_steps_per_epoch = math.ceil( 310 | len(train_dataloader) / args.gradient_accumulation_steps) 311 | 312 | lr_scheduler = get_scheduler( 313 | name=args.lr_scheduler_type, 314 | optimizer=optimizer, 315 | num_warmup_steps=args.num_warmup_steps, 316 | num_training_steps=args.num_train_epochs * num_update_steps_per_epoch, 317 | ) 318 | 319 | rm_model, optimizer, _, lr_scheduler = deepspeed.initialize( 320 | model=rm_model, 321 | optimizer=optimizer, 322 | args=args, 323 | config=ds_config, 324 | lr_scheduler=lr_scheduler, 325 | dist_init_required=True) 326 | 327 | if args.gradient_checkpointing: 328 | rm_model.gradient_checkpointing_enable() 329 | 330 | # Train! 331 | print_rank_0("***** Running training *****", args.global_rank) 332 | 333 | print_rank_0( 334 | f"***** Evaluating reward, Epoch {0}/{args.num_train_epochs} *****", 335 | args.global_rank) 336 | reward_score, reject_score, acc = evaluation_reward( 337 | rm_model, eval_dataloader, args.eval_iters) 338 | print_rank_0( 339 | f"chosen_last_scores (higher is better) : {reward_score}, " 340 | f"rejected_last_scores (lower is better) : {reject_score}, " 341 | f"acc (higher is better) : {acc}", args.global_rank) 342 | 343 | total_micro_steps = 0 344 | for epoch in range(args.num_train_epochs): 345 | print_rank_0( 346 | f"Beginning of Epoch {epoch+1}/{args.num_train_epochs}, Total Micro Batches {len(train_dataloader)}", 347 | args.global_rank) 348 | rm_model.train() 349 | mean_loss = 0 350 | for step, batch in enumerate(train_dataloader): 351 | batch = to_device(batch, device) 352 | outputs = rm_model(**batch, use_cache=False) 353 | loss = outputs["loss"] 354 | rm_model.backward(loss) 355 | rm_model.step() 356 | mean_loss += loss.item() 357 | total_micro_steps += 1 358 | gas_boundary = (total_micro_steps % 359 | args.gradient_accumulation_steps == 0) 360 | total_steps = total_micro_steps // args.gradient_accumulation_steps 361 | if args.eval_interval and gas_boundary and ( 362 | total_steps % args.eval_interval == 0): 363 | print_rank_0(f"Iter {total_steps}: Evaluating reward", 364 | args.global_rank) 365 | reward_score, reject_score, acc = evaluation_reward( 366 | rm_model, eval_dataloader, args.eval_iters) 367 | print_rank_0( 368 | f"Iter {total_steps}: c_scores: {reward_score}, r_scores: {reject_score}, " 369 | f"diff: {reward_score - reject_score}, acc: {acc}", 370 | args.global_rank) 371 | rm_model.train() 372 | 373 | print_rank_0( 374 | f"Epoch {epoch+1}/{args.num_train_epochs} with loss {mean_loss/(step+1)}", 375 | args.global_rank) 376 | # Evaluate reward_loss on the validation set. 377 | print_rank_0( 378 | f"***** Evaluating reward, Epoch {epoch+1}/{args.num_train_epochs} *****", 379 | args.global_rank) 380 | reward_score, reject_score, acc = evaluation_reward( 381 | rm_model, eval_dataloader, args.eval_iters) 382 | print_rank_0( 383 | f"chosen_last_scores (higher is better) : {reward_score}, " 384 | f"rejected_last_scores (lower is better) : {reject_score}, " 385 | f"acc (higher is better) : {acc}", args.global_rank) 386 | rm_model.tput_timer.update_epoch_count() 387 | 388 | if args.output_dir is not None: 389 | print_rank_0('saving model ...', args.global_rank) 390 | 391 | if args.global_rank == 0: 392 | save_hf_format(rm_model, tokenizer, args, sub_folder=f"epoch_{epoch+1}") 393 | if args.zero_stage == 3: 394 | raise RuntimeError('不支持zero3') 395 | 396 | 397 | if __name__ == "__main__": 398 | main() 399 | -------------------------------------------------------------------------------- /rlhf/step2.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright (c) Microsoft Corporation. 3 | # SPDX-License-Identifier: Apache-2.0 4 | 5 | # DeepSpeed Team 6 | OUTPUT=$1 7 | ZERO_STAGE=$2 8 | if [ "$OUTPUT" == "" ]; then 9 | OUTPUT=./step2_output 10 | fi 11 | if [ "$ZERO_STAGE" == "" ]; then 12 | ZERO_STAGE=2 13 | fi 14 | mkdir -p $OUTPUT 15 | 16 | deepspeed step2.py \ 17 | --data_path ./step2_data \ 18 | --model_name_or_path ./sft_model \ 19 | --per_device_train_batch_size 16 \ 20 | --per_device_eval_batch_size 16 \ 21 | --max_seq_len 1024 \ 22 | --learning_rate 9.65e-6 \ 23 | --weight_decay 0.1 \ 24 | --num_padding_at_beginning 0 \ 25 | --num_train_epochs 2 \ 26 | --gradient_accumulation_steps 1 \ 27 | --lr_scheduler_type cosine \ 28 | --num_warmup_steps 0 \ 29 | --seed 1234 \ 30 | --zero_stage $ZERO_STAGE \ 31 | --deepspeed \ 32 | --output_dir $OUTPUT \ 33 | &> $OUTPUT/training.log 34 | -------------------------------------------------------------------------------- /rlhf/step2_eval.sh: -------------------------------------------------------------------------------- 1 | python rw_eval.py \ 2 | --model_name_or_path ./step2_output/epoch_1 -------------------------------------------------------------------------------- /rlhf/step3.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright (c) Microsoft Corporation. 3 | # SPDX-License-Identifier: Apache-2.0 4 | 5 | # DeepSpeed Team 6 | ACTOR_MODEL_PATH=./sft_model 7 | CRITIC_MODEL_PATH=./step2_output/epoch_1 8 | ACTOR_ZERO_STAGE=2 9 | CRITIC_ZERO_STAGE=2 10 | OUTPUT=$5 11 | if [ "$OUTPUT" == "" ]; then 12 | OUTPUT=./step3_output 13 | fi 14 | if [ "$ACTOR_ZERO_STAGE" == "" ]; then 15 | ACTOR_ZERO_STAGE=2 16 | fi 17 | if [ "$CRITIC_ZERO_STAGE" == "" ]; then 18 | CRITIC_ZERO_STAGE=2 19 | fi 20 | mkdir -p $OUTPUT 21 | 22 | Num_Padding_at_Beginning=0 # this is model related 23 | 24 | Actor_Lr=1e-6 25 | Critic_Lr=1e-6 26 | 27 | deepspeed --master_port 12346 step3.py \ 28 | --data_path ./step3_data \ 29 | --actor_model_name_or_path $ACTOR_MODEL_PATH \ 30 | --critic_model_name_or_path $CRITIC_MODEL_PATH \ 31 | --num_padding_at_beginning 0 \ 32 | --per_device_generation_batch_size 40 \ 33 | --per_device_training_batch_size 40 \ 34 | --generation_batches 1 \ 35 | --ppo_epochs 1 \ 36 | --max_answer_seq_len 128 \ 37 | --max_prompt_seq_len 256 \ 38 | --actor_learning_rate ${Actor_Lr} \ 39 | --critic_learning_rate ${Critic_Lr} \ 40 | --actor_weight_decay 0.1 \ 41 | --critic_weight_decay 0.1 \ 42 | --num_train_epochs 3 \ 43 | --lr_scheduler_type cosine \ 44 | --gradient_accumulation_steps 1 \ 45 | --actor_gradient_checkpointing \ 46 | --critic_gradient_checkpointing \ 47 | --offload_reference_model \ 48 | --enable_ema \ 49 | --actor_dropout 0.0 \ 50 | --num_warmup_steps 100 \ 51 | --deepspeed --seed 1234 \ 52 | --actor_zero_stage $ACTOR_ZERO_STAGE \ 53 | --critic_zero_stage $CRITIC_ZERO_STAGE \ 54 | --enable_hybrid_engine \ 55 | --output_dir $OUTPUT \ 56 | &> $OUTPUT/training.log 57 | -------------------------------------------------------------------------------- /rlhf/utils/__pycache__/data_utils.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/utils/__pycache__/data_utils.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/utils/__pycache__/ds_utils.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/utils/__pycache__/ds_utils.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/utils/__pycache__/model_utils.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/utils/__pycache__/model_utils.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/utils/__pycache__/perf.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/utils/__pycache__/perf.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/utils/__pycache__/raw_datasets.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/utils/__pycache__/raw_datasets.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/utils/__pycache__/reward_model.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/utils/__pycache__/reward_model.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/utils/__pycache__/utils.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-Study-Han/Zero-Chatgpt/03a1d98d5fcf879bf13eb410bdd54547bbd46095/rlhf/utils/__pycache__/utils.cpython-311.pyc -------------------------------------------------------------------------------- /rlhf/utils/data_utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, Subset, ConcatDataset 3 | from torch.nn.utils.rnn import pad_sequence 4 | import torch.nn.functional as F 5 | from datasets import load_dataset 6 | import numpy as np 7 | import os 8 | import hashlib 9 | from itertools import chain 10 | from utils.raw_datasets import LocalJsonFileDataset 11 | from deepspeed.accelerator import get_accelerator 12 | 13 | 14 | 15 | def get_raw_dataset(data_path): 16 | 17 | return LocalJsonFileDataset(data_path) 18 | 19 | 20 | 21 | def get_shuffle_idx(seed, size): 22 | np_rng = np.random.RandomState(seed=seed) 23 | dtype_ = np.uint32 24 | if size >= (np.iinfo(np.uint32).max - 1): 25 | dtype_ = np.int64 26 | shuffle_idx = np.arange(start=0, stop=size, step=1, dtype=dtype_) 27 | np_rng.shuffle(shuffle_idx) 28 | return shuffle_idx 29 | 30 | def get_raw_dataset_split_index(seed, data_size): 31 | """ 32 | Generate raw dataset split indices without saving or loading. 33 | 34 | Parameters: 35 | - seed: int, random seed for shuffling 36 | - data_size: int, size of the dataset 37 | 38 | Returns: 39 | - index_list: list, shuffled index list 40 | """ 41 | shuffle_idx = get_shuffle_idx(seed, data_size) 42 | return shuffle_idx.tolist() 43 | 44 | def create_dataset(data_path, 45 | train_phase, seed, tokenizer, end_of_conversation_token, 46 | max_seq_len): 47 | raw_dataset = get_raw_dataset(data_path) 48 | train_dataset = raw_dataset.get_train_data() 49 | train_dataset = create_dataset_split(train_dataset, raw_dataset, 50 | train_phase, tokenizer, 51 | max_seq_len) 52 | eval_dataset = raw_dataset.get_eval_data() 53 | eval_dataset = create_dataset_split(eval_dataset, raw_dataset, train_phase, 54 | tokenizer, 55 | max_seq_len) 56 | 57 | return train_dataset, eval_dataset 58 | 59 | class PromptDataset(Dataset): 60 | 61 | def __init__(self, prompt_dataset, chosen_dataset, reject_dataset, 62 | pad_token_id, train_phase) -> None: 63 | super().__init__() 64 | self.prompt_dataset = prompt_dataset 65 | self.chosen_dataset = chosen_dataset 66 | self.reject_dataset = reject_dataset 67 | self.pad_token_id = pad_token_id 68 | self.train_phase = train_phase 69 | 70 | def __len__(self): 71 | length = len(self.chosen_dataset) 72 | if self.train_phase == 3: 73 | length = len(self.prompt_dataset) 74 | return length 75 | 76 | def __getitem__(self, idx): 77 | if self.train_phase == 2: 78 | return self.chosen_dataset[idx]["input_ids"], self.chosen_dataset[idx]["attention_mask"], \ 79 | self.reject_dataset[idx]["input_ids"], self.reject_dataset[idx]["attention_mask"] 80 | elif self.train_phase == 3: 81 | return self.prompt_dataset[idx]["input_ids"],self.prompt_dataset[idx]["attention_mask"], \ 82 | self.pad_token_id 83 | 84 | 85 | def create_prompt_dataset(local_rank, 86 | data_path, 87 | train_phase, 88 | seed, 89 | tokenizer, 90 | max_seq_len, 91 | end_of_conversation_token=None): 92 | """ 93 | Creates the prompt dataset 94 | """ 95 | if local_rank <= 0 : 96 | 97 | train_dataset, eval_dataset = create_dataset( 98 | data_path, 99 | train_phase, 100 | seed, 101 | tokenizer, 102 | end_of_conversation_token, 103 | max_seq_len) 104 | return train_dataset, eval_dataset 105 | 106 | torch.distributed.barrier() 107 | return None, None 108 | 109 | def create_dataset_split(current_dataset, raw_dataset, train_phase, tokenizer, max_seq_len): 110 | prompt_dataset = [] 111 | chosen_dataset = [] 112 | reject_dataset = [] 113 | 114 | if train_phase == 2: 115 | for i, tmp_data in enumerate(current_dataset): 116 | # tokenize the text 117 | chosen_sentence = raw_dataset.get_prompt_and_chosen( 118 | tmp_data) # the accept response 119 | reject_sentence = raw_dataset.get_prompt_and_rejected( 120 | tmp_data) # the accept response 121 | if chosen_sentence is not None and reject_sentence is not None: 122 | # chosen_sentence += end_of_conversation_token # the accept response 123 | # reject_sentence += end_of_conversation_token 124 | chosen_token = tokenizer(chosen_sentence, 125 | max_length=max_seq_len, 126 | padding="max_length", 127 | truncation=True, 128 | return_tensors="pt") 129 | reject_token = tokenizer(reject_sentence, 130 | max_length=max_seq_len, 131 | padding="max_length", 132 | truncation=True, 133 | return_tensors="pt") 134 | chosen_token["input_ids"] = chosen_token["input_ids"] 135 | chosen_token["attention_mask"] = chosen_token["attention_mask"] 136 | chosen_dataset.append(chosen_token) 137 | 138 | reject_token["input_ids"] = reject_token["input_ids"] 139 | reject_token["attention_mask"] = reject_token["attention_mask"] 140 | reject_dataset.append(reject_token) 141 | print( 142 | f'Creating dataset {raw_dataset.dataset_name_clean} for {train_phase=} size={len(chosen_dataset)}' 143 | ) 144 | 145 | elif train_phase == 3: 146 | filtered = 0 147 | for i, tmp_data in enumerate(current_dataset): 148 | # tokenize the text 149 | prompt = raw_dataset.get_prompt(tmp_data) 150 | if prompt is not None: 151 | prompt_token = tokenizer(prompt, return_tensors="pt") 152 | if prompt_token["input_ids"].size()[-1] <= max_seq_len: 153 | for key_word in ["input_ids", "attention_mask"]: 154 | prompt_token[key_word] = prompt_token[ 155 | key_word].squeeze(0).flip(0) 156 | prompt_dataset.append(prompt_token) 157 | else: 158 | filtered += 1 159 | print(f'Creating dataset {raw_dataset.dataset_name_clean} ' 160 | f'for {train_phase=} size={len(prompt_dataset)} {filtered=}') 161 | 162 | return PromptDataset(prompt_dataset, chosen_dataset, reject_dataset, 163 | tokenizer.pad_token_id, train_phase) 164 | 165 | 166 | class DataCollatorReward: 167 | 168 | def __call__(self, data): 169 | batch = {} 170 | batch["input_ids"] = torch.cat([f[0] 171 | for f in data] + [f[2] for f in data], 172 | dim=0) 173 | batch["attention_mask"] = torch.cat([f[1] for f in data] + 174 | [f[3] for f in data], 175 | dim=0) 176 | return batch 177 | 178 | class MiniDataset: 179 | 180 | def __init__(self, max_size, small_batch_size): 181 | self.dataset = [] 182 | self.max_size = max_size 183 | self.small_batch_size = small_batch_size 184 | 185 | def seperate(self): 186 | small_dataset = [] 187 | for large_batch in self.dataset: 188 | if type(large_batch) == list or type(large_batch) == tuple: 189 | large_size = len(large_batch[0]) 190 | elif type(large_batch) == dict: 191 | large_size = len(large_batch[list(large_batch.keys())[0]]) 192 | else: 193 | large_size = len(large_batch) 194 | for i in range(0, large_size, self.small_batch_size): 195 | if type(large_batch) == list or type(large_batch) == tuple: 196 | small_dataset.append( 197 | [x[i:i + self.small_batch_size] for x in large_batch]) 198 | elif type(large_batch) == dict: 199 | small_dataset.append({ 200 | k: v[i:i + self.small_batch_size] 201 | for k, v in large_batch.items() 202 | }) 203 | else: 204 | small_dataset.append(large_batch[i:i + 205 | self.small_batch_size]) 206 | self.free() 207 | 208 | return small_dataset 209 | 210 | def add(self, data): 211 | if len(self.dataset) < self.max_size: 212 | self.dataset.append(data) 213 | if len(self.dataset) == self.max_size: 214 | return self.seperate() 215 | else: 216 | return None 217 | else: 218 | raise ValueError( 219 | "The dataset is full but we did not stop it. There is a bug in the code." 220 | ) 221 | 222 | def free(self): 223 | self.dataset = [] 224 | 225 | class DataCollatorRLHF: 226 | 227 | def __init__(self, max_token_len, inference_tp_size, pad_token_id): 228 | self.max_token_len = max_token_len 229 | self.inference_tp_size = inference_tp_size 230 | self.pad_token_id = pad_token_id 231 | 232 | def __call__(self, data): 233 | batch = {} 234 | # pad_token_id = data[-1][-1] 235 | 236 | prompt = pad_sequence([f[0] for f in data], 237 | padding_value=self.pad_token_id, 238 | batch_first=True) 239 | prompt_mask = pad_sequence([f[1] for f in data], 240 | padding_value=0, 241 | batch_first=True) 242 | 243 | ### make sure the final ouput is a seqence of 2**? 244 | length = prompt.size()[-1] 245 | pad_length = self.max_token_len - length 246 | if pad_length > 0: 247 | batch["prompt"] = F.pad(prompt, 248 | pad=(0, pad_length), 249 | mode='constant', 250 | value=self.pad_token_id) 251 | batch["prompt_att_mask"] = F.pad(prompt_mask, 252 | pad=(0, pad_length), 253 | mode='constant', 254 | value=0) 255 | else: 256 | batch["prompt"] = prompt 257 | batch["prompt_att_mask"] = prompt_mask 258 | batch["prompt"] = batch["prompt"].flip(1) 259 | batch["prompt_att_mask"] = batch["prompt_att_mask"].flip(1) 260 | return batch 261 | 262 | 263 | def get_unsupervised_data(args, tokenizer): 264 | unsupervised_raw_datasets = load_dataset( 265 | args.unsupervised_dataset_name, args.unsupervised_dataset_config_name) 266 | column_names = unsupervised_raw_datasets["train"].column_names 267 | text_column_name = "text" if "text" in column_names else column_names[0] 268 | 269 | def tokenize_function(examples): 270 | return tokenizer(examples[text_column_name]) 271 | 272 | tokenized_datasets = unsupervised_raw_datasets.map( 273 | tokenize_function, 274 | batched=True, 275 | num_proc=args.preprocessing_num_workers, 276 | remove_columns=column_names, 277 | load_from_cache_file=True, 278 | desc="Running tokenizer on dataset", 279 | ) 280 | 281 | block_size = args.max_prompt_seq_len + args.max_answer_seq_len 282 | 283 | def group_texts(examples): 284 | # Concatenate all texts. 285 | concatenated_examples = { 286 | k: list(chain(*examples[k])) 287 | for k in examples.keys() 288 | } 289 | total_length = len(concatenated_examples[list(examples.keys())[0]]) 290 | # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can 291 | # customize this part to your needs. 292 | if total_length >= block_size: 293 | total_length = (total_length // block_size) * block_size 294 | # Split by chunks of max_len. 295 | result = { 296 | k: 297 | [t[i:i + block_size] for i in range(0, total_length, block_size)] 298 | for k, t in concatenated_examples.items() 299 | } 300 | result["labels"] = result["input_ids"].copy() 301 | return result 302 | 303 | lm_datasets = tokenized_datasets.map( 304 | group_texts, 305 | batched=True, 306 | num_proc=args.preprocessing_num_workers, 307 | load_from_cache_file=True, 308 | desc=f"Grouping texts in chunks of {block_size}", 309 | ) 310 | 311 | train_dataset = lm_datasets["train"] 312 | 313 | return train_dataset -------------------------------------------------------------------------------- /rlhf/utils/ds_utils.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # SPDX-License-Identifier: Apache-2.0 3 | 4 | # DeepSpeed Team 5 | 6 | import deepspeed.comm as dist 7 | from deepspeed.accelerator import get_accelerator 8 | 9 | GLOBAL_BATCH_SIZE = 32 10 | MICRO_BATCH_SIZE = 4 11 | 12 | 13 | def get_train_ds_config(offload, 14 | dtype, 15 | stage=2, 16 | enable_hybrid_engine=False, 17 | inference_tp_size=1, 18 | release_inference_cache=False, 19 | pin_parameters=True, 20 | tp_gather_partition_size=8, 21 | max_out_tokens=512, 22 | enable_tensorboard=False, 23 | enable_mixed_precision_lora=False, 24 | tb_path="", 25 | tb_name=""): 26 | 27 | device = "cpu" if offload else "none" 28 | if dtype == "fp16": 29 | data_type = "fp16" 30 | dtype_config = {"enabled": True, "loss_scale_window": 100} 31 | elif dtype == "bf16": 32 | data_type = "bfloat16" 33 | dtype_config = {"enabled": True} 34 | zero_opt_dict = { 35 | "stage": stage, 36 | "offload_param": { 37 | "device": device 38 | }, 39 | "offload_optimizer": { 40 | "device": device 41 | }, 42 | "stage3_param_persistence_threshold": 1e4, 43 | "stage3_max_live_parameters": 3e7, 44 | "stage3_prefetch_bucket_size": 3e7, 45 | "memory_efficient_linear": False 46 | } 47 | if enable_mixed_precision_lora: 48 | zero_opt_dict["zero_quantized_nontrainable_weights"] = True 49 | if dist.get_world_size() != get_accelerator().device_count(): 50 | zero_opt_dict["zero_hpz_partition_size"] = get_accelerator( 51 | ).device_count() 52 | return { 53 | "train_batch_size": GLOBAL_BATCH_SIZE, 54 | "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE, 55 | "steps_per_print": 10, 56 | "zero_optimization": zero_opt_dict, 57 | data_type: dtype_config, 58 | "gradient_clipping": 1.0, 59 | "prescale_gradients": False, 60 | "wall_clock_breakdown": False, 61 | "hybrid_engine": { 62 | "enabled": enable_hybrid_engine, 63 | "max_out_tokens": max_out_tokens, 64 | "inference_tp_size": inference_tp_size, 65 | "release_inference_cache": release_inference_cache, 66 | "pin_parameters": pin_parameters, 67 | "tp_gather_partition_size": tp_gather_partition_size, 68 | }, 69 | "tensorboard": { 70 | "enabled": enable_tensorboard, 71 | "output_path": f"{tb_path}/ds_tensorboard_logs/", 72 | "job_name": f"{tb_name}_tensorboard" 73 | } 74 | } 75 | 76 | 77 | def get_eval_ds_config(offload, dtype, stage=0): 78 | device = "cpu" if offload else "none" 79 | if dtype == "fp16": 80 | data_type = "fp16" 81 | dtype_config = { 82 | "enabled": True, 83 | } 84 | elif dtype == "bf16": 85 | data_type = "bfloat16" 86 | dtype_config = {"enabled": True} 87 | zero_opt_dict = { 88 | "stage": stage, 89 | "stage3_param_persistence_threshold": 1e4, 90 | "offload_param": { 91 | "device": device 92 | }, 93 | "memory_efficient_linear": False 94 | } 95 | return { 96 | "train_batch_size": GLOBAL_BATCH_SIZE, 97 | "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE, 98 | "steps_per_print": 10, 99 | "zero_optimization": zero_opt_dict, 100 | data_type: dtype_config, 101 | "gradient_clipping": 1.0, 102 | "prescale_gradients": False, 103 | "wall_clock_breakdown": False 104 | } 105 | -------------------------------------------------------------------------------- /rlhf/utils/model_utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import math 3 | import torch 4 | from transformers import ( 5 | AutoConfig, 6 | AutoModel, 7 | ) 8 | from transformers.deepspeed import HfDeepSpeedConfig 9 | from utils.reward_model import RewardModel 10 | from utils.utils import load_state_dict_into_model, print_rank_0 11 | 12 | 13 | 14 | def create_hf_model(model_class, 15 | model_name_or_path, 16 | tokenizer, 17 | ds_config=None, 18 | rlhf_training=False, 19 | dropout=None): 20 | model_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True) 21 | 22 | 23 | # Note: dschf is defined in function scope to avoid global effects 24 | # https://huggingface.co/docs/transformers/main_classes/deepspeed#nontrainer-deepspeed-integration 25 | if ds_config is not None and ds_config["zero_optimization"]["stage"] == 3: 26 | dschf = HfDeepSpeedConfig(ds_config) 27 | else: 28 | dschf = None 29 | if rlhf_training: 30 | # the weight loading is handled by create critic model 31 | model = model_class.from_config(model_config, trust_remote_code=True) 32 | else: 33 | model = model_class.from_pretrained( 34 | model_name_or_path, 35 | from_tf=bool(".ckpt" in model_name_or_path), 36 | config=model_config, 37 | trust_remote_code=True) 38 | 39 | model.config.end_token_id = tokenizer.eos_token_id 40 | model.config.pad_token_id = model.config.eos_token_id 41 | model.resize_token_embeddings(int( 42 | 8 * 43 | math.ceil(len(tokenizer) / 8.0))) # make the vocab size multiple of 8 44 | 45 | return model 46 | 47 | def create_critic_model(model_name_or_path, 48 | tokenizer, 49 | ds_config, 50 | num_padding_at_beginning=0, 51 | rlhf_training=False, 52 | dropout=None, 53 | zero_stage=0, 54 | compute_fp32_loss=False): 55 | # OPT model family always put a padding token at the beginning of the sequence, 56 | # we did not see this in other models but not sure if it is a general rule 57 | 58 | import time 59 | 60 | start = time.time() 61 | critic_model = create_hf_model(AutoModel, model_name_or_path, tokenizer, 62 | ds_config, rlhf_training, dropout) 63 | end = time.time() 64 | print_rank_0(f">Creating model from_config took {end - start} seconds", 65 | None) 66 | 67 | critic_model = RewardModel( 68 | critic_model, 69 | tokenizer, 70 | num_padding_at_beginning=num_padding_at_beginning, 71 | compute_fp32_loss=compute_fp32_loss) 72 | 73 | if rlhf_training: 74 | # load critic model from checkpoint 75 | 76 | 77 | model_ckpt_path = os.path.join(model_name_or_path, 'pytorch_model.bin') 78 | assert os.path.exists( 79 | model_ckpt_path 80 | ), f"Cannot find model checkpoint at {model_ckpt_path}" 81 | 82 | start = time.time() 83 | model_ckpt_state_dict = torch.load(model_ckpt_path, map_location='cpu') 84 | end = time.time() 85 | print_rank_0(f">Creating model from_config took {end - start} seconds", 86 | None) 87 | 88 | # load critic model from checkpoint with zero-stage 3 compatibility 89 | # this functionality may be moved to DS checkpoint load API in future 90 | start = time.time() 91 | load_state_dict_into_model(critic_model, 92 | model_ckpt_state_dict, 93 | "", 94 | zero_stage=zero_stage) 95 | end = time.time() 96 | 97 | print_rank_0(f">Creating model from_config took {end - start} seconds", 98 | None) 99 | 100 | return critic_model -------------------------------------------------------------------------------- /rlhf/utils/perf.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # SPDX-License-Identifier: Apache-2.0 3 | 4 | # DeepSpeed Team 5 | 6 | import torch 7 | 8 | 9 | # This function can be used to print throughput for Step 1 and 2 only 10 | def print_throughput(hf_model, args, e2e_time, rank=0): 11 | if rank <= 0: 12 | hf_config = hf_model.config 13 | num_layers, hidden_size, vocab_size = get_hf_configs(hf_config) 14 | 15 | gpus_per_model = torch.distributed.get_world_size() 16 | seq_length = args.max_seq_len 17 | batch_size = args.per_device_train_batch_size 18 | samples_per_second = batch_size / e2e_time 19 | checkpoint_activations_factor = 4 if args.gradient_checkpointing else 3 20 | if args.lora_dim > 0: 21 | k = args.lora_dim * 2 / hidden_size 22 | checkpoint_activations_factor -= (1 - k) 23 | 24 | hf_model._num_params = sum([ 25 | p.ds_numel if hasattr(p, "ds_tensor") else p.numel() 26 | for p in hf_model.parameters() 27 | ]) 28 | params_in_billions = hf_model._num_params / (1e9) 29 | 30 | # Megatron paper's formula to calculate training flops 31 | train_flops_per_iteration = calculate_flops( 32 | checkpoint_activations_factor, batch_size, seq_length, hf_config) 33 | 34 | train_tflops = train_flops_per_iteration / (e2e_time * gpus_per_model * 35 | (10**12)) 36 | 37 | param_string = f"{params_in_billions:.3f} B" if params_in_billions != 0 else "NA" 38 | print( 39 | f"Model Parameters: {param_string}, Latency: {e2e_time:.2f}s, TFLOPs: {train_tflops:.2f}, Samples/sec: {samples_per_second:.2f}, Time/seq {e2e_time/batch_size:.2f}s, Batch Size: {batch_size}, Sequence Length: {seq_length}" 40 | ) 41 | 42 | 43 | # Enhanced version of the function above that provides calculations and printing for Step 3 44 | def print_throughput_step3(actor_model, 45 | critic_model, 46 | args, 47 | e2e_time, 48 | gen_exp_time, 49 | train_time, 50 | rank=0): 51 | if rank <= 0: 52 | # Actor model passed here is a HF model. 53 | actor_hf_config = actor_model.config 54 | # Critic model passed here is a DeepSpeed Engine. The module inside is the Reward model (that wraps a HF model). 55 | critic_hf_config = critic_model.module.config 56 | 57 | actor_num_layers, actor_hidden_size, actor_vocab_size = get_hf_configs( 58 | actor_hf_config) 59 | critic_num_layers, critic_hidden_size, critic_vocab_size = get_hf_configs( 60 | critic_hf_config) 61 | 62 | gpus_per_model = torch.distributed.get_world_size() 63 | seq_length = args.max_answer_seq_len + args.max_prompt_seq_len 64 | batch_size = args.per_device_generation_batch_size * args.generation_batches * args.ppo_epochs * gpus_per_model * 1 if args.unsupervised_dataset_name is None else 2 65 | samples_per_second = batch_size / e2e_time 66 | 67 | actor_checkpoint_activations_factor = 4 if args.actor_gradient_checkpointing else 3 68 | critic_checkpoint_activations_factor = 4 if args.critic_gradient_checkpointing else 3 69 | if args.actor_lora_dim > 0: 70 | k = args.actor_lora_dim * 2 / actor_hidden_size 71 | actor_checkpoint_activations_factor -= (1 - k) 72 | if args.critic_lora_dim > 0: 73 | k = args.critic_lora_dim * 2 / critic_hidden_size 74 | critic_checkpoint_activations_factor -= (1 - k) 75 | 76 | actor_model._num_params = sum([ 77 | p.ds_numel if hasattr(p, "ds_tensor") else p.numel() 78 | for p in actor_model.parameters() 79 | ]) 80 | actor_params_in_billions = actor_model._num_params / (1e9) 81 | 82 | critic_model._num_params = sum([ 83 | p.ds_numel if hasattr(p, "ds_tensor") else p.numel() 84 | for p in critic_model.parameters() 85 | ]) 86 | critic_params_in_billions = critic_model._num_params / (1e9) 87 | 88 | # Megatron paper's formula to calculate training flops 89 | 90 | actor_train_flops_per_iteration = calculate_flops( 91 | actor_checkpoint_activations_factor, batch_size, seq_length, 92 | actor_hf_config) 93 | critic_train_flops_per_iteration = calculate_flops( 94 | critic_checkpoint_activations_factor, batch_size, seq_length, 95 | critic_hf_config) 96 | 97 | total_train_flops = actor_train_flops_per_iteration + critic_train_flops_per_iteration 98 | train_tflops = total_train_flops / (train_time * gpus_per_model * 99 | (10**12)) 100 | 101 | gen_bs = args.per_device_generation_batch_size * gpus_per_model 102 | 103 | # Modified formula for calculating flops in the forward pass only 104 | gen_flops_per_iteration = ( 105 | 24 * gen_bs * seq_length * actor_num_layers * 106 | (actor_hidden_size**2)) * ( 107 | 1.0 + (seq_length / (6.0 * actor_hidden_size)) + 108 | (actor_vocab_size / 109 | (16.0 * actor_num_layers * actor_hidden_size))) 110 | 111 | gen_tflops = gen_flops_per_iteration / (gen_exp_time * gpus_per_model * 112 | (10**12)) 113 | 114 | if actor_hf_config.torch_dtype == torch.float16: 115 | num_bytes = 2 116 | elif actor_hf_config.torch_dtype == torch.float32: 117 | num_bytes = 4 118 | else: 119 | num_bytes = -1 120 | 121 | pertok_lat = gen_exp_time / args.max_answer_seq_len 122 | gen_bw = 1 / pertok_lat * actor_model._num_params * num_bytes / 1e9 123 | 124 | total_flops_per_iteration = total_train_flops + gen_flops_per_iteration * args.generation_batches 125 | total_tflops = total_flops_per_iteration / (e2e_time * gpus_per_model * 126 | (10**12)) 127 | 128 | print( 129 | f"End-to-End => Latency: {e2e_time:.2f}s, TFLOPs: {total_tflops:.2f}, Samples/sec: {samples_per_second:.2f}, Time/seq {e2e_time/batch_size:.2f}s, Batch Size: {batch_size}, Total Seq. Length: {seq_length}" 130 | ) 131 | print( 132 | f"Generation => Latency: {gen_exp_time:.2f}s, Per-token Latency {pertok_lat*1000:.2f} ms, TFLOPs: {gen_tflops:.2f}, BW: {gen_bw if num_bytes > 0 else num_bytes:.2f} GB/sec, Answer Seq. Length: {args.max_answer_seq_len}" 133 | ) 134 | print( 135 | f"Training => Latency: {train_time:.2f}s, TFLOPs: {train_tflops:.2f}" 136 | ) 137 | actor_param_string = f"{actor_params_in_billions:.3f} B" if actor_params_in_billions != 0 else "NA" 138 | critic_param_string = f"{critic_params_in_billions:.3f} B" if critic_params_in_billions != 0 else "NA" 139 | print( 140 | f"Actor Model Parameters => {actor_param_string}, Critic Model Parameters => {critic_param_string}" 141 | ) 142 | 143 | 144 | # Helper function to calculate FLOPs using the Megatron-LM paper's formula 145 | def calculate_flops(checkpoint_activations_factor, batch_size, seq_length, 146 | hf_config): 147 | num_layers, hidden_size, vocab_size = get_hf_configs(hf_config) 148 | flops_per_iteration = (24 * checkpoint_activations_factor * batch_size * 149 | seq_length * num_layers * (hidden_size**2)) * ( 150 | 1.0 + (seq_length / (6.0 * hidden_size)) + 151 | (vocab_size / 152 | (16.0 * num_layers * hidden_size))) 153 | return flops_per_iteration 154 | 155 | 156 | def get_hf_configs(hf_config): 157 | num_layers = getattr(hf_config, "num_hidden_layers", 158 | getattr(hf_config, "n_layer", None)) 159 | hidden_size = getattr(hf_config, "hidden_size", 160 | getattr(hf_config, "n_embd", None)) 161 | vocab_size = getattr(hf_config, "vocab_size", None) 162 | assert all( 163 | (num_layers, hidden_size, vocab_size) 164 | ), "Could not determine number of layers, hidden size, and vocab size of the model" 165 | 166 | return num_layers, hidden_size, vocab_size 167 | -------------------------------------------------------------------------------- /rlhf/utils/raw_datasets.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | # DeepSpeed Team 4 | from datasets import load_dataset, load_from_disk 5 | from torch.utils.data import Subset 6 | import re 7 | 8 | class LocalJsonFileDataset(object): 9 | 10 | def __init__(self, data_path): 11 | self.dataset_name = "local/jsonfile" 12 | self.dataset_name_clean = "jsonfile" 13 | self.raw_datasets = load_dataset('json', 14 | data_files={ 15 | "train": 16 | data_path + '/train.jsonl', 17 | "eval": 18 | data_path + '/eval.jsonl' 19 | }) 20 | 21 | def get_train_data(self): 22 | if self.raw_datasets['train'] is not None: 23 | return self.raw_datasets['train'] 24 | return None 25 | 26 | def get_eval_data(self): 27 | if self.raw_datasets['eval'] is not None: 28 | return self.raw_datasets['eval'] 29 | return None 30 | 31 | # The prompt should be in the format of: " Human: " + actual_prompt_sentence + " Assistant:" 32 | def get_prompt(self, sample): 33 | if sample['prompt'] is not None: 34 | return sample['prompt'] 35 | return None 36 | 37 | # The chosen response should be in the format of: " " + actual_response_sentence 38 | def get_chosen(self, sample): 39 | if sample['chosen'] is not None: 40 | return sample['chosen'] 41 | return None 42 | 43 | # The rejected response should be in the format of: " " + actual_response_sentence 44 | # If the dataset does not have rejected response, return None 45 | def get_rejected(self, sample): 46 | if sample['rejected'] is not None: 47 | return sample['rejected'] 48 | return None 49 | 50 | def get_prompt_and_chosen(self, sample): 51 | if sample['prompt'] is not None and sample['chosen'] is not None: 52 | return sample['prompt'] + sample['chosen'] 53 | return None 54 | 55 | def get_prompt_and_rejected(self, sample): 56 | if sample['prompt'] is not None and sample['rejected'] is not None: 57 | return sample['prompt'] + sample['rejected'] 58 | return None -------------------------------------------------------------------------------- /rlhf/utils/reward_model.py: -------------------------------------------------------------------------------- 1 | # DeepSpeed Team 2 | import torch 3 | from torch import nn 4 | 5 | 6 | ## Note that the following code is modified from 7 | ## https://github.com/CarperAI/trlx/blob/main/examples/summarize_rlhf/reward_model/reward_model.py 8 | class RewardModel(nn.Module): 9 | 10 | def __init__(self, 11 | base_model, 12 | tokenizer, 13 | num_padding_at_beginning=0, 14 | compute_fp32_loss=False): 15 | super().__init__() 16 | self.config = base_model.config 17 | self.num_padding_at_beginning = num_padding_at_beginning 18 | if hasattr(self.config, "word_embed_proj_dim"): 19 | # `OPT` models use word_embed_proj_dim as final output 20 | # https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L497 21 | self.v_head = nn.Linear(self.config.word_embed_proj_dim, 22 | 1, 23 | bias=False) 24 | else: 25 | # `gpt-neo(x)` models use `hidden_size` attribute names instead of `n_embd`` 26 | self.config.n_embd = self.config.hidden_size if hasattr( 27 | self.config, "hidden_size") else self.config.n_embd 28 | self.v_head = nn.Linear(self.config.n_embd, 1, bias=False) 29 | self.rwtransformer = base_model 30 | self.PAD_ID = tokenizer.pad_token_id 31 | self.compute_fp32_loss = compute_fp32_loss 32 | 33 | def gradient_checkpointing_enable(self): 34 | self.rwtransformer.gradient_checkpointing_enable() 35 | 36 | def gradient_checkpointing_disable(self): 37 | self.rwtransformer.gradient_checkpointing_disable() 38 | 39 | def forward(self, 40 | input_ids=None, 41 | past_key_values=None, 42 | attention_mask=None, 43 | position_ids=None, 44 | head_mask=None, 45 | inputs_embeds=None, 46 | use_cache=False): 47 | loss = None 48 | 49 | 50 | kwargs = dict() 51 | 52 | transformer_outputs = self.rwtransformer( 53 | input_ids, 54 | past_key_values=past_key_values, 55 | attention_mask=attention_mask, 56 | inputs_embeds=inputs_embeds, 57 | use_cache=use_cache, 58 | **kwargs) 59 | 60 | hidden_states = transformer_outputs[0] 61 | rewards = self.v_head(hidden_states).squeeze(-1) 62 | chosen_mean_scores = [] 63 | rejected_mean_scores = [] 64 | 65 | # Split the inputs and rewards into two parts, chosen and rejected 66 | assert len(input_ids.shape) == 2 67 | bs = input_ids.shape[0] // 2 68 | seq_len = input_ids.shape[1] 69 | 70 | chosen_ids = input_ids[:bs] # bs x seq x 1 71 | rejected_ids = input_ids[bs:] 72 | chosen_rewards = rewards[:bs] 73 | rejected_rewards = rewards[bs:] 74 | 75 | # Compute pairwise loss. Only backprop on the different tokens before padding 76 | loss = 0. 77 | for i in range(bs): 78 | chosen_id = chosen_ids[i] 79 | rejected_id = rejected_ids[i] 80 | chosen_reward = chosen_rewards[i] 81 | rejected_reward = rejected_rewards[i] 82 | 83 | c_inds = (chosen_id == self.PAD_ID).nonzero() 84 | c_ind = c_inds[self.num_padding_at_beginning].item() if len( 85 | c_inds 86 | ) > self.num_padding_at_beginning else seq_len # OPT model pads the first token, so we need to use the second padding token as the end of the sequence 87 | check_divergence = (chosen_id != rejected_id).nonzero() 88 | 89 | if len(check_divergence) == 0: 90 | end_ind = rejected_reward.size(-1) 91 | divergence_ind = end_ind - 1 92 | r_ind = c_ind 93 | else: 94 | # Check if there is any padding otherwise take length of sequence 95 | r_inds = (rejected_id == self.PAD_ID).nonzero() 96 | r_ind = r_inds[self.num_padding_at_beginning].item( 97 | ) if len(r_inds) > self.num_padding_at_beginning else seq_len 98 | end_ind = max(c_ind, r_ind) 99 | divergence_ind = check_divergence[0] 100 | assert divergence_ind > 0 101 | c_truncated_reward = chosen_reward[divergence_ind:end_ind] 102 | r_truncated_reward = rejected_reward[divergence_ind:end_ind] 103 | chosen_mean_scores.append( 104 | chosen_reward[c_ind - 1]) #use the end score for reference 105 | rejected_mean_scores.append(rejected_reward[r_ind - 1]) 106 | 107 | if self.compute_fp32_loss: 108 | c_truncated_reward = c_truncated_reward.float() 109 | r_truncated_reward = r_truncated_reward.float() 110 | loss += -torch.nn.functional.logsigmoid(c_truncated_reward - 111 | r_truncated_reward).mean() 112 | 113 | loss = loss / bs 114 | chosen_mean_scores = torch.stack(chosen_mean_scores) 115 | rejected_mean_scores = torch.stack(rejected_mean_scores) 116 | return { 117 | "loss": loss, 118 | "chosen_mean_scores": chosen_mean_scores, 119 | "rejected_mean_scores": rejected_mean_scores, 120 | } 121 | 122 | def forward_value(self, 123 | input_ids=None, 124 | attention_mask=None, 125 | past_key_values=None, 126 | position_ids=None, 127 | head_mask=None, 128 | inputs_embeds=None, 129 | return_value_only=False, 130 | prompt_length=0, 131 | use_cache=False): 132 | 133 | 134 | kwargs = dict() 135 | 136 | 137 | transformer_outputs = self.rwtransformer( 138 | input_ids, 139 | past_key_values=past_key_values, 140 | attention_mask=attention_mask, 141 | inputs_embeds=inputs_embeds, 142 | use_cache=use_cache, 143 | **kwargs) 144 | hidden_states = transformer_outputs[0] 145 | values = self.v_head(hidden_states).squeeze(-1) 146 | if return_value_only: 147 | return values 148 | else: 149 | # [0 0 0 0 prompt, answer, 0 0 0 0 ] for step 3, we have padding at the beginning 150 | # [prompt, answer, 0, 0, 0, 0] this is normal 151 | assert prompt_length > 1, "prompt_length must be greater than 1 to help select the end score" 152 | bs = values.size(0) 153 | seq_len = input_ids.shape[1] 154 | chosen_end_scores = [ 155 | ] # we use this name for consistency with the original forward function 156 | for i in range(bs): 157 | input_id = input_ids[i] 158 | value = values[i] 159 | 160 | c_inds = (input_id[prompt_length:] == self.PAD_ID).nonzero() 161 | # here we only use the answer part of the sequence so we do not need to care about the padding at the beginning 162 | c_ind = c_inds[0].item() + prompt_length if len( 163 | c_inds) > 0 else seq_len 164 | chosen_end_scores.append(value[c_ind - 1]) 165 | return { 166 | "values": values, 167 | "chosen_end_scores": torch.stack(chosen_end_scores), 168 | } 169 | -------------------------------------------------------------------------------- /rlhf/utils/utils.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer 2 | import os 3 | import torch 4 | import random 5 | import numpy as np 6 | from transformers import set_seed, AutoTokenizer 7 | import json 8 | import deepspeed 9 | from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus 10 | from deepspeed.accelerator import get_accelerator 11 | import torch.nn as nn 12 | 13 | def print_rank_0(msg, rank=None): 14 | if rank is not None and rank <= 0: 15 | print(msg) 16 | elif is_rank_0(): 17 | print(msg) 18 | 19 | def is_rank_0(): 20 | """Check whether it is rank 0.""" 21 | if torch.distributed.is_initialized(): 22 | if torch.distributed.get_rank() == 0: 23 | return True 24 | else: 25 | return False 26 | else: 27 | return True 28 | 29 | def load_hf_tokenizer(model_name_or_path): 30 | tokenizer = AutoTokenizer.from_pretrained( 31 | model_name_or_path, trust_remote_code=True) 32 | return tokenizer 33 | 34 | def load_state_dict_into_model(model_to_load=None, 35 | state_dict=None, 36 | start_prefix="", 37 | zero_stage=0): 38 | 39 | # copy state_dict so _load_from_state_dict can modify it 40 | metadata = getattr(state_dict, "_metadata", None) 41 | state_dict = state_dict.copy() 42 | if metadata is not None: 43 | state_dict._metadata = metadata 44 | 45 | error_msgs = [] 46 | 47 | # PyTorch's `_load_from_state_dict` does not copy parameters in a module's descendants 48 | # so we need to apply the function recursively. 49 | def load(module: nn.Module, state_dict, prefix=""): 50 | local_metadata = {} if metadata is None else metadata.get( 51 | prefix[:-1], {}) 52 | args = (state_dict, prefix, local_metadata, True, [], [], error_msgs) 53 | # Parameters of module and children will start with prefix. We can exit early if there are none in this 54 | # state_dict 55 | if len([key for key in state_dict if key.startswith(prefix)]) > 0: 56 | if zero_stage == 3: 57 | # In sharded models, each shard has only part of the full state_dict, so only gather 58 | # parameters that are in the current state_dict. 59 | named_parameters = dict( 60 | module.named_parameters(prefix=prefix[:-1], recurse=False)) 61 | params_to_gather = [ 62 | named_parameters[k] for k in state_dict.keys() 63 | if k in named_parameters 64 | ] 65 | if len(params_to_gather) > 0: 66 | # because zero3 puts placeholders in model params, this context 67 | # manager gathers (unpartitions) the params of the current layer, then loads from 68 | # the state dict and then re-partitions them again 69 | with deepspeed.zero.GatheredParameters(params_to_gather, 70 | modifier_rank=0): 71 | if torch.distributed.get_rank() == 0: 72 | module._load_from_state_dict(*args) 73 | else: 74 | module._load_from_state_dict(*args) 75 | 76 | for name, child in module._modules.items(): 77 | if child is not None: 78 | load(child, state_dict, prefix + name + ".") 79 | 80 | load(model_to_load, state_dict, prefix=start_prefix) 81 | # Delete `state_dict` so it could be collected by GC earlier. Note that `state_dict` is a copy of the argument, so 82 | # it's safe to delete it. 83 | del state_dict 84 | 85 | return error_msgs 86 | 87 | def save_hf_format(model, tokenizer, args, sub_folder=""): 88 | # used to save huggingface format, so we can use it for hf.from_pretrained 89 | model_to_save = model.module if hasattr(model, 'module') else model 90 | CONFIG_NAME = "config.json" 91 | WEIGHTS_NAME = "pytorch_model.bin" 92 | output_dir = os.path.join(args.output_dir, sub_folder) 93 | os.makedirs(output_dir, exist_ok=True) 94 | output_model_file = os.path.join(output_dir, WEIGHTS_NAME) 95 | output_config_file = os.path.join(output_dir, CONFIG_NAME) 96 | save_dict = model_to_save.state_dict() 97 | for key in list(save_dict.keys()): 98 | if "lora" in key: 99 | del save_dict[key] 100 | torch.save(save_dict, output_model_file) 101 | model_to_save.config.to_json_file(output_config_file) 102 | tokenizer.save_vocabulary(output_dir) 103 | 104 | def to_device(batch, device): 105 | output = {} 106 | for k, v in batch.items(): 107 | try: 108 | output[k] = v.to(device) 109 | except: 110 | output[k] = v 111 | return output 112 | 113 | def set_random_seed(seed): 114 | if seed is not None: 115 | set_seed(seed) 116 | random.seed(seed) 117 | np.random.seed(seed) 118 | torch.manual_seed(seed) 119 | get_accelerator().manual_seed_all(seed) 120 | 121 | def get_all_reduce_mean(tensor): 122 | torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.SUM) 123 | tensor = tensor / torch.distributed.get_world_size() 124 | return tensor 125 | 126 | def get_optimizer_grouped_parameters( 127 | model, 128 | weight_decay, 129 | lora_lr=5e-4, 130 | no_decay_name_list=[ 131 | "bias", "layer_norm.weight", "layernorm.weight", "norm.weight", 132 | "ln_f.weight" 133 | ], 134 | lora_name_list=["lora_right_weight", "lora_left_weight"], 135 | ): 136 | optimizer_grouped_parameters = [ 137 | { 138 | "params": [ 139 | p for n, p in model.named_parameters() 140 | if (not any(nd in n.lower() for nd in no_decay_name_list) 141 | and p.requires_grad and not any(nd in n.lower() 142 | for nd in lora_name_list)) 143 | ], 144 | "weight_decay": 145 | weight_decay, 146 | }, 147 | { 148 | "params": [ 149 | p for n, p in model.named_parameters() 150 | if (not any(nd in n.lower() for nd in no_decay_name_list) 151 | and p.requires_grad and any(nd in n.lower() 152 | for nd in lora_name_list)) 153 | ], 154 | "weight_decay": 155 | weight_decay, 156 | "lr": 157 | lora_lr 158 | }, 159 | { 160 | "params": [ 161 | p for n, p in model.named_parameters() 162 | if (any(nd in n.lower() 163 | for nd in no_decay_name_list) and p.requires_grad) 164 | ], 165 | "weight_decay": 166 | 0.0, 167 | }, 168 | ] 169 | 170 | non_empty_groups = [] 171 | for group in optimizer_grouped_parameters: 172 | if group["params"]: 173 | non_empty_groups.append(group) 174 | return non_empty_groups 175 | 176 | def _z3_params_to_fetch(param_list): 177 | return [ 178 | p for p in param_list 179 | if hasattr(p, 'ds_id') and p.ds_status == ZeroParamStatus.NOT_AVAILABLE 180 | ] 181 | 182 | def moving_average(model, model_ema, beta=0.992, device=None, zero_stage=0): 183 | zero_stage_3 = (zero_stage == 3) 184 | with torch.no_grad(): 185 | for param, param_ema in zip(model.parameters(), 186 | model_ema.parameters()): 187 | # TODO: use prefiltering for efficiency 188 | params_to_fetch = _z3_params_to_fetch([param, param_ema 189 | ]) if zero_stage_3 else [] 190 | should_gather_param = len(params_to_fetch) > 0 191 | with deepspeed.zero.GatheredParameters( 192 | params_to_fetch, enabled=should_gather_param): 193 | data = param.data 194 | if device is not None: 195 | data = data.to(device) 196 | param_ema.data.copy_(torch.lerp(data, param_ema.data, beta)) 197 | 198 | 199 | 200 | def save_zero_three_model(model_ema, global_rank, save_dir, zero_stage=0): 201 | zero_stage_3 = (zero_stage == 3) 202 | os.makedirs(save_dir, exist_ok=True) 203 | WEIGHTS_NAME = "pytorch_model.bin" 204 | output_model_file = os.path.join(save_dir, WEIGHTS_NAME) 205 | 206 | model_to_save = model_ema.module if hasattr(model_ema, 207 | 'module') else model_ema 208 | if not zero_stage_3: 209 | if global_rank == 0: 210 | torch.save(model_to_save.state_dict(), output_model_file) 211 | else: 212 | output_state_dict = {} 213 | for k, v in model_to_save.named_parameters(): 214 | 215 | if hasattr(v, 'ds_id'): 216 | with deepspeed.zero.GatheredParameters(_z3_params_to_fetch([v 217 | ]), 218 | enabled=zero_stage_3): 219 | v_p = v.data.cpu() 220 | else: 221 | v_p = v.cpu() 222 | if global_rank == 0 and "lora" not in k: 223 | output_state_dict[k] = v_p 224 | if global_rank == 0: 225 | torch.save(output_state_dict, output_model_file) 226 | del output_state_dict 227 | 228 | class ExponentialMovingAverage: 229 | 230 | def __init__(self, alpha=0.9): 231 | self.alpha = alpha 232 | self.ema = None 233 | 234 | def update(self, num): 235 | prev_ema = num if self.ema is None else self.ema 236 | self.ema = self.alpha * prev_ema + (1.0 - self.alpha) * num 237 | return self.ema 238 | 239 | def get(self): 240 | return self.ema if self.ema is not None else 0. -------------------------------------------------------------------------------- /sft/ds_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "fp16": { 3 | "enabled": "auto", 4 | "loss_scale": 0, 5 | "loss_scale_window": 1000, 6 | "initial_scale_power": 16, 7 | "hysteresis": 2, 8 | "min_loss_scale": 1 9 | }, 10 | 11 | "optimizer": { 12 | "type": "AdamW", 13 | "params": { 14 | "lr": "auto", 15 | "betas": "auto", 16 | "eps": "auto", 17 | "weight_decay": "auto" 18 | } 19 | }, 20 | 21 | "scheduler": { 22 | "type": "WarmupDecayLR", 23 | "params": { 24 | "warmup_min_lr": 1e-5, 25 | "warmup_max_lr": "auto", 26 | "warmup_num_steps": "auto", 27 | "total_num_steps": "auto" 28 | } 29 | }, 30 | 31 | "zero_optimization": { 32 | "stage": 2, 33 | "allgather_partitions": true, 34 | "allgather_bucket_size": 2e8, 35 | "overlap_comm": true, 36 | "reduce_scatter": true, 37 | "reduce_bucket_size": 2e8, 38 | "contiguous_gradients": true 39 | }, 40 | 41 | "gradient_accumulation_steps": "auto", 42 | "gradient_clipping": "auto", 43 | "steps_per_print": 2000, 44 | "train_batch_size": "auto", 45 | "train_micro_batch_size_per_gpu": "auto", 46 | "wall_clock_breakdown": false 47 | } -------------------------------------------------------------------------------- /sft/model/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "model_type": "miaomiao", 3 | "architectures": [ 4 | "MiaomiaoModel" 5 | ], 6 | "auto_map": { 7 | "AutoConfig": "configuration_miaomiao.MiaomiaoConfig", 8 | "AutoModel": "modeling_miaomiao.MiaomiaoModel", 9 | "AutoModelForCausalLM": "modeling_miaomiao.MiaomiaoForCausalLM" 10 | }, 11 | "attention_dropout": 0.0, 12 | "bos_token_id": 32005, 13 | "eos_token_id": 32005, 14 | "hidden_act": "silu", 15 | "hidden_size": 512, 16 | "initializer_range": 0.02, 17 | "intermediate_size": 2752, 18 | "max_position_embeddings": 131072, 19 | "max_window_layers": 28, 20 | "num_attention_heads": 16, 21 | "num_hidden_layers": 24, 22 | "num_key_value_heads": 16, 23 | "rms_norm_eps": 1e-06, 24 | "rope_theta": 1000000.0, 25 | "sliding_window": 131072, 26 | "tie_word_embeddings": false, 27 | "torch_dtype": "bfloat16", 28 | "transformers_version": "4.37.2", 29 | "use_cache": true, 30 | "use_sliding_window": false, 31 | "vocab_size": 32006 32 | } -------------------------------------------------------------------------------- /sft/model/configuration_miaomiao.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | """ Miaomiao model configuration""" 4 | 5 | from transformers.configuration_utils import PretrainedConfig 6 | from transformers.utils import logging 7 | 8 | 9 | logger = logging.get_logger(__name__) 10 | 11 | 12 | class MiaomiaoConfig(PretrainedConfig): 13 | 14 | model_type = "miaomiao" 15 | keys_to_ignore_at_inference = ["past_key_values"] 16 | 17 | def __init__( 18 | self, 19 | vocab_size=32000, 20 | hidden_size=4096, 21 | intermediate_size=11008, 22 | num_hidden_layers=32, 23 | num_attention_heads=32, 24 | num_key_value_heads=None, 25 | hidden_act="silu", 26 | max_position_embeddings=2048, 27 | initializer_range=0.02, 28 | rms_norm_eps=1e-6, 29 | use_cache=True, 30 | pad_token_id=None, 31 | bos_token_id=1, 32 | eos_token_id=2, 33 | pretraining_tp=1, 34 | tie_word_embeddings=False, 35 | rope_theta=10000.0, 36 | rope_scaling=None, 37 | attention_bias=False, 38 | attention_dropout=0.0, 39 | mlp_bias=False, 40 | _attn_implementation="eager", 41 | **kwargs, 42 | ): 43 | self.vocab_size = vocab_size 44 | self.max_position_embeddings = max_position_embeddings 45 | self.hidden_size = hidden_size 46 | self.intermediate_size = intermediate_size 47 | self.num_hidden_layers = num_hidden_layers 48 | self.num_attention_heads = num_attention_heads 49 | 50 | # for backward compatibility 51 | if num_key_value_heads is None: 52 | num_key_value_heads = num_attention_heads 53 | 54 | self.num_key_value_heads = num_key_value_heads 55 | self.hidden_act = hidden_act 56 | self.initializer_range = initializer_range 57 | self.rms_norm_eps = rms_norm_eps 58 | self.pretraining_tp = pretraining_tp 59 | self.use_cache = use_cache 60 | self.rope_theta = rope_theta 61 | self.rope_scaling = rope_scaling 62 | self._rope_scaling_validation() 63 | self.attention_bias = attention_bias 64 | self.attention_dropout = attention_dropout 65 | self.mlp_bias = mlp_bias 66 | self._attn_implementation = _attn_implementation 67 | super().__init__( 68 | pad_token_id=pad_token_id, 69 | bos_token_id=bos_token_id, 70 | eos_token_id=eos_token_id, 71 | tie_word_embeddings=tie_word_embeddings, 72 | **kwargs, 73 | ) 74 | 75 | def _rope_scaling_validation(self): 76 | """ 77 | Validate the `rope_scaling` configuration. 78 | """ 79 | if self.rope_scaling is None: 80 | return 81 | 82 | if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2: 83 | raise ValueError( 84 | "`rope_scaling` must be a dictionary with two fields, `type` and `factor`, " f"got {self.rope_scaling}" 85 | ) 86 | rope_scaling_type = self.rope_scaling.get("type", None) 87 | rope_scaling_factor = self.rope_scaling.get("factor", None) 88 | if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]: 89 | raise ValueError( 90 | f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}" 91 | ) 92 | if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0: 93 | raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}") 94 | -------------------------------------------------------------------------------- /sft/model/tokenization_miaomiao.py: -------------------------------------------------------------------------------- 1 | 2 | """Tokenization classes for Miaomiao.""" 3 | 4 | import json 5 | import os 6 | import unicodedata 7 | from functools import lru_cache 8 | from typing import Optional, Tuple 9 | 10 | import regex as re 11 | 12 | from transformers import AddedToken, PreTrainedTokenizer 13 | from transformers.utils import logging 14 | 15 | 16 | logger = logging.get_logger(__name__) 17 | 18 | VOCAB_FILES_NAMES = { 19 | "vocab_file": "vocab.json", 20 | "merges_file": "merges.txt", 21 | } 22 | 23 | 24 | MAX_MODEL_INPUT_SIZES = {"miaomiao/miaomiao-tokenizer": 1024} 25 | 26 | PRETOKENIZE_REGEX = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""" 27 | 28 | 29 | @lru_cache() 30 | # Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode 31 | def bytes_to_unicode(): 32 | 33 | bs = ( 34 | list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1)) 35 | ) 36 | cs = bs[:] 37 | n = 0 38 | for b in range(2**8): 39 | if b not in bs: 40 | bs.append(b) 41 | cs.append(2**8 + n) 42 | n += 1 43 | cs = [chr(n) for n in cs] 44 | return dict(zip(bs, cs)) 45 | 46 | 47 | # Copied from transformers.models.gpt2.tokenization_gpt2.get_pairs 48 | def get_pairs(word): 49 | """ 50 | Return set of symbol pairs in a word. 51 | 52 | Word is represented as tuple of symbols (symbols being variable-length strings). 53 | """ 54 | pairs = set() 55 | prev_char = word[0] 56 | for char in word[1:]: 57 | pairs.add((prev_char, char)) 58 | prev_char = char 59 | return pairs 60 | 61 | 62 | class MiaomiaoTokenizer(PreTrainedTokenizer): 63 | vocab_files_names = VOCAB_FILES_NAMES 64 | model_input_names = ["input_ids", "attention_mask"] 65 | 66 | def __init__( 67 | self, 68 | vocab_file, 69 | merges_file, 70 | errors="replace", 71 | unk_token="<|endoftext|>", 72 | bos_token=None, 73 | eos_token="<|im_end|>", 74 | pad_token="<|endoftext|>", 75 | clean_up_tokenization_spaces=False, 76 | split_special_tokens=False, 77 | **kwargs, 78 | ): 79 | bos_token = ( 80 | AddedToken(bos_token, lstrip=False, rstrip=False, special=True, normalized=False) 81 | if isinstance(bos_token, str) 82 | else bos_token 83 | ) 84 | eos_token = ( 85 | AddedToken(eos_token, lstrip=False, rstrip=False, special=True, normalized=False) 86 | if isinstance(eos_token, str) 87 | else eos_token 88 | ) 89 | unk_token = ( 90 | AddedToken(unk_token, lstrip=False, rstrip=False, special=True, normalized=False) 91 | if isinstance(unk_token, str) 92 | else unk_token 93 | ) 94 | pad_token = ( 95 | AddedToken(pad_token, lstrip=False, rstrip=False, special=True, normalized=False) 96 | if isinstance(pad_token, str) 97 | else pad_token 98 | ) 99 | 100 | with open(vocab_file, encoding="utf-8") as vocab_handle: 101 | self.encoder = json.load(vocab_handle) 102 | self.decoder = {v: k for k, v in self.encoder.items()} 103 | self.errors = errors # how to handle errors in decoding 104 | self.byte_encoder = bytes_to_unicode() 105 | self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} 106 | bpe_merges = [] 107 | with open(merges_file, encoding="utf-8") as merges_handle: 108 | for i, line in enumerate(merges_handle): 109 | line = line.strip() 110 | if (i == 0 and line.startswith("#version:")) or not line: 111 | continue 112 | bpe_merges.append(tuple(line.split())) 113 | self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges)))) 114 | # NOTE: the cache can grow without bound and will get really large for long running processes 115 | # (esp. for texts of language that do not use space between word, e.g. Chinese); technically 116 | # not a memory leak but appears as one. 117 | # GPT2Tokenizer has the same problem, so let's be consistent. 118 | self.cache = {} 119 | 120 | self.pat = re.compile(PRETOKENIZE_REGEX) 121 | 122 | if kwargs.get("add_prefix_space", False): 123 | logger.warning_once( 124 | f"{self.__class__.__name} does not support `add_prefix_space`, setting it to True has no effect." 125 | ) 126 | 127 | super().__init__( 128 | errors=errors, 129 | bos_token=bos_token, 130 | eos_token=eos_token, 131 | pad_token=pad_token, 132 | unk_token=unk_token, 133 | clean_up_tokenization_spaces=clean_up_tokenization_spaces, 134 | split_special_tokens=split_special_tokens, 135 | **kwargs, 136 | ) 137 | 138 | @property 139 | def vocab_size(self) -> int: 140 | return len(self.encoder) 141 | 142 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.get_vocab 143 | def get_vocab(self): 144 | return dict(self.encoder, **self.added_tokens_encoder) 145 | 146 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.bpe 147 | @lru_cache(maxsize=100) # 设置缓存大小为100 148 | def bpe(self, token): 149 | # if token in self.cache: 150 | # return self.cache[token] 151 | word = tuple(token) 152 | pairs = get_pairs(word) 153 | 154 | if not pairs: 155 | return token 156 | 157 | while True: 158 | bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf"))) 159 | if bigram not in self.bpe_ranks: 160 | break 161 | first, second = bigram 162 | new_word = [] 163 | i = 0 164 | while i < len(word): 165 | try: 166 | j = word.index(first, i) 167 | except ValueError: 168 | new_word.extend(word[i:]) 169 | break 170 | else: 171 | new_word.extend(word[i:j]) 172 | i = j 173 | 174 | if word[i] == first and i < len(word) - 1 and word[i + 1] == second: 175 | new_word.append(first + second) 176 | i += 2 177 | else: 178 | new_word.append(word[i]) 179 | i += 1 180 | new_word = tuple(new_word) 181 | word = new_word 182 | if len(word) == 1: 183 | break 184 | else: 185 | pairs = get_pairs(word) 186 | word = " ".join(word) 187 | # self.cache[token] = word 188 | return word 189 | 190 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._tokenize 191 | def _tokenize(self, text): 192 | """Tokenize a string.""" 193 | bpe_tokens = [] 194 | for token in re.findall(self.pat, text): 195 | token = "".join( 196 | self.byte_encoder[b] for b in token.encode("utf-8") 197 | ) # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case) 198 | bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" ")) 199 | return bpe_tokens 200 | 201 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_token_to_id 202 | def _convert_token_to_id(self, token): 203 | """Converts a token (str) in an id using the vocab.""" 204 | return self.encoder.get(token, self.encoder.get(self.unk_token)) 205 | 206 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_id_to_token 207 | def _convert_id_to_token(self, index): 208 | """Converts an index (integer) in a token (str) using the vocab.""" 209 | return self.decoder.get(index) 210 | 211 | 212 | def convert_tokens_to_string(self, tokens): 213 | """Converts a sequence of tokens (string) in a single string.""" 214 | text = "".join(tokens) 215 | text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors) 216 | return text 217 | 218 | def decode( 219 | self, 220 | token_ids, 221 | skip_special_tokens: bool = False, 222 | clean_up_tokenization_spaces: Optional[bool] = False, 223 | spaces_between_special_tokens: bool = False, 224 | **kwargs, 225 | ) -> str: 226 | 227 | return super().decode( 228 | token_ids, 229 | skip_special_tokens=skip_special_tokens, 230 | clean_up_tokenization_spaces=clean_up_tokenization_spaces, 231 | spaces_between_special_tokens=spaces_between_special_tokens, 232 | **kwargs, 233 | ) 234 | 235 | 236 | def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: 237 | if not os.path.isdir(save_directory): 238 | logger.error(f"Vocabulary path ({save_directory}) should be a directory") 239 | return 240 | vocab_file = os.path.join( 241 | save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] 242 | ) 243 | merge_file = os.path.join( 244 | save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"] 245 | ) 246 | 247 | with open(vocab_file, "w", encoding="utf-8") as f: 248 | f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n") 249 | 250 | index = 0 251 | with open(merge_file, "w", encoding="utf-8") as writer: 252 | writer.write("#version: 0.2\n") 253 | for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]): 254 | if index != token_index: 255 | logger.warning( 256 | f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive." 257 | " Please check that the tokenizer is not corrupted!" 258 | ) 259 | index = token_index 260 | writer.write(" ".join(bpe_tokens) + "\n") 261 | index += 1 262 | 263 | return vocab_file, merge_file 264 | 265 | def prepare_for_tokenization(self, text, **kwargs): 266 | text = unicodedata.normalize("NFC", text) 267 | return (text, kwargs) 268 | -------------------------------------------------------------------------------- /sft/model/tokenizer_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "auto_map": { 3 | "AutoTokenizer": [ 4 | "tokenization_miaomiao.MiaomiaoTokenizer", 5 | null 6 | ] 7 | }, 8 | "add_prefix_space": false, 9 | "added_tokens_decoder": { 10 | "32000": { 11 | "content": "system", 12 | "lstrip": false, 13 | "normalized": false, 14 | "rstrip": false, 15 | "single_word": false, 16 | "special": true 17 | }, 18 | "32001": { 19 | "content": "user", 20 | "lstrip": false, 21 | "normalized": false, 22 | "rstrip": false, 23 | "single_word": false, 24 | "special": true 25 | }, 26 | "32002": { 27 | "content": "assistant", 28 | "lstrip": false, 29 | "normalized": false, 30 | "rstrip": false, 31 | "single_word": false, 32 | "special": true 33 | }, 34 | "32003": { 35 | "content": "<|endoftext|>", 36 | "lstrip": false, 37 | "normalized": false, 38 | "rstrip": false, 39 | "single_word": false, 40 | "special": true 41 | }, 42 | "32004": { 43 | "content": "<|im_start|>", 44 | "lstrip": false, 45 | "normalized": false, 46 | "rstrip": false, 47 | "single_word": false, 48 | "special": true 49 | }, 50 | "32005": { 51 | "content": "<|im_end|>", 52 | "lstrip": false, 53 | "normalized": false, 54 | "rstrip": false, 55 | "single_word": false, 56 | "special": true 57 | } 58 | }, 59 | "additional_special_tokens": [ 60 | "<|im_start|>", 61 | "<|im_end|>" 62 | ], 63 | "bos_token": null, 64 | "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\n你是一个由喵阿姨开发的喵喵小助手<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 65 | "clean_up_tokenization_spaces": false, 66 | "eos_token": "<|im_end|>", 67 | "errors": "replace", 68 | "model_max_length": 32768, 69 | "pad_token": "<|endoftext|>", 70 | "split_special_tokens": false, 71 | "tokenizer_class": "MiaomiaoTokenizer", 72 | "unk_token": null 73 | } -------------------------------------------------------------------------------- /sft/sft.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | from typing import Optional 3 | import transformers 4 | from sft_dataset import SFTDataset 5 | from transformers import ( 6 | AutoModelForCausalLM, 7 | HfArgumentParser, 8 | Trainer, 9 | TrainingArguments, 10 | AutoTokenizer, 11 | set_seed, 12 | ) 13 | from transformers.trainer_callback import TrainerCallback 14 | import torch 15 | import os 16 | import logging 17 | import glob 18 | import random 19 | import numpy as np 20 | from typing import Dict, Optional, Sequence 21 | 22 | IGNORE_INDEX = -100 23 | # 设置随机种子 24 | def set_seed(seed): 25 | random.seed(seed) 26 | np.random.seed(seed) 27 | torch.manual_seed(seed) 28 | if torch.cuda.is_available(): 29 | torch.cuda.manual_seed_all(seed) 30 | 31 | 32 | class LoggingCallback(TrainerCallback): 33 | def __init__(self, logger): 34 | self.logger = logger 35 | 36 | def on_log(self, args, state, control, logs=None, **kwargs): 37 | if logs is not None: 38 | self.logger.info(logs) 39 | 40 | 41 | @dataclass 42 | class ModelArguments: 43 | model_path: Optional[str] = None 44 | torch_dtype: Optional[str] = None 45 | 46 | @dataclass 47 | class DataTrainingArguments: 48 | train_dataset_file: Optional[str] = None 49 | overwrite_cache: bool = False 50 | preprocessing_num_workers: Optional[int] = None 51 | block_size: Optional[int] = None 52 | 53 | 54 | @dataclass 55 | class MyTrainingArguments(TrainingArguments): 56 | modules_to_save: Optional[str] = None 57 | 58 | 59 | # 模型初始化方式 60 | init_from: Optional[str] = "scratch" 61 | use_device: Optional[str] = 'cuda' 62 | use_compile: Optional[bool] = False 63 | log_file: Optional[str] = None 64 | nnodes: Optional[int] = None 65 | nproc_per_node: Optional[int] = None 66 | 67 | 68 | def init_model(model_args): 69 | tokenizer = AutoTokenizer.from_pretrained(model_args.model_path, trust_remote_code=True) 70 | model = AutoModelForCausalLM.from_pretrained( 71 | model_args.model_path, 72 | trust_remote_code=True 73 | ) 74 | return tokenizer, model 75 | 76 | 77 | 78 | @dataclass 79 | class DataCollatorForSFTDataset(object): 80 | """Collate examples for supervised fine-tuning.""" 81 | 82 | tokenizer: transformers.PreTrainedTokenizer 83 | 84 | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]: 85 | input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels")) 86 | input_ids = torch.nn.utils.rnn.pad_sequence( 87 | input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id 88 | ) 89 | labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX) 90 | print(f"DataCollatorForSFTDataset:{input_ids}") 91 | return dict( 92 | input_ids=input_ids, 93 | labels=labels, 94 | attention_mask=input_ids.ne(self.tokenizer.pad_token_id), 95 | ) 96 | 97 | 98 | def main(): 99 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, MyTrainingArguments)) 100 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 101 | 102 | # 设置日志记录器 103 | logging.basicConfig(filename=training_args.log_file, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 104 | logger = logging.getLogger(__name__) 105 | # 创建文件处理器，并设置写模式 106 | file_handler = logging.FileHandler(training_args.log_file, mode='w') 107 | file_handler.setLevel(logging.INFO) 108 | file_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s') 109 | file_handler.setFormatter(file_formatter) 110 | logger.addHandler(file_handler) 111 | # 输出日志到控制台（可选） 112 | console_handler = logging.StreamHandler() 113 | console_handler.setLevel(logging.INFO) 114 | formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s') 115 | console_handler.setFormatter(formatter) 116 | logger.addHandler(console_handler) 117 | 118 | set_seed(training_args.seed) 119 | 120 | tokenizer, model =init_model(model_args) 121 | model.to(training_args.use_device) 122 | 123 | if training_args.use_compile: 124 | model = torch.compile(model) 125 | 126 | 127 | total_params = sum(p.numel() for p in model.parameters()) 128 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 129 | logger.info(f"总参数: {total_params}") 130 | logger.info(f"可训练参数: {trainable_params}") 131 | 132 | logger.info(f"torch_dtype:{model_args.torch_dtype}") 133 | logger.info(f"training_args.bf16: {training_args.bf16}") 134 | 135 | 136 | train_ds = SFTDataset(data_path=data_args.train_dataset_file, tokenizer=tokenizer, max_length=data_args.block_size, prompt_max_len=int(data_args.block_size/2), answer_max_len=int(data_args.block_size/2), seed=training_args.seed) 137 | logger.info(f"Train dataset size: {len(train_ds)}") 138 | 139 | 140 | trainer = Trainer( 141 | model=model, 142 | args=training_args, 143 | train_dataset=train_ds, 144 | callbacks=[LoggingCallback(logger)], # 添加自定义回调 145 | ) 146 | print(training_args.bf16) 147 | 148 | trainer.train() 149 | 150 | 151 | 152 | 153 | if __name__ == "__main__": 154 | main() 155 | -------------------------------------------------------------------------------- /sft/sft.sh: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | lr=1e-4 6 | block_size=1024 7 | 8 | per_device_train_batch_size=24 9 | gradient_accumulation_steps=1 10 | model_path=./model 11 | train_dataset_file=./sft.jsonl 12 | log_file=./log/sft.log 13 | output_dir=./output 14 | deepspeed_config_file=./ds_config.json 15 | random_seed=42 16 | torchrun --nnodes 1 --nproc_per_node 2 sft.py \ 17 | --deepspeed ${deepspeed_config_file} \ 18 | --model_path ${model_path} \ 19 | --train_dataset_file ${train_dataset_file} \ 20 | --per_device_train_batch_size ${per_device_train_batch_size} \ 21 | --do_train \ 22 | --bf16 True\ 23 | --torch_dtype bfloat16 \ 24 | --seed ${random_seed} \ 25 | --num_train_epochs 3 \ 26 | --logging_strategy steps \ 27 | --logging_steps 100 \ 28 | --log_file ${log_file} \ 29 | --logging_first_step True \ 30 | --adam_beta1 0.9 \ 31 | --adam_beta1 0.95 \ 32 | --lr_scheduler_type cosine \ 33 | --learning_rate ${lr} \ 34 | --warmup_ratio 0.05 \ 35 | --weight_decay 0.01 \ 36 | --save_strategy epoch \ 37 | --save_total_limit 3 \ 38 | --save_steps 0.01 \ 39 | --gradient_accumulation_steps ${gradient_accumulation_steps} \ 40 | --block_size ${block_size} \ 41 | --output_dir ${output_dir} \ 42 | --overwrite_output_dir \ 43 | --ddp_timeout 30000 \ 44 | --use_device cuda \ 45 | --use_compile False \ -------------------------------------------------------------------------------- /sft/sft_data_filted.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer 2 | import torch 3 | import torch.nn.functional as F 4 | import json 5 | from tqdm import tqdm 6 | 7 | def calculate_perplexity(model, tokenizer, messages, device): 8 | 9 | formatted_messages = [ 10 | {"role": "user", "content": messages[0]['value']}, 11 | {"role": "assistant", "content": messages[1]['value']} 12 | ] 13 | user_input = [ 14 | {"role": "user", "content": messages[0]['value']} 15 | ] 16 | 17 | # 编码输入 18 | inputs_text = tokenizer.apply_chat_template( 19 | user_input, 20 | tokenize=False, 21 | add_generation_prompt=True 22 | ) 23 | #print(inputs_text) 24 | inputs = tokenizer(inputs_text, return_tensors="pt").to(device) 25 | 26 | # 编码输入 27 | full_text = tokenizer.apply_chat_template( 28 | formatted_messages, 29 | tokenize=False, 30 | add_generation_prompt=False 31 | ) 32 | #print(full_text) 33 | 34 | full_inputs = tokenizer(full_text, return_tensors="pt").to(device) 35 | 36 | # 计算给定用户输入情况下生成助理响应的困惑度 37 | with torch.no_grad(): 38 | outputs = model(**full_inputs) 39 | logits = outputs.logits 40 | 41 | # 只关注助理响应部分的logits 42 | start_pos = inputs.input_ids.size(1) 43 | shift_logits = logits[:, start_pos:-1, :].contiguous() 44 | shift_labels = full_inputs['input_ids'][:, start_pos+1:].contiguous() 45 | 46 | # 计算交叉熵损失 47 | loss_fct = torch.nn.CrossEntropyLoss(reduction='none') 48 | loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) 49 | loss = loss.view(shift_labels.size()) 50 | perplexity_given_user = torch.exp(loss.mean()) 51 | #print(f"给定输入的困惑度：{perplexity_given_user}") 52 | # 计算直接生成助理响应的困惑度 53 | with torch.no_grad(): 54 | assistant_input = messages[1]["value"] 55 | assistant_inputs = tokenizer(assistant_input, return_tensors="pt").to(device) 56 | 57 | outputs = model(**assistant_inputs) 58 | logits = outputs.logits 59 | 60 | # Shift the logits and labels to ignore the first token 61 | shift_logits = logits[:, :-1, :].contiguous() 62 | shift_labels = assistant_inputs['input_ids'][:, 1:].contiguous() 63 | 64 | # Flatten the logits and labels 65 | loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) 66 | loss = loss.view(shift_labels.size()) 67 | 68 | # 计算每个token的困惑度 69 | perplexity_direct = torch.exp(loss.mean()) 70 | #print(f"直接生成的困惑度：{perplexity_direct}") 71 | 72 | return perplexity_given_user.item(), perplexity_direct.item() 73 | 74 | def main(): 75 | model_name = "./Qwen2-0.5B-Instruct" 76 | device = "cuda" # 设备 77 | model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") 78 | tokenizer = AutoTokenizer.from_pretrained(model_name) 79 | input_data = './depulication_firefly.jsonl' 80 | output_data = './depulication_firefly_ppl.jsonl' 81 | # 打开输入文件 82 | with open(input_data, 'r', encoding='utf-8') as f: 83 | lines = f.readlines() 84 | 85 | with open(output_data, 'w', encoding='utf-8') as out_f: 86 | # 逐行处理输入数据并计算困惑度 87 | for line in tqdm(lines, desc="Processing"): 88 | data = json.loads(line) 89 | messages = data["messages"] 90 | perplexity_given_user, perplexity_direct = calculate_perplexity(model, tokenizer, messages, device) 91 | result = { 92 | "messages": messages, 93 | "ppl_a_q": perplexity_given_user, 94 | "ppl_a": perplexity_direct, 95 | "ifd": perplexity_given_user / perplexity_direct, 96 | } 97 | out_f.write(json.dumps(result, ensure_ascii=False) + '\n') 98 | # print(f"Perplexity given user input: {perplexity_given_user}") 99 | # print(f"Perplexity of direct assistant response: {perplexity_direct}") 100 | 101 | if __name__ == "__main__": 102 | main() 103 | -------------------------------------------------------------------------------- /sft/sft_dataset.py: -------------------------------------------------------------------------------- 1 | import random 2 | import pandas as pd 3 | import numpy as np 4 | from torch.utils.data import Dataset,DataLoader 5 | import torch 6 | from sklearn.model_selection import train_test_split 7 | import json 8 | from datasets import load_dataset,Features, Value 9 | import copy 10 | class SFTDataset(Dataset): 11 | def __init__(self, data_path, tokenizer, max_length=1024, prompt_max_len=512, answer_max_len=512, seed=42): 12 | super().__init__() 13 | IGNORE_INDEX = -100 14 | self.max_length = max_length 15 | self.prompt_max_len = prompt_max_len 16 | self.answer_max_len = answer_max_len 17 | self.tokenizer = tokenizer 18 | self.input_ids = [] 19 | self.labels = [] 20 | self.attention_mask = [] 21 | # 指定自定义字段 22 | features = Features({ 23 | 'prompt': Value('string'), 24 | 'answer': Value('string') 25 | }) 26 | sft_dataset = load_dataset('json', data_files=data_path, features=features) 27 | data = [] 28 | # 遍历数据集并取出每个元素 29 | for example in sft_dataset['train']: 30 | prompt = example['prompt'] 31 | answer = example['answer'] 32 | messages = [ 33 | {"role": "user", "content": prompt} 34 | ] 35 | prompt_text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 36 | answer_text = answer + tokenizer.eos_token 37 | 38 | prompt_id = self.tokenizer.encode(prompt_text) 39 | if (len(prompt_id) > self.prompt_max_len): 40 | prompt_id = prompt_id[:self.prompt_max_len] 41 | 42 | answer_id = tokenizer.encode(answer_text) 43 | if (len(answer_id) > self.answer_max_len): 44 | answer_id = prompt_id[:self.prompt_max_len] 45 | input_id = prompt_id + answer_id 46 | labels = [self.tokenizer.pad_token_id] * len(prompt_id) + answer_id 47 | pad_len = self.max_length - len(input_id) 48 | input_id = input_id + [self.tokenizer.pad_token_id] * pad_len 49 | labels = labels + [self.tokenizer.pad_token_id] * pad_len 50 | labels = [(l if l != self.tokenizer.pad_token_id else IGNORE_INDEX ) for l in labels] 51 | input_id = torch.LongTensor(input_id) 52 | labels = torch.LongTensor(labels) 53 | attention_mask = input_id.ne(self.tokenizer.pad_token_id) 54 | data.append({ 55 | "input_ids": input_id, 56 | "labels": labels, 57 | "attention_mask": attention_mask 58 | }) 59 | 60 | # 打乱数据集 61 | random.seed(seed) 62 | random.shuffle(data) 63 | 64 | for item in data: 65 | self.input_ids.append(item["input_ids"]) 66 | self.labels.append(item["labels"]) 67 | self.attention_mask.append(item["attention_mask"]) 68 | 69 | 70 | def __len__(self): 71 | return len(self.input_ids) 72 | 73 | def __getitem__(self, i: int): 74 | return { 75 | "input_ids": self.input_ids[i], 76 | "labels": self.labels[i], 77 | "attention_mask": self.attention_mask[i], 78 | } 79 | 80 | -------------------------------------------------------------------------------- /sft/test_sft_model.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer 2 | 3 | 4 | device = "cuda" # the device to load the model onto 5 | 6 | model = AutoModelForCausalLM.from_pretrained( 7 | './model', 8 | torch_dtype="auto", 9 | device_map="auto", 10 | trust_remote_code=True 11 | ) 12 | tokenizer = AutoTokenizer.from_pretrained('./miaomiao_tokenizer', trust_remote_code=True) 13 | 14 | prompt_list = ["你知道北京吗？ ", 15 | "你知道杭州有哪些美食吗？", 16 | "你知道中国的四大名著吗？", 17 | "你了解美国的历史吗？", 18 | "左手一只鸭，右手一只鸡。交换两次后左右手里各是什么？", 19 | "鸡兔同笼，共35只头，94只脚，问鸡兔各多少？", 20 | "世界上最大的动物是什么？", 21 | "介绍一下刘德华。", 22 | "介绍一下中国。" 23 | ] 24 | for prompt in prompt_list: 25 | messages = [ 26 | {"role": "user", "content": prompt} 27 | ] 28 | text = tokenizer.apply_chat_template( 29 | messages, 30 | tokenize=False, 31 | add_generation_prompt=True 32 | ) 33 | model_inputs = tokenizer([text], return_tensors="pt").to(device) 34 | 35 | generated_ids = model.generate( 36 | **model_inputs, 37 | max_new_tokens=512, 38 | do_sample=True, 39 | temperature = 0.9, 40 | top_k = 30 41 | ) 42 | generated_ids = [ 43 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 44 | ] 45 | 46 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 47 | print(f"question:{prompt}") 48 | print(f"response:{response}") -------------------------------------------------------------------------------- /train_tokenizer/miaomiao_tokenizer/tokenization_miaomiao.py: -------------------------------------------------------------------------------- 1 | 2 | """Tokenization classes for Miaomiao.""" 3 | 4 | import json 5 | import os 6 | import unicodedata 7 | from functools import lru_cache 8 | from typing import Optional, Tuple 9 | 10 | import regex as re 11 | 12 | from transformers import AddedToken, PreTrainedTokenizer 13 | from transformers.utils import logging 14 | 15 | 16 | logger = logging.get_logger(__name__) 17 | 18 | VOCAB_FILES_NAMES = { 19 | "vocab_file": "vocab.json", 20 | "merges_file": "merges.txt", 21 | } 22 | 23 | 24 | MAX_MODEL_INPUT_SIZES = {"miaomiao/miaomiao-tokenizer": 1024} 25 | 26 | PRETOKENIZE_REGEX = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""" 27 | 28 | 29 | @lru_cache() 30 | # Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode 31 | def bytes_to_unicode(): 32 | 33 | bs = ( 34 | list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1)) 35 | ) 36 | cs = bs[:] 37 | n = 0 38 | for b in range(2**8): 39 | if b not in bs: 40 | bs.append(b) 41 | cs.append(2**8 + n) 42 | n += 1 43 | cs = [chr(n) for n in cs] 44 | return dict(zip(bs, cs)) 45 | 46 | 47 | # Copied from transformers.models.gpt2.tokenization_gpt2.get_pairs 48 | def get_pairs(word): 49 | """ 50 | Return set of symbol pairs in a word. 51 | 52 | Word is represented as tuple of symbols (symbols being variable-length strings). 53 | """ 54 | pairs = set() 55 | prev_char = word[0] 56 | for char in word[1:]: 57 | pairs.add((prev_char, char)) 58 | prev_char = char 59 | return pairs 60 | 61 | 62 | class MiaomiaoTokenizer(PreTrainedTokenizer): 63 | vocab_files_names = VOCAB_FILES_NAMES 64 | model_input_names = ["input_ids", "attention_mask"] 65 | 66 | def __init__( 67 | self, 68 | vocab_file, 69 | merges_file, 70 | errors="replace", 71 | unk_token="<|endoftext|>", 72 | bos_token=None, 73 | eos_token="<|im_end|>", 74 | pad_token="<|endoftext|>", 75 | clean_up_tokenization_spaces=False, 76 | split_special_tokens=False, 77 | **kwargs, 78 | ): 79 | bos_token = ( 80 | AddedToken(bos_token, lstrip=False, rstrip=False, special=True, normalized=False) 81 | if isinstance(bos_token, str) 82 | else bos_token 83 | ) 84 | eos_token = ( 85 | AddedToken(eos_token, lstrip=False, rstrip=False, special=True, normalized=False) 86 | if isinstance(eos_token, str) 87 | else eos_token 88 | ) 89 | unk_token = ( 90 | AddedToken(unk_token, lstrip=False, rstrip=False, special=True, normalized=False) 91 | if isinstance(unk_token, str) 92 | else unk_token 93 | ) 94 | pad_token = ( 95 | AddedToken(pad_token, lstrip=False, rstrip=False, special=True, normalized=False) 96 | if isinstance(pad_token, str) 97 | else pad_token 98 | ) 99 | 100 | with open(vocab_file, encoding="utf-8") as vocab_handle: 101 | self.encoder = json.load(vocab_handle) 102 | self.decoder = {v: k for k, v in self.encoder.items()} 103 | self.errors = errors # how to handle errors in decoding 104 | self.byte_encoder = bytes_to_unicode() 105 | self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} 106 | bpe_merges = [] 107 | with open(merges_file, encoding="utf-8") as merges_handle: 108 | for i, line in enumerate(merges_handle): 109 | line = line.strip() 110 | if (i == 0 and line.startswith("#version:")) or not line: 111 | continue 112 | bpe_merges.append(tuple(line.split())) 113 | self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges)))) 114 | # NOTE: the cache can grow without bound and will get really large for long running processes 115 | # (esp. for texts of language that do not use space between word, e.g. Chinese); technically 116 | # not a memory leak but appears as one. 117 | # GPT2Tokenizer has the same problem, so let's be consistent. 118 | self.cache = {} 119 | 120 | self.pat = re.compile(PRETOKENIZE_REGEX) 121 | 122 | if kwargs.get("add_prefix_space", False): 123 | logger.warning_once( 124 | f"{self.__class__.__name} does not support `add_prefix_space`, setting it to True has no effect." 125 | ) 126 | 127 | super().__init__( 128 | errors=errors, 129 | bos_token=bos_token, 130 | eos_token=eos_token, 131 | pad_token=pad_token, 132 | unk_token=unk_token, 133 | clean_up_tokenization_spaces=clean_up_tokenization_spaces, 134 | split_special_tokens=split_special_tokens, 135 | **kwargs, 136 | ) 137 | 138 | @property 139 | def vocab_size(self) -> int: 140 | return len(self.encoder) 141 | 142 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.get_vocab 143 | def get_vocab(self): 144 | return dict(self.encoder, **self.added_tokens_encoder) 145 | 146 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.bpe 147 | @lru_cache(maxsize=100) # 设置缓存大小为100 148 | def bpe(self, token): 149 | # if token in self.cache: 150 | # return self.cache[token] 151 | word = tuple(token) 152 | pairs = get_pairs(word) 153 | 154 | if not pairs: 155 | return token 156 | 157 | while True: 158 | bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf"))) 159 | if bigram not in self.bpe_ranks: 160 | break 161 | first, second = bigram 162 | new_word = [] 163 | i = 0 164 | while i < len(word): 165 | try: 166 | j = word.index(first, i) 167 | except ValueError: 168 | new_word.extend(word[i:]) 169 | break 170 | else: 171 | new_word.extend(word[i:j]) 172 | i = j 173 | 174 | if word[i] == first and i < len(word) - 1 and word[i + 1] == second: 175 | new_word.append(first + second) 176 | i += 2 177 | else: 178 | new_word.append(word[i]) 179 | i += 1 180 | new_word = tuple(new_word) 181 | word = new_word 182 | if len(word) == 1: 183 | break 184 | else: 185 | pairs = get_pairs(word) 186 | word = " ".join(word) 187 | # self.cache[token] = word 188 | return word 189 | 190 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._tokenize 191 | def _tokenize(self, text): 192 | """Tokenize a string.""" 193 | bpe_tokens = [] 194 | for token in re.findall(self.pat, text): 195 | token = "".join( 196 | self.byte_encoder[b] for b in token.encode("utf-8") 197 | ) # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case) 198 | bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" ")) 199 | return bpe_tokens 200 | 201 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_token_to_id 202 | def _convert_token_to_id(self, token): 203 | """Converts a token (str) in an id using the vocab.""" 204 | return self.encoder.get(token, self.encoder.get(self.unk_token)) 205 | 206 | # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_id_to_token 207 | def _convert_id_to_token(self, index): 208 | """Converts an index (integer) in a token (str) using the vocab.""" 209 | return self.decoder.get(index) 210 | 211 | 212 | def convert_tokens_to_string(self, tokens): 213 | """Converts a sequence of tokens (string) in a single string.""" 214 | text = "".join(tokens) 215 | text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors) 216 | return text 217 | 218 | def decode( 219 | self, 220 | token_ids, 221 | skip_special_tokens: bool = False, 222 | clean_up_tokenization_spaces: Optional[bool] = False, 223 | spaces_between_special_tokens: bool = False, 224 | **kwargs, 225 | ) -> str: 226 | 227 | return super().decode( 228 | token_ids, 229 | skip_special_tokens=skip_special_tokens, 230 | clean_up_tokenization_spaces=clean_up_tokenization_spaces, 231 | spaces_between_special_tokens=spaces_between_special_tokens, 232 | **kwargs, 233 | ) 234 | 235 | 236 | def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: 237 | if not os.path.isdir(save_directory): 238 | logger.error(f"Vocabulary path ({save_directory}) should be a directory") 239 | return 240 | vocab_file = os.path.join( 241 | save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] 242 | ) 243 | merge_file = os.path.join( 244 | save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"] 245 | ) 246 | 247 | with open(vocab_file, "w", encoding="utf-8") as f: 248 | f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n") 249 | 250 | index = 0 251 | with open(merge_file, "w", encoding="utf-8") as writer: 252 | writer.write("#version: 0.2\n") 253 | for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]): 254 | if index != token_index: 255 | logger.warning( 256 | f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive." 257 | " Please check that the tokenizer is not corrupted!" 258 | ) 259 | index = token_index 260 | writer.write(" ".join(bpe_tokens) + "\n") 261 | index += 1 262 | 263 | return vocab_file, merge_file 264 | 265 | def prepare_for_tokenization(self, text, **kwargs): 266 | text = unicodedata.normalize("NFC", text) 267 | return (text, kwargs) 268 | -------------------------------------------------------------------------------- /train_tokenizer/miaomiao_tokenizer/tokenizer_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "auto_map": { 3 | "AutoTokenizer": [ 4 | "tokenization_miaomiao.MiaomiaoTokenizer", 5 | null 6 | ] 7 | }, 8 | "add_prefix_space": false, 9 | "added_tokens_decoder": { 10 | "32000": { 11 | "content": "system", 12 | "lstrip": false, 13 | "normalized": false, 14 | "rstrip": false, 15 | "single_word": false, 16 | "special": true 17 | }, 18 | "32001": { 19 | "content": "user", 20 | "lstrip": false, 21 | "normalized": false, 22 | "rstrip": false, 23 | "single_word": false, 24 | "special": true 25 | }, 26 | "32002": { 27 | "content": "assistant", 28 | "lstrip": false, 29 | "normalized": false, 30 | "rstrip": false, 31 | "single_word": false, 32 | "special": true 33 | }, 34 | "32003": { 35 | "content": "<|endoftext|>", 36 | "lstrip": false, 37 | "normalized": false, 38 | "rstrip": false, 39 | "single_word": false, 40 | "special": true 41 | }, 42 | "32004": { 43 | "content": "<|im_start|>", 44 | "lstrip": false, 45 | "normalized": false, 46 | "rstrip": false, 47 | "single_word": false, 48 | "special": true 49 | }, 50 | "32005": { 51 | "content": "<|im_end|>", 52 | "lstrip": false, 53 | "normalized": false, 54 | "rstrip": false, 55 | "single_word": false, 56 | "special": true 57 | } 58 | }, 59 | "additional_special_tokens": [ 60 | "<|im_start|>", 61 | "<|im_end|>" 62 | ], 63 | "bos_token": null, 64 | "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\n你是一个由喵阿姨开发的喵喵小助手<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 65 | "clean_up_tokenization_spaces": false, 66 | "eos_token": "<|im_end|>", 67 | "errors": "replace", 68 | "model_max_length": 32768, 69 | "pad_token": "<|endoftext|>", 70 | "split_special_tokens": false, 71 | "tokenizer_class": "MiaomiaoTokenizer", 72 | "unk_token": null 73 | } -------------------------------------------------------------------------------- /train_tokenizer/train_tokenizer.py: -------------------------------------------------------------------------------- 1 | import random 2 | from tqdm import tqdm 3 | from transformers import AutoTokenizer 4 | import json 5 | from datasets import load_dataset 6 | from tokenizers import ( 7 | decoders, 8 | models, 9 | normalizers, 10 | pre_tokenizers, 11 | processors, 12 | trainers, 13 | Tokenizer, 14 | ) 15 | import os 16 | 17 | random.seed(42) 18 | 19 | def train_tokenizer(): 20 | # 读取JSON文件并提取文本数据 21 | def read_texts_from_json(file_path): 22 | with open(file_path, 'r', encoding='utf-8') as f: 23 | for line in f: 24 | data = json.loads(line) 25 | yield data['text'] 26 | 27 | data_path = './tokenizer_data/tokenizer_data.json' 28 | 29 | # 初始化tokenizer 30 | tokenizer = Tokenizer(models.BPE()) 31 | tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) 32 | 33 | # 设置训练器 34 | trainer = trainers.BpeTrainer( 35 | vocab_size=32000, 36 | show_progress=True, 37 | initial_alphabet=pre_tokenizers.ByteLevel.alphabet() 38 | ) 39 | 40 | # 读取文本数据 41 | texts = read_texts_from_json(data_path) 42 | 43 | # 训练tokenizer 44 | tokenizer.train_from_iterator(texts, trainer=trainer) 45 | 46 | # 设置解码器 47 | tokenizer.decoder = decoders.ByteLevel() 48 | 49 | # 保存tokenizer 50 | tokenizer_dir = "./miaomiao_tokenizer" 51 | os.makedirs(tokenizer_dir, exist_ok=True) 52 | tokenizer.save(os.path.join(tokenizer_dir, "tokenizer.json")) 53 | tokenizer.model.save("./miaomiao_tokenizer") 54 | 55 | # 手动创建配置文件 56 | config = { 57 | "auto_map": { 58 | "AutoTokenizer": [ 59 | "tokenization_miaomiao.MiaomiaoTokenizer", 60 | None 61 | ] 62 | }, 63 | "add_prefix_space": False, 64 | "added_tokens_decoder": { 65 | "32000": { 66 | "content": "system", 67 | "lstrip": False, 68 | "normalized": False, 69 | "rstrip": False, 70 | "single_word": False, 71 | "special": True 72 | }, 73 | "32001": { 74 | "content": "user", 75 | "lstrip": False, 76 | "normalized": False, 77 | "rstrip": False, 78 | "single_word": False, 79 | "special": True 80 | }, 81 | "32002": { 82 | "content": "assistant", 83 | "lstrip": False, 84 | "normalized": False, 85 | "rstrip": False, 86 | "single_word": False, 87 | "special": True 88 | }, 89 | "32003": { 90 | "content": "<|endoftext|>", 91 | "lstrip": False, 92 | "normalized": False, 93 | "rstrip": False, 94 | "single_word": False, 95 | "special": True 96 | }, 97 | "32004": { 98 | "content": "<|im_start|>", 99 | "lstrip": False, 100 | "normalized": False, 101 | "rstrip": False, 102 | "single_word": False, 103 | "special": True 104 | }, 105 | "32005": { 106 | "content": "<|im_end|>", 107 | "lstrip": False, 108 | "normalized": False, 109 | "rstrip": False, 110 | "single_word": False, 111 | "special": True 112 | } 113 | }, 114 | "additional_special_tokens": ["<|im_start|>", "<|im_end|>"], 115 | "bos_token": None, 116 | "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\n你是一个由喵阿姨开发的喵喵小助手<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 117 | "clean_up_tokenization_spaces": False, 118 | "eos_token": "<|im_end|>", 119 | "errors": "replace", 120 | "model_max_length": 32768, 121 | "pad_token": "<|endoftext|>", 122 | "split_special_tokens": False, 123 | "tokenizer_class": "MiaomiaoTokenizer", 124 | "unk_token": None 125 | } 126 | 127 | # 保存配置文件 128 | with open(os.path.join(tokenizer_dir, "tokenizer_config.json"), "w", encoding="utf-8") as config_file: 129 | json.dump(config, config_file, ensure_ascii=False, indent=4) 130 | 131 | print("Tokenizer training completed and saved.") 132 | 133 | def test_tokenizer(): 134 | # 加载保存的分词器 135 | tokenizer = Tokenizer.from_file("./tokenizer/custom/tokenizer.json") 136 | 137 | # 测试分词器 138 | text = "hello word.You are a helpful assistant.今天，我们来训练一个大模型<|im_end|><|endoftext|>" 139 | encoding = tokenizer.encode(text) 140 | 141 | print("Original text:", text) 142 | print("Tokens:", encoding.tokens) 143 | print("Token IDs:", encoding.ids) 144 | # 获取词汇表 145 | vocab = tokenizer.get_vocab() 146 | 147 | # 获取特殊token的ID 148 | special_tokens=["", "<|endoftext|>", "<|im_start|>", "<|im_end|>", "system", "user", "assistant"] 149 | token_ids = {token: vocab[token] for token in special_tokens if token in vocab} 150 | 151 | print("Special tokens IDs:", token_ids) 152 | eos_token_id = token_ids.get("<|im_end|>", None) 153 | print("EOS token ID:", eos_token_id) 154 | print(vocab['<|im_end|>']) 155 | print(tokenizer.eos_token_id) 156 | 157 | 158 | 159 | def main(): 160 | 161 | train_tokenizer() 162 | #test_tokenizer() 163 | 164 | if __name__ == '__main__': 165 | main() 166 | --------------------------------------------------------------------------------