├── utils ├── chatglm3_tokenizer │ ├── special_tokens_map.json │ ├── tokenizer.model │ ├── tokenizer_config.json │ └── tokenization_chatglm.py ├── test_tokenizer.py ├── rl_train_process.py ├── sft_train_process.py ├── rm_train_process.py └── pre_train_process.py ├── tokenizer ├── run_train_spm.sh ├── sp_output │ └── chinese_spm_20000.model ├── tinyllm_tokenizer_hf │ ├── tokenizer.model │ ├── added_tokens.json │ ├── special_tokens_map.json │ └── tokenizer_config.json ├── input_dir │ └── llama2_tokenizer │ │ ├── tokenizer.model │ │ ├── special_tokens_map.json │ │ └── tokenizer_config.json ├── train_chinese_sp.py ├── expend_embedding.py ├── expend_tokenizer.py └── README.md ├── doc ├── image │ ├── ppl_gen.png │ ├── web_demo.png │ └── image_w13y9FgYsi.png ├── datasets_download.md ├── README.md ├── data_process.md └── Trainer参数.md ├── requirements.txt ├── .gitignore ├── demo ├── infer_chat.py ├── infer_func.py └── web_demo.py ├── llama.cpp └── README.md ├── script ├── gptq_demo.sh ├── ptm_demo.sh ├── sft_demo.sh ├── dpo_demo.sh └── rm_demo.sh ├── train ├── configuration_tinyllm.py ├── sft_train.py ├── ptm_train.py ├── rm_train.py ├── dpo_train.py └── generation_utils.py ├── quantize └── gptq_quantize.py ├── vllm ├── README.md └── tinyllm.py └── README.md /utils/chatglm3_tokenizer/special_tokens_map.json: -------------------------------------------------------------------------------- 1 | {} 2 | -------------------------------------------------------------------------------- /tokenizer/run_train_spm.sh: -------------------------------------------------------------------------------- 1 | python train_chinese_sp.py > test_train_chinese_sp.log 2>&1 & -------------------------------------------------------------------------------- /doc/image/ppl_gen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/doc/image/ppl_gen.png -------------------------------------------------------------------------------- /doc/image/web_demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/doc/image/web_demo.png -------------------------------------------------------------------------------- /doc/image/image_w13y9FgYsi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/doc/image/image_w13y9FgYsi.png -------------------------------------------------------------------------------- /utils/chatglm3_tokenizer/tokenizer.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/utils/chatglm3_tokenizer/tokenizer.model -------------------------------------------------------------------------------- /tokenizer/sp_output/chinese_spm_20000.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/tokenizer/sp_output/chinese_spm_20000.model -------------------------------------------------------------------------------- /tokenizer/tinyllm_tokenizer_hf/tokenizer.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/tokenizer/tinyllm_tokenizer_hf/tokenizer.model -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | deepspeed==0.10.2 2 | transformers>=4.37.2,<=4.38.2 3 | datasets 4 | accelerate 5 | sentencepiece 6 | streamlit 7 | flask 8 | pandas 9 | numpy -------------------------------------------------------------------------------- /tokenizer/input_dir/llama2_tokenizer/tokenizer.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/tokenizer/input_dir/llama2_tokenizer/tokenizer.model -------------------------------------------------------------------------------- /tokenizer/tinyllm_tokenizer_hf/added_tokens.json: -------------------------------------------------------------------------------- 1 | { 2 | "<|assistant|>": 49955, 3 | "<|im_end|>": 49957, 4 | "<|im_start|>": 49956, 5 | "<|system|>": 49953, 6 | "<|user|>": 49954 7 | } 8 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # 输出文件 2 | __pycache__ 3 | 4 | outputs 5 | data 6 | infer_chat_template.py 7 | infer_base.py 8 | cli_demo.py 9 | infer_test.py 10 | web_demo2.py 11 | demo/fast_api_demo.py 12 | demo/resq.py -------------------------------------------------------------------------------- /doc/datasets_download.md: -------------------------------------------------------------------------------- 1 | ## Tiny LLM Datsets 下载 2 | 3 | ### 1.下载链接 4 | 5 | - 链接:https://pan.baidu.com/s/1I5LYq3tu-_pFl0zu4wMpRg?pwd=tiny 6 | - 提取码:tiny 7 | 8 | ### 2.数据集介绍 9 | 10 | - chatglm3_tokenizer : tokenizer文件夹 11 | - pre_train : 预训练 token 12 | - rl_train : 偏好数据集 13 | - sft_train : 微调数据集 14 | - README.md : 数据集详解 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /doc/README.md: -------------------------------------------------------------------------------- 1 | # Tiny LLM 文档 2 | 3 | 4 | ## 1.Tokenizer 5 | 6 | - [扩充词表](./自定义构造Tokenizer.md) 7 | 8 | ## 2.数据处理 9 | 10 | - [数据下载](./datasets_download.md) 11 | 12 | ## 3.预训练 13 | 14 | ## 4.有监督微调 15 | 16 | ## 5.人类对齐 17 | 18 | ## 6.工具使用 19 | 20 | - [Transformers Trainer参数](./Trainer参数.md) 21 | - [Transformers Generate参数](./Generate参数与解码策略.md) 22 | 23 | -------------------------------------------------------------------------------- /tokenizer/tinyllm_tokenizer_hf/special_tokens_map.json: -------------------------------------------------------------------------------- 1 | { 2 | "bos_token": { 3 | "content": "", 4 | "lstrip": false, 5 | "normalized": false, 6 | "rstrip": false, 7 | "single_word": false 8 | }, 9 | "eos_token": { 10 | "content": "", 11 | "lstrip": false, 12 | "normalized": false, 13 | "rstrip": false, 14 | "single_word": false 15 | }, 16 | "unk_token": { 17 | "content": "", 18 | "lstrip": false, 19 | "normalized": false, 20 | "rstrip": false, 21 | "single_word": false 22 | } 23 | } 24 | -------------------------------------------------------------------------------- /tokenizer/input_dir/llama2_tokenizer/special_tokens_map.json: -------------------------------------------------------------------------------- 1 | { 2 | "bos_token": { 3 | "content": "", 4 | "lstrip": false, 5 | "normalized": true, 6 | "rstrip": false, 7 | "single_word": false 8 | }, 9 | "eos_token": { 10 | "content": "", 11 | "lstrip": false, 12 | "normalized": true, 13 | "rstrip": false, 14 | "single_word": false 15 | }, 16 | "pad_token": "", 17 | "unk_token": { 18 | "content": "", 19 | "lstrip": false, 20 | "normalized": true, 21 | "rstrip": false, 22 | "single_word": false 23 | } 24 | } 25 | -------------------------------------------------------------------------------- /tokenizer/input_dir/llama2_tokenizer/tokenizer_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "add_bos_token": true, 3 | "add_eos_token": false, 4 | "bos_token": { 5 | "__type": "AddedToken", 6 | "content": "", 7 | "lstrip": false, 8 | "normalized": true, 9 | "rstrip": false, 10 | "single_word": false 11 | }, 12 | "clean_up_tokenization_spaces": false, 13 | "eos_token": { 14 | "__type": "AddedToken", 15 | "content": "", 16 | "lstrip": false, 17 | "normalized": true, 18 | "rstrip": false, 19 | "single_word": false 20 | }, 21 | "legacy": false, 22 | "model_max_length": 1000000000000000019884624838656, 23 | "pad_token": null, 24 | "sp_model_kwargs": {}, 25 | "tokenizer_class": "LlamaTokenizer", 26 | "unk_token": { 27 | "__type": "AddedToken", 28 | "content": "", 29 | "lstrip": false, 30 | "normalized": true, 31 | "rstrip": false, 32 | "single_word": false 33 | } 34 | } 35 | -------------------------------------------------------------------------------- /demo/infer_chat.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoTokenizer, AutoModelForCausalLM 2 | from transformers.generation import GenerationConfig 3 | 4 | # model_id = "outputs/ckpt/tiny_llm_sft_92m" 5 | model_id = "wdndev/tiny_llm_sft_92m" 6 | model_id = "outputs/tiny_llm_sft_76m_llama" 7 | 8 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) 9 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True) 10 | generation_config = GenerationConfig.from_pretrained(model_id, trust_remote_code=True) 11 | sys_text = "你是由wdndev开发的个人助手。" 12 | # user_text = "世界上最大的动物是什么?" 13 | # user_text = "介绍一下刘德华。" 14 | user_text = "介绍一下中国。" 15 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 16 | "<|user|>", user_text.strip(), 17 | "<|assistant|>"]).strip() + "\n" 18 | 19 | generation_config.max_new_tokens = 200 20 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device) 21 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config) 22 | generated_ids = [ 23 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 24 | ] 25 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 26 | print(response) 27 | 28 | 29 | -------------------------------------------------------------------------------- /utils/chatglm3_tokenizer/tokenizer_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "added_tokens_decoder": { 3 | "64790": { 4 | "content": "[gMASK]", 5 | "lstrip": false, 6 | "normalized": true, 7 | "rstrip": false, 8 | "single_word": false, 9 | "special": false 10 | }, 11 | "64792": { 12 | "content": "sop", 13 | "lstrip": false, 14 | "normalized": true, 15 | "rstrip": false, 16 | "single_word": false, 17 | "special": false 18 | }, 19 | "64795": { 20 | "content": "<|user|>", 21 | "lstrip": false, 22 | "normalized": true, 23 | "rstrip": false, 24 | "single_word": false, 25 | "special": false 26 | }, 27 | "64796": { 28 | "content": "<|assistant|>", 29 | "lstrip": false, 30 | "normalized": true, 31 | "rstrip": false, 32 | "single_word": false, 33 | "special": false 34 | } 35 | }, 36 | "auto_map": { 37 | "AutoTokenizer": [ 38 | "tokenization_chatglm.ChatGLMTokenizer", 39 | null 40 | ] 41 | }, 42 | "chat_template": "{% for message in messages %}{% if loop.first %}[gMASK]sop<|{{ message['role'] }}|>\n {{ message['content'] }}{% else %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}", 43 | "clean_up_tokenization_spaces": false, 44 | "do_lower_case": false, 45 | "eos_token": "", 46 | "model_max_length": 1000000000000000019884624838656, 47 | "pad_token": "", 48 | "padding_side": "left", 49 | "remove_space": false, 50 | "tokenizer_class": "ChatGLMTokenizer", 51 | "unk_token": "" 52 | } 53 | -------------------------------------------------------------------------------- /llama.cpp/README.md: -------------------------------------------------------------------------------- 1 | # Tiny LLM llama.cpp 2 | 3 | ## 1.简介 4 | 5 | Tiny LLM 92M 模型已支持 llama.cpp C++ 推理框架,建议在 linux 环境下测试,windows效果不好; 6 | 7 | 所支持 llama.cpp 为自己修改的版本,仓库链接为: [llama.cpp.tinyllm](https://github.com/wdndev/llama.cpp.tinyllm) 8 | 9 | ### 1.1 llama.cpp 10 | 11 | llama.cpp 是一个C++库,用于简化LLM推理的设置。它使得在本地机器上运行Qwen成为可能。该库是一个纯C/C++实现,不依赖任何外部库,并且针对x86架构提供了AVX、AVX2和AVX512加速支持。此外,它还提供了2、3、4、5、6以及8位量化功能,以加快推理速度并减少内存占用。对于大于总VRAM容量的大规模模型,该库还支持CPU+GPU混合推理模式进行部分加速。本质上,llama.cpp的用途在于运行GGUF(由GPT生成的统一格式)模型。 12 | 13 | ### 1.2 gguf 14 | 15 | GGUF是指一系列经过特定优化,能够在不同硬件上高效运行的大模型格式。这些模型格式包括但不限于原始格式、exl2、finetuned模型(如axolotl、unsloth等)。每种格式都有其特定的应用场景和优化目标,例如加速模型推理、减少模型大小、提高模型准确性等。 16 | 17 | 18 | ## 2.使用 19 | 20 | ### 2.1 准备 21 | 22 | 建议使用 linux 系统 23 | 24 | ```shell 25 | git clone https://github.com/wdndev/llama.cpp.tinyllm 26 | cd llama.cpp.tinyllm 27 | ``` 28 | 29 | 然后运行 make 命令: 30 | 31 | ```shell 32 | make 33 | ``` 34 | 35 | 然后你就能使用 `llama.cpp` 运行GGUF文件。 36 | 37 | ### 2.2 模型转化 38 | 39 | 先需要按照如下所示的方式为fp16模型创建一个GGUF文件: 40 | 41 | ```shell 42 | python convert-hf-to-gguf.py wdndev/tiny_llm_sft_92m --outfile models/tinyllm/tinyllm-92m-fp16.gguf 43 | ``` 44 | 45 | 其中,第一个参数指代的是预训练模型所在的路径或者HF模型的名称,第二个参数则指的是想要生成的GGUF文件的路径;在运行命令之前,需要先创建这个目录。 46 | 47 | 下面需要根据实际需求将其量化至低比特位。以下是一个将模型量化至4位的具体示例: 48 | 49 | ```shell 50 | ./llama-quantize models/tinyllm/tinyllm-92m-fp16.gguf models/tiny_llm_92m/tinyllm-92m-q4_0.gguf q4_0 51 | ``` 52 | 53 | 到现在为止,已经完成了将模型量化为4比特,并将其放入GGUF文件中。这里的 q4_0 表示4比特量化。现在,这个量化后的模型可以直接通过llama.cpp运行。 54 | 55 | ### 2.3 推理 56 | 57 | 使用如下命令可以运行模型 58 | 59 | ```shell 60 | ./llama-cli -m ./models/tinyllm/tinyllm-92m-fp16.gguf -p "<|system|>\n你是由wdndev开发的个人助手。\n<|user|>\n请介绍一下北京,你好。\n<|assistant|>\n" -n 128 --repeat-penalty 1.2 --top-p 0.8 --top-k 0 61 | ``` 62 | 63 | `-n` 指的是要生成的最大token数量。这里还有其他超参数供你选择,并且你可以运行 64 | 65 | ```shell 66 | ./llama-cli -h 67 | ``` 68 | 以了解它们 69 | 70 | 71 | -------------------------------------------------------------------------------- /demo/infer_func.py: -------------------------------------------------------------------------------- 1 | import json 2 | from transformers import AutoModelForCausalLM, AutoTokenizer 3 | from transformers.generation.utils import GenerationConfig 4 | 5 | def load_model_tokenizer(model_id: str): 6 | model = AutoModelForCausalLM.from_pretrained( 7 | model_id, 8 | device_map="auto", 9 | trust_remote_code=True 10 | ) 11 | tokenizer = AutoTokenizer.from_pretrained( 12 | model_id, 13 | use_fast=False, 14 | trust_remote_code=True 15 | ) 16 | generation_config = GenerationConfig.from_pretrained(model_id) 17 | return model, tokenizer, generation_config 18 | 19 | def tinyllm_infer(text: str, 20 | model: AutoModelForCausalLM, 21 | tokenizer: AutoTokenizer, 22 | generation_config: GenerationConfig 23 | ): 24 | sys_text = "你是由wdndev开发的个人助手。" 25 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 26 | "<|user|>", text.strip(), 27 | "<|assistant|>"]).strip() + "\n" 28 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device) 29 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config) 30 | generated_ids = [ 31 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 32 | ] 33 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 34 | 35 | return response 36 | 37 | def main(): 38 | model_id = "outputs/ckpt/tiny_llm_sft_92m" 39 | # model_id = "wdndev/tiny_llm_sft_92m" 40 | 41 | model, tokenizer, generation_config = load_model_tokenizer(model_id) 42 | generation_config.max_new_tokens = 200 43 | response = tinyllm_infer("介绍一下中国", model, tokenizer, generation_config) 44 | 45 | print(response) 46 | 47 | if __name__ == "__main__": 48 | main() -------------------------------------------------------------------------------- /tokenizer/train_chinese_sp.py: -------------------------------------------------------------------------------- 1 | import sentencepiece as spm 2 | import os 3 | import glob 4 | 5 | def tain_chinses_spm(input_txt_dir, vocab_size, output_dir="."): 6 | # 保存的模型名称 7 | prefix = os.path.join(output_dir, f"test_chinese_spm_{vocab_size}") 8 | 9 | text_filenames = sorted(glob.glob(os.path.join(input_txt_dir, "*.txt"))) 10 | print("file list: ", text_filenames) 11 | 12 | # 2) train the sentencepiece model 13 | print("Will now train the vocab...") 14 | spm.SentencePieceTrainer.train(input=text_filenames, 15 | model_prefix=prefix, 16 | model_type="bpe", 17 | vocab_size=vocab_size, 18 | self_test_sample_size=0, 19 | input_format="text", 20 | character_coverage=0.9995, 21 | num_threads=os.cpu_count(), 22 | split_digits=True, # 是否将数字划分为单个 token, 在 llama 中是这么做的 23 | allow_whitespace_only_pieces=True, 24 | byte_fallback=True, 25 | unk_surface=r" \342\201\207 ", 26 | max_sentence_length=24000) 27 | 28 | 29 | print(f"Trained tokenizer is in {prefix}.model") 30 | print("Done.") 31 | 32 | def test_chinese_spm(spm_model_path): 33 | sp_bpe = spm.SentencePieceProcessor() 34 | sp_bpe.load(spm_model_path) 35 | print('*** BPE ***') 36 | print(sp_bpe.encode_as_pieces('翻译下面的句子为英文:有朋自远方来,不亦乐乎')) 37 | print(len(sp_bpe.encode_as_pieces('翻译下面的句子为英文:有朋自远方来,不亦乐乎'))) 38 | 39 | 40 | if __name__ == "__main__": 41 | input_txt_dir = "baike_txt" 42 | vocab_size = 20000 43 | output_dir = "sp_output" 44 | tain_chinses_spm(input_txt_dir, vocab_size, output_dir) 45 | 46 | # test_chinese_spm("sp_output/chinese_spm_20000.model") -------------------------------------------------------------------------------- /script/gptq_demo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -x 4 | 5 | # GPTQ parameters 6 | BITS=4 # [4, 8] 7 | GROUP_SIZE=128 8 | DAMP_PERCENT=0.1 9 | DESC_ACT=False 10 | STATIC_GROUPS=False 11 | SYM=True 12 | TRUE_SEQUENTIAL=True 13 | 14 | # training parameters 15 | MAX_LEN=8192 16 | BATCH_SIZE=1 17 | CACHE_ON_GPU=False 18 | USR_TRITON=False 19 | 20 | # basic setting 21 | MODEL_PATH="/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mtai/users/wangdongnian/outputs/ckpt/qwen2_7b_sft_v10_qwen2-20240724-173728/iter_0007484/huggingface_format" 22 | OUTPUT_DIR="outputs" 23 | DATASEY_PATH="quant_data/v10_new_prompt_data.jsonl" 24 | N_GPUS=4 25 | GPU_MAX_MEMORY=20 26 | MODEL_NAME="qwen2_7b_v10_new_prompt" 27 | 28 | OUTPUT_MODEL_PATH=${OUTPUT_DIR}/${MODEL_NAME}_gptq_int${BITS} 29 | mkdir -p $OUTPUT_MODEL_PATH 30 | QUANT_LOG="${OUTPUT_MODEL_PATH}/quantize_$(date "+%Y%m%d%H%M").log" 31 | 32 | GPTQ_ARGS=" \ 33 | --bits ${BITS} \ 34 | --group_size ${GROUP_SIZE} \ 35 | --damp_percent ${DAMP_PERCENT} \ 36 | --desc_act ${DESC_ACT} \ 37 | --static_groups ${STATIC_GROUPS} \ 38 | --sym ${SYM} \ 39 | --true_sequential ${TRUE_SEQUENTIAL} \ 40 | " 41 | 42 | TRAIN_ARGS=" \ 43 | --max_len ${MAX_LEN} \ 44 | --batch_size ${BATCH_SIZE} \ 45 | --cache_examples_on_gpu ${CACHE_ON_GPU} \ 46 | --use_triton ${USR_TRITON} \ 47 | " 48 | 49 | SCRIPT_ARGS=" \ 50 | --model_id ${MODEL_PATH} \ 51 | --dataset_dir_or_path ${DATASEY_PATH} \ 52 | --quant_output_dir ${OUTPUT_MODEL_PATH} \ 53 | --ngpus ${N_GPUS} \ 54 | --gpu_max_memory ${GPU_MAX_MEMORY} \ 55 | " 56 | 57 | ALL_ARGS=" $GPTQ_ARGS $TRAIN_ARGS $SCRIPT_ARGS " 58 | 59 | LAUNCHER="python quantize/gptq_quantize.py " 60 | 61 | # Combine all arguments into one command 62 | CMD="$LAUNCHER $ALL_ARGS" 63 | 64 | # Print the command that will be executed for debugging purposes 65 | echo $CMD 66 | 67 | # Execute the quantization process and redirect all output to the log file 68 | nohup $CMD > ${QUANT_LOG} 2>&1 & 69 | 70 | # Notify the user about the location of the log file 71 | echo "Running successfully. The logs are saved in ${QUANT_LOG}" -------------------------------------------------------------------------------- /train/configuration_tinyllm.py: -------------------------------------------------------------------------------- 1 | from transformers.configuration_utils import PretrainedConfig 2 | from transformers.utils import logging 3 | 4 | 5 | logger = logging.get_logger(__name__) 6 | 7 | 8 | class TinyllmConfig(PretrainedConfig): 9 | """ TinyLLM 配置文件 10 | """ 11 | 12 | model_type = "tinyllm" 13 | keys_to_ignore_at_inference = ["past_key_values"] 14 | 15 | def __init__( 16 | self, 17 | vocab_size=64797, 18 | hidden_size=4096, 19 | intermediate_size=11008, 20 | num_hidden_layers=32, 21 | num_attention_heads=32, 22 | num_key_value_heads=None, 23 | hidden_act="silu", 24 | max_position_embeddings=2048, 25 | initializer_range=0.02, 26 | rms_norm_eps=1e-6, 27 | use_cache=True, 28 | pad_token_id=None, 29 | bos_token_id=None, 30 | eos_token_id=None, 31 | tie_word_embeddings=False, 32 | rope_theta=10000.0, 33 | attention_dropout=0.0, 34 | **kwargs 35 | ): 36 | self.vocab_size = vocab_size 37 | self.max_position_embeddings = max_position_embeddings 38 | self.hidden_size = hidden_size 39 | self.intermediate_size = intermediate_size 40 | self.num_hidden_layers = num_hidden_layers 41 | self.num_attention_heads = num_attention_heads 42 | 43 | # for backward compatibility 44 | if num_key_value_heads is None: 45 | num_key_value_heads = num_attention_heads 46 | 47 | self.num_key_value_heads = num_key_value_heads 48 | self.hidden_act = hidden_act 49 | self.initializer_range = initializer_range 50 | self.rms_norm_eps = rms_norm_eps 51 | self.use_cache = use_cache 52 | self.rope_theta = rope_theta 53 | self.attention_dropout = attention_dropout 54 | 55 | super().__init__( 56 | pad_token_id=pad_token_id, 57 | bos_token_id=bos_token_id, 58 | eos_token_id=eos_token_id, 59 | tie_word_embeddings=tie_word_embeddings, 60 | **kwargs 61 | ) 62 | -------------------------------------------------------------------------------- /tokenizer/tinyllm_tokenizer_hf/tokenizer_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "add_bos_token": true, 3 | "add_eos_token": false, 4 | "add_prefix_space": true, 5 | "added_tokens_decoder": { 6 | "0": { 7 | "content": "", 8 | "lstrip": false, 9 | "normalized": false, 10 | "rstrip": false, 11 | "single_word": false, 12 | "special": true 13 | }, 14 | "1": { 15 | "content": "", 16 | "lstrip": false, 17 | "normalized": false, 18 | "rstrip": false, 19 | "single_word": false, 20 | "special": true 21 | }, 22 | "2": { 23 | "content": "", 24 | "lstrip": false, 25 | "normalized": false, 26 | "rstrip": false, 27 | "single_word": false, 28 | "special": true 29 | }, 30 | "49953": { 31 | "content": "<|system|>", 32 | "lstrip": false, 33 | "normalized": true, 34 | "rstrip": false, 35 | "single_word": false, 36 | "special": false 37 | }, 38 | "49954": { 39 | "content": "<|user|>", 40 | "lstrip": false, 41 | "normalized": true, 42 | "rstrip": false, 43 | "single_word": false, 44 | "special": false 45 | }, 46 | "49955": { 47 | "content": "<|assistant|>", 48 | "lstrip": false, 49 | "normalized": true, 50 | "rstrip": false, 51 | "single_word": false, 52 | "special": false 53 | }, 54 | "49956": { 55 | "content": "<|im_start|>", 56 | "lstrip": false, 57 | "normalized": true, 58 | "rstrip": false, 59 | "single_word": false, 60 | "special": false 61 | }, 62 | "49957": { 63 | "content": "<|im_end|>", 64 | "lstrip": false, 65 | "normalized": true, 66 | "rstrip": false, 67 | "single_word": false, 68 | "special": false 69 | } 70 | }, 71 | "bos_token": "", 72 | "clean_up_tokenization_spaces": false, 73 | "eos_token": "", 74 | "legacy": true, 75 | "model_max_length": 1000000000000000019884624838656, 76 | "pad_token": null, 77 | "sp_model_kwargs": {}, 78 | "spaces_between_special_tokens": false, 79 | "tokenizer_class": "LlamaTokenizer", 80 | "unk_token": "", 81 | "use_default_system_prompt": false 82 | } 83 | -------------------------------------------------------------------------------- /doc/data_process.md: -------------------------------------------------------------------------------- 1 | ## Tiny LLM 数据处理 2 | 3 | 项目所采用的数据,都是开源数据集,大部分来自[Hugging Face](https://huggingface.co/),详细数据集列表如下: 4 | 5 | ## 预训练数据 6 | 7 | 本次训练的预训练预料都来自[Hugging Face](https://huggingface.co/),主要包含以下几个经典的中文数据集,大约有35B左右Token,详细数据集如下: 8 | 9 | | 中文预训练语料 | 链接 | 描述 | 10 | | ----------------- | ------------------------------------------------------------ | ----------------------------------------------- | 11 | | Wiki中文百科 | [wikipedia](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) | 中文Wikipedia的数据 | 12 | | BaiduBaiKe | [baidubaike](https://huggingface.co/datasets/xuqinyang/BaiduBaike-5.63M) | 中文BaiduBaiKe的数据 | 13 | | zhihu | [zhihu](https://huggingface.co/datasets/wangrui6/Zhihu-KOL) | 知乎KOL中截取的数据 | 14 | | 网络小说 | [webnovel](https://huggingface.co/datasets/wdndev/webnovel-chinese) | 个人爬虫数据清洗的数据 | 15 | | TigerBot 部分数据 | [tigerBot](https://huggingface.co/datasets/TigerResearch/pretrain_zh) | TigerBot 模型训练的部分中文数据,原始数据太多了 | 16 | | | | | 17 | 18 | 上述数据处理脚本为,在处理时,Tokenizer后保存为可直接训练的二进制文件(`.bin`)。 19 | 20 | 注意:此处使用二进制文件保存,不需要考虑每个 max_seq_len 的长度,尽可能压缩存储空间。后续的SFT执行微调数据和RLHF数据集是较小,不需要提前保存为二进制文件。 21 | 22 | 23 | ## 微调数据 24 | 25 | SFT指令微调预料都来自[Hugging Face](https://huggingface.co/),主要包含以下几个经典的SFT数据集,大约有400w条,详细数据集如下: 26 | 27 | | SFT微调数据 | 链接 | 描述 | 28 | | ----------- | ------------------------------------------------------------ | ------------------------------------------ | 29 | | Belle | [Belle](https://huggingface.co/datasets/BelleGroup/train_2M_CN) | 包含约200万条由BELLE项目生成的中文指令数据 | 30 | | Firefly | [Firefly](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) | 流萤开源模型SFT数据集 | 31 | | TigerBot | [tigerBot](https://huggingface.co/datasets/TigerResearch/sft_zh) | TigerBot 模型SFT数据集 | 32 | | | | | 33 | 34 | 35 | 36 | -------------------------------------------------------------------------------- /utils/test_tokenizer.py: -------------------------------------------------------------------------------- 1 | from chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer 2 | import json 3 | import torch 4 | 5 | def process_func(prompt_txt:str, user_txt:str, assistant_txt:str, max_length=512): 6 | input_ids, labels = [], [] 7 | prompt = [tokenizer.get_command("<|system|>")] + tokenizer.encode(prompt_txt + "\n", add_special_tokens=False) 8 | instruction_ = [tokenizer.get_command("<|user|>")] + tokenizer.encode(user_txt.strip() + "\n", add_special_tokens=False,max_length=max_length) + [tokenizer.get_command("<|assistant|>")] 9 | instruction = prompt + instruction_ 10 | response = tokenizer.encode(assistant_txt.strip(), add_special_tokens=False) 11 | input_ids = instruction + response + [tokenizer.eos_token_id] 12 | labels = [tokenizer.pad_token_id] * len(instruction) + response + [tokenizer.eos_token_id] 13 | pad_len = max_length - len(input_ids) 14 | # print() 15 | input_ids += [tokenizer.pad_token_id] * pad_len 16 | labels += [tokenizer.pad_token_id] * pad_len 17 | labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels] 18 | 19 | input_ids = torch.LongTensor(input_ids) 20 | labels = torch.LongTensor(labels) 21 | attention_mask = input_ids.ne(tokenizer.pad_token_id) 22 | 23 | return { 24 | "input_ids": input_ids, 25 | "labels": labels, 26 | "attention_mask": attention_mask, 27 | } 28 | 29 | if __name__=="__main__": 30 | tokenizer = ChatGLMTokenizer(vocab_file='chatglm3_tokenizer/tokenizer.model') 31 | 32 | sys_text = "你是由wdndev开发的个人助手。" 33 | user_text = "介绍一下中国。" 34 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 35 | "<|user|>", user_text.strip(), 36 | "<|assistant|>"]).strip() + "\n" 37 | 38 | model_inputs = tokenizer([input_txt], return_tensors="pt") 39 | 40 | print(tokenizer.batch_decode(model_inputs["input_ids"])) 41 | 42 | messages = [ 43 | {"role": "system", "content": "你是由wdndev开发的个人助手。"}, 44 | {"role": "system", "content": "介绍一下中国。"} 45 | ] 46 | # print(tokenizer.chat_template) 47 | 48 | text = tokenizer.apply_chat_template( 49 | messages, 50 | tokenize=False, 51 | add_generation_prompt=True 52 | ) 53 | model_inputs = tokenizer([text], return_tensors="pt") 54 | print(tokenizer.batch_decode(model_inputs["input_ids"])) 55 | 56 | 57 | -------------------------------------------------------------------------------- /tokenizer/expend_embedding.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer 2 | import torch 3 | 4 | model_id = "outputs/ckpt/tiny_llm_sft_92m" 5 | 6 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True) 7 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) 8 | 9 | new_tokenizer = AutoTokenizer.from_pretrained("tokenizer/tinyllm_tokenizer_hf") 10 | print(len(new_tokenizer)) # 49958 11 | print(model) 12 | """ 13 | TinyllmForCausalLM( 14 | (model): TinyllmModel( 15 | (embed_tokens): Embedding(64798, 512) 16 | (layers): ModuleList( 17 | (0-7): 8 x TinyllmDecoderLayer( 18 | (self_attn): TinyllmSdpaAttention( 19 | (q_proj): Linear(in_features=512, out_features=512, bias=True) 20 | (k_proj): Linear(in_features=512, out_features=512, bias=True) 21 | (v_proj): Linear(in_features=512, out_features=512, bias=True) 22 | (o_proj): Linear(in_features=512, out_features=512, bias=False) 23 | (rotary_emb): TinyllmRotaryEmbedding() 24 | ) 25 | (mlp): TinyllmMLP( 26 | (gate_proj): Linear(in_features=512, out_features=1408, bias=False) 27 | (up_proj): Linear(in_features=512, out_features=1408, bias=False) 28 | (down_proj): Linear(in_features=1408, out_features=512, bias=False) 29 | (act_fn): SiLU() 30 | ) 31 | (input_layernorm): TinyllmRMSNorm() 32 | (post_attention_layernorm): TinyllmRMSNorm() 33 | ) 34 | ) 35 | (norm): TinyllmRMSNorm() 36 | ) 37 | (lm_head): Linear(in_features=512, out_features=64798, bias=False) 38 | ) 39 | """ 40 | 41 | embeddings = model.get_input_embeddings() 42 | model.resize_token_embeddings(49958) 43 | model.config.vocab_size = 49958 44 | 45 | print(model) 46 | """ 47 | TinyllmForCausalLM( 48 | (model): TinyllmModel( 49 | (embed_tokens): Embedding(49958, 512) 50 | (layers): ModuleList( 51 | (0-7): 8 x TinyllmDecoderLayer( 52 | (self_attn): TinyllmSdpaAttention( 53 | (q_proj): Linear(in_features=512, out_features=512, bias=True) 54 | (k_proj): Linear(in_features=512, out_features=512, bias=True) 55 | (v_proj): Linear(in_features=512, out_features=512, bias=True) 56 | (o_proj): Linear(in_features=512, out_features=512, bias=False) 57 | (rotary_emb): TinyllmRotaryEmbedding() 58 | ) 59 | (mlp): TinyllmMLP( 60 | (gate_proj): Linear(in_features=512, out_features=1408, bias=False) 61 | (up_proj): Linear(in_features=512, out_features=1408, bias=False) 62 | (down_proj): Linear(in_features=1408, out_features=512, bias=False) 63 | (act_fn): SiLU() 64 | ) 65 | (input_layernorm): TinyllmRMSNorm() 66 | (post_attention_layernorm): TinyllmRMSNorm() 67 | ) 68 | ) 69 | (norm): TinyllmRMSNorm() 70 | ) 71 | (lm_head): Linear(in_features=512, out_features=49958, bias=False) 72 | ) 73 | """ 74 | 75 | output_dir = "outputs/sft_92m_llama" 76 | 77 | model.save_pretrained(output_dir) 78 | new_tokenizer.save_pretrained(output_dir) 79 | -------------------------------------------------------------------------------- /demo/web_demo.py: -------------------------------------------------------------------------------- 1 | import json 2 | import streamlit as st 3 | from transformers import AutoModelForCausalLM, AutoTokenizer 4 | from transformers.generation.utils import GenerationConfig 5 | 6 | 7 | st.set_page_config(page_title="Tiny LLM 92M Demo") 8 | st.title("Tiny LLM 92M Demo") 9 | 10 | # model_id = "outputs/ckpt/tiny_llm_sft_92m" 11 | model_id = "wdndev/tiny_llm_sft_92m" 12 | 13 | @st.cache_resource 14 | def load_model_tokenizer(): 15 | model = AutoModelForCausalLM.from_pretrained( 16 | model_id, 17 | device_map="auto", 18 | trust_remote_code=True 19 | ) 20 | tokenizer = AutoTokenizer.from_pretrained( 21 | model_id, 22 | use_fast=False, 23 | trust_remote_code=True 24 | ) 25 | generation_config = GenerationConfig.from_pretrained(model_id) 26 | return model, tokenizer, generation_config 27 | 28 | 29 | def clear_chat_messages(): 30 | del st.session_state.messages 31 | 32 | 33 | def init_chat_messages(): 34 | with st.chat_message("assistant", avatar='🤖'): 35 | st.markdown("您好,我是由wdndev开发的个人助手,很高兴为您服务😄") 36 | 37 | if "messages" in st.session_state: 38 | for message in st.session_state.messages: 39 | avatar = "🧑‍💻" if message["role"] == "user" else "🤖" 40 | with st.chat_message(message["role"], avatar=avatar): 41 | st.markdown(message["content"]) 42 | else: 43 | st.session_state.messages = [] 44 | 45 | return st.session_state.messages 46 | 47 | 48 | max_new_tokens = st.sidebar.slider("max_new_tokens", 0, 1024, 512, step=1) 49 | top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01) 50 | top_k = st.sidebar.slider("top_k", 0, 100, 0, step=1) 51 | temperature = st.sidebar.slider("temperature", 0.0, 2.0, 1.0, step=0.01) 52 | do_sample = st.sidebar.checkbox("do_sample", value=True) 53 | 54 | def main(): 55 | model, tokenizer, generation_config = load_model_tokenizer() 56 | messages = init_chat_messages() 57 | 58 | if prompt := st.chat_input("Shift + Enter 换行, Enter 发送"): 59 | with st.chat_message("user", avatar='🧑‍💻'): 60 | st.markdown(prompt) 61 | with st.chat_message("assistant", avatar='🤖'): 62 | placeholder = st.empty() 63 | 64 | generation_config.max_new_tokens = max_new_tokens 65 | generation_config.top_p = top_p 66 | generation_config.top_k = top_k 67 | generation_config.temperature = temperature 68 | generation_config.do_sample = do_sample 69 | print("generation_config: ", generation_config) 70 | 71 | sys_text = "你是由wdndev开发的个人助手。" 72 | messages.append({"role": "user", "content": prompt}) 73 | user_text = prompt 74 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 75 | "<|user|>", user_text.strip(), 76 | "<|assistant|>"]).strip() + "\n" 77 | 78 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device) 79 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config) 80 | generated_ids = [ 81 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 82 | ] 83 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 84 | placeholder.markdown(response) 85 | 86 | messages.append({"role": "assistant", "content": response}) 87 | print("messages: ", json.dumps(response, ensure_ascii=False), flush=True) 88 | 89 | st.button("清空对话", on_click=clear_chat_messages) 90 | 91 | 92 | if __name__ == "__main__": 93 | main() -------------------------------------------------------------------------------- /utils/rl_train_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import glob 4 | import numpy as np 5 | from tqdm import tqdm 6 | import pandas as pd 7 | import csv 8 | 9 | 10 | def merge_datsets(input_dir): 11 | total_lines = [] 12 | for subdir, dirs, files in os.walk(input_dir): 13 | for idx, file in enumerate(files): 14 | # 只处理txt文件 15 | if file.endswith('.jsonl'): 16 | # 获取当前文件的绝对路径 17 | file_path = os.path.join(subdir, file) 18 | print(file_path) 19 | # 读取jsonl文件 20 | with open(file_path, 'r', encoding='utf-8') as infile: 21 | lines = infile.readlines() 22 | 23 | for line in tqdm(lines): 24 | json_obj = json.loads(line) # 解析json字符串为python对象 25 | 26 | prompt_text = json_obj["prompt"] 27 | chosen_text = json_obj["pos_resp"] 28 | rejected_text = json_obj["neg_resp"] 29 | 30 | data_dict = { 31 | "prompt": prompt_text, 32 | "chosen": chosen_text, 33 | "rejected": rejected_text 34 | } 35 | 36 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 37 | total_lines.append(processed_line) 38 | 39 | if file.endswith('.parquet'): 40 | # 获取当前文件的绝对路径 41 | file_path = os.path.join(subdir, file) 42 | print(file_path) 43 | # 读取jsonl文件 44 | df = pd.read_parquet(file_path) 45 | 46 | for idx, row in tqdm(df.iterrows(), total=len(df)): 47 | prompt_text = row['prompt'] 48 | chosen_text = row['chosen'] 49 | rejected_text = row['rejected'] 50 | 51 | data_dict = { 52 | "prompt": prompt_text, 53 | "chosen": chosen_text, 54 | "rejected": rejected_text 55 | } 56 | 57 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 58 | total_lines.append(processed_line) 59 | 60 | if file.endswith('.tsv'): 61 | # 获取当前文件的绝对路径 62 | file_path = os.path.join(subdir, file) 63 | print(file_path) 64 | # 读取jsonl文件 65 | df = pd.read_csv(file_path, sep='\t') 66 | 67 | for idx, row in tqdm(df.iterrows(), total=len(df)): 68 | prompt_text = row['prompt'] 69 | chosen_text = row['chosen'] 70 | rejected_text = row['rejected'] 71 | 72 | data_dict = { 73 | "prompt": prompt_text, 74 | "chosen": chosen_text, 75 | "rejected": rejected_text 76 | } 77 | 78 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 79 | total_lines.append(processed_line) 80 | 81 | # 如果输出子文件夹不存在,则创建它 82 | output_subfolder = "data/rl_train" 83 | if not os.path.exists(output_subfolder): 84 | os.makedirs(output_subfolder) 85 | 86 | # 保存处理后的csv文件到对应的输出子文件夹 87 | output_file_path = os.path.join(output_subfolder, "rl_data.jsonl") 88 | # 将处理后的json对象写入新的jsonl文件 89 | with open(output_file_path, 'w') as outfile: 90 | for line in total_lines: 91 | outfile.write(line) 92 | 93 | 94 | if __name__=="__main__": 95 | merge_datsets("corpus/rm_train") 96 | 97 | -------------------------------------------------------------------------------- /tokenizer/expend_tokenizer.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | # 设置环境变量,指定protobuf的Python实现为纯Python版本 4 | os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python" 5 | from transformers import LlamaTokenizer 6 | from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model 7 | import sentencepiece as spm 8 | import argparse 9 | import json 10 | 11 | def merge_tokenizer(llama_tokenizer_dir, chinese_sp_model_file, output_hf_dir="tinyllm_tokenizer_hf"): 12 | # 加载LlamaTokenizer 13 | llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir) 14 | # 中文sentencepiece模型 15 | chinese_sp_model = spm.SentencePieceProcessor() 16 | chinese_sp_model.Load(chinese_sp_model_file) 17 | 18 | # 将LlamaTokenizer加载为protobuf模型对象 19 | llama_spm = sp_pb2_model.ModelProto() 20 | llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto()) 21 | # 将中文模型加载为protobuf模型对象 22 | chinese_spm = sp_pb2_model.ModelProto() 23 | chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto()) 24 | 25 | # 打印基本信息 26 | print("llama token nums: ", len(llama_tokenizer)) 27 | print("chinese sp nums: ", len(chinese_sp_model)) 28 | 29 | # 向 LLaMA 的 tokenizer 中添加中文 tokens 30 | ## 1.首先创建一个set包含所有LLaMA的tokens以加速查找 31 | llama_spm_tokens_set = set(p.piece for p in llama_spm.pieces) 32 | ## 2.遍历中文模型的tokens,如果不在 LLaMA 的 tokens 集合中,则添加至 LLaMA 模型 33 | for p in chinese_spm.pieces: 34 | piece = p.piece 35 | if piece not in llama_spm_tokens_set: 36 | # 创建新的SentencePiece对象 37 | new_p = sp_pb2_model.ModelProto().SentencePiece() 38 | new_p.piece = piece # 设置token内容 39 | new_p.score = 0 # 设置默认的分数 40 | llama_spm.pieces.append(new_p) # 添加到LLaMA的模型pieces中 41 | 42 | 43 | # 保存合并后的模型 44 | output_sp_dir = 'tmp_tinyllm_tokenizer_sp' # 保存sentencepiece模型的目录 45 | os.makedirs(output_sp_dir, exist_ok=True) # 确保目录存在 46 | # 保存sentencepiece模型到文件 47 | with open(output_sp_dir + '/tokenizer.model', 'wb') as f: 48 | f.write(llama_spm.SerializeToString()) 49 | 50 | # 使用新生成的vocab文件初始化LlamaTokenizer,并保存为 Hugging Face 格式 51 | tokenizer = LlamaTokenizer(vocab_file = output_sp_dir + '/tokenizer.model', legacy=True) 52 | ## 添加特殊 token 53 | custom_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|im_start|>", "<|im_end|>"] 54 | for token in custom_special_tokens: 55 | tokenizer.add_tokens(token) 56 | 57 | # vocab_dict = tokenizer.get_vocab() 58 | # with open('vocab_utf8.txt', 'w', encoding='utf-8') as f: 59 | # json.dump(vocab_dict, f, indent=4) 60 | tokenizer.save_pretrained(output_hf_dir) 61 | print(f"tinyllm token num: {len(tokenizer)}") 62 | print(f"Tiny LLM tokenizer has been saved to {output_hf_dir}") 63 | 64 | def test_tokenizer(hf_tokenizer_dir): 65 | tinyllm_tokenizer = LlamaTokenizer.from_pretrained(hf_tokenizer_dir) 66 | print("tinyllm tokenizer nums: ", len(tinyllm_tokenizer)) 67 | 68 | sys_text = "你是由wdndev开发的个人助手。" 69 | user_text = "翻译下面的句子为英文:有朋自远方来,不亦乐乎" 70 | answer_text = "It is always a pleasure to greet a friend from afar." 71 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 72 | "<|user|>", user_text.strip(), 73 | "<|assistant|>"]).strip() + "\n" + answer_text.strip() 74 | 75 | print("-----input text: \n", input_txt) 76 | 77 | encode_ids = tinyllm_tokenizer.encode(input_txt, add_special_tokens=False) 78 | print("-----encode ids: \n", encode_ids) 79 | 80 | decode_ids = tinyllm_tokenizer.decode(encode_ids) 81 | print("-----dencode ids: \n", decode_ids) 82 | 83 | 84 | if __name__ == "__main__": 85 | 86 | llama_tokenizer_dir = "input_dir/llama2_tokenizer" 87 | chinese_sp_model_file = "sp_output/chinese_spm_20000.model" 88 | output_hf_dir = "tinyllm_tokenizer_hf" 89 | 90 | # merge_tokenizer(llama_tokenizer_dir, chinese_sp_model_file, output_hf_dir) 91 | 92 | test_tokenizer(output_hf_dir) -------------------------------------------------------------------------------- /utils/sft_train_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import glob 4 | import numpy as np 5 | from tqdm import tqdm 6 | import pandas as pd 7 | import csv 8 | import json 9 | 10 | #from zhconv import convert 11 | 12 | def process_bell_2m(file_path): 13 | """ https://huggingface.co/datasets/BelleGroup/train_2M_CN 14 | """ 15 | 16 | total_lines = [] 17 | with open(file_path, 'r', encoding='utf-8') as infile: 18 | lines = infile.readlines() 19 | for line in tqdm(lines): 20 | json_obj = json.loads(line) # 解析json字符串为python对象 21 | 22 | instruction = json_obj["instruction"] 23 | input_str = json_obj["input"] 24 | answer = json_obj["output"] 25 | 26 | question = instruction + input_str 27 | 28 | data_dict = { 29 | "question": question, 30 | "answer": answer 31 | } 32 | 33 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 34 | total_lines.append(processed_line) 35 | 36 | return total_lines 37 | 38 | def process_nlp(file_path): 39 | """ https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M 40 | """ 41 | total_lines = [] 42 | with open(file_path, 'r', encoding='utf-8') as infile: 43 | lines = infile.readlines() 44 | for line in tqdm(lines): 45 | json_obj = json.loads(line) # 解析json字符串为python对象 46 | 47 | # instruction = json_obj["instruction"] 48 | question = json_obj["input"] 49 | answer = json_obj["target"] 50 | 51 | data_dict = { 52 | "question": question, 53 | "answer": answer 54 | } 55 | 56 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 57 | total_lines.append(processed_line) 58 | 59 | return total_lines 60 | 61 | def process_tigerbot_sft(input_dir): 62 | """ https://huggingface.co/datasets/TigerResearch/sft_zh 63 | """ 64 | total_lines = [] 65 | for subdir, dirs, files in os.walk(input_dir): 66 | for idx, file in enumerate(files): 67 | # 只处理txt文件 68 | if file.endswith('.json'): 69 | # 获取当前文件的绝对路径 70 | file_path = os.path.join(subdir, file) 71 | print(file_path) 72 | # 读取jsonl文件 73 | with open(file_path, 'r', encoding='utf-8') as infile: 74 | lines = infile.readlines() 75 | 76 | for line in tqdm(lines): 77 | json_obj = json.loads(line) # 解析json字符串为python对象 78 | 79 | instruction = json_obj["instruction"] 80 | input_str = json_obj["input"] 81 | answer = json_obj["output"] 82 | 83 | question = instruction + input_str 84 | 85 | data_dict = { 86 | "question": question, 87 | "answer": answer 88 | } 89 | 90 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 91 | total_lines.append(processed_line) 92 | 93 | return total_lines 94 | 95 | 96 | if __name__=="__main__": 97 | 98 | total_lines = process_bell_2m("corpus/sft_train/bell_2m/train_2M_CN.json") 99 | print("bell 2m: ", len(total_lines)) 100 | nlp_total_lines = process_nlp("corpus/sft_train/nlp/firefly-train-1.1M.jsonl") 101 | print("nlp: ", len(nlp_total_lines)) 102 | 103 | total_lines.extend(nlp_total_lines) 104 | 105 | tigerbot_total_lines = process_tigerbot_sft("corpus/sft_train/tigerbot") 106 | print("tigerbot: ", len(tigerbot_total_lines)) 107 | 108 | total_lines.extend(tigerbot_total_lines) 109 | 110 | print("all: ", len(total_lines)) 111 | 112 | # 如果输出子文件夹不存在,则创建它 113 | output_subfolder = "data/sft_train" 114 | if not os.path.exists(output_subfolder): 115 | os.makedirs(output_subfolder) 116 | 117 | # 保存处理后的csv文件到对应的输出子文件夹 118 | output_file_path = os.path.join(output_subfolder, "sft_data_test.jsonl") 119 | # 将处理后的json对象写入新的jsonl文件 120 | with open(output_file_path, 'w') as outfile: 121 | for line in total_lines: 122 | outfile.write(line) 123 | 124 | 125 | -------------------------------------------------------------------------------- /utils/rm_train_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import glob 4 | import numpy as np 5 | from tqdm import tqdm 6 | import pandas as pd 7 | import csv 8 | import random 9 | 10 | 11 | def merge_datsets(input_dir): 12 | total_lines = [] 13 | for subdir, dirs, files in os.walk(input_dir): 14 | for idx, file in enumerate(files): 15 | # 只处理txt文件 16 | if file.endswith('.jsonl'): 17 | # https://www.modelscope.cn/datasets/iic/CValues-Comparison/summary 18 | # 获取当前文件的绝对路径 19 | file_path = os.path.join(subdir, file) 20 | print(file_path) 21 | # 读取jsonl文件 22 | with open(file_path, 'r', encoding='utf-8') as infile: 23 | lines = infile.readlines() 24 | 25 | for line in tqdm(lines): 26 | json_obj = json.loads(line) # 解析json字符串为python对象 27 | 28 | prompt_text = json_obj["prompt"] 29 | chosen_text = json_obj["pos_resp"] 30 | rejected_text = json_obj["neg_resp"] 31 | 32 | data_dict = { 33 | "prompt": prompt_text, 34 | "chosen": chosen_text, 35 | "rejected": rejected_text 36 | } 37 | 38 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 39 | total_lines.append(processed_line) 40 | 41 | if file.endswith('.parquet'): 42 | # https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese 43 | # 获取当前文件的绝对路径 44 | file_path = os.path.join(subdir, file) 45 | print(file_path) 46 | # 读取jsonl文件 47 | df = pd.read_parquet(file_path) 48 | 49 | for idx, row in tqdm(df.iterrows(), total=len(df)): 50 | prompt_text = row['prompt'] 51 | chosen_text = row['chosen'] 52 | rejected_text = row['rejected'] 53 | 54 | data_dict = { 55 | "prompt": prompt_text, 56 | "chosen": chosen_text, 57 | "rejected": rejected_text 58 | } 59 | 60 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 61 | total_lines.append(processed_line) 62 | 63 | if file.endswith('.tsv'): 64 | # https://huggingface.co/datasets/liyucheng/zhihu_rlhf_3k 65 | # 获取当前文件的绝对路径 66 | file_path = os.path.join(subdir, file) 67 | print(file_path) 68 | # 读取jsonl文件 69 | df = pd.read_csv(file_path, sep='\t') 70 | 71 | for idx, row in tqdm(df.iterrows(), total=len(df)): 72 | prompt_text = row['prompt'] 73 | chosen_text = row['chosen'] 74 | rejected_text = row['rejected'] 75 | 76 | data_dict = { 77 | "prompt": prompt_text, 78 | "chosen": chosen_text, 79 | "rejected": rejected_text 80 | } 81 | 82 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n' 83 | total_lines.append(processed_line) 84 | print("total len: ", len(total_lines)) 85 | # 拆分训练集和验证集 86 | # 随机抽取2000条数据 87 | eval_dat = random.sample(total_lines, 2000) 88 | # 剩余的8000条数据 89 | train_data = [item for item in total_lines if item not in eval_dat] 90 | # assert len(eval_dat) + len(train_data) == len(total_lines) 91 | print("eval len: ", len(eval_dat)) 92 | print("train len: ", len(train_data)) 93 | 94 | # 保存 95 | # 如果输出子文件夹不存在,则创建它 96 | output_subfolder = "data/rl_train" 97 | if not os.path.exists(output_subfolder): 98 | os.makedirs(output_subfolder) 99 | 100 | # 保存处理后的csv文件到对应的输出子文件夹 101 | eval_file_path = os.path.join(output_subfolder, "rl_eval_data.jsonl") 102 | train_file_path = os.path.join(output_subfolder, "rl_train_data.jsonl") 103 | # 将处理后的json对象写入新的jsonl文件 104 | with open(eval_file_path, 'w') as outfile: 105 | for line in eval_dat: 106 | outfile.write(line) 107 | with open(train_file_path, 'w') as outfile: 108 | for line in train_data: 109 | outfile.write(line) 110 | 111 | 112 | if __name__=="__main__": 113 | merge_datsets("corpus/rm_train") 114 | 115 | -------------------------------------------------------------------------------- /quantize/gptq_quantize.py: -------------------------------------------------------------------------------- 1 | """ 2 | 量化问题: 3 | https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit/discussions/5 4 | https://github.com/AutoGPTQ/AutoGPTQ/issues/657 5 | """ 6 | import os 7 | import torch 8 | from dataclasses import dataclass, field 9 | from typing import Dict, Optional 10 | 11 | from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig 12 | from transformers import AutoTokenizer, HfArgumentParser 13 | 14 | # Assuming 'read_jsonl_file' is a function defined in 'utilis' module 15 | from utilis import read_jsonl_file 16 | 17 | import logging 18 | 19 | @dataclass 20 | class ScriptArguments: 21 | """ 22 | The arguments for the DPO training script. 23 | """ 24 | # Basic settings 25 | model_id: Optional[str] = field(default="", metadata={"help": "The location of the SFT model name or path."}) 26 | # {"input": question, "target": answer} 27 | dataset_dir_or_path: Optional[str] = field(default="", metadata={"help": "The location of the dataset directory or path."}) 28 | quant_output_dir: Optional[str] = field(default="./results", metadata={"help": "The output directory for the quantized model."}) 29 | ngpus: Optional[int] = field(default=1, metadata={"help": "Number of GPUs for quantization."}) 30 | gpu_max_memory: Optional[int] = field(default=20, metadata={"help": "Max memory per GPU for quantization (in GB)."}) 31 | 32 | # GPTQ parameters 33 | bits: Optional[int] = field(default=4, metadata={"help": "Quantization bits (4 or 8)."}) 34 | group_size: Optional[int] = field(default=128, metadata={"help": "Group size for quantization (32, 64, 128)."}) 35 | damp_percent: Optional[float] = field(default=0.1, metadata={"help": "Damping percentage for quantization (0.1, 0.01)."}) 36 | desc_act: Optional[bool] = field(default=False, metadata={"help": "Whether to use descending activation (False speeds up inference but may affect perplexity)."}) 37 | static_groups: Optional[bool] = field(default=False, metadata={"help": "Whether to use static groups for quantization."}) 38 | sym: Optional[bool] = field(default=True, metadata={"help": "Whether to use symmetric quantization."}) 39 | true_sequential: Optional[bool] = field(default=True, metadata={"help": "Whether to use true sequential quantization."}) 40 | 41 | # Training parameters 42 | max_len: Optional[int] = field(default=8192, metadata={"help": "Maximum length of input data."}) 43 | batch_size: Optional[int] = field(default=1, metadata={"help": "Batch size for quantization training."}) 44 | cache_examples_on_gpu: Optional[bool] = field(default=False, metadata={"help": "Whether to cache examples on GPU during quantization."}) 45 | use_triton: Optional[bool] = field(default=False, metadata={"help": "Whether to use Triton for quantization."}) 46 | 47 | def data_process(data_list, max_len, tokenizer: AutoTokenizer): 48 | def qwen_process(item): 49 | input_text = item["input"] 50 | target_text = item["target"] 51 | messages = [ 52 | {"role": "system", "content": "You are a helpful assistant."}, 53 | {"role": "user", "content": input_text}, 54 | {"role": "assistant", "content": target_text} 55 | ] 56 | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) 57 | model_inputs = tokenizer(text, truncation=True, padding='max_length', max_length=max_len) 58 | input_ids = torch.tensor(model_inputs['input_ids'], dtype=torch.long) 59 | attention_mask = torch.tensor(model_inputs['attention_mask'], dtype=torch.long) 60 | return { 61 | "input_ids": input_ids, 62 | "attention_mask": attention_mask 63 | } 64 | 65 | return [qwen_process(item) for item in data_list] 66 | 67 | def main(): 68 | parser = HfArgumentParser(ScriptArguments) 69 | script_args = parser.parse_args_into_dataclasses()[0] 70 | 71 | logging.basicConfig( 72 | format="%(asctime)s %(levelname)s [%(name)s] %(message)s", 73 | level=logging.INFO, 74 | datefmt="%Y-%m-%d %H:%M:%S" 75 | ) 76 | 77 | quantize_config = BaseQuantizeConfig( 78 | bits=script_args.bits, # 4 or 8 79 | group_size=script_args.group_size, 80 | damp_percent=script_args.damp_percent, 81 | desc_act=script_args.desc_act, # set to False can significantly speed up inference but the perplexity may slightly bad 82 | static_groups=script_args.static_groups, 83 | sym=script_args.sym, 84 | true_sequential=script_args.true_sequential 85 | ) 86 | 87 | tokenizer = AutoTokenizer.from_pretrained( 88 | script_args.model_id, 89 | trust_remote_code=True 90 | ) 91 | 92 | model = AutoGPTQForCausalLM.from_pretrained( 93 | script_args.model_id, 94 | quantize_config, 95 | max_memory={i: f"{script_args.gpu_max_memory}GB" for i in range(script_args.ngpus)} 96 | ) 97 | 98 | data_list = read_jsonl_file(script_args.dataset_dir_or_path) 99 | quant_data = data_process( 100 | data_list, 101 | script_args.max_len, 102 | tokenizer 103 | ) 104 | 105 | model.quantize( 106 | quant_data, 107 | cache_examples_on_gpu=script_args.cache_examples_on_gpu, 108 | batch_size=script_args.batch_size, 109 | use_triton=script_args.use_triton 110 | ) 111 | 112 | model.save_quantized(script_args.quant_output_dir, use_safetensors=True) 113 | tokenizer.save_pretrained(script_args.quant_output_dir) 114 | 115 | if __name__ == "__main__": 116 | main() -------------------------------------------------------------------------------- /script/ptm_demo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -x 4 | 5 | # export CUDA_VISIBLE_DEVICES="4,5,6,7" 6 | 7 | source /home/.bashrc 8 | source /home/miniconda3/etc/profile.d/conda.sh 9 | conda activate md_llm 10 | which python 11 | 12 | function killall { 13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'` 14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9 15 | } 16 | 17 | WORK_DIR="/personal/tiny-llm-zh" 18 | cd ${WORK_DIR} 19 | 20 | 21 | # 常见参数 22 | N_NODES=1 23 | N_GPUS=8 24 | MBS=32 # 单卡bs 25 | GAS=1 # 梯度累积 26 | GRAD_CLIP=1 # 梯度裁剪 27 | RANK=0 28 | MASTER_ADDR=`hostname -i` 29 | MASTER_PORT=9902 30 | 31 | LR=3e-4 # 初始学习率 32 | LR_SCHEDULER_TYPE="cosine" 33 | WARMUP_RATION=0.05 34 | 35 | TRAIN_EPOCHS=5 # 训练轮次 36 | LOGGING_STEPS=100 # 记录日志步数 37 | CKPT_SAVE_STEPS=10000 # ckpt保存步数 38 | 39 | SEED=12 40 | DS_DTYPE="fp16" # [fp16, bf16] 41 | RESUME="False" 42 | 43 | # 数据 44 | MODE="ptm" # [ptm, sft, rm, rl] 45 | DATASET_DIR_OR_PATH="data/pre_train" 46 | BASE_MODEL_PATH="test" 47 | 48 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m] 49 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}" 50 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 51 | mkdir -p $OUTPUT_DIR 52 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log" 53 | # tensorboard输出路径 54 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 55 | mkdir -p $TB_DIR 56 | 57 | TRAIN_ARGS="" 58 | 59 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json 60 | ZERO_STAGE=2 61 | 62 | if [ $DS_DTYPE = "fp16" ];then 63 | TRAIN_ARGS+=" \ 64 | --fp16 \ 65 | " 66 | DS_FP16=true 67 | DS_BF16=false 68 | GAS_DTYPE=$DS_DTYPE 69 | elif [ $DS_DTYPE = "bf16" ];then 70 | TRAIN_ARGS+=" \ 71 | --bf16 \ 72 | --embedding-weights-in-fp32 \ 73 | " 74 | DS_FP16=false 75 | DS_BF16=true 76 | GAS_DTYPE="fp32" 77 | 78 | fi 79 | 80 | cat < $DS_CONFIG_JSON 81 | { 82 | "train_micro_batch_size_per_gpu": $MBS, 83 | "train_batch_size": "auto", 84 | "gradient_clipping": ${GRAD_CLIP}, 85 | "zero_optimization": { 86 | "stage": $ZERO_STAGE 87 | }, 88 | "bf16": { 89 | "enabled": ${DS_BF16} 90 | }, 91 | "data_types": { 92 | "grad_accum_dtype": "${GAS_DTYPE}" 93 | }, 94 | "fp16": { 95 | "enabled": ${DS_FP16}, 96 | "loss_scale": 0, 97 | "loss_scale_window": 200, 98 | "hysteresis": 5, 99 | "min_loss_scale": 1, 100 | "initial_scale_power": 12 101 | }, 102 | "steps_per_print": 10, 103 | "wall_clock_breakdown": true, 104 | "comms_logger": { 105 | "enabled": true, 106 | "verbose": false, 107 | "prof_all": false, 108 | "debug": false 109 | }, 110 | "flops_profiler": { 111 | "enabled": false, 112 | "profile_step": 30, 113 | "module_depth": -1, 114 | "top_modules": 1, 115 | "detailed": true, 116 | "output_file": null 117 | } 118 | } 119 | EOT 120 | 121 | 122 | TRAIN_ARGS+=" \ 123 | --seed ${SEED} \ 124 | --output_dir ${OUTPUT_DIR} \ 125 | --overwrite_output_dir \ 126 | --deepspeed ${DS_CONFIG_JSON} \ 127 | --per_device_train_batch_size ${MBS} \ 128 | --gradient_accumulation_steps ${GAS} \ 129 | --do_train \ 130 | --num_train_epochs ${TRAIN_EPOCHS} \ 131 | --logging_dir ${TB_DIR} \ 132 | --logging_strategy steps \ 133 | --logging_steps ${LOGGING_STEPS} \ 134 | --weight_decay 0.01 \ 135 | --adam_beta1 0.9 \ 136 | --adam_beta1 0.95 \ 137 | --max_grad_norm ${GRAD_CLIP} \ 138 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \ 139 | --learning_rate ${LR} \ 140 | --warmup_ratio ${WARMUP_RATION} \ 141 | --weight_decay 0.01 \ 142 | --save_strategy steps \ 143 | --save_total_limit 3 \ 144 | --save_steps ${CKPT_SAVE_STEPS} \ 145 | --ddp_timeout 30000 \ 146 | --logging_first_step True \ 147 | --save_safetensors False \ 148 | --ddp_find_unused_parameters False \ 149 | " 150 | 151 | if [[ $MODEL_SIZE == "16m" ]];then 152 | HIDDEN_SIZE=120 153 | NUM_HIDDEN_LAYERS=6 154 | NUM_ATTENTION_HEADS=6 155 | INTERMEDIATE_SIZE=384 156 | ROPE_THETA=10000.0 157 | MAX_POSITION_EMBEDDINGS=512 158 | VOCAB_SIZE=64798 159 | elif [[ $MODEL_SIZE == "42m" ]];then 160 | HIDDEN_SIZE=288 161 | NUM_HIDDEN_LAYERS=6 162 | NUM_ATTENTION_HEADS=6 163 | INTERMEDIATE_SIZE=768 164 | ROPE_THETA=10000.0 165 | MAX_POSITION_EMBEDDINGS=512 166 | VOCAB_SIZE=64798 167 | elif [[ $MODEL_SIZE == "92m" ]];then 168 | HIDDEN_SIZE=512 169 | NUM_HIDDEN_LAYERS=8 170 | NUM_ATTENTION_HEADS=8 171 | INTERMEDIATE_SIZE=1408 172 | ROPE_THETA=10000.0 173 | MAX_POSITION_EMBEDDINGS=1024 174 | VOCAB_SIZE=64798 175 | elif [[ $MODEL_SIZE == "210m" ]];then 176 | HIDDEN_SIZE=768 177 | NUM_HIDDEN_LAYERS=16 178 | NUM_ATTENTION_HEADS=12 179 | INTERMEDIATE_SIZE=2048 180 | ROPE_THETA=10000.0 181 | MAX_POSITION_EMBEDDINGS=1024 182 | VOCAB_SIZE=64798 183 | elif [[ $MODEL_SIZE == "440m" ]];then 184 | HIDDEN_SIZE=1024 185 | NUM_HIDDEN_LAYERS=24 186 | NUM_ATTENTION_HEADS=16 187 | INTERMEDIATE_SIZE=2816 188 | ROPE_THETA=10000.0 189 | MAX_POSITION_EMBEDDINGS=1024 190 | VOCAB_SIZE=64798 191 | fi 192 | 193 | GPT_ARGS=" \ 194 | --hidden_size ${HIDDEN_SIZE} \ 195 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \ 196 | --num_attention_heads ${NUM_ATTENTION_HEADS} \ 197 | --intermediate_size ${INTERMEDIATE_SIZE} \ 198 | --rope_theta ${ROPE_THETA} \ 199 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \ 200 | --vocab_size ${VOCAB_SIZE} \ 201 | " 202 | SCRIPT_ARGS=" \ 203 | --mode ${MODE} \ 204 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \ 205 | --resume ${RESUME} \ 206 | --base_model_path ${BASE_MODEL_PATH} \ 207 | " 208 | 209 | DISTRIBUTED_ARGS=" \ 210 | --nnodes $N_NODES \ 211 | --nproc_per_node $N_GPUS \ 212 | " 213 | 214 | # 检查num是否大于1 215 | if [ "$N_NODES" -ge 2 ]; then 216 | DISTRIBUTED_ARGS+=" \ 217 | --node_rank $RANK \ 218 | --master_addr $MASTER_ADDR \ 219 | --master_port $MASTER_PORT \ 220 | " 221 | fi 222 | 223 | # 所有参数 224 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS " 225 | 226 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/ptm_train.py " 227 | 228 | export CMD="$LAUNCHER $ALL_ARGS" 229 | echo $CMD 230 | 231 | killall ptm_train.py 232 | 233 | # 执行训练 234 | $CMD 2>&1 | tee ${TRAIN_LOG} 235 | 236 | killall ptm_train.py 237 | 238 | echo "train end : ${OUTPUT_DIR}" 239 | 240 | 241 | -------------------------------------------------------------------------------- /script/sft_demo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -x 4 | 5 | # export CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" 6 | 7 | source /home/.bashrc 8 | source /home/miniconda3/etc/profile.d/conda.sh 9 | conda activate md_llm 10 | which python 11 | 12 | function killall { 13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'` 14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9 15 | } 16 | 17 | WORK_DIR="/personal/tiny-llm-zh" 18 | cd ${WORK_DIR} 19 | 20 | # 常见参数 21 | N_NODES=1 22 | N_GPUS=8 23 | MBS=32 # 单卡bs 24 | GAS=1 # 梯度累积 25 | GRAD_CLIP=1 # 梯度裁剪 26 | RANK=0 27 | MASTER_ADDR=`hostname -i` 28 | MASTER_PORT=9902 29 | 30 | LR=3e-4 # 初始学习率 31 | LR_SCHEDULER_TYPE="cosine" 32 | WARMUP_RATION=0.03 33 | 34 | TRAIN_EPOCHS=5 # 训练轮次 35 | LOGGING_STEPS=100 # 记录日志步数 36 | CKPT_SAVE_STEPS=15000 # ckpt保存步数 37 | 38 | SEED=12 39 | DS_DTYPE="fp16" # [fp16, bf16] 40 | RESUME="False" 41 | 42 | # 数据 43 | MODE="sft" # [ptm, sft, rm, rl] 44 | DATASET_DIR_OR_PATH="data/sft_train/sft_data.jsonl" 45 | BASE_MODEL_PATH="outputs/ckpt/ptm_tiny_llm_92m_epoch5/last_ptm_model" 46 | 47 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m] 48 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}" 49 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 50 | mkdir -p $OUTPUT_DIR 51 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log" 52 | # tensorboard输出路径 53 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 54 | mkdir -p $TB_DIR 55 | 56 | TRAIN_ARGS="" 57 | 58 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json 59 | ZERO_STAGE=2 60 | 61 | if [ $DS_DTYPE = "fp16" ];then 62 | TRAIN_ARGS+=" \ 63 | --fp16 \ 64 | " 65 | DS_FP16=true 66 | DS_BF16=false 67 | GAS_DTYPE=$DS_DTYPE 68 | elif [ $DS_DTYPE = "bf16" ];then 69 | TRAIN_ARGS+=" \ 70 | --bf16 \ 71 | " 72 | DS_FP16=false 73 | DS_BF16=true 74 | GAS_DTYPE="fp32" 75 | 76 | fi 77 | 78 | cat < $DS_CONFIG_JSON 79 | { 80 | "train_micro_batch_size_per_gpu": $MBS, 81 | "train_batch_size": "auto", 82 | "gradient_clipping": ${GRAD_CLIP}, 83 | "zero_optimization": { 84 | "stage": $ZERO_STAGE 85 | }, 86 | "bf16": { 87 | "enabled": ${DS_BF16} 88 | }, 89 | "data_types": { 90 | "grad_accum_dtype": "${GAS_DTYPE}" 91 | }, 92 | "fp16": { 93 | "enabled": ${DS_FP16}, 94 | "loss_scale": 0, 95 | "loss_scale_window": 200, 96 | "hysteresis": 5, 97 | "min_loss_scale": 1, 98 | "initial_scale_power": 12 99 | }, 100 | "steps_per_print": 10, 101 | "wall_clock_breakdown": true, 102 | "comms_logger": { 103 | "enabled": true, 104 | "verbose": false, 105 | "prof_all": false, 106 | "debug": false 107 | }, 108 | "flops_profiler": { 109 | "enabled": false, 110 | "profile_step": 30, 111 | "module_depth": -1, 112 | "top_modules": 1, 113 | "detailed": true, 114 | "output_file": null 115 | } 116 | } 117 | EOT 118 | 119 | 120 | TRAIN_ARGS+=" \ 121 | --seed ${SEED} \ 122 | --output_dir ${OUTPUT_DIR} \ 123 | --overwrite_output_dir \ 124 | --deepspeed ${DS_CONFIG_JSON} \ 125 | --per_device_train_batch_size ${MBS} \ 126 | --gradient_accumulation_steps ${GAS} \ 127 | --do_train \ 128 | --num_train_epochs ${TRAIN_EPOCHS} \ 129 | --logging_dir ${TB_DIR} \ 130 | --logging_strategy steps \ 131 | --logging_steps ${LOGGING_STEPS} \ 132 | --weight_decay 0.01 \ 133 | --adam_beta1 0.9 \ 134 | --adam_beta1 0.95 \ 135 | --max_grad_norm ${GRAD_CLIP} \ 136 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \ 137 | --learning_rate ${LR} \ 138 | --warmup_ratio ${WARMUP_RATION} \ 139 | --weight_decay 0.01 \ 140 | --save_strategy steps \ 141 | --save_total_limit 3 \ 142 | --save_steps ${CKPT_SAVE_STEPS} \ 143 | --ddp_timeout 30000 \ 144 | --logging_first_step True \ 145 | --save_safetensors False \ 146 | --ddp_find_unused_parameters False \ 147 | " 148 | 149 | if [[ $MODEL_SIZE == "16m" ]];then 150 | HIDDEN_SIZE=120 151 | NUM_HIDDEN_LAYERS=6 152 | NUM_ATTENTION_HEADS=6 153 | INTERMEDIATE_SIZE=384 154 | ROPE_THETA=10000.0 155 | MAX_POSITION_EMBEDDINGS=512 156 | VOCAB_SIZE=64798 157 | elif [[ $MODEL_SIZE == "42m" ]];then 158 | HIDDEN_SIZE=288 159 | NUM_HIDDEN_LAYERS=6 160 | NUM_ATTENTION_HEADS=6 161 | INTERMEDIATE_SIZE=768 162 | ROPE_THETA=10000.0 163 | MAX_POSITION_EMBEDDINGS=512 164 | VOCAB_SIZE=64798 165 | elif [[ $MODEL_SIZE == "92m" ]];then 166 | HIDDEN_SIZE=512 167 | NUM_HIDDEN_LAYERS=8 168 | NUM_ATTENTION_HEADS=8 169 | INTERMEDIATE_SIZE=1408 170 | ROPE_THETA=10000.0 171 | MAX_POSITION_EMBEDDINGS=1024 172 | VOCAB_SIZE=64798 173 | elif [[ $MODEL_SIZE == "210m" ]];then 174 | HIDDEN_SIZE=768 175 | NUM_HIDDEN_LAYERS=16 176 | NUM_ATTENTION_HEADS=12 177 | INTERMEDIATE_SIZE=2048 178 | ROPE_THETA=10000.0 179 | MAX_POSITION_EMBEDDINGS=1024 180 | VOCAB_SIZE=64798 181 | elif [[ $MODEL_SIZE == "440m" ]];then 182 | HIDDEN_SIZE=1024 183 | NUM_HIDDEN_LAYERS=24 184 | NUM_ATTENTION_HEADS=16 185 | INTERMEDIATE_SIZE=2816 186 | ROPE_THETA=10000.0 187 | MAX_POSITION_EMBEDDINGS=1024 188 | VOCAB_SIZE=64798 189 | fi 190 | 191 | GPT_ARGS=" \ 192 | --hidden_size ${HIDDEN_SIZE} \ 193 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \ 194 | --num_attention_heads ${NUM_ATTENTION_HEADS} \ 195 | --intermediate_size ${INTERMEDIATE_SIZE} \ 196 | --rope_theta ${ROPE_THETA} \ 197 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \ 198 | --vocab_size ${VOCAB_SIZE} \ 199 | " 200 | SCRIPT_ARGS=" \ 201 | --mode ${MODE} \ 202 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \ 203 | --resume ${RESUME} \ 204 | --base_model_path ${BASE_MODEL_PATH} \ 205 | " 206 | 207 | DISTRIBUTED_ARGS=" \ 208 | --nnodes $N_NODES \ 209 | --nproc_per_node $N_GPUS \ 210 | " 211 | 212 | # 检查num是否大于1 213 | if [ "$N_NODES" -ge 2 ]; then 214 | DISTRIBUTED_ARGS+=" \ 215 | --node_rank $RANK \ 216 | --master_addr $MASTER_ADDR \ 217 | --master_port $MASTER_PORT \ 218 | " 219 | fi 220 | 221 | # 所有参数 222 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS " 223 | 224 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/sft_train.py " 225 | 226 | export CMD="$LAUNCHER $ALL_ARGS" 227 | echo $CMD 228 | 229 | killall sft_train.py 230 | 231 | # 执行训练 232 | $CMD 2>&1 | tee ${TRAIN_LOG} 233 | 234 | killall sft_train.py 235 | 236 | echo "train end : ${OUTPUT_DIR}" 237 | -------------------------------------------------------------------------------- /script/dpo_demo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -x 4 | 5 | # export CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" 6 | 7 | source /home/.bashrc 8 | source /home/miniconda3/etc/profile.d/conda.sh 9 | conda activate md_llm 10 | which python 11 | 12 | function killall { 13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'` 14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9 15 | } 16 | 17 | WORK_DIR="/personal/tiny-llm-zh" 18 | cd ${WORK_DIR} 19 | 20 | # 常见参数 21 | N_NODES=1 22 | N_GPUS=8 23 | MBS=32 # 单卡bs 24 | GAS=1 # 梯度累积 25 | GRAD_CLIP=1 # 梯度裁剪 26 | RANK=0 27 | MASTER_ADDR=`hostname -i` 28 | MASTER_PORT=9902 29 | 30 | LR=3e-4 # 初始学习率 31 | LR_SCHEDULER_TYPE="cosine" 32 | WARMUP_RATION=0.03 33 | 34 | TRAIN_EPOCHS=5 # 训练轮次 35 | LOGGING_STEPS=100 # 记录日志步数 36 | CKPT_SAVE_STEPS=15000 # ckpt保存步数 37 | 38 | SEED=12 39 | DS_DTYPE="fp16" # [fp16, bf16] 40 | RESUME="False" 41 | 42 | # 数据 43 | MODE="rl" # [ptm, sft, rm, rl] 44 | DATASET_DIR_OR_PATH="data/rl_train/rl_train_data.jsonl" 45 | BASE_MODEL_PATH="outputs/ckpt/sft_tiny_llm_92m_epoch5/last_sft_model" 46 | 47 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m] 48 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}" 49 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 50 | mkdir -p $OUTPUT_DIR 51 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log" 52 | # tensorboard输出路径 53 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 54 | mkdir -p $TB_DIR 55 | 56 | TRAIN_ARGS="" 57 | 58 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json 59 | ZERO_STAGE=2 60 | 61 | if [ $DS_DTYPE = "fp16" ];then 62 | TRAIN_ARGS+=" \ 63 | --fp16 \ 64 | " 65 | DS_FP16=true 66 | DS_BF16=false 67 | GAS_DTYPE=$DS_DTYPE 68 | elif [ $DS_DTYPE = "bf16" ];then 69 | TRAIN_ARGS+=" \ 70 | --bf16 \ 71 | " 72 | DS_FP16=false 73 | DS_BF16=true 74 | GAS_DTYPE="fp32" 75 | 76 | fi 77 | 78 | cat < $DS_CONFIG_JSON 79 | { 80 | "train_micro_batch_size_per_gpu": $MBS, 81 | "train_batch_size": "auto", 82 | "gradient_clipping": ${GRAD_CLIP}, 83 | "zero_optimization": { 84 | "stage": $ZERO_STAGE 85 | }, 86 | "bf16": { 87 | "enabled": ${DS_BF16} 88 | }, 89 | "data_types": { 90 | "grad_accum_dtype": "${GAS_DTYPE}" 91 | }, 92 | "fp16": { 93 | "enabled": ${DS_FP16}, 94 | "loss_scale": 0, 95 | "loss_scale_window": 200, 96 | "hysteresis": 5, 97 | "min_loss_scale": 1, 98 | "initial_scale_power": 12 99 | }, 100 | "steps_per_print": 10, 101 | "wall_clock_breakdown": true, 102 | "comms_logger": { 103 | "enabled": true, 104 | "verbose": false, 105 | "prof_all": false, 106 | "debug": false 107 | }, 108 | "flops_profiler": { 109 | "enabled": false, 110 | "profile_step": 30, 111 | "module_depth": -1, 112 | "top_modules": 1, 113 | "detailed": true, 114 | "output_file": null 115 | } 116 | } 117 | EOT 118 | 119 | 120 | TRAIN_ARGS+=" \ 121 | --seed ${SEED} \ 122 | --output_dir ${OUTPUT_DIR} \ 123 | --overwrite_output_dir \ 124 | --deepspeed ${DS_CONFIG_JSON} \ 125 | --per_device_train_batch_size ${MBS} \ 126 | --gradient_accumulation_steps ${GAS} \ 127 | --do_train \ 128 | --num_train_epochs ${TRAIN_EPOCHS} \ 129 | --logging_dir ${TB_DIR} \ 130 | --logging_strategy steps \ 131 | --logging_steps ${LOGGING_STEPS} \ 132 | --weight_decay 0.01 \ 133 | --adam_beta1 0.9 \ 134 | --adam_beta1 0.95 \ 135 | --max_grad_norm ${GRAD_CLIP} \ 136 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \ 137 | --learning_rate ${LR} \ 138 | --warmup_ratio ${WARMUP_RATION} \ 139 | --weight_decay 0.01 \ 140 | --save_strategy steps \ 141 | --save_total_limit 3 \ 142 | --save_steps ${CKPT_SAVE_STEPS} \ 143 | --ddp_timeout 30000 \ 144 | --logging_first_step True \ 145 | --save_safetensors False \ 146 | --ddp_find_unused_parameters False \ 147 | " 148 | 149 | if [[ $MODEL_SIZE == "16m" ]];then 150 | HIDDEN_SIZE=120 151 | NUM_HIDDEN_LAYERS=6 152 | NUM_ATTENTION_HEADS=6 153 | INTERMEDIATE_SIZE=384 154 | ROPE_THETA=10000.0 155 | MAX_POSITION_EMBEDDINGS=512 156 | VOCAB_SIZE=64798 157 | elif [[ $MODEL_SIZE == "42m" ]];then 158 | HIDDEN_SIZE=288 159 | NUM_HIDDEN_LAYERS=6 160 | NUM_ATTENTION_HEADS=6 161 | INTERMEDIATE_SIZE=768 162 | ROPE_THETA=10000.0 163 | MAX_POSITION_EMBEDDINGS=512 164 | VOCAB_SIZE=64798 165 | elif [[ $MODEL_SIZE == "92m" ]];then 166 | HIDDEN_SIZE=512 167 | NUM_HIDDEN_LAYERS=8 168 | NUM_ATTENTION_HEADS=8 169 | INTERMEDIATE_SIZE=1408 170 | ROPE_THETA=10000.0 171 | MAX_POSITION_EMBEDDINGS=1024 172 | VOCAB_SIZE=64798 173 | elif [[ $MODEL_SIZE == "210m" ]];then 174 | HIDDEN_SIZE=768 175 | NUM_HIDDEN_LAYERS=16 176 | NUM_ATTENTION_HEADS=12 177 | INTERMEDIATE_SIZE=2048 178 | ROPE_THETA=10000.0 179 | MAX_POSITION_EMBEDDINGS=1024 180 | VOCAB_SIZE=64798 181 | elif [[ $MODEL_SIZE == "440m" ]];then 182 | HIDDEN_SIZE=1024 183 | NUM_HIDDEN_LAYERS=24 184 | NUM_ATTENTION_HEADS=16 185 | INTERMEDIATE_SIZE=2816 186 | ROPE_THETA=10000.0 187 | MAX_POSITION_EMBEDDINGS=1024 188 | VOCAB_SIZE=64798 189 | fi 190 | 191 | GPT_ARGS=" \ 192 | --hidden_size ${HIDDEN_SIZE} \ 193 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \ 194 | --num_attention_heads ${NUM_ATTENTION_HEADS} \ 195 | --intermediate_size ${INTERMEDIATE_SIZE} \ 196 | --rope_theta ${ROPE_THETA} \ 197 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \ 198 | --vocab_size ${VOCAB_SIZE} \ 199 | " 200 | SCRIPT_ARGS=" \ 201 | --mode ${MODE} \ 202 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \ 203 | --resume ${RESUME} \ 204 | --base_model_path ${BASE_MODEL_PATH} \ 205 | " 206 | 207 | DISTRIBUTED_ARGS=" \ 208 | --nnodes $N_NODES \ 209 | --nproc_per_node $N_GPUS \ 210 | " 211 | 212 | # 检查num是否大于1 213 | if [ "$N_NODES" -ge 2 ]; then 214 | DISTRIBUTED_ARGS+=" \ 215 | --node_rank $RANK \ 216 | --master_addr $MASTER_ADDR \ 217 | --master_port $MASTER_PORT \ 218 | " 219 | fi 220 | 221 | # 所有参数 222 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS " 223 | 224 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/dpo_train.py " 225 | 226 | export CMD="$LAUNCHER $ALL_ARGS" 227 | echo $CMD 228 | 229 | killall dpo_train.py 230 | 231 | # 执行训练 232 | $CMD 2>&1 | tee ${TRAIN_LOG} 233 | 234 | killall dpo_train.py 235 | 236 | echo "train end : ${OUTPUT_DIR}" 237 | -------------------------------------------------------------------------------- /train/sft_train.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import numpy as np 3 | import os 4 | import sys 5 | from dataclasses import dataclass, field 6 | from typing import Optional, List, Dict, Any, Mapping 7 | import datasets 8 | import torch 9 | import torch.nn as nn 10 | import transformers 11 | from transformers import ( 12 | CONFIG_MAPPING, 13 | MODEL_FOR_CAUSAL_LM_MAPPING, 14 | AutoConfig, 15 | AutoModelForCausalLM, 16 | HfArgumentParser, 17 | Trainer, 18 | TrainingArguments, 19 | is_torch_tpu_available, 20 | set_seed, 21 | ) 22 | from transformers.utils.versions import require_version 23 | from sklearn.metrics import accuracy_score 24 | from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR 25 | 26 | from configuration_tinyllm import TinyllmConfig 27 | from modeling_tinyllm import TinyllmForCausalLM 28 | from tinyllm_dataset import SFTDataset 29 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer 30 | 31 | MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys()) 32 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) 33 | 34 | @dataclass 35 | class ModelArguments: 36 | """ 模型相关参数 37 | """ 38 | hidden_size : Optional[int] = field( 39 | default=512, 40 | metadata={"help": "hidden_size"} 41 | ) 42 | 43 | num_hidden_layers : Optional[int] = field( 44 | default=8, 45 | metadata={"help": "num_hidden_layers"} 46 | ) 47 | 48 | num_attention_heads : Optional[int] = field( 49 | default=8, 50 | metadata={"help": "transformer num_attention_heads"} 51 | ) 52 | 53 | intermediate_size : Optional[int] = field( 54 | default=1408, 55 | metadata={"help": "intermediate_size"} 56 | ) 57 | 58 | rope_theta : Optional[float] = field( 59 | default=10000.0, 60 | metadata={"help": "rope_theta"} 61 | ) 62 | 63 | max_position_embeddings : Optional[int] = field( 64 | default=1024, 65 | metadata={"help": "max_position_embeddings"} 66 | ) 67 | 68 | vocab_size : Optional[int] = field( 69 | default=64798, 70 | metadata={"help": "vocab_size, ref https://github.com/THUDM/ChatGLM3/issues/634"} 71 | ) 72 | 73 | @dataclass 74 | class ScriptArguments: 75 | """ 其他相关参数 76 | """ 77 | mode : Optional[str] = field( 78 | default="ptm", 79 | metadata={"help": "save pretrain *bin file dir"} 80 | ) 81 | 82 | dataset_dir_or_path : Optional[str] = field( 83 | default="data/pre_train", 84 | metadata={"help": "save pretrain file dir"} 85 | ) 86 | 87 | resume : Optional[bool] = field( 88 | default=False, 89 | metadata={"help": "use PyTorch 2.0 to compile the model to be faster"} 90 | ) 91 | 92 | base_model_path : Optional[str] = field( 93 | default=" ", 94 | metadata={"help": "SFT train, the base model path"} 95 | ) 96 | 97 | def data_collator_fn(examples): 98 | # 将所有样本的输入 (`X`) 和标签 (`Y`) 分别堆叠 99 | input_ids = torch.stack([example[0] for example in examples]) 100 | labels = torch.stack([example[1] for example in examples]) 101 | 102 | # 返回一个字典,包含模型需要的键和值 103 | data_dict = { 104 | "input_ids": input_ids, 105 | "labels": labels 106 | } 107 | return data_dict 108 | 109 | logger = logging.getLogger(__name__) 110 | 111 | def main(): 112 | parser = HfArgumentParser((ModelArguments, ScriptArguments, TrainingArguments)) 113 | model_args, script_args, training_args = parser.parse_args_into_dataclasses() 114 | 115 | # logger format 116 | logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",datefmt="%m/%d/%Y %H:%M:%S", 117 | level = logging.WARN, # if training_args.local_rank in [-1, 0] else logging.WARN, 118 | handlers = [logging.StreamHandler(sys.stdout)],) 119 | if training_args.should_log: 120 | # The default of training_args.log_level is passive, so we set log level at info here to have that default. 121 | transformers.utils.logging.set_verbosity_info() 122 | 123 | log_level = training_args.get_process_log_level() 124 | logger.setLevel(log_level) 125 | datasets.utils.logging.set_verbosity(log_level) 126 | transformers.utils.logging.set_verbosity(log_level) 127 | transformers.utils.logging.enable_default_handler() 128 | transformers.utils.logging.enable_explicit_format() 129 | 130 | # Log on each process the small summary: 131 | logger.warning( 132 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" 133 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" 134 | ) 135 | 136 | set_seed(training_args.seed) 137 | 138 | device = "cuda" if torch.cuda.is_available() else "cpu" 139 | 140 | # init model 141 | tokenizer = transformers.AutoTokenizer.from_pretrained( 142 | script_args.base_model_path, 143 | use_fast=False, 144 | trust_remote_code=True, 145 | model_max_length=model_args.max_position_embeddings 146 | ) 147 | 148 | config = transformers.AutoConfig.from_pretrained( 149 | script_args.base_model_path, 150 | trust_remote_code=True 151 | ) 152 | config.use_cache = False 153 | 154 | model = transformers.AutoModelForCausalLM.from_pretrained( 155 | script_args.base_model_path, 156 | config=config, 157 | trust_remote_code=True 158 | ) 159 | 160 | model.to(device) 161 | 162 | ################ 163 | total_params = sum(p.numel() for p in model.parameters()) 164 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 165 | logger.info(f"总参数: {total_params}, {total_params/2**20:.2f}M params") 166 | logger.info(f"可训练参数: {trainable_params}") 167 | ############## 168 | 169 | sft_dataset = SFTDataset( 170 | script_args.dataset_dir_or_path, 171 | tokenizer, 172 | model_args.max_position_embeddings 173 | ) 174 | 175 | trainer = Trainer( 176 | model = model, 177 | args = training_args, 178 | train_dataset = sft_dataset, 179 | # eval_dataset = None, 180 | # data_collator = data_collator_fn, 181 | ) 182 | # Training 183 | trainer.train(script_args.resume) 184 | # torch.save(model.state_dict(),'{}/last_model.pth'.format(training_args.output_dir)) 185 | last_model_dir = os.path.join(training_args.output_dir, 'last_sft_model') 186 | os.makedirs(last_model_dir, exist_ok=True) 187 | tokenizer.save_pretrained(last_model_dir) 188 | # # https://github.com/huggingface/transformers/issues/28630 189 | # model.save_pretrained(last_model_dir, safe_serialization=False) 190 | trainer.save_model(output_dir=last_model_dir) 191 | 192 | if __name__ == "__main__": 193 | main() 194 | 195 | -------------------------------------------------------------------------------- /tokenizer/README.md: -------------------------------------------------------------------------------- 1 | ## Tiny LLM Tokenizer 2 | 3 | ## 1.简介 4 | 5 | 采用扩充 LLaMA2 词表的方式构建 Tiny LLM 词表。 6 | 7 | 由于原版 LLaMA2 对中文的支持非常有限,本项目在原版 LLaMA 的基础上进一步扩充了中文词表。 8 | 9 | 在通用中文语料上训练了基于 sentencepiece 的 20K 中文词表并与原版LLaMA模型的 32K 词表进行合并,排除重复的token后,并添加特殊 token 后,最终得到的最终中文LLaMA词表大小为 49958 10 | 11 | 注意:预训练用的是ChatGLM3的词表,并未使用扩充的词表 12 | 13 | ## 2.词表扩种 14 | 15 | ### 2.1 训练中文分词 16 | 17 | 准备一份中文训练语料保存为按照每一行保存为 `.txt`文件,选用百科的所有语料,大约8G左右语料,存储为txt文本,其中划分句子代码如下: 18 | 19 | ```python 20 | def split_sentences(text): 21 | """ 22 | 分割文本为句子列表 23 | """ 24 | # 正则表达式匹配中英文句子结尾标点 25 | endings_pattern = r'(?", "<|user|>", "<|assistant|>", "<|im_start|>", "<|im_end|>"] 161 | for token in custom_special_tokens: 162 | tokenizer.add_tokens(token) 163 | 164 | tokenizer.save_pretrained(output_hf_dir) 165 | print(f"tinyllm token num: {len(tokenizer)}") 166 | print(f"Tiny LLM tokenizer has been saved to {output_hf_dir}") 167 | 168 | def test_tokenizer(hf_tokenizer_dir): 169 | tinyllm_tokenizer = LlamaTokenizer.from_pretrained(hf_tokenizer_dir) 170 | print("tinyllm tokenizer nums: ", len(tinyllm_tokenizer)) 171 | 172 | sys_text = "你是由wdndev开发的个人助手。" 173 | user_text = "翻译下面的句子为英文:有朋自远方来,不亦乐乎" 174 | answer_text = "It is always a pleasure to greet a friend from afar." 175 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 176 | "<|user|>", user_text.strip(), 177 | "<|assistant|>"]).strip() + "\n" + answer_text.strip() 178 | 179 | print("-----input text: \n", input_txt) 180 | 181 | encode_ids = tinyllm_tokenizer.encode(input_txt, add_special_tokens=False) 182 | print("-----encode ids: \n", encode_ids) 183 | 184 | decode_ids = tinyllm_tokenizer.decode(encode_ids) 185 | print("-----dencode ids: \n", decode_ids) 186 | 187 | 188 | if __name__ == "__main__": 189 | 190 | llama_tokenizer_dir = "input_dir/llama2_tokenizer" 191 | chinese_sp_model_file = "sp_output/chinese_spm_20000.model" 192 | output_hf_dir = "tinyllm_tokenizer_hf" 193 | 194 | merge_tokenizer(llama_tokenizer_dir, chinese_sp_model_file, output_hf_dir) 195 | 196 | test_tokenizer(output_hf_dir) 197 | ``` 198 | 199 | 至此,完成了LLaMa中文词表的扩充,扩充垂直领域词表也是如此,要准备垂直领域的训练语料,最好和通用领域的训练语料混合一下。 200 | 201 | -------------------------------------------------------------------------------- /vllm/README.md: -------------------------------------------------------------------------------- 1 | # Tiny LLM vLLM 模型部署 2 | 3 | ## 1.vLLM 环境 4 | 5 | 注意:测试环境为 vllm=0.4.0 6 | 7 | 如果使用**CUDA 12 以上和PyTorch 2.1 以上**,可以直接使用以下命令安装vLLM。 8 | 9 | ```shell 10 | pip install vllm==0.4.0 11 | ``` 12 | 13 | 否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。 14 | 15 | 安装完成后,还需要以下操作~ 16 | 17 | 1. 把 `vllm/tinyllm.py` 文件复制到env环境对应的 `vllm/model_executor/models` 目录下。 18 | 2. 然后在 `vllm/model_executor/models/__init__.py` 文件增加一行代码 19 | 20 | ```shell 21 | "TinyllmForCausalLM": ("tinyllm", "TinyllmForCausalLM"), 22 | ``` 23 | 24 | > 由于模型结构是自己定义的,vllm官方未实现,需要自己手动加入 25 | 26 | ## 2.vLLM OpenAI API 接口 27 | 28 | vLLM 部署实现 OpenAI API 协议的服务器非常方便。默认会在 http://localhost:8000 启动服务器。服务器当前一次托管一个模型,并实现列表模型、completions 和 chat completions 端口。 29 | 30 | - completions:是基本的文本生成任务,模型会在给定的提示后生成一段文本。这种类型的任务通常用于生成文章、故事、邮件等。 31 | - chat completions:是面向对话的任务,模型需要理解和生成对话。这种类型的任务通常用于构建聊天机器人或者对话系统。 32 | 33 | 在创建服务器时,可以指定模型名称、模型路径、聊天模板等参数。 34 | 35 | - --host 和 --port 参数指定地址。 36 | - --model 参数指定模型名称。 37 | - --chat-template 参数指定聊天模板。 38 | - --served-model-name 指定服务模型的名称。 39 | - --max-model-len 指定模型的最大长度。 40 | 41 | #### 启动服务 42 | 43 | ```shell 44 | python -m vllm.entrypoints.openai.api_server \ 45 | --served-model-name tinyllm_92m \ 46 | --model wdn/tiny_llm_sft_92m \ 47 | --trust-remote-code \ 48 | --tensor-parallel-size 1 \ 49 | --max-model-len 1024 \ 50 | ``` 51 | #### 查看当前模型列表 52 | 53 | ```shell 54 | curl http://localhost:8000/v1/models 55 | ``` 56 | 57 | 得到的返回值如下所示 58 | 59 | ```json 60 | { 61 | "object": "list", 62 | "data": [ 63 | { 64 | "id": "tinyllm_92m", 65 | "object": "model", 66 | "created": 1717735884, 67 | "owned_by": "vllm", 68 | "root": "tiny_llm_sft_92m", 69 | "parent": null, 70 | "permission": [ 71 | { 72 | "id": "cmpl-55520539697749e7bc6f0243bf2dae18", 73 | "object": "model_permission", 74 | "created": 1720594920, 75 | "allow_create_engine": false, 76 | "allow_sampling": true, 77 | "allow_logprobs": true, 78 | "allow_search_indices": false, 79 | "allow_view": true, 80 | "allow_fine_tuning": false, 81 | "organization": "*", 82 | "group": null, 83 | "is_blocking": false 84 | } 85 | ] 86 | } 87 | ] 88 | } 89 | ``` 90 | #### 测试OpenAI Completions API 91 | 92 | ```shell 93 | curl http://localhost:8000/v1/completions \ 94 | -H "Content-Type: application/json" \ 95 | -d '{ 96 | "model": "tinyllm_92m", 97 | "prompt": "你好", 98 | "max_tokens": 50, 99 | "temperature": 0 100 | }' 101 | ``` 102 | 103 | 得到返回值 104 | 105 | ```json 106 | { 107 | "id": "cmpl-55520539697749e7bc6f0243bf2dae18", 108 | "object": "text_completion", 109 | "created": 1720594920, 110 | "model": "tinyllm_92m", 111 | "choices": [ 112 | { 113 | "index": 0, 114 | "text": "你好,我是TinyLLM,一个由wdndev开发的人工智能助手。我可以回答各种问题、提供信息、执行任务和提供帮助。", 115 | "logprobs": null, 116 | "finish_reason": "length", 117 | "stop_reason": null 118 | } 119 | ], 120 | "usage": { 121 | "prompt_tokens": 1, 122 | "total_tokens": 51, 123 | "completion_tokens": 50 124 | } 125 | } 126 | ``` 127 | 128 | #### 使用Python脚本请求 OpenAI Completions API 129 | 130 | ```python 131 | from openai import OpenAI 132 | client = OpenAI( 133 | base_url="http://localhost:8000/v1", 134 | api_key="sk-xxx", # 随便填写,只是为了通过接口参数校验 135 | ) 136 | 137 | completion = client.chat.completions.create( 138 | model="tinyllm_92m", 139 | messages=[ 140 | {"role": "user", "content": "你好"} 141 | ] 142 | ) 143 | 144 | print(completion.choices[0].message) 145 | ``` 146 | 147 | 返回值 148 | 149 | ```shell 150 | ChatCompletionMessage(content=' 151 | 你好,我是TinyLLM,一个由wdndev开发的人工智能助手。我可以回答各种问题、提供信息、执行任务和提供帮助。', role='assistant', function_call=None, tool_calls=None) 152 | ``` 153 | 154 | #### 使用curl测试 OpenAI Chat Completions API 155 | 156 | ```shell 157 | curl http://localhost:8000/v1/chat/completions \ 158 | -H "Content-Type: application/json" \ 159 | -d '{ 160 | "model": "tinyllm_92m", 161 | "messages": [ 162 | {"role": "system", "content": "You are a helpful assistant."}, 163 | {"role": "user", "content": "请介绍一下北京"} 164 | ] 165 | }' 166 | 167 | ``` 168 | 返回结果 169 | ```json 170 | { 171 | "id": "cmpl-55520539697749e7bc6f0243bf2dae18", 172 | "object": "chat.completion", 173 | "created": 1720594920, 174 | "model": "tinyllm_92m", 175 | "choices": [ 176 | { 177 | "index": 0, 178 | "message": { 179 | "role": "assistant", 180 | "content": ":北京是中国的首都,也是中国改革开放的前沿城市之一,也是中国的首都。首都有着丰富的历史和文化底蕴,是中国的重要首都之一。" 181 | }, 182 | "logprobs": null, 183 | "finish_reason": "stop", 184 | "stop_reason": null 185 | } 186 | ], 187 | "usage": { 188 | "prompt_tokens": 24, 189 | "total_tokens": 55, 190 | "completion_tokens": 31 191 | } 192 | } 193 | ``` 194 | 195 | #### 使用 python 测试OpenAI Chat Completions API 196 | 197 | ```python 198 | # vllm_openai_chat_completions.py 199 | from openai import OpenAI 200 | openai_api_key = "sk-xxx" # 随便填写,只是为了通过接口参数校验 201 | openai_api_base = "http://localhost:8000/v1" 202 | 203 | client = OpenAI( 204 | api_key=openai_api_key, 205 | base_url=openai_api_base, 206 | ) 207 | 208 | chat_outputs = client.chat.completions.create( 209 | model="tinyllm_92m", 210 | messages=[ 211 | {"role": "system", "content": "You are a helpful assistant."}, 212 | {"role": "user", "content": "你好"}, 213 | ] 214 | ) 215 | print(chat_outputs) 216 | ``` 217 | 218 | ## 3.vLLM python调用 219 | 220 | 首先从 vLLM 库中导入 LLM 和 SamplingParams 类。LLM 类是使用 vLLM 引擎运行离线推理的主要类。SamplingParams 类指定采样过程的参数,用于控制和调整生成文本的随机性和多样性。 221 | 222 | vLLM 提供了非常方便的封装,直接传入模型名称或模型路径即可,不必手动初始化模型和分词器。 223 | 224 | ```python 225 | # vllm_model.py 226 | from vllm import LLM, SamplingParams 227 | from transformers import AutoTokenizer 228 | import os 229 | import json 230 | 231 | # 自动下载模型时,指定使用modelscope。不设置的话,会从 huggingface 下载 232 | os.environ['VLLM_USE_MODELSCOPE']='True' 233 | 234 | def get_completion(prompts, model, tokenizer=None, max_tokens=512, temperature=0.8, top_p=0.95, max_model_len=2048): 235 | stop_token_ids = [151329, 151336, 151338] 236 | # 创建采样参数。temperature 控制生成文本的多样性,top_p 控制核心采样的概率 237 | sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens, stop_token_ids=stop_token_ids) 238 | # 初始化 vLLM 推理引擎 239 | llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len,trust_remote_code=True) 240 | outputs = llm.generate(prompts, sampling_params) 241 | return outputs 242 | 243 | 244 | if __name__ == "__main__": 245 | # 初始化 vLLM 推理引擎 246 | model='/personal/wdn/tiny_llm_sft_92m' # 指定模型路径 247 | # model="wdn/tiny_llm_sft_92m" # 指定模型名称,自动下载模型 248 | tokenizer = None 249 | # 加载分词器后传入vLLM 模型,但不是必要的。 250 | # tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False) 251 | 252 | text = ["你好。", 253 | "请介绍一下北京。"] 254 | 255 | outputs = get_completion(text, model, tokenizer=tokenizer, max_tokens=512, temperature=1, top_p=1, max_model_len=2048) 256 | 257 | # 输出是一个包含 prompt、生成文本和其他信息的 RequestOutput 对象列表。 258 | # 打印输出。 259 | for output in outputs: 260 | prompt = output.prompt 261 | generated_text = output.outputs[0].text 262 | print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") 263 | ``` 264 | 265 | 266 | -------------------------------------------------------------------------------- /train/ptm_train.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import numpy as np 3 | import os 4 | import glob 5 | import sys 6 | import math 7 | import json 8 | from dataclasses import dataclass, field 9 | # from itertools import chain 10 | from typing import Optional, List, Dict, Any, Mapping 11 | # from pathlib import Path 12 | import datasets 13 | import torch 14 | import torch.nn as nn 15 | # from torch.optim import AdamW 16 | # from torch.optim.lr_scheduler import LambdaLR 17 | # from datasets import load_dataset, concatenate_datasets, Dataset 18 | from datetime import datetime, timezone 19 | import transformers 20 | from transformers import ( 21 | CONFIG_MAPPING, 22 | MODEL_FOR_CAUSAL_LM_MAPPING, 23 | AutoConfig, 24 | AutoModelForCausalLM, 25 | HfArgumentParser, 26 | Trainer, 27 | TrainingArguments, 28 | is_torch_tpu_available, 29 | set_seed, 30 | ) 31 | from transformers.utils.versions import require_version 32 | from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR 33 | 34 | from configuration_tinyllm import TinyllmConfig 35 | from modeling_tinyllm import TinyllmForCausalLM 36 | from tinyllm_dataset import PTMDataset 37 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer 38 | 39 | MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys()) 40 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) 41 | 42 | @dataclass 43 | class ModelArguments: 44 | """ 模型相关参数 45 | """ 46 | hidden_size : Optional[int] = field( 47 | default=512, 48 | metadata={"help": "hidden_size"} 49 | ) 50 | 51 | num_hidden_layers : Optional[int] = field( 52 | default=8, 53 | metadata={"help": "num_hidden_layers"} 54 | ) 55 | 56 | num_attention_heads : Optional[int] = field( 57 | default=8, 58 | metadata={"help": "transformer num_attention_heads"} 59 | ) 60 | 61 | intermediate_size : Optional[int] = field( 62 | default=1408, 63 | metadata={"help": "intermediate_size"} 64 | ) 65 | 66 | rope_theta : Optional[float] = field( 67 | default=10000.0, 68 | metadata={"help": "rope_theta"} 69 | ) 70 | 71 | max_position_embeddings : Optional[int] = field( 72 | default=1024, 73 | metadata={"help": "max_position_embeddings"} 74 | ) 75 | 76 | vocab_size : Optional[int] = field( 77 | default=64798, 78 | metadata={"help": "vocab_size, ref https://github.com/THUDM/ChatGLM3/issues/634"} 79 | ) 80 | 81 | @dataclass 82 | class ScriptArguments: 83 | """ 其他相关参数 84 | """ 85 | mode : Optional[str] = field( 86 | default="ptm", 87 | metadata={"help": "save pretrain *bin file dir"} 88 | ) 89 | 90 | dataset_dir_or_path : Optional[str] = field( 91 | default="data/pre_train", 92 | metadata={"help": "save pretrain *bin file dir"} 93 | ) 94 | 95 | resume : Optional[bool] = field( 96 | default=False, 97 | metadata={"help": "use PyTorch 2.0 to compile the model to be faster"} 98 | ) 99 | 100 | base_model_path : Optional[str] = field( 101 | default=" ", 102 | metadata={"help": "SFT train, the base model path"} 103 | ) 104 | 105 | def data_collator_fn(examples): 106 | # 将所有样本的输入 (`X`) 和标签 (`Y`) 分别堆叠 107 | input_ids = torch.stack([example[0] for example in examples]) 108 | labels = torch.stack([example[1] for example in examples]) 109 | 110 | # 返回一个字典,包含模型需要的键和值 111 | data_dict = { 112 | "input_ids": input_ids, 113 | "labels": labels 114 | } 115 | return data_dict 116 | 117 | logger = logging.getLogger(__name__) 118 | 119 | def main(): 120 | parser = HfArgumentParser((ModelArguments, ScriptArguments, TrainingArguments)) 121 | model_args, script_args, training_args = parser.parse_args_into_dataclasses() 122 | 123 | # logger format 124 | logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",datefmt="%m/%d/%Y %H:%M:%S", 125 | level = logging.WARN, # if training_args.local_rank in [-1, 0] else logging.WARN, 126 | handlers = [logging.StreamHandler(sys.stdout)],) 127 | if training_args.should_log: 128 | # The default of training_args.log_level is passive, so we set log level at info here to have that default. 129 | transformers.utils.logging.set_verbosity_info() 130 | 131 | log_level = training_args.get_process_log_level() 132 | logger.setLevel(log_level) 133 | datasets.utils.logging.set_verbosity(log_level) 134 | transformers.utils.logging.set_verbosity(log_level) 135 | transformers.utils.logging.enable_default_handler() 136 | transformers.utils.logging.enable_explicit_format() 137 | 138 | # Log on each process the small summary: 139 | logger.warning( 140 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" 141 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" 142 | ) 143 | 144 | set_seed(training_args.seed) 145 | 146 | device = "cuda" if torch.cuda.is_available() else "cpu" 147 | 148 | # init model 149 | gpt_args = dict( 150 | hidden_size = model_args.hidden_size, 151 | num_hidden_layers = model_args.num_hidden_layers, 152 | num_attention_heads = model_args.num_attention_heads, 153 | intermediate_size = model_args.intermediate_size, 154 | rope_theta = model_args.rope_theta, 155 | max_position_embeddings = model_args.max_position_embeddings, 156 | vocab_size = model_args.vocab_size, # 64798 157 | ) 158 | gpt_conf = TinyllmConfig(**gpt_args) 159 | model = TinyllmForCausalLM(gpt_conf) 160 | model.to(device) 161 | 162 | ################ 163 | total_params = sum(p.numel() for p in model.parameters()) 164 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 165 | logger.info(f"总参数: {total_params}, {total_params/2**20:.2f}M params") 166 | logger.info(f"可训练参数: {trainable_params}") 167 | ############## 168 | 169 | def get_bin_files_abs_paths(directory): 170 | bin_files_paths = [] 171 | for root, dirs, files in os.walk(directory): 172 | for file in files: 173 | if file.endswith('.bin'): 174 | bin_files_paths.append(os.path.abspath(os.path.join(root, file))) 175 | return bin_files_paths 176 | # data_path_list = glob.glob(os.path.join(script_args.dataset_dir_or_path, '*.bin')) 177 | data_path_list = get_bin_files_abs_paths(script_args.dataset_dir_or_path) 178 | if len(data_path_list) == 0: 179 | logger.error("***************NO INPUT DATA********************") 180 | 181 | train_ds = PTMDataset(data_path_list, max_length = model_args.max_position_embeddings, memmap=False) 182 | 183 | trainer = Trainer( 184 | model = model, 185 | args = training_args, 186 | train_dataset = train_ds, 187 | # eval_dataset = None, 188 | # data_collator = data_collator_fn, 189 | ) 190 | # Training 191 | trainer.train(script_args.resume) 192 | torch.save(model.state_dict(),'{}/last_model.pth'.format(training_args.output_dir)) 193 | last_model_dir = os.path.join(training_args.output_dir, 'last_ptm_model') 194 | os.makedirs(last_model_dir, exist_ok=True) 195 | # https://github.com/huggingface/transformers/issues/28630 196 | model.save_pretrained(last_model_dir, safe_serialization=False) 197 | 198 | 199 | if __name__ == "__main__": 200 | main() 201 | 202 | -------------------------------------------------------------------------------- /script/rm_demo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -x 4 | 5 | # export CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" 6 | 7 | source /home/.bashrc 8 | source /home/miniconda3/etc/profile.d/conda.sh 9 | conda activate md_llm 10 | which python 11 | 12 | function killall { 13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'` 14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9 15 | } 16 | 17 | WORK_DIR="/personal/tiny-llm-zh" 18 | cd ${WORK_DIR} 19 | 20 | # 常见参数 21 | N_NODES=1 22 | N_GPUS=8 23 | MBS=16 # 单卡bs 24 | GAS=1 # 梯度累积 25 | GRAD_CLIP=1 # 梯度裁剪 26 | RANK=0 27 | MASTER_ADDR=`hostname -i` 28 | MASTER_PORT=2345 29 | 30 | LR=1e-4 # 初始学习率 31 | LR_SCHEDULER_TYPE="cosine" 32 | WARMUP_RATION=0.00 33 | 34 | TRAIN_EPOCHS=5 # 训练轮次 35 | LOGGING_STEPS=50 # 记录日志步数 36 | CKPT_SAVE_STEPS=5000 # ckpt保存步数 37 | 38 | SEED=12 39 | DS_DTYPE="bf16" # [fp16, bf16] 40 | RESUME="False" 41 | 42 | IS_EVAL="False" 43 | EVAL_STEP=1000 44 | EVAL_MBS=16 45 | 46 | # 数据 47 | MODE="rm" # [ptm, sft, rm, rl] 48 | DATASET_DIR_OR_PATH="data/rm_train/rm_data.jsonl" 49 | BASE_MODEL_PATH="outputs/ckpt/ptm_tiny_llm_92m_epoch5_2/last_ptm_model" 50 | 51 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m] 52 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}" 53 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 54 | mkdir -p $OUTPUT_DIR 55 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log" 56 | # tensorboard输出路径 57 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}" 58 | mkdir -p $TB_DIR 59 | 60 | TRAIN_ARGS="" 61 | 62 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json 63 | ZERO_STAGE=2 64 | 65 | if [ $DS_DTYPE = "fp16" ];then 66 | TRAIN_ARGS+=" \ 67 | --fp16 \ 68 | " 69 | DS_FP16=true 70 | DS_BF16=false 71 | GAS_DTYPE=$DS_DTYPE 72 | elif [ $DS_DTYPE = "bf16" ];then 73 | TRAIN_ARGS+=" \ 74 | --bf16 \ 75 | " 76 | DS_FP16=false 77 | DS_BF16=true 78 | GAS_DTYPE="fp32" 79 | 80 | fi 81 | 82 | cat < $DS_CONFIG_JSON 83 | { 84 | "train_micro_batch_size_per_gpu": $MBS, 85 | "train_batch_size": "auto", 86 | "gradient_clipping": ${GRAD_CLIP}, 87 | "zero_optimization": { 88 | "stage": $ZERO_STAGE 89 | }, 90 | "bf16": { 91 | "enabled": ${DS_BF16} 92 | }, 93 | "data_types": { 94 | "grad_accum_dtype": "${GAS_DTYPE}" 95 | }, 96 | "fp16": { 97 | "enabled": ${DS_FP16}, 98 | "loss_scale": 0, 99 | "loss_scale_window": 200, 100 | "hysteresis": 5, 101 | "min_loss_scale": 1, 102 | "initial_scale_power": 12 103 | }, 104 | "steps_per_print": 10, 105 | "wall_clock_breakdown": true, 106 | "comms_logger": { 107 | "enabled": true, 108 | "verbose": false, 109 | "prof_all": false, 110 | "debug": false 111 | }, 112 | "flops_profiler": { 113 | "enabled": false, 114 | "profile_step": 30, 115 | "module_depth": -1, 116 | "top_modules": 1, 117 | "detailed": true, 118 | "output_file": null 119 | } 120 | } 121 | EOT 122 | 123 | 124 | TRAIN_ARGS+=" \ 125 | --seed ${SEED} \ 126 | --output_dir ${OUTPUT_DIR} \ 127 | --overwrite_output_dir \ 128 | --deepspeed ${DS_CONFIG_JSON} \ 129 | --per_device_train_batch_size ${MBS} \ 130 | --gradient_accumulation_steps ${GAS} \ 131 | --do_train \ 132 | --num_train_epochs ${TRAIN_EPOCHS} \ 133 | --logging_dir ${TB_DIR} \ 134 | --logging_strategy steps \ 135 | --logging_steps ${LOGGING_STEPS} \ 136 | --weight_decay 0.01 \ 137 | --adam_beta1 0.9 \ 138 | --adam_beta1 0.95 \ 139 | --max_grad_norm ${GRAD_CLIP} \ 140 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \ 141 | --learning_rate ${LR} \ 142 | --warmup_ratio ${WARMUP_RATION} \ 143 | --weight_decay 0.01 \ 144 | --save_strategy steps \ 145 | --save_total_limit 3 \ 146 | --save_steps ${CKPT_SAVE_STEPS} \ 147 | --ddp_timeout 30000 \ 148 | --logging_first_step True \ 149 | --save_safetensors False \ 150 | --ddp_find_unused_parameters False \ 151 | --remove_unused_columns False \ 152 | " 153 | 154 | if [ $IS_EVAL = "True" ];then 155 | TRAIN_ARGS+=" \ 156 | --per_device_eval_batch_size ${EVAL_MBS} \ 157 | --evaluation_strategy steps \ 158 | --eval_steps ${EVAL_STEP} \ 159 | " 160 | fi 161 | 162 | if [[ $MODEL_SIZE == "16m" ]];then 163 | HIDDEN_SIZE=120 164 | NUM_HIDDEN_LAYERS=6 165 | NUM_ATTENTION_HEADS=6 166 | INTERMEDIATE_SIZE=384 167 | ROPE_THETA=10000.0 168 | MAX_POSITION_EMBEDDINGS=512 169 | VOCAB_SIZE=64798 170 | elif [[ $MODEL_SIZE == "42m" ]];then 171 | HIDDEN_SIZE=288 172 | NUM_HIDDEN_LAYERS=6 173 | NUM_ATTENTION_HEADS=6 174 | INTERMEDIATE_SIZE=768 175 | ROPE_THETA=10000.0 176 | MAX_POSITION_EMBEDDINGS=512 177 | VOCAB_SIZE=64798 178 | elif [[ $MODEL_SIZE == "92m" ]];then 179 | HIDDEN_SIZE=512 180 | NUM_HIDDEN_LAYERS=8 181 | NUM_ATTENTION_HEADS=8 182 | INTERMEDIATE_SIZE=1408 183 | ROPE_THETA=10000.0 184 | MAX_POSITION_EMBEDDINGS=1024 185 | VOCAB_SIZE=64798 186 | elif [[ $MODEL_SIZE == "210m" ]];then 187 | HIDDEN_SIZE=768 188 | NUM_HIDDEN_LAYERS=16 189 | NUM_ATTENTION_HEADS=12 190 | INTERMEDIATE_SIZE=2048 191 | ROPE_THETA=10000.0 192 | MAX_POSITION_EMBEDDINGS=1024 193 | VOCAB_SIZE=64798 194 | elif [[ $MODEL_SIZE == "440m" ]];then 195 | HIDDEN_SIZE=1024 196 | NUM_HIDDEN_LAYERS=24 197 | NUM_ATTENTION_HEADS=16 198 | INTERMEDIATE_SIZE=2816 199 | ROPE_THETA=10000.0 200 | MAX_POSITION_EMBEDDINGS=1024 201 | VOCAB_SIZE=64798 202 | fi 203 | 204 | GPT_ARGS=" \ 205 | --hidden_size ${HIDDEN_SIZE} \ 206 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \ 207 | --num_attention_heads ${NUM_ATTENTION_HEADS} \ 208 | --intermediate_size ${INTERMEDIATE_SIZE} \ 209 | --rope_theta ${ROPE_THETA} \ 210 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \ 211 | --vocab_size ${VOCAB_SIZE} \ 212 | " 213 | SCRIPT_ARGS=" \ 214 | --mode ${MODE} \ 215 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \ 216 | --resume ${RESUME} \ 217 | --base_model_path ${BASE_MODEL_PATH} \ 218 | " 219 | 220 | DISTRIBUTED_ARGS=" \ 221 | --nnodes $N_NODES \ 222 | --nproc_per_node $N_GPUS \ 223 | --node_rank $RANK \ 224 | --master_addr $MASTER_ADDR \ 225 | --master_port $MASTER_PORT \ 226 | " 227 | 228 | # 检查num是否大于1 229 | if [ "$N_NODES" -ge 2 ]; then 230 | DISTRIBUTED_ARGS+=" \ 231 | --node_rank $RANK \ 232 | --master_addr $MASTER_ADDR \ 233 | --master_port $MASTER_PORT \ 234 | " 235 | fi 236 | 237 | # 所有参数 238 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS " 239 | 240 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/rm_train.py " 241 | 242 | export CMD="$LAUNCHER $ALL_ARGS" 243 | echo $CMD 244 | 245 | killall train/rm_train.py 246 | 247 | # 执行训练 248 | $CMD 2>&1 | tee ${TRAIN_LOG} 249 | 250 | killall train/rm_train.py 251 | 252 | echo "train end : ${OUTPUT_DIR}" 253 | # nohup torchrun --standalone --nproc_per_node=$N_GPUS pretrain.py \ 254 | # --out_dir="$OUTPUT_DIR/$MODEL_NAME" \ 255 | # --vocab_size=$VOCAB_SIZE \ 256 | # --max_seq_len=$VOCAB_SIZE \ 257 | # --dim=$DIM \ 258 | # --n_layers=$N_LAYERS \ 259 | # --n_heads=$N_HEADS \ 260 | # --n_kv_heads=$N_KV_HEADS \ 261 | # --multiple_of=$MULTIPLE_OF \ 262 | # --dropout=$DROPOUT \ 263 | # --batch_size=$BATCH_SIZE \ 264 | # >> $log_file 2>&1 & 265 | 266 | 267 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Tiny LLM zh 2 | 3 | ## 1.简介 4 | 5 | 本项目旨在构建一个小参数量的中文语言大模型,用于快速入门学习大模型相关知识,如果此项目对你有用,可以点一下start,谢谢! 6 | 7 | 模型架构:整体模型架构采用开源通用架构,包括:RMSNorm,RoPE,MHA等 8 | 9 | 实现细节:实现大模型两阶段训练及后续人类对齐,即:分词(Tokenizer) -> 预训练(PTM) -> 指令微调(SFT) -> 人类对齐(RLHF, DPO) -> 测评 -> 量化 -> 部署。 10 | 11 | 项目已部署,可以在如下网站上体验。 12 | 13 | - [ModeScope Tiny LLM](https://www.modelscope.cn/studios/wdndev/tiny_llm_92m_demo/summary) 14 | 15 | 项目特点: 16 | 17 | - 公开全部数据及代码,包括预训练数据,tokenizer等;([Tiny LLM Datasets](doc/datasets_download.md)) 18 | - 走通大模型整个流程:分词(Tokenizer) -> 预训练(PTM) -> 指令微调(SFT) -> 人类对齐(RLHF, DPO) -> 测评 -> 部署; 19 | - 公开预训练token 42B,SFT数据400w条,RL数据 17w条; 20 | - 训练 Tokenizer:10G 中文百科文本训练 20K 中文词表,与 Llama2 词表合并,构建Tiny LLM词表; 21 | - 使用 Transformers deepspeed 进行训练,支持多机多卡,支持 Zero 等优化技术; 22 | - 所有代码 `Bash` 脚本启动,支持不同大小的模型,如16m, 42m, 92m, 210m, 440m等; 23 | - 支持 MoE 架构,在 [tiny_llm_moe](https://github.com/wdndev/tiny-llm-zh/tree/tiny_llm_moe) 支持最新共享专家,平衡专家等技术; 24 | - 支持 vLLM 推理框架; 25 | - 支持 llama.cpp 推理框架; 26 | 27 | 28 | 本项目主要有三个分支,推荐学习 主分支,具体区别如下: 29 | 30 | - [llama2_torch](https://github.com/wdndev/tiny-llm-zh/tree/llama2_torch) : 模型架构采用原版 Llama2 架构,只是将部分的输入输出修改为适合训练的格式; 31 | - `main` `tiny_llm` : 对齐开源社区模型,使用Transformers库构建底层模型,也使用Transformers库进行多卡多机训练; 32 | - [tiny_llm_moe](https://github.com/wdndev/tiny-llm-zh/tree/tiny_llm_moe) : 在`tiny_llm`的基础上,修改 `MLP`层为MoE模型,使用Transformers库进行多卡多机训练。 33 | 34 | 注意: 35 | 36 | 1. 因资源限制,本项目的第一要务是走通大模型整个流程,而不是调教比较好的效果,故评测结果分数较低,部分生成错误。 37 | 2. 详细的数据处理,训练过程见 `doc` 文件夹(正在整理。。。) 38 | 39 | 40 | ## 2.快速开始 41 | 42 | 模型已托管在 [Huggingface](https://huggingface.co/wdndev/tiny_llm_sft_92m) 和 [ModeScope](https://www.modelscope.cn/models/wdndev/tiny_llm_sft_92m) 中,可运行代码自动下载。 43 | 44 | 建议使用 Huggingface 在线加载模型,如果运行不了,在试 ModeScope ;如果需要本地运行,修改`model_id`中的路径为本地目录,即可运行。 45 | 46 | #### 依赖安装 47 | 48 | - python 3.8 and above 49 | - pytorch 2.0 and above 50 | - transformers 4.37.2 and above 51 | - CUDA 11.4 and above are recommended. (if training) 52 | 53 | ```bash 54 | pip install -r requirements.txt 55 | ``` 56 | 57 | 58 | #### 🤗 HuggingFace 59 | 60 | ```python 61 | from transformers import AutoTokenizer, AutoModelForCausalLM 62 | from transformers.generation import GenerationConfig 63 | 64 | model_id = "wdndev/tiny_llm_sft_92m" 65 | 66 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) 67 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True) 68 | generation_config = GenerationConfig.from_pretrained(model_id, trust_remote_code=True) 69 | sys_text = "你是由wdndev开发的个人助手。" 70 | # user_text = "世界上最大的动物是什么?" 71 | # user_text = "介绍一下刘德华。" 72 | user_text = "介绍一下中国。" 73 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 74 | "<|user|>", user_text.strip(), 75 | "<|assistant|>"]).strip() + "\n" 76 | 77 | generation_config.max_new_tokens = 200 78 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device) 79 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config) 80 | generated_ids = [ 81 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 82 | ] 83 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 84 | print(response) 85 | ``` 86 | 87 | #### 🤖 ModeScope 88 | 89 | ```python 90 | from modelscope import AutoModelForCausalLM, AutoTokenizer 91 | 92 | model_id = "wdndev/tiny_llm_sft_92m" 93 | 94 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) 95 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True) 96 | 97 | sys_text = "你是由wdndev开发的个人助手。" 98 | # user_text = "世界上最大的动物是什么?" 99 | # user_text = "介绍一下刘德华。" 100 | user_text = "介绍一下中国。" 101 | input_txt = "\n".join(["<|system|>", sys_text.strip(), 102 | "<|user|>", user_text.strip(), 103 | "<|assistant|>"]).strip() + "\n" 104 | 105 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device) 106 | generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=200) 107 | generated_ids = [ 108 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 109 | ] 110 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 111 | print(response) 112 | ``` 113 | 114 | 115 | 生成效果 116 | ```bash 117 | 问:世界上最大的动物是什么? 118 | 答:目前已知最大的动物是蓝鲸(Balaenoptera musculus),这是一个庞大的哺乳动物,属于须鲸亚目、须鲸科中的最大物种。蓝鲸的身长可达30米以上,体重可达175吨。它们在海洋中生活,主要以浮游生物为食,如甲壳类动物和小型鱼类等。由于其巨大的体型和复杂的生态群落,蓝鲸成为海洋旅游的热门景点之一。 119 | 120 | 问:介绍一下刘德华。 121 | 答:刘德华是一位香港流行歌手、演员和导演,他在音乐界的贡献非常巨大。他是华语乐坛历史上最伟大的艺人之一,代表作品包括《爱我身体》和《肥皂泡》。他也经常参演电影和电视剧,并在电视上受到好评。 122 | 123 | 问:介绍一下中国。 124 | 答:中国是位于东亚的大陆,被欧洲以及亚洲和其他大陆所包围。它是中国第二大文明和世界上最大的经济体之一。中国的历史可以追溯到公元前5000年左右,从古至今都有其独特的文化和语言传承者。 125 | 126 | ``` 127 | 128 | ## 3.模型 129 | 130 | ### 3.1 Tokenizer 131 | 132 | LLM分词器的构建方式有两种:一种是自己构造词表,训练一个分词器;另一种是选择开源模型训练好的分词器。 133 | 134 | 本项目为了方便,从优秀的开源项目中选择词表,考虑到训练的模型较小,且词表大小影响模型大小,故优先选择词表较小的开源项目;经过比较,最终选择 [ChatGLM3](https://huggingface.co/THUDM/chatglm3-6b) 的词表,该词表大小为 64798 。 135 | 136 | 自己构造词表方式见 [tokenizer](tokenizer/),扩充 LLaMA2的32K词表为50K,增加20K中文词表,详细扩充方式见[文档](./doc/)或[tokenizer/README.md](./tokenizer/README.md). 137 | 138 | 注意:本项目使用的ChatGLM3的词表。 139 | 140 | ### 3.2 模型结构 141 | 142 | 模型结构采用类Llama2的结构,具体包括:RMSNorm,RoPE,MHA等; 143 | 144 | 145 | ### 3.3 模型尺寸 146 | 147 | 具体参数细节如下所示: 148 | 149 | | model | hidden size | intermediate size | n_layers | n_heads | max context length | params | vocab size | 150 | | ---------------- | ----------- | ----------------- | -------- | ------- | ------------------ | ------ | ---------- | 151 | | tiny-llm-16m | 120 | 384 | 6 | 6 | 512 | 16M | 64798 | 152 | | tiny-llm-42m | 288 | 768 | 6 | 6 | 512 | 42M | 64798 | 153 | | tiny-llm-92m | 512 | 1024 | 8 | 8 | 1024 | 92M | 64798 | 154 | | tiny-llm-210m | 768 | 2048 | 16 | 12 | 1024 | 210M | 64798 | 155 | | tiny-llm-440m | 1024 | 2816 | 24 | 16 | 1024 | 440M | 64798 | 156 | | tiny-llm-1_5b | 2048 | 5504 | 24 | 16 | 1024 | 1.5B | 64798 | 157 | 158 | 159 | ### 3.4 模型评估 160 | 161 | 因训练数据和微调数据,大部分都是中文数据,所以在`C-Eval`和`CMMLU`这两个数据集上进行模型的评估;使用[OpenCompass](https://github.com/open-compass/opencompass)工具,进行模型评估,评估分数如下所示: 162 | 163 | | model | Type | C-Eval | CMMLU | 164 | | ---------------- | ----- | ------- | ------- | 165 | | tiny-llm-92m | Base | 23.48 | 25.02 | 166 | | tiny-llm-92m | Chat | 26.79 | 26.59 | 167 | 168 | Base模型,采用评测方式 ppl 方式进行评测;Chat模型,采用 gen 方式评测。具体区别如下图所示: 169 | 170 | ![ppl gen](doc/image/ppl_gen.png) 171 | 172 | > 来源:[ppl和gen模式有什么区别](https://github.com/open-compass/opencompass/discussions/597) 173 | 174 | 注意:只对常用的两个模型进行了评测,分数较低,其余模型评测意义不大。 175 | 176 | 177 | ## 4.模型部署 178 | 179 | ### 4.1 网页Demo 180 | 181 | 网页Demo已部署,可以在如下网站上体验:[ModeScope Tiny LLM](https://www.modelscope.cn/studios/wdndev/tiny_llm_92m_demo/summary) 182 | 183 | 如果想在本地运行网页Demo,注意修改 `web_demo.py` 文件中模型的路径`model_id`,输入如下命令即可运行: 184 | 185 | ```shell 186 | streamlit run web_demo.py 187 | ``` 188 | 189 | ![web demo](doc/image/web_demo.png) 190 | 191 | ### 4.2 Transformers 192 | 193 | Transfomers 框架部署,位于 `demo/infer_chat.py` 和 `demo/infer_func.py` 文件中,和其他LLM运行无太大区别,注意输入的拼接即可。 194 | 195 | 196 | ### 4.3 FastAPI 197 | 198 | 199 | 200 | ### 4.4 vllm 201 | 202 | 详细vllm部署见 [vllm](vllm/README.md) 203 | 204 | 如果使用**CUDA 12 以上和PyTorch 2.1 以上**,可以直接使用以下命令安装vLLM。 205 | 206 | ```shell 207 | pip install vllm==0.4.0 208 | ``` 209 | 210 | 否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。 211 | 212 | 安装完成后,还需要以下操作~ 213 | 214 | 1. 把 `vllm/tinyllm.py` 文件复制到env环境对应的 `vllm/model_executor/models` 目录下。 215 | 2. 然后在vllm/model_executor/models/\_\_init\_\_.py文件增加一行代码 216 | 217 | ```shell 218 | "TinyllmForCausalLM": ("tinyllm", "TinyllmForCausalLM"), 219 | ``` 220 | 221 | > 由于模型结构是自己定义的,vllm官方未实现,需要自己手动加入 222 | 223 | ### 4.5 llama.cpp 224 | 225 | 详细 llama.cpp 部署见 [llama.cpp](llama.cpp/README.md) 226 | 227 | Tiny LLM 92M 模型已支持 llama.cpp C++ 推理框架,建议在 linux 环境下测试,windows效果不好; 228 | 229 | 所支持 llama.cpp 为自己修改的版本,仓库链接为: [llama.cpp.tinyllm](https://github.com/wdndev/llama.cpp.tinyllm) 230 | -------------------------------------------------------------------------------- /utils/pre_train_process.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import glob 4 | import numpy as np 5 | from tqdm import tqdm 6 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer 7 | import pandas as pd 8 | 9 | 10 | def process_wiki_clean(file_path, tokenizer): 11 | """ https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered 12 | """ 13 | with open(file_path, 'r', encoding='utf-8') as f: 14 | data = json.load(f) 15 | all_tokens = [] 16 | for line in tqdm(data): 17 | text = line['completion'] 18 | tokens = tokenizer.encode(text, add_special_tokens=False) 19 | tokens.append(tokenizer.special_tokens['']) 20 | if len(tokens) > 5: 21 | all_tokens += tokens 22 | arr = np.array(all_tokens, dtype=np.uint16) 23 | base_name, ext = os.path.splitext(file_path) 24 | output_file_path = base_name + '.bin' 25 | with open(output_file_path, 'wb') as f: 26 | f.write(arr.tobytes()) 27 | 28 | def process_webnovel(input_dir, tokenizer): 29 | for subdir, dirs, files in os.walk(input_dir): 30 | for idx, file in enumerate(files): 31 | # 只处理txt文件 32 | if file.endswith('.jsonl'): 33 | # 获取当前文件的绝对路径 34 | file_path = os.path.join(subdir, file) 35 | all_tokens = [] 36 | # 读取jsonl文件 37 | with open(file_path, 'r', encoding='utf-8') as infile: 38 | lines = infile.readlines() 39 | 40 | for line in tqdm(lines): 41 | json_obj = json.loads(line) # 解析json字符串为python对象 42 | text = json_obj['text'] 43 | tokens = tokenizer.encode(text, add_special_tokens=False) 44 | tokens.append(tokenizer.special_tokens['']) 45 | if len(tokens) > 5: 46 | all_tokens += tokens 47 | 48 | arr = np.array(all_tokens, dtype = np.uint16) 49 | base_name, ext = os.path.splitext(file_path) 50 | output_file_path = base_name + '.bin' 51 | with open(output_file_path, 'wb') as f: 52 | f.write(arr.tobytes()) 53 | 54 | def process_tigerbot_wiki(input_dir, tokenizer): 55 | """ https://huggingface.co/datasets/TigerResearch/tigerbot-wiki-plugin 56 | """ 57 | for subdir, dirs, files in os.walk(input_dir): 58 | for idx, file in enumerate(files): 59 | # 只处理txt文件 60 | if file.endswith('.json'): 61 | # 获取当前文件的绝对路径 62 | file_path = os.path.join(subdir, file) 63 | all_tokens = [] 64 | # 读取jsonl文件 65 | with open(file_path, 'r', encoding='utf-8') as infile: 66 | lines = infile.readlines() 67 | 68 | for line in tqdm(lines): 69 | json_obj = json.loads(line) # 解析json字符串为python对象 70 | text = json_obj['text'] 71 | tokens = tokenizer.encode(text, add_special_tokens=False) 72 | tokens.append(tokenizer.special_tokens['']) 73 | if len(tokens) > 5: 74 | all_tokens += tokens 75 | 76 | arr = np.array(all_tokens, dtype = np.uint16) 77 | base_name, ext = os.path.splitext(file_path) 78 | output_file_path = base_name + '.bin' 79 | with open(output_file_path, 'wb') as f: 80 | f.write(arr.tobytes()) 81 | 82 | def process_tigerbot_part(input_dir, tokenizer): 83 | """ https://huggingface.co/datasets/TigerResearch/pretrain_zh 84 | """ 85 | # df = pd.read_parquet("zhizhu/train-00000-of-00005-a1278ede4e8c5cdb.parquet") 86 | # responses = df['RESPONSE'] 87 | # print(len(responses)) 88 | # print(responses[4000]) 89 | all_tokens = [] 90 | total_len = 0 91 | file_idx = 7 92 | # 使用glob找出文件夹下所有的.parquet 93 | for file in glob.glob(os.path.join(input_dir, '*.parquet')): 94 | # 读取jsonl文件 95 | print(file) 96 | # 读取parquet文件 97 | df = pd.read_parquet(file) 98 | 99 | # 提取RESPONSE列 100 | responses = df['content'] 101 | 102 | for text in tqdm(responses): 103 | tokens = tokenizer.encode(text, add_special_tokens=False) 104 | tokens.append(tokenizer.special_tokens['']) 105 | if len(tokens) > 5: 106 | all_tokens += tokens 107 | 108 | total_len += len(df) 109 | if total_len > 600000: 110 | arr = np.array(all_tokens, dtype=np.uint16) 111 | output_file_path = "tigerbot_part_" + str(file_idx) + '.bin' 112 | with open(output_file_path, 'wb') as f: 113 | f.write(arr.tobytes()) 114 | 115 | all_tokens = [] 116 | total_len = 0 117 | file_idx += 1 118 | 119 | if len(all_tokens) > 0: 120 | arr = np.array(all_tokens, dtype=np.uint16) 121 | output_file_path = "tigerbot_part_" + str(file_idx) + '.bin' 122 | with open(output_file_path, 'wb') as f: 123 | f.write(arr.tobytes()) 124 | 125 | def process_zhihu(input_dir, tokenizer): 126 | """ https://huggingface.co/datasets/wangrui6/Zhihu-KOL 127 | """ 128 | # df = pd.read_parquet("zhizhu/train-00000-of-00005-a1278ede4e8c5cdb.parquet") 129 | # responses = df['RESPONSE'] 130 | # print(len(responses)) 131 | # print(responses[4000]) 132 | all_tokens = [] 133 | # 使用glob找出文件夹下所有的.parquet 134 | for file in glob.glob(os.path.join(input_dir, '*.parquet')): 135 | # 读取jsonl文件 136 | print(file) 137 | # 读取parquet文件 138 | df = pd.read_parquet(file) 139 | 140 | # 提取RESPONSE列 141 | responses = df['RESPONSE'] 142 | 143 | for text in tqdm(responses): 144 | tokens = tokenizer.encode(text, add_special_tokens=False) 145 | tokens.append(tokenizer.special_tokens['']) 146 | if len(tokens) > 5: 147 | all_tokens += tokens 148 | arr = np.array(all_tokens, dtype=np.uint16) 149 | # base_name, ext = os.path.splitext(file_path) 150 | output_file_path = "zhihu" + '.bin' 151 | with open(output_file_path, 'wb') as f: 152 | f.write(arr.tobytes()) 153 | 154 | def process_baidu_baike(input_path, tokenizer): 155 | """ https://huggingface.co/datasets/xuqinyang/BaiduBaike-5.63M 156 | """ 157 | BATCH_SIZE = 1000000 158 | 159 | cnt = 0 160 | batch_cnt = 0 161 | token = 0 162 | doc_ids = [] 163 | 164 | f1 = open(input_path, 'r', encoding='utf-8') 165 | 166 | while True: 167 | line = f1.readline() 168 | if not line: 169 | break 170 | line = json.loads(line) 171 | text = '' 172 | try: 173 | text += line['title']+':' + line['summary'] 174 | except: 175 | pass 176 | for per in line['sections']: 177 | text += per['title']+':'+per['content']+'。' 178 | text_id = tokenizer.encode(text, add_special_tokens=False) 179 | text_id.append(tokenizer.special_tokens['']) 180 | if len(text_id) > 5: 181 | doc_ids += text_id 182 | cnt += 1 183 | if cnt % BATCH_SIZE==0: 184 | batch_cnt += 1 185 | arr = np.array(doc_ids, dtype=np.uint16) 186 | doc_ids=[] 187 | print('cnt:',cnt,'arr_shape:',arr.shape) 188 | with open('./baidubaike_563w_{}.bin'.format(batch_cnt),'wb') as f2: 189 | f2.write(arr.tobytes()) 190 | del arr 191 | 192 | if not doc_ids: 193 | batch_cnt += 1 194 | arr = np.array(doc_ids, dtype=np.uint16) 195 | print('cnt:',cnt,'arr_shape:',arr.shape) 196 | with open('./baidubaike_563w_{}.bin'.format(batch_cnt),'wb') as f: 197 | f.write(arr.tobytes()) 198 | 199 | def merge_bin(data_path_list : list): 200 | """ 合并所有bin文件 201 | """ 202 | data_arr = [] 203 | for data_path in tqdm(data_path_list): 204 | with open(data_path,'rb') as f: 205 | data = np.fromfile(f,dtype = np.uint16) 206 | data_arr.append(data) 207 | arr = np.concatenate(data_arr) 208 | print(arr.shape) 209 | with open('./data/pretrain_data.bin','wb') as f: 210 | f.write(arr.tobytes()) 211 | 212 | if __name__=="__main__": 213 | tokenizer = ChatGLMTokenizer(vocab_file='utils/tokenizer/tokenizer.model') 214 | 215 | # process_webnovel("webnovel-chinese/data", tokenizer) 216 | # process_wiki_clean("corpus/pre_train/wiki_cn/wikipedia-cn.json", tokenizer) 217 | # process_zhihu("corpus/pre_train/zhihu", tokenizer) 218 | process_tigerbot_part("corpus/pre_train/tigerbot2", tokenizer) 219 | # process_baidu_baike('corpus/pre_train/baidubaike/563w_baidubaike.json', tokenizer) -------------------------------------------------------------------------------- /train/rm_train.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import numpy as np 3 | import os 4 | import evaluate 5 | import glob 6 | import sys 7 | import math 8 | import json 9 | from dataclasses import dataclass, field 10 | # from itertools import chain 11 | from typing import Optional, List, Dict, Any, Mapping 12 | # from pathlib import Path 13 | import datasets 14 | import torch 15 | import torch.nn as nn 16 | # from torch.optim import AdamW 17 | # from torch.optim.lr_scheduler import LambdaLR 18 | # from datasets import load_dataset, concatenate_datasets, Dataset 19 | from torch.utils.data import Dataset, DataLoader, random_split 20 | from datetime import datetime, timezone 21 | import transformers 22 | from transformers import ( 23 | CONFIG_MAPPING, 24 | MODEL_FOR_CAUSAL_LM_MAPPING, 25 | AutoConfig, 26 | AutoModelForCausalLM, 27 | HfArgumentParser, 28 | Trainer, 29 | TrainingArguments, 30 | is_torch_tpu_available, 31 | set_seed, 32 | ) 33 | from transformers.utils.versions import require_version 34 | from sklearn.metrics import accuracy_score 35 | from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR 36 | 37 | from configuration_tinyllm import TinyllmConfig 38 | from modeling_tinyllm import TinyllmForCausalLM, TinyllmForSequenceClassification 39 | from tinyllm_dataset import RMDataset 40 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer 41 | 42 | @dataclass 43 | class ModelArguments: 44 | """ 模型相关参数 45 | """ 46 | hidden_size : Optional[int] = field( 47 | default=512, 48 | metadata={"help": "hidden_size"} 49 | ) 50 | 51 | num_hidden_layers : Optional[int] = field( 52 | default=8, 53 | metadata={"help": "num_hidden_layers"} 54 | ) 55 | 56 | num_attention_heads : Optional[int] = field( 57 | default=8, 58 | metadata={"help": "transformer num_attention_heads"} 59 | ) 60 | 61 | intermediate_size : Optional[int] = field( 62 | default=1408, 63 | metadata={"help": "intermediate_size"} 64 | ) 65 | 66 | rope_theta : Optional[float] = field( 67 | default=10000.0, 68 | metadata={"help": "rope_theta"} 69 | ) 70 | 71 | max_position_embeddings : Optional[int] = field( 72 | default=1024, 73 | metadata={"help": "max_position_embeddings"} 74 | ) 75 | 76 | vocab_size : Optional[int] = field( 77 | default=64798, 78 | metadata={"help": "vocab_size, ref https://github.com/THUDM/ChatGLM3/issues/634"} 79 | ) 80 | 81 | @dataclass 82 | class ScriptArguments: 83 | """ 其他相关参数 84 | """ 85 | mode : Optional[str] = field( 86 | default="rm", 87 | metadata={"help": "save sft *bin file dir"} 88 | ) 89 | 90 | dataset_dir_or_path : Optional[str] = field( 91 | default="data/rm_train", 92 | metadata={"help": "save rmtrain *bin file dir"} 93 | ) 94 | 95 | resume : Optional[bool] = field( 96 | default=False, 97 | metadata={"help": "use PyTorch 2.0 to compile the model to be faster"} 98 | ) 99 | 100 | base_model_path : Optional[str] = field( 101 | default=" ", 102 | metadata={"help": "SFT train, the base model path"} 103 | ) 104 | 105 | class RMTrainer(Trainer): 106 | def compute_loss(self, model, inputs, return_outputs=False): 107 | """ Define how to compute the reward loss. 108 | We use the InstructGPT pairwise logloss: https://arxiv.org/abs/2203.02155 109 | """ 110 | rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0] 111 | rewards_k = model(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0] 112 | loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean() 113 | if return_outputs: 114 | return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k} 115 | return loss 116 | 117 | # Define the metric that we'll use for validation. 118 | # accuracy = evaluate.load("accuracy") 119 | # def compute_metrics(eval_pred): 120 | # predictions, _ = eval_pred 121 | # # Here, predictions is rewards_j and rewards_k. 122 | # # We want to see how much of the time rewards_j > rewards_k. 123 | # # 是这么计算的: 124 | # # 通过 argmax,得到最大值的 index,当 rewards_j 最大时,返回 0,rewards_k 最大时,返回 1 125 | # # 正确标签应该是全部为 0(index都在 0 这里) 126 | 127 | # # Q: model的输出不是一个score吗,为什么这里可以使用argmax? 128 | # # A: 下面的 compute_loss 中定义了新的model forward 方法,即会接受两个输入产生两个输出 129 | # # Trainer 中会把这种两个输出拼起来,从而得到一个在axis=0维度上有两项的形式,因此argmax就是看哪一项更大 130 | # # 具体可以参考 Trainer 中对 涉及到 compute_loss/logits/training_step/prediction_step 的部分,以及 _gather_and_numpify 方法 131 | # predictions = np.argmax(predictions, axis=0) 132 | # labels = np.zeros(predictions.shape) 133 | # return accuracy.compute(predictions=predictions, references=labels) 134 | 135 | from sklearn.metrics import accuracy_score 136 | def compute_metrics(eval_preds): 137 | predictions = eval_preds.predictions 138 | preds = np.argmax(predictions, axis=1).reshape(-1) 139 | labels = np.zeros(preds.shape) 140 | metric = { 141 | "accuracy": float( 142 | accuracy_score(labels, preds, normalize=True) 143 | ), 144 | } 145 | return metric 146 | 147 | logger = logging.getLogger(__name__) 148 | 149 | def main(): 150 | parser = HfArgumentParser((ModelArguments, ScriptArguments, TrainingArguments)) 151 | model_args, script_args, training_args = parser.parse_args_into_dataclasses() 152 | 153 | # logger format 154 | logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",datefmt="%m/%d/%Y %H:%M:%S", 155 | level = logging.WARN, # if training_args.local_rank in [-1, 0] else logging.WARN, 156 | handlers = [logging.StreamHandler(sys.stdout)],) 157 | if training_args.should_log: 158 | # The default of training_args.log_level is passive, so we set log level at info here to have that default. 159 | transformers.utils.logging.set_verbosity_info() 160 | 161 | log_level = training_args.get_process_log_level() 162 | logger.setLevel(log_level) 163 | datasets.utils.logging.set_verbosity(log_level) 164 | transformers.utils.logging.set_verbosity(log_level) 165 | transformers.utils.logging.enable_default_handler() 166 | transformers.utils.logging.enable_explicit_format() 167 | 168 | # Log on each process the small summary: 169 | logger.warning( 170 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" 171 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" 172 | ) 173 | 174 | set_seed(training_args.seed) 175 | 176 | device = "cuda" if torch.cuda.is_available() else "cpu" 177 | 178 | # init model 179 | config = transformers.AutoConfig.from_pretrained( 180 | script_args.base_model_path, 181 | trust_remote_code=True 182 | ) 183 | 184 | tokenizer = transformers.AutoTokenizer.from_pretrained( 185 | script_args.base_model_path, 186 | use_fast=False, 187 | trust_remote_code=True, 188 | model_max_length=config.max_position_embeddings 189 | ) 190 | 191 | config.use_cache = False 192 | config.num_labels = 1 193 | config.pad_token_id = tokenizer.eos_token_id 194 | 195 | model = TinyllmForSequenceClassification.from_pretrained( 196 | script_args.base_model_path, 197 | config=config, 198 | trust_remote_code=True 199 | ) 200 | 201 | model.to(device) 202 | 203 | ################ 204 | total_params = sum(p.numel() for p in model.parameters()) 205 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 206 | logger.info(f"总参数: {total_params}, {total_params/2**20:.2f}M params") 207 | logger.info(f"可训练参数: {trainable_params}") 208 | ############## 209 | 210 | rm_dataset = RMDataset( 211 | script_args.dataset_dir_or_path, 212 | tokenizer, 213 | config.max_position_embeddings 214 | ) 215 | total_len = len(rm_dataset) 216 | eval_size = int(0.01 * total_len) 217 | # 划分训练集和验证集 218 | train_ds, eval_ds = random_split(rm_dataset, [total_len - eval_size, eval_size]) 219 | 220 | trainer = RMTrainer( 221 | model = model, 222 | args = training_args, 223 | train_dataset = train_ds, 224 | eval_dataset = eval_ds, 225 | compute_metrics=compute_metrics, 226 | ) 227 | 228 | # Training 229 | trainer.train(script_args.resume) 230 | torch.save(model.state_dict(),'{}/last_model.pth'.format(training_args.output_dir)) 231 | last_model_dir = os.path.join(training_args.output_dir, 'last_rm_model') 232 | os.makedirs(last_model_dir, exist_ok=True) 233 | tokenizer.save_pretrained(last_model_dir) 234 | # https://github.com/huggingface/transformers/issues/28630 235 | model.save_pretrained(last_model_dir, safe_serialization=False) 236 | 237 | if __name__ == "__main__": 238 | main() -------------------------------------------------------------------------------- /train/dpo_train.py: -------------------------------------------------------------------------------- 1 | # https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py 2 | # https://huggingface.co/docs/trl/main/en/dpo_trainer 3 | # https://huggingface.co/datasets/lvwerra/stack-exchange-paired 4 | # https://huggingface.co/blog/zh/dpo-trl 5 | 6 | # https://github.dev/RUCAIBox/LLMBox 7 | 8 | # 0. imports 9 | import os 10 | from dataclasses import dataclass, field 11 | from typing import Dict, Optional 12 | 13 | import torch 14 | from accelerate import Accelerator 15 | from datasets import Dataset, load_dataset 16 | from peft import LoraConfig 17 | from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments, set_seed 18 | from torch.utils.data import Dataset, DataLoader, random_split 19 | from trl import DPOTrainer 20 | from tinyllm_dataset import load_dpo_dataset 21 | 22 | 23 | # Define and parse arguments. 24 | @dataclass 25 | class ScriptArguments: 26 | """ 27 | The arguments for the DPO training script. 28 | """ 29 | 30 | # data parameters 31 | beta: Optional[float] = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"}) 32 | 33 | # training parameters 34 | model_name: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"}) 35 | dataset_dir_or_path: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"}) 36 | eval_dataset_dir_or_path: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"}) 37 | resume: Optional[bool] = field(default=False,metadata={"help": "the location of the SFT model name or path"}) 38 | base_model_path: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"}) 39 | 40 | learning_rate: Optional[float] = field(default=5e-4, metadata={"help": "optimizer learning rate"}) 41 | lr_scheduler_type: Optional[str] = field(default="cosine", metadata={"help": "the lr scheduler type"}) 42 | warmup_ratio: Optional[float] = field(default=0.01, metadata={"help": "the number of warmup steps"}) 43 | weight_decay: Optional[float] = field(default=0.05, metadata={"help": "the weight decay"}) 44 | optimizer_type: Optional[str] = field(default="adamw_torch", metadata={"help": "the optimizer type"}) 45 | 46 | per_device_train_batch_size: Optional[int] = field(default=1, metadata={"help": "train batch size per device"}) 47 | per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "eval batch size per device"}) 48 | gradient_accumulation_steps: Optional[int] = field( 49 | default=4, metadata={"help": "the number of gradient accumulation steps"} 50 | ) 51 | gradient_checkpointing: Optional[bool] = field( 52 | default=True, metadata={"help": "whether to use gradient checkpointing"} 53 | ) 54 | 55 | gradient_checkpointing_use_reentrant: Optional[bool] = field( 56 | default=False, metadata={"help": "whether to use reentrant for gradient checkpointing"} 57 | ) 58 | 59 | bf16: bool = field( 60 | default=False, 61 | metadata={ 62 | "help": ( 63 | "Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA" 64 | " architecture or using CPU (use_cpu) or Ascend NPU. This is an experimental API and it may change." 65 | ) 66 | }, 67 | ) 68 | fp16: bool = field( 69 | default=False, 70 | metadata={"help": "Whether to use fp16 (mixed) precision instead of 32-bit"}, 71 | ) 72 | 73 | max_prompt_length: Optional[int] = field(default=512, metadata={"help": "the maximum prompt length"}) 74 | max_length: Optional[int] = field(default=1024, metadata={"help": "the maximum sequence length"}) 75 | num_train_epochs: Optional[int] = field(default=5, metadata={"help": "epoch of training steps"}) 76 | logging_strategy: Optional[str] = field(default="steps", metadata={"help": "logging_strategy"}) 77 | logging_dir: Optional[str] = field(default="", metadata={"help": "logging_dir"}) 78 | logging_steps: Optional[int] = field(default=10, metadata={"help": "the logging frequency"}) 79 | save_steps: Optional[int] = field(default=100, metadata={"help": "the saving frequency"}) 80 | eval_steps: Optional[int] = field(default=100, metadata={"help": "the evaluation frequency"}) 81 | 82 | output_dir: Optional[str] = field(default="./results", metadata={"help": "the output directory"}) 83 | 84 | # instrumentation 85 | sanity_check: Optional[bool] = field(default=False, metadata={"help": "only train on 1000 samples"}) 86 | report_to: Optional[str] = field( 87 | default="tensorboard", 88 | metadata={ 89 | "help": 'The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,' 90 | '`"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. ' 91 | 'Use `"all"` to report to all integrations installed, `"none"` for no integrations.' 92 | }, 93 | ) 94 | # debug argument for distributed training 95 | ignore_bias_buffers: Optional[bool] = field( 96 | default=False, 97 | metadata={ 98 | "help": "fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation. See" 99 | "https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992" 100 | }, 101 | ) 102 | seed: Optional[int] = field( 103 | default=0, metadata={"help": "Random seed that will be set at the beginning of training."} 104 | ) 105 | 106 | 107 | if __name__ == "__main__": 108 | parser = HfArgumentParser(ScriptArguments) 109 | script_args = parser.parse_args_into_dataclasses()[0] 110 | 111 | set_seed(script_args.seed) 112 | 113 | # 1. load a pretrained model 114 | model = AutoModelForCausalLM.from_pretrained( 115 | script_args.base_model_path, 116 | trust_remote_code=True, 117 | ) 118 | model.config.use_cache = False 119 | 120 | if script_args.ignore_bias_buffers: 121 | # torch distributed hack 122 | model._ddp_params_and_buffers_to_ignore = [ 123 | name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool 124 | ] 125 | 126 | tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_path, trust_remote_code=True) 127 | tokenizer.pad_token = tokenizer.eos_token 128 | 129 | tokenizer.add_special_tokens({"bos_token": tokenizer.eos_token}) 130 | tokenizer.bos_token_id = tokenizer.eos_token_id 131 | 132 | # 2. Load the Stack-exchange paired dataset 133 | # dpo_dataset = get_stack_exchange_paired(data_dir="data/rl", sanity_check=script_args.sanity_check) 134 | # dpo_dataset = dpo_dataset.filter( 135 | # lambda x: len(x["prompt"]) + len(x["chosen"]) <= script_args.max_length 136 | # and len(x["prompt"]) + len(x["rejected"]) <= script_args.max_length 137 | # ) 138 | 139 | data_path = "/mnt/cephfs-xiongzhuang/wangdongnian/tiny-llm-zh/data/rm_train/rm_data.jsonl" 140 | dpo_dataset = load_dpo_dataset(script_args.dataset_dir_or_path, max_length=script_args.max_length, sanity_check=script_args.sanity_check) 141 | 142 | train_loader = torch.utils.data.DataLoader( 143 | dpo_dataset, 144 | batch_size=2, 145 | pin_memory=False, 146 | drop_last=False, 147 | shuffle=False, 148 | num_workers=8, 149 | ) 150 | for i, item in enumerate(train_loader): 151 | print(item) 152 | break 153 | 154 | 155 | # 3. Load evaluation dataset 156 | if script_args.eval_dataset_dir_or_path == "": 157 | evaluation_strategy = "no" 158 | else: 159 | evaluation_strategy = "steps" 160 | eval_dataset = load_dpo_dataset(script_args.eval_dataset_dir_or_path, max_length=script_args.max_length, sanity_check=script_args.sanity_check) 161 | 162 | 163 | # 4. initialize training arguments: 164 | training_args = TrainingArguments( 165 | per_device_train_batch_size=script_args.per_device_train_batch_size, 166 | per_device_eval_batch_size=script_args.per_device_eval_batch_size, 167 | num_train_epochs=script_args.num_train_epochs, 168 | logging_dir=script_args.logging_dir, 169 | logging_strategy=script_args.logging_strategy, 170 | logging_steps=script_args.logging_steps, 171 | save_steps=script_args.save_steps, 172 | gradient_accumulation_steps=script_args.gradient_accumulation_steps, 173 | gradient_checkpointing=script_args.gradient_checkpointing, 174 | learning_rate=script_args.learning_rate, 175 | evaluation_strategy="no", 176 | eval_steps=script_args.eval_steps, 177 | output_dir=script_args.output_dir, 178 | report_to=script_args.report_to, 179 | lr_scheduler_type=script_args.lr_scheduler_type, 180 | warmup_ratio=script_args.warmup_ratio, 181 | optim=script_args.optimizer_type, 182 | bf16=script_args.bf16, 183 | fp16=script_args.fp16, 184 | remove_unused_columns=False, 185 | run_name=script_args.model_name, 186 | gradient_checkpointing_kwargs=dict(use_reentrant=script_args.gradient_checkpointing_use_reentrant), 187 | seed=script_args.seed, 188 | # project_kwargs={"logging_dir": script_args.output_dir}, 189 | ) 190 | 191 | # peft_config = LoraConfig( 192 | # r=script_args.lora_r, 193 | # lora_alpha=script_args.lora_alpha, 194 | # lora_dropout=script_args.lora_dropout, 195 | # target_modules=[ 196 | # "q_proj", 197 | # "v_proj", 198 | # "k_proj", 199 | # "out_proj", 200 | # "fc_in", 201 | # "fc_out", 202 | # "wte", 203 | # ], 204 | # bias="none", 205 | # task_type="CAUSAL_LM", 206 | # ) 207 | 208 | # 5. initialize the DPO trainer 209 | dpo_trainer = DPOTrainer( 210 | model, 211 | ref_model=None, 212 | args=training_args, 213 | beta=script_args.beta, 214 | train_dataset=dpo_dataset, 215 | eval_dataset=eval_dataset, 216 | tokenizer=tokenizer, 217 | # peft_config=peft_config, 218 | max_prompt_length=script_args.max_prompt_length, 219 | max_length=script_args.max_length, 220 | # data_collator=collator_fn, 221 | ) 222 | 223 | # 6. train 224 | dpo_trainer.train(script_args.resume) 225 | 226 | 227 | # 7. save 228 | output_dir = os.path.join(script_args.output_dir, "last_dpo_model") 229 | tokenizer.save_pretrained(output_dir) 230 | dpo_trainer.save_model(output_dir) 231 | # dpo_trainer.model.save_pretrained(output_dir) -------------------------------------------------------------------------------- /train/generation_utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import torch 3 | import numpy as np 4 | from queue import Queue 5 | from typing import Tuple, List, Union, Iterable 6 | from transformers.utils import logging, add_start_docstrings 7 | from transformers.generation.logits_process import LogitsProcessor, LOGITS_PROCESSOR_INPUTS_DOCSTRING, LogitsProcessorList 8 | 9 | def make_context(model, tokenizer, 10 | messages: List[dict], 11 | system: str = "You are a helpful assistant.", 12 | max_new_tokens: int=0, 13 | ): 14 | # 确定新生成的token数量,优先使用传入参数,否则使用模型配置中的默认值 15 | max_new_tokens = max_new_tokens or model.generation_config.max_new_tokens 16 | # 计算模型允许的最大输入长度(模型最大长度减去新生成的token数) 17 | max_input_length = model.config.max_position_embeddings - max_new_tokens 18 | 19 | nl_tokens = tokenizer.encode("\n", add_special_tokens=False) 20 | 21 | def _parse_messages(messages): 22 | """ 解析消息列表,分离系统消息、查询和对话历史 23 | """ 24 | system, query, history = "", "", [] 25 | ## system 26 | if messages[0]["role"] == "system": 27 | system = messages[0]["content"] 28 | messages = messages[1:] 29 | ## query 30 | ### 确保最后一项是用户消息 31 | assert messages[-1]["role"] == "user" 32 | query = messages[-1]["content"] 33 | messages = messages[:-1] 34 | ## history 35 | assert len(messages) % 2 == 0 36 | for i in range(0, len(messages), 2): 37 | assert messages[i]["role"] == "user" and messages[i+1]["role"] == "assistant" 38 | history.append([messages[i]["content"], messages[i+1]["content"]]) 39 | 40 | return system, query, history 41 | 42 | # 调用_parse_messages解析消息 43 | _system, query, history = _parse_messages(messages) 44 | 45 | ## system 46 | system_text = _system if _system != "" else system 47 | system_tokens = [] 48 | if system_text: 49 | # system_tokens = tokenizer.build_single_message("system", "", system_text.strip()) 50 | system_tokens = tokenizer.encode(text=("<|system|>\n"+system_text.strip()), add_special_tokens=True, truncation=True) + nl_tokens 51 | ## query 52 | # query_tokens = tokenizer.build_single_message("user", "", query.strip()) 53 | query_tokens = tokenizer.encode(text=("<|user|>\n"+query.strip()), add_special_tokens=False, truncation=True) + nl_tokens 54 | ## final assistant 55 | # final_tokens = tokenizer.build_single_message("assistant", "", "") 56 | final_tokens = tokenizer.encode("<|assistant|>", add_special_tokens=False, truncation=True) + nl_tokens 57 | 58 | ## max_history_tokens 59 | max_history_length = max_input_length - len(system_tokens) - len(query_tokens) - len(final_tokens) 60 | 61 | ## history 62 | ## 逆序遍历对话历史,构建token序列 63 | context_tokens = [] 64 | for turn_query, turn_response in reversed(history): 65 | ## query tokens 66 | history_query_tokens = tokenizer.encode("<|user|>\n"+turn_query.strip(), add_special_tokens=False, truncation=True) + nl_tokens 67 | ## answer tokens 68 | histroy_response_tokens = tokenizer.encode("<|assistant|>\n"+turn_response.strip(), add_special_tokens=False, truncation=True) + nl_tokens 69 | ## this round tokens 70 | next_context_tokens = history_query_tokens + histroy_response_tokens 71 | ## concat 72 | ## 确保加入这些token后总长度不超过允许的最大历史长度 73 | current_context_size = len(next_context_tokens) + len(context_tokens) 74 | if current_context_size < max_history_length: 75 | context_tokens = next_context_tokens + context_tokens 76 | else: 77 | break 78 | input_tokens = system_tokens + context_tokens + query_tokens + final_tokens 79 | 80 | return torch.LongTensor([input_tokens]).to(model.device) 81 | 82 | def parse_pot_no_stream(inputs): 83 | """ 解析并处理输入字符串中特定格式(形如 <<...>>)的代码片段。 84 | 这些代码片段可以是简单的数学表达式赋值,也可以是定义和调用函数。 85 | 1. 对于包含 "func" 的代码片段,它会识别函数定义,执行该函数, 86 | 并将函数返回的结果替换到原始字符串中的相应位置。 87 | 如果函数涉及到 sympy(一个符号计算库), 88 | 则还会做一些特定的字符串替换处理。 89 | 2. 对于不包含 "func" 的代码片段,它会直接计算等号右边的表达式, 90 | 并将计算结果替换到原始字符串中,同时也会进行一些类型转换 91 | (如将浮点数转为整数)。 92 | """ 93 | try: 94 | # 尝试从输入字符串中找到形如 "<<...>>" 的模式 95 | s = re.findall(r'<<(.*?)>>', inputs, re.DOTALL) 96 | # 如果没有找到匹配项,则直接返回原始输入 97 | if not s: 98 | #print("err inputs: ", origin_inputs, flush=True) 99 | return inputs 100 | 101 | index = 0 102 | # 遍历所有匹配到的模式 103 | for k in s: 104 | try: 105 | # 检查模式内是否包含 "func" 106 | if "func" in k: 107 | # 分割并处理函数定义 108 | var = k.split("=", 1) 109 | try: 110 | # 去除空白字符并执行函数定义 111 | var[1] = var[1].strip(" ") 112 | exec(var[1], globals()) 113 | # 调用函数获取结果 114 | ans = func() 115 | except: 116 | # 特殊处理包含 'sympy' 的情况 117 | if 'sympy' in var[1]: 118 | var[1] = var[1].replace('res[x]', 'res[0][0]').replace('res[y]', 'res[0][1]') 119 | exec(var[1], globals()) 120 | ans = func() 121 | pass 122 | var_list = [c.strip(" ") for c in var[0].split(",")] 123 | # 如果只有一个变量名,则将结果放入列表 124 | if len(var_list) == 1: 125 | ans = [ans] 126 | 127 | # 将结果转换为浮点数或整数形式,并替换到输入字符串中 128 | for i in range(len(ans)): 129 | try: 130 | ans[i] = float(ans[i]) 131 | if abs(ans[i] - int(ans[i])) < 1e-10: 132 | ans[i] = str(int(ans[i])) 133 | except: 134 | pass 135 | 136 | # 替换原字符串中的模式和变量名 137 | inputs = inputs.replace("<<"+k+">>", "") 138 | for i in range(len(var_list)): 139 | inputs = inputs.replace(var_list[i], str(ans[i])) 140 | index += 1 141 | # 更新后续模式中的变量值 142 | for c in range(index, len(s)): 143 | for i in range(len(var_list)): 144 | s[c] = s[c].replace(var_list[i], str(ans[i])) 145 | else: 146 | # 处理非函数的情况,直接计算并替换 147 | var = k.replace(" ", "").split("=") 148 | var[1] = var[1].replace("eval", "") 149 | ans = round(eval(var[1]), 10) 150 | ans = float(ans) 151 | if abs(ans - int(ans)) < 1e-10: 152 | ans = str(int(ans)) 153 | # 替换原字符串中的模式和变量名 154 | inputs = inputs.replace("<<"+k+">>", "").replace(var[0], str(ans)) 155 | index += 1 156 | # 更新后续模式中的变量值 157 | for c in range(index, len(s)): 158 | s[c] = s[c].replace(var[0], str(ans)) 159 | except: 160 | return inputs 161 | except Exception as e: 162 | return inputs 163 | 164 | return inputs 165 | 166 | 167 | class TextIterStreamer: 168 | """ 实现文本的流式处理 169 | 能够逐个或逐段生成和输出文本,而不是一次性输出全部内容 170 | """ 171 | def __init__(self, tokenizer, skip_prompt=False, skip_special_tokens=False, use_pot=True): 172 | self.tokenizer = tokenizer 173 | self.skip_prompt = skip_prompt 174 | self.skip_special_tokens = skip_special_tokens 175 | self.tokens = [] 176 | # 使用队列来缓存生成的文本片段,以便于逐块输出 177 | self.text_queue = Queue() 178 | self.next_tokens_are_prompt = True 179 | # 是否使用特定的后处理技术(例如翻译或优化),默认为True 180 | self.use_pot = use_pot 181 | 182 | def put(self, value): 183 | # 接收并处理生成的token值 184 | if self.skip_prompt and self.next_tokens_are_prompt: 185 | self.next_tokens_are_prompt = False 186 | else: 187 | if len(value.shape) > 1: 188 | value = value[0] 189 | self.tokens.extend(value.tolist()) 190 | tokens_str = self.tokenizer.decode(self.tokens, skip_special_tokens=self.skip_special_tokens, errors='ignore') 191 | if self.use_pot: 192 | tokens_str = parse_pot_no_stream(tokens_str) 193 | self.text_queue.put(tokens_str) 194 | 195 | def end(self): 196 | self.text_queue.put(None) 197 | 198 | def __iter__(self): 199 | return self 200 | 201 | def __next__(self): 202 | # 实现迭代器的下一步方法,从队列中获取并返回文本, 203 | # 或在无更多内容时抛出StopIteration异常 204 | value = self.text_queue.get() 205 | if value is None: 206 | raise StopIteration() 207 | else: 208 | return value 209 | 210 | 211 | class OutputRepetitionPenaltyLogitsProcessor(LogitsProcessor): 212 | r""" 213 | [`OutputLogitsProcessor`] that prevents the repetition of previous tokens through a penalty. This penalty is applied at 214 | most once per token. Note that, for decoder-only models like most LLMs, the considered tokens include the prompt. 215 | 216 | In the original [paper](https://arxiv.org/pdf/1909.05858.pdf), the authors suggest the use of a penalty of around 217 | 1.2 to achieve a good balance between truthful generation and lack of repetition. To penalize and reduce 218 | repetition, use `penalty` values above 1.0, where a higher value penalizes more strongly. To reward and encourage 219 | repetition, use `penalty` values between 0.0 and 1.0, where a lower value rewards more strongly. 220 | 221 | Args: 222 | penalty (`float`): 223 | The parameter for repetition penalty. 1.0 means no penalty. Above 1.0 penalizes previously generated 224 | tokens. Between 0.0 and 1.0 rewards previously generated tokens. 225 | """ 226 | 227 | def __init__(self, input_length: int, 228 | presence_penalties: float = 1.0, 229 | frequency_penalties: float = 0, 230 | repetition_penalties: float = 0): 231 | if not (repetition_penalties > 0): 232 | raise ValueError(f"`repetition_penalties` has to be a strictly positive float, but is {repetition_penalties}") 233 | if not ( (frequency_penalties >= -2) and (frequency_penalties <= 2) ): 234 | raise ValueError(f"`frequency_penalties` has to be [-2, 2], but is {frequency_penalties}") 235 | if not ( (presence_penalties >= -2) and (presence_penalties <= 2) ): 236 | raise ValueError(f"`presence_penalties` has to be [-2, 2], but is {presence_penalties}") 237 | 238 | self.repetition_penalties = repetition_penalties 239 | self.frequency_penalties = frequency_penalties 240 | self.presence_penalties = presence_penalties 241 | self.input_length = input_length 242 | 243 | def _get_bin_counts_and_mask( 244 | self, 245 | tokens: torch.Tensor, 246 | vocab_size: int, 247 | num_seqs: int, 248 | ) -> Tuple[torch.Tensor, torch.Tensor]: 249 | # Compute the bin counts for the tokens. 250 | # vocab_size + 1 for padding. 251 | bin_counts = torch.zeros((num_seqs, vocab_size + 1), 252 | dtype=torch.long, 253 | device=tokens.device) 254 | bin_counts.scatter_add_(1, tokens, torch.ones_like(tokens)) 255 | bin_counts = bin_counts[:, :vocab_size] 256 | mask = bin_counts > 0 257 | 258 | return bin_counts, mask 259 | 260 | @add_start_docstrings(LOGITS_PROCESSOR_INPUTS_DOCSTRING) 261 | def __call__(self, input_ids: torch.LongTensor, logits: torch.FloatTensor) -> torch.FloatTensor: 262 | prompt_tokens_tensor = input_ids[:, :self.input_length+1] 263 | output_tokens_tensor = input_ids[:, self.input_length+1:] 264 | 265 | num_seqs, vocab_size = logits.shape 266 | _, prompt_mask = self._get_bin_counts_and_mask( 267 | prompt_tokens_tensor, vocab_size, num_seqs) 268 | output_bin_counts, output_mask = self._get_bin_counts_and_mask( 269 | output_tokens_tensor, vocab_size, num_seqs) 270 | 271 | repetition_penalties = torch.Tensor([self.repetition_penalties]).to(logits.device) 272 | frequency_penalties = torch.Tensor([self.frequency_penalties]).to(logits.device) 273 | presence_penalties = torch.Tensor([self.presence_penalties]).to(logits.device) 274 | 275 | repetition_penalties = repetition_penalties[:, None].repeat(1, vocab_size) 276 | repetition_penalties[~(prompt_mask | output_mask)] = 1.0 277 | logits = torch.where(logits > 0, logits / repetition_penalties, 278 | logits * repetition_penalties) 279 | 280 | # We follow the definition in OpenAI API. 281 | # Refer to https://platform.openai.com/docs/api-reference/parameter-details 282 | logits -= frequency_penalties.unsqueeze_(dim=1) * output_bin_counts 283 | logits -= presence_penalties.unsqueeze_(dim=1) * output_mask 284 | 285 | return logits -------------------------------------------------------------------------------- /vllm/tinyllm.py: -------------------------------------------------------------------------------- 1 | """ Model Architecture 2 | TinyllmForCausalLM( 3 | (model): TinyllmModel( 4 | (embed_tokens): Embedding(64798, 512) 5 | (layers): ModuleList( 6 | (0-7): 8 x TinyllmDecoderLayer( 7 | (self_attn): TinyllmSdpaAttention( 8 | (q_proj): Linear(in_features=512, out_features=512, bias=True) 9 | (k_proj): Linear(in_features=512, out_features=512, bias=True) 10 | (v_proj): Linear(in_features=512, out_features=512, bias=True) 11 | (o_proj): Linear(in_features=512, out_features=512, bias=False) 12 | (rotary_emb): TinyllmRotaryEmbedding() 13 | ) 14 | (mlp): TinyllmMLP( 15 | (gate_proj): Linear(in_features=512, out_features=1408, bias=False) 16 | (up_proj): Linear(in_features=512, out_features=1408, bias=False) 17 | (down_proj): Linear(in_features=1408, out_features=512, bias=False) 18 | (act_fn): SiLU() 19 | ) 20 | (input_layernorm): TinyllmRMSNorm() 21 | (post_attention_layernorm): TinyllmRMSNorm() 22 | ) 23 | ) 24 | (norm): TinyllmRMSNorm() 25 | ) 26 | (lm_head): Linear(in_features=512, out_features=64798, bias=False) 27 | ) 28 | """ 29 | 30 | # tiny llm model vllm implement 31 | 32 | from typing import List, Optional, Tuple 33 | 34 | import torch 35 | from torch import nn 36 | from transformers import LlamaConfig 37 | 38 | from vllm.attention import Attention, AttentionMetadata 39 | from vllm.config import LoRAConfig 40 | from vllm.model_executor.parallel_utils.parallel_state import ( 41 | get_tensor_model_parallel_world_size) 42 | from vllm.model_executor.layers.activation import SiluAndMul 43 | from vllm.model_executor.layers.layernorm import RMSNorm 44 | from vllm.model_executor.layers.linear import (LinearMethodBase, 45 | MergedColumnParallelLinear, 46 | QKVParallelLinear, 47 | RowParallelLinear) 48 | from vllm.model_executor.layers.logits_processor import LogitsProcessor 49 | from vllm.model_executor.layers.rotary_embedding import get_rope 50 | from vllm.model_executor.layers.sampler import Sampler 51 | from vllm.model_executor.layers.vocab_parallel_embedding import ( 52 | ParallelLMHead, VocabParallelEmbedding) 53 | from vllm.model_executor.sampling_metadata import SamplingMetadata 54 | from vllm.model_executor.weight_utils import (default_weight_loader, 55 | hf_model_weights_iterator) 56 | from vllm.sequence import SamplerOutput 57 | 58 | class TinyllmMLP(nn.Module): 59 | def __init__( 60 | self, 61 | hidden_size: int, 62 | intermediate_size: int, 63 | hidden_act: str, 64 | linear_method: Optional[LinearMethodBase] = None, 65 | ) -> None: 66 | super().__init__() 67 | self.gate_up_proj = MergedColumnParallelLinear( 68 | hidden_size, [intermediate_size] * 2, 69 | bias=False, 70 | linear_method=linear_method) 71 | self.down_proj = RowParallelLinear(intermediate_size, 72 | hidden_size, 73 | bias=False, 74 | linear_method=linear_method) 75 | if hidden_act != "silu": 76 | raise ValueError(f"Unsupported activation: {hidden_act}. " 77 | "Only silu is supported for now.") 78 | self.act_fn = SiluAndMul() 79 | 80 | def forward(self, x): 81 | gate_up, _ = self.gate_up_proj(x) 82 | x = self.act_fn(gate_up) 83 | x, _ = self.down_proj(x) 84 | return x 85 | 86 | class TinyllmAttention(nn.Module): 87 | 88 | def __init__(self, 89 | hidden_size: int, 90 | num_heads: int, 91 | num_kv_heads: int, 92 | max_position: int = 1024, 93 | rope_theta: float = 10000, 94 | linear_method: Optional[LinearMethodBase] = None, 95 | sliding_window: Optional[int] = None) -> None: 96 | super().__init__() 97 | self.hidden_size = hidden_size 98 | tp_size = get_tensor_model_parallel_world_size() 99 | self.total_num_heads = num_heads 100 | assert self.total_num_heads % tp_size == 0 101 | self.num_heads = self.total_num_heads // tp_size 102 | self.total_num_kv_heads = num_kv_heads 103 | if self.total_num_kv_heads >= tp_size: 104 | # Number of KV heads is greater than TP size, so we partition 105 | # the KV heads across multiple tensor parallel GPUs. 106 | assert self.total_num_kv_heads % tp_size == 0 107 | else: 108 | # Number of KV heads is less than TP size, so we replicate 109 | # the KV heads across multiple tensor parallel GPUs. 110 | assert tp_size % self.total_num_kv_heads == 0 111 | self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size) 112 | self.head_dim = hidden_size // self.total_num_heads 113 | self.q_size = self.num_heads * self.head_dim 114 | self.kv_size = self.num_kv_heads * self.head_dim 115 | self.scaling = self.head_dim**-0.5 116 | self.rope_theta = rope_theta 117 | 118 | self.qkv_proj = QKVParallelLinear( 119 | hidden_size, 120 | self.head_dim, 121 | self.total_num_heads, 122 | self.total_num_kv_heads, 123 | bias=True, 124 | linear_method=linear_method, 125 | ) 126 | self.o_proj = RowParallelLinear( 127 | self.total_num_heads * self.head_dim, 128 | hidden_size, 129 | bias=False, 130 | linear_method=linear_method, 131 | ) 132 | 133 | self.rotary_emb = get_rope( 134 | self.head_dim, 135 | rotary_dim=self.head_dim, 136 | max_position=max_position, 137 | base=self.rope_theta, 138 | ) 139 | self.attn = Attention(self.num_heads, 140 | self.head_dim, 141 | self.scaling, 142 | num_kv_heads=self.num_kv_heads, 143 | sliding_window=sliding_window) 144 | 145 | def forward( 146 | self, 147 | positions: torch.Tensor, 148 | hidden_states: torch.Tensor, 149 | kv_cache: torch.Tensor, 150 | attn_metadata: AttentionMetadata, 151 | ) -> torch.Tensor: 152 | qkv, _ = self.qkv_proj(hidden_states) 153 | q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1) 154 | q, k = self.rotary_emb(positions, q, k) 155 | attn_output = self.attn(q, k, v, kv_cache, attn_metadata) 156 | output, _ = self.o_proj(attn_output) 157 | return output 158 | 159 | class TinyllmDecoderLayer(nn.Module): 160 | def __init__( 161 | self, 162 | config: LlamaConfig, 163 | linear_method: Optional[LinearMethodBase] = None, 164 | ) -> None: 165 | super().__init__() 166 | 167 | self.hidden_size = config.hidden_size 168 | rope_theta = getattr(config, "rope_theta", 10000) 169 | max_position_embeddings = getattr(config, "max_position_embeddings", 1024) 170 | sliding_window = getattr(config, "sliding_window", None) 171 | 172 | self.self_attn = TinyllmAttention( 173 | hidden_size=self.hidden_size, 174 | num_heads=config.num_attention_heads, 175 | num_kv_heads=getattr(config, "num_key_value_heads", config.num_attention_heads), 176 | max_position=max_position_embeddings, 177 | rope_theta=rope_theta, 178 | linear_method=linear_method, 179 | sliding_window=sliding_window 180 | ) 181 | 182 | self.mlp = TinyllmMLP( 183 | hidden_size=self.hidden_size, 184 | intermediate_size=config.intermediate_size, 185 | hidden_act=config.hidden_act, 186 | linear_method=linear_method, 187 | ) 188 | 189 | self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) 190 | self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) 191 | 192 | def forward( 193 | self, 194 | positions: torch.Tensor, 195 | hidden_states: torch.Tensor, 196 | kv_cache: torch.Tensor, 197 | attn_metadata: AttentionMetadata, 198 | residual: Optional[torch.Tensor], 199 | ) -> Tuple[torch.Tensor, torch.Tensor]: 200 | # Self Attention 201 | if residual is None: 202 | residual = hidden_states 203 | hidden_states = self.input_layernorm(hidden_states) 204 | else: 205 | hidden_states, residual = self.input_layernorm( 206 | hidden_states, residual) 207 | hidden_states = self.self_attn( 208 | positions=positions, 209 | hidden_states=hidden_states, 210 | kv_cache=kv_cache, 211 | attn_metadata=attn_metadata, 212 | ) 213 | 214 | # Fully Connected 215 | hidden_states, residual = self.post_attention_layernorm( 216 | hidden_states, residual) 217 | hidden_states = self.mlp(hidden_states) 218 | 219 | return hidden_states, residual 220 | 221 | class TinyllmModel(nn.Module): 222 | def __init__( 223 | self, 224 | config: LlamaConfig, 225 | linear_method: Optional[LinearMethodBase] = None, 226 | ) -> None: 227 | super().__init__() 228 | 229 | self.config = config 230 | self.padding_idx = config.pad_token_id 231 | self.vocab_size = config.vocab_size 232 | 233 | self.embed_tokens = VocabParallelEmbedding( 234 | config.vocab_size, 235 | config.hidden_size 236 | ) 237 | self.layers = nn.ModuleList([ 238 | TinyllmDecoderLayer(config, linear_method) 239 | for _ in range(config.num_hidden_layers) 240 | ]) 241 | self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) 242 | 243 | def forward( 244 | self, 245 | input_ids: torch.Tensor, 246 | positions: torch.Tensor, 247 | kv_caches: List[torch.Tensor], 248 | attn_metadata: AttentionMetadata, 249 | ) -> torch.Tensor: 250 | hidden_states = self.embed_tokens(input_ids) 251 | residual = None 252 | for i in range(len(self.layers)): 253 | layer = self.layers[i] 254 | hidden_states, residual = layer( 255 | positions, 256 | hidden_states, 257 | kv_caches[i], 258 | attn_metadata, 259 | residual, 260 | ) 261 | hidden_states, _ = self.norm(hidden_states, residual) 262 | return hidden_states 263 | 264 | class TinyllmForCausalLM(nn.Module): 265 | packed_modules_mapping = { 266 | "qkv_proj": [ 267 | "q_proj", 268 | "k_proj", 269 | "v_proj", 270 | ], 271 | "gate_up_proj": [ 272 | "gate_proj", 273 | "up_proj", 274 | ], 275 | } 276 | 277 | # LoRA specific attributes 278 | supported_lora_modules = [ 279 | "qkv_proj", 280 | "o_proj", 281 | "gate_up_proj", 282 | "down_proj", 283 | ] 284 | embedding_modules = {} 285 | embedding_padding_modules = [] 286 | 287 | def __init__( 288 | self, 289 | config: LlamaConfig, 290 | linear_method: Optional[LinearMethodBase] = None, 291 | lora_config: Optional[LoRAConfig] = None, 292 | ) -> None: 293 | del lora_config 294 | super().__init__() 295 | self.config = config 296 | self.linear_method = linear_method 297 | self.model = TinyllmModel(config, linear_method) 298 | 299 | if config.tie_word_embeddings: 300 | self.lm_head_weight = self.model.embed_tokens.weight 301 | else: 302 | self.lm_head = ParallelLMHead(config.vocab_size, config.hidden_size) 303 | self.lm_head_weight = self.lm_head.weight 304 | 305 | self.logits_processor = LogitsProcessor(config.vocab_size) 306 | self.sampler = Sampler() 307 | 308 | def forward( 309 | self, 310 | input_ids: torch.Tensor, 311 | positions: torch.Tensor, 312 | kv_caches: List[torch.Tensor], 313 | attn_metadata: AttentionMetadata, 314 | ) -> torch.Tensor: 315 | hidden_states = self.model(input_ids, positions, kv_caches, attn_metadata) 316 | return hidden_states 317 | 318 | def compute_logits(self, hidden_states: torch.Tensor, 319 | sampling_metadata: SamplingMetadata) -> torch.Tensor: 320 | logits = self.logits_processor(self.lm_head_weight, hidden_states, sampling_metadata) 321 | return logits 322 | 323 | def sample( 324 | self, 325 | logits: torch.Tensor, 326 | sampling_metadata: SamplingMetadata, 327 | ) -> Optional[SamplerOutput]: 328 | next_tokens = self.sampler(logits, sampling_metadata) 329 | return next_tokens 330 | 331 | def load_weights(self, 332 | model_name_or_path: str, 333 | cache_dir: Optional[str] = None, 334 | load_format: str = "auto", 335 | revision: Optional[str] = None): 336 | stacked_params_mapping = [ 337 | # (param_name, shard_name, shard_id) 338 | ("qkv_proj", "q_proj", "q"), 339 | ("qkv_proj", "k_proj", "k"), 340 | ("qkv_proj", "v_proj", "v"), 341 | ("gate_up_proj", "gate_proj", 0), 342 | ("gate_up_proj", "up_proj", 1), 343 | ] 344 | params_dict = dict(self.named_parameters(remove_duplicate=False)) 345 | for name, loaded_weight in hf_model_weights_iterator( 346 | model_name_or_path, cache_dir, load_format, revision): 347 | if "rotary_emb.inv_freq" in name: 348 | continue 349 | if self.config.tie_word_embeddings and "lm_head.weight" in name: 350 | continue 351 | for (param_name, weight_name, shard_id) in stacked_params_mapping: 352 | if weight_name not in name: 353 | continue 354 | name = name.replace(weight_name, param_name) 355 | # Skip loading extra bias for GPTQ models. 356 | if name.endswith(".bias") and name not in params_dict: 357 | continue 358 | param = params_dict[name] 359 | weight_loader = param.weight_loader 360 | weight_loader(param, loaded_weight, shard_id) 361 | break 362 | else: 363 | # Skip loading extra bias for GPTQ models. 364 | if name.endswith(".bias") and name not in params_dict: 365 | continue 366 | param = params_dict[name] 367 | weight_loader = getattr(param, "weight_loader", 368 | default_weight_loader) 369 | weight_loader(param, loaded_weight) 370 | 371 | -------------------------------------------------------------------------------- /utils/chatglm3_tokenizer/tokenization_chatglm.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import re 4 | from typing import List, Optional, Union, Dict 5 | from sentencepiece import SentencePieceProcessor 6 | from transformers import PreTrainedTokenizer 7 | from transformers.utils import logging, PaddingStrategy 8 | from transformers.tokenization_utils_base import EncodedInput, BatchEncoding 9 | 10 | 11 | logger = logging.get_logger(__name__) 12 | 13 | 14 | class SPTokenizer: 15 | def __init__(self, model_path: str): 16 | # reload tokenizer 17 | assert os.path.isfile(model_path), model_path 18 | self.sp_model = SentencePieceProcessor(model_file=model_path) 19 | 20 | # BOS / EOS token IDs 21 | self.n_words: int = self.sp_model.vocab_size() 22 | self.bos_id: int = self.sp_model.bos_id() 23 | self.eos_id: int = self.sp_model.eos_id() 24 | self.pad_id: int = self.sp_model.unk_id() 25 | # 确保vocab_size与piece数量一致 26 | assert self.sp_model.vocab_size() == self.sp_model.get_piece_size() 27 | 28 | # 定义聊天角色相关的特殊token 29 | role_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"] 30 | # 添加额外的通用特殊token 31 | special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"] + role_special_tokens 32 | # 创建特殊token与ID之间的映射关系 33 | self.special_tokens = {} 34 | self.index_special_tokens = {} 35 | for token in special_tokens: 36 | # 分配新的词汇表ID给特殊token 37 | self.special_tokens[token] = self.n_words 38 | self.index_special_tokens[self.n_words] = token 39 | self.n_words += 1 40 | # 生成正则表达式,用于在apply_chat_template方法中查找特殊token 41 | self.role_special_token_expression = "|".join([re.escape(token) for token in special_tokens]) # for apply_chat_template 42 | 43 | def tokenize(self, s: str, encode_special_tokens=False): 44 | """ 对输入字符串进行分词操作,可选择是否编码特殊token 45 | """ 46 | if encode_special_tokens: 47 | # 对特殊字符进行处理 48 | last_index = 0 49 | t = [] 50 | for match in re.finditer(self.role_special_token_expression, s): 51 | # 查找并保留非特殊token部分的分词结果 52 | if last_index < match.start(): 53 | t.extend(self.sp_model.EncodeAsPieces(s[last_index:match.start()])) 54 | # 直接添加特殊token 55 | t.append(s[match.start():match.end()]) 56 | last_index = match.end() 57 | # 处理剩余非特殊token部分 58 | if last_index < len(s): 59 | t.extend(self.sp_model.EncodeAsPieces(s[last_index:])) 60 | return t 61 | else: 62 | # 当encode_special_tokens为False时,直接调用SentencePiece模型进行分词 63 | return self.sp_model.EncodeAsPieces(s) 64 | 65 | def encode(self, s: str, bos: bool = False, eos: bool = False) -> List[int]: 66 | """ 将字符串转化为ID列表,可选择是否添加BOS/EOS token 67 | """ 68 | assert type(s) is str 69 | t = self.sp_model.encode(s) 70 | if bos: 71 | t = [self.bos_id] + t 72 | if eos: 73 | t = t + [self.eos_id] 74 | return t 75 | 76 | def decode(self, t: List[int]) -> str: 77 | """ 将ID列表解码为字符串 78 | """ 79 | text, buffer = "", [] 80 | for token in t: 81 | # 处理特殊tokenID转字符串 82 | if token in self.index_special_tokens: 83 | if buffer: 84 | text += self.sp_model.decode(buffer) 85 | buffer = [] 86 | text += self.index_special_tokens[token] 87 | else: 88 | buffer.append(token) 89 | # 解码剩余普通tokenID 90 | if buffer: 91 | text += self.sp_model.decode(buffer) 92 | return text 93 | 94 | def decode_tokens(self, tokens: List[str]) -> str: 95 | """ 将分词结果(List[str])解码为字符串 96 | """ 97 | text = self.sp_model.DecodePieces(tokens) 98 | return text 99 | 100 | def convert_token_to_id(self, token): 101 | """ 将给定的token字符串转化为对应的ID 102 | """ 103 | if token in self.special_tokens: 104 | return self.special_tokens[token] 105 | return self.sp_model.PieceToId(token) 106 | 107 | def convert_id_to_token(self, index): 108 | """ 将给定的ID转化为对应的token字符串 109 | """ 110 | # 处理特殊tokenID 111 | if index in self.index_special_tokens: 112 | return self.index_special_tokens[index] 113 | # 处理边界情况和其他特殊ID 114 | if index in [self.eos_id, self.bos_id, self.pad_id] or index < 0 or index > self.sp_model.vocab_size(): 115 | return "" 116 | # 将普通ID转换为token 117 | return self.sp_model.IdToPiece(index) 118 | 119 | 120 | class ChatGLMTokenizer(PreTrainedTokenizer): 121 | # 预训练模型所需的文件名配置,这里指向tokenizer的model文件 122 | vocab_files_names = {"vocab_file": "tokenizer.model"} 123 | # 模型输入的特征名称列表 124 | model_input_names = ["input_ids", "attention_mask", "position_ids"] 125 | 126 | def __init__( 127 | self, 128 | vocab_file, 129 | padding_side="left", 130 | clean_up_tokenization_spaces=False, 131 | encode_special_tokens=False, 132 | **kwargs 133 | ): 134 | # 设置tokenizer的名称 135 | self.name = "GLMTokenizer" 136 | # 存储vocab文件路径 137 | self.vocab_file = vocab_file 138 | # 使用SPTokenizer作为基础分词器 139 | self.tokenizer = SPTokenizer(vocab_file) 140 | # 定义特殊token及其对应的ID 141 | self.special_tokens = { 142 | "": self.tokenizer.bos_id, 143 | "": self.tokenizer.eos_id, 144 | "": self.tokenizer.pad_id, 145 | "": self.tokenizer.pad_id 146 | } 147 | self.encode_special_tokens = encode_special_tokens 148 | 149 | super().__init__( 150 | padding_side=padding_side, 151 | clean_up_tokenization_spaces=clean_up_tokenization_spaces, 152 | **kwargs 153 | ) 154 | 155 | # self.chat_template = "{% for message in messages %}{% if loop.first %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% else %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}" 156 | 157 | def get_command(self, token): 158 | """ 获取指定特殊 token 对应的 id 159 | """ 160 | if token in self.special_tokens: 161 | return self.special_tokens[token] 162 | # 如果不在自定义特殊 token 中,则从基础SPTokenizer的特殊 token 中查找 163 | assert token in self.tokenizer.special_tokens, f"{token} is not a special token for {self.name}" 164 | return self.tokenizer.special_tokens[token] 165 | 166 | @property 167 | def unk_token(self) -> str: 168 | """ 通过ID获取未登录词、填充符和结束符的字符串形式 169 | """ 170 | return self.tokenizer.sp_model.IdToPiece(self.get_command("")) 171 | 172 | @property 173 | def pad_token(self) -> str: 174 | return self.tokenizer.sp_model.IdToPiece(self.get_command("")) 175 | 176 | @property 177 | def eos_token(self) -> str: 178 | return self.tokenizer.sp_model.IdToPiece(self.get_command("")) 179 | 180 | @property 181 | def unk_token_id(self) -> int: 182 | """ 获取未登录词、填充符和结束符的ID形式 183 | """ 184 | return self.get_command("") 185 | 186 | @property 187 | def pad_token_id(self) -> int: 188 | return self.get_command("") 189 | 190 | @property 191 | def eos_token_id(self): 192 | return self.get_command("") 193 | 194 | @unk_token.setter 195 | def unk_token(self, value): 196 | """ 不支持设置未登录词、填充符和结束符,输出警告信息 197 | """ 198 | logger.warning("Setting unk_token is not supported, use the default one.") 199 | 200 | @pad_token.setter 201 | def pad_token(self, value): 202 | logger.warning("Setting pad_token is not supported, use the default one.") 203 | 204 | @eos_token.setter 205 | def eos_token(self, value): 206 | logger.warning("Setting eos_token is not supported, use the default one.") 207 | 208 | @property 209 | def vocab_size(self): 210 | """ 返回整个词汇表的大小 211 | """ 212 | return self.tokenizer.n_words 213 | 214 | def get_vocab(self): 215 | """ 获取词汇表字典,其中键是token,值是其对应的ID 216 | """ 217 | vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)} 218 | vocab.update(self.added_tokens_encoder) 219 | return vocab 220 | 221 | def _tokenize(self, text, **kwargs): 222 | """ 实现分词功能,利用SPTokenizer进行分词操作 223 | """ 224 | return self.tokenizer.tokenize(text, encode_special_tokens=self.encode_special_tokens) 225 | 226 | def _convert_token_to_id(self, token): 227 | """ 将token字符串转化为ID 228 | """ 229 | return self.tokenizer.convert_token_to_id(token) 230 | 231 | def _convert_id_to_token(self, index): 232 | """ 将ID转化为token字符串 233 | """ 234 | return self.tokenizer.convert_id_to_token(index) 235 | 236 | def convert_tokens_to_string(self, tokens: List[str]) -> str: 237 | """ 将分词结果的tokens列表还原为字符串 238 | """ 239 | return self.tokenizer.decode_tokens(tokens) 240 | 241 | def save_vocabulary(self, save_directory, filename_prefix=None): 242 | """ 将词汇表和特殊令牌token保存到指定目录。 243 | 244 | Args: 245 | save_directory (`str`): 将词汇表和特殊令牌文件保存到指定目录。 246 | filename_prefix (`str`, *optional*): 可选添加到保存文件名前的前缀。 247 | 248 | Returns: 249 | `Tuple(str)`: 保存文件的路径 250 | """ 251 | if os.path.isdir(save_directory): 252 | vocab_file = os.path.join( 253 | save_directory, self.vocab_files_names["vocab_file"] 254 | ) 255 | else: 256 | vocab_file = save_directory 257 | 258 | with open(self.vocab_file, 'rb') as fin: 259 | proto_str = fin.read() 260 | 261 | with open(vocab_file, "wb") as writer: 262 | writer.write(proto_str) 263 | 264 | return (vocab_file,) 265 | 266 | def get_prefix_tokens(self): 267 | """ 获取用于模型输入的前缀 token 268 | """ 269 | prefix_tokens = [self.get_command("[gMASK]"), self.get_command("sop")] 270 | return prefix_tokens 271 | 272 | def build_single_message(self, role, metadata, message): 273 | """ 构建单条消息的 token 序列 274 | """ 275 | assert role in ["system", "user", "assistant", "observation"], role 276 | # 构建角色标识Token序列 277 | role_tokens = [self.get_command(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n") 278 | # 构建消息正文Token序列 279 | message_tokens = self.tokenizer.encode(message) 280 | # 合并角色标识Token与消息正文Token 281 | tokens = role_tokens + message_tokens 282 | return tokens 283 | 284 | def build_chat_input(self, query, history=None, role="user"): 285 | """ 根据对话历史及当前query构建模型输入 286 | """ 287 | if history is None: 288 | history = [] 289 | input_ids = [] 290 | # 遍历对话历史 291 | for item in history: 292 | # 获取内容 293 | content = item["content"] 294 | # 若为系统消息且包含工具信息,将其加入内容 295 | if item["role"] == "system" and "tools" in item: 296 | content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False) 297 | # 构建单条历史消息的Token序列并加入到模型输入ID列表 298 | input_ids.extend(self.build_single_message(item["role"], item.get("metadata", ""), content)) 299 | # 构建当前query的Token序列并加入到模型输入ID列表 300 | input_ids.extend(self.build_single_message(role, "", query)) 301 | # 添加表示回复的assistant标记 302 | input_ids.extend([self.get_command("<|assistant|>")]) 303 | # 调用tokenizer批量编码方法,返回PyTorch张量形式的模型输入 304 | return self.batch_encode_plus([input_ids], return_tensors="pt", is_split_into_words=True) 305 | 306 | def build_inputs_with_special_tokens( 307 | self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None 308 | ) -> List[int]: 309 | """ 通过拼接和添加特殊标记,从一个或两个序列构建用于序列分类任务的模型输入。 310 | 311 | BERT序列格式如下: 312 | - 单一序列:`[CLS] X [SEP]` 313 | - 序列对:`[CLS] A [SEP] B [SEP]` 314 | 315 | Args: 316 | token_ids_0 (`List[int]`): 将添加特殊token的IDs列表 317 | token_ids_1 (`List[int]`, *optional*): 可选的第二个序列的IDs列表,用于序列对。 318 | 319 | Returns: 320 | `List[int]`: 包含适当特殊标记的[输入IDs](../glossary#input-ids)列表。 321 | """ 322 | # 获取前缀标记 323 | prefix_tokens = self.get_prefix_tokens() 324 | # 在token_ids_0前添加前缀标记 325 | token_ids_0 = prefix_tokens + token_ids_0 326 | # 若存在token_ids_1,将token_ids_0、token_ids_1连接,并添加结束标记,然后返回 327 | if token_ids_1 is not None: 328 | token_ids_0 = token_ids_0 + token_ids_1 + [self.get_command("")] 329 | return token_ids_0 330 | 331 | def _pad( 332 | self, 333 | encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding], 334 | max_length: Optional[int] = None, 335 | padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, 336 | pad_to_multiple_of: Optional[int] = None, 337 | return_attention_mask: Optional[bool] = None, 338 | ) -> dict: 339 | """ 此方法用于对编码后的输入进行填充(左右两侧填充,直至达到预设长度或批次中的最大长度) 340 | 341 | Args: 342 | encoded_inputs: 字典形式的编码后输入,键为特征名称,值为整数列表(例如,`List[int]`),或者一批编码后的输入(例如,`List[List[int]]`)。 343 | max_length: 返回列表的最大长度,也可作为填充长度 344 | padding_strategy: 填充策略,有以下选项: 345 | - PaddingStrategy.LONGEST : 根据批次中最长序列进行填充 346 | - PaddingStrategy.MAX_LENGTH: 默认策略,填充至最大长度 347 | - PaddingStrategy.DO_NOT_PAD: 不进行填充 348 | 本tokenizer的填充方向由self.padding_side属性决定: 349 | - 'left': 在序列左侧填充 350 | - 'right': 在序列右侧填充 351 | pad_to_multiple_of: (可选)若设置,则将序列填充至给定值的倍数。这对于在NVIDIA硬件上启用具有计算能力`>= 7.5`(Volta及以上)的Tensor Core非常有用。 352 | return_attention_mask:(可选)若设置为False,则避免返回注意力掩码(默认:根据模型特性设置 353 | """ 354 | # 从模型默认设置中加载填充侧信息 355 | assert self.padding_side == "left" 356 | 357 | # 获取必要的输入特征,这里假设第一个特征为主要输入特征 358 | required_input = encoded_inputs[self.model_input_names[0]] 359 | seq_length = len(required_input) 360 | 361 | # 如果填充策略为最长序列,则将最大长度设置为当前序列长度 362 | if padding_strategy == PaddingStrategy.LONGEST: 363 | max_length = len(required_input) 364 | 365 | # 计算实际最大长度,确保满足pad_to_multiple_of的要求 366 | if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0): 367 | max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of 368 | 369 | # 判断是否需要填充 370 | needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length 371 | 372 | # 若不存在注意力掩码,则初始化 373 | if "attention_mask" not in encoded_inputs: 374 | encoded_inputs["attention_mask"] = [1] * seq_length 375 | 376 | if "position_ids" not in encoded_inputs: 377 | encoded_inputs["position_ids"] = list(range(seq_length)) 378 | 379 | # 若需要填充,则执行填充操作 380 | if needs_to_be_padded: 381 | difference = max_length - len(required_input) 382 | # 对注意力掩码进行填充 383 | if "attention_mask" in encoded_inputs: 384 | encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"] 385 | # 对位置标识进行填充 386 | if "position_ids" in encoded_inputs: 387 | encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"] 388 | # 对主要输入特征进行填充 389 | encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input 390 | 391 | return encoded_inputs 392 | -------------------------------------------------------------------------------- /doc/Trainer参数.md: -------------------------------------------------------------------------------- 1 | # Trainer参数 2 | 3 | 本文主要记录一下**Transformers的Trainer 以及其训练参数**,主要参考的就是官网的文档,版本为4.34.0,这些参数的顺序也是按照官网的顺序来的,简单的参数就直接翻译了一下。 4 | 5 | ## **1.Transformers的Trainer类接受的参数:** 6 | 7 | 1. **`model`**\*\* (****`PreTrainedModel`**** 或 ****`torch.nn.Module`****, 可选) \*\*:要进行训练、评估或预测的实例化后模型,如果不提供,必须传递一个`model_init`来初始化一个模型。 8 | 2. **`args`**\*\* (****`TrainingArguments`****, 可选) \*\*:训练的参数,如果不提供,就会使用默认的`TrainingArguments` 里面的参数,其中 `output_dir` 设置为当前目录中的名为 "tmp\_trainer" 的目录。 9 | 3. **`data_collator`**\*\* (****`DataCollator`****, 可选) \*\*:用于从`train_dataset` 或 `eval_dataset` 中构成batch的函数,如果未提供tokenizer,将默认使用 `default_data_collator()`;如果提供,将使用 `DataCollatorWithPadding` 。 10 | 4. **`train_dataset`**\*\* (****`torch.utils.data.Dataset`**** 或 ****`torch.utils.data.IterableDataset`****, 可选) \*\*:用于训练的数据集,如果是torch.utils.data.Dataset,则会自动删除模型的`forward()` 方法不接受的列。 11 | 5. **`eval_dataset`**\*\* (Union\[torch.utils.data.Dataset, Dict\[str, torch.utils.data.Dataset]), 可选)\*\*:同上,用于评估的数据集,如果是字典,将对每个数据集进行评估,并在指标名称前附加字典的键值。 12 | 6. **`tokenizer`**\*\* (PreTrainedTokenizerBase, 可选)\*\*:用于预处理数据的分词器,如果提供,将在批量输入时自动对输入进行填充到最大长度,并会保存在模型目录下中,为了重新运行中断的训练或重复微调模型时更容易进行操作。 13 | 7. **`model_init`**\*\* (Callable\[\[], PreTrainedModel], 可选)\*\*:用于实例化要使用的模型的函数,如果提供,每次调用 `train()` 时都会从此函数给出的模型的新实例开始。 14 | 8. **`compute_metrics`**\*\* (Callable\[\[EvalPrediction], Dict], 可选)\*\*:用于在评估时计算指标的函数,必须接受 `EvalPrediction` 作为入参,并返回一个字典,其中包含了不同性能指标的名称和相应的数值,一般是准确度、精确度、召回率、F1 分数等。 15 | 9. **`callbacks`**\*\* (TrainerCallback 列表, 可选)\*\*:自定义回调函数,如果要删除使用的默认回调函数,要使用 `Trainer.remove_callback()` 方法。 16 | 10. **`optimizers`**\*\* (Tuple\[torch.optim.Optimizer, torch.optim.lr\_scheduler.LambdaLR], 可选) \*\*:用于指定一个包含优化器和学习率调度器的元组(Tuple),这个元组的两个元素分别是优化器 17 | 18 | (`torch.optim.Optimizer`)和学习率调度器(`torch.optim.lr_scheduler.LambdaLR`),默认会创建一个基于AdamW优化器的实例,并使用 `get_linear_schedule_with_warmup()` 函数创建一个学习率调度器。 19 | 11. **`preprocess_logits_for_metrics`**\*\* (Callable\[\[torch.Tensor, torch.Tensor], torch.Tensor], 可选)\*\*:用于指定一个函数,这个函数在每次评估步骤(evaluation step)前,其实就是在进入compute\_metrics函数前对模型的输出 logits 进行预处理。接受两个张量(tensors)作为参数,一个是模型的输出 logits,另一个是真实标签(labels)。然后返回一个经过预处理后的 logits 张量,给到compute\_metrics函数作为参数。 20 | 21 | ## **2.TrainingArguments的参数** 22 | 23 | 1. **`output_dir`**\*\* (str)\*\*:用于指定模型checkpoint和最终结果的输出目录。 24 | 2. **`overwrite_output_dir`**\*\* (bool, 可选,默认为 False)**:如果设置为True,将**覆盖输出目录\*\*中已存在的内容,在想要继续训练模型并且输出目录指向一个checkpoint目录时还是比较有用的。 25 | 3. **`do_train`**\*\* (bool, 可选,默认为 False)\*\*:是否执行训练,其实Trainer是不直接使用此参数,主要是用于在写脚本时,作为if的条件来判断是否执行接下来的代码。 26 | 4. **`do_eval`**\*\* (bool, 可选)\*\*:是否在验证集上进行评估,如果评估策略(evaluation\_strategy)不是"no",将自动设置为True。与do\_train类似,也不是直接由Trainer使用的,主要是用于我们写训练脚本。 27 | 5. **`do_predict`**\*\* (bool, 可选,默认为 False)\*\*:是否在测试集上进行预测。 28 | 6. **`evaluation_strategy `(str, 可选,默认为 "no")**:用于指定训练期间采用的评估策略,可选值包括: 29 | - "no":在训练期间不进行任何评估。 30 | - "steps":每eval\_steps步骤进行评估。 31 | - "epoch":在每个训练周期结束时进行评估。 32 | 7. **`prediction_loss_only `(bool, 可选, 默认为 False)**:如果设置为True,当进行评估和预测时,只返回损失值,而不返回其他评估指标。 33 | 8. **`per_device_train_batch_size`**\*\* (int, 可选, 默认为 8)\*\*:用于指定训练的每个GPU/XPU/TPU/MPS/NPU/CPU的batch,每个训练步骤中每个硬件上的样本数量。 34 | 9. **`per_device_eval_batch_size`**\*\* (int, 可选, 默认为 8)\*\*:用于指定评估的每个GPU/XPU/TPU/MPS/NPU/CPU的batch,每个评估步骤中每个硬件上的样本数量。 35 | 10. **`gradient_accumulation_steps`**\*\* (int, 可选, 默认为 1)\*\*:用于指定在每次更新模型参数之前,梯度积累的更新步数。使得梯度积累可以在多个batch上累积梯度,然后更新模型参数,就可以在显存不够的情况下执行大batch的反向传播。 36 | 37 | 假设有4张卡,每张卡的batch size为8,那么一个steps的batch size就是32,如果我们这个参数设置为4,那么相当于一个batch训练样本数量就是128。**好处:显存不够增大此参数**。 38 | 11. **`eval_accumulation_steps`**\*\* (int, 可选)\*\*:指定在执行评估时,模型会累积多少个预测步骤的输出张量,然后才将它们从GPU/NPU/TPU移动到CPU上,默认是整个评估的输出结果将在GPU/NPU/TPU上累积,然后一次性传输到CPU,速度更快,但占显存。 39 | 12. **`eval_delay`**\*\* (float, 可选)\*\*:指定等待执行第一次评估的轮数或步数。如果evaluation\_strategy为"steps",设置此参数为10,则10个steps后才进行首次评估。 40 | 13. **`learning_rate`**\*\* (float, 可选, 默认为 5e-5)\*\*:指定AdamW优化器的初始学习率。 41 | 14. **`weight_decay`**\*\* (float, 可选, 默认为 0)\*\*:指定权重衰减的值,会应用在 AdamW 优化器的所有层上,除了偏置(bias)和 Layer Normalization 层(LayerNorm)的权重上。 42 | 43 | 简单解释一下,权重衰减是一种正则化手段,通过向损失函数添加一个额外的项来惩罚较大的权重值,有助于防止模型过拟合训练数据。 44 | 15. **`adam_beta1`**\*\* (float, 可选, 默认为 0.9)\*\*:指定AdamW优化器的beta1超参数,详细的解释可以看其论文。 45 | 16. **`adam_beta2`**\*\* (float, 可选, 默认为 0.999)\*\*:指定AdamW优化器的beta2超参数,详细的解释可以看其论文。 46 | 17. **`adam_epsilon`**\*\* (float, 可选, 默认为 1e-8)\*\*:指定AdamW优化器的epsilon超参数,详细的解释可以看其论文。 47 | 18. **`max_grad_norm`**\*\* (float, 可选, 默认为 1.0)\*\*:指定梯度剪裁的最大梯度范数,可以防止梯度爆炸,一般都是1,如果某一步梯度的L2范数超过了 此参数,那么梯度将被重新缩放,确保它的大小不超过此参数。 48 | 19. **`num_train_epochs`**\*\* (float, 可选, 默认为 3.0)\*\*:训练的总epochs数。 49 | 20. **`max_steps`**\*\* (int, 可选, 默认为 -1)\*\*:如果设置为正数,就是执行的总训练步数,**会覆盖num\_train\_epochs**。注意如果使用此参数,就算没有达到这个参数值的步数,训练也会在数据跑完后停止。 50 | 21. **`lr_scheduler_type`**\*\* (str, 可选, 默认为"linear")\*\*:用于指定学习率scheduler的类型,根据训练的进程来自动调整学习率。详细见: 51 | - **"linear"**:线性学习率scheduler,学习率以线性方式改变 52 | - **"cosine"**:余弦学习率scheduler,学习率以余弦形状的方式改变。 53 | - **"constant"**:常数学习率,学习率在整个训练过程中保持不变。 54 | - **"polynomial"**:多项式学习率scheduler,学习率按多项式函数的方式变化。 55 | - **"piecewise"**:分段常数学习率scheduler,每个阶段使用不同的学习率。 56 | - **"exponential"**:指数学习率scheduler,学习率以指数方式改变。 57 | 22. **`warmup_ratio`**\*\* (float, 可选, 默认为0.0)\*\*:用于指定线性热身占总训练步骤的比例,线性热身是一种训练策略,学习率在开始阶段从0逐渐增加到其最大值(通常是设定的学习率),然后在随后的训练中保持不变或者按照其他调度策略进行调整。如果设置为0.0,表示没有热身。 58 | 23. **`warmup_steps`**\*\* (int,可选, 默认为0)\*\*:这个是直接指定线性热身的步骤数,这个参数会覆盖warmup\_ratio,如果设置了warmup\_steps,将会忽略warmup\_ratio。 59 | 24. **`log_level`**\*\* (str, 可选, 默认为passive)\*\*:用于指定主进程上要使用的日志级别, 60 | - debug:最详细的日志级别。 61 | - info:用于一般的信息性消息。 62 | - warning:用于警告信息。 63 | - error:用于错误信息。 64 | - critical:用于严重错误信息。 65 | - passive:不设置任何内容,将会使用Transformers库当前的日志级别(默认为"warning")。 66 | 建议训练时使用info级别。 67 | 25. **`log_level_replica`**\*\* (str, 可选, 默认为warning)\*\*:副本上要使用的日志级别,与log\_level相同。 68 | 26. **`log_on_each_node`**\*\* (bool, optional, defaults to True)\*\*:在多节点分布式训练中,是否在每个节点上使用log\_level进行日志记录。 69 | 27. **`logging_dir`**\*\* (str, 可选)\*\*:TensorBoard日志目录。默认为output\_dir/runs/CURRENT\_DATETIME\_HOSTNAME。 70 | 28. **`logging_strategy`**\*\* (str, 可选, 默认为"steps")\*\*:训练过程中采用的日志记录策略。可选包括: 71 | - "no":在训练过程中不记录任何日志。 72 | - "epoch":在每个epoch结束时记录日志。 73 | - "steps":根据logging\_steps参数记录日志。 74 | 29. **`logging_steps`**\*\* (int or float,可选, 默认为500)\*\*:如果logging\_strategy="steps",则此参数为每多少步记录一次步骤。 75 | 30. **`logging_nan_inf_filter`**\*\* (bool, 可选, 默认为 True)\*\*:是否过滤日志记录中为nan和inf的loss,如果设置为True,将过滤每个步骤的loss,如果出现nan或inf,将取当前日志窗口的平均损失值。 76 | 31. **`save_strategy`**\*\* (str , 可选, 默认为 "steps")\*\*:训练过程中保存checkpoint的策略,包括: 77 | - "no":在训练过程中不保存checkpoint。 78 | - "epoch":在每个epoch束时保存checkpoint。 79 | - "steps":根据save\_steps参数保存checkpoint。 80 | 32. **`save_steps`**\*\* (int or float, 可选, 默认为500)\*\*:如果save\_strategy="steps",就是指两次checkpoint保存之间的更新步骤数。如果是在\[0, 1)的浮点数,则就会当做与总训练步骤数的比例。 81 | 33. **`save_total_limit`**\*\* (int, 可选)\*\*:如果给定了参数,将限制checkpoint的总数,因为checkpoint也是很占硬盘的,将会删除输出目录中旧的checkpoint。当启用load\_best\_model\_at\_end时,会根据metric\_for\_best\_model保留最好的checkpoint,以及最近的checkpoint。 82 | 83 | 举个例子,当`save_total_limit=5`和指定`load_best_model_at_end`时,将始终保留最近的四个checkpoint以及最好的checkpoint;当`save_total_limit=1`和指定`load_best_model_at_end`时,会保存两个checkpoint:最后一个和最好的一个(如果它们不同一个)。 84 | 34. **`load_best_model_at_end `(bool, 可选, 默认为False)**:用于指定是否在训练结束时加载在训练过程中最好的checkpoint,设置为 True 时,就是帮你找到在验证集上指标最好的checkpoint并且保存,然后还会保存最后一个checkpoint,在普通的多epoch训练中,最好设置为True,但在大模型训练中,一般是一个epoch,使用的就是最后一个checkpoint。 85 | 35. **`save_safetensors`**\*\* (bool, 可选, 默认为False)\*\*:用于指定是否在保存和加载模型参数时使用 "safetensors","safetensors" 就是更好地处理了不同 PyTorch 版本之间的模型参数加载的兼容性问题。 86 | 36. **`save_on_each_node`**\*\* (bool, 可选, 默认为 False)\*\*:在进行多节点分布式训练时,是否在每个节点上保存checkpoint,还是仅在主节点上保存。注意如果多节点使用的是同一套存储设备,比如都是外挂的铜一个nas,开启后会报错,因为文件名称都一样。 87 | 37. **`use_cpu`**\*\* (bool, 可选, 默认为 False)\*\*:是否使用CPU训练。如果设置为False,将使用CUDA或其他可用设备。 88 | 38. **`seed`**\*\* (int, 可选, 默认为42)\*\*:用于指定训练过程的随机种子,可以确保训练的可重现性,主要用于model\_init,随机初始化权重参数。 89 | 39. **`data_seed`**\*\* (int, 可选)\*\*:用于指定数据采样的随机种子,如果没有设置将使用与seed相同的种子,可以确保数据采样的可重现性。 90 | 40. **`jit_mode_eval `(bool, 可选, 默认为False)**:用于指定是否在推理(inference)过程中使用 PyTorch 的 JIT(Just-In-Time)跟踪功能,PyTorch JIT 是 PyTorch 的一个功能,用于将模型的前向传播计算编译成高性能的机器代码,会加速模型的推理。 91 | 41. **`use_ipex `(bool, 可选, 默认为 False)**:用于指定是否使用英特尔扩展(Intel extension)来优化 PyTorch,需要安装IPEX,IPEX是一组用于优化深度学习框架的工具和库,可以提高训练和推理的性能,特别针对英特尔的处理器做了优化。 92 | 42. **`bf16 `(bool, 可选, 默认为False)**:用于指定是否使用bf16进行混合精度训练,而不是fp32训练,需要安培架构或者更高的NVIDIA架构。 93 | 94 | 在简单解释一下混合精度训练:模型训练时将模型参数和梯度存储为fp32,但在前向和后向传播计算中使用fp16,这样可以减少内存使用和计算时间,并提高训练速度,这个只是简单的解释,关于混合精度训练,这篇文章讲的比较好 [点这里](https://mp.weixin.qq.com/s%3F__biz%3DMzI4MDYzNzg4Mw%3D%3D%26mid%3D2247550159%26idx%3D5%26sn%3Df5db2afa547970bc429112e32d2e7daf%26chksm%3Debb73c1bdcc0b50d0e85039bd5d8349a23330e3e0f138a7dd2da218a20174d0965837682dd14%26scene%3D27 "点这里")。 95 | 43. **`fp16 `(bool,** 可选, 默认为False):用于指定是否使用fp16进行混合精度训练,而不是fp32训练。 96 | 44. **`fp16_opt_level `(str, 可选, 默认为 ''O1'')**:对于fp16训练,选择的Apex AMP的优化级别,可选值有 \['O0', 'O1', 'O2'和'O3']。详细信息可以看Apex文档。 97 | 45. **`half_precision_backend`**\*\* (str, 可选, 默认为"auto")\*\*:用于指定混合精度训练(Mixed Precision Training)时要使用的后端,必须是 "auto"、"cuda\_amp"、"apex"、"cpu\_amp" 中的一个。"auto"将根据检测到的PyTorch版本来使用后端,而其他选项将会强制使用请求的后端。使用默认就行。 98 | 46. **`bf16_full_eval`**\*\* (bool, 可选, 默认为 False)\*\*:用于指定是否使用完全的bf16进行评估,而不是fp32。这样更快且省内存,但因为精度的问题指标可能会下降。 99 | 47. **`fp16_full_eval`**\*\* (bool, 可选, 默认为 False)\*\*:同上,不过将使用fp16. 100 | 48. **`tf32`**\*\* (bool, 可选)\*\*:用于指定是否启用tf32精度模式,适用于安培架构或者更高的NVIDIA架构,默认值取决于PyTorch的版本torch.backends.cuda.matmul.allow\_tf32的默认值。 101 | 49. **`local_rank`**\*\* (int, 可选, 默认为 -1)\*\*:用于指定在分布式训练中的当前进程(本地排名)的排名,这个不需要我们设置,使用PyTorch分布式训练时会自动设置,默认为自动设置。 102 | 50. **`ddp_backend`**\*\* (str, 可选)\*\*:用于指定处理分布式计算的后端框架,这些框架的主要用于多个计算节点协同工作以加速训练,处理模型参数和梯度的同步、通信等操作,可选值如下 103 | - **"nccl"**:这是 NVIDIA Collective Communications Library (NCCL) 的后端。 104 | - **"mpi"**:Message Passing Interface (MPI) 后端, 是一种用于不同计算节点之间通信的标准协议。 105 | - **"ccl"**:这是 Intel的oneCCL (oneAPI Collective Communications Library) 的后端。 106 | - **"gloo"**:这是Facebook开发的分布式通信后端。 107 | - **"hccl"**:这是Huawei Collective Communications Library (HCCL) 的后端,用于华为昇腾NPU的系统上进行分布式训练。 108 | 默认会根据系统自动设置,一般是nccl。 109 | 51. **`tpu_num_cores `(int, 可选)**:指定在TPU上训练时,TPU核心的数量。 110 | 52. **`dataloader_drop_last `(bool, 可选, 默认为False)**:用于指定是否丢弃最后一个不完整的batch,发生在数据集的样本数量不是batch\_size的整数倍的时候。 111 | 53. **`eval_steps `(int or float, 可选)**:如果evaluation\_strategy="steps",就是指两次评估之间的更新步数,如果未设置,默认和设置和logging\_steps相同的值,如果是在\[0, 1)的浮点数,则就会当做与总评估步骤数的比例。 112 | 54. **`dataloader_num_workers `(int, 可选, 默认为 0)**:用于指定数据加载时的子进程数量(仅用于PyTorch)其实就是PyTorch的num\_workers参数,0表示数据将在主进程中加载。 113 | 55. **`past_index `(int, 可选, 默认为 -1)**:一些模型(如TransformerXL或XLNet)可以利用过去的隐藏状态进行预测,如果将此参数设置为正整数,Trainer将使用相应的输出(通常索引为2)作为过去状态,并将其在下一个训练步骤中作为mems关键字参数提供给模型,只针对一些特定模型。 114 | 56. **`run_name`**\*\* (str, 可选)\*\*:用于指定训练运行(run)的字符串参数,与日志记录工具(例如wandb和mlflow)一起使用,不影响训练过程,就是给其他的日志记录工具开了一个接口,个人还是比较推荐wandb比较好用。 115 | 57. **`disable_tqdm `(bool, 可选)**:是否禁用Jupyter笔记本中的\~notebook.NotebookTrainingTracker生成的tqdm进度条,如果日志级别设置为warn或更低,则将默认为True,否则为False。 116 | 58. **`remove_unused_columns `(bool, 可选, 默认为True)**:是否自动删除模型在训练时,没有用到的数据列,默认会删除,比如你的数据有两列分别是content和id,如果没有用到id这一列,训练时就会被删除。 117 | 59. **`label_names `(List\[str], 可选)**:用于指定在模型的输入字典中对应于标签(labels)的键,默认情况下不需要显式指定。 118 | 60. **`metric_for_best_model`**\*\* (str, 可选)\*\*:与 load\_best\_model\_at\_end 结合使用,用于指定比较不同模型的度量标准,默认情况下,如果未指定,将使用验证集的 "loss" 作为度量标准,可使用accuracy、F1、loss等。 119 | 61. **`greater_is_better `(bool, 可选)**:与 load\_best\_model\_at\_end 和 metric\_for\_best\_model 结合使用,这个和上面的那个参数是对应的,是指上面的那个指标是越大越好还是越小越好,如果是loss就是越小越好,这个参数就会被设置为False;如果是accuracy,你需要把这个值设为True。 120 | 62. **`ignore_data_skip `(bool, 可选,默认为False)**:用于指定是否断点训练,即训练终止又恢复后,是否跳过之前的训练数据。 121 | 63. **`resume_from_checkpoint `(str, 可选)**:用于指定从checkpoint恢复训练的路径。 122 | 64. **`sharded_ddp `(bool, str 或 ShardedDDPOption 列表, 可选, 默认为'')**:是否在分布式训练中使用 Sharded DDP(Sharded Data Parallelism),这是由 FairScale提供的,默认不使用,简单解释一下: FairScale 是Mate开发的一个用于高性能和大规模训练的 PyTorch 扩展库。这个库扩展了基本的 PyTorch 功能,同时引入了最新的先进规模化技术,通过可组合的模块和易于使用的API,提供了最新的分布式训练技术。详细的可以看其官网。 123 | 65. **`fsdp `(bool, str 或 FSDPOption 列表, 可选, 默认为'')**:用于指定是否要启用 PyTorch 的 FSDP(Fully Sharded Data Parallel Training),以及如何配置分布式并行训练。 124 | 66. **`fsdp_config `(str 或 dict, 可选)**:用于配置 PyTorch 的 FSDP(Fully Sharded Data Parallel Training)的配置文件 125 | 67. **`deepspeed `(str 或 dict, 可选)**:用于指定是否要启用 DeepSpeed,以及如何配置 DeepSpeed。也是目前分布式训练使用最多的框架,比上面pytorch原生分布式训练以及FairScale用的范围更广,详细的可以看其官网。 126 | 68. **`label_smoothing_factor`**\*\* (float, 可选,默认为0.0)\*\*:用于指定标签平滑的因子。 127 | 69. **`debug`**\*\* (str 或 DebugOption 列表, 可选, 默认为'')\*\*:用于启用一个或多个调试功能 128 | 129 | 支持的选项: 130 | - "underflow\_overflow":此选项用于检测模型输入/输出中的溢出。 131 | - "tpu\_metrics\_debug":此选项用于在 TPU 上打印调试指标。 132 | 70. **`optim`**\*\* (str 或 training\_args.OptimizerNames, 可选, 默认为 "adamw\_torch")\*\*:指定要使用的优化器。 133 | 134 | 可选项: 135 | - "adamw\_hf" 136 | - "adamw\_torch" 137 | - "adamw\_torch\_fused" 138 | - "adamw\_apex\_fused" 139 | - "adamw\_anyprecision" 140 | - "adafactor" 141 | 71. **`optim_args`**\*\* (str, 可选)\*\*:用于向特定类型的优化器(如adamw\_anyprecision)提供额外的参数或自定义配置。 142 | 72. **`group_by_length`**\*\* (bool, 可选, 默认为 False)\*\*:是否在训练数据集中对大致相同长度的样本进行分组然后放在一个batch里,目的是尽量减少在训练过程中进行的padding,提高训练效率。 143 | 73. **`length_column_name`**\*\* (str, 可选, 默认为 "length")\*\*:当你上个参数设置为True时,你可以给你的训练数据在增加一列”长度“,就是事先计算好的,可以加快分组的速度,默认是length。 144 | 74. **`report_to`**\*\* (str 或 str 列表, 可选, 默认为 "all")\*\*:用于指定要将训练结果和日志报告到的不同日记集成平台,有很多"azure\_ml", "clearml", "codecarbon", "comet\_ml", "dagshub", "flyte", "mlflow", "neptune", "tensorboard", and "wandb"。直接默认就行,都发。 145 | 75. **`ddp_find_unused_parameters`**\*\* (bool, 可选)\*\*:当你使用分布式训练时,这个参数用于控制是否查找并处理那些在计算中没有被使用的参数,如果启用了梯度检查点(gradient checkpointing),表示部分参数是惰性加载的,这时默认值为 False,因为梯度检查点本身已经考虑了未使用的参数,如果没有启用梯度检查点,默认值为 True,表示要查找并处理所有参数,以确保它们的梯度被正确传播。 146 | 76. **`ddp_bucket_cap_mb`**\*\* (int, 可选)\*\*:在分布式训练中,数据通常分成小块进行处理,这些小块称为"桶",这个参数用于指定每个桶的最大内存占用大小,一般自动分配即可。 147 | 77. **`ddp_broadcast_buffers`**\*\* (bool, 可选)\*\*:在分布式训练中,模型的某些部分可能包含缓冲区,如 Batch Normalization 层的统计信息,这个参数用于控制是否将这些缓冲区广播到所有计算设备,以确保模型在不同设备上保持同步,如果启用了梯度检查点,表示不需要广播缓冲区,因为它们不会被使用,如果没有启用梯度检查点,默认值为 True,表示要广播缓冲区,以确保模型的不同部分在所有设备上都一致。 148 | 78. **`gradient_checkpointing`**\*\* (bool, 可选, 默认为False)\*\*:是否开启梯度检查点,简单解释一下:训练大型模型时需要大量的内存,其中在反向传播过程中,需要保存前向传播的中间计算结果以计算梯度,但是这些中间结果占用大量内存,可能会导致内存不足,梯度检查点会在训练期间释放不再需要的中间结果以减小内存占用,但它会使训练变慢。 149 | 79. **`dataloader_pin_memory`**\*\* (bool, 可选, 默认为 True)\*\*:用于指定dataloader加载数据时,是否启用“pin memory”功能。“Pin memory” 用于将数据加载到GPU内存之前,将数据复制到GPU的锁页内存(pinned memory)中,锁页内存是一种特殊的内存,可以更快地传输数据到GPU,从而加速训练过程,但是会占用额外的CPU内存,会导致内存不足的问题,如果数据量特别大,百G以上建议False。 150 | 80. **`skip_memory_metrics`**\*\* (bool, 可选, 默认为 True)\*\*:用于控制是否将内存分析报告添加到性能指标中,默认情况下跳过这一步,以提高训练和评估的速度,建议打开,更能够清晰的知道每一步的内存使用。 151 | 81. **`include_inputs_for_metrics`**\*\* (bool, 可选, 默认为 False)\*\*:是否将输入传递给 `compute_metrics` 函数,一般计算metrics用的是用的是模型预测的结果和我们提供的标签,但是有的指标需要输入,比如cv的IoU(Intersection over Union)指标。 152 | 82. **`auto_find_batch_size`**\*\* (bool, 可选, 默认为 False)\*\*:是否使用自动寻找适合内存的batch size大小,以避免 CUDA 内存溢出错误,需要安装 `accelerate`(使用 `pip install accelerate`),这个功能还是比较NB的。 153 | 83. **`full_determinism`**\*\* (bool, 可选, 默认为 False)\*\*:如果设置为 `True`,将调用 `enable_full_determinism()` 而不是 `set_seed()`,训练过程将启用完全确定性(full determinism),在训练过程中,所有的随机性因素都将被消除,确保每次运行训练过程都会得到相同的结果,注意:会对性能产生负面影响,因此仅在调试时使用。 154 | 84. **`torchdynamo`**\*\* (str, 可选)\*\*:用于选择 TorchDynamo 的后端编译器,TorchDynamo 是 PyTorch 的一个库,用于提高模型性能和部署效率,可选的选择包括 "eager"、"aot\_eager"、"inductor"、"nvfuser"、"aot\_nvfuser"、"aot\_cudagraphs"、"ofi"、"fx2trt"、"onnxrt" 和 "ipex"。默认就行,自动会选。 155 | 85. **`ray_scope`**\*\* (str, 可选, 默认为 "last")\*\*:用于使用 Ray 进行超参数搜索时,指定要使用的范围,默认情况下,使用 "last",Ray 将使用所有试验的最后一个检查点,比较它们并选择最佳的。详细的可以看一下它的文档。 156 | 86. **`ddp_timeout`**\*\* (int, 可选, 默认为 1800)\*\*:用于 torch.distributed.init\_process\_group 调用的超时时间,在分布式运行中执行较慢操作时,用于避免超时,具体的可以看 [PyTorch 文档](https://link.zhihu.com/?target=https%3A//pytorch.org/docs/stable/distributed.html%23torch.distributed.init_process_group "PyTorch 文档") 。 157 | 87. **`torch_compile`**\*\* (bool, 可选, 默认为 False)\*\*:是否使用 PyTorch 2.0 及以上的 torch.compile 编译模型,具体的可以看 [PyTorch 文档](https://link.zhihu.com/?target=https%3A//pytorch.org/docs/stable/distributed.html%23torch.distributed.init_process_group "PyTorch 文档") 。 158 | 88. **`torch_compile_backend`**\*\* (str, 可选)\*\*:指定在 torch.compile 中使用的后端,如果设置为任何值,将启用 torch\_compile。 159 | 89. **`torch_compile_mode`**\*\* (str, 可选)\*\*:指定在 torch.compile 中使用的模式,如果设置为任何值,将启用 torch\_compile。 160 | 90. **`include_tokens_per_second`**\*\* (bool, 可选)\*\*:确定是否计算每个设备的每秒token数以获取训练速度指标,会在整个训练数据加载器之前进行迭代,会稍微减慢整个训练过程,建议打开。 161 | 91. **`push_to_hub`**\*\* (bool, 可选, 默认为 False)\*\*:指定是否在每次保存模型时将模型推送到Huggingface Hub。 162 | 92. **`hub_model_id`**\*\* (str, 可选)\*\*:指定要与本地 output\_dir 同步的存储库的名称。 163 | 93. **`hub_strategy`**\*\* (str 或 HubStrategy, 可选, 默认为 "every\_save") \*\*:指定怎么推送到Huggingface Hub。 164 | 94. **`hub_token`**\*\* (str, 可选)\*\*:指定推送模型到Huggingface Hub 的token。 165 | 95. **`hub_private_repo`**\*\* (bool, 可选, 默认为 False)\*\*:如果设置为 True,Huggingface Hub 存储库将设置为私有。 166 | 96. **`hub_always_push`**\*\* (bool, 可选, 默认为 False)\*\*:是否每次都推送模型。 167 | --------------------------------------------------------------------------------