├── utils
├── chatglm3_tokenizer
│ ├── special_tokens_map.json
│ ├── tokenizer.model
│ ├── tokenizer_config.json
│ └── tokenization_chatglm.py
├── test_tokenizer.py
├── rl_train_process.py
├── sft_train_process.py
├── rm_train_process.py
└── pre_train_process.py
├── tokenizer
├── run_train_spm.sh
├── sp_output
│ └── chinese_spm_20000.model
├── tinyllm_tokenizer_hf
│ ├── tokenizer.model
│ ├── added_tokens.json
│ ├── special_tokens_map.json
│ └── tokenizer_config.json
├── input_dir
│ └── llama2_tokenizer
│ │ ├── tokenizer.model
│ │ ├── special_tokens_map.json
│ │ └── tokenizer_config.json
├── train_chinese_sp.py
├── expend_embedding.py
├── expend_tokenizer.py
└── README.md
├── doc
├── image
│ ├── ppl_gen.png
│ ├── web_demo.png
│ └── image_w13y9FgYsi.png
├── datasets_download.md
├── README.md
├── data_process.md
└── Trainer参数.md
├── requirements.txt
├── .gitignore
├── demo
├── infer_chat.py
├── infer_func.py
└── web_demo.py
├── llama.cpp
└── README.md
├── script
├── gptq_demo.sh
├── ptm_demo.sh
├── sft_demo.sh
├── dpo_demo.sh
└── rm_demo.sh
├── train
├── configuration_tinyllm.py
├── sft_train.py
├── ptm_train.py
├── rm_train.py
├── dpo_train.py
└── generation_utils.py
├── quantize
└── gptq_quantize.py
├── vllm
├── README.md
└── tinyllm.py
└── README.md
/utils/chatglm3_tokenizer/special_tokens_map.json:
--------------------------------------------------------------------------------
1 | {}
2 |
--------------------------------------------------------------------------------
/tokenizer/run_train_spm.sh:
--------------------------------------------------------------------------------
1 | python train_chinese_sp.py > test_train_chinese_sp.log 2>&1 &
--------------------------------------------------------------------------------
/doc/image/ppl_gen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/doc/image/ppl_gen.png
--------------------------------------------------------------------------------
/doc/image/web_demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/doc/image/web_demo.png
--------------------------------------------------------------------------------
/doc/image/image_w13y9FgYsi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/doc/image/image_w13y9FgYsi.png
--------------------------------------------------------------------------------
/utils/chatglm3_tokenizer/tokenizer.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/utils/chatglm3_tokenizer/tokenizer.model
--------------------------------------------------------------------------------
/tokenizer/sp_output/chinese_spm_20000.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/tokenizer/sp_output/chinese_spm_20000.model
--------------------------------------------------------------------------------
/tokenizer/tinyllm_tokenizer_hf/tokenizer.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/tokenizer/tinyllm_tokenizer_hf/tokenizer.model
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | deepspeed==0.10.2
2 | transformers>=4.37.2,<=4.38.2
3 | datasets
4 | accelerate
5 | sentencepiece
6 | streamlit
7 | flask
8 | pandas
9 | numpy
--------------------------------------------------------------------------------
/tokenizer/input_dir/llama2_tokenizer/tokenizer.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wdndev/tiny-llm-zh/HEAD/tokenizer/input_dir/llama2_tokenizer/tokenizer.model
--------------------------------------------------------------------------------
/tokenizer/tinyllm_tokenizer_hf/added_tokens.json:
--------------------------------------------------------------------------------
1 | {
2 | "<|assistant|>": 49955,
3 | "<|im_end|>": 49957,
4 | "<|im_start|>": 49956,
5 | "<|system|>": 49953,
6 | "<|user|>": 49954
7 | }
8 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # 输出文件
2 | __pycache__
3 |
4 | outputs
5 | data
6 | infer_chat_template.py
7 | infer_base.py
8 | cli_demo.py
9 | infer_test.py
10 | web_demo2.py
11 | demo/fast_api_demo.py
12 | demo/resq.py
--------------------------------------------------------------------------------
/doc/datasets_download.md:
--------------------------------------------------------------------------------
1 | ## Tiny LLM Datsets 下载
2 |
3 | ### 1.下载链接
4 |
5 | - 链接:https://pan.baidu.com/s/1I5LYq3tu-_pFl0zu4wMpRg?pwd=tiny
6 | - 提取码:tiny
7 |
8 | ### 2.数据集介绍
9 |
10 | - chatglm3_tokenizer : tokenizer文件夹
11 | - pre_train : 预训练 token
12 | - rl_train : 偏好数据集
13 | - sft_train : 微调数据集
14 | - README.md : 数据集详解
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/doc/README.md:
--------------------------------------------------------------------------------
1 | # Tiny LLM 文档
2 |
3 |
4 | ## 1.Tokenizer
5 |
6 | - [扩充词表](./自定义构造Tokenizer.md)
7 |
8 | ## 2.数据处理
9 |
10 | - [数据下载](./datasets_download.md)
11 |
12 | ## 3.预训练
13 |
14 | ## 4.有监督微调
15 |
16 | ## 5.人类对齐
17 |
18 | ## 6.工具使用
19 |
20 | - [Transformers Trainer参数](./Trainer参数.md)
21 | - [Transformers Generate参数](./Generate参数与解码策略.md)
22 |
23 |
--------------------------------------------------------------------------------
/tokenizer/tinyllm_tokenizer_hf/special_tokens_map.json:
--------------------------------------------------------------------------------
1 | {
2 | "bos_token": {
3 | "content": "",
4 | "lstrip": false,
5 | "normalized": false,
6 | "rstrip": false,
7 | "single_word": false
8 | },
9 | "eos_token": {
10 | "content": "",
11 | "lstrip": false,
12 | "normalized": false,
13 | "rstrip": false,
14 | "single_word": false
15 | },
16 | "unk_token": {
17 | "content": "",
18 | "lstrip": false,
19 | "normalized": false,
20 | "rstrip": false,
21 | "single_word": false
22 | }
23 | }
24 |
--------------------------------------------------------------------------------
/tokenizer/input_dir/llama2_tokenizer/special_tokens_map.json:
--------------------------------------------------------------------------------
1 | {
2 | "bos_token": {
3 | "content": "",
4 | "lstrip": false,
5 | "normalized": true,
6 | "rstrip": false,
7 | "single_word": false
8 | },
9 | "eos_token": {
10 | "content": "",
11 | "lstrip": false,
12 | "normalized": true,
13 | "rstrip": false,
14 | "single_word": false
15 | },
16 | "pad_token": "",
17 | "unk_token": {
18 | "content": "",
19 | "lstrip": false,
20 | "normalized": true,
21 | "rstrip": false,
22 | "single_word": false
23 | }
24 | }
25 |
--------------------------------------------------------------------------------
/tokenizer/input_dir/llama2_tokenizer/tokenizer_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "add_bos_token": true,
3 | "add_eos_token": false,
4 | "bos_token": {
5 | "__type": "AddedToken",
6 | "content": "",
7 | "lstrip": false,
8 | "normalized": true,
9 | "rstrip": false,
10 | "single_word": false
11 | },
12 | "clean_up_tokenization_spaces": false,
13 | "eos_token": {
14 | "__type": "AddedToken",
15 | "content": "",
16 | "lstrip": false,
17 | "normalized": true,
18 | "rstrip": false,
19 | "single_word": false
20 | },
21 | "legacy": false,
22 | "model_max_length": 1000000000000000019884624838656,
23 | "pad_token": null,
24 | "sp_model_kwargs": {},
25 | "tokenizer_class": "LlamaTokenizer",
26 | "unk_token": {
27 | "__type": "AddedToken",
28 | "content": "",
29 | "lstrip": false,
30 | "normalized": true,
31 | "rstrip": false,
32 | "single_word": false
33 | }
34 | }
35 |
--------------------------------------------------------------------------------
/demo/infer_chat.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoModelForCausalLM
2 | from transformers.generation import GenerationConfig
3 |
4 | # model_id = "outputs/ckpt/tiny_llm_sft_92m"
5 | model_id = "wdndev/tiny_llm_sft_92m"
6 | model_id = "outputs/tiny_llm_sft_76m_llama"
7 |
8 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
9 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)
10 | generation_config = GenerationConfig.from_pretrained(model_id, trust_remote_code=True)
11 | sys_text = "你是由wdndev开发的个人助手。"
12 | # user_text = "世界上最大的动物是什么?"
13 | # user_text = "介绍一下刘德华。"
14 | user_text = "介绍一下中国。"
15 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
16 | "<|user|>", user_text.strip(),
17 | "<|assistant|>"]).strip() + "\n"
18 |
19 | generation_config.max_new_tokens = 200
20 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
21 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config)
22 | generated_ids = [
23 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
24 | ]
25 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
26 | print(response)
27 |
28 |
29 |
--------------------------------------------------------------------------------
/utils/chatglm3_tokenizer/tokenizer_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "added_tokens_decoder": {
3 | "64790": {
4 | "content": "[gMASK]",
5 | "lstrip": false,
6 | "normalized": true,
7 | "rstrip": false,
8 | "single_word": false,
9 | "special": false
10 | },
11 | "64792": {
12 | "content": "sop",
13 | "lstrip": false,
14 | "normalized": true,
15 | "rstrip": false,
16 | "single_word": false,
17 | "special": false
18 | },
19 | "64795": {
20 | "content": "<|user|>",
21 | "lstrip": false,
22 | "normalized": true,
23 | "rstrip": false,
24 | "single_word": false,
25 | "special": false
26 | },
27 | "64796": {
28 | "content": "<|assistant|>",
29 | "lstrip": false,
30 | "normalized": true,
31 | "rstrip": false,
32 | "single_word": false,
33 | "special": false
34 | }
35 | },
36 | "auto_map": {
37 | "AutoTokenizer": [
38 | "tokenization_chatglm.ChatGLMTokenizer",
39 | null
40 | ]
41 | },
42 | "chat_template": "{% for message in messages %}{% if loop.first %}[gMASK]sop<|{{ message['role'] }}|>\n {{ message['content'] }}{% else %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}",
43 | "clean_up_tokenization_spaces": false,
44 | "do_lower_case": false,
45 | "eos_token": "",
46 | "model_max_length": 1000000000000000019884624838656,
47 | "pad_token": "",
48 | "padding_side": "left",
49 | "remove_space": false,
50 | "tokenizer_class": "ChatGLMTokenizer",
51 | "unk_token": ""
52 | }
53 |
--------------------------------------------------------------------------------
/llama.cpp/README.md:
--------------------------------------------------------------------------------
1 | # Tiny LLM llama.cpp
2 |
3 | ## 1.简介
4 |
5 | Tiny LLM 92M 模型已支持 llama.cpp C++ 推理框架,建议在 linux 环境下测试,windows效果不好;
6 |
7 | 所支持 llama.cpp 为自己修改的版本,仓库链接为: [llama.cpp.tinyllm](https://github.com/wdndev/llama.cpp.tinyllm)
8 |
9 | ### 1.1 llama.cpp
10 |
11 | llama.cpp 是一个C++库,用于简化LLM推理的设置。它使得在本地机器上运行Qwen成为可能。该库是一个纯C/C++实现,不依赖任何外部库,并且针对x86架构提供了AVX、AVX2和AVX512加速支持。此外,它还提供了2、3、4、5、6以及8位量化功能,以加快推理速度并减少内存占用。对于大于总VRAM容量的大规模模型,该库还支持CPU+GPU混合推理模式进行部分加速。本质上,llama.cpp的用途在于运行GGUF(由GPT生成的统一格式)模型。
12 |
13 | ### 1.2 gguf
14 |
15 | GGUF是指一系列经过特定优化,能够在不同硬件上高效运行的大模型格式。这些模型格式包括但不限于原始格式、exl2、finetuned模型(如axolotl、unsloth等)。每种格式都有其特定的应用场景和优化目标,例如加速模型推理、减少模型大小、提高模型准确性等。
16 |
17 |
18 | ## 2.使用
19 |
20 | ### 2.1 准备
21 |
22 | 建议使用 linux 系统
23 |
24 | ```shell
25 | git clone https://github.com/wdndev/llama.cpp.tinyllm
26 | cd llama.cpp.tinyllm
27 | ```
28 |
29 | 然后运行 make 命令:
30 |
31 | ```shell
32 | make
33 | ```
34 |
35 | 然后你就能使用 `llama.cpp` 运行GGUF文件。
36 |
37 | ### 2.2 模型转化
38 |
39 | 先需要按照如下所示的方式为fp16模型创建一个GGUF文件:
40 |
41 | ```shell
42 | python convert-hf-to-gguf.py wdndev/tiny_llm_sft_92m --outfile models/tinyllm/tinyllm-92m-fp16.gguf
43 | ```
44 |
45 | 其中,第一个参数指代的是预训练模型所在的路径或者HF模型的名称,第二个参数则指的是想要生成的GGUF文件的路径;在运行命令之前,需要先创建这个目录。
46 |
47 | 下面需要根据实际需求将其量化至低比特位。以下是一个将模型量化至4位的具体示例:
48 |
49 | ```shell
50 | ./llama-quantize models/tinyllm/tinyllm-92m-fp16.gguf models/tiny_llm_92m/tinyllm-92m-q4_0.gguf q4_0
51 | ```
52 |
53 | 到现在为止,已经完成了将模型量化为4比特,并将其放入GGUF文件中。这里的 q4_0 表示4比特量化。现在,这个量化后的模型可以直接通过llama.cpp运行。
54 |
55 | ### 2.3 推理
56 |
57 | 使用如下命令可以运行模型
58 |
59 | ```shell
60 | ./llama-cli -m ./models/tinyllm/tinyllm-92m-fp16.gguf -p "<|system|>\n你是由wdndev开发的个人助手。\n<|user|>\n请介绍一下北京,你好。\n<|assistant|>\n" -n 128 --repeat-penalty 1.2 --top-p 0.8 --top-k 0
61 | ```
62 |
63 | `-n` 指的是要生成的最大token数量。这里还有其他超参数供你选择,并且你可以运行
64 |
65 | ```shell
66 | ./llama-cli -h
67 | ```
68 | 以了解它们
69 |
70 |
71 |
--------------------------------------------------------------------------------
/demo/infer_func.py:
--------------------------------------------------------------------------------
1 | import json
2 | from transformers import AutoModelForCausalLM, AutoTokenizer
3 | from transformers.generation.utils import GenerationConfig
4 |
5 | def load_model_tokenizer(model_id: str):
6 | model = AutoModelForCausalLM.from_pretrained(
7 | model_id,
8 | device_map="auto",
9 | trust_remote_code=True
10 | )
11 | tokenizer = AutoTokenizer.from_pretrained(
12 | model_id,
13 | use_fast=False,
14 | trust_remote_code=True
15 | )
16 | generation_config = GenerationConfig.from_pretrained(model_id)
17 | return model, tokenizer, generation_config
18 |
19 | def tinyllm_infer(text: str,
20 | model: AutoModelForCausalLM,
21 | tokenizer: AutoTokenizer,
22 | generation_config: GenerationConfig
23 | ):
24 | sys_text = "你是由wdndev开发的个人助手。"
25 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
26 | "<|user|>", text.strip(),
27 | "<|assistant|>"]).strip() + "\n"
28 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
29 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config)
30 | generated_ids = [
31 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
32 | ]
33 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
34 |
35 | return response
36 |
37 | def main():
38 | model_id = "outputs/ckpt/tiny_llm_sft_92m"
39 | # model_id = "wdndev/tiny_llm_sft_92m"
40 |
41 | model, tokenizer, generation_config = load_model_tokenizer(model_id)
42 | generation_config.max_new_tokens = 200
43 | response = tinyllm_infer("介绍一下中国", model, tokenizer, generation_config)
44 |
45 | print(response)
46 |
47 | if __name__ == "__main__":
48 | main()
--------------------------------------------------------------------------------
/tokenizer/train_chinese_sp.py:
--------------------------------------------------------------------------------
1 | import sentencepiece as spm
2 | import os
3 | import glob
4 |
5 | def tain_chinses_spm(input_txt_dir, vocab_size, output_dir="."):
6 | # 保存的模型名称
7 | prefix = os.path.join(output_dir, f"test_chinese_spm_{vocab_size}")
8 |
9 | text_filenames = sorted(glob.glob(os.path.join(input_txt_dir, "*.txt")))
10 | print("file list: ", text_filenames)
11 |
12 | # 2) train the sentencepiece model
13 | print("Will now train the vocab...")
14 | spm.SentencePieceTrainer.train(input=text_filenames,
15 | model_prefix=prefix,
16 | model_type="bpe",
17 | vocab_size=vocab_size,
18 | self_test_sample_size=0,
19 | input_format="text",
20 | character_coverage=0.9995,
21 | num_threads=os.cpu_count(),
22 | split_digits=True, # 是否将数字划分为单个 token, 在 llama 中是这么做的
23 | allow_whitespace_only_pieces=True,
24 | byte_fallback=True,
25 | unk_surface=r" \342\201\207 ",
26 | max_sentence_length=24000)
27 |
28 |
29 | print(f"Trained tokenizer is in {prefix}.model")
30 | print("Done.")
31 |
32 | def test_chinese_spm(spm_model_path):
33 | sp_bpe = spm.SentencePieceProcessor()
34 | sp_bpe.load(spm_model_path)
35 | print('*** BPE ***')
36 | print(sp_bpe.encode_as_pieces('翻译下面的句子为英文:有朋自远方来,不亦乐乎'))
37 | print(len(sp_bpe.encode_as_pieces('翻译下面的句子为英文:有朋自远方来,不亦乐乎')))
38 |
39 |
40 | if __name__ == "__main__":
41 | input_txt_dir = "baike_txt"
42 | vocab_size = 20000
43 | output_dir = "sp_output"
44 | tain_chinses_spm(input_txt_dir, vocab_size, output_dir)
45 |
46 | # test_chinese_spm("sp_output/chinese_spm_20000.model")
--------------------------------------------------------------------------------
/script/gptq_demo.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -x
4 |
5 | # GPTQ parameters
6 | BITS=4 # [4, 8]
7 | GROUP_SIZE=128
8 | DAMP_PERCENT=0.1
9 | DESC_ACT=False
10 | STATIC_GROUPS=False
11 | SYM=True
12 | TRUE_SEQUENTIAL=True
13 |
14 | # training parameters
15 | MAX_LEN=8192
16 | BATCH_SIZE=1
17 | CACHE_ON_GPU=False
18 | USR_TRITON=False
19 |
20 | # basic setting
21 | MODEL_PATH="/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mtai/users/wangdongnian/outputs/ckpt/qwen2_7b_sft_v10_qwen2-20240724-173728/iter_0007484/huggingface_format"
22 | OUTPUT_DIR="outputs"
23 | DATASEY_PATH="quant_data/v10_new_prompt_data.jsonl"
24 | N_GPUS=4
25 | GPU_MAX_MEMORY=20
26 | MODEL_NAME="qwen2_7b_v10_new_prompt"
27 |
28 | OUTPUT_MODEL_PATH=${OUTPUT_DIR}/${MODEL_NAME}_gptq_int${BITS}
29 | mkdir -p $OUTPUT_MODEL_PATH
30 | QUANT_LOG="${OUTPUT_MODEL_PATH}/quantize_$(date "+%Y%m%d%H%M").log"
31 |
32 | GPTQ_ARGS=" \
33 | --bits ${BITS} \
34 | --group_size ${GROUP_SIZE} \
35 | --damp_percent ${DAMP_PERCENT} \
36 | --desc_act ${DESC_ACT} \
37 | --static_groups ${STATIC_GROUPS} \
38 | --sym ${SYM} \
39 | --true_sequential ${TRUE_SEQUENTIAL} \
40 | "
41 |
42 | TRAIN_ARGS=" \
43 | --max_len ${MAX_LEN} \
44 | --batch_size ${BATCH_SIZE} \
45 | --cache_examples_on_gpu ${CACHE_ON_GPU} \
46 | --use_triton ${USR_TRITON} \
47 | "
48 |
49 | SCRIPT_ARGS=" \
50 | --model_id ${MODEL_PATH} \
51 | --dataset_dir_or_path ${DATASEY_PATH} \
52 | --quant_output_dir ${OUTPUT_MODEL_PATH} \
53 | --ngpus ${N_GPUS} \
54 | --gpu_max_memory ${GPU_MAX_MEMORY} \
55 | "
56 |
57 | ALL_ARGS=" $GPTQ_ARGS $TRAIN_ARGS $SCRIPT_ARGS "
58 |
59 | LAUNCHER="python quantize/gptq_quantize.py "
60 |
61 | # Combine all arguments into one command
62 | CMD="$LAUNCHER $ALL_ARGS"
63 |
64 | # Print the command that will be executed for debugging purposes
65 | echo $CMD
66 |
67 | # Execute the quantization process and redirect all output to the log file
68 | nohup $CMD > ${QUANT_LOG} 2>&1 &
69 |
70 | # Notify the user about the location of the log file
71 | echo "Running successfully. The logs are saved in ${QUANT_LOG}"
--------------------------------------------------------------------------------
/train/configuration_tinyllm.py:
--------------------------------------------------------------------------------
1 | from transformers.configuration_utils import PretrainedConfig
2 | from transformers.utils import logging
3 |
4 |
5 | logger = logging.get_logger(__name__)
6 |
7 |
8 | class TinyllmConfig(PretrainedConfig):
9 | """ TinyLLM 配置文件
10 | """
11 |
12 | model_type = "tinyllm"
13 | keys_to_ignore_at_inference = ["past_key_values"]
14 |
15 | def __init__(
16 | self,
17 | vocab_size=64797,
18 | hidden_size=4096,
19 | intermediate_size=11008,
20 | num_hidden_layers=32,
21 | num_attention_heads=32,
22 | num_key_value_heads=None,
23 | hidden_act="silu",
24 | max_position_embeddings=2048,
25 | initializer_range=0.02,
26 | rms_norm_eps=1e-6,
27 | use_cache=True,
28 | pad_token_id=None,
29 | bos_token_id=None,
30 | eos_token_id=None,
31 | tie_word_embeddings=False,
32 | rope_theta=10000.0,
33 | attention_dropout=0.0,
34 | **kwargs
35 | ):
36 | self.vocab_size = vocab_size
37 | self.max_position_embeddings = max_position_embeddings
38 | self.hidden_size = hidden_size
39 | self.intermediate_size = intermediate_size
40 | self.num_hidden_layers = num_hidden_layers
41 | self.num_attention_heads = num_attention_heads
42 |
43 | # for backward compatibility
44 | if num_key_value_heads is None:
45 | num_key_value_heads = num_attention_heads
46 |
47 | self.num_key_value_heads = num_key_value_heads
48 | self.hidden_act = hidden_act
49 | self.initializer_range = initializer_range
50 | self.rms_norm_eps = rms_norm_eps
51 | self.use_cache = use_cache
52 | self.rope_theta = rope_theta
53 | self.attention_dropout = attention_dropout
54 |
55 | super().__init__(
56 | pad_token_id=pad_token_id,
57 | bos_token_id=bos_token_id,
58 | eos_token_id=eos_token_id,
59 | tie_word_embeddings=tie_word_embeddings,
60 | **kwargs
61 | )
62 |
--------------------------------------------------------------------------------
/tokenizer/tinyllm_tokenizer_hf/tokenizer_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "add_bos_token": true,
3 | "add_eos_token": false,
4 | "add_prefix_space": true,
5 | "added_tokens_decoder": {
6 | "0": {
7 | "content": "",
8 | "lstrip": false,
9 | "normalized": false,
10 | "rstrip": false,
11 | "single_word": false,
12 | "special": true
13 | },
14 | "1": {
15 | "content": "",
16 | "lstrip": false,
17 | "normalized": false,
18 | "rstrip": false,
19 | "single_word": false,
20 | "special": true
21 | },
22 | "2": {
23 | "content": "",
24 | "lstrip": false,
25 | "normalized": false,
26 | "rstrip": false,
27 | "single_word": false,
28 | "special": true
29 | },
30 | "49953": {
31 | "content": "<|system|>",
32 | "lstrip": false,
33 | "normalized": true,
34 | "rstrip": false,
35 | "single_word": false,
36 | "special": false
37 | },
38 | "49954": {
39 | "content": "<|user|>",
40 | "lstrip": false,
41 | "normalized": true,
42 | "rstrip": false,
43 | "single_word": false,
44 | "special": false
45 | },
46 | "49955": {
47 | "content": "<|assistant|>",
48 | "lstrip": false,
49 | "normalized": true,
50 | "rstrip": false,
51 | "single_word": false,
52 | "special": false
53 | },
54 | "49956": {
55 | "content": "<|im_start|>",
56 | "lstrip": false,
57 | "normalized": true,
58 | "rstrip": false,
59 | "single_word": false,
60 | "special": false
61 | },
62 | "49957": {
63 | "content": "<|im_end|>",
64 | "lstrip": false,
65 | "normalized": true,
66 | "rstrip": false,
67 | "single_word": false,
68 | "special": false
69 | }
70 | },
71 | "bos_token": "",
72 | "clean_up_tokenization_spaces": false,
73 | "eos_token": "",
74 | "legacy": true,
75 | "model_max_length": 1000000000000000019884624838656,
76 | "pad_token": null,
77 | "sp_model_kwargs": {},
78 | "spaces_between_special_tokens": false,
79 | "tokenizer_class": "LlamaTokenizer",
80 | "unk_token": "",
81 | "use_default_system_prompt": false
82 | }
83 |
--------------------------------------------------------------------------------
/doc/data_process.md:
--------------------------------------------------------------------------------
1 | ## Tiny LLM 数据处理
2 |
3 | 项目所采用的数据,都是开源数据集,大部分来自[Hugging Face](https://huggingface.co/),详细数据集列表如下:
4 |
5 | ## 预训练数据
6 |
7 | 本次训练的预训练预料都来自[Hugging Face](https://huggingface.co/),主要包含以下几个经典的中文数据集,大约有35B左右Token,详细数据集如下:
8 |
9 | | 中文预训练语料 | 链接 | 描述 |
10 | | ----------------- | ------------------------------------------------------------ | ----------------------------------------------- |
11 | | Wiki中文百科 | [wikipedia](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) | 中文Wikipedia的数据 |
12 | | BaiduBaiKe | [baidubaike](https://huggingface.co/datasets/xuqinyang/BaiduBaike-5.63M) | 中文BaiduBaiKe的数据 |
13 | | zhihu | [zhihu](https://huggingface.co/datasets/wangrui6/Zhihu-KOL) | 知乎KOL中截取的数据 |
14 | | 网络小说 | [webnovel](https://huggingface.co/datasets/wdndev/webnovel-chinese) | 个人爬虫数据清洗的数据 |
15 | | TigerBot 部分数据 | [tigerBot](https://huggingface.co/datasets/TigerResearch/pretrain_zh) | TigerBot 模型训练的部分中文数据,原始数据太多了 |
16 | | | | |
17 |
18 | 上述数据处理脚本为,在处理时,Tokenizer后保存为可直接训练的二进制文件(`.bin`)。
19 |
20 | 注意:此处使用二进制文件保存,不需要考虑每个 max_seq_len 的长度,尽可能压缩存储空间。后续的SFT执行微调数据和RLHF数据集是较小,不需要提前保存为二进制文件。
21 |
22 |
23 | ## 微调数据
24 |
25 | SFT指令微调预料都来自[Hugging Face](https://huggingface.co/),主要包含以下几个经典的SFT数据集,大约有400w条,详细数据集如下:
26 |
27 | | SFT微调数据 | 链接 | 描述 |
28 | | ----------- | ------------------------------------------------------------ | ------------------------------------------ |
29 | | Belle | [Belle](https://huggingface.co/datasets/BelleGroup/train_2M_CN) | 包含约200万条由BELLE项目生成的中文指令数据 |
30 | | Firefly | [Firefly](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) | 流萤开源模型SFT数据集 |
31 | | TigerBot | [tigerBot](https://huggingface.co/datasets/TigerResearch/sft_zh) | TigerBot 模型SFT数据集 |
32 | | | | |
33 |
34 |
35 |
36 |
--------------------------------------------------------------------------------
/utils/test_tokenizer.py:
--------------------------------------------------------------------------------
1 | from chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer
2 | import json
3 | import torch
4 |
5 | def process_func(prompt_txt:str, user_txt:str, assistant_txt:str, max_length=512):
6 | input_ids, labels = [], []
7 | prompt = [tokenizer.get_command("<|system|>")] + tokenizer.encode(prompt_txt + "\n", add_special_tokens=False)
8 | instruction_ = [tokenizer.get_command("<|user|>")] + tokenizer.encode(user_txt.strip() + "\n", add_special_tokens=False,max_length=max_length) + [tokenizer.get_command("<|assistant|>")]
9 | instruction = prompt + instruction_
10 | response = tokenizer.encode(assistant_txt.strip(), add_special_tokens=False)
11 | input_ids = instruction + response + [tokenizer.eos_token_id]
12 | labels = [tokenizer.pad_token_id] * len(instruction) + response + [tokenizer.eos_token_id]
13 | pad_len = max_length - len(input_ids)
14 | # print()
15 | input_ids += [tokenizer.pad_token_id] * pad_len
16 | labels += [tokenizer.pad_token_id] * pad_len
17 | labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
18 |
19 | input_ids = torch.LongTensor(input_ids)
20 | labels = torch.LongTensor(labels)
21 | attention_mask = input_ids.ne(tokenizer.pad_token_id)
22 |
23 | return {
24 | "input_ids": input_ids,
25 | "labels": labels,
26 | "attention_mask": attention_mask,
27 | }
28 |
29 | if __name__=="__main__":
30 | tokenizer = ChatGLMTokenizer(vocab_file='chatglm3_tokenizer/tokenizer.model')
31 |
32 | sys_text = "你是由wdndev开发的个人助手。"
33 | user_text = "介绍一下中国。"
34 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
35 | "<|user|>", user_text.strip(),
36 | "<|assistant|>"]).strip() + "\n"
37 |
38 | model_inputs = tokenizer([input_txt], return_tensors="pt")
39 |
40 | print(tokenizer.batch_decode(model_inputs["input_ids"]))
41 |
42 | messages = [
43 | {"role": "system", "content": "你是由wdndev开发的个人助手。"},
44 | {"role": "system", "content": "介绍一下中国。"}
45 | ]
46 | # print(tokenizer.chat_template)
47 |
48 | text = tokenizer.apply_chat_template(
49 | messages,
50 | tokenize=False,
51 | add_generation_prompt=True
52 | )
53 | model_inputs = tokenizer([text], return_tensors="pt")
54 | print(tokenizer.batch_decode(model_inputs["input_ids"]))
55 |
56 |
57 |
--------------------------------------------------------------------------------
/tokenizer/expend_embedding.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoModelForCausalLM, AutoTokenizer
2 | import torch
3 |
4 | model_id = "outputs/ckpt/tiny_llm_sft_92m"
5 |
6 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)
7 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
8 |
9 | new_tokenizer = AutoTokenizer.from_pretrained("tokenizer/tinyllm_tokenizer_hf")
10 | print(len(new_tokenizer)) # 49958
11 | print(model)
12 | """
13 | TinyllmForCausalLM(
14 | (model): TinyllmModel(
15 | (embed_tokens): Embedding(64798, 512)
16 | (layers): ModuleList(
17 | (0-7): 8 x TinyllmDecoderLayer(
18 | (self_attn): TinyllmSdpaAttention(
19 | (q_proj): Linear(in_features=512, out_features=512, bias=True)
20 | (k_proj): Linear(in_features=512, out_features=512, bias=True)
21 | (v_proj): Linear(in_features=512, out_features=512, bias=True)
22 | (o_proj): Linear(in_features=512, out_features=512, bias=False)
23 | (rotary_emb): TinyllmRotaryEmbedding()
24 | )
25 | (mlp): TinyllmMLP(
26 | (gate_proj): Linear(in_features=512, out_features=1408, bias=False)
27 | (up_proj): Linear(in_features=512, out_features=1408, bias=False)
28 | (down_proj): Linear(in_features=1408, out_features=512, bias=False)
29 | (act_fn): SiLU()
30 | )
31 | (input_layernorm): TinyllmRMSNorm()
32 | (post_attention_layernorm): TinyllmRMSNorm()
33 | )
34 | )
35 | (norm): TinyllmRMSNorm()
36 | )
37 | (lm_head): Linear(in_features=512, out_features=64798, bias=False)
38 | )
39 | """
40 |
41 | embeddings = model.get_input_embeddings()
42 | model.resize_token_embeddings(49958)
43 | model.config.vocab_size = 49958
44 |
45 | print(model)
46 | """
47 | TinyllmForCausalLM(
48 | (model): TinyllmModel(
49 | (embed_tokens): Embedding(49958, 512)
50 | (layers): ModuleList(
51 | (0-7): 8 x TinyllmDecoderLayer(
52 | (self_attn): TinyllmSdpaAttention(
53 | (q_proj): Linear(in_features=512, out_features=512, bias=True)
54 | (k_proj): Linear(in_features=512, out_features=512, bias=True)
55 | (v_proj): Linear(in_features=512, out_features=512, bias=True)
56 | (o_proj): Linear(in_features=512, out_features=512, bias=False)
57 | (rotary_emb): TinyllmRotaryEmbedding()
58 | )
59 | (mlp): TinyllmMLP(
60 | (gate_proj): Linear(in_features=512, out_features=1408, bias=False)
61 | (up_proj): Linear(in_features=512, out_features=1408, bias=False)
62 | (down_proj): Linear(in_features=1408, out_features=512, bias=False)
63 | (act_fn): SiLU()
64 | )
65 | (input_layernorm): TinyllmRMSNorm()
66 | (post_attention_layernorm): TinyllmRMSNorm()
67 | )
68 | )
69 | (norm): TinyllmRMSNorm()
70 | )
71 | (lm_head): Linear(in_features=512, out_features=49958, bias=False)
72 | )
73 | """
74 |
75 | output_dir = "outputs/sft_92m_llama"
76 |
77 | model.save_pretrained(output_dir)
78 | new_tokenizer.save_pretrained(output_dir)
79 |
--------------------------------------------------------------------------------
/demo/web_demo.py:
--------------------------------------------------------------------------------
1 | import json
2 | import streamlit as st
3 | from transformers import AutoModelForCausalLM, AutoTokenizer
4 | from transformers.generation.utils import GenerationConfig
5 |
6 |
7 | st.set_page_config(page_title="Tiny LLM 92M Demo")
8 | st.title("Tiny LLM 92M Demo")
9 |
10 | # model_id = "outputs/ckpt/tiny_llm_sft_92m"
11 | model_id = "wdndev/tiny_llm_sft_92m"
12 |
13 | @st.cache_resource
14 | def load_model_tokenizer():
15 | model = AutoModelForCausalLM.from_pretrained(
16 | model_id,
17 | device_map="auto",
18 | trust_remote_code=True
19 | )
20 | tokenizer = AutoTokenizer.from_pretrained(
21 | model_id,
22 | use_fast=False,
23 | trust_remote_code=True
24 | )
25 | generation_config = GenerationConfig.from_pretrained(model_id)
26 | return model, tokenizer, generation_config
27 |
28 |
29 | def clear_chat_messages():
30 | del st.session_state.messages
31 |
32 |
33 | def init_chat_messages():
34 | with st.chat_message("assistant", avatar='🤖'):
35 | st.markdown("您好,我是由wdndev开发的个人助手,很高兴为您服务😄")
36 |
37 | if "messages" in st.session_state:
38 | for message in st.session_state.messages:
39 | avatar = "🧑💻" if message["role"] == "user" else "🤖"
40 | with st.chat_message(message["role"], avatar=avatar):
41 | st.markdown(message["content"])
42 | else:
43 | st.session_state.messages = []
44 |
45 | return st.session_state.messages
46 |
47 |
48 | max_new_tokens = st.sidebar.slider("max_new_tokens", 0, 1024, 512, step=1)
49 | top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
50 | top_k = st.sidebar.slider("top_k", 0, 100, 0, step=1)
51 | temperature = st.sidebar.slider("temperature", 0.0, 2.0, 1.0, step=0.01)
52 | do_sample = st.sidebar.checkbox("do_sample", value=True)
53 |
54 | def main():
55 | model, tokenizer, generation_config = load_model_tokenizer()
56 | messages = init_chat_messages()
57 |
58 | if prompt := st.chat_input("Shift + Enter 换行, Enter 发送"):
59 | with st.chat_message("user", avatar='🧑💻'):
60 | st.markdown(prompt)
61 | with st.chat_message("assistant", avatar='🤖'):
62 | placeholder = st.empty()
63 |
64 | generation_config.max_new_tokens = max_new_tokens
65 | generation_config.top_p = top_p
66 | generation_config.top_k = top_k
67 | generation_config.temperature = temperature
68 | generation_config.do_sample = do_sample
69 | print("generation_config: ", generation_config)
70 |
71 | sys_text = "你是由wdndev开发的个人助手。"
72 | messages.append({"role": "user", "content": prompt})
73 | user_text = prompt
74 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
75 | "<|user|>", user_text.strip(),
76 | "<|assistant|>"]).strip() + "\n"
77 |
78 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
79 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config)
80 | generated_ids = [
81 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
82 | ]
83 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
84 | placeholder.markdown(response)
85 |
86 | messages.append({"role": "assistant", "content": response})
87 | print("messages: ", json.dumps(response, ensure_ascii=False), flush=True)
88 |
89 | st.button("清空对话", on_click=clear_chat_messages)
90 |
91 |
92 | if __name__ == "__main__":
93 | main()
--------------------------------------------------------------------------------
/utils/rl_train_process.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import glob
4 | import numpy as np
5 | from tqdm import tqdm
6 | import pandas as pd
7 | import csv
8 |
9 |
10 | def merge_datsets(input_dir):
11 | total_lines = []
12 | for subdir, dirs, files in os.walk(input_dir):
13 | for idx, file in enumerate(files):
14 | # 只处理txt文件
15 | if file.endswith('.jsonl'):
16 | # 获取当前文件的绝对路径
17 | file_path = os.path.join(subdir, file)
18 | print(file_path)
19 | # 读取jsonl文件
20 | with open(file_path, 'r', encoding='utf-8') as infile:
21 | lines = infile.readlines()
22 |
23 | for line in tqdm(lines):
24 | json_obj = json.loads(line) # 解析json字符串为python对象
25 |
26 | prompt_text = json_obj["prompt"]
27 | chosen_text = json_obj["pos_resp"]
28 | rejected_text = json_obj["neg_resp"]
29 |
30 | data_dict = {
31 | "prompt": prompt_text,
32 | "chosen": chosen_text,
33 | "rejected": rejected_text
34 | }
35 |
36 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
37 | total_lines.append(processed_line)
38 |
39 | if file.endswith('.parquet'):
40 | # 获取当前文件的绝对路径
41 | file_path = os.path.join(subdir, file)
42 | print(file_path)
43 | # 读取jsonl文件
44 | df = pd.read_parquet(file_path)
45 |
46 | for idx, row in tqdm(df.iterrows(), total=len(df)):
47 | prompt_text = row['prompt']
48 | chosen_text = row['chosen']
49 | rejected_text = row['rejected']
50 |
51 | data_dict = {
52 | "prompt": prompt_text,
53 | "chosen": chosen_text,
54 | "rejected": rejected_text
55 | }
56 |
57 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
58 | total_lines.append(processed_line)
59 |
60 | if file.endswith('.tsv'):
61 | # 获取当前文件的绝对路径
62 | file_path = os.path.join(subdir, file)
63 | print(file_path)
64 | # 读取jsonl文件
65 | df = pd.read_csv(file_path, sep='\t')
66 |
67 | for idx, row in tqdm(df.iterrows(), total=len(df)):
68 | prompt_text = row['prompt']
69 | chosen_text = row['chosen']
70 | rejected_text = row['rejected']
71 |
72 | data_dict = {
73 | "prompt": prompt_text,
74 | "chosen": chosen_text,
75 | "rejected": rejected_text
76 | }
77 |
78 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
79 | total_lines.append(processed_line)
80 |
81 | # 如果输出子文件夹不存在,则创建它
82 | output_subfolder = "data/rl_train"
83 | if not os.path.exists(output_subfolder):
84 | os.makedirs(output_subfolder)
85 |
86 | # 保存处理后的csv文件到对应的输出子文件夹
87 | output_file_path = os.path.join(output_subfolder, "rl_data.jsonl")
88 | # 将处理后的json对象写入新的jsonl文件
89 | with open(output_file_path, 'w') as outfile:
90 | for line in total_lines:
91 | outfile.write(line)
92 |
93 |
94 | if __name__=="__main__":
95 | merge_datsets("corpus/rm_train")
96 |
97 |
--------------------------------------------------------------------------------
/tokenizer/expend_tokenizer.py:
--------------------------------------------------------------------------------
1 |
2 | import os
3 | # 设置环境变量,指定protobuf的Python实现为纯Python版本
4 | os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
5 | from transformers import LlamaTokenizer
6 | from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
7 | import sentencepiece as spm
8 | import argparse
9 | import json
10 |
11 | def merge_tokenizer(llama_tokenizer_dir, chinese_sp_model_file, output_hf_dir="tinyllm_tokenizer_hf"):
12 | # 加载LlamaTokenizer
13 | llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
14 | # 中文sentencepiece模型
15 | chinese_sp_model = spm.SentencePieceProcessor()
16 | chinese_sp_model.Load(chinese_sp_model_file)
17 |
18 | # 将LlamaTokenizer加载为protobuf模型对象
19 | llama_spm = sp_pb2_model.ModelProto()
20 | llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
21 | # 将中文模型加载为protobuf模型对象
22 | chinese_spm = sp_pb2_model.ModelProto()
23 | chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())
24 |
25 | # 打印基本信息
26 | print("llama token nums: ", len(llama_tokenizer))
27 | print("chinese sp nums: ", len(chinese_sp_model))
28 |
29 | # 向 LLaMA 的 tokenizer 中添加中文 tokens
30 | ## 1.首先创建一个set包含所有LLaMA的tokens以加速查找
31 | llama_spm_tokens_set = set(p.piece for p in llama_spm.pieces)
32 | ## 2.遍历中文模型的tokens,如果不在 LLaMA 的 tokens 集合中,则添加至 LLaMA 模型
33 | for p in chinese_spm.pieces:
34 | piece = p.piece
35 | if piece not in llama_spm_tokens_set:
36 | # 创建新的SentencePiece对象
37 | new_p = sp_pb2_model.ModelProto().SentencePiece()
38 | new_p.piece = piece # 设置token内容
39 | new_p.score = 0 # 设置默认的分数
40 | llama_spm.pieces.append(new_p) # 添加到LLaMA的模型pieces中
41 |
42 |
43 | # 保存合并后的模型
44 | output_sp_dir = 'tmp_tinyllm_tokenizer_sp' # 保存sentencepiece模型的目录
45 | os.makedirs(output_sp_dir, exist_ok=True) # 确保目录存在
46 | # 保存sentencepiece模型到文件
47 | with open(output_sp_dir + '/tokenizer.model', 'wb') as f:
48 | f.write(llama_spm.SerializeToString())
49 |
50 | # 使用新生成的vocab文件初始化LlamaTokenizer,并保存为 Hugging Face 格式
51 | tokenizer = LlamaTokenizer(vocab_file = output_sp_dir + '/tokenizer.model', legacy=True)
52 | ## 添加特殊 token
53 | custom_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|im_start|>", "<|im_end|>"]
54 | for token in custom_special_tokens:
55 | tokenizer.add_tokens(token)
56 |
57 | # vocab_dict = tokenizer.get_vocab()
58 | # with open('vocab_utf8.txt', 'w', encoding='utf-8') as f:
59 | # json.dump(vocab_dict, f, indent=4)
60 | tokenizer.save_pretrained(output_hf_dir)
61 | print(f"tinyllm token num: {len(tokenizer)}")
62 | print(f"Tiny LLM tokenizer has been saved to {output_hf_dir}")
63 |
64 | def test_tokenizer(hf_tokenizer_dir):
65 | tinyllm_tokenizer = LlamaTokenizer.from_pretrained(hf_tokenizer_dir)
66 | print("tinyllm tokenizer nums: ", len(tinyllm_tokenizer))
67 |
68 | sys_text = "你是由wdndev开发的个人助手。"
69 | user_text = "翻译下面的句子为英文:有朋自远方来,不亦乐乎"
70 | answer_text = "It is always a pleasure to greet a friend from afar."
71 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
72 | "<|user|>", user_text.strip(),
73 | "<|assistant|>"]).strip() + "\n" + answer_text.strip()
74 |
75 | print("-----input text: \n", input_txt)
76 |
77 | encode_ids = tinyllm_tokenizer.encode(input_txt, add_special_tokens=False)
78 | print("-----encode ids: \n", encode_ids)
79 |
80 | decode_ids = tinyllm_tokenizer.decode(encode_ids)
81 | print("-----dencode ids: \n", decode_ids)
82 |
83 |
84 | if __name__ == "__main__":
85 |
86 | llama_tokenizer_dir = "input_dir/llama2_tokenizer"
87 | chinese_sp_model_file = "sp_output/chinese_spm_20000.model"
88 | output_hf_dir = "tinyllm_tokenizer_hf"
89 |
90 | # merge_tokenizer(llama_tokenizer_dir, chinese_sp_model_file, output_hf_dir)
91 |
92 | test_tokenizer(output_hf_dir)
--------------------------------------------------------------------------------
/utils/sft_train_process.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import glob
4 | import numpy as np
5 | from tqdm import tqdm
6 | import pandas as pd
7 | import csv
8 | import json
9 |
10 | #from zhconv import convert
11 |
12 | def process_bell_2m(file_path):
13 | """ https://huggingface.co/datasets/BelleGroup/train_2M_CN
14 | """
15 |
16 | total_lines = []
17 | with open(file_path, 'r', encoding='utf-8') as infile:
18 | lines = infile.readlines()
19 | for line in tqdm(lines):
20 | json_obj = json.loads(line) # 解析json字符串为python对象
21 |
22 | instruction = json_obj["instruction"]
23 | input_str = json_obj["input"]
24 | answer = json_obj["output"]
25 |
26 | question = instruction + input_str
27 |
28 | data_dict = {
29 | "question": question,
30 | "answer": answer
31 | }
32 |
33 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
34 | total_lines.append(processed_line)
35 |
36 | return total_lines
37 |
38 | def process_nlp(file_path):
39 | """ https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
40 | """
41 | total_lines = []
42 | with open(file_path, 'r', encoding='utf-8') as infile:
43 | lines = infile.readlines()
44 | for line in tqdm(lines):
45 | json_obj = json.loads(line) # 解析json字符串为python对象
46 |
47 | # instruction = json_obj["instruction"]
48 | question = json_obj["input"]
49 | answer = json_obj["target"]
50 |
51 | data_dict = {
52 | "question": question,
53 | "answer": answer
54 | }
55 |
56 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
57 | total_lines.append(processed_line)
58 |
59 | return total_lines
60 |
61 | def process_tigerbot_sft(input_dir):
62 | """ https://huggingface.co/datasets/TigerResearch/sft_zh
63 | """
64 | total_lines = []
65 | for subdir, dirs, files in os.walk(input_dir):
66 | for idx, file in enumerate(files):
67 | # 只处理txt文件
68 | if file.endswith('.json'):
69 | # 获取当前文件的绝对路径
70 | file_path = os.path.join(subdir, file)
71 | print(file_path)
72 | # 读取jsonl文件
73 | with open(file_path, 'r', encoding='utf-8') as infile:
74 | lines = infile.readlines()
75 |
76 | for line in tqdm(lines):
77 | json_obj = json.loads(line) # 解析json字符串为python对象
78 |
79 | instruction = json_obj["instruction"]
80 | input_str = json_obj["input"]
81 | answer = json_obj["output"]
82 |
83 | question = instruction + input_str
84 |
85 | data_dict = {
86 | "question": question,
87 | "answer": answer
88 | }
89 |
90 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
91 | total_lines.append(processed_line)
92 |
93 | return total_lines
94 |
95 |
96 | if __name__=="__main__":
97 |
98 | total_lines = process_bell_2m("corpus/sft_train/bell_2m/train_2M_CN.json")
99 | print("bell 2m: ", len(total_lines))
100 | nlp_total_lines = process_nlp("corpus/sft_train/nlp/firefly-train-1.1M.jsonl")
101 | print("nlp: ", len(nlp_total_lines))
102 |
103 | total_lines.extend(nlp_total_lines)
104 |
105 | tigerbot_total_lines = process_tigerbot_sft("corpus/sft_train/tigerbot")
106 | print("tigerbot: ", len(tigerbot_total_lines))
107 |
108 | total_lines.extend(tigerbot_total_lines)
109 |
110 | print("all: ", len(total_lines))
111 |
112 | # 如果输出子文件夹不存在,则创建它
113 | output_subfolder = "data/sft_train"
114 | if not os.path.exists(output_subfolder):
115 | os.makedirs(output_subfolder)
116 |
117 | # 保存处理后的csv文件到对应的输出子文件夹
118 | output_file_path = os.path.join(output_subfolder, "sft_data_test.jsonl")
119 | # 将处理后的json对象写入新的jsonl文件
120 | with open(output_file_path, 'w') as outfile:
121 | for line in total_lines:
122 | outfile.write(line)
123 |
124 |
125 |
--------------------------------------------------------------------------------
/utils/rm_train_process.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import glob
4 | import numpy as np
5 | from tqdm import tqdm
6 | import pandas as pd
7 | import csv
8 | import random
9 |
10 |
11 | def merge_datsets(input_dir):
12 | total_lines = []
13 | for subdir, dirs, files in os.walk(input_dir):
14 | for idx, file in enumerate(files):
15 | # 只处理txt文件
16 | if file.endswith('.jsonl'):
17 | # https://www.modelscope.cn/datasets/iic/CValues-Comparison/summary
18 | # 获取当前文件的绝对路径
19 | file_path = os.path.join(subdir, file)
20 | print(file_path)
21 | # 读取jsonl文件
22 | with open(file_path, 'r', encoding='utf-8') as infile:
23 | lines = infile.readlines()
24 |
25 | for line in tqdm(lines):
26 | json_obj = json.loads(line) # 解析json字符串为python对象
27 |
28 | prompt_text = json_obj["prompt"]
29 | chosen_text = json_obj["pos_resp"]
30 | rejected_text = json_obj["neg_resp"]
31 |
32 | data_dict = {
33 | "prompt": prompt_text,
34 | "chosen": chosen_text,
35 | "rejected": rejected_text
36 | }
37 |
38 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
39 | total_lines.append(processed_line)
40 |
41 | if file.endswith('.parquet'):
42 | # https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese
43 | # 获取当前文件的绝对路径
44 | file_path = os.path.join(subdir, file)
45 | print(file_path)
46 | # 读取jsonl文件
47 | df = pd.read_parquet(file_path)
48 |
49 | for idx, row in tqdm(df.iterrows(), total=len(df)):
50 | prompt_text = row['prompt']
51 | chosen_text = row['chosen']
52 | rejected_text = row['rejected']
53 |
54 | data_dict = {
55 | "prompt": prompt_text,
56 | "chosen": chosen_text,
57 | "rejected": rejected_text
58 | }
59 |
60 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
61 | total_lines.append(processed_line)
62 |
63 | if file.endswith('.tsv'):
64 | # https://huggingface.co/datasets/liyucheng/zhihu_rlhf_3k
65 | # 获取当前文件的绝对路径
66 | file_path = os.path.join(subdir, file)
67 | print(file_path)
68 | # 读取jsonl文件
69 | df = pd.read_csv(file_path, sep='\t')
70 |
71 | for idx, row in tqdm(df.iterrows(), total=len(df)):
72 | prompt_text = row['prompt']
73 | chosen_text = row['chosen']
74 | rejected_text = row['rejected']
75 |
76 | data_dict = {
77 | "prompt": prompt_text,
78 | "chosen": chosen_text,
79 | "rejected": rejected_text
80 | }
81 |
82 | processed_line = json.dumps(data_dict, ensure_ascii=False) + '\n'
83 | total_lines.append(processed_line)
84 | print("total len: ", len(total_lines))
85 | # 拆分训练集和验证集
86 | # 随机抽取2000条数据
87 | eval_dat = random.sample(total_lines, 2000)
88 | # 剩余的8000条数据
89 | train_data = [item for item in total_lines if item not in eval_dat]
90 | # assert len(eval_dat) + len(train_data) == len(total_lines)
91 | print("eval len: ", len(eval_dat))
92 | print("train len: ", len(train_data))
93 |
94 | # 保存
95 | # 如果输出子文件夹不存在,则创建它
96 | output_subfolder = "data/rl_train"
97 | if not os.path.exists(output_subfolder):
98 | os.makedirs(output_subfolder)
99 |
100 | # 保存处理后的csv文件到对应的输出子文件夹
101 | eval_file_path = os.path.join(output_subfolder, "rl_eval_data.jsonl")
102 | train_file_path = os.path.join(output_subfolder, "rl_train_data.jsonl")
103 | # 将处理后的json对象写入新的jsonl文件
104 | with open(eval_file_path, 'w') as outfile:
105 | for line in eval_dat:
106 | outfile.write(line)
107 | with open(train_file_path, 'w') as outfile:
108 | for line in train_data:
109 | outfile.write(line)
110 |
111 |
112 | if __name__=="__main__":
113 | merge_datsets("corpus/rm_train")
114 |
115 |
--------------------------------------------------------------------------------
/quantize/gptq_quantize.py:
--------------------------------------------------------------------------------
1 | """
2 | 量化问题:
3 | https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit/discussions/5
4 | https://github.com/AutoGPTQ/AutoGPTQ/issues/657
5 | """
6 | import os
7 | import torch
8 | from dataclasses import dataclass, field
9 | from typing import Dict, Optional
10 |
11 | from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
12 | from transformers import AutoTokenizer, HfArgumentParser
13 |
14 | # Assuming 'read_jsonl_file' is a function defined in 'utilis' module
15 | from utilis import read_jsonl_file
16 |
17 | import logging
18 |
19 | @dataclass
20 | class ScriptArguments:
21 | """
22 | The arguments for the DPO training script.
23 | """
24 | # Basic settings
25 | model_id: Optional[str] = field(default="", metadata={"help": "The location of the SFT model name or path."})
26 | # {"input": question, "target": answer}
27 | dataset_dir_or_path: Optional[str] = field(default="", metadata={"help": "The location of the dataset directory or path."})
28 | quant_output_dir: Optional[str] = field(default="./results", metadata={"help": "The output directory for the quantized model."})
29 | ngpus: Optional[int] = field(default=1, metadata={"help": "Number of GPUs for quantization."})
30 | gpu_max_memory: Optional[int] = field(default=20, metadata={"help": "Max memory per GPU for quantization (in GB)."})
31 |
32 | # GPTQ parameters
33 | bits: Optional[int] = field(default=4, metadata={"help": "Quantization bits (4 or 8)."})
34 | group_size: Optional[int] = field(default=128, metadata={"help": "Group size for quantization (32, 64, 128)."})
35 | damp_percent: Optional[float] = field(default=0.1, metadata={"help": "Damping percentage for quantization (0.1, 0.01)."})
36 | desc_act: Optional[bool] = field(default=False, metadata={"help": "Whether to use descending activation (False speeds up inference but may affect perplexity)."})
37 | static_groups: Optional[bool] = field(default=False, metadata={"help": "Whether to use static groups for quantization."})
38 | sym: Optional[bool] = field(default=True, metadata={"help": "Whether to use symmetric quantization."})
39 | true_sequential: Optional[bool] = field(default=True, metadata={"help": "Whether to use true sequential quantization."})
40 |
41 | # Training parameters
42 | max_len: Optional[int] = field(default=8192, metadata={"help": "Maximum length of input data."})
43 | batch_size: Optional[int] = field(default=1, metadata={"help": "Batch size for quantization training."})
44 | cache_examples_on_gpu: Optional[bool] = field(default=False, metadata={"help": "Whether to cache examples on GPU during quantization."})
45 | use_triton: Optional[bool] = field(default=False, metadata={"help": "Whether to use Triton for quantization."})
46 |
47 | def data_process(data_list, max_len, tokenizer: AutoTokenizer):
48 | def qwen_process(item):
49 | input_text = item["input"]
50 | target_text = item["target"]
51 | messages = [
52 | {"role": "system", "content": "You are a helpful assistant."},
53 | {"role": "user", "content": input_text},
54 | {"role": "assistant", "content": target_text}
55 | ]
56 | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
57 | model_inputs = tokenizer(text, truncation=True, padding='max_length', max_length=max_len)
58 | input_ids = torch.tensor(model_inputs['input_ids'], dtype=torch.long)
59 | attention_mask = torch.tensor(model_inputs['attention_mask'], dtype=torch.long)
60 | return {
61 | "input_ids": input_ids,
62 | "attention_mask": attention_mask
63 | }
64 |
65 | return [qwen_process(item) for item in data_list]
66 |
67 | def main():
68 | parser = HfArgumentParser(ScriptArguments)
69 | script_args = parser.parse_args_into_dataclasses()[0]
70 |
71 | logging.basicConfig(
72 | format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
73 | level=logging.INFO,
74 | datefmt="%Y-%m-%d %H:%M:%S"
75 | )
76 |
77 | quantize_config = BaseQuantizeConfig(
78 | bits=script_args.bits, # 4 or 8
79 | group_size=script_args.group_size,
80 | damp_percent=script_args.damp_percent,
81 | desc_act=script_args.desc_act, # set to False can significantly speed up inference but the perplexity may slightly bad
82 | static_groups=script_args.static_groups,
83 | sym=script_args.sym,
84 | true_sequential=script_args.true_sequential
85 | )
86 |
87 | tokenizer = AutoTokenizer.from_pretrained(
88 | script_args.model_id,
89 | trust_remote_code=True
90 | )
91 |
92 | model = AutoGPTQForCausalLM.from_pretrained(
93 | script_args.model_id,
94 | quantize_config,
95 | max_memory={i: f"{script_args.gpu_max_memory}GB" for i in range(script_args.ngpus)}
96 | )
97 |
98 | data_list = read_jsonl_file(script_args.dataset_dir_or_path)
99 | quant_data = data_process(
100 | data_list,
101 | script_args.max_len,
102 | tokenizer
103 | )
104 |
105 | model.quantize(
106 | quant_data,
107 | cache_examples_on_gpu=script_args.cache_examples_on_gpu,
108 | batch_size=script_args.batch_size,
109 | use_triton=script_args.use_triton
110 | )
111 |
112 | model.save_quantized(script_args.quant_output_dir, use_safetensors=True)
113 | tokenizer.save_pretrained(script_args.quant_output_dir)
114 |
115 | if __name__ == "__main__":
116 | main()
--------------------------------------------------------------------------------
/script/ptm_demo.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -x
4 |
5 | # export CUDA_VISIBLE_DEVICES="4,5,6,7"
6 |
7 | source /home/.bashrc
8 | source /home/miniconda3/etc/profile.d/conda.sh
9 | conda activate md_llm
10 | which python
11 |
12 | function killall {
13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'`
14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9
15 | }
16 |
17 | WORK_DIR="/personal/tiny-llm-zh"
18 | cd ${WORK_DIR}
19 |
20 |
21 | # 常见参数
22 | N_NODES=1
23 | N_GPUS=8
24 | MBS=32 # 单卡bs
25 | GAS=1 # 梯度累积
26 | GRAD_CLIP=1 # 梯度裁剪
27 | RANK=0
28 | MASTER_ADDR=`hostname -i`
29 | MASTER_PORT=9902
30 |
31 | LR=3e-4 # 初始学习率
32 | LR_SCHEDULER_TYPE="cosine"
33 | WARMUP_RATION=0.05
34 |
35 | TRAIN_EPOCHS=5 # 训练轮次
36 | LOGGING_STEPS=100 # 记录日志步数
37 | CKPT_SAVE_STEPS=10000 # ckpt保存步数
38 |
39 | SEED=12
40 | DS_DTYPE="fp16" # [fp16, bf16]
41 | RESUME="False"
42 |
43 | # 数据
44 | MODE="ptm" # [ptm, sft, rm, rl]
45 | DATASET_DIR_OR_PATH="data/pre_train"
46 | BASE_MODEL_PATH="test"
47 |
48 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m]
49 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}"
50 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
51 | mkdir -p $OUTPUT_DIR
52 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log"
53 | # tensorboard输出路径
54 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
55 | mkdir -p $TB_DIR
56 |
57 | TRAIN_ARGS=""
58 |
59 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json
60 | ZERO_STAGE=2
61 |
62 | if [ $DS_DTYPE = "fp16" ];then
63 | TRAIN_ARGS+=" \
64 | --fp16 \
65 | "
66 | DS_FP16=true
67 | DS_BF16=false
68 | GAS_DTYPE=$DS_DTYPE
69 | elif [ $DS_DTYPE = "bf16" ];then
70 | TRAIN_ARGS+=" \
71 | --bf16 \
72 | --embedding-weights-in-fp32 \
73 | "
74 | DS_FP16=false
75 | DS_BF16=true
76 | GAS_DTYPE="fp32"
77 |
78 | fi
79 |
80 | cat < $DS_CONFIG_JSON
81 | {
82 | "train_micro_batch_size_per_gpu": $MBS,
83 | "train_batch_size": "auto",
84 | "gradient_clipping": ${GRAD_CLIP},
85 | "zero_optimization": {
86 | "stage": $ZERO_STAGE
87 | },
88 | "bf16": {
89 | "enabled": ${DS_BF16}
90 | },
91 | "data_types": {
92 | "grad_accum_dtype": "${GAS_DTYPE}"
93 | },
94 | "fp16": {
95 | "enabled": ${DS_FP16},
96 | "loss_scale": 0,
97 | "loss_scale_window": 200,
98 | "hysteresis": 5,
99 | "min_loss_scale": 1,
100 | "initial_scale_power": 12
101 | },
102 | "steps_per_print": 10,
103 | "wall_clock_breakdown": true,
104 | "comms_logger": {
105 | "enabled": true,
106 | "verbose": false,
107 | "prof_all": false,
108 | "debug": false
109 | },
110 | "flops_profiler": {
111 | "enabled": false,
112 | "profile_step": 30,
113 | "module_depth": -1,
114 | "top_modules": 1,
115 | "detailed": true,
116 | "output_file": null
117 | }
118 | }
119 | EOT
120 |
121 |
122 | TRAIN_ARGS+=" \
123 | --seed ${SEED} \
124 | --output_dir ${OUTPUT_DIR} \
125 | --overwrite_output_dir \
126 | --deepspeed ${DS_CONFIG_JSON} \
127 | --per_device_train_batch_size ${MBS} \
128 | --gradient_accumulation_steps ${GAS} \
129 | --do_train \
130 | --num_train_epochs ${TRAIN_EPOCHS} \
131 | --logging_dir ${TB_DIR} \
132 | --logging_strategy steps \
133 | --logging_steps ${LOGGING_STEPS} \
134 | --weight_decay 0.01 \
135 | --adam_beta1 0.9 \
136 | --adam_beta1 0.95 \
137 | --max_grad_norm ${GRAD_CLIP} \
138 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \
139 | --learning_rate ${LR} \
140 | --warmup_ratio ${WARMUP_RATION} \
141 | --weight_decay 0.01 \
142 | --save_strategy steps \
143 | --save_total_limit 3 \
144 | --save_steps ${CKPT_SAVE_STEPS} \
145 | --ddp_timeout 30000 \
146 | --logging_first_step True \
147 | --save_safetensors False \
148 | --ddp_find_unused_parameters False \
149 | "
150 |
151 | if [[ $MODEL_SIZE == "16m" ]];then
152 | HIDDEN_SIZE=120
153 | NUM_HIDDEN_LAYERS=6
154 | NUM_ATTENTION_HEADS=6
155 | INTERMEDIATE_SIZE=384
156 | ROPE_THETA=10000.0
157 | MAX_POSITION_EMBEDDINGS=512
158 | VOCAB_SIZE=64798
159 | elif [[ $MODEL_SIZE == "42m" ]];then
160 | HIDDEN_SIZE=288
161 | NUM_HIDDEN_LAYERS=6
162 | NUM_ATTENTION_HEADS=6
163 | INTERMEDIATE_SIZE=768
164 | ROPE_THETA=10000.0
165 | MAX_POSITION_EMBEDDINGS=512
166 | VOCAB_SIZE=64798
167 | elif [[ $MODEL_SIZE == "92m" ]];then
168 | HIDDEN_SIZE=512
169 | NUM_HIDDEN_LAYERS=8
170 | NUM_ATTENTION_HEADS=8
171 | INTERMEDIATE_SIZE=1408
172 | ROPE_THETA=10000.0
173 | MAX_POSITION_EMBEDDINGS=1024
174 | VOCAB_SIZE=64798
175 | elif [[ $MODEL_SIZE == "210m" ]];then
176 | HIDDEN_SIZE=768
177 | NUM_HIDDEN_LAYERS=16
178 | NUM_ATTENTION_HEADS=12
179 | INTERMEDIATE_SIZE=2048
180 | ROPE_THETA=10000.0
181 | MAX_POSITION_EMBEDDINGS=1024
182 | VOCAB_SIZE=64798
183 | elif [[ $MODEL_SIZE == "440m" ]];then
184 | HIDDEN_SIZE=1024
185 | NUM_HIDDEN_LAYERS=24
186 | NUM_ATTENTION_HEADS=16
187 | INTERMEDIATE_SIZE=2816
188 | ROPE_THETA=10000.0
189 | MAX_POSITION_EMBEDDINGS=1024
190 | VOCAB_SIZE=64798
191 | fi
192 |
193 | GPT_ARGS=" \
194 | --hidden_size ${HIDDEN_SIZE} \
195 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \
196 | --num_attention_heads ${NUM_ATTENTION_HEADS} \
197 | --intermediate_size ${INTERMEDIATE_SIZE} \
198 | --rope_theta ${ROPE_THETA} \
199 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \
200 | --vocab_size ${VOCAB_SIZE} \
201 | "
202 | SCRIPT_ARGS=" \
203 | --mode ${MODE} \
204 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \
205 | --resume ${RESUME} \
206 | --base_model_path ${BASE_MODEL_PATH} \
207 | "
208 |
209 | DISTRIBUTED_ARGS=" \
210 | --nnodes $N_NODES \
211 | --nproc_per_node $N_GPUS \
212 | "
213 |
214 | # 检查num是否大于1
215 | if [ "$N_NODES" -ge 2 ]; then
216 | DISTRIBUTED_ARGS+=" \
217 | --node_rank $RANK \
218 | --master_addr $MASTER_ADDR \
219 | --master_port $MASTER_PORT \
220 | "
221 | fi
222 |
223 | # 所有参数
224 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS "
225 |
226 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/ptm_train.py "
227 |
228 | export CMD="$LAUNCHER $ALL_ARGS"
229 | echo $CMD
230 |
231 | killall ptm_train.py
232 |
233 | # 执行训练
234 | $CMD 2>&1 | tee ${TRAIN_LOG}
235 |
236 | killall ptm_train.py
237 |
238 | echo "train end : ${OUTPUT_DIR}"
239 |
240 |
241 |
--------------------------------------------------------------------------------
/script/sft_demo.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -x
4 |
5 | # export CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7"
6 |
7 | source /home/.bashrc
8 | source /home/miniconda3/etc/profile.d/conda.sh
9 | conda activate md_llm
10 | which python
11 |
12 | function killall {
13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'`
14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9
15 | }
16 |
17 | WORK_DIR="/personal/tiny-llm-zh"
18 | cd ${WORK_DIR}
19 |
20 | # 常见参数
21 | N_NODES=1
22 | N_GPUS=8
23 | MBS=32 # 单卡bs
24 | GAS=1 # 梯度累积
25 | GRAD_CLIP=1 # 梯度裁剪
26 | RANK=0
27 | MASTER_ADDR=`hostname -i`
28 | MASTER_PORT=9902
29 |
30 | LR=3e-4 # 初始学习率
31 | LR_SCHEDULER_TYPE="cosine"
32 | WARMUP_RATION=0.03
33 |
34 | TRAIN_EPOCHS=5 # 训练轮次
35 | LOGGING_STEPS=100 # 记录日志步数
36 | CKPT_SAVE_STEPS=15000 # ckpt保存步数
37 |
38 | SEED=12
39 | DS_DTYPE="fp16" # [fp16, bf16]
40 | RESUME="False"
41 |
42 | # 数据
43 | MODE="sft" # [ptm, sft, rm, rl]
44 | DATASET_DIR_OR_PATH="data/sft_train/sft_data.jsonl"
45 | BASE_MODEL_PATH="outputs/ckpt/ptm_tiny_llm_92m_epoch5/last_ptm_model"
46 |
47 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m]
48 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}"
49 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
50 | mkdir -p $OUTPUT_DIR
51 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log"
52 | # tensorboard输出路径
53 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
54 | mkdir -p $TB_DIR
55 |
56 | TRAIN_ARGS=""
57 |
58 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json
59 | ZERO_STAGE=2
60 |
61 | if [ $DS_DTYPE = "fp16" ];then
62 | TRAIN_ARGS+=" \
63 | --fp16 \
64 | "
65 | DS_FP16=true
66 | DS_BF16=false
67 | GAS_DTYPE=$DS_DTYPE
68 | elif [ $DS_DTYPE = "bf16" ];then
69 | TRAIN_ARGS+=" \
70 | --bf16 \
71 | "
72 | DS_FP16=false
73 | DS_BF16=true
74 | GAS_DTYPE="fp32"
75 |
76 | fi
77 |
78 | cat < $DS_CONFIG_JSON
79 | {
80 | "train_micro_batch_size_per_gpu": $MBS,
81 | "train_batch_size": "auto",
82 | "gradient_clipping": ${GRAD_CLIP},
83 | "zero_optimization": {
84 | "stage": $ZERO_STAGE
85 | },
86 | "bf16": {
87 | "enabled": ${DS_BF16}
88 | },
89 | "data_types": {
90 | "grad_accum_dtype": "${GAS_DTYPE}"
91 | },
92 | "fp16": {
93 | "enabled": ${DS_FP16},
94 | "loss_scale": 0,
95 | "loss_scale_window": 200,
96 | "hysteresis": 5,
97 | "min_loss_scale": 1,
98 | "initial_scale_power": 12
99 | },
100 | "steps_per_print": 10,
101 | "wall_clock_breakdown": true,
102 | "comms_logger": {
103 | "enabled": true,
104 | "verbose": false,
105 | "prof_all": false,
106 | "debug": false
107 | },
108 | "flops_profiler": {
109 | "enabled": false,
110 | "profile_step": 30,
111 | "module_depth": -1,
112 | "top_modules": 1,
113 | "detailed": true,
114 | "output_file": null
115 | }
116 | }
117 | EOT
118 |
119 |
120 | TRAIN_ARGS+=" \
121 | --seed ${SEED} \
122 | --output_dir ${OUTPUT_DIR} \
123 | --overwrite_output_dir \
124 | --deepspeed ${DS_CONFIG_JSON} \
125 | --per_device_train_batch_size ${MBS} \
126 | --gradient_accumulation_steps ${GAS} \
127 | --do_train \
128 | --num_train_epochs ${TRAIN_EPOCHS} \
129 | --logging_dir ${TB_DIR} \
130 | --logging_strategy steps \
131 | --logging_steps ${LOGGING_STEPS} \
132 | --weight_decay 0.01 \
133 | --adam_beta1 0.9 \
134 | --adam_beta1 0.95 \
135 | --max_grad_norm ${GRAD_CLIP} \
136 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \
137 | --learning_rate ${LR} \
138 | --warmup_ratio ${WARMUP_RATION} \
139 | --weight_decay 0.01 \
140 | --save_strategy steps \
141 | --save_total_limit 3 \
142 | --save_steps ${CKPT_SAVE_STEPS} \
143 | --ddp_timeout 30000 \
144 | --logging_first_step True \
145 | --save_safetensors False \
146 | --ddp_find_unused_parameters False \
147 | "
148 |
149 | if [[ $MODEL_SIZE == "16m" ]];then
150 | HIDDEN_SIZE=120
151 | NUM_HIDDEN_LAYERS=6
152 | NUM_ATTENTION_HEADS=6
153 | INTERMEDIATE_SIZE=384
154 | ROPE_THETA=10000.0
155 | MAX_POSITION_EMBEDDINGS=512
156 | VOCAB_SIZE=64798
157 | elif [[ $MODEL_SIZE == "42m" ]];then
158 | HIDDEN_SIZE=288
159 | NUM_HIDDEN_LAYERS=6
160 | NUM_ATTENTION_HEADS=6
161 | INTERMEDIATE_SIZE=768
162 | ROPE_THETA=10000.0
163 | MAX_POSITION_EMBEDDINGS=512
164 | VOCAB_SIZE=64798
165 | elif [[ $MODEL_SIZE == "92m" ]];then
166 | HIDDEN_SIZE=512
167 | NUM_HIDDEN_LAYERS=8
168 | NUM_ATTENTION_HEADS=8
169 | INTERMEDIATE_SIZE=1408
170 | ROPE_THETA=10000.0
171 | MAX_POSITION_EMBEDDINGS=1024
172 | VOCAB_SIZE=64798
173 | elif [[ $MODEL_SIZE == "210m" ]];then
174 | HIDDEN_SIZE=768
175 | NUM_HIDDEN_LAYERS=16
176 | NUM_ATTENTION_HEADS=12
177 | INTERMEDIATE_SIZE=2048
178 | ROPE_THETA=10000.0
179 | MAX_POSITION_EMBEDDINGS=1024
180 | VOCAB_SIZE=64798
181 | elif [[ $MODEL_SIZE == "440m" ]];then
182 | HIDDEN_SIZE=1024
183 | NUM_HIDDEN_LAYERS=24
184 | NUM_ATTENTION_HEADS=16
185 | INTERMEDIATE_SIZE=2816
186 | ROPE_THETA=10000.0
187 | MAX_POSITION_EMBEDDINGS=1024
188 | VOCAB_SIZE=64798
189 | fi
190 |
191 | GPT_ARGS=" \
192 | --hidden_size ${HIDDEN_SIZE} \
193 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \
194 | --num_attention_heads ${NUM_ATTENTION_HEADS} \
195 | --intermediate_size ${INTERMEDIATE_SIZE} \
196 | --rope_theta ${ROPE_THETA} \
197 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \
198 | --vocab_size ${VOCAB_SIZE} \
199 | "
200 | SCRIPT_ARGS=" \
201 | --mode ${MODE} \
202 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \
203 | --resume ${RESUME} \
204 | --base_model_path ${BASE_MODEL_PATH} \
205 | "
206 |
207 | DISTRIBUTED_ARGS=" \
208 | --nnodes $N_NODES \
209 | --nproc_per_node $N_GPUS \
210 | "
211 |
212 | # 检查num是否大于1
213 | if [ "$N_NODES" -ge 2 ]; then
214 | DISTRIBUTED_ARGS+=" \
215 | --node_rank $RANK \
216 | --master_addr $MASTER_ADDR \
217 | --master_port $MASTER_PORT \
218 | "
219 | fi
220 |
221 | # 所有参数
222 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS "
223 |
224 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/sft_train.py "
225 |
226 | export CMD="$LAUNCHER $ALL_ARGS"
227 | echo $CMD
228 |
229 | killall sft_train.py
230 |
231 | # 执行训练
232 | $CMD 2>&1 | tee ${TRAIN_LOG}
233 |
234 | killall sft_train.py
235 |
236 | echo "train end : ${OUTPUT_DIR}"
237 |
--------------------------------------------------------------------------------
/script/dpo_demo.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -x
4 |
5 | # export CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7"
6 |
7 | source /home/.bashrc
8 | source /home/miniconda3/etc/profile.d/conda.sh
9 | conda activate md_llm
10 | which python
11 |
12 | function killall {
13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'`
14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9
15 | }
16 |
17 | WORK_DIR="/personal/tiny-llm-zh"
18 | cd ${WORK_DIR}
19 |
20 | # 常见参数
21 | N_NODES=1
22 | N_GPUS=8
23 | MBS=32 # 单卡bs
24 | GAS=1 # 梯度累积
25 | GRAD_CLIP=1 # 梯度裁剪
26 | RANK=0
27 | MASTER_ADDR=`hostname -i`
28 | MASTER_PORT=9902
29 |
30 | LR=3e-4 # 初始学习率
31 | LR_SCHEDULER_TYPE="cosine"
32 | WARMUP_RATION=0.03
33 |
34 | TRAIN_EPOCHS=5 # 训练轮次
35 | LOGGING_STEPS=100 # 记录日志步数
36 | CKPT_SAVE_STEPS=15000 # ckpt保存步数
37 |
38 | SEED=12
39 | DS_DTYPE="fp16" # [fp16, bf16]
40 | RESUME="False"
41 |
42 | # 数据
43 | MODE="rl" # [ptm, sft, rm, rl]
44 | DATASET_DIR_OR_PATH="data/rl_train/rl_train_data.jsonl"
45 | BASE_MODEL_PATH="outputs/ckpt/sft_tiny_llm_92m_epoch5/last_sft_model"
46 |
47 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m]
48 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}"
49 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
50 | mkdir -p $OUTPUT_DIR
51 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log"
52 | # tensorboard输出路径
53 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
54 | mkdir -p $TB_DIR
55 |
56 | TRAIN_ARGS=""
57 |
58 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json
59 | ZERO_STAGE=2
60 |
61 | if [ $DS_DTYPE = "fp16" ];then
62 | TRAIN_ARGS+=" \
63 | --fp16 \
64 | "
65 | DS_FP16=true
66 | DS_BF16=false
67 | GAS_DTYPE=$DS_DTYPE
68 | elif [ $DS_DTYPE = "bf16" ];then
69 | TRAIN_ARGS+=" \
70 | --bf16 \
71 | "
72 | DS_FP16=false
73 | DS_BF16=true
74 | GAS_DTYPE="fp32"
75 |
76 | fi
77 |
78 | cat < $DS_CONFIG_JSON
79 | {
80 | "train_micro_batch_size_per_gpu": $MBS,
81 | "train_batch_size": "auto",
82 | "gradient_clipping": ${GRAD_CLIP},
83 | "zero_optimization": {
84 | "stage": $ZERO_STAGE
85 | },
86 | "bf16": {
87 | "enabled": ${DS_BF16}
88 | },
89 | "data_types": {
90 | "grad_accum_dtype": "${GAS_DTYPE}"
91 | },
92 | "fp16": {
93 | "enabled": ${DS_FP16},
94 | "loss_scale": 0,
95 | "loss_scale_window": 200,
96 | "hysteresis": 5,
97 | "min_loss_scale": 1,
98 | "initial_scale_power": 12
99 | },
100 | "steps_per_print": 10,
101 | "wall_clock_breakdown": true,
102 | "comms_logger": {
103 | "enabled": true,
104 | "verbose": false,
105 | "prof_all": false,
106 | "debug": false
107 | },
108 | "flops_profiler": {
109 | "enabled": false,
110 | "profile_step": 30,
111 | "module_depth": -1,
112 | "top_modules": 1,
113 | "detailed": true,
114 | "output_file": null
115 | }
116 | }
117 | EOT
118 |
119 |
120 | TRAIN_ARGS+=" \
121 | --seed ${SEED} \
122 | --output_dir ${OUTPUT_DIR} \
123 | --overwrite_output_dir \
124 | --deepspeed ${DS_CONFIG_JSON} \
125 | --per_device_train_batch_size ${MBS} \
126 | --gradient_accumulation_steps ${GAS} \
127 | --do_train \
128 | --num_train_epochs ${TRAIN_EPOCHS} \
129 | --logging_dir ${TB_DIR} \
130 | --logging_strategy steps \
131 | --logging_steps ${LOGGING_STEPS} \
132 | --weight_decay 0.01 \
133 | --adam_beta1 0.9 \
134 | --adam_beta1 0.95 \
135 | --max_grad_norm ${GRAD_CLIP} \
136 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \
137 | --learning_rate ${LR} \
138 | --warmup_ratio ${WARMUP_RATION} \
139 | --weight_decay 0.01 \
140 | --save_strategy steps \
141 | --save_total_limit 3 \
142 | --save_steps ${CKPT_SAVE_STEPS} \
143 | --ddp_timeout 30000 \
144 | --logging_first_step True \
145 | --save_safetensors False \
146 | --ddp_find_unused_parameters False \
147 | "
148 |
149 | if [[ $MODEL_SIZE == "16m" ]];then
150 | HIDDEN_SIZE=120
151 | NUM_HIDDEN_LAYERS=6
152 | NUM_ATTENTION_HEADS=6
153 | INTERMEDIATE_SIZE=384
154 | ROPE_THETA=10000.0
155 | MAX_POSITION_EMBEDDINGS=512
156 | VOCAB_SIZE=64798
157 | elif [[ $MODEL_SIZE == "42m" ]];then
158 | HIDDEN_SIZE=288
159 | NUM_HIDDEN_LAYERS=6
160 | NUM_ATTENTION_HEADS=6
161 | INTERMEDIATE_SIZE=768
162 | ROPE_THETA=10000.0
163 | MAX_POSITION_EMBEDDINGS=512
164 | VOCAB_SIZE=64798
165 | elif [[ $MODEL_SIZE == "92m" ]];then
166 | HIDDEN_SIZE=512
167 | NUM_HIDDEN_LAYERS=8
168 | NUM_ATTENTION_HEADS=8
169 | INTERMEDIATE_SIZE=1408
170 | ROPE_THETA=10000.0
171 | MAX_POSITION_EMBEDDINGS=1024
172 | VOCAB_SIZE=64798
173 | elif [[ $MODEL_SIZE == "210m" ]];then
174 | HIDDEN_SIZE=768
175 | NUM_HIDDEN_LAYERS=16
176 | NUM_ATTENTION_HEADS=12
177 | INTERMEDIATE_SIZE=2048
178 | ROPE_THETA=10000.0
179 | MAX_POSITION_EMBEDDINGS=1024
180 | VOCAB_SIZE=64798
181 | elif [[ $MODEL_SIZE == "440m" ]];then
182 | HIDDEN_SIZE=1024
183 | NUM_HIDDEN_LAYERS=24
184 | NUM_ATTENTION_HEADS=16
185 | INTERMEDIATE_SIZE=2816
186 | ROPE_THETA=10000.0
187 | MAX_POSITION_EMBEDDINGS=1024
188 | VOCAB_SIZE=64798
189 | fi
190 |
191 | GPT_ARGS=" \
192 | --hidden_size ${HIDDEN_SIZE} \
193 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \
194 | --num_attention_heads ${NUM_ATTENTION_HEADS} \
195 | --intermediate_size ${INTERMEDIATE_SIZE} \
196 | --rope_theta ${ROPE_THETA} \
197 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \
198 | --vocab_size ${VOCAB_SIZE} \
199 | "
200 | SCRIPT_ARGS=" \
201 | --mode ${MODE} \
202 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \
203 | --resume ${RESUME} \
204 | --base_model_path ${BASE_MODEL_PATH} \
205 | "
206 |
207 | DISTRIBUTED_ARGS=" \
208 | --nnodes $N_NODES \
209 | --nproc_per_node $N_GPUS \
210 | "
211 |
212 | # 检查num是否大于1
213 | if [ "$N_NODES" -ge 2 ]; then
214 | DISTRIBUTED_ARGS+=" \
215 | --node_rank $RANK \
216 | --master_addr $MASTER_ADDR \
217 | --master_port $MASTER_PORT \
218 | "
219 | fi
220 |
221 | # 所有参数
222 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS "
223 |
224 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/dpo_train.py "
225 |
226 | export CMD="$LAUNCHER $ALL_ARGS"
227 | echo $CMD
228 |
229 | killall dpo_train.py
230 |
231 | # 执行训练
232 | $CMD 2>&1 | tee ${TRAIN_LOG}
233 |
234 | killall dpo_train.py
235 |
236 | echo "train end : ${OUTPUT_DIR}"
237 |
--------------------------------------------------------------------------------
/train/sft_train.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import numpy as np
3 | import os
4 | import sys
5 | from dataclasses import dataclass, field
6 | from typing import Optional, List, Dict, Any, Mapping
7 | import datasets
8 | import torch
9 | import torch.nn as nn
10 | import transformers
11 | from transformers import (
12 | CONFIG_MAPPING,
13 | MODEL_FOR_CAUSAL_LM_MAPPING,
14 | AutoConfig,
15 | AutoModelForCausalLM,
16 | HfArgumentParser,
17 | Trainer,
18 | TrainingArguments,
19 | is_torch_tpu_available,
20 | set_seed,
21 | )
22 | from transformers.utils.versions import require_version
23 | from sklearn.metrics import accuracy_score
24 | from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
25 |
26 | from configuration_tinyllm import TinyllmConfig
27 | from modeling_tinyllm import TinyllmForCausalLM
28 | from tinyllm_dataset import SFTDataset
29 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer
30 |
31 | MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys())
32 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
33 |
34 | @dataclass
35 | class ModelArguments:
36 | """ 模型相关参数
37 | """
38 | hidden_size : Optional[int] = field(
39 | default=512,
40 | metadata={"help": "hidden_size"}
41 | )
42 |
43 | num_hidden_layers : Optional[int] = field(
44 | default=8,
45 | metadata={"help": "num_hidden_layers"}
46 | )
47 |
48 | num_attention_heads : Optional[int] = field(
49 | default=8,
50 | metadata={"help": "transformer num_attention_heads"}
51 | )
52 |
53 | intermediate_size : Optional[int] = field(
54 | default=1408,
55 | metadata={"help": "intermediate_size"}
56 | )
57 |
58 | rope_theta : Optional[float] = field(
59 | default=10000.0,
60 | metadata={"help": "rope_theta"}
61 | )
62 |
63 | max_position_embeddings : Optional[int] = field(
64 | default=1024,
65 | metadata={"help": "max_position_embeddings"}
66 | )
67 |
68 | vocab_size : Optional[int] = field(
69 | default=64798,
70 | metadata={"help": "vocab_size, ref https://github.com/THUDM/ChatGLM3/issues/634"}
71 | )
72 |
73 | @dataclass
74 | class ScriptArguments:
75 | """ 其他相关参数
76 | """
77 | mode : Optional[str] = field(
78 | default="ptm",
79 | metadata={"help": "save pretrain *bin file dir"}
80 | )
81 |
82 | dataset_dir_or_path : Optional[str] = field(
83 | default="data/pre_train",
84 | metadata={"help": "save pretrain file dir"}
85 | )
86 |
87 | resume : Optional[bool] = field(
88 | default=False,
89 | metadata={"help": "use PyTorch 2.0 to compile the model to be faster"}
90 | )
91 |
92 | base_model_path : Optional[str] = field(
93 | default=" ",
94 | metadata={"help": "SFT train, the base model path"}
95 | )
96 |
97 | def data_collator_fn(examples):
98 | # 将所有样本的输入 (`X`) 和标签 (`Y`) 分别堆叠
99 | input_ids = torch.stack([example[0] for example in examples])
100 | labels = torch.stack([example[1] for example in examples])
101 |
102 | # 返回一个字典,包含模型需要的键和值
103 | data_dict = {
104 | "input_ids": input_ids,
105 | "labels": labels
106 | }
107 | return data_dict
108 |
109 | logger = logging.getLogger(__name__)
110 |
111 | def main():
112 | parser = HfArgumentParser((ModelArguments, ScriptArguments, TrainingArguments))
113 | model_args, script_args, training_args = parser.parse_args_into_dataclasses()
114 |
115 | # logger format
116 | logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",datefmt="%m/%d/%Y %H:%M:%S",
117 | level = logging.WARN, # if training_args.local_rank in [-1, 0] else logging.WARN,
118 | handlers = [logging.StreamHandler(sys.stdout)],)
119 | if training_args.should_log:
120 | # The default of training_args.log_level is passive, so we set log level at info here to have that default.
121 | transformers.utils.logging.set_verbosity_info()
122 |
123 | log_level = training_args.get_process_log_level()
124 | logger.setLevel(log_level)
125 | datasets.utils.logging.set_verbosity(log_level)
126 | transformers.utils.logging.set_verbosity(log_level)
127 | transformers.utils.logging.enable_default_handler()
128 | transformers.utils.logging.enable_explicit_format()
129 |
130 | # Log on each process the small summary:
131 | logger.warning(
132 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
133 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
134 | )
135 |
136 | set_seed(training_args.seed)
137 |
138 | device = "cuda" if torch.cuda.is_available() else "cpu"
139 |
140 | # init model
141 | tokenizer = transformers.AutoTokenizer.from_pretrained(
142 | script_args.base_model_path,
143 | use_fast=False,
144 | trust_remote_code=True,
145 | model_max_length=model_args.max_position_embeddings
146 | )
147 |
148 | config = transformers.AutoConfig.from_pretrained(
149 | script_args.base_model_path,
150 | trust_remote_code=True
151 | )
152 | config.use_cache = False
153 |
154 | model = transformers.AutoModelForCausalLM.from_pretrained(
155 | script_args.base_model_path,
156 | config=config,
157 | trust_remote_code=True
158 | )
159 |
160 | model.to(device)
161 |
162 | ################
163 | total_params = sum(p.numel() for p in model.parameters())
164 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
165 | logger.info(f"总参数: {total_params}, {total_params/2**20:.2f}M params")
166 | logger.info(f"可训练参数: {trainable_params}")
167 | ##############
168 |
169 | sft_dataset = SFTDataset(
170 | script_args.dataset_dir_or_path,
171 | tokenizer,
172 | model_args.max_position_embeddings
173 | )
174 |
175 | trainer = Trainer(
176 | model = model,
177 | args = training_args,
178 | train_dataset = sft_dataset,
179 | # eval_dataset = None,
180 | # data_collator = data_collator_fn,
181 | )
182 | # Training
183 | trainer.train(script_args.resume)
184 | # torch.save(model.state_dict(),'{}/last_model.pth'.format(training_args.output_dir))
185 | last_model_dir = os.path.join(training_args.output_dir, 'last_sft_model')
186 | os.makedirs(last_model_dir, exist_ok=True)
187 | tokenizer.save_pretrained(last_model_dir)
188 | # # https://github.com/huggingface/transformers/issues/28630
189 | # model.save_pretrained(last_model_dir, safe_serialization=False)
190 | trainer.save_model(output_dir=last_model_dir)
191 |
192 | if __name__ == "__main__":
193 | main()
194 |
195 |
--------------------------------------------------------------------------------
/tokenizer/README.md:
--------------------------------------------------------------------------------
1 | ## Tiny LLM Tokenizer
2 |
3 | ## 1.简介
4 |
5 | 采用扩充 LLaMA2 词表的方式构建 Tiny LLM 词表。
6 |
7 | 由于原版 LLaMA2 对中文的支持非常有限,本项目在原版 LLaMA 的基础上进一步扩充了中文词表。
8 |
9 | 在通用中文语料上训练了基于 sentencepiece 的 20K 中文词表并与原版LLaMA模型的 32K 词表进行合并,排除重复的token后,并添加特殊 token 后,最终得到的最终中文LLaMA词表大小为 49958
10 |
11 | 注意:预训练用的是ChatGLM3的词表,并未使用扩充的词表
12 |
13 | ## 2.词表扩种
14 |
15 | ### 2.1 训练中文分词
16 |
17 | 准备一份中文训练语料保存为按照每一行保存为 `.txt`文件,选用百科的所有语料,大约8G左右语料,存储为txt文本,其中划分句子代码如下:
18 |
19 | ```python
20 | def split_sentences(text):
21 | """
22 | 分割文本为句子列表
23 | """
24 | # 正则表达式匹配中英文句子结尾标点
25 | endings_pattern = r'(?", "<|user|>", "<|assistant|>", "<|im_start|>", "<|im_end|>"]
161 | for token in custom_special_tokens:
162 | tokenizer.add_tokens(token)
163 |
164 | tokenizer.save_pretrained(output_hf_dir)
165 | print(f"tinyllm token num: {len(tokenizer)}")
166 | print(f"Tiny LLM tokenizer has been saved to {output_hf_dir}")
167 |
168 | def test_tokenizer(hf_tokenizer_dir):
169 | tinyllm_tokenizer = LlamaTokenizer.from_pretrained(hf_tokenizer_dir)
170 | print("tinyllm tokenizer nums: ", len(tinyllm_tokenizer))
171 |
172 | sys_text = "你是由wdndev开发的个人助手。"
173 | user_text = "翻译下面的句子为英文:有朋自远方来,不亦乐乎"
174 | answer_text = "It is always a pleasure to greet a friend from afar."
175 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
176 | "<|user|>", user_text.strip(),
177 | "<|assistant|>"]).strip() + "\n" + answer_text.strip()
178 |
179 | print("-----input text: \n", input_txt)
180 |
181 | encode_ids = tinyllm_tokenizer.encode(input_txt, add_special_tokens=False)
182 | print("-----encode ids: \n", encode_ids)
183 |
184 | decode_ids = tinyllm_tokenizer.decode(encode_ids)
185 | print("-----dencode ids: \n", decode_ids)
186 |
187 |
188 | if __name__ == "__main__":
189 |
190 | llama_tokenizer_dir = "input_dir/llama2_tokenizer"
191 | chinese_sp_model_file = "sp_output/chinese_spm_20000.model"
192 | output_hf_dir = "tinyllm_tokenizer_hf"
193 |
194 | merge_tokenizer(llama_tokenizer_dir, chinese_sp_model_file, output_hf_dir)
195 |
196 | test_tokenizer(output_hf_dir)
197 | ```
198 |
199 | 至此,完成了LLaMa中文词表的扩充,扩充垂直领域词表也是如此,要准备垂直领域的训练语料,最好和通用领域的训练语料混合一下。
200 |
201 |
--------------------------------------------------------------------------------
/vllm/README.md:
--------------------------------------------------------------------------------
1 | # Tiny LLM vLLM 模型部署
2 |
3 | ## 1.vLLM 环境
4 |
5 | 注意:测试环境为 vllm=0.4.0
6 |
7 | 如果使用**CUDA 12 以上和PyTorch 2.1 以上**,可以直接使用以下命令安装vLLM。
8 |
9 | ```shell
10 | pip install vllm==0.4.0
11 | ```
12 |
13 | 否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
14 |
15 | 安装完成后,还需要以下操作~
16 |
17 | 1. 把 `vllm/tinyllm.py` 文件复制到env环境对应的 `vllm/model_executor/models` 目录下。
18 | 2. 然后在 `vllm/model_executor/models/__init__.py` 文件增加一行代码
19 |
20 | ```shell
21 | "TinyllmForCausalLM": ("tinyllm", "TinyllmForCausalLM"),
22 | ```
23 |
24 | > 由于模型结构是自己定义的,vllm官方未实现,需要自己手动加入
25 |
26 | ## 2.vLLM OpenAI API 接口
27 |
28 | vLLM 部署实现 OpenAI API 协议的服务器非常方便。默认会在 http://localhost:8000 启动服务器。服务器当前一次托管一个模型,并实现列表模型、completions 和 chat completions 端口。
29 |
30 | - completions:是基本的文本生成任务,模型会在给定的提示后生成一段文本。这种类型的任务通常用于生成文章、故事、邮件等。
31 | - chat completions:是面向对话的任务,模型需要理解和生成对话。这种类型的任务通常用于构建聊天机器人或者对话系统。
32 |
33 | 在创建服务器时,可以指定模型名称、模型路径、聊天模板等参数。
34 |
35 | - --host 和 --port 参数指定地址。
36 | - --model 参数指定模型名称。
37 | - --chat-template 参数指定聊天模板。
38 | - --served-model-name 指定服务模型的名称。
39 | - --max-model-len 指定模型的最大长度。
40 |
41 | #### 启动服务
42 |
43 | ```shell
44 | python -m vllm.entrypoints.openai.api_server \
45 | --served-model-name tinyllm_92m \
46 | --model wdn/tiny_llm_sft_92m \
47 | --trust-remote-code \
48 | --tensor-parallel-size 1 \
49 | --max-model-len 1024 \
50 | ```
51 | #### 查看当前模型列表
52 |
53 | ```shell
54 | curl http://localhost:8000/v1/models
55 | ```
56 |
57 | 得到的返回值如下所示
58 |
59 | ```json
60 | {
61 | "object": "list",
62 | "data": [
63 | {
64 | "id": "tinyllm_92m",
65 | "object": "model",
66 | "created": 1717735884,
67 | "owned_by": "vllm",
68 | "root": "tiny_llm_sft_92m",
69 | "parent": null,
70 | "permission": [
71 | {
72 | "id": "cmpl-55520539697749e7bc6f0243bf2dae18",
73 | "object": "model_permission",
74 | "created": 1720594920,
75 | "allow_create_engine": false,
76 | "allow_sampling": true,
77 | "allow_logprobs": true,
78 | "allow_search_indices": false,
79 | "allow_view": true,
80 | "allow_fine_tuning": false,
81 | "organization": "*",
82 | "group": null,
83 | "is_blocking": false
84 | }
85 | ]
86 | }
87 | ]
88 | }
89 | ```
90 | #### 测试OpenAI Completions API
91 |
92 | ```shell
93 | curl http://localhost:8000/v1/completions \
94 | -H "Content-Type: application/json" \
95 | -d '{
96 | "model": "tinyllm_92m",
97 | "prompt": "你好",
98 | "max_tokens": 50,
99 | "temperature": 0
100 | }'
101 | ```
102 |
103 | 得到返回值
104 |
105 | ```json
106 | {
107 | "id": "cmpl-55520539697749e7bc6f0243bf2dae18",
108 | "object": "text_completion",
109 | "created": 1720594920,
110 | "model": "tinyllm_92m",
111 | "choices": [
112 | {
113 | "index": 0,
114 | "text": "你好,我是TinyLLM,一个由wdndev开发的人工智能助手。我可以回答各种问题、提供信息、执行任务和提供帮助。",
115 | "logprobs": null,
116 | "finish_reason": "length",
117 | "stop_reason": null
118 | }
119 | ],
120 | "usage": {
121 | "prompt_tokens": 1,
122 | "total_tokens": 51,
123 | "completion_tokens": 50
124 | }
125 | }
126 | ```
127 |
128 | #### 使用Python脚本请求 OpenAI Completions API
129 |
130 | ```python
131 | from openai import OpenAI
132 | client = OpenAI(
133 | base_url="http://localhost:8000/v1",
134 | api_key="sk-xxx", # 随便填写,只是为了通过接口参数校验
135 | )
136 |
137 | completion = client.chat.completions.create(
138 | model="tinyllm_92m",
139 | messages=[
140 | {"role": "user", "content": "你好"}
141 | ]
142 | )
143 |
144 | print(completion.choices[0].message)
145 | ```
146 |
147 | 返回值
148 |
149 | ```shell
150 | ChatCompletionMessage(content='
151 | 你好,我是TinyLLM,一个由wdndev开发的人工智能助手。我可以回答各种问题、提供信息、执行任务和提供帮助。', role='assistant', function_call=None, tool_calls=None)
152 | ```
153 |
154 | #### 使用curl测试 OpenAI Chat Completions API
155 |
156 | ```shell
157 | curl http://localhost:8000/v1/chat/completions \
158 | -H "Content-Type: application/json" \
159 | -d '{
160 | "model": "tinyllm_92m",
161 | "messages": [
162 | {"role": "system", "content": "You are a helpful assistant."},
163 | {"role": "user", "content": "请介绍一下北京"}
164 | ]
165 | }'
166 |
167 | ```
168 | 返回结果
169 | ```json
170 | {
171 | "id": "cmpl-55520539697749e7bc6f0243bf2dae18",
172 | "object": "chat.completion",
173 | "created": 1720594920,
174 | "model": "tinyllm_92m",
175 | "choices": [
176 | {
177 | "index": 0,
178 | "message": {
179 | "role": "assistant",
180 | "content": ":北京是中国的首都,也是中国改革开放的前沿城市之一,也是中国的首都。首都有着丰富的历史和文化底蕴,是中国的重要首都之一。"
181 | },
182 | "logprobs": null,
183 | "finish_reason": "stop",
184 | "stop_reason": null
185 | }
186 | ],
187 | "usage": {
188 | "prompt_tokens": 24,
189 | "total_tokens": 55,
190 | "completion_tokens": 31
191 | }
192 | }
193 | ```
194 |
195 | #### 使用 python 测试OpenAI Chat Completions API
196 |
197 | ```python
198 | # vllm_openai_chat_completions.py
199 | from openai import OpenAI
200 | openai_api_key = "sk-xxx" # 随便填写,只是为了通过接口参数校验
201 | openai_api_base = "http://localhost:8000/v1"
202 |
203 | client = OpenAI(
204 | api_key=openai_api_key,
205 | base_url=openai_api_base,
206 | )
207 |
208 | chat_outputs = client.chat.completions.create(
209 | model="tinyllm_92m",
210 | messages=[
211 | {"role": "system", "content": "You are a helpful assistant."},
212 | {"role": "user", "content": "你好"},
213 | ]
214 | )
215 | print(chat_outputs)
216 | ```
217 |
218 | ## 3.vLLM python调用
219 |
220 | 首先从 vLLM 库中导入 LLM 和 SamplingParams 类。LLM 类是使用 vLLM 引擎运行离线推理的主要类。SamplingParams 类指定采样过程的参数,用于控制和调整生成文本的随机性和多样性。
221 |
222 | vLLM 提供了非常方便的封装,直接传入模型名称或模型路径即可,不必手动初始化模型和分词器。
223 |
224 | ```python
225 | # vllm_model.py
226 | from vllm import LLM, SamplingParams
227 | from transformers import AutoTokenizer
228 | import os
229 | import json
230 |
231 | # 自动下载模型时,指定使用modelscope。不设置的话,会从 huggingface 下载
232 | os.environ['VLLM_USE_MODELSCOPE']='True'
233 |
234 | def get_completion(prompts, model, tokenizer=None, max_tokens=512, temperature=0.8, top_p=0.95, max_model_len=2048):
235 | stop_token_ids = [151329, 151336, 151338]
236 | # 创建采样参数。temperature 控制生成文本的多样性,top_p 控制核心采样的概率
237 | sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens, stop_token_ids=stop_token_ids)
238 | # 初始化 vLLM 推理引擎
239 | llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len,trust_remote_code=True)
240 | outputs = llm.generate(prompts, sampling_params)
241 | return outputs
242 |
243 |
244 | if __name__ == "__main__":
245 | # 初始化 vLLM 推理引擎
246 | model='/personal/wdn/tiny_llm_sft_92m' # 指定模型路径
247 | # model="wdn/tiny_llm_sft_92m" # 指定模型名称,自动下载模型
248 | tokenizer = None
249 | # 加载分词器后传入vLLM 模型,但不是必要的。
250 | # tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
251 |
252 | text = ["你好。",
253 | "请介绍一下北京。"]
254 |
255 | outputs = get_completion(text, model, tokenizer=tokenizer, max_tokens=512, temperature=1, top_p=1, max_model_len=2048)
256 |
257 | # 输出是一个包含 prompt、生成文本和其他信息的 RequestOutput 对象列表。
258 | # 打印输出。
259 | for output in outputs:
260 | prompt = output.prompt
261 | generated_text = output.outputs[0].text
262 | print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
263 | ```
264 |
265 |
266 |
--------------------------------------------------------------------------------
/train/ptm_train.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import numpy as np
3 | import os
4 | import glob
5 | import sys
6 | import math
7 | import json
8 | from dataclasses import dataclass, field
9 | # from itertools import chain
10 | from typing import Optional, List, Dict, Any, Mapping
11 | # from pathlib import Path
12 | import datasets
13 | import torch
14 | import torch.nn as nn
15 | # from torch.optim import AdamW
16 | # from torch.optim.lr_scheduler import LambdaLR
17 | # from datasets import load_dataset, concatenate_datasets, Dataset
18 | from datetime import datetime, timezone
19 | import transformers
20 | from transformers import (
21 | CONFIG_MAPPING,
22 | MODEL_FOR_CAUSAL_LM_MAPPING,
23 | AutoConfig,
24 | AutoModelForCausalLM,
25 | HfArgumentParser,
26 | Trainer,
27 | TrainingArguments,
28 | is_torch_tpu_available,
29 | set_seed,
30 | )
31 | from transformers.utils.versions import require_version
32 | from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
33 |
34 | from configuration_tinyllm import TinyllmConfig
35 | from modeling_tinyllm import TinyllmForCausalLM
36 | from tinyllm_dataset import PTMDataset
37 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer
38 |
39 | MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys())
40 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
41 |
42 | @dataclass
43 | class ModelArguments:
44 | """ 模型相关参数
45 | """
46 | hidden_size : Optional[int] = field(
47 | default=512,
48 | metadata={"help": "hidden_size"}
49 | )
50 |
51 | num_hidden_layers : Optional[int] = field(
52 | default=8,
53 | metadata={"help": "num_hidden_layers"}
54 | )
55 |
56 | num_attention_heads : Optional[int] = field(
57 | default=8,
58 | metadata={"help": "transformer num_attention_heads"}
59 | )
60 |
61 | intermediate_size : Optional[int] = field(
62 | default=1408,
63 | metadata={"help": "intermediate_size"}
64 | )
65 |
66 | rope_theta : Optional[float] = field(
67 | default=10000.0,
68 | metadata={"help": "rope_theta"}
69 | )
70 |
71 | max_position_embeddings : Optional[int] = field(
72 | default=1024,
73 | metadata={"help": "max_position_embeddings"}
74 | )
75 |
76 | vocab_size : Optional[int] = field(
77 | default=64798,
78 | metadata={"help": "vocab_size, ref https://github.com/THUDM/ChatGLM3/issues/634"}
79 | )
80 |
81 | @dataclass
82 | class ScriptArguments:
83 | """ 其他相关参数
84 | """
85 | mode : Optional[str] = field(
86 | default="ptm",
87 | metadata={"help": "save pretrain *bin file dir"}
88 | )
89 |
90 | dataset_dir_or_path : Optional[str] = field(
91 | default="data/pre_train",
92 | metadata={"help": "save pretrain *bin file dir"}
93 | )
94 |
95 | resume : Optional[bool] = field(
96 | default=False,
97 | metadata={"help": "use PyTorch 2.0 to compile the model to be faster"}
98 | )
99 |
100 | base_model_path : Optional[str] = field(
101 | default=" ",
102 | metadata={"help": "SFT train, the base model path"}
103 | )
104 |
105 | def data_collator_fn(examples):
106 | # 将所有样本的输入 (`X`) 和标签 (`Y`) 分别堆叠
107 | input_ids = torch.stack([example[0] for example in examples])
108 | labels = torch.stack([example[1] for example in examples])
109 |
110 | # 返回一个字典,包含模型需要的键和值
111 | data_dict = {
112 | "input_ids": input_ids,
113 | "labels": labels
114 | }
115 | return data_dict
116 |
117 | logger = logging.getLogger(__name__)
118 |
119 | def main():
120 | parser = HfArgumentParser((ModelArguments, ScriptArguments, TrainingArguments))
121 | model_args, script_args, training_args = parser.parse_args_into_dataclasses()
122 |
123 | # logger format
124 | logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",datefmt="%m/%d/%Y %H:%M:%S",
125 | level = logging.WARN, # if training_args.local_rank in [-1, 0] else logging.WARN,
126 | handlers = [logging.StreamHandler(sys.stdout)],)
127 | if training_args.should_log:
128 | # The default of training_args.log_level is passive, so we set log level at info here to have that default.
129 | transformers.utils.logging.set_verbosity_info()
130 |
131 | log_level = training_args.get_process_log_level()
132 | logger.setLevel(log_level)
133 | datasets.utils.logging.set_verbosity(log_level)
134 | transformers.utils.logging.set_verbosity(log_level)
135 | transformers.utils.logging.enable_default_handler()
136 | transformers.utils.logging.enable_explicit_format()
137 |
138 | # Log on each process the small summary:
139 | logger.warning(
140 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
141 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
142 | )
143 |
144 | set_seed(training_args.seed)
145 |
146 | device = "cuda" if torch.cuda.is_available() else "cpu"
147 |
148 | # init model
149 | gpt_args = dict(
150 | hidden_size = model_args.hidden_size,
151 | num_hidden_layers = model_args.num_hidden_layers,
152 | num_attention_heads = model_args.num_attention_heads,
153 | intermediate_size = model_args.intermediate_size,
154 | rope_theta = model_args.rope_theta,
155 | max_position_embeddings = model_args.max_position_embeddings,
156 | vocab_size = model_args.vocab_size, # 64798
157 | )
158 | gpt_conf = TinyllmConfig(**gpt_args)
159 | model = TinyllmForCausalLM(gpt_conf)
160 | model.to(device)
161 |
162 | ################
163 | total_params = sum(p.numel() for p in model.parameters())
164 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
165 | logger.info(f"总参数: {total_params}, {total_params/2**20:.2f}M params")
166 | logger.info(f"可训练参数: {trainable_params}")
167 | ##############
168 |
169 | def get_bin_files_abs_paths(directory):
170 | bin_files_paths = []
171 | for root, dirs, files in os.walk(directory):
172 | for file in files:
173 | if file.endswith('.bin'):
174 | bin_files_paths.append(os.path.abspath(os.path.join(root, file)))
175 | return bin_files_paths
176 | # data_path_list = glob.glob(os.path.join(script_args.dataset_dir_or_path, '*.bin'))
177 | data_path_list = get_bin_files_abs_paths(script_args.dataset_dir_or_path)
178 | if len(data_path_list) == 0:
179 | logger.error("***************NO INPUT DATA********************")
180 |
181 | train_ds = PTMDataset(data_path_list, max_length = model_args.max_position_embeddings, memmap=False)
182 |
183 | trainer = Trainer(
184 | model = model,
185 | args = training_args,
186 | train_dataset = train_ds,
187 | # eval_dataset = None,
188 | # data_collator = data_collator_fn,
189 | )
190 | # Training
191 | trainer.train(script_args.resume)
192 | torch.save(model.state_dict(),'{}/last_model.pth'.format(training_args.output_dir))
193 | last_model_dir = os.path.join(training_args.output_dir, 'last_ptm_model')
194 | os.makedirs(last_model_dir, exist_ok=True)
195 | # https://github.com/huggingface/transformers/issues/28630
196 | model.save_pretrained(last_model_dir, safe_serialization=False)
197 |
198 |
199 | if __name__ == "__main__":
200 | main()
201 |
202 |
--------------------------------------------------------------------------------
/script/rm_demo.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -x
4 |
5 | # export CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7"
6 |
7 | source /home/.bashrc
8 | source /home/miniconda3/etc/profile.d/conda.sh
9 | conda activate md_llm
10 | which python
11 |
12 | function killall {
13 | echo `ps -ef | grep $1 | grep -v grep | awk '{print $2}'`
14 | ps -ef | grep $1 | grep -v grep | awk '{print $2}' |xargs kill -9
15 | }
16 |
17 | WORK_DIR="/personal/tiny-llm-zh"
18 | cd ${WORK_DIR}
19 |
20 | # 常见参数
21 | N_NODES=1
22 | N_GPUS=8
23 | MBS=16 # 单卡bs
24 | GAS=1 # 梯度累积
25 | GRAD_CLIP=1 # 梯度裁剪
26 | RANK=0
27 | MASTER_ADDR=`hostname -i`
28 | MASTER_PORT=2345
29 |
30 | LR=1e-4 # 初始学习率
31 | LR_SCHEDULER_TYPE="cosine"
32 | WARMUP_RATION=0.00
33 |
34 | TRAIN_EPOCHS=5 # 训练轮次
35 | LOGGING_STEPS=50 # 记录日志步数
36 | CKPT_SAVE_STEPS=5000 # ckpt保存步数
37 |
38 | SEED=12
39 | DS_DTYPE="bf16" # [fp16, bf16]
40 | RESUME="False"
41 |
42 | IS_EVAL="False"
43 | EVAL_STEP=1000
44 | EVAL_MBS=16
45 |
46 | # 数据
47 | MODE="rm" # [ptm, sft, rm, rl]
48 | DATASET_DIR_OR_PATH="data/rm_train/rm_data.jsonl"
49 | BASE_MODEL_PATH="outputs/ckpt/ptm_tiny_llm_92m_epoch5_2/last_ptm_model"
50 |
51 | MODEL_SIZE="92m" # [16m, 42m, 92m, 210m, 440m]
52 | MODEL_NAME="${MODE}_tiny_llm_${MODEL_SIZE}"
53 | OUTPUT_DIR="outputs/ckpt/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
54 | mkdir -p $OUTPUT_DIR
55 | TRAIN_LOG="${OUTPUT_DIR}/train_$(date "+%Y%m%d%H%M").log"
56 | # tensorboard输出路径
57 | TB_DIR="outputs/tensorboard/${MODEL_NAME}_epoch${TRAIN_EPOCHS}"
58 | mkdir -p $TB_DIR
59 |
60 | TRAIN_ARGS=""
61 |
62 | DS_CONFIG_JSON=${OUTPUT_DIR}/${MODEL_SIZE}_ds_config.json
63 | ZERO_STAGE=2
64 |
65 | if [ $DS_DTYPE = "fp16" ];then
66 | TRAIN_ARGS+=" \
67 | --fp16 \
68 | "
69 | DS_FP16=true
70 | DS_BF16=false
71 | GAS_DTYPE=$DS_DTYPE
72 | elif [ $DS_DTYPE = "bf16" ];then
73 | TRAIN_ARGS+=" \
74 | --bf16 \
75 | "
76 | DS_FP16=false
77 | DS_BF16=true
78 | GAS_DTYPE="fp32"
79 |
80 | fi
81 |
82 | cat < $DS_CONFIG_JSON
83 | {
84 | "train_micro_batch_size_per_gpu": $MBS,
85 | "train_batch_size": "auto",
86 | "gradient_clipping": ${GRAD_CLIP},
87 | "zero_optimization": {
88 | "stage": $ZERO_STAGE
89 | },
90 | "bf16": {
91 | "enabled": ${DS_BF16}
92 | },
93 | "data_types": {
94 | "grad_accum_dtype": "${GAS_DTYPE}"
95 | },
96 | "fp16": {
97 | "enabled": ${DS_FP16},
98 | "loss_scale": 0,
99 | "loss_scale_window": 200,
100 | "hysteresis": 5,
101 | "min_loss_scale": 1,
102 | "initial_scale_power": 12
103 | },
104 | "steps_per_print": 10,
105 | "wall_clock_breakdown": true,
106 | "comms_logger": {
107 | "enabled": true,
108 | "verbose": false,
109 | "prof_all": false,
110 | "debug": false
111 | },
112 | "flops_profiler": {
113 | "enabled": false,
114 | "profile_step": 30,
115 | "module_depth": -1,
116 | "top_modules": 1,
117 | "detailed": true,
118 | "output_file": null
119 | }
120 | }
121 | EOT
122 |
123 |
124 | TRAIN_ARGS+=" \
125 | --seed ${SEED} \
126 | --output_dir ${OUTPUT_DIR} \
127 | --overwrite_output_dir \
128 | --deepspeed ${DS_CONFIG_JSON} \
129 | --per_device_train_batch_size ${MBS} \
130 | --gradient_accumulation_steps ${GAS} \
131 | --do_train \
132 | --num_train_epochs ${TRAIN_EPOCHS} \
133 | --logging_dir ${TB_DIR} \
134 | --logging_strategy steps \
135 | --logging_steps ${LOGGING_STEPS} \
136 | --weight_decay 0.01 \
137 | --adam_beta1 0.9 \
138 | --adam_beta1 0.95 \
139 | --max_grad_norm ${GRAD_CLIP} \
140 | --lr_scheduler_type ${LR_SCHEDULER_TYPE} \
141 | --learning_rate ${LR} \
142 | --warmup_ratio ${WARMUP_RATION} \
143 | --weight_decay 0.01 \
144 | --save_strategy steps \
145 | --save_total_limit 3 \
146 | --save_steps ${CKPT_SAVE_STEPS} \
147 | --ddp_timeout 30000 \
148 | --logging_first_step True \
149 | --save_safetensors False \
150 | --ddp_find_unused_parameters False \
151 | --remove_unused_columns False \
152 | "
153 |
154 | if [ $IS_EVAL = "True" ];then
155 | TRAIN_ARGS+=" \
156 | --per_device_eval_batch_size ${EVAL_MBS} \
157 | --evaluation_strategy steps \
158 | --eval_steps ${EVAL_STEP} \
159 | "
160 | fi
161 |
162 | if [[ $MODEL_SIZE == "16m" ]];then
163 | HIDDEN_SIZE=120
164 | NUM_HIDDEN_LAYERS=6
165 | NUM_ATTENTION_HEADS=6
166 | INTERMEDIATE_SIZE=384
167 | ROPE_THETA=10000.0
168 | MAX_POSITION_EMBEDDINGS=512
169 | VOCAB_SIZE=64798
170 | elif [[ $MODEL_SIZE == "42m" ]];then
171 | HIDDEN_SIZE=288
172 | NUM_HIDDEN_LAYERS=6
173 | NUM_ATTENTION_HEADS=6
174 | INTERMEDIATE_SIZE=768
175 | ROPE_THETA=10000.0
176 | MAX_POSITION_EMBEDDINGS=512
177 | VOCAB_SIZE=64798
178 | elif [[ $MODEL_SIZE == "92m" ]];then
179 | HIDDEN_SIZE=512
180 | NUM_HIDDEN_LAYERS=8
181 | NUM_ATTENTION_HEADS=8
182 | INTERMEDIATE_SIZE=1408
183 | ROPE_THETA=10000.0
184 | MAX_POSITION_EMBEDDINGS=1024
185 | VOCAB_SIZE=64798
186 | elif [[ $MODEL_SIZE == "210m" ]];then
187 | HIDDEN_SIZE=768
188 | NUM_HIDDEN_LAYERS=16
189 | NUM_ATTENTION_HEADS=12
190 | INTERMEDIATE_SIZE=2048
191 | ROPE_THETA=10000.0
192 | MAX_POSITION_EMBEDDINGS=1024
193 | VOCAB_SIZE=64798
194 | elif [[ $MODEL_SIZE == "440m" ]];then
195 | HIDDEN_SIZE=1024
196 | NUM_HIDDEN_LAYERS=24
197 | NUM_ATTENTION_HEADS=16
198 | INTERMEDIATE_SIZE=2816
199 | ROPE_THETA=10000.0
200 | MAX_POSITION_EMBEDDINGS=1024
201 | VOCAB_SIZE=64798
202 | fi
203 |
204 | GPT_ARGS=" \
205 | --hidden_size ${HIDDEN_SIZE} \
206 | --num_hidden_layers ${NUM_HIDDEN_LAYERS} \
207 | --num_attention_heads ${NUM_ATTENTION_HEADS} \
208 | --intermediate_size ${INTERMEDIATE_SIZE} \
209 | --rope_theta ${ROPE_THETA} \
210 | --max_position_embeddings ${MAX_POSITION_EMBEDDINGS} \
211 | --vocab_size ${VOCAB_SIZE} \
212 | "
213 | SCRIPT_ARGS=" \
214 | --mode ${MODE} \
215 | --dataset_dir_or_path ${DATASET_DIR_OR_PATH} \
216 | --resume ${RESUME} \
217 | --base_model_path ${BASE_MODEL_PATH} \
218 | "
219 |
220 | DISTRIBUTED_ARGS=" \
221 | --nnodes $N_NODES \
222 | --nproc_per_node $N_GPUS \
223 | --node_rank $RANK \
224 | --master_addr $MASTER_ADDR \
225 | --master_port $MASTER_PORT \
226 | "
227 |
228 | # 检查num是否大于1
229 | if [ "$N_NODES" -ge 2 ]; then
230 | DISTRIBUTED_ARGS+=" \
231 | --node_rank $RANK \
232 | --master_addr $MASTER_ADDR \
233 | --master_port $MASTER_PORT \
234 | "
235 | fi
236 |
237 | # 所有参数
238 | ALL_ARGS=" $GPT_ARGS $TRAIN_ARGS $SCRIPT_ARGS "
239 |
240 | LAUNCHER="torchrun $DISTRIBUTED_ARGS train/rm_train.py "
241 |
242 | export CMD="$LAUNCHER $ALL_ARGS"
243 | echo $CMD
244 |
245 | killall train/rm_train.py
246 |
247 | # 执行训练
248 | $CMD 2>&1 | tee ${TRAIN_LOG}
249 |
250 | killall train/rm_train.py
251 |
252 | echo "train end : ${OUTPUT_DIR}"
253 | # nohup torchrun --standalone --nproc_per_node=$N_GPUS pretrain.py \
254 | # --out_dir="$OUTPUT_DIR/$MODEL_NAME" \
255 | # --vocab_size=$VOCAB_SIZE \
256 | # --max_seq_len=$VOCAB_SIZE \
257 | # --dim=$DIM \
258 | # --n_layers=$N_LAYERS \
259 | # --n_heads=$N_HEADS \
260 | # --n_kv_heads=$N_KV_HEADS \
261 | # --multiple_of=$MULTIPLE_OF \
262 | # --dropout=$DROPOUT \
263 | # --batch_size=$BATCH_SIZE \
264 | # >> $log_file 2>&1 &
265 |
266 |
267 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Tiny LLM zh
2 |
3 | ## 1.简介
4 |
5 | 本项目旨在构建一个小参数量的中文语言大模型,用于快速入门学习大模型相关知识,如果此项目对你有用,可以点一下start,谢谢!
6 |
7 | 模型架构:整体模型架构采用开源通用架构,包括:RMSNorm,RoPE,MHA等
8 |
9 | 实现细节:实现大模型两阶段训练及后续人类对齐,即:分词(Tokenizer) -> 预训练(PTM) -> 指令微调(SFT) -> 人类对齐(RLHF, DPO) -> 测评 -> 量化 -> 部署。
10 |
11 | 项目已部署,可以在如下网站上体验。
12 |
13 | - [ModeScope Tiny LLM](https://www.modelscope.cn/studios/wdndev/tiny_llm_92m_demo/summary)
14 |
15 | 项目特点:
16 |
17 | - 公开全部数据及代码,包括预训练数据,tokenizer等;([Tiny LLM Datasets](doc/datasets_download.md))
18 | - 走通大模型整个流程:分词(Tokenizer) -> 预训练(PTM) -> 指令微调(SFT) -> 人类对齐(RLHF, DPO) -> 测评 -> 部署;
19 | - 公开预训练token 42B,SFT数据400w条,RL数据 17w条;
20 | - 训练 Tokenizer:10G 中文百科文本训练 20K 中文词表,与 Llama2 词表合并,构建Tiny LLM词表;
21 | - 使用 Transformers deepspeed 进行训练,支持多机多卡,支持 Zero 等优化技术;
22 | - 所有代码 `Bash` 脚本启动,支持不同大小的模型,如16m, 42m, 92m, 210m, 440m等;
23 | - 支持 MoE 架构,在 [tiny_llm_moe](https://github.com/wdndev/tiny-llm-zh/tree/tiny_llm_moe) 支持最新共享专家,平衡专家等技术;
24 | - 支持 vLLM 推理框架;
25 | - 支持 llama.cpp 推理框架;
26 |
27 |
28 | 本项目主要有三个分支,推荐学习 主分支,具体区别如下:
29 |
30 | - [llama2_torch](https://github.com/wdndev/tiny-llm-zh/tree/llama2_torch) : 模型架构采用原版 Llama2 架构,只是将部分的输入输出修改为适合训练的格式;
31 | - `main` `tiny_llm` : 对齐开源社区模型,使用Transformers库构建底层模型,也使用Transformers库进行多卡多机训练;
32 | - [tiny_llm_moe](https://github.com/wdndev/tiny-llm-zh/tree/tiny_llm_moe) : 在`tiny_llm`的基础上,修改 `MLP`层为MoE模型,使用Transformers库进行多卡多机训练。
33 |
34 | 注意:
35 |
36 | 1. 因资源限制,本项目的第一要务是走通大模型整个流程,而不是调教比较好的效果,故评测结果分数较低,部分生成错误。
37 | 2. 详细的数据处理,训练过程见 `doc` 文件夹(正在整理。。。)
38 |
39 |
40 | ## 2.快速开始
41 |
42 | 模型已托管在 [Huggingface](https://huggingface.co/wdndev/tiny_llm_sft_92m) 和 [ModeScope](https://www.modelscope.cn/models/wdndev/tiny_llm_sft_92m) 中,可运行代码自动下载。
43 |
44 | 建议使用 Huggingface 在线加载模型,如果运行不了,在试 ModeScope ;如果需要本地运行,修改`model_id`中的路径为本地目录,即可运行。
45 |
46 | #### 依赖安装
47 |
48 | - python 3.8 and above
49 | - pytorch 2.0 and above
50 | - transformers 4.37.2 and above
51 | - CUDA 11.4 and above are recommended. (if training)
52 |
53 | ```bash
54 | pip install -r requirements.txt
55 | ```
56 |
57 |
58 | #### 🤗 HuggingFace
59 |
60 | ```python
61 | from transformers import AutoTokenizer, AutoModelForCausalLM
62 | from transformers.generation import GenerationConfig
63 |
64 | model_id = "wdndev/tiny_llm_sft_92m"
65 |
66 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
67 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)
68 | generation_config = GenerationConfig.from_pretrained(model_id, trust_remote_code=True)
69 | sys_text = "你是由wdndev开发的个人助手。"
70 | # user_text = "世界上最大的动物是什么?"
71 | # user_text = "介绍一下刘德华。"
72 | user_text = "介绍一下中国。"
73 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
74 | "<|user|>", user_text.strip(),
75 | "<|assistant|>"]).strip() + "\n"
76 |
77 | generation_config.max_new_tokens = 200
78 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
79 | generated_ids = model.generate(model_inputs.input_ids, generation_config=generation_config)
80 | generated_ids = [
81 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
82 | ]
83 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
84 | print(response)
85 | ```
86 |
87 | #### 🤖 ModeScope
88 |
89 | ```python
90 | from modelscope import AutoModelForCausalLM, AutoTokenizer
91 |
92 | model_id = "wdndev/tiny_llm_sft_92m"
93 |
94 | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
95 | model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)
96 |
97 | sys_text = "你是由wdndev开发的个人助手。"
98 | # user_text = "世界上最大的动物是什么?"
99 | # user_text = "介绍一下刘德华。"
100 | user_text = "介绍一下中国。"
101 | input_txt = "\n".join(["<|system|>", sys_text.strip(),
102 | "<|user|>", user_text.strip(),
103 | "<|assistant|>"]).strip() + "\n"
104 |
105 | model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
106 | generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=200)
107 | generated_ids = [
108 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
109 | ]
110 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
111 | print(response)
112 | ```
113 |
114 |
115 | 生成效果
116 | ```bash
117 | 问:世界上最大的动物是什么?
118 | 答:目前已知最大的动物是蓝鲸(Balaenoptera musculus),这是一个庞大的哺乳动物,属于须鲸亚目、须鲸科中的最大物种。蓝鲸的身长可达30米以上,体重可达175吨。它们在海洋中生活,主要以浮游生物为食,如甲壳类动物和小型鱼类等。由于其巨大的体型和复杂的生态群落,蓝鲸成为海洋旅游的热门景点之一。
119 |
120 | 问:介绍一下刘德华。
121 | 答:刘德华是一位香港流行歌手、演员和导演,他在音乐界的贡献非常巨大。他是华语乐坛历史上最伟大的艺人之一,代表作品包括《爱我身体》和《肥皂泡》。他也经常参演电影和电视剧,并在电视上受到好评。
122 |
123 | 问:介绍一下中国。
124 | 答:中国是位于东亚的大陆,被欧洲以及亚洲和其他大陆所包围。它是中国第二大文明和世界上最大的经济体之一。中国的历史可以追溯到公元前5000年左右,从古至今都有其独特的文化和语言传承者。
125 |
126 | ```
127 |
128 | ## 3.模型
129 |
130 | ### 3.1 Tokenizer
131 |
132 | LLM分词器的构建方式有两种:一种是自己构造词表,训练一个分词器;另一种是选择开源模型训练好的分词器。
133 |
134 | 本项目为了方便,从优秀的开源项目中选择词表,考虑到训练的模型较小,且词表大小影响模型大小,故优先选择词表较小的开源项目;经过比较,最终选择 [ChatGLM3](https://huggingface.co/THUDM/chatglm3-6b) 的词表,该词表大小为 64798 。
135 |
136 | 自己构造词表方式见 [tokenizer](tokenizer/),扩充 LLaMA2的32K词表为50K,增加20K中文词表,详细扩充方式见[文档](./doc/)或[tokenizer/README.md](./tokenizer/README.md).
137 |
138 | 注意:本项目使用的ChatGLM3的词表。
139 |
140 | ### 3.2 模型结构
141 |
142 | 模型结构采用类Llama2的结构,具体包括:RMSNorm,RoPE,MHA等;
143 |
144 |
145 | ### 3.3 模型尺寸
146 |
147 | 具体参数细节如下所示:
148 |
149 | | model | hidden size | intermediate size | n_layers | n_heads | max context length | params | vocab size |
150 | | ---------------- | ----------- | ----------------- | -------- | ------- | ------------------ | ------ | ---------- |
151 | | tiny-llm-16m | 120 | 384 | 6 | 6 | 512 | 16M | 64798 |
152 | | tiny-llm-42m | 288 | 768 | 6 | 6 | 512 | 42M | 64798 |
153 | | tiny-llm-92m | 512 | 1024 | 8 | 8 | 1024 | 92M | 64798 |
154 | | tiny-llm-210m | 768 | 2048 | 16 | 12 | 1024 | 210M | 64798 |
155 | | tiny-llm-440m | 1024 | 2816 | 24 | 16 | 1024 | 440M | 64798 |
156 | | tiny-llm-1_5b | 2048 | 5504 | 24 | 16 | 1024 | 1.5B | 64798 |
157 |
158 |
159 | ### 3.4 模型评估
160 |
161 | 因训练数据和微调数据,大部分都是中文数据,所以在`C-Eval`和`CMMLU`这两个数据集上进行模型的评估;使用[OpenCompass](https://github.com/open-compass/opencompass)工具,进行模型评估,评估分数如下所示:
162 |
163 | | model | Type | C-Eval | CMMLU |
164 | | ---------------- | ----- | ------- | ------- |
165 | | tiny-llm-92m | Base | 23.48 | 25.02 |
166 | | tiny-llm-92m | Chat | 26.79 | 26.59 |
167 |
168 | Base模型,采用评测方式 ppl 方式进行评测;Chat模型,采用 gen 方式评测。具体区别如下图所示:
169 |
170 | 
171 |
172 | > 来源:[ppl和gen模式有什么区别](https://github.com/open-compass/opencompass/discussions/597)
173 |
174 | 注意:只对常用的两个模型进行了评测,分数较低,其余模型评测意义不大。
175 |
176 |
177 | ## 4.模型部署
178 |
179 | ### 4.1 网页Demo
180 |
181 | 网页Demo已部署,可以在如下网站上体验:[ModeScope Tiny LLM](https://www.modelscope.cn/studios/wdndev/tiny_llm_92m_demo/summary)
182 |
183 | 如果想在本地运行网页Demo,注意修改 `web_demo.py` 文件中模型的路径`model_id`,输入如下命令即可运行:
184 |
185 | ```shell
186 | streamlit run web_demo.py
187 | ```
188 |
189 | 
190 |
191 | ### 4.2 Transformers
192 |
193 | Transfomers 框架部署,位于 `demo/infer_chat.py` 和 `demo/infer_func.py` 文件中,和其他LLM运行无太大区别,注意输入的拼接即可。
194 |
195 |
196 | ### 4.3 FastAPI
197 |
198 |
199 |
200 | ### 4.4 vllm
201 |
202 | 详细vllm部署见 [vllm](vllm/README.md)
203 |
204 | 如果使用**CUDA 12 以上和PyTorch 2.1 以上**,可以直接使用以下命令安装vLLM。
205 |
206 | ```shell
207 | pip install vllm==0.4.0
208 | ```
209 |
210 | 否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
211 |
212 | 安装完成后,还需要以下操作~
213 |
214 | 1. 把 `vllm/tinyllm.py` 文件复制到env环境对应的 `vllm/model_executor/models` 目录下。
215 | 2. 然后在vllm/model_executor/models/\_\_init\_\_.py文件增加一行代码
216 |
217 | ```shell
218 | "TinyllmForCausalLM": ("tinyllm", "TinyllmForCausalLM"),
219 | ```
220 |
221 | > 由于模型结构是自己定义的,vllm官方未实现,需要自己手动加入
222 |
223 | ### 4.5 llama.cpp
224 |
225 | 详细 llama.cpp 部署见 [llama.cpp](llama.cpp/README.md)
226 |
227 | Tiny LLM 92M 模型已支持 llama.cpp C++ 推理框架,建议在 linux 环境下测试,windows效果不好;
228 |
229 | 所支持 llama.cpp 为自己修改的版本,仓库链接为: [llama.cpp.tinyllm](https://github.com/wdndev/llama.cpp.tinyllm)
230 |
--------------------------------------------------------------------------------
/utils/pre_train_process.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import glob
4 | import numpy as np
5 | from tqdm import tqdm
6 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer
7 | import pandas as pd
8 |
9 |
10 | def process_wiki_clean(file_path, tokenizer):
11 | """ https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered
12 | """
13 | with open(file_path, 'r', encoding='utf-8') as f:
14 | data = json.load(f)
15 | all_tokens = []
16 | for line in tqdm(data):
17 | text = line['completion']
18 | tokens = tokenizer.encode(text, add_special_tokens=False)
19 | tokens.append(tokenizer.special_tokens[''])
20 | if len(tokens) > 5:
21 | all_tokens += tokens
22 | arr = np.array(all_tokens, dtype=np.uint16)
23 | base_name, ext = os.path.splitext(file_path)
24 | output_file_path = base_name + '.bin'
25 | with open(output_file_path, 'wb') as f:
26 | f.write(arr.tobytes())
27 |
28 | def process_webnovel(input_dir, tokenizer):
29 | for subdir, dirs, files in os.walk(input_dir):
30 | for idx, file in enumerate(files):
31 | # 只处理txt文件
32 | if file.endswith('.jsonl'):
33 | # 获取当前文件的绝对路径
34 | file_path = os.path.join(subdir, file)
35 | all_tokens = []
36 | # 读取jsonl文件
37 | with open(file_path, 'r', encoding='utf-8') as infile:
38 | lines = infile.readlines()
39 |
40 | for line in tqdm(lines):
41 | json_obj = json.loads(line) # 解析json字符串为python对象
42 | text = json_obj['text']
43 | tokens = tokenizer.encode(text, add_special_tokens=False)
44 | tokens.append(tokenizer.special_tokens[''])
45 | if len(tokens) > 5:
46 | all_tokens += tokens
47 |
48 | arr = np.array(all_tokens, dtype = np.uint16)
49 | base_name, ext = os.path.splitext(file_path)
50 | output_file_path = base_name + '.bin'
51 | with open(output_file_path, 'wb') as f:
52 | f.write(arr.tobytes())
53 |
54 | def process_tigerbot_wiki(input_dir, tokenizer):
55 | """ https://huggingface.co/datasets/TigerResearch/tigerbot-wiki-plugin
56 | """
57 | for subdir, dirs, files in os.walk(input_dir):
58 | for idx, file in enumerate(files):
59 | # 只处理txt文件
60 | if file.endswith('.json'):
61 | # 获取当前文件的绝对路径
62 | file_path = os.path.join(subdir, file)
63 | all_tokens = []
64 | # 读取jsonl文件
65 | with open(file_path, 'r', encoding='utf-8') as infile:
66 | lines = infile.readlines()
67 |
68 | for line in tqdm(lines):
69 | json_obj = json.loads(line) # 解析json字符串为python对象
70 | text = json_obj['text']
71 | tokens = tokenizer.encode(text, add_special_tokens=False)
72 | tokens.append(tokenizer.special_tokens[''])
73 | if len(tokens) > 5:
74 | all_tokens += tokens
75 |
76 | arr = np.array(all_tokens, dtype = np.uint16)
77 | base_name, ext = os.path.splitext(file_path)
78 | output_file_path = base_name + '.bin'
79 | with open(output_file_path, 'wb') as f:
80 | f.write(arr.tobytes())
81 |
82 | def process_tigerbot_part(input_dir, tokenizer):
83 | """ https://huggingface.co/datasets/TigerResearch/pretrain_zh
84 | """
85 | # df = pd.read_parquet("zhizhu/train-00000-of-00005-a1278ede4e8c5cdb.parquet")
86 | # responses = df['RESPONSE']
87 | # print(len(responses))
88 | # print(responses[4000])
89 | all_tokens = []
90 | total_len = 0
91 | file_idx = 7
92 | # 使用glob找出文件夹下所有的.parquet
93 | for file in glob.glob(os.path.join(input_dir, '*.parquet')):
94 | # 读取jsonl文件
95 | print(file)
96 | # 读取parquet文件
97 | df = pd.read_parquet(file)
98 |
99 | # 提取RESPONSE列
100 | responses = df['content']
101 |
102 | for text in tqdm(responses):
103 | tokens = tokenizer.encode(text, add_special_tokens=False)
104 | tokens.append(tokenizer.special_tokens[''])
105 | if len(tokens) > 5:
106 | all_tokens += tokens
107 |
108 | total_len += len(df)
109 | if total_len > 600000:
110 | arr = np.array(all_tokens, dtype=np.uint16)
111 | output_file_path = "tigerbot_part_" + str(file_idx) + '.bin'
112 | with open(output_file_path, 'wb') as f:
113 | f.write(arr.tobytes())
114 |
115 | all_tokens = []
116 | total_len = 0
117 | file_idx += 1
118 |
119 | if len(all_tokens) > 0:
120 | arr = np.array(all_tokens, dtype=np.uint16)
121 | output_file_path = "tigerbot_part_" + str(file_idx) + '.bin'
122 | with open(output_file_path, 'wb') as f:
123 | f.write(arr.tobytes())
124 |
125 | def process_zhihu(input_dir, tokenizer):
126 | """ https://huggingface.co/datasets/wangrui6/Zhihu-KOL
127 | """
128 | # df = pd.read_parquet("zhizhu/train-00000-of-00005-a1278ede4e8c5cdb.parquet")
129 | # responses = df['RESPONSE']
130 | # print(len(responses))
131 | # print(responses[4000])
132 | all_tokens = []
133 | # 使用glob找出文件夹下所有的.parquet
134 | for file in glob.glob(os.path.join(input_dir, '*.parquet')):
135 | # 读取jsonl文件
136 | print(file)
137 | # 读取parquet文件
138 | df = pd.read_parquet(file)
139 |
140 | # 提取RESPONSE列
141 | responses = df['RESPONSE']
142 |
143 | for text in tqdm(responses):
144 | tokens = tokenizer.encode(text, add_special_tokens=False)
145 | tokens.append(tokenizer.special_tokens[''])
146 | if len(tokens) > 5:
147 | all_tokens += tokens
148 | arr = np.array(all_tokens, dtype=np.uint16)
149 | # base_name, ext = os.path.splitext(file_path)
150 | output_file_path = "zhihu" + '.bin'
151 | with open(output_file_path, 'wb') as f:
152 | f.write(arr.tobytes())
153 |
154 | def process_baidu_baike(input_path, tokenizer):
155 | """ https://huggingface.co/datasets/xuqinyang/BaiduBaike-5.63M
156 | """
157 | BATCH_SIZE = 1000000
158 |
159 | cnt = 0
160 | batch_cnt = 0
161 | token = 0
162 | doc_ids = []
163 |
164 | f1 = open(input_path, 'r', encoding='utf-8')
165 |
166 | while True:
167 | line = f1.readline()
168 | if not line:
169 | break
170 | line = json.loads(line)
171 | text = ''
172 | try:
173 | text += line['title']+':' + line['summary']
174 | except:
175 | pass
176 | for per in line['sections']:
177 | text += per['title']+':'+per['content']+'。'
178 | text_id = tokenizer.encode(text, add_special_tokens=False)
179 | text_id.append(tokenizer.special_tokens[''])
180 | if len(text_id) > 5:
181 | doc_ids += text_id
182 | cnt += 1
183 | if cnt % BATCH_SIZE==0:
184 | batch_cnt += 1
185 | arr = np.array(doc_ids, dtype=np.uint16)
186 | doc_ids=[]
187 | print('cnt:',cnt,'arr_shape:',arr.shape)
188 | with open('./baidubaike_563w_{}.bin'.format(batch_cnt),'wb') as f2:
189 | f2.write(arr.tobytes())
190 | del arr
191 |
192 | if not doc_ids:
193 | batch_cnt += 1
194 | arr = np.array(doc_ids, dtype=np.uint16)
195 | print('cnt:',cnt,'arr_shape:',arr.shape)
196 | with open('./baidubaike_563w_{}.bin'.format(batch_cnt),'wb') as f:
197 | f.write(arr.tobytes())
198 |
199 | def merge_bin(data_path_list : list):
200 | """ 合并所有bin文件
201 | """
202 | data_arr = []
203 | for data_path in tqdm(data_path_list):
204 | with open(data_path,'rb') as f:
205 | data = np.fromfile(f,dtype = np.uint16)
206 | data_arr.append(data)
207 | arr = np.concatenate(data_arr)
208 | print(arr.shape)
209 | with open('./data/pretrain_data.bin','wb') as f:
210 | f.write(arr.tobytes())
211 |
212 | if __name__=="__main__":
213 | tokenizer = ChatGLMTokenizer(vocab_file='utils/tokenizer/tokenizer.model')
214 |
215 | # process_webnovel("webnovel-chinese/data", tokenizer)
216 | # process_wiki_clean("corpus/pre_train/wiki_cn/wikipedia-cn.json", tokenizer)
217 | # process_zhihu("corpus/pre_train/zhihu", tokenizer)
218 | process_tigerbot_part("corpus/pre_train/tigerbot2", tokenizer)
219 | # process_baidu_baike('corpus/pre_train/baidubaike/563w_baidubaike.json', tokenizer)
--------------------------------------------------------------------------------
/train/rm_train.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import numpy as np
3 | import os
4 | import evaluate
5 | import glob
6 | import sys
7 | import math
8 | import json
9 | from dataclasses import dataclass, field
10 | # from itertools import chain
11 | from typing import Optional, List, Dict, Any, Mapping
12 | # from pathlib import Path
13 | import datasets
14 | import torch
15 | import torch.nn as nn
16 | # from torch.optim import AdamW
17 | # from torch.optim.lr_scheduler import LambdaLR
18 | # from datasets import load_dataset, concatenate_datasets, Dataset
19 | from torch.utils.data import Dataset, DataLoader, random_split
20 | from datetime import datetime, timezone
21 | import transformers
22 | from transformers import (
23 | CONFIG_MAPPING,
24 | MODEL_FOR_CAUSAL_LM_MAPPING,
25 | AutoConfig,
26 | AutoModelForCausalLM,
27 | HfArgumentParser,
28 | Trainer,
29 | TrainingArguments,
30 | is_torch_tpu_available,
31 | set_seed,
32 | )
33 | from transformers.utils.versions import require_version
34 | from sklearn.metrics import accuracy_score
35 | from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
36 |
37 | from configuration_tinyllm import TinyllmConfig
38 | from modeling_tinyllm import TinyllmForCausalLM, TinyllmForSequenceClassification
39 | from tinyllm_dataset import RMDataset
40 | from utils.chatglm3_tokenizer.tokenization_chatglm import ChatGLMTokenizer
41 |
42 | @dataclass
43 | class ModelArguments:
44 | """ 模型相关参数
45 | """
46 | hidden_size : Optional[int] = field(
47 | default=512,
48 | metadata={"help": "hidden_size"}
49 | )
50 |
51 | num_hidden_layers : Optional[int] = field(
52 | default=8,
53 | metadata={"help": "num_hidden_layers"}
54 | )
55 |
56 | num_attention_heads : Optional[int] = field(
57 | default=8,
58 | metadata={"help": "transformer num_attention_heads"}
59 | )
60 |
61 | intermediate_size : Optional[int] = field(
62 | default=1408,
63 | metadata={"help": "intermediate_size"}
64 | )
65 |
66 | rope_theta : Optional[float] = field(
67 | default=10000.0,
68 | metadata={"help": "rope_theta"}
69 | )
70 |
71 | max_position_embeddings : Optional[int] = field(
72 | default=1024,
73 | metadata={"help": "max_position_embeddings"}
74 | )
75 |
76 | vocab_size : Optional[int] = field(
77 | default=64798,
78 | metadata={"help": "vocab_size, ref https://github.com/THUDM/ChatGLM3/issues/634"}
79 | )
80 |
81 | @dataclass
82 | class ScriptArguments:
83 | """ 其他相关参数
84 | """
85 | mode : Optional[str] = field(
86 | default="rm",
87 | metadata={"help": "save sft *bin file dir"}
88 | )
89 |
90 | dataset_dir_or_path : Optional[str] = field(
91 | default="data/rm_train",
92 | metadata={"help": "save rmtrain *bin file dir"}
93 | )
94 |
95 | resume : Optional[bool] = field(
96 | default=False,
97 | metadata={"help": "use PyTorch 2.0 to compile the model to be faster"}
98 | )
99 |
100 | base_model_path : Optional[str] = field(
101 | default=" ",
102 | metadata={"help": "SFT train, the base model path"}
103 | )
104 |
105 | class RMTrainer(Trainer):
106 | def compute_loss(self, model, inputs, return_outputs=False):
107 | """ Define how to compute the reward loss.
108 | We use the InstructGPT pairwise logloss: https://arxiv.org/abs/2203.02155
109 | """
110 | rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
111 | rewards_k = model(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0]
112 | loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
113 | if return_outputs:
114 | return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
115 | return loss
116 |
117 | # Define the metric that we'll use for validation.
118 | # accuracy = evaluate.load("accuracy")
119 | # def compute_metrics(eval_pred):
120 | # predictions, _ = eval_pred
121 | # # Here, predictions is rewards_j and rewards_k.
122 | # # We want to see how much of the time rewards_j > rewards_k.
123 | # # 是这么计算的:
124 | # # 通过 argmax,得到最大值的 index,当 rewards_j 最大时,返回 0,rewards_k 最大时,返回 1
125 | # # 正确标签应该是全部为 0(index都在 0 这里)
126 |
127 | # # Q: model的输出不是一个score吗,为什么这里可以使用argmax?
128 | # # A: 下面的 compute_loss 中定义了新的model forward 方法,即会接受两个输入产生两个输出
129 | # # Trainer 中会把这种两个输出拼起来,从而得到一个在axis=0维度上有两项的形式,因此argmax就是看哪一项更大
130 | # # 具体可以参考 Trainer 中对 涉及到 compute_loss/logits/training_step/prediction_step 的部分,以及 _gather_and_numpify 方法
131 | # predictions = np.argmax(predictions, axis=0)
132 | # labels = np.zeros(predictions.shape)
133 | # return accuracy.compute(predictions=predictions, references=labels)
134 |
135 | from sklearn.metrics import accuracy_score
136 | def compute_metrics(eval_preds):
137 | predictions = eval_preds.predictions
138 | preds = np.argmax(predictions, axis=1).reshape(-1)
139 | labels = np.zeros(preds.shape)
140 | metric = {
141 | "accuracy": float(
142 | accuracy_score(labels, preds, normalize=True)
143 | ),
144 | }
145 | return metric
146 |
147 | logger = logging.getLogger(__name__)
148 |
149 | def main():
150 | parser = HfArgumentParser((ModelArguments, ScriptArguments, TrainingArguments))
151 | model_args, script_args, training_args = parser.parse_args_into_dataclasses()
152 |
153 | # logger format
154 | logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",datefmt="%m/%d/%Y %H:%M:%S",
155 | level = logging.WARN, # if training_args.local_rank in [-1, 0] else logging.WARN,
156 | handlers = [logging.StreamHandler(sys.stdout)],)
157 | if training_args.should_log:
158 | # The default of training_args.log_level is passive, so we set log level at info here to have that default.
159 | transformers.utils.logging.set_verbosity_info()
160 |
161 | log_level = training_args.get_process_log_level()
162 | logger.setLevel(log_level)
163 | datasets.utils.logging.set_verbosity(log_level)
164 | transformers.utils.logging.set_verbosity(log_level)
165 | transformers.utils.logging.enable_default_handler()
166 | transformers.utils.logging.enable_explicit_format()
167 |
168 | # Log on each process the small summary:
169 | logger.warning(
170 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
171 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
172 | )
173 |
174 | set_seed(training_args.seed)
175 |
176 | device = "cuda" if torch.cuda.is_available() else "cpu"
177 |
178 | # init model
179 | config = transformers.AutoConfig.from_pretrained(
180 | script_args.base_model_path,
181 | trust_remote_code=True
182 | )
183 |
184 | tokenizer = transformers.AutoTokenizer.from_pretrained(
185 | script_args.base_model_path,
186 | use_fast=False,
187 | trust_remote_code=True,
188 | model_max_length=config.max_position_embeddings
189 | )
190 |
191 | config.use_cache = False
192 | config.num_labels = 1
193 | config.pad_token_id = tokenizer.eos_token_id
194 |
195 | model = TinyllmForSequenceClassification.from_pretrained(
196 | script_args.base_model_path,
197 | config=config,
198 | trust_remote_code=True
199 | )
200 |
201 | model.to(device)
202 |
203 | ################
204 | total_params = sum(p.numel() for p in model.parameters())
205 | trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
206 | logger.info(f"总参数: {total_params}, {total_params/2**20:.2f}M params")
207 | logger.info(f"可训练参数: {trainable_params}")
208 | ##############
209 |
210 | rm_dataset = RMDataset(
211 | script_args.dataset_dir_or_path,
212 | tokenizer,
213 | config.max_position_embeddings
214 | )
215 | total_len = len(rm_dataset)
216 | eval_size = int(0.01 * total_len)
217 | # 划分训练集和验证集
218 | train_ds, eval_ds = random_split(rm_dataset, [total_len - eval_size, eval_size])
219 |
220 | trainer = RMTrainer(
221 | model = model,
222 | args = training_args,
223 | train_dataset = train_ds,
224 | eval_dataset = eval_ds,
225 | compute_metrics=compute_metrics,
226 | )
227 |
228 | # Training
229 | trainer.train(script_args.resume)
230 | torch.save(model.state_dict(),'{}/last_model.pth'.format(training_args.output_dir))
231 | last_model_dir = os.path.join(training_args.output_dir, 'last_rm_model')
232 | os.makedirs(last_model_dir, exist_ok=True)
233 | tokenizer.save_pretrained(last_model_dir)
234 | # https://github.com/huggingface/transformers/issues/28630
235 | model.save_pretrained(last_model_dir, safe_serialization=False)
236 |
237 | if __name__ == "__main__":
238 | main()
--------------------------------------------------------------------------------
/train/dpo_train.py:
--------------------------------------------------------------------------------
1 | # https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py
2 | # https://huggingface.co/docs/trl/main/en/dpo_trainer
3 | # https://huggingface.co/datasets/lvwerra/stack-exchange-paired
4 | # https://huggingface.co/blog/zh/dpo-trl
5 |
6 | # https://github.dev/RUCAIBox/LLMBox
7 |
8 | # 0. imports
9 | import os
10 | from dataclasses import dataclass, field
11 | from typing import Dict, Optional
12 |
13 | import torch
14 | from accelerate import Accelerator
15 | from datasets import Dataset, load_dataset
16 | from peft import LoraConfig
17 | from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments, set_seed
18 | from torch.utils.data import Dataset, DataLoader, random_split
19 | from trl import DPOTrainer
20 | from tinyllm_dataset import load_dpo_dataset
21 |
22 |
23 | # Define and parse arguments.
24 | @dataclass
25 | class ScriptArguments:
26 | """
27 | The arguments for the DPO training script.
28 | """
29 |
30 | # data parameters
31 | beta: Optional[float] = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"})
32 |
33 | # training parameters
34 | model_name: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"})
35 | dataset_dir_or_path: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"})
36 | eval_dataset_dir_or_path: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"})
37 | resume: Optional[bool] = field(default=False,metadata={"help": "the location of the SFT model name or path"})
38 | base_model_path: Optional[str] = field(default="",metadata={"help": "the location of the SFT model name or path"})
39 |
40 | learning_rate: Optional[float] = field(default=5e-4, metadata={"help": "optimizer learning rate"})
41 | lr_scheduler_type: Optional[str] = field(default="cosine", metadata={"help": "the lr scheduler type"})
42 | warmup_ratio: Optional[float] = field(default=0.01, metadata={"help": "the number of warmup steps"})
43 | weight_decay: Optional[float] = field(default=0.05, metadata={"help": "the weight decay"})
44 | optimizer_type: Optional[str] = field(default="adamw_torch", metadata={"help": "the optimizer type"})
45 |
46 | per_device_train_batch_size: Optional[int] = field(default=1, metadata={"help": "train batch size per device"})
47 | per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "eval batch size per device"})
48 | gradient_accumulation_steps: Optional[int] = field(
49 | default=4, metadata={"help": "the number of gradient accumulation steps"}
50 | )
51 | gradient_checkpointing: Optional[bool] = field(
52 | default=True, metadata={"help": "whether to use gradient checkpointing"}
53 | )
54 |
55 | gradient_checkpointing_use_reentrant: Optional[bool] = field(
56 | default=False, metadata={"help": "whether to use reentrant for gradient checkpointing"}
57 | )
58 |
59 | bf16: bool = field(
60 | default=False,
61 | metadata={
62 | "help": (
63 | "Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA"
64 | " architecture or using CPU (use_cpu) or Ascend NPU. This is an experimental API and it may change."
65 | )
66 | },
67 | )
68 | fp16: bool = field(
69 | default=False,
70 | metadata={"help": "Whether to use fp16 (mixed) precision instead of 32-bit"},
71 | )
72 |
73 | max_prompt_length: Optional[int] = field(default=512, metadata={"help": "the maximum prompt length"})
74 | max_length: Optional[int] = field(default=1024, metadata={"help": "the maximum sequence length"})
75 | num_train_epochs: Optional[int] = field(default=5, metadata={"help": "epoch of training steps"})
76 | logging_strategy: Optional[str] = field(default="steps", metadata={"help": "logging_strategy"})
77 | logging_dir: Optional[str] = field(default="", metadata={"help": "logging_dir"})
78 | logging_steps: Optional[int] = field(default=10, metadata={"help": "the logging frequency"})
79 | save_steps: Optional[int] = field(default=100, metadata={"help": "the saving frequency"})
80 | eval_steps: Optional[int] = field(default=100, metadata={"help": "the evaluation frequency"})
81 |
82 | output_dir: Optional[str] = field(default="./results", metadata={"help": "the output directory"})
83 |
84 | # instrumentation
85 | sanity_check: Optional[bool] = field(default=False, metadata={"help": "only train on 1000 samples"})
86 | report_to: Optional[str] = field(
87 | default="tensorboard",
88 | metadata={
89 | "help": 'The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,'
90 | '`"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. '
91 | 'Use `"all"` to report to all integrations installed, `"none"` for no integrations.'
92 | },
93 | )
94 | # debug argument for distributed training
95 | ignore_bias_buffers: Optional[bool] = field(
96 | default=False,
97 | metadata={
98 | "help": "fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation. See"
99 | "https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992"
100 | },
101 | )
102 | seed: Optional[int] = field(
103 | default=0, metadata={"help": "Random seed that will be set at the beginning of training."}
104 | )
105 |
106 |
107 | if __name__ == "__main__":
108 | parser = HfArgumentParser(ScriptArguments)
109 | script_args = parser.parse_args_into_dataclasses()[0]
110 |
111 | set_seed(script_args.seed)
112 |
113 | # 1. load a pretrained model
114 | model = AutoModelForCausalLM.from_pretrained(
115 | script_args.base_model_path,
116 | trust_remote_code=True,
117 | )
118 | model.config.use_cache = False
119 |
120 | if script_args.ignore_bias_buffers:
121 | # torch distributed hack
122 | model._ddp_params_and_buffers_to_ignore = [
123 | name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
124 | ]
125 |
126 | tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_path, trust_remote_code=True)
127 | tokenizer.pad_token = tokenizer.eos_token
128 |
129 | tokenizer.add_special_tokens({"bos_token": tokenizer.eos_token})
130 | tokenizer.bos_token_id = tokenizer.eos_token_id
131 |
132 | # 2. Load the Stack-exchange paired dataset
133 | # dpo_dataset = get_stack_exchange_paired(data_dir="data/rl", sanity_check=script_args.sanity_check)
134 | # dpo_dataset = dpo_dataset.filter(
135 | # lambda x: len(x["prompt"]) + len(x["chosen"]) <= script_args.max_length
136 | # and len(x["prompt"]) + len(x["rejected"]) <= script_args.max_length
137 | # )
138 |
139 | data_path = "/mnt/cephfs-xiongzhuang/wangdongnian/tiny-llm-zh/data/rm_train/rm_data.jsonl"
140 | dpo_dataset = load_dpo_dataset(script_args.dataset_dir_or_path, max_length=script_args.max_length, sanity_check=script_args.sanity_check)
141 |
142 | train_loader = torch.utils.data.DataLoader(
143 | dpo_dataset,
144 | batch_size=2,
145 | pin_memory=False,
146 | drop_last=False,
147 | shuffle=False,
148 | num_workers=8,
149 | )
150 | for i, item in enumerate(train_loader):
151 | print(item)
152 | break
153 |
154 |
155 | # 3. Load evaluation dataset
156 | if script_args.eval_dataset_dir_or_path == "":
157 | evaluation_strategy = "no"
158 | else:
159 | evaluation_strategy = "steps"
160 | eval_dataset = load_dpo_dataset(script_args.eval_dataset_dir_or_path, max_length=script_args.max_length, sanity_check=script_args.sanity_check)
161 |
162 |
163 | # 4. initialize training arguments:
164 | training_args = TrainingArguments(
165 | per_device_train_batch_size=script_args.per_device_train_batch_size,
166 | per_device_eval_batch_size=script_args.per_device_eval_batch_size,
167 | num_train_epochs=script_args.num_train_epochs,
168 | logging_dir=script_args.logging_dir,
169 | logging_strategy=script_args.logging_strategy,
170 | logging_steps=script_args.logging_steps,
171 | save_steps=script_args.save_steps,
172 | gradient_accumulation_steps=script_args.gradient_accumulation_steps,
173 | gradient_checkpointing=script_args.gradient_checkpointing,
174 | learning_rate=script_args.learning_rate,
175 | evaluation_strategy="no",
176 | eval_steps=script_args.eval_steps,
177 | output_dir=script_args.output_dir,
178 | report_to=script_args.report_to,
179 | lr_scheduler_type=script_args.lr_scheduler_type,
180 | warmup_ratio=script_args.warmup_ratio,
181 | optim=script_args.optimizer_type,
182 | bf16=script_args.bf16,
183 | fp16=script_args.fp16,
184 | remove_unused_columns=False,
185 | run_name=script_args.model_name,
186 | gradient_checkpointing_kwargs=dict(use_reentrant=script_args.gradient_checkpointing_use_reentrant),
187 | seed=script_args.seed,
188 | # project_kwargs={"logging_dir": script_args.output_dir},
189 | )
190 |
191 | # peft_config = LoraConfig(
192 | # r=script_args.lora_r,
193 | # lora_alpha=script_args.lora_alpha,
194 | # lora_dropout=script_args.lora_dropout,
195 | # target_modules=[
196 | # "q_proj",
197 | # "v_proj",
198 | # "k_proj",
199 | # "out_proj",
200 | # "fc_in",
201 | # "fc_out",
202 | # "wte",
203 | # ],
204 | # bias="none",
205 | # task_type="CAUSAL_LM",
206 | # )
207 |
208 | # 5. initialize the DPO trainer
209 | dpo_trainer = DPOTrainer(
210 | model,
211 | ref_model=None,
212 | args=training_args,
213 | beta=script_args.beta,
214 | train_dataset=dpo_dataset,
215 | eval_dataset=eval_dataset,
216 | tokenizer=tokenizer,
217 | # peft_config=peft_config,
218 | max_prompt_length=script_args.max_prompt_length,
219 | max_length=script_args.max_length,
220 | # data_collator=collator_fn,
221 | )
222 |
223 | # 6. train
224 | dpo_trainer.train(script_args.resume)
225 |
226 |
227 | # 7. save
228 | output_dir = os.path.join(script_args.output_dir, "last_dpo_model")
229 | tokenizer.save_pretrained(output_dir)
230 | dpo_trainer.save_model(output_dir)
231 | # dpo_trainer.model.save_pretrained(output_dir)
--------------------------------------------------------------------------------
/train/generation_utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import torch
3 | import numpy as np
4 | from queue import Queue
5 | from typing import Tuple, List, Union, Iterable
6 | from transformers.utils import logging, add_start_docstrings
7 | from transformers.generation.logits_process import LogitsProcessor, LOGITS_PROCESSOR_INPUTS_DOCSTRING, LogitsProcessorList
8 |
9 | def make_context(model, tokenizer,
10 | messages: List[dict],
11 | system: str = "You are a helpful assistant.",
12 | max_new_tokens: int=0,
13 | ):
14 | # 确定新生成的token数量,优先使用传入参数,否则使用模型配置中的默认值
15 | max_new_tokens = max_new_tokens or model.generation_config.max_new_tokens
16 | # 计算模型允许的最大输入长度(模型最大长度减去新生成的token数)
17 | max_input_length = model.config.max_position_embeddings - max_new_tokens
18 |
19 | nl_tokens = tokenizer.encode("\n", add_special_tokens=False)
20 |
21 | def _parse_messages(messages):
22 | """ 解析消息列表,分离系统消息、查询和对话历史
23 | """
24 | system, query, history = "", "", []
25 | ## system
26 | if messages[0]["role"] == "system":
27 | system = messages[0]["content"]
28 | messages = messages[1:]
29 | ## query
30 | ### 确保最后一项是用户消息
31 | assert messages[-1]["role"] == "user"
32 | query = messages[-1]["content"]
33 | messages = messages[:-1]
34 | ## history
35 | assert len(messages) % 2 == 0
36 | for i in range(0, len(messages), 2):
37 | assert messages[i]["role"] == "user" and messages[i+1]["role"] == "assistant"
38 | history.append([messages[i]["content"], messages[i+1]["content"]])
39 |
40 | return system, query, history
41 |
42 | # 调用_parse_messages解析消息
43 | _system, query, history = _parse_messages(messages)
44 |
45 | ## system
46 | system_text = _system if _system != "" else system
47 | system_tokens = []
48 | if system_text:
49 | # system_tokens = tokenizer.build_single_message("system", "", system_text.strip())
50 | system_tokens = tokenizer.encode(text=("<|system|>\n"+system_text.strip()), add_special_tokens=True, truncation=True) + nl_tokens
51 | ## query
52 | # query_tokens = tokenizer.build_single_message("user", "", query.strip())
53 | query_tokens = tokenizer.encode(text=("<|user|>\n"+query.strip()), add_special_tokens=False, truncation=True) + nl_tokens
54 | ## final assistant
55 | # final_tokens = tokenizer.build_single_message("assistant", "", "")
56 | final_tokens = tokenizer.encode("<|assistant|>", add_special_tokens=False, truncation=True) + nl_tokens
57 |
58 | ## max_history_tokens
59 | max_history_length = max_input_length - len(system_tokens) - len(query_tokens) - len(final_tokens)
60 |
61 | ## history
62 | ## 逆序遍历对话历史,构建token序列
63 | context_tokens = []
64 | for turn_query, turn_response in reversed(history):
65 | ## query tokens
66 | history_query_tokens = tokenizer.encode("<|user|>\n"+turn_query.strip(), add_special_tokens=False, truncation=True) + nl_tokens
67 | ## answer tokens
68 | histroy_response_tokens = tokenizer.encode("<|assistant|>\n"+turn_response.strip(), add_special_tokens=False, truncation=True) + nl_tokens
69 | ## this round tokens
70 | next_context_tokens = history_query_tokens + histroy_response_tokens
71 | ## concat
72 | ## 确保加入这些token后总长度不超过允许的最大历史长度
73 | current_context_size = len(next_context_tokens) + len(context_tokens)
74 | if current_context_size < max_history_length:
75 | context_tokens = next_context_tokens + context_tokens
76 | else:
77 | break
78 | input_tokens = system_tokens + context_tokens + query_tokens + final_tokens
79 |
80 | return torch.LongTensor([input_tokens]).to(model.device)
81 |
82 | def parse_pot_no_stream(inputs):
83 | """ 解析并处理输入字符串中特定格式(形如 <<...>>)的代码片段。
84 | 这些代码片段可以是简单的数学表达式赋值,也可以是定义和调用函数。
85 | 1. 对于包含 "func" 的代码片段,它会识别函数定义,执行该函数,
86 | 并将函数返回的结果替换到原始字符串中的相应位置。
87 | 如果函数涉及到 sympy(一个符号计算库),
88 | 则还会做一些特定的字符串替换处理。
89 | 2. 对于不包含 "func" 的代码片段,它会直接计算等号右边的表达式,
90 | 并将计算结果替换到原始字符串中,同时也会进行一些类型转换
91 | (如将浮点数转为整数)。
92 | """
93 | try:
94 | # 尝试从输入字符串中找到形如 "<<...>>" 的模式
95 | s = re.findall(r'<<(.*?)>>', inputs, re.DOTALL)
96 | # 如果没有找到匹配项,则直接返回原始输入
97 | if not s:
98 | #print("err inputs: ", origin_inputs, flush=True)
99 | return inputs
100 |
101 | index = 0
102 | # 遍历所有匹配到的模式
103 | for k in s:
104 | try:
105 | # 检查模式内是否包含 "func"
106 | if "func" in k:
107 | # 分割并处理函数定义
108 | var = k.split("=", 1)
109 | try:
110 | # 去除空白字符并执行函数定义
111 | var[1] = var[1].strip(" ")
112 | exec(var[1], globals())
113 | # 调用函数获取结果
114 | ans = func()
115 | except:
116 | # 特殊处理包含 'sympy' 的情况
117 | if 'sympy' in var[1]:
118 | var[1] = var[1].replace('res[x]', 'res[0][0]').replace('res[y]', 'res[0][1]')
119 | exec(var[1], globals())
120 | ans = func()
121 | pass
122 | var_list = [c.strip(" ") for c in var[0].split(",")]
123 | # 如果只有一个变量名,则将结果放入列表
124 | if len(var_list) == 1:
125 | ans = [ans]
126 |
127 | # 将结果转换为浮点数或整数形式,并替换到输入字符串中
128 | for i in range(len(ans)):
129 | try:
130 | ans[i] = float(ans[i])
131 | if abs(ans[i] - int(ans[i])) < 1e-10:
132 | ans[i] = str(int(ans[i]))
133 | except:
134 | pass
135 |
136 | # 替换原字符串中的模式和变量名
137 | inputs = inputs.replace("<<"+k+">>", "")
138 | for i in range(len(var_list)):
139 | inputs = inputs.replace(var_list[i], str(ans[i]))
140 | index += 1
141 | # 更新后续模式中的变量值
142 | for c in range(index, len(s)):
143 | for i in range(len(var_list)):
144 | s[c] = s[c].replace(var_list[i], str(ans[i]))
145 | else:
146 | # 处理非函数的情况,直接计算并替换
147 | var = k.replace(" ", "").split("=")
148 | var[1] = var[1].replace("eval", "")
149 | ans = round(eval(var[1]), 10)
150 | ans = float(ans)
151 | if abs(ans - int(ans)) < 1e-10:
152 | ans = str(int(ans))
153 | # 替换原字符串中的模式和变量名
154 | inputs = inputs.replace("<<"+k+">>", "").replace(var[0], str(ans))
155 | index += 1
156 | # 更新后续模式中的变量值
157 | for c in range(index, len(s)):
158 | s[c] = s[c].replace(var[0], str(ans))
159 | except:
160 | return inputs
161 | except Exception as e:
162 | return inputs
163 |
164 | return inputs
165 |
166 |
167 | class TextIterStreamer:
168 | """ 实现文本的流式处理
169 | 能够逐个或逐段生成和输出文本,而不是一次性输出全部内容
170 | """
171 | def __init__(self, tokenizer, skip_prompt=False, skip_special_tokens=False, use_pot=True):
172 | self.tokenizer = tokenizer
173 | self.skip_prompt = skip_prompt
174 | self.skip_special_tokens = skip_special_tokens
175 | self.tokens = []
176 | # 使用队列来缓存生成的文本片段,以便于逐块输出
177 | self.text_queue = Queue()
178 | self.next_tokens_are_prompt = True
179 | # 是否使用特定的后处理技术(例如翻译或优化),默认为True
180 | self.use_pot = use_pot
181 |
182 | def put(self, value):
183 | # 接收并处理生成的token值
184 | if self.skip_prompt and self.next_tokens_are_prompt:
185 | self.next_tokens_are_prompt = False
186 | else:
187 | if len(value.shape) > 1:
188 | value = value[0]
189 | self.tokens.extend(value.tolist())
190 | tokens_str = self.tokenizer.decode(self.tokens, skip_special_tokens=self.skip_special_tokens, errors='ignore')
191 | if self.use_pot:
192 | tokens_str = parse_pot_no_stream(tokens_str)
193 | self.text_queue.put(tokens_str)
194 |
195 | def end(self):
196 | self.text_queue.put(None)
197 |
198 | def __iter__(self):
199 | return self
200 |
201 | def __next__(self):
202 | # 实现迭代器的下一步方法,从队列中获取并返回文本,
203 | # 或在无更多内容时抛出StopIteration异常
204 | value = self.text_queue.get()
205 | if value is None:
206 | raise StopIteration()
207 | else:
208 | return value
209 |
210 |
211 | class OutputRepetitionPenaltyLogitsProcessor(LogitsProcessor):
212 | r"""
213 | [`OutputLogitsProcessor`] that prevents the repetition of previous tokens through a penalty. This penalty is applied at
214 | most once per token. Note that, for decoder-only models like most LLMs, the considered tokens include the prompt.
215 |
216 | In the original [paper](https://arxiv.org/pdf/1909.05858.pdf), the authors suggest the use of a penalty of around
217 | 1.2 to achieve a good balance between truthful generation and lack of repetition. To penalize and reduce
218 | repetition, use `penalty` values above 1.0, where a higher value penalizes more strongly. To reward and encourage
219 | repetition, use `penalty` values between 0.0 and 1.0, where a lower value rewards more strongly.
220 |
221 | Args:
222 | penalty (`float`):
223 | The parameter for repetition penalty. 1.0 means no penalty. Above 1.0 penalizes previously generated
224 | tokens. Between 0.0 and 1.0 rewards previously generated tokens.
225 | """
226 |
227 | def __init__(self, input_length: int,
228 | presence_penalties: float = 1.0,
229 | frequency_penalties: float = 0,
230 | repetition_penalties: float = 0):
231 | if not (repetition_penalties > 0):
232 | raise ValueError(f"`repetition_penalties` has to be a strictly positive float, but is {repetition_penalties}")
233 | if not ( (frequency_penalties >= -2) and (frequency_penalties <= 2) ):
234 | raise ValueError(f"`frequency_penalties` has to be [-2, 2], but is {frequency_penalties}")
235 | if not ( (presence_penalties >= -2) and (presence_penalties <= 2) ):
236 | raise ValueError(f"`presence_penalties` has to be [-2, 2], but is {presence_penalties}")
237 |
238 | self.repetition_penalties = repetition_penalties
239 | self.frequency_penalties = frequency_penalties
240 | self.presence_penalties = presence_penalties
241 | self.input_length = input_length
242 |
243 | def _get_bin_counts_and_mask(
244 | self,
245 | tokens: torch.Tensor,
246 | vocab_size: int,
247 | num_seqs: int,
248 | ) -> Tuple[torch.Tensor, torch.Tensor]:
249 | # Compute the bin counts for the tokens.
250 | # vocab_size + 1 for padding.
251 | bin_counts = torch.zeros((num_seqs, vocab_size + 1),
252 | dtype=torch.long,
253 | device=tokens.device)
254 | bin_counts.scatter_add_(1, tokens, torch.ones_like(tokens))
255 | bin_counts = bin_counts[:, :vocab_size]
256 | mask = bin_counts > 0
257 |
258 | return bin_counts, mask
259 |
260 | @add_start_docstrings(LOGITS_PROCESSOR_INPUTS_DOCSTRING)
261 | def __call__(self, input_ids: torch.LongTensor, logits: torch.FloatTensor) -> torch.FloatTensor:
262 | prompt_tokens_tensor = input_ids[:, :self.input_length+1]
263 | output_tokens_tensor = input_ids[:, self.input_length+1:]
264 |
265 | num_seqs, vocab_size = logits.shape
266 | _, prompt_mask = self._get_bin_counts_and_mask(
267 | prompt_tokens_tensor, vocab_size, num_seqs)
268 | output_bin_counts, output_mask = self._get_bin_counts_and_mask(
269 | output_tokens_tensor, vocab_size, num_seqs)
270 |
271 | repetition_penalties = torch.Tensor([self.repetition_penalties]).to(logits.device)
272 | frequency_penalties = torch.Tensor([self.frequency_penalties]).to(logits.device)
273 | presence_penalties = torch.Tensor([self.presence_penalties]).to(logits.device)
274 |
275 | repetition_penalties = repetition_penalties[:, None].repeat(1, vocab_size)
276 | repetition_penalties[~(prompt_mask | output_mask)] = 1.0
277 | logits = torch.where(logits > 0, logits / repetition_penalties,
278 | logits * repetition_penalties)
279 |
280 | # We follow the definition in OpenAI API.
281 | # Refer to https://platform.openai.com/docs/api-reference/parameter-details
282 | logits -= frequency_penalties.unsqueeze_(dim=1) * output_bin_counts
283 | logits -= presence_penalties.unsqueeze_(dim=1) * output_mask
284 |
285 | return logits
--------------------------------------------------------------------------------
/vllm/tinyllm.py:
--------------------------------------------------------------------------------
1 | """ Model Architecture
2 | TinyllmForCausalLM(
3 | (model): TinyllmModel(
4 | (embed_tokens): Embedding(64798, 512)
5 | (layers): ModuleList(
6 | (0-7): 8 x TinyllmDecoderLayer(
7 | (self_attn): TinyllmSdpaAttention(
8 | (q_proj): Linear(in_features=512, out_features=512, bias=True)
9 | (k_proj): Linear(in_features=512, out_features=512, bias=True)
10 | (v_proj): Linear(in_features=512, out_features=512, bias=True)
11 | (o_proj): Linear(in_features=512, out_features=512, bias=False)
12 | (rotary_emb): TinyllmRotaryEmbedding()
13 | )
14 | (mlp): TinyllmMLP(
15 | (gate_proj): Linear(in_features=512, out_features=1408, bias=False)
16 | (up_proj): Linear(in_features=512, out_features=1408, bias=False)
17 | (down_proj): Linear(in_features=1408, out_features=512, bias=False)
18 | (act_fn): SiLU()
19 | )
20 | (input_layernorm): TinyllmRMSNorm()
21 | (post_attention_layernorm): TinyllmRMSNorm()
22 | )
23 | )
24 | (norm): TinyllmRMSNorm()
25 | )
26 | (lm_head): Linear(in_features=512, out_features=64798, bias=False)
27 | )
28 | """
29 |
30 | # tiny llm model vllm implement
31 |
32 | from typing import List, Optional, Tuple
33 |
34 | import torch
35 | from torch import nn
36 | from transformers import LlamaConfig
37 |
38 | from vllm.attention import Attention, AttentionMetadata
39 | from vllm.config import LoRAConfig
40 | from vllm.model_executor.parallel_utils.parallel_state import (
41 | get_tensor_model_parallel_world_size)
42 | from vllm.model_executor.layers.activation import SiluAndMul
43 | from vllm.model_executor.layers.layernorm import RMSNorm
44 | from vllm.model_executor.layers.linear import (LinearMethodBase,
45 | MergedColumnParallelLinear,
46 | QKVParallelLinear,
47 | RowParallelLinear)
48 | from vllm.model_executor.layers.logits_processor import LogitsProcessor
49 | from vllm.model_executor.layers.rotary_embedding import get_rope
50 | from vllm.model_executor.layers.sampler import Sampler
51 | from vllm.model_executor.layers.vocab_parallel_embedding import (
52 | ParallelLMHead, VocabParallelEmbedding)
53 | from vllm.model_executor.sampling_metadata import SamplingMetadata
54 | from vllm.model_executor.weight_utils import (default_weight_loader,
55 | hf_model_weights_iterator)
56 | from vllm.sequence import SamplerOutput
57 |
58 | class TinyllmMLP(nn.Module):
59 | def __init__(
60 | self,
61 | hidden_size: int,
62 | intermediate_size: int,
63 | hidden_act: str,
64 | linear_method: Optional[LinearMethodBase] = None,
65 | ) -> None:
66 | super().__init__()
67 | self.gate_up_proj = MergedColumnParallelLinear(
68 | hidden_size, [intermediate_size] * 2,
69 | bias=False,
70 | linear_method=linear_method)
71 | self.down_proj = RowParallelLinear(intermediate_size,
72 | hidden_size,
73 | bias=False,
74 | linear_method=linear_method)
75 | if hidden_act != "silu":
76 | raise ValueError(f"Unsupported activation: {hidden_act}. "
77 | "Only silu is supported for now.")
78 | self.act_fn = SiluAndMul()
79 |
80 | def forward(self, x):
81 | gate_up, _ = self.gate_up_proj(x)
82 | x = self.act_fn(gate_up)
83 | x, _ = self.down_proj(x)
84 | return x
85 |
86 | class TinyllmAttention(nn.Module):
87 |
88 | def __init__(self,
89 | hidden_size: int,
90 | num_heads: int,
91 | num_kv_heads: int,
92 | max_position: int = 1024,
93 | rope_theta: float = 10000,
94 | linear_method: Optional[LinearMethodBase] = None,
95 | sliding_window: Optional[int] = None) -> None:
96 | super().__init__()
97 | self.hidden_size = hidden_size
98 | tp_size = get_tensor_model_parallel_world_size()
99 | self.total_num_heads = num_heads
100 | assert self.total_num_heads % tp_size == 0
101 | self.num_heads = self.total_num_heads // tp_size
102 | self.total_num_kv_heads = num_kv_heads
103 | if self.total_num_kv_heads >= tp_size:
104 | # Number of KV heads is greater than TP size, so we partition
105 | # the KV heads across multiple tensor parallel GPUs.
106 | assert self.total_num_kv_heads % tp_size == 0
107 | else:
108 | # Number of KV heads is less than TP size, so we replicate
109 | # the KV heads across multiple tensor parallel GPUs.
110 | assert tp_size % self.total_num_kv_heads == 0
111 | self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size)
112 | self.head_dim = hidden_size // self.total_num_heads
113 | self.q_size = self.num_heads * self.head_dim
114 | self.kv_size = self.num_kv_heads * self.head_dim
115 | self.scaling = self.head_dim**-0.5
116 | self.rope_theta = rope_theta
117 |
118 | self.qkv_proj = QKVParallelLinear(
119 | hidden_size,
120 | self.head_dim,
121 | self.total_num_heads,
122 | self.total_num_kv_heads,
123 | bias=True,
124 | linear_method=linear_method,
125 | )
126 | self.o_proj = RowParallelLinear(
127 | self.total_num_heads * self.head_dim,
128 | hidden_size,
129 | bias=False,
130 | linear_method=linear_method,
131 | )
132 |
133 | self.rotary_emb = get_rope(
134 | self.head_dim,
135 | rotary_dim=self.head_dim,
136 | max_position=max_position,
137 | base=self.rope_theta,
138 | )
139 | self.attn = Attention(self.num_heads,
140 | self.head_dim,
141 | self.scaling,
142 | num_kv_heads=self.num_kv_heads,
143 | sliding_window=sliding_window)
144 |
145 | def forward(
146 | self,
147 | positions: torch.Tensor,
148 | hidden_states: torch.Tensor,
149 | kv_cache: torch.Tensor,
150 | attn_metadata: AttentionMetadata,
151 | ) -> torch.Tensor:
152 | qkv, _ = self.qkv_proj(hidden_states)
153 | q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
154 | q, k = self.rotary_emb(positions, q, k)
155 | attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
156 | output, _ = self.o_proj(attn_output)
157 | return output
158 |
159 | class TinyllmDecoderLayer(nn.Module):
160 | def __init__(
161 | self,
162 | config: LlamaConfig,
163 | linear_method: Optional[LinearMethodBase] = None,
164 | ) -> None:
165 | super().__init__()
166 |
167 | self.hidden_size = config.hidden_size
168 | rope_theta = getattr(config, "rope_theta", 10000)
169 | max_position_embeddings = getattr(config, "max_position_embeddings", 1024)
170 | sliding_window = getattr(config, "sliding_window", None)
171 |
172 | self.self_attn = TinyllmAttention(
173 | hidden_size=self.hidden_size,
174 | num_heads=config.num_attention_heads,
175 | num_kv_heads=getattr(config, "num_key_value_heads", config.num_attention_heads),
176 | max_position=max_position_embeddings,
177 | rope_theta=rope_theta,
178 | linear_method=linear_method,
179 | sliding_window=sliding_window
180 | )
181 |
182 | self.mlp = TinyllmMLP(
183 | hidden_size=self.hidden_size,
184 | intermediate_size=config.intermediate_size,
185 | hidden_act=config.hidden_act,
186 | linear_method=linear_method,
187 | )
188 |
189 | self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
190 | self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
191 |
192 | def forward(
193 | self,
194 | positions: torch.Tensor,
195 | hidden_states: torch.Tensor,
196 | kv_cache: torch.Tensor,
197 | attn_metadata: AttentionMetadata,
198 | residual: Optional[torch.Tensor],
199 | ) -> Tuple[torch.Tensor, torch.Tensor]:
200 | # Self Attention
201 | if residual is None:
202 | residual = hidden_states
203 | hidden_states = self.input_layernorm(hidden_states)
204 | else:
205 | hidden_states, residual = self.input_layernorm(
206 | hidden_states, residual)
207 | hidden_states = self.self_attn(
208 | positions=positions,
209 | hidden_states=hidden_states,
210 | kv_cache=kv_cache,
211 | attn_metadata=attn_metadata,
212 | )
213 |
214 | # Fully Connected
215 | hidden_states, residual = self.post_attention_layernorm(
216 | hidden_states, residual)
217 | hidden_states = self.mlp(hidden_states)
218 |
219 | return hidden_states, residual
220 |
221 | class TinyllmModel(nn.Module):
222 | def __init__(
223 | self,
224 | config: LlamaConfig,
225 | linear_method: Optional[LinearMethodBase] = None,
226 | ) -> None:
227 | super().__init__()
228 |
229 | self.config = config
230 | self.padding_idx = config.pad_token_id
231 | self.vocab_size = config.vocab_size
232 |
233 | self.embed_tokens = VocabParallelEmbedding(
234 | config.vocab_size,
235 | config.hidden_size
236 | )
237 | self.layers = nn.ModuleList([
238 | TinyllmDecoderLayer(config, linear_method)
239 | for _ in range(config.num_hidden_layers)
240 | ])
241 | self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
242 |
243 | def forward(
244 | self,
245 | input_ids: torch.Tensor,
246 | positions: torch.Tensor,
247 | kv_caches: List[torch.Tensor],
248 | attn_metadata: AttentionMetadata,
249 | ) -> torch.Tensor:
250 | hidden_states = self.embed_tokens(input_ids)
251 | residual = None
252 | for i in range(len(self.layers)):
253 | layer = self.layers[i]
254 | hidden_states, residual = layer(
255 | positions,
256 | hidden_states,
257 | kv_caches[i],
258 | attn_metadata,
259 | residual,
260 | )
261 | hidden_states, _ = self.norm(hidden_states, residual)
262 | return hidden_states
263 |
264 | class TinyllmForCausalLM(nn.Module):
265 | packed_modules_mapping = {
266 | "qkv_proj": [
267 | "q_proj",
268 | "k_proj",
269 | "v_proj",
270 | ],
271 | "gate_up_proj": [
272 | "gate_proj",
273 | "up_proj",
274 | ],
275 | }
276 |
277 | # LoRA specific attributes
278 | supported_lora_modules = [
279 | "qkv_proj",
280 | "o_proj",
281 | "gate_up_proj",
282 | "down_proj",
283 | ]
284 | embedding_modules = {}
285 | embedding_padding_modules = []
286 |
287 | def __init__(
288 | self,
289 | config: LlamaConfig,
290 | linear_method: Optional[LinearMethodBase] = None,
291 | lora_config: Optional[LoRAConfig] = None,
292 | ) -> None:
293 | del lora_config
294 | super().__init__()
295 | self.config = config
296 | self.linear_method = linear_method
297 | self.model = TinyllmModel(config, linear_method)
298 |
299 | if config.tie_word_embeddings:
300 | self.lm_head_weight = self.model.embed_tokens.weight
301 | else:
302 | self.lm_head = ParallelLMHead(config.vocab_size, config.hidden_size)
303 | self.lm_head_weight = self.lm_head.weight
304 |
305 | self.logits_processor = LogitsProcessor(config.vocab_size)
306 | self.sampler = Sampler()
307 |
308 | def forward(
309 | self,
310 | input_ids: torch.Tensor,
311 | positions: torch.Tensor,
312 | kv_caches: List[torch.Tensor],
313 | attn_metadata: AttentionMetadata,
314 | ) -> torch.Tensor:
315 | hidden_states = self.model(input_ids, positions, kv_caches, attn_metadata)
316 | return hidden_states
317 |
318 | def compute_logits(self, hidden_states: torch.Tensor,
319 | sampling_metadata: SamplingMetadata) -> torch.Tensor:
320 | logits = self.logits_processor(self.lm_head_weight, hidden_states, sampling_metadata)
321 | return logits
322 |
323 | def sample(
324 | self,
325 | logits: torch.Tensor,
326 | sampling_metadata: SamplingMetadata,
327 | ) -> Optional[SamplerOutput]:
328 | next_tokens = self.sampler(logits, sampling_metadata)
329 | return next_tokens
330 |
331 | def load_weights(self,
332 | model_name_or_path: str,
333 | cache_dir: Optional[str] = None,
334 | load_format: str = "auto",
335 | revision: Optional[str] = None):
336 | stacked_params_mapping = [
337 | # (param_name, shard_name, shard_id)
338 | ("qkv_proj", "q_proj", "q"),
339 | ("qkv_proj", "k_proj", "k"),
340 | ("qkv_proj", "v_proj", "v"),
341 | ("gate_up_proj", "gate_proj", 0),
342 | ("gate_up_proj", "up_proj", 1),
343 | ]
344 | params_dict = dict(self.named_parameters(remove_duplicate=False))
345 | for name, loaded_weight in hf_model_weights_iterator(
346 | model_name_or_path, cache_dir, load_format, revision):
347 | if "rotary_emb.inv_freq" in name:
348 | continue
349 | if self.config.tie_word_embeddings and "lm_head.weight" in name:
350 | continue
351 | for (param_name, weight_name, shard_id) in stacked_params_mapping:
352 | if weight_name not in name:
353 | continue
354 | name = name.replace(weight_name, param_name)
355 | # Skip loading extra bias for GPTQ models.
356 | if name.endswith(".bias") and name not in params_dict:
357 | continue
358 | param = params_dict[name]
359 | weight_loader = param.weight_loader
360 | weight_loader(param, loaded_weight, shard_id)
361 | break
362 | else:
363 | # Skip loading extra bias for GPTQ models.
364 | if name.endswith(".bias") and name not in params_dict:
365 | continue
366 | param = params_dict[name]
367 | weight_loader = getattr(param, "weight_loader",
368 | default_weight_loader)
369 | weight_loader(param, loaded_weight)
370 |
371 |
--------------------------------------------------------------------------------
/utils/chatglm3_tokenizer/tokenization_chatglm.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import re
4 | from typing import List, Optional, Union, Dict
5 | from sentencepiece import SentencePieceProcessor
6 | from transformers import PreTrainedTokenizer
7 | from transformers.utils import logging, PaddingStrategy
8 | from transformers.tokenization_utils_base import EncodedInput, BatchEncoding
9 |
10 |
11 | logger = logging.get_logger(__name__)
12 |
13 |
14 | class SPTokenizer:
15 | def __init__(self, model_path: str):
16 | # reload tokenizer
17 | assert os.path.isfile(model_path), model_path
18 | self.sp_model = SentencePieceProcessor(model_file=model_path)
19 |
20 | # BOS / EOS token IDs
21 | self.n_words: int = self.sp_model.vocab_size()
22 | self.bos_id: int = self.sp_model.bos_id()
23 | self.eos_id: int = self.sp_model.eos_id()
24 | self.pad_id: int = self.sp_model.unk_id()
25 | # 确保vocab_size与piece数量一致
26 | assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()
27 |
28 | # 定义聊天角色相关的特殊token
29 | role_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"]
30 | # 添加额外的通用特殊token
31 | special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"] + role_special_tokens
32 | # 创建特殊token与ID之间的映射关系
33 | self.special_tokens = {}
34 | self.index_special_tokens = {}
35 | for token in special_tokens:
36 | # 分配新的词汇表ID给特殊token
37 | self.special_tokens[token] = self.n_words
38 | self.index_special_tokens[self.n_words] = token
39 | self.n_words += 1
40 | # 生成正则表达式,用于在apply_chat_template方法中查找特殊token
41 | self.role_special_token_expression = "|".join([re.escape(token) for token in special_tokens]) # for apply_chat_template
42 |
43 | def tokenize(self, s: str, encode_special_tokens=False):
44 | """ 对输入字符串进行分词操作,可选择是否编码特殊token
45 | """
46 | if encode_special_tokens:
47 | # 对特殊字符进行处理
48 | last_index = 0
49 | t = []
50 | for match in re.finditer(self.role_special_token_expression, s):
51 | # 查找并保留非特殊token部分的分词结果
52 | if last_index < match.start():
53 | t.extend(self.sp_model.EncodeAsPieces(s[last_index:match.start()]))
54 | # 直接添加特殊token
55 | t.append(s[match.start():match.end()])
56 | last_index = match.end()
57 | # 处理剩余非特殊token部分
58 | if last_index < len(s):
59 | t.extend(self.sp_model.EncodeAsPieces(s[last_index:]))
60 | return t
61 | else:
62 | # 当encode_special_tokens为False时,直接调用SentencePiece模型进行分词
63 | return self.sp_model.EncodeAsPieces(s)
64 |
65 | def encode(self, s: str, bos: bool = False, eos: bool = False) -> List[int]:
66 | """ 将字符串转化为ID列表,可选择是否添加BOS/EOS token
67 | """
68 | assert type(s) is str
69 | t = self.sp_model.encode(s)
70 | if bos:
71 | t = [self.bos_id] + t
72 | if eos:
73 | t = t + [self.eos_id]
74 | return t
75 |
76 | def decode(self, t: List[int]) -> str:
77 | """ 将ID列表解码为字符串
78 | """
79 | text, buffer = "", []
80 | for token in t:
81 | # 处理特殊tokenID转字符串
82 | if token in self.index_special_tokens:
83 | if buffer:
84 | text += self.sp_model.decode(buffer)
85 | buffer = []
86 | text += self.index_special_tokens[token]
87 | else:
88 | buffer.append(token)
89 | # 解码剩余普通tokenID
90 | if buffer:
91 | text += self.sp_model.decode(buffer)
92 | return text
93 |
94 | def decode_tokens(self, tokens: List[str]) -> str:
95 | """ 将分词结果(List[str])解码为字符串
96 | """
97 | text = self.sp_model.DecodePieces(tokens)
98 | return text
99 |
100 | def convert_token_to_id(self, token):
101 | """ 将给定的token字符串转化为对应的ID
102 | """
103 | if token in self.special_tokens:
104 | return self.special_tokens[token]
105 | return self.sp_model.PieceToId(token)
106 |
107 | def convert_id_to_token(self, index):
108 | """ 将给定的ID转化为对应的token字符串
109 | """
110 | # 处理特殊tokenID
111 | if index in self.index_special_tokens:
112 | return self.index_special_tokens[index]
113 | # 处理边界情况和其他特殊ID
114 | if index in [self.eos_id, self.bos_id, self.pad_id] or index < 0 or index > self.sp_model.vocab_size():
115 | return ""
116 | # 将普通ID转换为token
117 | return self.sp_model.IdToPiece(index)
118 |
119 |
120 | class ChatGLMTokenizer(PreTrainedTokenizer):
121 | # 预训练模型所需的文件名配置,这里指向tokenizer的model文件
122 | vocab_files_names = {"vocab_file": "tokenizer.model"}
123 | # 模型输入的特征名称列表
124 | model_input_names = ["input_ids", "attention_mask", "position_ids"]
125 |
126 | def __init__(
127 | self,
128 | vocab_file,
129 | padding_side="left",
130 | clean_up_tokenization_spaces=False,
131 | encode_special_tokens=False,
132 | **kwargs
133 | ):
134 | # 设置tokenizer的名称
135 | self.name = "GLMTokenizer"
136 | # 存储vocab文件路径
137 | self.vocab_file = vocab_file
138 | # 使用SPTokenizer作为基础分词器
139 | self.tokenizer = SPTokenizer(vocab_file)
140 | # 定义特殊token及其对应的ID
141 | self.special_tokens = {
142 | "": self.tokenizer.bos_id,
143 | "": self.tokenizer.eos_id,
144 | "": self.tokenizer.pad_id,
145 | "": self.tokenizer.pad_id
146 | }
147 | self.encode_special_tokens = encode_special_tokens
148 |
149 | super().__init__(
150 | padding_side=padding_side,
151 | clean_up_tokenization_spaces=clean_up_tokenization_spaces,
152 | **kwargs
153 | )
154 |
155 | # self.chat_template = "{% for message in messages %}{% if loop.first %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% else %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}"
156 |
157 | def get_command(self, token):
158 | """ 获取指定特殊 token 对应的 id
159 | """
160 | if token in self.special_tokens:
161 | return self.special_tokens[token]
162 | # 如果不在自定义特殊 token 中,则从基础SPTokenizer的特殊 token 中查找
163 | assert token in self.tokenizer.special_tokens, f"{token} is not a special token for {self.name}"
164 | return self.tokenizer.special_tokens[token]
165 |
166 | @property
167 | def unk_token(self) -> str:
168 | """ 通过ID获取未登录词、填充符和结束符的字符串形式
169 | """
170 | return self.tokenizer.sp_model.IdToPiece(self.get_command(""))
171 |
172 | @property
173 | def pad_token(self) -> str:
174 | return self.tokenizer.sp_model.IdToPiece(self.get_command(""))
175 |
176 | @property
177 | def eos_token(self) -> str:
178 | return self.tokenizer.sp_model.IdToPiece(self.get_command(""))
179 |
180 | @property
181 | def unk_token_id(self) -> int:
182 | """ 获取未登录词、填充符和结束符的ID形式
183 | """
184 | return self.get_command("")
185 |
186 | @property
187 | def pad_token_id(self) -> int:
188 | return self.get_command("")
189 |
190 | @property
191 | def eos_token_id(self):
192 | return self.get_command("")
193 |
194 | @unk_token.setter
195 | def unk_token(self, value):
196 | """ 不支持设置未登录词、填充符和结束符,输出警告信息
197 | """
198 | logger.warning("Setting unk_token is not supported, use the default one.")
199 |
200 | @pad_token.setter
201 | def pad_token(self, value):
202 | logger.warning("Setting pad_token is not supported, use the default one.")
203 |
204 | @eos_token.setter
205 | def eos_token(self, value):
206 | logger.warning("Setting eos_token is not supported, use the default one.")
207 |
208 | @property
209 | def vocab_size(self):
210 | """ 返回整个词汇表的大小
211 | """
212 | return self.tokenizer.n_words
213 |
214 | def get_vocab(self):
215 | """ 获取词汇表字典,其中键是token,值是其对应的ID
216 | """
217 | vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
218 | vocab.update(self.added_tokens_encoder)
219 | return vocab
220 |
221 | def _tokenize(self, text, **kwargs):
222 | """ 实现分词功能,利用SPTokenizer进行分词操作
223 | """
224 | return self.tokenizer.tokenize(text, encode_special_tokens=self.encode_special_tokens)
225 |
226 | def _convert_token_to_id(self, token):
227 | """ 将token字符串转化为ID
228 | """
229 | return self.tokenizer.convert_token_to_id(token)
230 |
231 | def _convert_id_to_token(self, index):
232 | """ 将ID转化为token字符串
233 | """
234 | return self.tokenizer.convert_id_to_token(index)
235 |
236 | def convert_tokens_to_string(self, tokens: List[str]) -> str:
237 | """ 将分词结果的tokens列表还原为字符串
238 | """
239 | return self.tokenizer.decode_tokens(tokens)
240 |
241 | def save_vocabulary(self, save_directory, filename_prefix=None):
242 | """ 将词汇表和特殊令牌token保存到指定目录。
243 |
244 | Args:
245 | save_directory (`str`): 将词汇表和特殊令牌文件保存到指定目录。
246 | filename_prefix (`str`, *optional*): 可选添加到保存文件名前的前缀。
247 |
248 | Returns:
249 | `Tuple(str)`: 保存文件的路径
250 | """
251 | if os.path.isdir(save_directory):
252 | vocab_file = os.path.join(
253 | save_directory, self.vocab_files_names["vocab_file"]
254 | )
255 | else:
256 | vocab_file = save_directory
257 |
258 | with open(self.vocab_file, 'rb') as fin:
259 | proto_str = fin.read()
260 |
261 | with open(vocab_file, "wb") as writer:
262 | writer.write(proto_str)
263 |
264 | return (vocab_file,)
265 |
266 | def get_prefix_tokens(self):
267 | """ 获取用于模型输入的前缀 token
268 | """
269 | prefix_tokens = [self.get_command("[gMASK]"), self.get_command("sop")]
270 | return prefix_tokens
271 |
272 | def build_single_message(self, role, metadata, message):
273 | """ 构建单条消息的 token 序列
274 | """
275 | assert role in ["system", "user", "assistant", "observation"], role
276 | # 构建角色标识Token序列
277 | role_tokens = [self.get_command(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n")
278 | # 构建消息正文Token序列
279 | message_tokens = self.tokenizer.encode(message)
280 | # 合并角色标识Token与消息正文Token
281 | tokens = role_tokens + message_tokens
282 | return tokens
283 |
284 | def build_chat_input(self, query, history=None, role="user"):
285 | """ 根据对话历史及当前query构建模型输入
286 | """
287 | if history is None:
288 | history = []
289 | input_ids = []
290 | # 遍历对话历史
291 | for item in history:
292 | # 获取内容
293 | content = item["content"]
294 | # 若为系统消息且包含工具信息,将其加入内容
295 | if item["role"] == "system" and "tools" in item:
296 | content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False)
297 | # 构建单条历史消息的Token序列并加入到模型输入ID列表
298 | input_ids.extend(self.build_single_message(item["role"], item.get("metadata", ""), content))
299 | # 构建当前query的Token序列并加入到模型输入ID列表
300 | input_ids.extend(self.build_single_message(role, "", query))
301 | # 添加表示回复的assistant标记
302 | input_ids.extend([self.get_command("<|assistant|>")])
303 | # 调用tokenizer批量编码方法,返回PyTorch张量形式的模型输入
304 | return self.batch_encode_plus([input_ids], return_tensors="pt", is_split_into_words=True)
305 |
306 | def build_inputs_with_special_tokens(
307 | self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
308 | ) -> List[int]:
309 | """ 通过拼接和添加特殊标记,从一个或两个序列构建用于序列分类任务的模型输入。
310 |
311 | BERT序列格式如下:
312 | - 单一序列:`[CLS] X [SEP]`
313 | - 序列对:`[CLS] A [SEP] B [SEP]`
314 |
315 | Args:
316 | token_ids_0 (`List[int]`): 将添加特殊token的IDs列表
317 | token_ids_1 (`List[int]`, *optional*): 可选的第二个序列的IDs列表,用于序列对。
318 |
319 | Returns:
320 | `List[int]`: 包含适当特殊标记的[输入IDs](../glossary#input-ids)列表。
321 | """
322 | # 获取前缀标记
323 | prefix_tokens = self.get_prefix_tokens()
324 | # 在token_ids_0前添加前缀标记
325 | token_ids_0 = prefix_tokens + token_ids_0
326 | # 若存在token_ids_1,将token_ids_0、token_ids_1连接,并添加结束标记,然后返回
327 | if token_ids_1 is not None:
328 | token_ids_0 = token_ids_0 + token_ids_1 + [self.get_command("")]
329 | return token_ids_0
330 |
331 | def _pad(
332 | self,
333 | encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
334 | max_length: Optional[int] = None,
335 | padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
336 | pad_to_multiple_of: Optional[int] = None,
337 | return_attention_mask: Optional[bool] = None,
338 | ) -> dict:
339 | """ 此方法用于对编码后的输入进行填充(左右两侧填充,直至达到预设长度或批次中的最大长度)
340 |
341 | Args:
342 | encoded_inputs: 字典形式的编码后输入,键为特征名称,值为整数列表(例如,`List[int]`),或者一批编码后的输入(例如,`List[List[int]]`)。
343 | max_length: 返回列表的最大长度,也可作为填充长度
344 | padding_strategy: 填充策略,有以下选项:
345 | - PaddingStrategy.LONGEST : 根据批次中最长序列进行填充
346 | - PaddingStrategy.MAX_LENGTH: 默认策略,填充至最大长度
347 | - PaddingStrategy.DO_NOT_PAD: 不进行填充
348 | 本tokenizer的填充方向由self.padding_side属性决定:
349 | - 'left': 在序列左侧填充
350 | - 'right': 在序列右侧填充
351 | pad_to_multiple_of: (可选)若设置,则将序列填充至给定值的倍数。这对于在NVIDIA硬件上启用具有计算能力`>= 7.5`(Volta及以上)的Tensor Core非常有用。
352 | return_attention_mask:(可选)若设置为False,则避免返回注意力掩码(默认:根据模型特性设置
353 | """
354 | # 从模型默认设置中加载填充侧信息
355 | assert self.padding_side == "left"
356 |
357 | # 获取必要的输入特征,这里假设第一个特征为主要输入特征
358 | required_input = encoded_inputs[self.model_input_names[0]]
359 | seq_length = len(required_input)
360 |
361 | # 如果填充策略为最长序列,则将最大长度设置为当前序列长度
362 | if padding_strategy == PaddingStrategy.LONGEST:
363 | max_length = len(required_input)
364 |
365 | # 计算实际最大长度,确保满足pad_to_multiple_of的要求
366 | if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
367 | max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
368 |
369 | # 判断是否需要填充
370 | needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
371 |
372 | # 若不存在注意力掩码,则初始化
373 | if "attention_mask" not in encoded_inputs:
374 | encoded_inputs["attention_mask"] = [1] * seq_length
375 |
376 | if "position_ids" not in encoded_inputs:
377 | encoded_inputs["position_ids"] = list(range(seq_length))
378 |
379 | # 若需要填充,则执行填充操作
380 | if needs_to_be_padded:
381 | difference = max_length - len(required_input)
382 | # 对注意力掩码进行填充
383 | if "attention_mask" in encoded_inputs:
384 | encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
385 | # 对位置标识进行填充
386 | if "position_ids" in encoded_inputs:
387 | encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
388 | # 对主要输入特征进行填充
389 | encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
390 |
391 | return encoded_inputs
392 |
--------------------------------------------------------------------------------
/doc/Trainer参数.md:
--------------------------------------------------------------------------------
1 | # Trainer参数
2 |
3 | 本文主要记录一下**Transformers的Trainer 以及其训练参数**,主要参考的就是官网的文档,版本为4.34.0,这些参数的顺序也是按照官网的顺序来的,简单的参数就直接翻译了一下。
4 |
5 | ## **1.Transformers的Trainer类接受的参数:**
6 |
7 | 1. **`model`**\*\* (****`PreTrainedModel`**** 或 ****`torch.nn.Module`****, 可选) \*\*:要进行训练、评估或预测的实例化后模型,如果不提供,必须传递一个`model_init`来初始化一个模型。
8 | 2. **`args`**\*\* (****`TrainingArguments`****, 可选) \*\*:训练的参数,如果不提供,就会使用默认的`TrainingArguments` 里面的参数,其中 `output_dir` 设置为当前目录中的名为 "tmp\_trainer" 的目录。
9 | 3. **`data_collator`**\*\* (****`DataCollator`****, 可选) \*\*:用于从`train_dataset` 或 `eval_dataset` 中构成batch的函数,如果未提供tokenizer,将默认使用 `default_data_collator()`;如果提供,将使用 `DataCollatorWithPadding` 。
10 | 4. **`train_dataset`**\*\* (****`torch.utils.data.Dataset`**** 或 ****`torch.utils.data.IterableDataset`****, 可选) \*\*:用于训练的数据集,如果是torch.utils.data.Dataset,则会自动删除模型的`forward()` 方法不接受的列。
11 | 5. **`eval_dataset`**\*\* (Union\[torch.utils.data.Dataset, Dict\[str, torch.utils.data.Dataset]), 可选)\*\*:同上,用于评估的数据集,如果是字典,将对每个数据集进行评估,并在指标名称前附加字典的键值。
12 | 6. **`tokenizer`**\*\* (PreTrainedTokenizerBase, 可选)\*\*:用于预处理数据的分词器,如果提供,将在批量输入时自动对输入进行填充到最大长度,并会保存在模型目录下中,为了重新运行中断的训练或重复微调模型时更容易进行操作。
13 | 7. **`model_init`**\*\* (Callable\[\[], PreTrainedModel], 可选)\*\*:用于实例化要使用的模型的函数,如果提供,每次调用 `train()` 时都会从此函数给出的模型的新实例开始。
14 | 8. **`compute_metrics`**\*\* (Callable\[\[EvalPrediction], Dict], 可选)\*\*:用于在评估时计算指标的函数,必须接受 `EvalPrediction` 作为入参,并返回一个字典,其中包含了不同性能指标的名称和相应的数值,一般是准确度、精确度、召回率、F1 分数等。
15 | 9. **`callbacks`**\*\* (TrainerCallback 列表, 可选)\*\*:自定义回调函数,如果要删除使用的默认回调函数,要使用 `Trainer.remove_callback()` 方法。
16 | 10. **`optimizers`**\*\* (Tuple\[torch.optim.Optimizer, torch.optim.lr\_scheduler.LambdaLR], 可选) \*\*:用于指定一个包含优化器和学习率调度器的元组(Tuple),这个元组的两个元素分别是优化器
17 |
18 | (`torch.optim.Optimizer`)和学习率调度器(`torch.optim.lr_scheduler.LambdaLR`),默认会创建一个基于AdamW优化器的实例,并使用 `get_linear_schedule_with_warmup()` 函数创建一个学习率调度器。
19 | 11. **`preprocess_logits_for_metrics`**\*\* (Callable\[\[torch.Tensor, torch.Tensor], torch.Tensor], 可选)\*\*:用于指定一个函数,这个函数在每次评估步骤(evaluation step)前,其实就是在进入compute\_metrics函数前对模型的输出 logits 进行预处理。接受两个张量(tensors)作为参数,一个是模型的输出 logits,另一个是真实标签(labels)。然后返回一个经过预处理后的 logits 张量,给到compute\_metrics函数作为参数。
20 |
21 | ## **2.TrainingArguments的参数**
22 |
23 | 1. **`output_dir`**\*\* (str)\*\*:用于指定模型checkpoint和最终结果的输出目录。
24 | 2. **`overwrite_output_dir`**\*\* (bool, 可选,默认为 False)**:如果设置为True,将**覆盖输出目录\*\*中已存在的内容,在想要继续训练模型并且输出目录指向一个checkpoint目录时还是比较有用的。
25 | 3. **`do_train`**\*\* (bool, 可选,默认为 False)\*\*:是否执行训练,其实Trainer是不直接使用此参数,主要是用于在写脚本时,作为if的条件来判断是否执行接下来的代码。
26 | 4. **`do_eval`**\*\* (bool, 可选)\*\*:是否在验证集上进行评估,如果评估策略(evaluation\_strategy)不是"no",将自动设置为True。与do\_train类似,也不是直接由Trainer使用的,主要是用于我们写训练脚本。
27 | 5. **`do_predict`**\*\* (bool, 可选,默认为 False)\*\*:是否在测试集上进行预测。
28 | 6. **`evaluation_strategy `(str, 可选,默认为 "no")**:用于指定训练期间采用的评估策略,可选值包括:
29 | - "no":在训练期间不进行任何评估。
30 | - "steps":每eval\_steps步骤进行评估。
31 | - "epoch":在每个训练周期结束时进行评估。
32 | 7. **`prediction_loss_only `(bool, 可选, 默认为 False)**:如果设置为True,当进行评估和预测时,只返回损失值,而不返回其他评估指标。
33 | 8. **`per_device_train_batch_size`**\*\* (int, 可选, 默认为 8)\*\*:用于指定训练的每个GPU/XPU/TPU/MPS/NPU/CPU的batch,每个训练步骤中每个硬件上的样本数量。
34 | 9. **`per_device_eval_batch_size`**\*\* (int, 可选, 默认为 8)\*\*:用于指定评估的每个GPU/XPU/TPU/MPS/NPU/CPU的batch,每个评估步骤中每个硬件上的样本数量。
35 | 10. **`gradient_accumulation_steps`**\*\* (int, 可选, 默认为 1)\*\*:用于指定在每次更新模型参数之前,梯度积累的更新步数。使得梯度积累可以在多个batch上累积梯度,然后更新模型参数,就可以在显存不够的情况下执行大batch的反向传播。
36 |
37 | 假设有4张卡,每张卡的batch size为8,那么一个steps的batch size就是32,如果我们这个参数设置为4,那么相当于一个batch训练样本数量就是128。**好处:显存不够增大此参数**。
38 | 11. **`eval_accumulation_steps`**\*\* (int, 可选)\*\*:指定在执行评估时,模型会累积多少个预测步骤的输出张量,然后才将它们从GPU/NPU/TPU移动到CPU上,默认是整个评估的输出结果将在GPU/NPU/TPU上累积,然后一次性传输到CPU,速度更快,但占显存。
39 | 12. **`eval_delay`**\*\* (float, 可选)\*\*:指定等待执行第一次评估的轮数或步数。如果evaluation\_strategy为"steps",设置此参数为10,则10个steps后才进行首次评估。
40 | 13. **`learning_rate`**\*\* (float, 可选, 默认为 5e-5)\*\*:指定AdamW优化器的初始学习率。
41 | 14. **`weight_decay`**\*\* (float, 可选, 默认为 0)\*\*:指定权重衰减的值,会应用在 AdamW 优化器的所有层上,除了偏置(bias)和 Layer Normalization 层(LayerNorm)的权重上。
42 |
43 | 简单解释一下,权重衰减是一种正则化手段,通过向损失函数添加一个额外的项来惩罚较大的权重值,有助于防止模型过拟合训练数据。
44 | 15. **`adam_beta1`**\*\* (float, 可选, 默认为 0.9)\*\*:指定AdamW优化器的beta1超参数,详细的解释可以看其论文。
45 | 16. **`adam_beta2`**\*\* (float, 可选, 默认为 0.999)\*\*:指定AdamW优化器的beta2超参数,详细的解释可以看其论文。
46 | 17. **`adam_epsilon`**\*\* (float, 可选, 默认为 1e-8)\*\*:指定AdamW优化器的epsilon超参数,详细的解释可以看其论文。
47 | 18. **`max_grad_norm`**\*\* (float, 可选, 默认为 1.0)\*\*:指定梯度剪裁的最大梯度范数,可以防止梯度爆炸,一般都是1,如果某一步梯度的L2范数超过了 此参数,那么梯度将被重新缩放,确保它的大小不超过此参数。
48 | 19. **`num_train_epochs`**\*\* (float, 可选, 默认为 3.0)\*\*:训练的总epochs数。
49 | 20. **`max_steps`**\*\* (int, 可选, 默认为 -1)\*\*:如果设置为正数,就是执行的总训练步数,**会覆盖num\_train\_epochs**。注意如果使用此参数,就算没有达到这个参数值的步数,训练也会在数据跑完后停止。
50 | 21. **`lr_scheduler_type`**\*\* (str, 可选, 默认为"linear")\*\*:用于指定学习率scheduler的类型,根据训练的进程来自动调整学习率。详细见:
51 | - **"linear"**:线性学习率scheduler,学习率以线性方式改变
52 | - **"cosine"**:余弦学习率scheduler,学习率以余弦形状的方式改变。
53 | - **"constant"**:常数学习率,学习率在整个训练过程中保持不变。
54 | - **"polynomial"**:多项式学习率scheduler,学习率按多项式函数的方式变化。
55 | - **"piecewise"**:分段常数学习率scheduler,每个阶段使用不同的学习率。
56 | - **"exponential"**:指数学习率scheduler,学习率以指数方式改变。
57 | 22. **`warmup_ratio`**\*\* (float, 可选, 默认为0.0)\*\*:用于指定线性热身占总训练步骤的比例,线性热身是一种训练策略,学习率在开始阶段从0逐渐增加到其最大值(通常是设定的学习率),然后在随后的训练中保持不变或者按照其他调度策略进行调整。如果设置为0.0,表示没有热身。
58 | 23. **`warmup_steps`**\*\* (int,可选, 默认为0)\*\*:这个是直接指定线性热身的步骤数,这个参数会覆盖warmup\_ratio,如果设置了warmup\_steps,将会忽略warmup\_ratio。
59 | 24. **`log_level`**\*\* (str, 可选, 默认为passive)\*\*:用于指定主进程上要使用的日志级别,
60 | - debug:最详细的日志级别。
61 | - info:用于一般的信息性消息。
62 | - warning:用于警告信息。
63 | - error:用于错误信息。
64 | - critical:用于严重错误信息。
65 | - passive:不设置任何内容,将会使用Transformers库当前的日志级别(默认为"warning")。
66 | 建议训练时使用info级别。
67 | 25. **`log_level_replica`**\*\* (str, 可选, 默认为warning)\*\*:副本上要使用的日志级别,与log\_level相同。
68 | 26. **`log_on_each_node`**\*\* (bool, optional, defaults to True)\*\*:在多节点分布式训练中,是否在每个节点上使用log\_level进行日志记录。
69 | 27. **`logging_dir`**\*\* (str, 可选)\*\*:TensorBoard日志目录。默认为output\_dir/runs/CURRENT\_DATETIME\_HOSTNAME。
70 | 28. **`logging_strategy`**\*\* (str, 可选, 默认为"steps")\*\*:训练过程中采用的日志记录策略。可选包括:
71 | - "no":在训练过程中不记录任何日志。
72 | - "epoch":在每个epoch结束时记录日志。
73 | - "steps":根据logging\_steps参数记录日志。
74 | 29. **`logging_steps`**\*\* (int or float,可选, 默认为500)\*\*:如果logging\_strategy="steps",则此参数为每多少步记录一次步骤。
75 | 30. **`logging_nan_inf_filter`**\*\* (bool, 可选, 默认为 True)\*\*:是否过滤日志记录中为nan和inf的loss,如果设置为True,将过滤每个步骤的loss,如果出现nan或inf,将取当前日志窗口的平均损失值。
76 | 31. **`save_strategy`**\*\* (str , 可选, 默认为 "steps")\*\*:训练过程中保存checkpoint的策略,包括:
77 | - "no":在训练过程中不保存checkpoint。
78 | - "epoch":在每个epoch束时保存checkpoint。
79 | - "steps":根据save\_steps参数保存checkpoint。
80 | 32. **`save_steps`**\*\* (int or float, 可选, 默认为500)\*\*:如果save\_strategy="steps",就是指两次checkpoint保存之间的更新步骤数。如果是在\[0, 1)的浮点数,则就会当做与总训练步骤数的比例。
81 | 33. **`save_total_limit`**\*\* (int, 可选)\*\*:如果给定了参数,将限制checkpoint的总数,因为checkpoint也是很占硬盘的,将会删除输出目录中旧的checkpoint。当启用load\_best\_model\_at\_end时,会根据metric\_for\_best\_model保留最好的checkpoint,以及最近的checkpoint。
82 |
83 | 举个例子,当`save_total_limit=5`和指定`load_best_model_at_end`时,将始终保留最近的四个checkpoint以及最好的checkpoint;当`save_total_limit=1`和指定`load_best_model_at_end`时,会保存两个checkpoint:最后一个和最好的一个(如果它们不同一个)。
84 | 34. **`load_best_model_at_end `(bool, 可选, 默认为False)**:用于指定是否在训练结束时加载在训练过程中最好的checkpoint,设置为 True 时,就是帮你找到在验证集上指标最好的checkpoint并且保存,然后还会保存最后一个checkpoint,在普通的多epoch训练中,最好设置为True,但在大模型训练中,一般是一个epoch,使用的就是最后一个checkpoint。
85 | 35. **`save_safetensors`**\*\* (bool, 可选, 默认为False)\*\*:用于指定是否在保存和加载模型参数时使用 "safetensors","safetensors" 就是更好地处理了不同 PyTorch 版本之间的模型参数加载的兼容性问题。
86 | 36. **`save_on_each_node`**\*\* (bool, 可选, 默认为 False)\*\*:在进行多节点分布式训练时,是否在每个节点上保存checkpoint,还是仅在主节点上保存。注意如果多节点使用的是同一套存储设备,比如都是外挂的铜一个nas,开启后会报错,因为文件名称都一样。
87 | 37. **`use_cpu`**\*\* (bool, 可选, 默认为 False)\*\*:是否使用CPU训练。如果设置为False,将使用CUDA或其他可用设备。
88 | 38. **`seed`**\*\* (int, 可选, 默认为42)\*\*:用于指定训练过程的随机种子,可以确保训练的可重现性,主要用于model\_init,随机初始化权重参数。
89 | 39. **`data_seed`**\*\* (int, 可选)\*\*:用于指定数据采样的随机种子,如果没有设置将使用与seed相同的种子,可以确保数据采样的可重现性。
90 | 40. **`jit_mode_eval `(bool, 可选, 默认为False)**:用于指定是否在推理(inference)过程中使用 PyTorch 的 JIT(Just-In-Time)跟踪功能,PyTorch JIT 是 PyTorch 的一个功能,用于将模型的前向传播计算编译成高性能的机器代码,会加速模型的推理。
91 | 41. **`use_ipex `(bool, 可选, 默认为 False)**:用于指定是否使用英特尔扩展(Intel extension)来优化 PyTorch,需要安装IPEX,IPEX是一组用于优化深度学习框架的工具和库,可以提高训练和推理的性能,特别针对英特尔的处理器做了优化。
92 | 42. **`bf16 `(bool, 可选, 默认为False)**:用于指定是否使用bf16进行混合精度训练,而不是fp32训练,需要安培架构或者更高的NVIDIA架构。
93 |
94 | 在简单解释一下混合精度训练:模型训练时将模型参数和梯度存储为fp32,但在前向和后向传播计算中使用fp16,这样可以减少内存使用和计算时间,并提高训练速度,这个只是简单的解释,关于混合精度训练,这篇文章讲的比较好 [点这里](https://mp.weixin.qq.com/s%3F__biz%3DMzI4MDYzNzg4Mw%3D%3D%26mid%3D2247550159%26idx%3D5%26sn%3Df5db2afa547970bc429112e32d2e7daf%26chksm%3Debb73c1bdcc0b50d0e85039bd5d8349a23330e3e0f138a7dd2da218a20174d0965837682dd14%26scene%3D27 "点这里")。
95 | 43. **`fp16 `(bool,** 可选, 默认为False):用于指定是否使用fp16进行混合精度训练,而不是fp32训练。
96 | 44. **`fp16_opt_level `(str, 可选, 默认为 ''O1'')**:对于fp16训练,选择的Apex AMP的优化级别,可选值有 \['O0', 'O1', 'O2'和'O3']。详细信息可以看Apex文档。
97 | 45. **`half_precision_backend`**\*\* (str, 可选, 默认为"auto")\*\*:用于指定混合精度训练(Mixed Precision Training)时要使用的后端,必须是 "auto"、"cuda\_amp"、"apex"、"cpu\_amp" 中的一个。"auto"将根据检测到的PyTorch版本来使用后端,而其他选项将会强制使用请求的后端。使用默认就行。
98 | 46. **`bf16_full_eval`**\*\* (bool, 可选, 默认为 False)\*\*:用于指定是否使用完全的bf16进行评估,而不是fp32。这样更快且省内存,但因为精度的问题指标可能会下降。
99 | 47. **`fp16_full_eval`**\*\* (bool, 可选, 默认为 False)\*\*:同上,不过将使用fp16.
100 | 48. **`tf32`**\*\* (bool, 可选)\*\*:用于指定是否启用tf32精度模式,适用于安培架构或者更高的NVIDIA架构,默认值取决于PyTorch的版本torch.backends.cuda.matmul.allow\_tf32的默认值。
101 | 49. **`local_rank`**\*\* (int, 可选, 默认为 -1)\*\*:用于指定在分布式训练中的当前进程(本地排名)的排名,这个不需要我们设置,使用PyTorch分布式训练时会自动设置,默认为自动设置。
102 | 50. **`ddp_backend`**\*\* (str, 可选)\*\*:用于指定处理分布式计算的后端框架,这些框架的主要用于多个计算节点协同工作以加速训练,处理模型参数和梯度的同步、通信等操作,可选值如下
103 | - **"nccl"**:这是 NVIDIA Collective Communications Library (NCCL) 的后端。
104 | - **"mpi"**:Message Passing Interface (MPI) 后端, 是一种用于不同计算节点之间通信的标准协议。
105 | - **"ccl"**:这是 Intel的oneCCL (oneAPI Collective Communications Library) 的后端。
106 | - **"gloo"**:这是Facebook开发的分布式通信后端。
107 | - **"hccl"**:这是Huawei Collective Communications Library (HCCL) 的后端,用于华为昇腾NPU的系统上进行分布式训练。
108 | 默认会根据系统自动设置,一般是nccl。
109 | 51. **`tpu_num_cores `(int, 可选)**:指定在TPU上训练时,TPU核心的数量。
110 | 52. **`dataloader_drop_last `(bool, 可选, 默认为False)**:用于指定是否丢弃最后一个不完整的batch,发生在数据集的样本数量不是batch\_size的整数倍的时候。
111 | 53. **`eval_steps `(int or float, 可选)**:如果evaluation\_strategy="steps",就是指两次评估之间的更新步数,如果未设置,默认和设置和logging\_steps相同的值,如果是在\[0, 1)的浮点数,则就会当做与总评估步骤数的比例。
112 | 54. **`dataloader_num_workers `(int, 可选, 默认为 0)**:用于指定数据加载时的子进程数量(仅用于PyTorch)其实就是PyTorch的num\_workers参数,0表示数据将在主进程中加载。
113 | 55. **`past_index `(int, 可选, 默认为 -1)**:一些模型(如TransformerXL或XLNet)可以利用过去的隐藏状态进行预测,如果将此参数设置为正整数,Trainer将使用相应的输出(通常索引为2)作为过去状态,并将其在下一个训练步骤中作为mems关键字参数提供给模型,只针对一些特定模型。
114 | 56. **`run_name`**\*\* (str, 可选)\*\*:用于指定训练运行(run)的字符串参数,与日志记录工具(例如wandb和mlflow)一起使用,不影响训练过程,就是给其他的日志记录工具开了一个接口,个人还是比较推荐wandb比较好用。
115 | 57. **`disable_tqdm `(bool, 可选)**:是否禁用Jupyter笔记本中的\~notebook.NotebookTrainingTracker生成的tqdm进度条,如果日志级别设置为warn或更低,则将默认为True,否则为False。
116 | 58. **`remove_unused_columns `(bool, 可选, 默认为True)**:是否自动删除模型在训练时,没有用到的数据列,默认会删除,比如你的数据有两列分别是content和id,如果没有用到id这一列,训练时就会被删除。
117 | 59. **`label_names `(List\[str], 可选)**:用于指定在模型的输入字典中对应于标签(labels)的键,默认情况下不需要显式指定。
118 | 60. **`metric_for_best_model`**\*\* (str, 可选)\*\*:与 load\_best\_model\_at\_end 结合使用,用于指定比较不同模型的度量标准,默认情况下,如果未指定,将使用验证集的 "loss" 作为度量标准,可使用accuracy、F1、loss等。
119 | 61. **`greater_is_better `(bool, 可选)**:与 load\_best\_model\_at\_end 和 metric\_for\_best\_model 结合使用,这个和上面的那个参数是对应的,是指上面的那个指标是越大越好还是越小越好,如果是loss就是越小越好,这个参数就会被设置为False;如果是accuracy,你需要把这个值设为True。
120 | 62. **`ignore_data_skip `(bool, 可选,默认为False)**:用于指定是否断点训练,即训练终止又恢复后,是否跳过之前的训练数据。
121 | 63. **`resume_from_checkpoint `(str, 可选)**:用于指定从checkpoint恢复训练的路径。
122 | 64. **`sharded_ddp `(bool, str 或 ShardedDDPOption 列表, 可选, 默认为'')**:是否在分布式训练中使用 Sharded DDP(Sharded Data Parallelism),这是由 FairScale提供的,默认不使用,简单解释一下: FairScale 是Mate开发的一个用于高性能和大规模训练的 PyTorch 扩展库。这个库扩展了基本的 PyTorch 功能,同时引入了最新的先进规模化技术,通过可组合的模块和易于使用的API,提供了最新的分布式训练技术。详细的可以看其官网。
123 | 65. **`fsdp `(bool, str 或 FSDPOption 列表, 可选, 默认为'')**:用于指定是否要启用 PyTorch 的 FSDP(Fully Sharded Data Parallel Training),以及如何配置分布式并行训练。
124 | 66. **`fsdp_config `(str 或 dict, 可选)**:用于配置 PyTorch 的 FSDP(Fully Sharded Data Parallel Training)的配置文件
125 | 67. **`deepspeed `(str 或 dict, 可选)**:用于指定是否要启用 DeepSpeed,以及如何配置 DeepSpeed。也是目前分布式训练使用最多的框架,比上面pytorch原生分布式训练以及FairScale用的范围更广,详细的可以看其官网。
126 | 68. **`label_smoothing_factor`**\*\* (float, 可选,默认为0.0)\*\*:用于指定标签平滑的因子。
127 | 69. **`debug`**\*\* (str 或 DebugOption 列表, 可选, 默认为'')\*\*:用于启用一个或多个调试功能
128 |
129 | 支持的选项:
130 | - "underflow\_overflow":此选项用于检测模型输入/输出中的溢出。
131 | - "tpu\_metrics\_debug":此选项用于在 TPU 上打印调试指标。
132 | 70. **`optim`**\*\* (str 或 training\_args.OptimizerNames, 可选, 默认为 "adamw\_torch")\*\*:指定要使用的优化器。
133 |
134 | 可选项:
135 | - "adamw\_hf"
136 | - "adamw\_torch"
137 | - "adamw\_torch\_fused"
138 | - "adamw\_apex\_fused"
139 | - "adamw\_anyprecision"
140 | - "adafactor"
141 | 71. **`optim_args`**\*\* (str, 可选)\*\*:用于向特定类型的优化器(如adamw\_anyprecision)提供额外的参数或自定义配置。
142 | 72. **`group_by_length`**\*\* (bool, 可选, 默认为 False)\*\*:是否在训练数据集中对大致相同长度的样本进行分组然后放在一个batch里,目的是尽量减少在训练过程中进行的padding,提高训练效率。
143 | 73. **`length_column_name`**\*\* (str, 可选, 默认为 "length")\*\*:当你上个参数设置为True时,你可以给你的训练数据在增加一列”长度“,就是事先计算好的,可以加快分组的速度,默认是length。
144 | 74. **`report_to`**\*\* (str 或 str 列表, 可选, 默认为 "all")\*\*:用于指定要将训练结果和日志报告到的不同日记集成平台,有很多"azure\_ml", "clearml", "codecarbon", "comet\_ml", "dagshub", "flyte", "mlflow", "neptune", "tensorboard", and "wandb"。直接默认就行,都发。
145 | 75. **`ddp_find_unused_parameters`**\*\* (bool, 可选)\*\*:当你使用分布式训练时,这个参数用于控制是否查找并处理那些在计算中没有被使用的参数,如果启用了梯度检查点(gradient checkpointing),表示部分参数是惰性加载的,这时默认值为 False,因为梯度检查点本身已经考虑了未使用的参数,如果没有启用梯度检查点,默认值为 True,表示要查找并处理所有参数,以确保它们的梯度被正确传播。
146 | 76. **`ddp_bucket_cap_mb`**\*\* (int, 可选)\*\*:在分布式训练中,数据通常分成小块进行处理,这些小块称为"桶",这个参数用于指定每个桶的最大内存占用大小,一般自动分配即可。
147 | 77. **`ddp_broadcast_buffers`**\*\* (bool, 可选)\*\*:在分布式训练中,模型的某些部分可能包含缓冲区,如 Batch Normalization 层的统计信息,这个参数用于控制是否将这些缓冲区广播到所有计算设备,以确保模型在不同设备上保持同步,如果启用了梯度检查点,表示不需要广播缓冲区,因为它们不会被使用,如果没有启用梯度检查点,默认值为 True,表示要广播缓冲区,以确保模型的不同部分在所有设备上都一致。
148 | 78. **`gradient_checkpointing`**\*\* (bool, 可选, 默认为False)\*\*:是否开启梯度检查点,简单解释一下:训练大型模型时需要大量的内存,其中在反向传播过程中,需要保存前向传播的中间计算结果以计算梯度,但是这些中间结果占用大量内存,可能会导致内存不足,梯度检查点会在训练期间释放不再需要的中间结果以减小内存占用,但它会使训练变慢。
149 | 79. **`dataloader_pin_memory`**\*\* (bool, 可选, 默认为 True)\*\*:用于指定dataloader加载数据时,是否启用“pin memory”功能。“Pin memory” 用于将数据加载到GPU内存之前,将数据复制到GPU的锁页内存(pinned memory)中,锁页内存是一种特殊的内存,可以更快地传输数据到GPU,从而加速训练过程,但是会占用额外的CPU内存,会导致内存不足的问题,如果数据量特别大,百G以上建议False。
150 | 80. **`skip_memory_metrics`**\*\* (bool, 可选, 默认为 True)\*\*:用于控制是否将内存分析报告添加到性能指标中,默认情况下跳过这一步,以提高训练和评估的速度,建议打开,更能够清晰的知道每一步的内存使用。
151 | 81. **`include_inputs_for_metrics`**\*\* (bool, 可选, 默认为 False)\*\*:是否将输入传递给 `compute_metrics` 函数,一般计算metrics用的是用的是模型预测的结果和我们提供的标签,但是有的指标需要输入,比如cv的IoU(Intersection over Union)指标。
152 | 82. **`auto_find_batch_size`**\*\* (bool, 可选, 默认为 False)\*\*:是否使用自动寻找适合内存的batch size大小,以避免 CUDA 内存溢出错误,需要安装 `accelerate`(使用 `pip install accelerate`),这个功能还是比较NB的。
153 | 83. **`full_determinism`**\*\* (bool, 可选, 默认为 False)\*\*:如果设置为 `True`,将调用 `enable_full_determinism()` 而不是 `set_seed()`,训练过程将启用完全确定性(full determinism),在训练过程中,所有的随机性因素都将被消除,确保每次运行训练过程都会得到相同的结果,注意:会对性能产生负面影响,因此仅在调试时使用。
154 | 84. **`torchdynamo`**\*\* (str, 可选)\*\*:用于选择 TorchDynamo 的后端编译器,TorchDynamo 是 PyTorch 的一个库,用于提高模型性能和部署效率,可选的选择包括 "eager"、"aot\_eager"、"inductor"、"nvfuser"、"aot\_nvfuser"、"aot\_cudagraphs"、"ofi"、"fx2trt"、"onnxrt" 和 "ipex"。默认就行,自动会选。
155 | 85. **`ray_scope`**\*\* (str, 可选, 默认为 "last")\*\*:用于使用 Ray 进行超参数搜索时,指定要使用的范围,默认情况下,使用 "last",Ray 将使用所有试验的最后一个检查点,比较它们并选择最佳的。详细的可以看一下它的文档。
156 | 86. **`ddp_timeout`**\*\* (int, 可选, 默认为 1800)\*\*:用于 torch.distributed.init\_process\_group 调用的超时时间,在分布式运行中执行较慢操作时,用于避免超时,具体的可以看 [PyTorch 文档](https://link.zhihu.com/?target=https%3A//pytorch.org/docs/stable/distributed.html%23torch.distributed.init_process_group "PyTorch 文档") 。
157 | 87. **`torch_compile`**\*\* (bool, 可选, 默认为 False)\*\*:是否使用 PyTorch 2.0 及以上的 torch.compile 编译模型,具体的可以看 [PyTorch 文档](https://link.zhihu.com/?target=https%3A//pytorch.org/docs/stable/distributed.html%23torch.distributed.init_process_group "PyTorch 文档") 。
158 | 88. **`torch_compile_backend`**\*\* (str, 可选)\*\*:指定在 torch.compile 中使用的后端,如果设置为任何值,将启用 torch\_compile。
159 | 89. **`torch_compile_mode`**\*\* (str, 可选)\*\*:指定在 torch.compile 中使用的模式,如果设置为任何值,将启用 torch\_compile。
160 | 90. **`include_tokens_per_second`**\*\* (bool, 可选)\*\*:确定是否计算每个设备的每秒token数以获取训练速度指标,会在整个训练数据加载器之前进行迭代,会稍微减慢整个训练过程,建议打开。
161 | 91. **`push_to_hub`**\*\* (bool, 可选, 默认为 False)\*\*:指定是否在每次保存模型时将模型推送到Huggingface Hub。
162 | 92. **`hub_model_id`**\*\* (str, 可选)\*\*:指定要与本地 output\_dir 同步的存储库的名称。
163 | 93. **`hub_strategy`**\*\* (str 或 HubStrategy, 可选, 默认为 "every\_save") \*\*:指定怎么推送到Huggingface Hub。
164 | 94. **`hub_token`**\*\* (str, 可选)\*\*:指定推送模型到Huggingface Hub 的token。
165 | 95. **`hub_private_repo`**\*\* (bool, 可选, 默认为 False)\*\*:如果设置为 True,Huggingface Hub 存储库将设置为私有。
166 | 96. **`hub_always_push`**\*\* (bool, 可选, 默认为 False)\*\*:是否每次都推送模型。
167 |
--------------------------------------------------------------------------------