├── log └── README.md ├── pics ├── llama.png ├── main_table.png └── combined_plot.png ├── .gitignore ├── config ├── __pycache__ │ ├── prompt.cpython-38.pyc │ └── prompt.cpython-39.pyc └── prompt.py ├── requirements.txt ├── scripts ├── preprocess_lb.sh ├── merge_config.yaml ├── qwen14b_lora.yaml ├── qwen_warmup.sh ├── llama_warmup.sh ├── llama_sft.sh ├── qwen_sft.sh ├── qwen_lora.sh └── lora_sft.sh ├── evaltoolkits ├── cache │ ├── user_config.yaml │ ├── ds_z2_config.json │ ├── ds_z2_offload_config.json │ ├── ds_z3_config.json │ └── ds_z3_offload_config.json ├── filter.sh ├── loop_sample.sh ├── loop_eval.sh ├── utils.py ├── launch_lbv2.sh ├── launch_lbv2m.sh ├── launch_lbv1.sh ├── launch_inference.sh ├── step2_extract_preds_from_raw.py ├── launch_lbv1big.sh ├── step3_eval_f1.py ├── step1_eval_inference.py ├── metrics.py ├── filter_data.py ├── inference2.out └── inference.out ├── preprocess_train.py ├── cache └── ds_z3_config.json ├── preprocess_lbv1.py ├── preprocess_lbv2.py ├── data └── dataset_info.json └── README.md /log/README.md: -------------------------------------------------------------------------------- 1 | This is a folder for log saving. -------------------------------------------------------------------------------- /pics/llama.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/HEAD/pics/llama.png -------------------------------------------------------------------------------- /pics/main_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/HEAD/pics/main_table.png -------------------------------------------------------------------------------- /pics/combined_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/HEAD/pics/combined_plot.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | dataset/ 2 | data/ 3 | saves/ 4 | evaltoolkits/pred*/ 5 | **__pycache__/ 6 | log/ 7 | *.out 8 | **bak/ 9 | **dev/ -------------------------------------------------------------------------------- /config/__pycache__/prompt.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/HEAD/config/__pycache__/prompt.cpython-38.pyc -------------------------------------------------------------------------------- /config/__pycache__/prompt.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/HEAD/config/__pycache__/prompt.cpython-39.pyc -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | vllm==0.6.0 2 | vllm-flash-attn==2.6.1 3 | deepspeed==0.15.4 4 | openai 5 | jsonlines 6 | json-repair 7 | fuzzywuzzy 8 | jieba -------------------------------------------------------------------------------- /scripts/preprocess_lb.sh: -------------------------------------------------------------------------------- 1 | mkdir -p dataset/longbenchv1 2 | mkdir -p dataset/longbenchv2 3 | 4 | python preprocess_lbv1.py 5 | python preprocess_lbv2.py -------------------------------------------------------------------------------- /evaltoolkits/cache/user_config.yaml: -------------------------------------------------------------------------------- 1 | cache_dir: null 2 | lang: zh 3 | last_model: Qwen2.5-14B 4 | path_dict: 5 | Qwen2.5-14B: /mnt/xiyu/Model/Qwen/Qwen2.5-14B 6 | -------------------------------------------------------------------------------- /evaltoolkits/filter.sh: -------------------------------------------------------------------------------- 1 | python /mnt/xiyu/LongRePS/evaltoolkits/filter_data.py \ 2 | --path_to_src_file /mnt/xiyu/musique-Qwen-2.5-32B.temp0.7sample30.cot_eval.jsonl \ 3 | --path_to_stage1_file /mnt/xiyu/LongRePS/dataset/musique-qwen32b_sample30temp0.7thresh1.0_factcheck_stage1.jsonl \ 4 | --sample_num 30 5 | -------------------------------------------------------------------------------- /scripts/merge_config.yaml: -------------------------------------------------------------------------------- 1 | model_name_or_path: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-warmup-epoch1 2 | adapter_name_or_path: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm_train_lr5e-5_maxlen16k_2025-03-30-17-41-39/checkpoint-88 3 | template: qwen 4 | finetuning_type: lora 5 | 6 | export_dir: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2 7 | export_size: 2 8 | export_device: cpu 9 | export_legacy_format: false -------------------------------------------------------------------------------- /evaltoolkits/loop_sample.sh: -------------------------------------------------------------------------------- 1 | eval_model_list=( 2 | "Qwen-7B-Instruct-yarn-example /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct musique-Qwen-2.5-7B_prm_train.jsonl" 3 | ) 4 | model_list=("${eval_model_list[@]}") 5 | 6 | for model in "${model_list[@]}"; do 7 | model_name=$(echo $model | cut -d' ' -f1) 8 | model_path=$(echo $model | cut -d' ' -f2) 9 | file_name=$(echo $model | cut -d' ' -f3) 10 | echo "Launching inference for ${model_name}..." 11 | echo "Model path: ${model_path}" 12 | echo "File name: ${file_name}" 13 | bash launch_inference.sh ${model_name} ${model_path} ${file_name} 14 | done -------------------------------------------------------------------------------- /evaltoolkits/loop_eval.sh: -------------------------------------------------------------------------------- 1 | eval_model_list=( 2 | "Llama-8B-warmup-lr1e-5-epoch1 ../saves/Llama3.1-8B/full/Llama-3.1-8B_warmup_train_lr1e-5_maxlen16k_2025-03-13-22-43-49/checkpoint-10 " 3 | ) 4 | model_list=("${eval_model_list[@]}") 5 | 6 | for model in "${model_list[@]}"; do 7 | model_name=$(echo $model | cut -d' ' -f1) 8 | model_path=$(echo $model | cut -d' ' -f2) 9 | file_name=$(echo $model | cut -d' ' -f3) 10 | echo "Launching inference for ${model_name}..." 11 | echo "Model path: ${model_path}" 12 | echo "File name: ${file_name}" 13 | bash new_launch_inference.sh ${model_name} ${model_path} ${file_name} 14 | done -------------------------------------------------------------------------------- /scripts/qwen14b_lora.yaml: -------------------------------------------------------------------------------- 1 | model_name_or_path: /mnt/xiyu/Model/Qwen/Qwen2.5-14B 2 | 3 | stage: sft 4 | do_train: true 5 | finetuning_type: lora 6 | lora_target: all 7 | 8 | dataset: Qwen-2.5-14B_warmup 9 | dataset_dir: data 10 | template: qwen 11 | cutoff_len: 16384 12 | max_samples: 100000 13 | overwrite_cache: true 14 | preprocessing_num_workers: 16 15 | 16 | output_dir: saves/llama3-8b/lora/sft 17 | logging_steps: 5 18 | save_strategy: epoch 19 | plot_loss: true 20 | overwrite_output_dir: true 21 | 22 | per_device_train_batch_size: 1 23 | gradient_accumulation_steps: 4 24 | learning_rate: 1e-5 25 | num_train_epochs: 2.0 26 | lr_scheduler_type: constant_with_warmup 27 | warmup_steps: 5 28 | bf16: true 29 | ddp_timeout: 180000000 -------------------------------------------------------------------------------- /preprocess_train.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import jsonlines 3 | 4 | 5 | def preprocess(model:str): 6 | dataset = load_dataset(f"Lemon123prog/{model}-LongRePS") 7 | warmup_data=dataset['warmup'].to_list() 8 | orm_data=dataset['train_orm'].to_list() 9 | prm_data=dataset['train_prm'].to_list() 10 | 11 | with jsonlines.open(f"./data/{model}_warmup.jsonl", 'w') as writer: 12 | writer.write_all(warmup_data) 13 | 14 | with jsonlines.open(f"./data/musique-{model}_orm_train.jsonl", 'w') as writer: 15 | writer.write_all(orm_data) 16 | 17 | with jsonlines.open(f"./data/musique-{model}_prm_train.jsonl", 'w') as writer: 18 | writer.write_all(prm_data) 19 | 20 | 21 | preprocess("Llama-3.1-8B") 22 | preprocess("Qwen-2.5-7B") -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z2_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 2, 20 | "allgather_partitions": true, 21 | "allgather_bucket_size": 500000000.0, 22 | "overlap_comm": true, 23 | "reduce_scatter": true, 24 | "reduce_bucket_size": 500000000.0, 25 | "contiguous_gradients": true, 26 | "round_robin_gradients": true 27 | } 28 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z2_offload_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 2, 20 | "allgather_partitions": true, 21 | "allgather_bucket_size": 500000000.0, 22 | "overlap_comm": true, 23 | "reduce_scatter": true, 24 | "reduce_bucket_size": 500000000.0, 25 | "contiguous_gradients": true, 26 | "round_robin_gradients": true, 27 | "offload_optimizer": { 28 | "device": "cpu", 29 | "pin_memory": true 30 | } 31 | } 32 | } -------------------------------------------------------------------------------- /cache/ds_z3_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 3, 20 | "overlap_comm": true, 21 | "contiguous_gradients": true, 22 | "sub_group_size": 1000000000.0, 23 | "reduce_bucket_size": "auto", 24 | "stage3_prefetch_bucket_size": "auto", 25 | "stage3_param_persistence_threshold": "auto", 26 | "stage3_max_live_parameters": 1000000000.0, 27 | "stage3_max_reuse_distance": 1000000000.0, 28 | "stage3_gather_16bit_weights_on_model_save": true 29 | } 30 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z3_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 3, 20 | "overlap_comm": true, 21 | "contiguous_gradients": true, 22 | "sub_group_size": 1000000000.0, 23 | "reduce_bucket_size": "auto", 24 | "stage3_prefetch_bucket_size": "auto", 25 | "stage3_param_persistence_threshold": "auto", 26 | "stage3_max_live_parameters": 1000000000.0, 27 | "stage3_max_reuse_distance": 1000000000.0, 28 | "stage3_gather_16bit_weights_on_model_save": true 29 | } 30 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z3_offload_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 3, 20 | "overlap_comm": true, 21 | "contiguous_gradients": true, 22 | "sub_group_size": 1000000000.0, 23 | "reduce_bucket_size": "auto", 24 | "stage3_prefetch_bucket_size": "auto", 25 | "stage3_param_persistence_threshold": "auto", 26 | "stage3_max_live_parameters": 1000000000.0, 27 | "stage3_max_reuse_distance": 1000000000.0, 28 | "stage3_gather_16bit_weights_on_model_save": true, 29 | "offload_optimizer": { 30 | "device": "cpu", 31 | "pin_memory": true 32 | }, 33 | "offload_param": { 34 | "device": "cpu", 35 | "pin_memory": true 36 | } 37 | } 38 | } -------------------------------------------------------------------------------- /scripts/qwen_warmup.sh: -------------------------------------------------------------------------------- 1 | model_path="../Model/Qwen/Qwen2.5-7B" 2 | template="qwen" 3 | learning_rate=1e-5 4 | dataset="Qwen-2.5-7B_warmup" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-7B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done -------------------------------------------------------------------------------- /scripts/llama_warmup.sh: -------------------------------------------------------------------------------- 1 | model_path="../Model/meta-llama/Llama-3.1-8B" 2 | template="llama3" 3 | learning_rate=1e-5 4 | dataset="Llama-3.1-8B_warmup" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Llama3.1-8B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done -------------------------------------------------------------------------------- /scripts/llama_sft.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/LongRePS/saves/Llama3.1-8B/full/Llama-3.1-8B_warmup_train_lr1e-5_maxlen16k_2025-03-13-22-43-49/checkpoint-10" 2 | template="llama3" 3 | learning_rate=5e-6 4 | dataset="Llama-3.1-8B_sample30_thresh1.0_v2_prm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Llama3.1-8B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done -------------------------------------------------------------------------------- /scripts/qwen_sft.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-7B/full/Qwen-2.5-7B_warmup_train_lr1e-5_maxlen16k_2025-03-26-17-19-47/checkpoint-18" 2 | template="qwen" 3 | learning_rate=5e-6 4 | dataset="Qwen-2.5-7B_sample100_thresh1.0_yarn_checkstage3_prm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-7B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done -------------------------------------------------------------------------------- /scripts/qwen_lora.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/Model/Qwen/Qwen2.5-14B" 2 | template="qwen" 3 | learning_rate=1e-5 4 | dataset="Qwen-2.5-7B_orm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-14B/lora/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | cp /mnt/xiyu/LongRePS/scripts/qwen_lora.sh ${output_path} 10 | 11 | llamafactory-cli train \ 12 | --stage sft \ 13 | --do_train True \ 14 | --model_name_or_path ${model_path} \ 15 | --preprocessing_num_workers 16 \ 16 | --finetuning_type lora \ 17 | --lora_rank 128 \ 18 | --lora_alpha 128 \ 19 | --lora_dropout 0.05 \ 20 | --lora_target all \ 21 | --template ${template} \ 22 | --flash_attn auto \ 23 | --dataset_dir data \ 24 | --dataset ${dataset} \ 25 | --cutoff_len 16384 \ 26 | --learning_rate ${learning_rate} \ 27 | --num_train_epochs 2 \ 28 | --max_samples 100000 \ 29 | --per_device_train_batch_size 1 \ 30 | --gradient_accumulation_steps 4 \ 31 | --lr_scheduler_type cosine \ 32 | --max_grad_norm 1.0 \ 33 | --logging_steps 5 \ 34 | --save_strategy epoch \ 35 | --packing False \ 36 | --save_only_model True \ 37 | --report_to none \ 38 | --output_dir ${output_path} \ 39 | --bf16 True \ 40 | --plot_loss True \ 41 | --ddp_timeout 180000000 \ 42 | --optim adamw_torch \ 43 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 44 | done -------------------------------------------------------------------------------- /scripts/lora_sft.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-warmup-epoch1" 2 | template="qwen" 3 | learning_rate=5e-5 4 | dataset="Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-32B/lora/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | cp /mnt/xiyu/LongRePS/scripts/lora_sft.sh ${output_path} 10 | 11 | llamafactory-cli train \ 12 | --stage sft \ 13 | --do_train True \ 14 | --model_name_or_path ${model_path} \ 15 | --preprocessing_num_workers 16 \ 16 | --finetuning_type lora \ 17 | --lora_rank 128 \ 18 | --lora_alpha 128 \ 19 | --lora_dropout 0.05 \ 20 | --lora_target all \ 21 | --template ${template} \ 22 | --flash_attn auto \ 23 | --dataset_dir data \ 24 | --dataset ${dataset} \ 25 | --cutoff_len 16384 \ 26 | --learning_rate ${learning_rate} \ 27 | --num_train_epochs 2 \ 28 | --max_samples 100000 \ 29 | --per_device_train_batch_size 1 \ 30 | --gradient_accumulation_steps 4 \ 31 | --lr_scheduler_type cosine \ 32 | --max_grad_norm 1.0 \ 33 | --logging_steps 5 \ 34 | --save_strategy epoch \ 35 | --packing False \ 36 | --save_only_model True \ 37 | --report_to none \ 38 | --output_dir ${output_path} \ 39 | --bf16 True \ 40 | --plot_loss True \ 41 | --ddp_timeout 180000000 \ 42 | --optim adamw_torch \ 43 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 44 | done -------------------------------------------------------------------------------- /preprocess_lbv1.py: -------------------------------------------------------------------------------- 1 | import json 2 | import jsonlines 3 | import json_repair 4 | import pandas as pd 5 | import numpy as np 6 | import torch 7 | import shutil 8 | import glob 9 | import re 10 | 11 | from typing import List 12 | from tqdm import tqdm 13 | from pathlib import Path 14 | from transformers import AutoTokenizer, AutoModelForCausalLM 15 | from datasets import Dataset, load_dataset 16 | from config.prompt import prompt_lbv1_cot,prompt_lbv1_nocot,prompt_cot 17 | 18 | def construct_cot_nocot_split(split: str): 19 | data_list=load_dataset('THUDM/LongBench',split, split='test') 20 | new_cot_data_list = [] 21 | new_nocot_data_list = [] 22 | for data in data_list: 23 | context = data["context"] 24 | question = data["input"] 25 | answers = data["answers"] 26 | all_classes=data['all_classes'] 27 | id = data["_id"] 28 | output = answers[0] 29 | instruction_cot = prompt_lbv1_cot.format(context=context, question=question) 30 | instruction_nocot = prompt_lbv1_nocot.format(context=context, question=question) 31 | new_cot_data_list.append({"id": id, "question":question,"instruction": instruction_cot, "answers": answers,"all_classes":all_classes ,"output": output, "system": "You are a helpful assistant."}) 32 | new_nocot_data_list.append({"id": id,"question":question,"instruction": instruction_nocot, "answers": answers, "all_classes":all_classes,"output": output, "system": "You are a helpful assistant."}) 33 | print(f"size of {split} new_cot_data_list: {len(new_cot_data_list)}") 34 | print(f"size of {split} new_nocot_data_list: {len(new_nocot_data_list)}") 35 | with jsonlines.open(f"dataset/longbenchv1/{split}_cot.jsonl", 'w') as writer: 36 | writer.write_all(new_cot_data_list) 37 | with jsonlines.open(f"dataset/longbenchv1/{split}_nocot.jsonl", 'w') as writer: 38 | writer.write_all(new_nocot_data_list) 39 | print(f"Finished writing {split} dataset") 40 | return 41 | 42 | 43 | 44 | from datasets import load_dataset 45 | datasets = ["qasper", "multifieldqa_en", "hotpotqa", "2wikimqa","musique"] 46 | for dataset in datasets: 47 | construct_cot_nocot_split(dataset) 48 | -------------------------------------------------------------------------------- /preprocess_lbv2.py: -------------------------------------------------------------------------------- 1 | import json 2 | import jsonlines 3 | import json_repair 4 | import pandas as pd 5 | import numpy as np 6 | import torch 7 | import shutil 8 | import glob 9 | import re 10 | 11 | from typing import List 12 | from tqdm import tqdm 13 | from pathlib import Path 14 | from transformers import AutoTokenizer, AutoModelForCausalLM 15 | from datasets import Dataset, load_dataset 16 | from config.prompt import prompt_lbv2_cot,prompt_lbv2_nocot 17 | 18 | def construct_cot_nocot_split(filtered_data,split="MQA"): 19 | new_cot_data_list = [] 20 | new_nocot_data_list = [] 21 | for item in filtered_data: 22 | context = item["context"] 23 | question=item['question'] 24 | _id=item['_id'] 25 | difficulty=item['difficulty'] 26 | instruction_cot=prompt_lbv2_cot.format(context=context, question=question,choice_A=item['choice_A'],choice_B=item['choice_B'],choice_C=item['choice_C'],choice_D=item['choice_D']) 27 | instruction_nocot=prompt_lbv2_nocot.format(context=context, question=question,choice_A=item['choice_A'],choice_B=item['choice_B'],choice_C=item['choice_C'],choice_D=item['choice_D']) 28 | new_cot_data_list.append({"id": id, "instruction": instruction_cot, "output": item['answer'], "id":_id,"difficulty":difficulty,"question":item['question'],"num_tokens":item['token_num'],"system": "You are a helpful assistant."}) 29 | new_nocot_data_list.append({"id": id, "instruction": instruction_nocot, "output": item['answer'], "id":_id,"difficulty":difficulty,"question":item['question'],"num_tokens":item['token_num'],"system": "You are a helpful assistant."}) 30 | 31 | print(f"size of new_cot_data_list: {len(new_cot_data_list)}") 32 | print(f"size of new_nocot_data_list: {len(new_nocot_data_list)}") 33 | with jsonlines.open(f"dataset/longbenchv2/{split}_cot.jsonl", 'w') as writer: 34 | writer.write_all(new_cot_data_list) 35 | with jsonlines.open(f"dataset/longbenchv2/{split}_nocot.jsonl", 'w') as writer: 36 | writer.write_all(new_nocot_data_list) 37 | 38 | 39 | 40 | split_list=["Single-Document QA","Multi-Document QA"] 41 | split_tag=["SQA","MQA"] 42 | dataset=load_dataset('Lemon123prog/Longmix-LongRePS',split='test') 43 | 44 | for (split,tag) in zip(split_list,split_tag): 45 | filter_data=[] 46 | for data in dataset: 47 | if data['domain']==split and data['token_num']<105*1024: #In case of the Bug for over-long data 48 | filter_data.append(data) 49 | construct_cot_nocot_split(filter_data,split=tag) 50 | 51 | -------------------------------------------------------------------------------- /evaltoolkits/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import string 4 | import json_repair 5 | 6 | from pathlib import Path 7 | from typing import Union, List 8 | 9 | # from metrics import normalize_answer 10 | def extract_number(text): 11 | match = re.search(r'\[\[([0-9]*\.?[0-9]+)\]\]', text) 12 | if match: 13 | return float(match.group(1)) 14 | match = re.search(r'\[([0-9]*\.?[0-9]+)\]', text) 15 | if match: 16 | return float(match.group(1)) 17 | return 0.0 18 | 19 | def normalize_answer(s): 20 | """Lower text and remove punctuation, articles and extra whitespace.""" 21 | 22 | def remove_articles(text): 23 | return re.sub(r"\b(a|an|the)\b", " ", text) 24 | 25 | def white_space_fix(text): 26 | return " ".join(text.split()) 27 | 28 | def remove_punc(text): 29 | exclude = set(string.punctuation) 30 | return "".join(ch for ch in text if ch not in exclude) 31 | 32 | def lower(text): 33 | return text.lower() 34 | 35 | return white_space_fix(remove_articles(remove_punc(lower(s)))) 36 | 37 | def load_jsonl_file(path_to_file: Union[str, Path]): 38 | data_list = [] 39 | error_cnt = 0 40 | with open(path_to_file) as f: 41 | for idx, line in enumerate(f): 42 | try: 43 | data = json.loads(line) 44 | data_list.append(data) 45 | except Exception as e: 46 | error_cnt += 1 47 | print(f"Failed loading line {idx}, error: {e}") 48 | print(line) 49 | print(f"Failed loading {error_cnt} lines, total {len(data_list)} lines loaded") 50 | return data_list 51 | 52 | def preprocess_pred_for_json_repair(pred: str): 53 | escaped_str = re.sub( 54 | r'(?<="reasoning": ")(.*?)(?="\s*,\s*\n\s*"answer":)', 55 | lambda match: re.sub(r'(? bool: 67 | # 去除 instruction 中的标点符号 68 | instruction_cleaned = normalize_answer(instruction) 69 | 70 | for fact in fact_list: 71 | # 去除 fact 中的标点符号 72 | fact_cleaned = normalize_answer(fact) 73 | # 比较去除标点符号后的 fact 和 instruction 74 | if fact_cleaned not in instruction_cleaned: 75 | # print(fact) 76 | return False 77 | return True 78 | 79 | def check_pred_fact_consistency(pred: str, instruction: str): 80 | 81 | processed_pred = preprocess_pred_for_json_repair(pred) 82 | content = json_repair.loads(processed_pred) 83 | if not isinstance(content, dict) or not content or 'reasoning' not in content or len(content) > 2 or type(content['reasoning']) != str: 84 | return False 85 | fact_list = extract_fact_list(content['reasoning']) 86 | if len(fact_list) > 0 and verify_fact_list(fact_list, instruction): 87 | return True 88 | return False -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv2.sh: -------------------------------------------------------------------------------- 1 | 2 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2" 3 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2" 4 | mode="cot" 5 | 6 | domain_list=("all") 7 | eval_data_dir="../dataset/longbenchv2" 8 | sample_num=1 9 | temperature=0.0 10 | 11 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} #for Llama Models 12 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} 13 | 14 | 15 | for gpu_id in 0 1 2 3 4 5 6 7; do 16 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \ 17 | --served-model-name ${model_name} \ 18 | --model ${model_path} \ 19 | --tensor-parallel-size=1 \ 20 | --trust-remote-code \ 21 | --dtype bfloat16 \ 22 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 & 23 | done 24 | sleep 30 # sleep 30s, wait for the servers to start 25 | 26 | 27 | for domain in "${domain_list[@]}"; do 28 | file_name_list=("MQA_${mode}.jsonl" "SQA_${mode}.jsonl") 29 | for file_name in "${file_name_list[@]}"; do 30 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 31 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 32 | result_dir="./pred_cot_vs_nocot" 33 | output_dir="${result_dir}/${model_name}" 34 | mkdir -p ${output_dir} 35 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 36 | 37 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 38 | eval_data_dir="${eval_data_dir%/}" 39 | 40 | echo "Launching inference for ${model_name}..." 41 | echo "Model path: ${model_path}" 42 | echo "Eval data dir: ${eval_data_dir}" 43 | echo "File name: ${file_name}" 44 | echo "Sample num: ${sample_num}" 45 | echo "Result dir: ${result_dir}" 46 | echo "Temperature: ${temperature}" 47 | echo "COT mode: ${cot_mode}" 48 | echo "Dataset: ${eval_dataset_name}" 49 | 50 | echo "Evaluating ${eval_dataset_name}..." 51 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 52 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 53 | 54 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 55 | --model ${model_name} \ 56 | --model_path ${model_path} \ 57 | --data_path ${eval_data_dir}/${file_name} \ 58 | --output_path ${path_to_inference_output} \ 59 | --sample_num ${sample_num} \ 60 | --dataset_name ${eval_dataset_name} \ 61 | --temperature ${temperature} \ 62 | > ./inference2.out 63 | 64 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 65 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 66 | 67 | done 68 | done 69 | pkill -f vllm; pkill -f spawn_main 70 | -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv2m.sh: -------------------------------------------------------------------------------- 1 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2" 2 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2" 3 | #model_config="${model_root}/tokenizer*" 4 | #model_path="${model_root}/checkpoint-58" 5 | mode="cot" 6 | 7 | 8 | domain_list=("all") 9 | eval_data_dir="../dataset/longbenchv2" 10 | sample_num=1 11 | temperature=0.0 12 | 13 | 14 | 15 | CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \ 16 | --served-model-name ${model_name} \ 17 | --model ${model_path} \ 18 | --tensor-parallel-size=4 \ 19 | --trust-remote-code \ 20 | --port 8000 > ../log/vllm_${model_name}_gpu0.log 2>&1 & 21 | 22 | CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server \ 23 | --served-model-name ${model_name} \ 24 | --model ${model_path} \ 25 | --tensor-parallel-size=4 \ 26 | --trust-remote-code \ 27 | --port 8001 > ../log/vllm_${model_name}_gpu1.log 2>&1 & 28 | 29 | sleep 30 # sleep 30s, wait for the servers to start 30 | 31 | 32 | for domain in "${domain_list[@]}"; do 33 | file_name_list=("MQA_${mode}.jsonl" "SQA_${mode}.jsonl") 34 | for file_name in "${file_name_list[@]}"; do 35 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 36 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 37 | result_dir="./pred_cot_vs_nocot" 38 | output_dir="${result_dir}/${model_name}" 39 | mkdir -p ${output_dir} 40 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 41 | 42 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 43 | eval_data_dir="${eval_data_dir%/}" 44 | 45 | echo "Launching inference for ${model_name}..." 46 | echo "Model path: ${model_path}" 47 | echo "Eval data dir: ${eval_data_dir}" 48 | echo "File name: ${file_name}" 49 | echo "Sample num: ${sample_num}" 50 | echo "Result dir: ${result_dir}" 51 | echo "Temperature: ${temperature}" 52 | echo "COT mode: ${cot_mode}" 53 | echo "Dataset: ${eval_dataset_name}" 54 | 55 | echo "Evaluating ${eval_dataset_name}..." 56 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 57 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 58 | 59 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 60 | --model ${model_name} \ 61 | --model_path ${model_path} \ 62 | --data_path ${eval_data_dir}/${file_name} \ 63 | --output_path ${path_to_inference_output} \ 64 | --sample_num ${sample_num} \ 65 | --dataset_name ${eval_dataset_name} \ 66 | --temperature ${temperature} \ 67 | --world_size 2 \ 68 | > ./inference2.out 69 | 70 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 71 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 72 | 73 | done 74 | done 75 | pkill -f vllm; pkill -f spawn_main -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv1.sh: -------------------------------------------------------------------------------- 1 | 2 | model_name="Qwen2.5-7B-Instruct" 3 | model_path="/mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct" 4 | #model_config="${model_root}/tokenizer*" 5 | #model_path="${model_root}/checkpoint-58" 6 | mode="cot" 7 | 8 | domain_list=("all") 9 | eval_data_dir="../dataset/longbenchv1" 10 | sample_num=1 11 | temperature=0.0 12 | 13 | 14 | #cp ${model_config} ${model_path} 15 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Models 16 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} 17 | 18 | 19 | for gpu_id in 0 1 2 3 4 5 6 7; do 20 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \ 21 | --served-model-name ${model_name} \ 22 | --model ${model_path} \ 23 | --tensor-parallel-size=1 \ 24 | --swap-space 32\ 25 | --trust-remote-code \ 26 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 & 27 | done 28 | sleep 30 # sleep 30s, wait for the servers to start 29 | 30 | 31 | for domain in "${domain_list[@]}"; do 32 | file_name_list=( "musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" "qasper_${mode}.jsonl" "2wikimqa_${mode}.jsonl") #"musique_${mode}.jsonl" 33 | for file_name in "${file_name_list[@]}"; do 34 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 35 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 36 | result_dir="./pred_cot_vs_nocot" 37 | output_dir="${result_dir}/${model_name}" 38 | mkdir -p ${output_dir} 39 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 40 | 41 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 42 | eval_data_dir="${eval_data_dir%/}" 43 | 44 | echo "Launching inference for ${model_name}..." 45 | echo "Model path: ${model_path}" 46 | echo "Eval data dir: ${eval_data_dir}" 47 | echo "File name: ${file_name}" 48 | echo "Sample num: ${sample_num}" 49 | echo "Result dir: ${result_dir}" 50 | echo "Temperature: ${temperature}" 51 | echo "COT mode: ${cot_mode}" 52 | echo "Dataset: ${eval_dataset_name}" 53 | 54 | echo "Evaluating ${eval_dataset_name}..." 55 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 56 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 57 | 58 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 59 | --model ${model_name} \ 60 | --model_path ${model_path} \ 61 | --data_path ${eval_data_dir}/${file_name} \ 62 | --output_path ${path_to_inference_output} \ 63 | --sample_num ${sample_num} \ 64 | --dataset_name ${eval_dataset_name} \ 65 | --temperature ${temperature} \ 66 | > ./inference2.out 67 | 68 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 69 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 70 | 71 | done 72 | done 73 | pkill -f vllm; pkill -f spawn_main 74 | -------------------------------------------------------------------------------- /evaltoolkits/launch_inference.sh: -------------------------------------------------------------------------------- 1 | # if model_name, model_path, eval_data_dir, file_name, sample_num, thresh, temperature, filtered_filename, are passed as arguments, then parse these arguments 2 | if [[ $# -eq 8 ]]; then 3 | model_name=$1 4 | model_path=$2 5 | eval_data_dir=$3 6 | file_name=$4 7 | sample_num=$5 8 | thresh=$6 9 | temperature=$7 10 | filtered_filename=$8 11 | inference_mode=$(echo "$file_name" | grep -q "train" && echo "train" || echo "eval") # train (for sample data), eval (for evaluation) 12 | else 13 | model_name=$1 14 | model_path=$2 15 | file_name=$3 16 | eval_data_dir="../dataset" 17 | mode="predicted_answer" 18 | inference_mode=$(echo "$file_name" | grep -q "train" && echo "train" || echo "eval") # train (for sample data), eval (for evaluation) 19 | if [[ $inference_mode == "train" ]]; then 20 | sample_num=1 21 | thresh=1.0 22 | temperature=0.7 23 | filtered_filename="${model_name}_sample${sample_num}temp${temperature}thresh${thresh}.jsonl" 24 | elif [[ $inference_mode == "eval" ]]; then 25 | sample_num=1 26 | temperature=0.0 27 | fi 28 | 29 | fi 30 | 31 | 32 | 33 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 34 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 35 | result_dir="./pred_${inference_mode}" 36 | output_dir="${result_dir}/${model_name}" 37 | mkdir -p ${output_dir} 38 | echo -e "\nScript executed with parameters: $@" >> ${output_dir}/new_launch_inference.sh 39 | 40 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 41 | eval_data_dir="${eval_data_dir%/}" 42 | 43 | echo "Launching inference for ${model_name}..." 44 | echo "Model path: ${model_path}" 45 | echo "Eval data dir: ${eval_data_dir}" 46 | echo "File name: ${file_name}" 47 | echo "Inference mode: ${inference_mode}" 48 | echo "Sample num: ${sample_num}" 49 | echo "Result dir: ${result_dir}" 50 | echo "Temperature: ${temperature}" 51 | echo "COT mode: ${cot_mode}" 52 | echo "Dataset: ${eval_dataset_name}" 53 | echo "Filtered filename: ${filtered_filename}" 54 | echo "Thresh: ${thresh}" 55 | 56 | #cp Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Model 57 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} #for Llama Model 58 | 59 | for gpu_id in 0 1 2 3 4 5 6 7; do 60 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \ 61 | --served-model-name ${model_name} \ 62 | --model ${model_path} \ 63 | --tensor-parallel-size=1 \ 64 | --trust-remote-code \ 65 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 & 66 | done 67 | 68 | sleep 30 # sleep 30s, wait for the servers to start 69 | 70 | echo "Evaluating ${eval_dataset_name}..." 71 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 72 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 73 | 74 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 75 | --model ${model_name} \ 76 | --model_path ${model_path} \ 77 | --data_path ${eval_data_dir}/${file_name} \ 78 | --output_path ${path_to_inference_output} \ 79 | --sample_num ${sample_num} \ 80 | --dataset_name ${eval_dataset_name} \ 81 | --temperature ${temperature} \ 82 | > ./inference.out 83 | 84 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 85 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 86 | 87 | pkill -f vllm; pkill -f spawn_main 88 | -------------------------------------------------------------------------------- /evaltoolkits/step2_extract_preds_from_raw.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import json_repair 4 | import argparse 5 | import jsonlines 6 | 7 | from typing import List 8 | from tqdm import tqdm 9 | from utils import load_jsonl_file 10 | 11 | def parse_args(args=None): 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument('--path_to_src_file', type=str, default=None) 14 | return parser.parse_args(args) 15 | 16 | def extract_pred_list(raw_pred_list:List[str]): 17 | extracted_pred_list = [] 18 | for pred in raw_pred_list: 19 | if pred.startswith("```json"): 20 | # remove the begining and ending ```json 21 | pred = pred.replace("```json", "").replace("```", "") 22 | pred = pred.strip() 23 | if pred.startswith("{"): 24 | pred = pred.strip() 25 | try: 26 | content = json_repair.loads(pred) 27 | if type(content)==list: 28 | content=content[0] 29 | content = content["answer"] 30 | extracted_pred_list.append(str(content)) 31 | except Exception as e: 32 | # print(e, pred) 33 | # try to extract the answer from the raw pred, if failed, append the raw pred 34 | # use re to extract the content after "answer: " and before "." (inclusive) 35 | try: 36 | #content =re.findall(r'"answer": (.+?)(?=\n|$)', pred)[0].strip() 37 | #print(content) 38 | #content = content.strip('\'"[]') 39 | pattern = r'"answer": "([^"]+)"' 40 | match = re.search(pattern, pred) 41 | content = match.group(1) 42 | #print(content) 43 | # print(f"Extracted re: {content}") 44 | extracted_pred_list.append(content) 45 | except Exception as e2: 46 | extracted_pred_list.append(pred) 47 | else: 48 | # extract plain text format response 49 | # print("extracting plain text format response") 50 | ''' 51 | try: 52 | content = pred.split("Answer:")[1].split("Reasoning:")[0] 53 | except: 54 | try: 55 | content = pred.split("Answer:")[1] 56 | except: 57 | content = pred 58 | try: 59 | content = pred.split("Reasoning:")[0] 60 | except: 61 | content = pred 62 | ''' 63 | try: 64 | content=pred.split("Answer:")[1] 65 | #content=pred.split("Reasoning:")[1] 66 | except: 67 | try: 68 | content=pred.split("Reasoning:")[1] 69 | except: 70 | try: 71 | #print("Pred:",pred) 72 | content=pred.split('\n')[0] 73 | #print("Content:",content) 74 | except: 75 | content=pred 76 | try: 77 | content=content.split("\n")[0] 78 | except: 79 | content=content 80 | extracted_pred_list.append(content) 81 | return extracted_pred_list 82 | 83 | 84 | def main(): 85 | args = parse_args() 86 | data_list = load_jsonl_file(args.path_to_src_file) 87 | for data in tqdm(data_list, desc="Extracting preds"): 88 | extracted_pred_list = extract_pred_list(data["pred"]) 89 | data["extracted_pred_list"] = extracted_pred_list 90 | path_to_tgt_file = args.path_to_src_file.replace(".jsonl", "_eval.jsonl") 91 | with jsonlines.open(path_to_tgt_file, mode='w') as writer: 92 | writer.write_all(data_list) 93 | 94 | if __name__ == "__main__": 95 | main() 96 | 97 | 98 | 99 | -------------------------------------------------------------------------------- /data/dataset_info.json: -------------------------------------------------------------------------------- 1 | { 2 | "Llama-3.1-8B_warmup": { 3 | "file_name": "Llama-3.1-8B_warmup.jsonl", 4 | "columns": { 5 | "prompt": "instruction", 6 | "response": "output", 7 | "system": "system" 8 | } 9 | }, 10 | "Llama-3.1-8B_orm": { 11 | "file_name": "Llama-3.1-8B_orm.jsonl", 12 | "columns": { 13 | "prompt": "instruction", 14 | "response": "output", 15 | "system": "system" 16 | } 17 | }, 18 | "Llama-3.1-8B_sample30_thresh1.0_prm": { 19 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_llamastage2.jsonl", 20 | "columns": { 21 | "prompt": "instruction", 22 | "response": "output", 23 | "system": "system" 24 | } 25 | }, 26 | "Llama-3.1-8B_sample30_thresh1.0_checkstage3_prm": { 27 | "file_name": "musique-llama3.1_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 28 | "columns": { 29 | "prompt": "instruction", 30 | "response": "output", 31 | "system": "system" 32 | } 33 | }, 34 | "Qwen-2.5-7B_warmup": { 35 | "file_name": "Qwen-2.5-7B_warmup.jsonl", 36 | "columns": { 37 | "prompt": "instruction", 38 | "response": "output", 39 | "system": "system" 40 | } 41 | }, 42 | "Qwen-2.5-7B_orm": { 43 | "file_name": "Qwen-2.5-7B_orm.jsonl", 44 | "columns": { 45 | "prompt": "instruction", 46 | "response": "output", 47 | "system": "system" 48 | } 49 | } 50 | , 51 | "Qwen-2.5-7B_prm": { 52 | "file_name": "Qwen-2.5-7B_prm.jsonl", 53 | "columns": { 54 | "prompt": "instruction", 55 | "response": "output", 56 | "system": "system" 57 | } 58 | }, 59 | "Qwen-2.5-7B_sample30_thresh1.0_prm": { 60 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_qwenstage2.jsonl", 61 | "columns": { 62 | "prompt": "instruction", 63 | "response": "output", 64 | "system": "system" 65 | } 66 | }, 67 | "Qwen-2.5-7B_sample30_thresh1.0_checkstage3_prm": { 68 | "file_name": "musique-qwen2.5_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 69 | "columns": { 70 | "prompt": "instruction", 71 | "response": "output", 72 | "system": "system" 73 | } 74 | }, 75 | "Qwen-2.5-7B_sample30_thresh1.0_yarn_prm": { 76 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_yarn_qwenstage2.jsonl", 77 | "columns": { 78 | "prompt": "instruction", 79 | "response": "output", 80 | "system": "system" 81 | } 82 | }, 83 | "Qwen-2.5-7B_sample100_thresh1.0_yarn_checkstage3_prm": { 84 | "file_name": "musique-qwen2.5_sample100temp0.7thresh1.0_yarn_factcheck_stage3.jsonl", 85 | "columns": { 86 | "prompt": "instruction", 87 | "response": "output", 88 | "system": "system" 89 | } 90 | }, 91 | "Llama-3.1-8B_sample30_thresh1.0_v2_checkstage3_prm": { 92 | "file_name": "musique-llama3.1_sample30temp0.7thresh1.0_v2_factcheck_stage3.jsonl", 93 | "columns": { 94 | "prompt": "instruction", 95 | "response": "output", 96 | "system": "system" 97 | } 98 | }, 99 | "Llama-3.1-8B_sample30_thresh1.0_v2_prm": { 100 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_v2_llamastage2.jsonl", 101 | "columns": { 102 | "prompt": "instruction", 103 | "response": "output", 104 | "system": "system" 105 | } 106 | }, 107 | "Qwen-2.5-14B_warmup": { 108 | "file_name": "musique_train_warmup_qwen14b.jsonl", 109 | "columns": { 110 | "prompt": "instruction", 111 | "response": "output", 112 | "system": "system" 113 | } 114 | }, 115 | "Qwen-2.5-14B_sample30_temp0.7_thresh1.0_prm": { 116 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_v2_qwen14bstage2.jsonl", 117 | "columns": { 118 | "prompt": "instruction", 119 | "response": "output", 120 | "system": "system" 121 | } 122 | }, 123 | "Qwen-2.5-14B_sample30_temp0.7_thresh1.0_checkstage3_prm": { 124 | "file_name": "musique-qwen14b_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 125 | "columns": { 126 | "prompt": "instruction", 127 | "response": "output", 128 | "system": "system" 129 | } 130 | }, 131 | "Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm": { 132 | "file_name": "musique-qwen32b_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 133 | "columns": { 134 | "prompt": "instruction", 135 | "response": "output", 136 | "system": "system" 137 | } 138 | } 139 | } -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv1big.sh: -------------------------------------------------------------------------------- 1 | 2 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2" 3 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2" 4 | #model_config="${model_root}/tokenizer*" 5 | #model_path="${model_root}/checkpoint-58" 6 | mode="cot" 7 | 8 | domain_list=("all") 9 | eval_data_dir="../dataset/longbenchv1" 10 | sample_num=1 11 | temperature=0.0 12 | 13 | 14 | #cp ${model_config} ${model_path} 15 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Models 16 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} 17 | 18 | CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \ 19 | --served-model-name ${model_name} \ 20 | --model ${model_path} \ 21 | --tensor-parallel-size=2 \ 22 | --max_model_len 25000\ 23 | --swap-space 32\ 24 | --trust-remote-code \ 25 | --port 8000 > ../log/vllm_${model_name}_gpu0.log 2>&1 & 26 | 27 | CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \ 28 | --served-model-name ${model_name} \ 29 | --model ${model_path} \ 30 | --tensor-parallel-size=2 \ 31 | --max_model_len 25000\ 32 | --swap-space 32\ 33 | --trust-remote-code \ 34 | --port 8001 > ../log/vllm_${model_name}_gpu1.log 2>&1 & 35 | 36 | CUDA_VISIBLE_DEVICES=4,5 python -m vllm.entrypoints.openai.api_server \ 37 | --served-model-name ${model_name} \ 38 | --model ${model_path} \ 39 | --tensor-parallel-size=2 \ 40 | --max_model_len 25000\ 41 | --swap-space 32\ 42 | --trust-remote-code \ 43 | --port 8002 > ../log/vllm_${model_name}_gpu2.log 2>&1 & 44 | 45 | CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server \ 46 | --served-model-name ${model_name} \ 47 | --model ${model_path} \ 48 | --tensor-parallel-size=2 \ 49 | --max_model_len 25000\ 50 | --swap-space 32\ 51 | --trust-remote-code \ 52 | --port 8003 > ../log/vllm_${model_name}_gpu3.log 2>&1 & 53 | 54 | sleep 30 # sleep 30s, wait for the servers to start 55 | 56 | 57 | for domain in "${domain_list[@]}"; do 58 | file_name_list=( "musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" "qasper_${mode}.jsonl" "2wikimqa_${mode}.jsonl") #"musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" #"musique_${mode}.jsonl" 59 | for file_name in "${file_name_list[@]}"; do 60 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 61 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 62 | result_dir="./pred_cot_vs_nocot" 63 | output_dir="${result_dir}/${model_name}" 64 | mkdir -p ${output_dir} 65 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 66 | 67 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 68 | eval_data_dir="${eval_data_dir%/}" 69 | 70 | echo "Launching inference for ${model_name}..." 71 | echo "Model path: ${model_path}" 72 | echo "Eval data dir: ${eval_data_dir}" 73 | echo "File name: ${file_name}" 74 | echo "Sample num: ${sample_num}" 75 | echo "Result dir: ${result_dir}" 76 | echo "Temperature: ${temperature}" 77 | echo "COT mode: ${cot_mode}" 78 | echo "Dataset: ${eval_dataset_name}" 79 | 80 | echo "Evaluating ${eval_dataset_name}..." 81 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 82 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 83 | 84 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 85 | --model ${model_name} \ 86 | --model_path ${model_path} \ 87 | --data_path ${eval_data_dir}/${file_name} \ 88 | --output_path ${path_to_inference_output} \ 89 | --sample_num ${sample_num} \ 90 | --world_size 4 \ 91 | --dataset_name ${eval_dataset_name} \ 92 | --temperature ${temperature} \ 93 | > ./inference2.out 94 | 95 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 96 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 97 | 98 | done 99 | done 100 | pkill -f vllm; pkill -f spawn_main -------------------------------------------------------------------------------- /evaltoolkits/step3_eval_f1.py: -------------------------------------------------------------------------------- 1 | import json 2 | import json_repair 3 | import argparse 4 | import numpy as np 5 | import jsonlines 6 | 7 | from pathlib import Path 8 | from typing import List 9 | from tqdm import tqdm 10 | 11 | from utils import load_jsonl_file 12 | from metrics import ( 13 | qa_f1_score, 14 | rouge_zh_score, 15 | qa_f1_zh_score, 16 | rouge_score, 17 | classification_score, 18 | retrieval_score, 19 | retrieval_zh_score, 20 | count_score, 21 | code_sim_score, 22 | qa_recall_score, 23 | babiq3_score, 24 | ruler_score, 25 | babi_score, 26 | accuracy_score 27 | ) 28 | 29 | dataset2metric = { 30 | "narrativeqa": qa_f1_score, 31 | "qasper": qa_f1_score, 32 | "Ruler": ruler_score, 33 | "MQA-Medium": accuracy_score, 34 | "MQA-Medium-v2": accuracy_score, 35 | "babiq3": qa_f1_score, 36 | "Babi": babi_score, 37 | "Babiq3": babiq3_score, 38 | "MQA": accuracy_score, 39 | "ICL": accuracy_score, 40 | "LIL": accuracy_score, 41 | "LSDU": accuracy_score, 42 | "NIAH": accuracy_score, 43 | "BABILong": accuracy_score, 44 | "LHU": accuracy_score, 45 | "CRU": accuracy_score, 46 | "SQA": accuracy_score, 47 | "SQA-Medium-v2": accuracy_score, 48 | "multifieldqa": qa_f1_score, 49 | "multifieldqa_en": qa_f1_score, 50 | "multifieldqa_zh": qa_f1_zh_score, 51 | "hotpotqa": qa_f1_score, 52 | "2wikimqa": qa_f1_score, 53 | "musique": qa_f1_score, 54 | "dureader": rouge_zh_score, 55 | "gov_report": rouge_score, 56 | "gov": rouge_score, 57 | "qmsum": rouge_score, 58 | "multi_news": rouge_score, 59 | "multi": rouge_score, 60 | "vcsum": rouge_zh_score, 61 | "trec": classification_score, 62 | "triviaqa": qa_f1_score, 63 | "samsum": rouge_score, 64 | "lsht": classification_score, 65 | "passage_retrieval_en": retrieval_score, 66 | "passage_count": count_score, 67 | "passage_retrieval_zh": retrieval_zh_score, 68 | "lcc": code_sim_score, 69 | "repobench-p": code_sim_score, 70 | } 71 | 72 | def parse_args(args=None): 73 | parser = argparse.ArgumentParser() 74 | parser.add_argument('--path_to_src_file', type=str, default=None) 75 | return parser.parse_args(args) 76 | 77 | def get_score_list(dataset:str, predictions:List[str], answers:List[str],**kwargs) -> List[float]: 78 | score_list: List[float] = [] 79 | for pred in predictions: 80 | if dataset=="Ruler": 81 | score=1 82 | else: 83 | score=0 84 | for answer in answers: 85 | if dataset=="Ruler": 86 | score = min(dataset2metric[dataset](pred, answer,**kwargs),score) 87 | else: 88 | score = max(dataset2metric[dataset](pred, answer,**kwargs),score) 89 | score_list.append(score) 90 | return score_list 91 | 92 | def main(): 93 | args = parse_args() 94 | data_list = load_jsonl_file(args.path_to_src_file) 95 | best_score_list = [] 96 | file_name = Path(args.path_to_src_file).name 97 | dataset = file_name.split('.')[0].split("-")[0] 98 | print(f"Eval {dataset}") 99 | for data in tqdm(data_list, desc="Calculating F1 score"): 100 | extracted_pred_list:List[str] = data["extracted_pred_list"] 101 | answers = data["answers"] 102 | if type(answers) !=type(["!"]): 103 | answers=[answers] 104 | if "all_classes" in data.keys(): 105 | all_classes=data['all_classes'] 106 | else: 107 | all_classes=[] 108 | score_list = get_score_list(dataset, extracted_pred_list, answers,all_classes=all_classes) 109 | best_score_in_this_data = max(score_list) 110 | best_score_list.append(best_score_in_this_data) 111 | data["f1_score_list"] = score_list 112 | final_score = np.mean(best_score_list) # *100 and round to 2 decimal places 113 | final_score = round(final_score*100, 2) 114 | print(f"Final score: {final_score}") 115 | with jsonlines.open(args.path_to_src_file, mode='w') as writer: 116 | writer.write_all(data_list) 117 | data_list_noinstr = [{k:v for k,v in data.items() if k!="instruction"} for data in data_list] 118 | with jsonlines.open(args.path_to_src_file.replace(".jsonl","_noinstr.jsonl"), mode='w') as writer: 119 | writer.write_all(data_list_noinstr) 120 | 121 | # add to result.json, overwrite the value if it already exists 122 | # check if result.json exists 123 | result_path = Path(args.path_to_src_file).parent / "result.json" 124 | if result_path.exists(): 125 | with open(result_path, 'r') as f: 126 | result = json.load(f) 127 | else: 128 | result = {} 129 | result[file_name] = final_score 130 | with open(result_path, 'w') as f: 131 | json.dump(result, f, ensure_ascii=False, indent=4) 132 | print(f"Result saved in {result_path}") 133 | 134 | if __name__ == '__main__': 135 | main() 136 | 137 | 138 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 |
3 |
4 | 5 | # 📖 Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision 6 | 7 |
10 | 11 | **LongRePS** tackles quality bottlenecks in CoT reasoning for extended contexts by integrating process supervision. As shown in the figure, we have discovered that in complex task scenarios, using the chain of thought always enables the model performance to achieve a universal gain. Furthermore, we figure out that while vanilla CoT improves with context length, self-sampled reasoning paths exhibit significant inconsistency and hallucination risks, especially in multi-hop QA and complex scenarios. 12 | 13 | 14 | The framework operates in two phases: (1) **Self-sampling** generates diverse CoT candidates to capture reasoning variability, and (2) **Context-aware assessment** enforces answer correctness, grounding via text matching, and intrinsic consistency via LLM-based scoring. 15 | 16 | 17 | Evaluations on long-context tasks show LongRePS achieves 13.6/3.8-point gains on MuSiQue (LLaMA/Qwen) and cross-task robustness, outperforming outcome supervision. The results validate process supervision as pivotal for scalable long-context reasoning, with open-source code enabling community adoption. 18 | *** 19 |  20 | *** 21 |  22 | 23 | 24 | ## 🔥 News 25 | **[2025/03/03]** Release training and evaluation data for **LongRePS**. The model parameters and complete codes will be available soon. 26 | 27 | ## 🔍 List of Contents 28 | - [🔨 Requirements](#requirements) 29 | - [⚙️ How to Prepare Data for Training](#how-to-Prepare-Data-for-Training) 30 | - [🖥️ How to Prepare Data for Evaluating](#how-to-Prepare-Data-for-Evaluating) 31 | - [🍧 Training](#training) 32 | - [📊 Evaluation](#evaluation) 33 | - [📄 Acknowledgement](#acknowledgement) 34 | 35 | 36 | 37 | ## 🔨 Requirements 38 | 39 | **Install LLaMA-Factory** 40 | 41 | Please refer to this tutorial for [installation](https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/installation.html). 42 | Or you can use following command: 43 | ```bash 44 | git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git 45 | cd LLaMA-Factory 46 | pip install -e ".[torch,metrics]" 47 | ``` 48 | 49 | **Install Other Supporting Libraries** 50 | 51 | ```bash 52 | cd .. 53 | git clone https://github.com/lemon-prog123/LongRePS.git 54 | cd LongRePS 55 | pip install -r requirements.txt 56 | ``` 57 | 58 | 59 | 60 | ## ⚙️ How to Prepare Data for Training 61 | 62 | **Llama-3.1-8B**: 63 | ```python 64 | from datasets import load_dataset 65 | import jsonlines 66 | model="Llama-3.1-8B" 67 | dataset = load_dataset("Lemon123prog/Llama-3.1-8B-LongRePS") 68 | warmup_data=dataset['warmup'].to_list() 69 | orm_data=dataset['train_orm'].to_list() 70 | prm_data=dataset['train_prm'].to_list() 71 | 72 | with jsonlines.open(f"data/{model}_warmup.jsonl", 'w') as writer: 73 | writer.write_all(warmup_data) 74 | 75 | with jsonlines.open(f"data/{model}_orm.jsonl", 'w') as writer: 76 | writer.write_all(orm_data) 77 | 78 | with jsonlines.open(f"data/{model}_prm.jsonl", 'w') as writer: 79 | writer.write_all(prm_data) 80 | ``` 81 | 82 | **Qwen-2.5-7B**: 83 | ```python 84 | from datasets import load_dataset 85 | import jsonlines 86 | model="Qwen-2.5-7B" 87 | dataset = load_dataset("Lemon123prog/Qwen-2.5-7B-LongRePS") 88 | warmup_data=dataset['warmup'].to_list() 89 | orm_data=dataset['train_orm'].to_list() 90 | prm_data=dataset['train_prm'].to_list() 91 | 92 | with jsonlines.open(f"data/{model}_warmup.jsonl", 'w') as writer: 93 | writer.write_all(warmup_data) 94 | 95 | with jsonlines.open(f"data/{model}_orm.jsonl", 'w') as writer: 96 | writer.write_all(orm_data) 97 | 98 | with jsonlines.open(f"data/{model}_prm.jsonl", 'w') as writer: 99 | writer.write_all(prm_data) 100 | ``` 101 | 102 | Or you can simply run [preprocess.py](preprocess.py) 103 | ```bash 104 | python preprocess_train.py 105 | ``` 106 | 107 | 108 | 109 | ## 🖥️ How to Prepare Data for Evaluating 110 | 111 | ```bash 112 | bash scripts/preprocess_lb.sh 113 | ``` 114 | Then you will obtain the processed evaluation data in the **dataset** directory. 115 | 116 | 117 | 118 | ## 🍧 Training 119 | 120 | ### Download base models 121 | 122 | ```python 123 | from huggingface_hub import snapshot_download 124 | from pathlib import Path 125 | repo_id ="Qwen/Qwen2.5-7B" 126 | root_dir = Path("Your own path for Qwen") 127 | snapshot_download(repo_id=repo_id,local_dir=root_dir/repo_id,repo_type="model") 128 | 129 | repo_id ="meta-llama/Llama-3.1-8B" 130 | root_dir = Path("Your own path for Llama") 131 | snapshot_download(repo_id=repo_id,local_dir=root_dir/repo_id,repo_type="model") 132 | ``` 133 | 134 | Set **Model_Path** in the scripts before training. 135 | 136 | ### Warm Up Stage 137 | 138 | **Llama-3.1-8B** 139 | ```bash 140 | bash scripts/llama_warmup.sh 141 | ``` 142 | 143 | **Qwen-2.5-7B** 144 | ```bash 145 | bash scripts/qwen_warmup.sh 146 | ``` 147 | 148 | ### Sample Data and Fine-tune Models 149 | 150 | Set **Model-Name** & **Model-Path** & **File-Name** in the scripts before sampling. 151 | ```bash 152 | cd evaltoolkits 153 | bash loop_sample.sh 154 | ``` 155 | 156 | After the sampling process, you can use [filter_data.py](evaltoolkits/filter_data.py) to launch the filtering framework. 157 | 158 | ```bash 159 | cd evaltoolkits 160 | python filter_data.py \ 161 | --path_to_src_file [Sampling Data] \ 162 | --path_to_stage1_file [Output Data Path] 163 | ``` 164 | 165 | You can modify [dataset_info.json](data/dataset_info.json) to enable the added **filtered dataset** to be included in the file list. 166 | 167 | Finally, by set the **warm-up model path** and **datset_name** in the scripts, you can launch the fine-tuning process. 168 | 169 | **Llama-3.1-8B** 170 | ```bash 171 | bash scripts/llama_sft.sh 172 | ``` 173 | 174 | **Qwen-2.5-7B** 175 | ```bash 176 | bash scripts/qwen_sft.sh 177 | ``` 178 | 179 | 180 | 181 | ## 📊 Evaluation 182 | 183 | **LongBench v1** 184 | ```bash 185 | cd evaltoolkits 186 | bash launch_lbv1.sh 187 | ``` 188 | 189 | **LongBench v2** 190 | ```bash 191 | cd evaltoolkits 192 | bash launch_lbv2.sh 193 | ``` 194 | 195 | Note: Set **model_path** and **mode** to the desired target model. 196 | 197 | 198 | 199 | ## 📝 Citation 200 | ``` 201 | @article{zhu2025chain, 202 | title={Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision}, 203 | author={Zhu, Dawei and Wei, Xiyu and Zhao, Guangxiang and Wu, Wenhao and Zou, Haosheng and Ran, Junfeng and Wang, Xun and Sun, Lin and Zhang, Xiangzheng and Li, Sujian}, 204 | journal={arXiv preprint arXiv:2502.20790}, 205 | year={2025} 206 | } 207 | ``` 208 | ## 📄 Acknowledgement 209 | We are deeply thankful for the following projects that serve as the foundation for LongRePS: 210 | 211 | * [**SEALONG**](https://github.com/SihengLi99/SEALONG) 212 | * [**LongBench**](https://github.com/THUDM/LongBench) 213 | * [**LLaMA-Factory**](https://github.com/hiyouga/LLaMA-Factory) 214 | * [**360-LLaMA-Factory**](https://github.com/Qihoo360/360-LLaMA-Factory) 215 | 216 | -------------------------------------------------------------------------------- /config/prompt.py: -------------------------------------------------------------------------------- 1 | prompt_lbv1_cot=""" 2 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. You should follow the instructions below to provide an accurate reasoning path, as well as a concise answer to the question: 3 | 4 | **Instructions:** 5 | Step 1. **Reasoning:** Imagine you are a student who has no prior knowledge about the giving context. Your task is to answer the questions based solely on the information presented here. First retrieve all relevant information, then deduce the correct answer. Begin by carefully reading the provided context. Identify and extract all relevant information that is directly related to the question. Be succinct and only extract the most important excerpts that will help you answer the question. Finally, deduce the correct answer based on the retrieved information. 6 | Step 2. **Answer:** Using the information you have retrieved, and your deduction, answer the question as concisely as you can, using a single phrase or sentence if possible. Ensure that your answer should be brief and to the point. 7 | Step 3. **Format Your Response:** Present your response in JSON format, comprising two components: "reasoning" and "answer". The "reasoning" section should detail your thought process, including the breakdown of the question, the relevant excerpts (indicated by [Excerpt xxx] at the start), and the derived conclusion. Ensure that each excerpt is an exact match to the original document. Limit the number of excerpts to a maximum of 10. The "answer" part should contain your final answer to the question, as concise and to the point as possible. 8 | 9 | Illustrative Examples: 10 | 11 | Example #1: 12 | 13 | **Context:** [... Saltram is living with the Mulvilles at Wimbledon ... He is not working or producing anything ... He is idle and dependent on others ...] 14 | **Question:** What is Saltram's living situation? 15 | 16 | **Response:** 17 | {{ 18 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about Saltram's living situation. In the document, I can first locate that [Excerpt 1] `Saltram is living with the Mulvilles at Wimbledon`. Additionally, it is mentioned that [Excerpt 2] `He is not working or producing anything` and [Excerpt 3] `He is idle and dependent on others`. From these excerpts, I can deduce that Saltram is a guest in the home of the Mulvilles.", 19 | "answer": "He is a guest in the home of the Mulvilles." 20 | }} 21 | 22 | Example #2: 23 | 24 | **Context:** [... The Collegian is the bi-weekly official student publication of Houston Baptist University in Houston, Texas ... Houston Baptist University, affiliated with the Baptist General Convention of Texas, offers bachelor's and graduate degrees. It was founded in 1960 ...] 25 | **Question:** When was the institute that owned The Collegian founded? 26 | 27 | **Response:** 28 | {{ 29 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about the founding date of the institute that owned The Collegian. In the document, I can first locate that [Excerpt 1] `The Collegian is the bi-weekly official student publication of Houston Baptist University in Houston, Texas`, so I need to look for information about Houston Baptist University. I find that [Excerpt 2] `Houston Baptist University was founded in 1960`. Therefore, the institute that owned The Collegian was founded in 1960.", 30 | "answer": "1960" 31 | }} 32 | 33 | 34 | Now, based on the context provided below, answer the question as concisely as you can, using a single phrase or sentence if possible. Meanwhile, reasoning must comply with the original text, and any knowledge should be derived from the original text. 35 | 36 | **Context:** {context} 37 | **Question:** {question} 38 | 39 | **Response:** 40 | """ 41 | 42 | prompt_lbv1_nocot=""" 43 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. Answer the question as concisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nContext:{context}\n\nNow, answer the question based on the context as concisely as you can, using a single phrase if possible. Do not provide any explanation and only give the best answer once.\n\nQuestion:{question}\n\nAnswer:""" 44 | 45 | 46 | prompt_lbv2_cot=""" 47 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. You should follow the instructions below to provide an accurate reasoning path, as well as a answer chosen from ABCD options to the question: 48 | 49 | **Instructions:** 50 | Step 1. **Reasoning:** First retrieve all relevant information, then deduce the correct answer. Begin by carefully reading the provided context. Identify and extract all relevant information that is directly related to the question. Be succinct and only extract the most important excerpts that will help you answer the question. Finally, deduce the correct answer based on the retrieved information. 51 | Step 2. **Answer:** Using the information you have retrieved, and your deduction, answer the question as concisely as you can, using a single phrase or sentence if possible. Ensure that your answer should be brief and to the point. 52 | Step 3. **Format Your Response:** Present your response in JSON format, comprising two components: "reasoning" and "answer". The "reasoning" section should detail your thought process, including the breakdown of the question, the relevant excerpts (indicated by [Excerpt xxx] at the start), and the derived conclusion. Ensure that each excerpt is an exact match to the original document. Limit the number of excerpts to a maximum of 10. The "answer" part should contain your final answer to the question, which is a choice selected from the ABCD options. 53 | 54 | Illustrative Examples: 55 | 56 | Example #1: 57 | 58 | **Context:** [... Saltram is living with the Mulvilles at Wimbledon ... He is not working or producing anything ... He is idle and dependent on others ...] 59 | **Question:** What is Saltram's living situation? 60 | **Choices:** 61 | (A) He is a guest in the home of the Mulvilles. 62 | (B) He is in a hotel. 63 | (C) He is homeless now. 64 | (D) Unkonwn 65 | 66 | **Response:** 67 | {{ 68 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about Saltram's living situation. In the document, I can first locate that [Excerpt 1] `Saltram is living with the Mulvilles at Wimbledon`. Additionally, it is mentioned that [Excerpt 2] `He is not working or producing anything` and [Excerpt 3] `He is idle and dependent on others`. From these excerpts, I can deduce that Saltram is a guest in the home of the Mulvilles.", 69 | "answer": "A" 70 | }} 71 | 72 | Now, based on the context provided below, answer the question with a choice selected from ABCD. 73 | 74 | **Context:** {context} 75 | **Question:** {question} 76 | **Choices:** 77 | (A) {choice_A} 78 | (B) {choice_B} 79 | (C) {choice_C} 80 | (D) {choice_D} 81 | 82 | **Response:** 83 | """ 84 | 85 | prompt_lbv2_nocot=""" 86 | Please read the following text and answer the question below. 87 | 88 | {context} 89 | 90 | What is the correct answer to this question: {question} 91 | Choices: 92 | (A) {choice_A} 93 | (B) {choice_B} 94 | (C) {choice_C} 95 | (D) {choice_D} 96 | 97 | Format your response as follows: "The correct answer is (insert answer here)". 98 | """ 99 | 100 | prompt_cot="""Given a document and a question, answer concisely using a single phrase and provide a brief reasoning process. \n\nContext:{context}\n\n Now, answer the question based on the context as concisely as you can and give a reasoning paths, using a single phrase if possible. \n\nQuestion:{question}\n\n Format your response as: 101 | Answer: [] 102 | Reasoning: [] 103 | Ensure both sections are separated clearly for easy extraction.""" -------------------------------------------------------------------------------- /evaltoolkits/step1_eval_inference.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import json 4 | import argparse 5 | import torch.distributed as dist 6 | import numpy as np 7 | import random 8 | import torch.multiprocessing as mp 9 | import time 10 | 11 | from openai import OpenAI 12 | from tqdm import tqdm 13 | from pathlib import Path 14 | from pydantic import BaseModel 15 | from typing import List 16 | 17 | from utils import load_jsonl_file, check_pred_fact_consistency 18 | 19 | class Response(BaseModel): 20 | reasoning: str 21 | answer: str 22 | 23 | 24 | def seed_everything(seed): 25 | torch.manual_seed(seed) 26 | torch.cuda.manual_seed(seed) 27 | np.random.seed(seed) 28 | random.seed(seed) 29 | torch.backends.cudnn.benchmark = False 30 | torch.backends.cudnn.deterministic = True 31 | torch.cuda.manual_seed_all(seed) 32 | 33 | def parse_args(args=None): 34 | parser = argparse.ArgumentParser() 35 | parser.add_argument('--model', type=str, default=None) 36 | parser.add_argument('--test', action='store_true', help="Evaluate on test mode") 37 | parser.add_argument('--model_path', type=str, default=None) 38 | parser.add_argument('--data_path', type=str, default=None) 39 | parser.add_argument('--output_path', type=str, default=None) 40 | parser.add_argument('--dataset_name', type=str, default=None) 41 | parser.add_argument('--sample_num', type=int, default=30) 42 | parser.add_argument('--world_size', type=int, default=8) 43 | parser.add_argument('--max_gen', type=int, default=512) 44 | parser.add_argument('--gpt', action='store_true', help="Evaluate on test mode") 45 | parser.add_argument('--temperature', type=float, default=0) 46 | return parser.parse_args(args) 47 | 48 | def get_api_results(model_name, prompt, gpu_id, sample_num, max_gen, temp,gpt=False): 49 | max_retries = 5 50 | response_list = [] 51 | # json_schema = Response.model_json_schema() #TODO 52 | for i in range(max_retries): 53 | if gpt: 54 | api_key = os.getenv("OPENAI_API_KEY") 55 | client = OpenAI(api_key=api_key, base_url="Your Online Model URL") 56 | else: 57 | client = OpenAI(api_key="EMPTY", base_url=f"http://localhost:800{gpu_id}/v1") 58 | try: 59 | ''' 60 | response=client.completions.create( 61 | prompt=prompt, 62 | #messages=[ 63 | # {"role":"system", "content": "You are a helpful assistant."}, 64 | # {"role": "user","content": prompt} 65 | #], 66 | model=model_name, 67 | temperature=temp, 68 | n=sample_num, 69 | stop=["}"], 70 | max_tokens=max_gen, 71 | # extra_body={"guided_json": json_schema}, 72 | ) 73 | 74 | for choice in response.choices: 75 | #response_list.append(choice.message.content) 76 | response_list.append(choice.text) 77 | return response_list 78 | ''' 79 | 80 | response=client.chat.completions.create( 81 | #prompt=prompt, 82 | messages=[ 83 | {"role":"system", "content": "You are a helpful assistant."}, 84 | {"role": "user","content": prompt} 85 | ], 86 | #stop=["}","."], 87 | model=model_name, 88 | temperature=temp, 89 | n=sample_num, 90 | max_tokens=max_gen, 91 | # extra_body={"guided_json": json_schema}, 92 | ) 93 | 94 | for choice in response.choices: 95 | response_list.append(choice.message.content) 96 | #response_list.append(choice.text) 97 | return response_list 98 | except Exception as e: 99 | print(e) 100 | time.sleep(50) 101 | return None 102 | 103 | def get_pred_from_vllm(rank, data, max_gen, model_name, out_path, sample_num,lock, temp,gpt): 104 | # print("Temp: ",temp) 105 | if gpt: 106 | print("Eval On ",model_name) 107 | for json_obj in tqdm(data): 108 | prompt = json_obj['instruction'] 109 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 110 | def check_pred_validity(pred:str, prompt): 111 | if prompt.endswith("Answer:") or prompt.endswith("Type:") or prompt.endswith("Summary: ") or prompt.endswith("Answer:\n") or prompt.endswith("\".\n"): 112 | return True 113 | if "\"answer\"" not in pred: 114 | return False 115 | return True 116 | 117 | if preds==None: 118 | new_preds = get_api_results(model_name, prompt, rank, 5, max_gen=max_gen, temp=0.3,gpt=gpt) 119 | if new_preds==None: 120 | continue 121 | else: 122 | preds=new_preds 123 | 124 | check_flag=False 125 | if len(preds) == 1: 126 | if not check_pred_validity(preds[0], prompt): 127 | new_preds = get_api_results(model_name, prompt, rank, 5, max_gen=max_gen, temp=0.3) 128 | if new_preds!=None: 129 | for pred in new_preds: 130 | if check_pred_validity(pred, prompt): 131 | preds = [pred] 132 | check_flag=True 133 | break 134 | else: 135 | check_flag=True 136 | 137 | if not check_pred_validity(preds[0], prompt): 138 | new_preds = get_api_results(model_name, prompt, rank, 10, max_gen=max_gen, temp=0.3) 139 | if new_preds!=None: 140 | for pred in new_preds: 141 | if check_pred_validity(pred, prompt): 142 | preds = [pred] 143 | check_flag=True 144 | break 145 | else: 146 | continue 147 | else: 148 | check_flag=True 149 | 150 | if "answers" in json_obj.keys(): 151 | instruction, answers, _id = json_obj["instruction"], json_obj["answers"], json_obj["id"] 152 | else: 153 | instruction, answers, _id = json_obj["instruction"], [json_obj["output"]], json_obj["id"] 154 | if "all_classes" in json_obj.keys(): 155 | all_classes=json_obj['all_classes'] 156 | else: 157 | all_classes=[] 158 | 159 | try: 160 | question=json_obj['question'] 161 | except: 162 | question = instruction.split("Question:")[-1].replace("\n\nAnswer:","").replace("*","").strip() 163 | with lock: 164 | with open(out_path, "a", encoding="utf-8") as f: 165 | json.dump({"pred": preds, "instruction": instruction, "question":question, "answers": answers, "id":_id,"check_flag":str(check_flag) ,"all_classes": all_classes, "length": 0}, f, ensure_ascii=False) 166 | f.write('\n') 167 | dist.destroy_process_group() 168 | return 169 | 170 | if __name__ == '__main__': 171 | print(os.getpid()) 172 | seed_everything(42) 173 | args = parse_args() 174 | world_size = args.world_size #torch.cuda.device_count() 175 | print("Wold Size ",world_size) 176 | mp.set_start_method('fork', force=True) 177 | model_name = args.model 178 | dataset_name = args.dataset_name 179 | 180 | sources = load_jsonl_file(args.data_path) 181 | data_all = [data_sample for data_sample in sources] 182 | data_subsets = [data_all[i::world_size] for i in range(world_size)] 183 | 184 | out_path = Path(args.output_path) 185 | out_path.parent.mkdir(parents=True, exist_ok=True) 186 | if out_path.exists(): 187 | out_path.unlink() 188 | processes = [] 189 | lock = mp.RLock() 190 | for rank in range(world_size): 191 | p = mp.Process(target=get_pred_from_vllm, args=(rank, data_subsets[rank], args.max_gen, model_name, out_path, args.sample_num, lock, args.temperature,args.gpt)) 192 | p.start() 193 | processes.append(p) 194 | 195 | for p in processes: 196 | p.join() -------------------------------------------------------------------------------- /evaltoolkits/metrics.py: -------------------------------------------------------------------------------- 1 | import re 2 | import string 3 | 4 | import jieba 5 | from fuzzywuzzy import fuzz 6 | import difflib 7 | 8 | from typing import List 9 | from collections import Counter 10 | from rouge import Rouge 11 | 12 | def normalize_answer(s): 13 | """Lower text and remove punctuation, articles and extra whitespace.""" 14 | 15 | def remove_articles(text): 16 | return re.sub(r"\b(a|an|the)\b", " ", text) 17 | 18 | def white_space_fix(text): 19 | return " ".join(text.split()) 20 | 21 | def remove_punc(text): 22 | exclude = set(string.punctuation) 23 | return "".join(ch for ch in text if ch not in exclude) 24 | 25 | def lower(text): 26 | return text.lower() 27 | 28 | return white_space_fix(remove_articles(remove_punc(lower(s)))) 29 | 30 | 31 | def normalize_zh_answer(s): 32 | """Lower text and remove punctuation, extra whitespace.""" 33 | 34 | def white_space_fix(text): 35 | return "".join(text.split()) 36 | 37 | def remove_punc(text): 38 | cn_punctuation = "!?。。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏." 39 | all_punctuation = set(string.punctuation + cn_punctuation) 40 | return "".join(ch for ch in text if ch not in all_punctuation) 41 | 42 | def lower(text): 43 | return text.lower() 44 | 45 | return white_space_fix(remove_punc(lower(s))) 46 | 47 | def count_score(prediction, ground_truth, **kwargs): 48 | numbers = re.findall(r"\d+", prediction) 49 | right_num = 0 50 | for number in numbers: 51 | if str(number) == str(ground_truth): 52 | right_num += 1 53 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers) 54 | return float(final_score) 55 | 56 | def retrieval_score(prediction, ground_truth, **kwargs): 57 | pattern = r'Paragraph (\d+)' 58 | matches = re.findall(pattern, ground_truth) 59 | ground_truth_id = matches[0] 60 | numbers = re.findall(r"\d+", prediction) 61 | right_num = 0 62 | for number in numbers: 63 | if str(number) == str(ground_truth_id): 64 | right_num += 1 65 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers) 66 | return float(final_score) 67 | 68 | def babi_score(prediction, ground_truth, **kwargs): 69 | 70 | if ground_truth in prediction: 71 | return 1.0 72 | elif prediction==ground_truth: 73 | return 1.0 74 | elif prediction==(ground_truth+')'): 75 | return 1.0 76 | elif prediction==('('+ground_truth+')'): 77 | return 1.0 78 | else: 79 | return 0 80 | 81 | def babiq3_score(prediction, ground_truth, **kwargs): 82 | try: 83 | answer=prediction.split('was in')[1] 84 | except: 85 | answer=prediction 86 | if ground_truth in answer: 87 | return 1.0 88 | else: 89 | return 0 90 | 91 | def ruler_score(prediction, ground_truth, **kwargs): 92 | def postprocess_pred(predict_str: str): 93 | 94 | predict_str = predict_str.strip() 95 | # Remove all non-printable characters 96 | np_pattern = re.compile(r'[\x00-\x1f]') 97 | predict_str = np_pattern.sub('\n', predict_str).strip() 98 | return predict_str 99 | positive_pattern1 = r':\s*(\d+)' 100 | positive_pattern2 = r'is\s*(\d+)' 101 | negative_pattern = r'no \s* magic number' 102 | positive_match1 = re.search(positive_pattern1, prediction) 103 | positive_match2 = re.search(positive_pattern2, prediction) 104 | negative_match = re.search(negative_pattern, prediction) 105 | if negative_match: 106 | return 0 107 | elif positive_match1 or positive_match2: 108 | return 1.0 109 | 110 | if ground_truth in prediction: 111 | return 1.0 112 | elif prediction==ground_truth: 113 | return 1.0 114 | elif prediction==(ground_truth+')'): 115 | return 1.0 116 | elif prediction==('('+ground_truth+')'): 117 | return 1.0 118 | elif postprocess_pred(prediction)==ground_truth: 119 | return 1.0 120 | else: 121 | return 0 122 | 123 | def accuracy_score(prediction, ground_truth, **kwargs): 124 | def extract_answer(response): 125 | response = response.replace('*', '') 126 | match = re.search(r'The correct answer is \(([A-D])\)', response) 127 | if match: 128 | return match.group(1) 129 | else: 130 | match = re.search(r'The correct answer is ([A-D])', response) 131 | if match: 132 | return match.group(1) 133 | else: 134 | return None 135 | bool_cnt=0 136 | choice_list=['A','B','C','D'] 137 | pattern1 = f'{ground_truth} ' 138 | pattern3=f'\* {ground_truth}' 139 | pattern4=f'{ground_truth}$' 140 | for choice in choice_list: 141 | if ('('+choice+')') in prediction: 142 | bool_cnt+=1 143 | continue 144 | pattern2 = f'{choice} ' 145 | matches = re.findall(pattern2, prediction) 146 | if matches: 147 | bool_cnt+=1 148 | continue 149 | if bool_cnt>=2: #m choices 150 | return 0 151 | if ('('+ground_truth+')') in prediction: 152 | return 1.0 153 | if ground_truth==prediction: 154 | return 1.0 155 | matches1 = re.findall(pattern1, prediction) 156 | matches2 = re.findall(pattern3, prediction) 157 | matches3 = re.findall(pattern4, prediction) 158 | if matches1 or matches2 or matches3: 159 | return 1.0 160 | else: 161 | return 0 162 | 163 | def retrieval_zh_score(prediction, ground_truth, **kwargs): 164 | pattern = r'段落(\d+)' 165 | matches = re.findall(pattern, ground_truth) 166 | ground_truth_id = matches[0] 167 | numbers = re.findall(r"\d+", prediction) 168 | right_num = 0 169 | for number in numbers: 170 | if str(number) == str(ground_truth_id): 171 | right_num += 1 172 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers) 173 | return float(final_score) 174 | 175 | def code_sim_score(prediction, ground_truth, **kwargs): 176 | all_lines = prediction.lstrip('\n').split('\n') 177 | prediction = "" 178 | for line in all_lines: 179 | if ('`' not in line) and ('#' not in line) and ('//' not in line): 180 | prediction = line 181 | break 182 | return (fuzz.ratio(prediction, ground_truth) / 100) 183 | 184 | def classification_score(prediction, ground_truth, **kwargs): 185 | #print(prediction) 186 | #if '\n' in prediction: 187 | # prediction = prediction.lstrip('\n').split('\n')[0] 188 | em_match_list = [] 189 | all_classes = kwargs["all_classes"] 190 | for class_name in all_classes: 191 | if class_name in prediction: 192 | em_match_list.append(class_name) 193 | for match_term in em_match_list: 194 | if match_term in ground_truth and match_term != ground_truth: 195 | em_match_list.remove(match_term) 196 | if ground_truth in em_match_list: 197 | score = (1.0 / len(em_match_list)) 198 | else: 199 | score = 0.0 200 | return score 201 | 202 | def rouge_score(prediction, ground_truth, **kwargs): 203 | rouge = Rouge() 204 | try: 205 | scores = rouge.get_scores([prediction], [ground_truth], avg=True) 206 | except: 207 | return 0.0 208 | return scores["rouge-l"]["f"] 209 | 210 | def rouge_zh_score(prediction, ground_truth, **kwargs): 211 | prediction = " ".join(list(jieba.cut(prediction, cut_all=False))) 212 | ground_truth = " ".join(list(jieba.cut(ground_truth, cut_all=False))) 213 | score = rouge_score(prediction, ground_truth) 214 | return score 215 | 216 | def f1_score(prediction, ground_truth, **kwargs): 217 | common = Counter(prediction) & Counter(ground_truth) 218 | num_same = sum(common.values()) 219 | if num_same == 0: 220 | return 0 221 | precision = 1.0 * num_same / len(prediction) 222 | recall = 1.0 * num_same / len(ground_truth) 223 | f1 = (2 * precision * recall) / (precision + recall) 224 | return f1 225 | 226 | def recall_score(prediction, ground_truth, **kwargs): 227 | common = Counter(prediction) & Counter(ground_truth) 228 | num_same = sum(common.values()) 229 | if num_same == 0: 230 | return 0 231 | recall = 1.0 * num_same / len(ground_truth) 232 | return recall 233 | 234 | def qa_f1_score(prediction, ground_truth, **kwargs): 235 | normalized_prediction = normalize_answer(prediction) 236 | normalized_ground_truth = normalize_answer(ground_truth) 237 | 238 | prediction_tokens = normalized_prediction.split() 239 | ground_truth_tokens = normalized_ground_truth.split() 240 | return f1_score(prediction_tokens, ground_truth_tokens) 241 | 242 | def qa_recall_score(prediction, ground_truth, **kwargs): 243 | normalized_prediction = normalize_answer(prediction) 244 | normalized_ground_truth = normalize_answer(ground_truth) 245 | 246 | prediction_tokens = normalized_prediction.split() 247 | ground_truth_tokens = normalized_ground_truth.split() 248 | return recall_score(prediction_tokens, ground_truth_tokens) 249 | 250 | 251 | def qa_f1_zh_score(prediction, ground_truth, **kwargs): 252 | prediction_tokens = list(jieba.cut(prediction, cut_all=False)) 253 | ground_truth_tokens = list(jieba.cut(ground_truth, cut_all=False)) 254 | prediction_tokens = [normalize_zh_answer(token) for token in prediction_tokens] 255 | ground_truth_tokens = [normalize_zh_answer(token) for token in ground_truth_tokens] 256 | prediction_tokens = [token for token in prediction_tokens if len(token) > 0] 257 | ground_truth_tokens = [token for token in ground_truth_tokens if len(token) > 0] 258 | return f1_score(prediction_tokens, ground_truth_tokens) 259 | -------------------------------------------------------------------------------- /evaltoolkits/filter_data.py: -------------------------------------------------------------------------------- 1 | import json 2 | import jsonlines 3 | import json_repair 4 | import pandas as pd 5 | import numpy as np 6 | import torch 7 | import shutil 8 | import glob 9 | import re 10 | import string 11 | import time 12 | import os 13 | import argparse 14 | from typing import List, Tuple 15 | from tqdm import tqdm 16 | from pathlib import Path 17 | from json_repair import repair_json 18 | from transformers import AutoTokenizer, AutoModelForCausalLM 19 | from datasets import Dataset 20 | from openai import OpenAI 21 | 22 | from utils import normalize_answer, preprocess_pred_for_json_repair, extract_fact_list, verify_fact_list, extract_number 23 | def parse_args(args=None): 24 | parser = argparse.ArgumentParser() 25 | parser.add_argument('--path_to_src_file', type=str, default=None) 26 | parser.add_argument('--path_to_stage1_file', type=str, default=None) 27 | parser.add_argument('--sample_num', type=int, default=30) 28 | return parser.parse_args(args) 29 | # 我们需要你帮忙评价模型推理过程的质量。模型的接收的输入是一段长文本,以及一个复杂的问题,它的任务是根据问题的需要,从长文本中检索出相关信息(以[Excerpt xxx]的形式开头,包含在``中),并给出正确的答案。现在,我们已经在上面给出了问题和模型的推理过程。模型最终得到的结果是正确的,但是我们需要你来评价模型的推理过程是否合理。请你根据以下几个方面来评价模型的推理过程:- 逻辑性:模型对问题的拆解应当合理。推理过程对于检索到的信息的使用应该符合逻辑,根据检索到的信息得出答案的逻辑链条应该合理。 - 完整性:推理过程应该主要使用从文中检索到的信息,即[Excerpts xxx]后内容,而非模型自身的知识。 - 简洁性:只应当检索回答问题相关的信息,不应罗列过多无关的信息。 30 | 31 | EVAL_PROMPT = '''[Question] 32 | {question} 33 | 34 | [The Start of Assistant's Reasoning Path] 35 | {reasoning} 36 | [The End of Assistant's Reasoning Path] 37 | 38 | [System] 39 | We would like to request your feedback on the quality of the reasoning process in the given response. 40 | The model receives a long text input and a complex question. Its task is to retrieve relevant information from the long text (marked as [Excerpt xxx] and enclosed in ``) based on the question's requirements and provide the correct answer. Above, we have provided both the question and the model's reasoning process. While the model's final answer is correct, we need you to evaluate whether its reasoning process is sound. 41 | 42 | Please assess the model's reasoning process based on the following aspects: 43 | 44 | 1. Logical Coherence: 45 | - The model should break down the question appropriately 46 | - The use of retrieved information should follow logical patterns 47 | - The chain of reasoning from retrieved information to the final answer should be sound 48 | 49 | 2. Completeness: 50 | - The reasoning process should primarily rely on information retrieved from the text ([Excerpts xxx]) 51 | - The model should not heavily depend on its own knowledge base 52 | 53 | 3. Conciseness: 54 | - Only information relevant to answering the question should be retrieved 55 | - The model should avoid listing excessive or irrelevant information 56 | 57 | Please rate whether this reasoning path is suitable for the question. The assistant receives an overall score on a scale of 1 to 100, where a higher score indicates better overall performance. 58 | Please note that if the assistant's reasoning process fully meets the above criteria, its overall rating should be full marks (100). 59 | Please first provide a comprehensive explanation of your evaluation, avoiding any potential bias. 60 | Then, output a line indicating the score of the Assistant. 61 | 62 | PLEASE OUTPUT WITH THE FOLLOWING FORMAT, WHERE THE SCORE IS ON A SCALE OF 1 TO 100 BY STRICTLY FOLLOWING THIS FORMAT: "[[score]]", FOR EXAMPLE "Rating: [[100]]": 63 |