├── .gitignore ├── README.md ├── cache └── ds_z3_config.json ├── config ├── __pycache__ │ ├── prompt.cpython-38.pyc │ └── prompt.cpython-39.pyc └── prompt.py ├── data └── dataset_info.json ├── evaltoolkits ├── cache │ ├── ds_z2_config.json │ ├── ds_z2_offload_config.json │ ├── ds_z3_config.json │ ├── ds_z3_offload_config.json │ └── user_config.yaml ├── filter.sh ├── filter_data.py ├── inference.out ├── inference2.out ├── launch_inference.sh ├── launch_lbv1.sh ├── launch_lbv1big.sh ├── launch_lbv2.sh ├── launch_lbv2m.sh ├── loop_eval.sh ├── loop_sample.sh ├── metrics.py ├── step1_eval_inference.py ├── step2_extract_preds_from_raw.py ├── step3_eval_f1.py └── utils.py ├── log └── README.md ├── pics ├── combined_plot.png ├── llama.png └── main_table.png ├── preprocess_lbv1.py ├── preprocess_lbv2.py ├── preprocess_train.py ├── requirements.txt └── scripts ├── llama_sft.sh ├── llama_warmup.sh ├── lora_sft.sh ├── merge_config.yaml ├── preprocess_lb.sh ├── qwen14b_lora.yaml ├── qwen_lora.sh ├── qwen_sft.sh └── qwen_warmup.sh /.gitignore: -------------------------------------------------------------------------------- 1 | dataset/ 2 | data/ 3 | saves/ 4 | evaltoolkits/pred*/ 5 | **__pycache__/ 6 | log/ 7 | *.out 8 | **bak/ 9 | **dev/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 |

4 | 5 | # 📖 Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision 6 | 7 |

8 | 🤗 HF Repo • 📃 Paper 9 |

10 | 11 | **LongRePS** tackles quality bottlenecks in CoT reasoning for extended contexts by integrating process supervision. As shown in the figure, we have discovered that in complex task scenarios, using the chain of thought always enables the model performance to achieve a universal gain. Furthermore, we figure out that while vanilla CoT improves with context length, self-sampled reasoning paths exhibit significant inconsistency and hallucination risks, especially in multi-hop QA and complex scenarios. 12 | 13 | 14 | The framework operates in two phases: (1) **Self-sampling** generates diverse CoT candidates to capture reasoning variability, and (2) **Context-aware assessment** enforces answer correctness, grounding via text matching, and intrinsic consistency via LLM-based scoring. 15 | 16 | 17 | Evaluations on long-context tasks show LongRePS achieves 13.6/3.8-point gains on MuSiQue (LLaMA/Qwen) and cross-task robustness, outperforming outcome supervision. The results validate process supervision as pivotal for scalable long-context reasoning, with open-source code enabling community adoption. 18 | *** 19 | ![](pics/combined_plot.png) 20 | *** 21 | ![](pics/main_table.png) 22 | 23 | 24 | ## 🔥 News 25 | **[2025/03/03]** Release training and evaluation data for **LongRePS**. The model parameters and complete codes will be available soon. 26 | 27 | ## 🔍 List of Contents 28 | - [🔨 Requirements](#requirements) 29 | - [⚙️ How to Prepare Data for Training](#how-to-Prepare-Data-for-Training) 30 | - [🖥️ How to Prepare Data for Evaluating](#how-to-Prepare-Data-for-Evaluating) 31 | - [🍧 Training](#training) 32 | - [📊 Evaluation](#evaluation) 33 | - [📄 Acknowledgement](#acknowledgement) 34 | 35 | 36 | 37 | ## 🔨 Requirements 38 | 39 | **Install LLaMA-Factory** 40 | 41 | Please refer to this tutorial for [installation](https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/installation.html). 42 | Or you can use following command: 43 | ```bash 44 | git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git 45 | cd LLaMA-Factory 46 | pip install -e ".[torch,metrics]" 47 | ``` 48 | 49 | **Install Other Supporting Libraries** 50 | 51 | ```bash 52 | cd .. 53 | git clone https://github.com/lemon-prog123/LongRePS.git 54 | cd LongRePS 55 | pip install -r requirements.txt 56 | ``` 57 | 58 | 59 | 60 | ## ⚙️ How to Prepare Data for Training 61 | 62 | **Llama-3.1-8B**: 63 | ```python 64 | from datasets import load_dataset 65 | import jsonlines 66 | model="Llama-3.1-8B" 67 | dataset = load_dataset("Lemon123prog/Llama-3.1-8B-LongRePS") 68 | warmup_data=dataset['warmup'].to_list() 69 | orm_data=dataset['train_orm'].to_list() 70 | prm_data=dataset['train_prm'].to_list() 71 | 72 | with jsonlines.open(f"data/{model}_warmup.jsonl", 'w') as writer: 73 | writer.write_all(warmup_data) 74 | 75 | with jsonlines.open(f"data/{model}_orm.jsonl", 'w') as writer: 76 | writer.write_all(orm_data) 77 | 78 | with jsonlines.open(f"data/{model}_prm.jsonl", 'w') as writer: 79 | writer.write_all(prm_data) 80 | ``` 81 | 82 | **Qwen-2.5-7B**: 83 | ```python 84 | from datasets import load_dataset 85 | import jsonlines 86 | model="Qwen-2.5-7B" 87 | dataset = load_dataset("Lemon123prog/Qwen-2.5-7B-LongRePS") 88 | warmup_data=dataset['warmup'].to_list() 89 | orm_data=dataset['train_orm'].to_list() 90 | prm_data=dataset['train_prm'].to_list() 91 | 92 | with jsonlines.open(f"data/{model}_warmup.jsonl", 'w') as writer: 93 | writer.write_all(warmup_data) 94 | 95 | with jsonlines.open(f"data/{model}_orm.jsonl", 'w') as writer: 96 | writer.write_all(orm_data) 97 | 98 | with jsonlines.open(f"data/{model}_prm.jsonl", 'w') as writer: 99 | writer.write_all(prm_data) 100 | ``` 101 | 102 | Or you can simply run [preprocess.py](preprocess.py) 103 | ```bash 104 | python preprocess_train.py 105 | ``` 106 | 107 | 108 | 109 | ## 🖥️ How to Prepare Data for Evaluating 110 | 111 | ```bash 112 | bash scripts/preprocess_lb.sh 113 | ``` 114 | Then you will obtain the processed evaluation data in the **dataset** directory. 115 | 116 | 117 | 118 | ## 🍧 Training 119 | 120 | ### Download base models 121 | 122 | ```python 123 | from huggingface_hub import snapshot_download 124 | from pathlib import Path 125 | repo_id ="Qwen/Qwen2.5-7B" 126 | root_dir = Path("Your own path for Qwen") 127 | snapshot_download(repo_id=repo_id,local_dir=root_dir/repo_id,repo_type="model") 128 | 129 | repo_id ="meta-llama/Llama-3.1-8B" 130 | root_dir = Path("Your own path for Llama") 131 | snapshot_download(repo_id=repo_id,local_dir=root_dir/repo_id,repo_type="model") 132 | ``` 133 | 134 | Set **Model_Path** in the scripts before training. 135 | 136 | ### Warm Up Stage 137 | 138 | **Llama-3.1-8B** 139 | ```bash 140 | bash scripts/llama_warmup.sh 141 | ``` 142 | 143 | **Qwen-2.5-7B** 144 | ```bash 145 | bash scripts/qwen_warmup.sh 146 | ``` 147 | 148 | ### Sample Data and Fine-tune Models 149 | 150 | Set **Model-Name** & **Model-Path** & **File-Name** in the scripts before sampling. 151 | ```bash 152 | cd evaltoolkits 153 | bash loop_sample.sh 154 | ``` 155 | 156 | After the sampling process, you can use [filter_data.py](evaltoolkits/filter_data.py) to launch the filtering framework. 157 | 158 | ```bash 159 | cd evaltoolkits 160 | python filter_data.py \ 161 | --path_to_src_file [Sampling Data] \ 162 | --path_to_stage1_file [Output Data Path] 163 | ``` 164 | 165 | You can modify [dataset_info.json](data/dataset_info.json) to enable the added **filtered dataset** to be included in the file list. 166 | 167 | Finally, by set the **warm-up model path** and **datset_name** in the scripts, you can launch the fine-tuning process. 168 | 169 | **Llama-3.1-8B** 170 | ```bash 171 | bash scripts/llama_sft.sh 172 | ``` 173 | 174 | **Qwen-2.5-7B** 175 | ```bash 176 | bash scripts/qwen_sft.sh 177 | ``` 178 | 179 | 180 | 181 | ## 📊 Evaluation 182 | 183 | **LongBench v1** 184 | ```bash 185 | cd evaltoolkits 186 | bash launch_lbv1.sh 187 | ``` 188 | 189 | **LongBench v2** 190 | ```bash 191 | cd evaltoolkits 192 | bash launch_lbv2.sh 193 | ``` 194 | 195 | Note: Set **model_path** and **mode** to the desired target model. 196 | 197 | 198 | 199 | ## 📝 Citation 200 | ``` 201 | @article{zhu2025chain, 202 | title={Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision}, 203 | author={Zhu, Dawei and Wei, Xiyu and Zhao, Guangxiang and Wu, Wenhao and Zou, Haosheng and Ran, Junfeng and Wang, Xun and Sun, Lin and Zhang, Xiangzheng and Li, Sujian}, 204 | journal={arXiv preprint arXiv:2502.20790}, 205 | year={2025} 206 | } 207 | ``` 208 | ## 📄 Acknowledgement 209 | We are deeply thankful for the following projects that serve as the foundation for LongRePS: 210 | 211 | * [**SEALONG**](https://github.com/SihengLi99/SEALONG) 212 | * [**LongBench**](https://github.com/THUDM/LongBench) 213 | * [**LLaMA-Factory**](https://github.com/hiyouga/LLaMA-Factory) 214 | * [**360-LLaMA-Factory**](https://github.com/Qihoo360/360-LLaMA-Factory) 215 | 216 | -------------------------------------------------------------------------------- /cache/ds_z3_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 3, 20 | "overlap_comm": true, 21 | "contiguous_gradients": true, 22 | "sub_group_size": 1000000000.0, 23 | "reduce_bucket_size": "auto", 24 | "stage3_prefetch_bucket_size": "auto", 25 | "stage3_param_persistence_threshold": "auto", 26 | "stage3_max_live_parameters": 1000000000.0, 27 | "stage3_max_reuse_distance": 1000000000.0, 28 | "stage3_gather_16bit_weights_on_model_save": true 29 | } 30 | } -------------------------------------------------------------------------------- /config/__pycache__/prompt.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/config/__pycache__/prompt.cpython-38.pyc -------------------------------------------------------------------------------- /config/__pycache__/prompt.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/config/__pycache__/prompt.cpython-39.pyc -------------------------------------------------------------------------------- /config/prompt.py: -------------------------------------------------------------------------------- 1 | prompt_lbv1_cot=""" 2 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. You should follow the instructions below to provide an accurate reasoning path, as well as a concise answer to the question: 3 | 4 | **Instructions:** 5 | Step 1. **Reasoning:** Imagine you are a student who has no prior knowledge about the giving context. Your task is to answer the questions based solely on the information presented here. First retrieve all relevant information, then deduce the correct answer. Begin by carefully reading the provided context. Identify and extract all relevant information that is directly related to the question. Be succinct and only extract the most important excerpts that will help you answer the question. Finally, deduce the correct answer based on the retrieved information. 6 | Step 2. **Answer:** Using the information you have retrieved, and your deduction, answer the question as concisely as you can, using a single phrase or sentence if possible. Ensure that your answer should be brief and to the point. 7 | Step 3. **Format Your Response:** Present your response in JSON format, comprising two components: "reasoning" and "answer". The "reasoning" section should detail your thought process, including the breakdown of the question, the relevant excerpts (indicated by [Excerpt xxx] at the start), and the derived conclusion. Ensure that each excerpt is an exact match to the original document. Limit the number of excerpts to a maximum of 10. The "answer" part should contain your final answer to the question, as concise and to the point as possible. 8 | 9 | Illustrative Examples: 10 | 11 | Example #1: 12 | 13 | **Context:** [... Saltram is living with the Mulvilles at Wimbledon ... He is not working or producing anything ... He is idle and dependent on others ...] 14 | **Question:** What is Saltram's living situation? 15 | 16 | **Response:** 17 | {{ 18 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about Saltram's living situation. In the document, I can first locate that [Excerpt 1] `Saltram is living with the Mulvilles at Wimbledon`. Additionally, it is mentioned that [Excerpt 2] `He is not working or producing anything` and [Excerpt 3] `He is idle and dependent on others`. From these excerpts, I can deduce that Saltram is a guest in the home of the Mulvilles.", 19 | "answer": "He is a guest in the home of the Mulvilles." 20 | }} 21 | 22 | Example #2: 23 | 24 | **Context:** [... The Collegian is the bi-weekly official student publication of Houston Baptist University in Houston, Texas ... Houston Baptist University, affiliated with the Baptist General Convention of Texas, offers bachelor's and graduate degrees. It was founded in 1960 ...] 25 | **Question:** When was the institute that owned The Collegian founded? 26 | 27 | **Response:** 28 | {{ 29 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about the founding date of the institute that owned The Collegian. In the document, I can first locate that [Excerpt 1] `The Collegian is the bi-weekly official student publication of Houston Baptist University in Houston, Texas`, so I need to look for information about Houston Baptist University. I find that [Excerpt 2] `Houston Baptist University was founded in 1960`. Therefore, the institute that owned The Collegian was founded in 1960.", 30 | "answer": "1960" 31 | }} 32 | 33 | 34 | Now, based on the context provided below, answer the question as concisely as you can, using a single phrase or sentence if possible. Meanwhile, reasoning must comply with the original text, and any knowledge should be derived from the original text. 35 | 36 | **Context:** {context} 37 | **Question:** {question} 38 | 39 | **Response:** 40 | """ 41 | 42 | prompt_lbv1_nocot=""" 43 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. Answer the question as concisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nContext:{context}\n\nNow, answer the question based on the context as concisely as you can, using a single phrase if possible. Do not provide any explanation and only give the best answer once.\n\nQuestion:{question}\n\nAnswer:""" 44 | 45 | 46 | prompt_lbv2_cot=""" 47 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. You should follow the instructions below to provide an accurate reasoning path, as well as a answer chosen from ABCD options to the question: 48 | 49 | **Instructions:** 50 | Step 1. **Reasoning:** First retrieve all relevant information, then deduce the correct answer. Begin by carefully reading the provided context. Identify and extract all relevant information that is directly related to the question. Be succinct and only extract the most important excerpts that will help you answer the question. Finally, deduce the correct answer based on the retrieved information. 51 | Step 2. **Answer:** Using the information you have retrieved, and your deduction, answer the question as concisely as you can, using a single phrase or sentence if possible. Ensure that your answer should be brief and to the point. 52 | Step 3. **Format Your Response:** Present your response in JSON format, comprising two components: "reasoning" and "answer". The "reasoning" section should detail your thought process, including the breakdown of the question, the relevant excerpts (indicated by [Excerpt xxx] at the start), and the derived conclusion. Ensure that each excerpt is an exact match to the original document. Limit the number of excerpts to a maximum of 10. The "answer" part should contain your final answer to the question, which is a choice selected from the ABCD options. 53 | 54 | Illustrative Examples: 55 | 56 | Example #1: 57 | 58 | **Context:** [... Saltram is living with the Mulvilles at Wimbledon ... He is not working or producing anything ... He is idle and dependent on others ...] 59 | **Question:** What is Saltram's living situation? 60 | **Choices:** 61 | (A) He is a guest in the home of the Mulvilles. 62 | (B) He is in a hotel. 63 | (C) He is homeless now. 64 | (D) Unkonwn 65 | 66 | **Response:** 67 | {{ 68 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about Saltram's living situation. In the document, I can first locate that [Excerpt 1] `Saltram is living with the Mulvilles at Wimbledon`. Additionally, it is mentioned that [Excerpt 2] `He is not working or producing anything` and [Excerpt 3] `He is idle and dependent on others`. From these excerpts, I can deduce that Saltram is a guest in the home of the Mulvilles.", 69 | "answer": "A" 70 | }} 71 | 72 | Now, based on the context provided below, answer the question with a choice selected from ABCD. 73 | 74 | **Context:** {context} 75 | **Question:** {question} 76 | **Choices:** 77 | (A) {choice_A} 78 | (B) {choice_B} 79 | (C) {choice_C} 80 | (D) {choice_D} 81 | 82 | **Response:** 83 | """ 84 | 85 | prompt_lbv2_nocot=""" 86 | Please read the following text and answer the question below. 87 | 88 | {context} 89 | 90 | What is the correct answer to this question: {question} 91 | Choices: 92 | (A) {choice_A} 93 | (B) {choice_B} 94 | (C) {choice_C} 95 | (D) {choice_D} 96 | 97 | Format your response as follows: "The correct answer is (insert answer here)". 98 | """ 99 | 100 | prompt_cot="""Given a document and a question, answer concisely using a single phrase and provide a brief reasoning process. \n\nContext:{context}\n\n Now, answer the question based on the context as concisely as you can and give a reasoning paths, using a single phrase if possible. \n\nQuestion:{question}\n\n Format your response as: 101 | Answer: [] 102 | Reasoning: [] 103 | Ensure both sections are separated clearly for easy extraction.""" -------------------------------------------------------------------------------- /data/dataset_info.json: -------------------------------------------------------------------------------- 1 | { 2 | "Llama-3.1-8B_warmup": { 3 | "file_name": "Llama-3.1-8B_warmup.jsonl", 4 | "columns": { 5 | "prompt": "instruction", 6 | "response": "output", 7 | "system": "system" 8 | } 9 | }, 10 | "Llama-3.1-8B_orm": { 11 | "file_name": "Llama-3.1-8B_orm.jsonl", 12 | "columns": { 13 | "prompt": "instruction", 14 | "response": "output", 15 | "system": "system" 16 | } 17 | }, 18 | "Llama-3.1-8B_sample30_thresh1.0_prm": { 19 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_llamastage2.jsonl", 20 | "columns": { 21 | "prompt": "instruction", 22 | "response": "output", 23 | "system": "system" 24 | } 25 | }, 26 | "Llama-3.1-8B_sample30_thresh1.0_checkstage3_prm": { 27 | "file_name": "musique-llama3.1_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 28 | "columns": { 29 | "prompt": "instruction", 30 | "response": "output", 31 | "system": "system" 32 | } 33 | }, 34 | "Qwen-2.5-7B_warmup": { 35 | "file_name": "Qwen-2.5-7B_warmup.jsonl", 36 | "columns": { 37 | "prompt": "instruction", 38 | "response": "output", 39 | "system": "system" 40 | } 41 | }, 42 | "Qwen-2.5-7B_orm": { 43 | "file_name": "Qwen-2.5-7B_orm.jsonl", 44 | "columns": { 45 | "prompt": "instruction", 46 | "response": "output", 47 | "system": "system" 48 | } 49 | } 50 | , 51 | "Qwen-2.5-7B_prm": { 52 | "file_name": "Qwen-2.5-7B_prm.jsonl", 53 | "columns": { 54 | "prompt": "instruction", 55 | "response": "output", 56 | "system": "system" 57 | } 58 | }, 59 | "Qwen-2.5-7B_sample30_thresh1.0_prm": { 60 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_qwenstage2.jsonl", 61 | "columns": { 62 | "prompt": "instruction", 63 | "response": "output", 64 | "system": "system" 65 | } 66 | }, 67 | "Qwen-2.5-7B_sample30_thresh1.0_checkstage3_prm": { 68 | "file_name": "musique-qwen2.5_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 69 | "columns": { 70 | "prompt": "instruction", 71 | "response": "output", 72 | "system": "system" 73 | } 74 | }, 75 | "Qwen-2.5-7B_sample30_thresh1.0_yarn_prm": { 76 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_yarn_qwenstage2.jsonl", 77 | "columns": { 78 | "prompt": "instruction", 79 | "response": "output", 80 | "system": "system" 81 | } 82 | }, 83 | "Qwen-2.5-7B_sample100_thresh1.0_yarn_checkstage3_prm": { 84 | "file_name": "musique-qwen2.5_sample100temp0.7thresh1.0_yarn_factcheck_stage3.jsonl", 85 | "columns": { 86 | "prompt": "instruction", 87 | "response": "output", 88 | "system": "system" 89 | } 90 | }, 91 | "Llama-3.1-8B_sample30_thresh1.0_v2_checkstage3_prm": { 92 | "file_name": "musique-llama3.1_sample30temp0.7thresh1.0_v2_factcheck_stage3.jsonl", 93 | "columns": { 94 | "prompt": "instruction", 95 | "response": "output", 96 | "system": "system" 97 | } 98 | }, 99 | "Llama-3.1-8B_sample30_thresh1.0_v2_prm": { 100 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_v2_llamastage2.jsonl", 101 | "columns": { 102 | "prompt": "instruction", 103 | "response": "output", 104 | "system": "system" 105 | } 106 | }, 107 | "Qwen-2.5-14B_warmup": { 108 | "file_name": "musique_train_warmup_qwen14b.jsonl", 109 | "columns": { 110 | "prompt": "instruction", 111 | "response": "output", 112 | "system": "system" 113 | } 114 | }, 115 | "Qwen-2.5-14B_sample30_temp0.7_thresh1.0_prm": { 116 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_v2_qwen14bstage2.jsonl", 117 | "columns": { 118 | "prompt": "instruction", 119 | "response": "output", 120 | "system": "system" 121 | } 122 | }, 123 | "Qwen-2.5-14B_sample30_temp0.7_thresh1.0_checkstage3_prm": { 124 | "file_name": "musique-qwen14b_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 125 | "columns": { 126 | "prompt": "instruction", 127 | "response": "output", 128 | "system": "system" 129 | } 130 | }, 131 | "Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm": { 132 | "file_name": "musique-qwen32b_sample30temp0.7thresh1.0_factcheck_stage3.jsonl", 133 | "columns": { 134 | "prompt": "instruction", 135 | "response": "output", 136 | "system": "system" 137 | } 138 | } 139 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z2_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 2, 20 | "allgather_partitions": true, 21 | "allgather_bucket_size": 500000000.0, 22 | "overlap_comm": true, 23 | "reduce_scatter": true, 24 | "reduce_bucket_size": 500000000.0, 25 | "contiguous_gradients": true, 26 | "round_robin_gradients": true 27 | } 28 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z2_offload_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 2, 20 | "allgather_partitions": true, 21 | "allgather_bucket_size": 500000000.0, 22 | "overlap_comm": true, 23 | "reduce_scatter": true, 24 | "reduce_bucket_size": 500000000.0, 25 | "contiguous_gradients": true, 26 | "round_robin_gradients": true, 27 | "offload_optimizer": { 28 | "device": "cpu", 29 | "pin_memory": true 30 | } 31 | } 32 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z3_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 3, 20 | "overlap_comm": true, 21 | "contiguous_gradients": true, 22 | "sub_group_size": 1000000000.0, 23 | "reduce_bucket_size": "auto", 24 | "stage3_prefetch_bucket_size": "auto", 25 | "stage3_param_persistence_threshold": "auto", 26 | "stage3_max_live_parameters": 1000000000.0, 27 | "stage3_max_reuse_distance": 1000000000.0, 28 | "stage3_gather_16bit_weights_on_model_save": true 29 | } 30 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/ds_z3_offload_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": "auto", 3 | "train_micro_batch_size_per_gpu": "auto", 4 | "gradient_accumulation_steps": "auto", 5 | "gradient_clipping": "auto", 6 | "zero_allow_untested_optimizer": true, 7 | "fp16": { 8 | "enabled": "auto", 9 | "loss_scale": 0, 10 | "loss_scale_window": 1000, 11 | "initial_scale_power": 16, 12 | "hysteresis": 2, 13 | "min_loss_scale": 1 14 | }, 15 | "bf16": { 16 | "enabled": "auto" 17 | }, 18 | "zero_optimization": { 19 | "stage": 3, 20 | "overlap_comm": true, 21 | "contiguous_gradients": true, 22 | "sub_group_size": 1000000000.0, 23 | "reduce_bucket_size": "auto", 24 | "stage3_prefetch_bucket_size": "auto", 25 | "stage3_param_persistence_threshold": "auto", 26 | "stage3_max_live_parameters": 1000000000.0, 27 | "stage3_max_reuse_distance": 1000000000.0, 28 | "stage3_gather_16bit_weights_on_model_save": true, 29 | "offload_optimizer": { 30 | "device": "cpu", 31 | "pin_memory": true 32 | }, 33 | "offload_param": { 34 | "device": "cpu", 35 | "pin_memory": true 36 | } 37 | } 38 | } -------------------------------------------------------------------------------- /evaltoolkits/cache/user_config.yaml: -------------------------------------------------------------------------------- 1 | cache_dir: null 2 | lang: zh 3 | last_model: Qwen2.5-14B 4 | path_dict: 5 | Qwen2.5-14B: /mnt/xiyu/Model/Qwen/Qwen2.5-14B 6 | -------------------------------------------------------------------------------- /evaltoolkits/filter.sh: -------------------------------------------------------------------------------- 1 | python /mnt/xiyu/LongRePS/evaltoolkits/filter_data.py \ 2 | --path_to_src_file /mnt/xiyu/musique-Qwen-2.5-32B.temp0.7sample30.cot_eval.jsonl \ 3 | --path_to_stage1_file /mnt/xiyu/LongRePS/dataset/musique-qwen32b_sample30temp0.7thresh1.0_factcheck_stage1.jsonl \ 4 | --sample_num 30 5 | -------------------------------------------------------------------------------- /evaltoolkits/filter_data.py: -------------------------------------------------------------------------------- 1 | import json 2 | import jsonlines 3 | import json_repair 4 | import pandas as pd 5 | import numpy as np 6 | import torch 7 | import shutil 8 | import glob 9 | import re 10 | import string 11 | import time 12 | import os 13 | import argparse 14 | from typing import List, Tuple 15 | from tqdm import tqdm 16 | from pathlib import Path 17 | from json_repair import repair_json 18 | from transformers import AutoTokenizer, AutoModelForCausalLM 19 | from datasets import Dataset 20 | from openai import OpenAI 21 | 22 | from utils import normalize_answer, preprocess_pred_for_json_repair, extract_fact_list, verify_fact_list, extract_number 23 | def parse_args(args=None): 24 | parser = argparse.ArgumentParser() 25 | parser.add_argument('--path_to_src_file', type=str, default=None) 26 | parser.add_argument('--path_to_stage1_file', type=str, default=None) 27 | parser.add_argument('--sample_num', type=int, default=30) 28 | return parser.parse_args(args) 29 | # 我们需要你帮忙评价模型推理过程的质量。模型的接收的输入是一段长文本，以及一个复杂的问题，它的任务是根据问题的需要，从长文本中检索出相关信息（以[Excerpt xxx]的形式开头，包含在``中），并给出正确的答案。现在，我们已经在上面给出了问题和模型的推理过程。模型最终得到的结果是正确的，但是我们需要你来评价模型的推理过程是否合理。请你根据以下几个方面来评价模型的推理过程：- 逻辑性：模型对问题的拆解应当合理。推理过程对于检索到的信息的使用应该符合逻辑，根据检索到的信息得出答案的逻辑链条应该合理。 - 完整性：推理过程应该主要使用从文中检索到的信息，即[Excerpts xxx]后内容，而非模型自身的知识。 - 简洁性：只应当检索回答问题相关的信息，不应罗列过多无关的信息。 30 | 31 | EVAL_PROMPT = '''[Question] 32 | {question} 33 | 34 | [The Start of Assistant's Reasoning Path] 35 | {reasoning} 36 | [The End of Assistant's Reasoning Path] 37 | 38 | [System] 39 | We would like to request your feedback on the quality of the reasoning process in the given response. 40 | The model receives a long text input and a complex question. Its task is to retrieve relevant information from the long text (marked as [Excerpt xxx] and enclosed in ``) based on the question's requirements and provide the correct answer. Above, we have provided both the question and the model's reasoning process. While the model's final answer is correct, we need you to evaluate whether its reasoning process is sound. 41 | 42 | Please assess the model's reasoning process based on the following aspects: 43 | 44 | 1. Logical Coherence: 45 | - The model should break down the question appropriately 46 | - The use of retrieved information should follow logical patterns 47 | - The chain of reasoning from retrieved information to the final answer should be sound 48 | 49 | 2. Completeness: 50 | - The reasoning process should primarily rely on information retrieved from the text ([Excerpts xxx]) 51 | - The model should not heavily depend on its own knowledge base 52 | 53 | 3. Conciseness: 54 | - Only information relevant to answering the question should be retrieved 55 | - The model should avoid listing excessive or irrelevant information 56 | 57 | Please rate whether this reasoning path is suitable for the question. The assistant receives an overall score on a scale of 1 to 100, where a higher score indicates better overall performance. 58 | Please note that if the assistant's reasoning process fully meets the above criteria, its overall rating should be full marks (100). 59 | Please first provide a comprehensive explanation of your evaluation, avoiding any potential bias. 60 | Then, output a line indicating the score of the Assistant. 61 | 62 | PLEASE OUTPUT WITH THE FOLLOWING FORMAT, WHERE THE SCORE IS ON A SCALE OF 1 TO 100 BY STRICTLY FOLLOWING THIS FORMAT: "[[score]]", FOR EXAMPLE "Rating: [[100]]": 63 | 64 | Evaluation evidence: your evaluation explanation here, no more than 100 words 65 | Rating: [[score]] 66 | 67 | 68 | Now, start your evaluation:''' 69 | 70 | def process_single_data(example): 71 | """处理单条数据的函数""" 72 | checked_data = {**example} 73 | checked_data['new_pred'] = [] 74 | checked_data['new_f1_score_list'] = [] 75 | checked_data['new_extracted_pred_list'] = [] 76 | fact_valid_flag = 0 77 | json_valid_flag = 0 78 | total_num=len(example['pred']) 79 | save_num=0 80 | for pred_idx, pred in enumerate(example['pred']): 81 | try: 82 | pred = preprocess_pred_for_json_repair(pred) 83 | content = json_repair.loads(pred) 84 | if len(content) >1: 85 | content=content[0] 86 | pred=json.dumps(content) 87 | if not isinstance(content, dict) or not content or 'reasoning' not in content or len(content) > 2 or type(content['reasoning']) != str: 88 | continue 89 | else: 90 | json_valid_flag = 1 91 | 92 | fact_list = extract_fact_list(content['reasoning']) 93 | if len(fact_list) >= 0: 94 | if verify_fact_list(fact_list, example['instruction']): 95 | save_num+=1 96 | fact_valid_flag = 1 97 | checked_data['new_pred'].append(pred) 98 | checked_data['new_f1_score_list'].append(example['f1_score_list'][pred_idx]) 99 | checked_data['new_extracted_pred_list'].append(example['extracted_pred_list'][pred_idx]) 100 | except: 101 | continue 102 | 103 | checked_data['fact_valid'] = fact_valid_flag 104 | checked_data['json_valid'] = json_valid_flag 105 | checked_data['save_rate']=save_num/total_num 106 | return checked_data 107 | 108 | def filter_stage_1(path_to_src_file: str, path_to_stage1_file: str, f1_score_thresh: float = 1.0, sample: int =10): 109 | 110 | with jsonlines.open(path_to_src_file) as fin: 111 | data_list = list(fin) 112 | print("Sample Number ",sample) 113 | for data in data_list: 114 | data['pred']=data['pred'][:sample] 115 | data['f1_score_list']=data['f1_score_list'][:sample] 116 | data['extracted_pred_list']=data['extracted_pred_list'][:sample] 117 | 118 | dataset = Dataset.from_list(data_list) 119 | 120 | # 并行处理数据 121 | processed_dataset = dataset.map( 122 | process_single_data, 123 | num_proc=16, 124 | desc="Processing data" 125 | ) 126 | 127 | # 统计结果 128 | no_valid_fact_cnt = sum(1 for x in processed_dataset if not x['fact_valid']) 129 | no_valid_json_cnt = sum(1 for x in processed_dataset if not x['json_valid']) 130 | avg_save_rate = np.mean([x['save_rate'] for x in processed_dataset]) 131 | processed_dataset = processed_dataset.filter(lambda x: len(x['new_pred']) > 0) 132 | 133 | print(f"Avg Save Rate: {avg_save_rate}") 134 | print(f"No valid fact count: {no_valid_fact_cnt}") 135 | print(f"No valid JSON count: {no_valid_json_cnt}") 136 | print(f"Checked data count: {len(processed_dataset)}") 137 | 138 | # 处理最终结果 139 | result_data_list = [] 140 | 141 | for item in processed_dataset: 142 | if not item['new_f1_score_list']: 143 | continue 144 | max_value = max(item['new_f1_score_list']) 145 | # max_index = item['new_f1_score_list'].index(max_value) 146 | # score = item['new_f1_score_list'][max_index] 147 | 148 | if max_value >= f1_score_thresh: 149 | 150 | remaining_idx_list = [idx for idx, score in enumerate(item['new_f1_score_list']) if score >= f1_score_thresh][:10] 151 | remaining_pred_list = [item['new_pred'][idx] for idx in remaining_idx_list][:10] 152 | remaining_f1_score_list = [item['new_f1_score_list'][idx] for idx in remaining_idx_list][:10] 153 | remaining_extracted_pred_list = [item['new_extracted_pred_list'][idx] for idx in remaining_idx_list][:10] 154 | output = remaining_pred_list[0] 155 | st_output = item['answers'] 156 | f1_score = remaining_f1_score_list[0] 157 | 158 | result_data_list.append({ 159 | "instruction": item['instruction'], 160 | "filtered_pred": remaining_pred_list, 161 | "filtered_f1_score_list": remaining_f1_score_list, 162 | "filtered_extracted_pred_list": remaining_extracted_pred_list, 163 | "answers": item['answers'], 164 | "output": output, 165 | "st_output": st_output, 166 | "f1_score": f1_score, 167 | "system": "You are a helpful assistant.", 168 | "id": item['id'], 169 | }) 170 | 171 | print(f"selected_cnt: {len(result_data_list)}") 172 | 173 | # 保存结果 174 | with jsonlines.open(path_to_stage1_file, 'w') as writer: 175 | writer.write_all(result_data_list) 176 | 177 | return 178 | 179 | 180 | def get_score_evidence(question:str, pred:str,ground_truth:str) -> Tuple[float, str]: 181 | prompt = EVAL_PROMPT.format(question=question, reasoning=pred) 182 | max_retries = 5 183 | API_KEY = os.getenv("OPENAI_API_KEY") 184 | client = OpenAI(api_key=API_KEY,base_url="") 185 | for _ in range(max_retries): 186 | content = "" 187 | try: 188 | response = client.chat.completions.create( 189 | messages=[{"role": "user","content": prompt}], 190 | model="gpt-4o-mini", 191 | temperature=0, 192 | max_tokens=512, 193 | ) 194 | content = response.choices[0].message.content 195 | if content == None: 196 | continue 197 | evidence = content 198 | rating = extract_number(content) 199 | return (rating, evidence) 200 | except Exception as e: 201 | content = "" if content == None else content 202 | print(e, content) 203 | time.sleep(30) 204 | return (0.0, content) 205 | 206 | return (0.0, "") 207 | 208 | 209 | def get_score_evidence_list(question:str, predictions:List[str], answer:str) -> Tuple[List[float], List[str]]: 210 | score_list: List[float] = [] 211 | evidence_list: List[str] = [] 212 | for pred in predictions: 213 | # if there are more than 3 "100" in score_list, break 214 | if score_list.count(100) >= 3: 215 | break 216 | score, evidence = get_score_evidence(question=question, pred=pred, ground_truth=answer) 217 | score_list.append(float(score)) 218 | evidence_list.append(evidence) 219 | return score_list, evidence_list 220 | 221 | 222 | def get_llm_score_for_single_data(data): 223 | answer = data["answers"][0] 224 | question = data["instruction"].split("Question:")[-1].replace("\n\nAnswer:","").replace("*","").strip().replace("\n\nResponse:","").strip() 225 | reasoning_list = [json_repair.loads(pred)['reasoning'] for pred in data["filtered_pred"]] 226 | score_list, evidence_list = get_score_evidence_list(question, reasoning_list, answer) 227 | 228 | data["llm_score_list"] = score_list 229 | data["llm_evidence_list"] = evidence_list 230 | return data 231 | 232 | 233 | def filter_stage_2(path_to_stage1_file: str, path_to_stage2_file: str, path_to_stage3_file: str, llm_score_thresh: float = 100): 234 | 235 | with jsonlines.open(path_to_stage1_file) as fin: 236 | data_list = list(fin) 237 | dataset = Dataset.from_list(data_list) 238 | 239 | preprocessed_dataset = dataset.map( 240 | get_llm_score_for_single_data, 241 | num_proc=8, 242 | desc="Processing data" 243 | ) 244 | preprocessed_dataset = preprocessed_dataset.to_list() 245 | score_list=[] 246 | for data in preprocessed_dataset: 247 | score_list.append(np.mean(data['llm_score_list'])) 248 | max_score = max(data["llm_score_list"]) 249 | max_index = data["llm_score_list"].index(max_score) 250 | data["output"] = data["filtered_pred"][max_index] 251 | data["st_output"] = data["answers"] 252 | data["llm_score"] = max_score 253 | print(f'Avg Score in Stage2 is {np.mean(score_list)}') 254 | 255 | with jsonlines.open(path_to_stage2_file, 'w') as writer: 256 | writer.write_all(preprocessed_dataset) 257 | 258 | 259 | stage3_data_list = [data for data in preprocessed_dataset if data['llm_score'] >= llm_score_thresh] 260 | print(f"Stage3 selected_cnt: {len(stage3_data_list)}, avg score in stage3 is {np.mean([data['llm_score'] for data in stage3_data_list])}") 261 | 262 | with jsonlines.open(path_to_stage3_file, 'w') as writer: 263 | writer.write_all(stage3_data_list) 264 | 265 | return 266 | 267 | def filter_stage_3(path_to_stage2_file: str, path_to_stage3_file: str, llm_score_thresh: float = 100): 268 | with jsonlines.open(path_to_stage2_file) as fin: 269 | preprocessed_dataset= list(fin) 270 | 271 | stage3_data_list = [data for data in preprocessed_dataset if data['llm_score'] >= llm_score_thresh] 272 | print(f"Stage3 selected_cnt: {len(stage3_data_list)}, avg score in stage3 is {np.mean([data['llm_score'] for data in stage3_data_list])}") 273 | 274 | with jsonlines.open(path_to_stage3_file, 'w') as writer: 275 | writer.write_all(stage3_data_list) 276 | 277 | return 278 | 279 | args = parse_args() 280 | path_to_src_file = args.path_to_src_file 281 | path_to_stage1_file = args.path_to_stage1_file 282 | sample=args.sample_num 283 | path_to_stage2_file = path_to_stage1_file.replace("stage1", "stage2") 284 | path_to_stage3_file = path_to_stage1_file.replace("stage1", "stage3") 285 | f1_score_thresh = 1.0 286 | llm_score_thresh = 0 287 | 288 | assert "thresh" + str(f1_score_thresh) in path_to_stage1_file, "f1_score_thresh is not consistent with the one used in stage1" 289 | 290 | filter_stage_1(path_to_src_file, path_to_stage1_file, f1_score_thresh,sample) 291 | filter_stage_2(path_to_stage1_file, path_to_stage2_file, path_to_stage3_file, llm_score_thresh) 292 | -------------------------------------------------------------------------------- /evaltoolkits/inference.out: -------------------------------------------------------------------------------- 1 | 2007527 2 | Wold Size 8 3 | Failed loading 0 lines, total 3000 lines loaded 4 | 0%| | 0/375 [00:00 22 | p.join() 23 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 149, in join 24 | res = self._popen.wait(timeout) 25 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait 26 | return self.poll(os.WNOHANG if timeout == 0.0 else 0) 27 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll 28 | pid, sts = os.waitpid(self.pid, flag) 29 | KeyboardInterrupt 30 | Traceback (most recent call last): 31 | Traceback (most recent call last): 32 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 33 | self.run() 34 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 35 | self.run() 36 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 37 | self._target(*self._args, **self._kwargs) 38 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 39 | self._target(*self._args, **self._kwargs) 40 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 41 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 42 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 43 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 44 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 45 | response=client.completions.create( 46 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 47 | response=client.completions.create( 48 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 49 | return func(*args, **kwargs) 50 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 51 | return func(*args, **kwargs) 52 | Traceback (most recent call last): 53 | Traceback (most recent call last): 54 | Traceback (most recent call last): 55 | Traceback (most recent call last): 56 | Traceback (most recent call last): 57 | Traceback (most recent call last): 58 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 59 | return self._post( 60 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 61 | return self._post( 62 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 63 | self.run() 64 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 65 | self.run() 66 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 67 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 68 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 69 | self.run() 70 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 71 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 72 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 73 | self._target(*self._args, **self._kwargs) 74 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 75 | self.run() 76 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 77 | self._target(*self._args, **self._kwargs) 78 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 79 | return self._request( 80 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 81 | self._target(*self._args, **self._kwargs) 82 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 83 | self.run() 84 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 85 | return self._request( 86 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 87 | self._target(*self._args, **self._kwargs) 88 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 89 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 90 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 91 | self.run() 92 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 93 | self._target(*self._args, **self._kwargs) 94 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 95 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 96 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 97 | response = self._client.send( 98 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 99 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 100 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 101 | response = self._client.send( 102 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 103 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 104 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 105 | response=client.completions.create( 106 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run 107 | self._target(*self._args, **self._kwargs) 108 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 109 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 110 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 111 | response=client.completions.create( 112 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 113 | response = self._send_handling_auth( 114 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 115 | response=client.completions.create( 116 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 117 | response = self._send_handling_auth( 118 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 119 | response=client.completions.create( 120 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 121 | return func(*args, **kwargs) 122 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm 123 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 124 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 125 | response=client.completions.create( 126 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 127 | return func(*args, **kwargs) 128 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 129 | response = self._send_handling_redirects( 130 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 131 | return func(*args, **kwargs) 132 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 133 | response = self._send_handling_redirects( 134 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 135 | return func(*args, **kwargs) 136 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 137 | return self._post( 138 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results 139 | response=client.completions.create( 140 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 141 | return func(*args, **kwargs) 142 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 143 | return self._post( 144 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 145 | response = self._send_single_request(request) 146 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 147 | return self._post( 148 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 149 | response = self._send_single_request(request) 150 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 151 | return self._post( 152 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 153 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 154 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper 155 | return func(*args, **kwargs) 156 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 157 | return self._post( 158 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 159 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 160 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 161 | response = transport.handle_request(request) 162 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 163 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 164 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 165 | response = transport.handle_request(request) 166 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 167 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 168 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 169 | return self._request( 170 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create 171 | return self._post( 172 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 173 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 174 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 175 | return self._request( 176 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 177 | resp = self._pool.handle_request(req) 178 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 179 | return self._request( 180 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 181 | resp = self._pool.handle_request(req) 182 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 183 | return self._request( 184 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 185 | response = self._client.send( 186 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post 187 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) 188 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 189 | return self._request( 190 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 191 | response = self._client.send( 192 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 193 | raise exc from None 194 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 195 | response = self._client.send( 196 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 197 | raise exc from None 198 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 199 | response = self._client.send( 200 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 201 | response = self._send_handling_auth( 202 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request 203 | return self._request( 204 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 205 | response = self._client.send( 206 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 207 | response = self._send_handling_auth( 208 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 209 | response = connection.handle_request( 210 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 211 | response = self._send_handling_auth( 212 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 213 | response = connection.handle_request( 214 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 215 | response = self._send_handling_auth( 216 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 217 | response = self._send_handling_redirects( 218 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request 219 | response = self._client.send( 220 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 221 | response = self._send_handling_auth( 222 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 223 | response = self._send_handling_redirects( 224 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 225 | return self._connection.handle_request(request) 226 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 227 | response = self._send_handling_redirects( 228 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 229 | return self._connection.handle_request(request) 230 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 231 | response = self._send_handling_redirects( 232 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 233 | response = self._send_single_request(request) 234 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send 235 | response = self._send_handling_auth( 236 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 237 | response = self._send_handling_redirects( 238 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 239 | response = self._send_single_request(request) 240 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 241 | raise exc 242 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 243 | response = self._send_single_request(request) 244 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 245 | raise exc 246 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 247 | response = self._send_single_request(request) 248 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 249 | response = transport.handle_request(request) 250 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth 251 | response = self._send_handling_redirects( 252 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 253 | response = self._send_single_request(request) 254 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 255 | response = transport.handle_request(request) 256 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 257 | ) = self._receive_response_headers(**kwargs) 258 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 259 | response = transport.handle_request(request) 260 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 261 | ) = self._receive_response_headers(**kwargs) 262 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 263 | response = transport.handle_request(request) 264 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 265 | resp = self._pool.handle_request(req) 266 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects 267 | response = self._send_single_request(request) 268 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 269 | response = transport.handle_request(request) 270 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 271 | resp = self._pool.handle_request(req) 272 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 273 | event = self._receive_event(timeout=timeout) 274 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 275 | resp = self._pool.handle_request(req) 276 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 277 | event = self._receive_event(timeout=timeout) 278 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 279 | resp = self._pool.handle_request(req) 280 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 281 | raise exc from None 282 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request 283 | response = transport.handle_request(request) 284 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 285 | resp = self._pool.handle_request(req) 286 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 287 | raise exc from None 288 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 289 | data = self._network_stream.read( 290 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 291 | raise exc from None 292 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 293 | data = self._network_stream.read( 294 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 295 | raise exc from None 296 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 297 | response = connection.handle_request( 298 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request 299 | resp = self._pool.handle_request(req) 300 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 301 | raise exc from None 302 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 303 | response = connection.handle_request( 304 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 305 | return self._sock.recv(max_bytes) 306 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 307 | response = connection.handle_request( 308 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 309 | return self._sock.recv(max_bytes) 310 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 311 | response = connection.handle_request( 312 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 313 | return self._connection.handle_request(request) 314 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request 315 | raise exc from None 316 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 317 | response = connection.handle_request( 318 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 319 | return self._connection.handle_request(request) 320 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 321 | return self._connection.handle_request(request) 322 | KeyboardInterrupt 323 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 324 | return self._connection.handle_request(request) 325 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 326 | raise exc 327 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request 328 | response = connection.handle_request( 329 | KeyboardInterrupt 330 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 331 | return self._connection.handle_request(request) 332 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 333 | raise exc 334 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 335 | raise exc 336 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 337 | raise exc 338 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 339 | ) = self._receive_response_headers(**kwargs) 340 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request 341 | return self._connection.handle_request(request) 342 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 343 | raise exc 344 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 345 | ) = self._receive_response_headers(**kwargs) 346 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 347 | ) = self._receive_response_headers(**kwargs) 348 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 349 | ) = self._receive_response_headers(**kwargs) 350 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 351 | event = self._receive_event(timeout=timeout) 352 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request 353 | raise exc 354 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 355 | ) = self._receive_response_headers(**kwargs) 356 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 357 | event = self._receive_event(timeout=timeout) 358 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 359 | event = self._receive_event(timeout=timeout) 360 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 361 | event = self._receive_event(timeout=timeout) 362 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 363 | data = self._network_stream.read( 364 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request 365 | ) = self._receive_response_headers(**kwargs) 366 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 367 | event = self._receive_event(timeout=timeout) 368 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 369 | data = self._network_stream.read( 370 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 371 | data = self._network_stream.read( 372 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 373 | data = self._network_stream.read( 374 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 375 | return self._sock.recv(max_bytes) 376 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers 377 | event = self._receive_event(timeout=timeout) 378 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 379 | data = self._network_stream.read( 380 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 381 | return self._sock.recv(max_bytes) 382 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 383 | return self._sock.recv(max_bytes) 384 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 385 | return self._sock.recv(max_bytes) 386 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event 387 | data = self._network_stream.read( 388 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 389 | return self._sock.recv(max_bytes) 390 | Connection error. 391 | Connection error. 392 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read 393 | return self._sock.recv(max_bytes) 394 | KeyboardInterrupt 395 | KeyboardInterrupt 396 | KeyboardInterrupt 397 | KeyboardInterrupt 398 | KeyboardInterrupt 399 | KeyboardInterrupt 400 | Connection error. 401 | Connection error. 402 | Connection error. 403 | Connection error. 404 | Connection error. 405 | Connection error. 406 | -------------------------------------------------------------------------------- /evaltoolkits/inference2.out: -------------------------------------------------------------------------------- 1 | 2058724 2 | Wold Size 8 3 | Failed loading 0 lines, total 200 lines loaded 4 | 0%| | 0/25 [00:00> ${output_dir}/new_launch_inference.sh 39 | 40 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 41 | eval_data_dir="${eval_data_dir%/}" 42 | 43 | echo "Launching inference for ${model_name}..." 44 | echo "Model path: ${model_path}" 45 | echo "Eval data dir: ${eval_data_dir}" 46 | echo "File name: ${file_name}" 47 | echo "Inference mode: ${inference_mode}" 48 | echo "Sample num: ${sample_num}" 49 | echo "Result dir: ${result_dir}" 50 | echo "Temperature: ${temperature}" 51 | echo "COT mode: ${cot_mode}" 52 | echo "Dataset: ${eval_dataset_name}" 53 | echo "Filtered filename: ${filtered_filename}" 54 | echo "Thresh: ${thresh}" 55 | 56 | #cp Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Model 57 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} #for Llama Model 58 | 59 | for gpu_id in 0 1 2 3 4 5 6 7; do 60 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \ 61 | --served-model-name ${model_name} \ 62 | --model ${model_path} \ 63 | --tensor-parallel-size=1 \ 64 | --trust-remote-code \ 65 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 & 66 | done 67 | 68 | sleep 30 # sleep 30s, wait for the servers to start 69 | 70 | echo "Evaluating ${eval_dataset_name}..." 71 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 72 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 73 | 74 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 75 | --model ${model_name} \ 76 | --model_path ${model_path} \ 77 | --data_path ${eval_data_dir}/${file_name} \ 78 | --output_path ${path_to_inference_output} \ 79 | --sample_num ${sample_num} \ 80 | --dataset_name ${eval_dataset_name} \ 81 | --temperature ${temperature} \ 82 | > ./inference.out 83 | 84 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 85 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 86 | 87 | pkill -f vllm; pkill -f spawn_main 88 | -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv1.sh: -------------------------------------------------------------------------------- 1 | 2 | model_name="Qwen2.5-7B-Instruct" 3 | model_path="/mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct" 4 | #model_config="${model_root}/tokenizer*" 5 | #model_path="${model_root}/checkpoint-58" 6 | mode="cot" 7 | 8 | domain_list=("all") 9 | eval_data_dir="../dataset/longbenchv1" 10 | sample_num=1 11 | temperature=0.0 12 | 13 | 14 | #cp ${model_config} ${model_path} 15 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Models 16 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} 17 | 18 | 19 | for gpu_id in 0 1 2 3 4 5 6 7; do 20 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \ 21 | --served-model-name ${model_name} \ 22 | --model ${model_path} \ 23 | --tensor-parallel-size=1 \ 24 | --swap-space 32\ 25 | --trust-remote-code \ 26 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 & 27 | done 28 | sleep 30 # sleep 30s, wait for the servers to start 29 | 30 | 31 | for domain in "${domain_list[@]}"; do 32 | file_name_list=( "musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" "qasper_${mode}.jsonl" "2wikimqa_${mode}.jsonl") #"musique_${mode}.jsonl" 33 | for file_name in "${file_name_list[@]}"; do 34 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 35 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 36 | result_dir="./pred_cot_vs_nocot" 37 | output_dir="${result_dir}/${model_name}" 38 | mkdir -p ${output_dir} 39 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 40 | 41 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 42 | eval_data_dir="${eval_data_dir%/}" 43 | 44 | echo "Launching inference for ${model_name}..." 45 | echo "Model path: ${model_path}" 46 | echo "Eval data dir: ${eval_data_dir}" 47 | echo "File name: ${file_name}" 48 | echo "Sample num: ${sample_num}" 49 | echo "Result dir: ${result_dir}" 50 | echo "Temperature: ${temperature}" 51 | echo "COT mode: ${cot_mode}" 52 | echo "Dataset: ${eval_dataset_name}" 53 | 54 | echo "Evaluating ${eval_dataset_name}..." 55 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 56 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 57 | 58 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 59 | --model ${model_name} \ 60 | --model_path ${model_path} \ 61 | --data_path ${eval_data_dir}/${file_name} \ 62 | --output_path ${path_to_inference_output} \ 63 | --sample_num ${sample_num} \ 64 | --dataset_name ${eval_dataset_name} \ 65 | --temperature ${temperature} \ 66 | > ./inference2.out 67 | 68 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 69 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 70 | 71 | done 72 | done 73 | pkill -f vllm; pkill -f spawn_main 74 | -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv1big.sh: -------------------------------------------------------------------------------- 1 | 2 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2" 3 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2" 4 | #model_config="${model_root}/tokenizer*" 5 | #model_path="${model_root}/checkpoint-58" 6 | mode="cot" 7 | 8 | domain_list=("all") 9 | eval_data_dir="../dataset/longbenchv1" 10 | sample_num=1 11 | temperature=0.0 12 | 13 | 14 | #cp ${model_config} ${model_path} 15 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Models 16 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} 17 | 18 | CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \ 19 | --served-model-name ${model_name} \ 20 | --model ${model_path} \ 21 | --tensor-parallel-size=2 \ 22 | --max_model_len 25000\ 23 | --swap-space 32\ 24 | --trust-remote-code \ 25 | --port 8000 > ../log/vllm_${model_name}_gpu0.log 2>&1 & 26 | 27 | CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \ 28 | --served-model-name ${model_name} \ 29 | --model ${model_path} \ 30 | --tensor-parallel-size=2 \ 31 | --max_model_len 25000\ 32 | --swap-space 32\ 33 | --trust-remote-code \ 34 | --port 8001 > ../log/vllm_${model_name}_gpu1.log 2>&1 & 35 | 36 | CUDA_VISIBLE_DEVICES=4,5 python -m vllm.entrypoints.openai.api_server \ 37 | --served-model-name ${model_name} \ 38 | --model ${model_path} \ 39 | --tensor-parallel-size=2 \ 40 | --max_model_len 25000\ 41 | --swap-space 32\ 42 | --trust-remote-code \ 43 | --port 8002 > ../log/vllm_${model_name}_gpu2.log 2>&1 & 44 | 45 | CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server \ 46 | --served-model-name ${model_name} \ 47 | --model ${model_path} \ 48 | --tensor-parallel-size=2 \ 49 | --max_model_len 25000\ 50 | --swap-space 32\ 51 | --trust-remote-code \ 52 | --port 8003 > ../log/vllm_${model_name}_gpu3.log 2>&1 & 53 | 54 | sleep 30 # sleep 30s, wait for the servers to start 55 | 56 | 57 | for domain in "${domain_list[@]}"; do 58 | file_name_list=( "musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" "qasper_${mode}.jsonl" "2wikimqa_${mode}.jsonl") #"musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" #"musique_${mode}.jsonl" 59 | for file_name in "${file_name_list[@]}"; do 60 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 61 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 62 | result_dir="./pred_cot_vs_nocot" 63 | output_dir="${result_dir}/${model_name}" 64 | mkdir -p ${output_dir} 65 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 66 | 67 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 68 | eval_data_dir="${eval_data_dir%/}" 69 | 70 | echo "Launching inference for ${model_name}..." 71 | echo "Model path: ${model_path}" 72 | echo "Eval data dir: ${eval_data_dir}" 73 | echo "File name: ${file_name}" 74 | echo "Sample num: ${sample_num}" 75 | echo "Result dir: ${result_dir}" 76 | echo "Temperature: ${temperature}" 77 | echo "COT mode: ${cot_mode}" 78 | echo "Dataset: ${eval_dataset_name}" 79 | 80 | echo "Evaluating ${eval_dataset_name}..." 81 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 82 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 83 | 84 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 85 | --model ${model_name} \ 86 | --model_path ${model_path} \ 87 | --data_path ${eval_data_dir}/${file_name} \ 88 | --output_path ${path_to_inference_output} \ 89 | --sample_num ${sample_num} \ 90 | --world_size 4 \ 91 | --dataset_name ${eval_dataset_name} \ 92 | --temperature ${temperature} \ 93 | > ./inference2.out 94 | 95 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 96 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 97 | 98 | done 99 | done 100 | pkill -f vllm; pkill -f spawn_main -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv2.sh: -------------------------------------------------------------------------------- 1 | 2 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2" 3 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2" 4 | mode="cot" 5 | 6 | domain_list=("all") 7 | eval_data_dir="../dataset/longbenchv2" 8 | sample_num=1 9 | temperature=0.0 10 | 11 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} #for Llama Models 12 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} 13 | 14 | 15 | for gpu_id in 0 1 2 3 4 5 6 7; do 16 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \ 17 | --served-model-name ${model_name} \ 18 | --model ${model_path} \ 19 | --tensor-parallel-size=1 \ 20 | --trust-remote-code \ 21 | --dtype bfloat16 \ 22 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 & 23 | done 24 | sleep 30 # sleep 30s, wait for the servers to start 25 | 26 | 27 | for domain in "${domain_list[@]}"; do 28 | file_name_list=("MQA_${mode}.jsonl" "SQA_${mode}.jsonl") 29 | for file_name in "${file_name_list[@]}"; do 30 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 31 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 32 | result_dir="./pred_cot_vs_nocot" 33 | output_dir="${result_dir}/${model_name}" 34 | mkdir -p ${output_dir} 35 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 36 | 37 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 38 | eval_data_dir="${eval_data_dir%/}" 39 | 40 | echo "Launching inference for ${model_name}..." 41 | echo "Model path: ${model_path}" 42 | echo "Eval data dir: ${eval_data_dir}" 43 | echo "File name: ${file_name}" 44 | echo "Sample num: ${sample_num}" 45 | echo "Result dir: ${result_dir}" 46 | echo "Temperature: ${temperature}" 47 | echo "COT mode: ${cot_mode}" 48 | echo "Dataset: ${eval_dataset_name}" 49 | 50 | echo "Evaluating ${eval_dataset_name}..." 51 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 52 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 53 | 54 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 55 | --model ${model_name} \ 56 | --model_path ${model_path} \ 57 | --data_path ${eval_data_dir}/${file_name} \ 58 | --output_path ${path_to_inference_output} \ 59 | --sample_num ${sample_num} \ 60 | --dataset_name ${eval_dataset_name} \ 61 | --temperature ${temperature} \ 62 | > ./inference2.out 63 | 64 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 65 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 66 | 67 | done 68 | done 69 | pkill -f vllm; pkill -f spawn_main 70 | -------------------------------------------------------------------------------- /evaltoolkits/launch_lbv2m.sh: -------------------------------------------------------------------------------- 1 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2" 2 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2" 3 | #model_config="${model_root}/tokenizer*" 4 | #model_path="${model_root}/checkpoint-58" 5 | mode="cot" 6 | 7 | 8 | domain_list=("all") 9 | eval_data_dir="../dataset/longbenchv2" 10 | sample_num=1 11 | temperature=0.0 12 | 13 | 14 | 15 | CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \ 16 | --served-model-name ${model_name} \ 17 | --model ${model_path} \ 18 | --tensor-parallel-size=4 \ 19 | --trust-remote-code \ 20 | --port 8000 > ../log/vllm_${model_name}_gpu0.log 2>&1 & 21 | 22 | CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server \ 23 | --served-model-name ${model_name} \ 24 | --model ${model_path} \ 25 | --tensor-parallel-size=4 \ 26 | --trust-remote-code \ 27 | --port 8001 > ../log/vllm_${model_name}_gpu1.log 2>&1 & 28 | 29 | sleep 30 # sleep 30s, wait for the servers to start 30 | 31 | 32 | for domain in "${domain_list[@]}"; do 33 | file_name_list=("MQA_${mode}.jsonl" "SQA_${mode}.jsonl") 34 | for file_name in "${file_name_list[@]}"; do 35 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1) 36 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and || 37 | result_dir="./pred_cot_vs_nocot" 38 | output_dir="${result_dir}/${model_name}" 39 | mkdir -p ${output_dir} 40 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh 41 | 42 | result_dir="${result_dir%/}" # remove the trailing slash if there is any 43 | eval_data_dir="${eval_data_dir%/}" 44 | 45 | echo "Launching inference for ${model_name}..." 46 | echo "Model path: ${model_path}" 47 | echo "Eval data dir: ${eval_data_dir}" 48 | echo "File name: ${file_name}" 49 | echo "Sample num: ${sample_num}" 50 | echo "Result dir: ${result_dir}" 51 | echo "Temperature: ${temperature}" 52 | echo "COT mode: ${cot_mode}" 53 | echo "Dataset: ${eval_dataset_name}" 54 | 55 | echo "Evaluating ${eval_dataset_name}..." 56 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl" 57 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl" 58 | 59 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \ 60 | --model ${model_name} \ 61 | --model_path ${model_path} \ 62 | --data_path ${eval_data_dir}/${file_name} \ 63 | --output_path ${path_to_inference_output} \ 64 | --sample_num ${sample_num} \ 65 | --dataset_name ${eval_dataset_name} \ 66 | --temperature ${temperature} \ 67 | --world_size 2 \ 68 | > ./inference2.out 69 | 70 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output} 71 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result} 72 | 73 | done 74 | done 75 | pkill -f vllm; pkill -f spawn_main -------------------------------------------------------------------------------- /evaltoolkits/loop_eval.sh: -------------------------------------------------------------------------------- 1 | eval_model_list=( 2 | "Llama-8B-warmup-lr1e-5-epoch1 ../saves/Llama3.1-8B/full/Llama-3.1-8B_warmup_train_lr1e-5_maxlen16k_2025-03-13-22-43-49/checkpoint-10 " 3 | ) 4 | model_list=("${eval_model_list[@]}") 5 | 6 | for model in "${model_list[@]}"; do 7 | model_name=$(echo $model | cut -d' ' -f1) 8 | model_path=$(echo $model | cut -d' ' -f2) 9 | file_name=$(echo $model | cut -d' ' -f3) 10 | echo "Launching inference for ${model_name}..." 11 | echo "Model path: ${model_path}" 12 | echo "File name: ${file_name}" 13 | bash new_launch_inference.sh ${model_name} ${model_path} ${file_name} 14 | done -------------------------------------------------------------------------------- /evaltoolkits/loop_sample.sh: -------------------------------------------------------------------------------- 1 | eval_model_list=( 2 | "Qwen-7B-Instruct-yarn-example /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct musique-Qwen-2.5-7B_prm_train.jsonl" 3 | ) 4 | model_list=("${eval_model_list[@]}") 5 | 6 | for model in "${model_list[@]}"; do 7 | model_name=$(echo $model | cut -d' ' -f1) 8 | model_path=$(echo $model | cut -d' ' -f2) 9 | file_name=$(echo $model | cut -d' ' -f3) 10 | echo "Launching inference for ${model_name}..." 11 | echo "Model path: ${model_path}" 12 | echo "File name: ${file_name}" 13 | bash launch_inference.sh ${model_name} ${model_path} ${file_name} 14 | done -------------------------------------------------------------------------------- /evaltoolkits/metrics.py: -------------------------------------------------------------------------------- 1 | import re 2 | import string 3 | 4 | import jieba 5 | from fuzzywuzzy import fuzz 6 | import difflib 7 | 8 | from typing import List 9 | from collections import Counter 10 | from rouge import Rouge 11 | 12 | def normalize_answer(s): 13 | """Lower text and remove punctuation, articles and extra whitespace.""" 14 | 15 | def remove_articles(text): 16 | return re.sub(r"\b(a|an|the)\b", " ", text) 17 | 18 | def white_space_fix(text): 19 | return " ".join(text.split()) 20 | 21 | def remove_punc(text): 22 | exclude = set(string.punctuation) 23 | return "".join(ch for ch in text if ch not in exclude) 24 | 25 | def lower(text): 26 | return text.lower() 27 | 28 | return white_space_fix(remove_articles(remove_punc(lower(s)))) 29 | 30 | 31 | def normalize_zh_answer(s): 32 | """Lower text and remove punctuation, extra whitespace.""" 33 | 34 | def white_space_fix(text): 35 | return "".join(text.split()) 36 | 37 | def remove_punc(text): 38 | cn_punctuation = "！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏." 39 | all_punctuation = set(string.punctuation + cn_punctuation) 40 | return "".join(ch for ch in text if ch not in all_punctuation) 41 | 42 | def lower(text): 43 | return text.lower() 44 | 45 | return white_space_fix(remove_punc(lower(s))) 46 | 47 | def count_score(prediction, ground_truth, **kwargs): 48 | numbers = re.findall(r"\d+", prediction) 49 | right_num = 0 50 | for number in numbers: 51 | if str(number) == str(ground_truth): 52 | right_num += 1 53 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers) 54 | return float(final_score) 55 | 56 | def retrieval_score(prediction, ground_truth, **kwargs): 57 | pattern = r'Paragraph (\d+)' 58 | matches = re.findall(pattern, ground_truth) 59 | ground_truth_id = matches[0] 60 | numbers = re.findall(r"\d+", prediction) 61 | right_num = 0 62 | for number in numbers: 63 | if str(number) == str(ground_truth_id): 64 | right_num += 1 65 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers) 66 | return float(final_score) 67 | 68 | def babi_score(prediction, ground_truth, **kwargs): 69 | 70 | if ground_truth in prediction: 71 | return 1.0 72 | elif prediction==ground_truth: 73 | return 1.0 74 | elif prediction==(ground_truth+')'): 75 | return 1.0 76 | elif prediction==('('+ground_truth+')'): 77 | return 1.0 78 | else: 79 | return 0 80 | 81 | def babiq3_score(prediction, ground_truth, **kwargs): 82 | try: 83 | answer=prediction.split('was in')[1] 84 | except: 85 | answer=prediction 86 | if ground_truth in answer: 87 | return 1.0 88 | else: 89 | return 0 90 | 91 | def ruler_score(prediction, ground_truth, **kwargs): 92 | def postprocess_pred(predict_str: str): 93 | 94 | predict_str = predict_str.strip() 95 | # Remove all non-printable characters 96 | np_pattern = re.compile(r'[\x00-\x1f]') 97 | predict_str = np_pattern.sub('\n', predict_str).strip() 98 | return predict_str 99 | positive_pattern1 = r':\s*(\d+)' 100 | positive_pattern2 = r'is\s*(\d+)' 101 | negative_pattern = r'no \s* magic number' 102 | positive_match1 = re.search(positive_pattern1, prediction) 103 | positive_match2 = re.search(positive_pattern2, prediction) 104 | negative_match = re.search(negative_pattern, prediction) 105 | if negative_match: 106 | return 0 107 | elif positive_match1 or positive_match2: 108 | return 1.0 109 | 110 | if ground_truth in prediction: 111 | return 1.0 112 | elif prediction==ground_truth: 113 | return 1.0 114 | elif prediction==(ground_truth+')'): 115 | return 1.0 116 | elif prediction==('('+ground_truth+')'): 117 | return 1.0 118 | elif postprocess_pred(prediction)==ground_truth: 119 | return 1.0 120 | else: 121 | return 0 122 | 123 | def accuracy_score(prediction, ground_truth, **kwargs): 124 | def extract_answer(response): 125 | response = response.replace('*', '') 126 | match = re.search(r'The correct answer is $([A-D])$', response) 127 | if match: 128 | return match.group(1) 129 | else: 130 | match = re.search(r'The correct answer is ([A-D])', response) 131 | if match: 132 | return match.group(1) 133 | else: 134 | return None 135 | bool_cnt=0 136 | choice_list=['A','B','C','D'] 137 | pattern1 = f'{ground_truth} ' 138 | pattern3=f'\* {ground_truth}' 139 | pattern4=f'{ground_truth}$' 140 | for choice in choice_list: 141 | if ('('+choice+')') in prediction: 142 | bool_cnt+=1 143 | continue 144 | pattern2 = f'{choice} ' 145 | matches = re.findall(pattern2, prediction) 146 | if matches: 147 | bool_cnt+=1 148 | continue 149 | if bool_cnt>=2: #m choices 150 | return 0 151 | if ('('+ground_truth+')') in prediction: 152 | return 1.0 153 | if ground_truth==prediction: 154 | return 1.0 155 | matches1 = re.findall(pattern1, prediction) 156 | matches2 = re.findall(pattern3, prediction) 157 | matches3 = re.findall(pattern4, prediction) 158 | if matches1 or matches2 or matches3: 159 | return 1.0 160 | else: 161 | return 0 162 | 163 | def retrieval_zh_score(prediction, ground_truth, **kwargs): 164 | pattern = r'段落(\d+)' 165 | matches = re.findall(pattern, ground_truth) 166 | ground_truth_id = matches[0] 167 | numbers = re.findall(r"\d+", prediction) 168 | right_num = 0 169 | for number in numbers: 170 | if str(number) == str(ground_truth_id): 171 | right_num += 1 172 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers) 173 | return float(final_score) 174 | 175 | def code_sim_score(prediction, ground_truth, **kwargs): 176 | all_lines = prediction.lstrip('\n').split('\n') 177 | prediction = "" 178 | for line in all_lines: 179 | if ('`' not in line) and ('#' not in line) and ('//' not in line): 180 | prediction = line 181 | break 182 | return (fuzz.ratio(prediction, ground_truth) / 100) 183 | 184 | def classification_score(prediction, ground_truth, **kwargs): 185 | #print(prediction) 186 | #if '\n' in prediction: 187 | # prediction = prediction.lstrip('\n').split('\n')[0] 188 | em_match_list = [] 189 | all_classes = kwargs["all_classes"] 190 | for class_name in all_classes: 191 | if class_name in prediction: 192 | em_match_list.append(class_name) 193 | for match_term in em_match_list: 194 | if match_term in ground_truth and match_term != ground_truth: 195 | em_match_list.remove(match_term) 196 | if ground_truth in em_match_list: 197 | score = (1.0 / len(em_match_list)) 198 | else: 199 | score = 0.0 200 | return score 201 | 202 | def rouge_score(prediction, ground_truth, **kwargs): 203 | rouge = Rouge() 204 | try: 205 | scores = rouge.get_scores([prediction], [ground_truth], avg=True) 206 | except: 207 | return 0.0 208 | return scores["rouge-l"]["f"] 209 | 210 | def rouge_zh_score(prediction, ground_truth, **kwargs): 211 | prediction = " ".join(list(jieba.cut(prediction, cut_all=False))) 212 | ground_truth = " ".join(list(jieba.cut(ground_truth, cut_all=False))) 213 | score = rouge_score(prediction, ground_truth) 214 | return score 215 | 216 | def f1_score(prediction, ground_truth, **kwargs): 217 | common = Counter(prediction) & Counter(ground_truth) 218 | num_same = sum(common.values()) 219 | if num_same == 0: 220 | return 0 221 | precision = 1.0 * num_same / len(prediction) 222 | recall = 1.0 * num_same / len(ground_truth) 223 | f1 = (2 * precision * recall) / (precision + recall) 224 | return f1 225 | 226 | def recall_score(prediction, ground_truth, **kwargs): 227 | common = Counter(prediction) & Counter(ground_truth) 228 | num_same = sum(common.values()) 229 | if num_same == 0: 230 | return 0 231 | recall = 1.0 * num_same / len(ground_truth) 232 | return recall 233 | 234 | def qa_f1_score(prediction, ground_truth, **kwargs): 235 | normalized_prediction = normalize_answer(prediction) 236 | normalized_ground_truth = normalize_answer(ground_truth) 237 | 238 | prediction_tokens = normalized_prediction.split() 239 | ground_truth_tokens = normalized_ground_truth.split() 240 | return f1_score(prediction_tokens, ground_truth_tokens) 241 | 242 | def qa_recall_score(prediction, ground_truth, **kwargs): 243 | normalized_prediction = normalize_answer(prediction) 244 | normalized_ground_truth = normalize_answer(ground_truth) 245 | 246 | prediction_tokens = normalized_prediction.split() 247 | ground_truth_tokens = normalized_ground_truth.split() 248 | return recall_score(prediction_tokens, ground_truth_tokens) 249 | 250 | 251 | def qa_f1_zh_score(prediction, ground_truth, **kwargs): 252 | prediction_tokens = list(jieba.cut(prediction, cut_all=False)) 253 | ground_truth_tokens = list(jieba.cut(ground_truth, cut_all=False)) 254 | prediction_tokens = [normalize_zh_answer(token) for token in prediction_tokens] 255 | ground_truth_tokens = [normalize_zh_answer(token) for token in ground_truth_tokens] 256 | prediction_tokens = [token for token in prediction_tokens if len(token) > 0] 257 | ground_truth_tokens = [token for token in ground_truth_tokens if len(token) > 0] 258 | return f1_score(prediction_tokens, ground_truth_tokens) 259 | -------------------------------------------------------------------------------- /evaltoolkits/step1_eval_inference.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import json 4 | import argparse 5 | import torch.distributed as dist 6 | import numpy as np 7 | import random 8 | import torch.multiprocessing as mp 9 | import time 10 | 11 | from openai import OpenAI 12 | from tqdm import tqdm 13 | from pathlib import Path 14 | from pydantic import BaseModel 15 | from typing import List 16 | 17 | from utils import load_jsonl_file, check_pred_fact_consistency 18 | 19 | class Response(BaseModel): 20 | reasoning: str 21 | answer: str 22 | 23 | 24 | def seed_everything(seed): 25 | torch.manual_seed(seed) 26 | torch.cuda.manual_seed(seed) 27 | np.random.seed(seed) 28 | random.seed(seed) 29 | torch.backends.cudnn.benchmark = False 30 | torch.backends.cudnn.deterministic = True 31 | torch.cuda.manual_seed_all(seed) 32 | 33 | def parse_args(args=None): 34 | parser = argparse.ArgumentParser() 35 | parser.add_argument('--model', type=str, default=None) 36 | parser.add_argument('--test', action='store_true', help="Evaluate on test mode") 37 | parser.add_argument('--model_path', type=str, default=None) 38 | parser.add_argument('--data_path', type=str, default=None) 39 | parser.add_argument('--output_path', type=str, default=None) 40 | parser.add_argument('--dataset_name', type=str, default=None) 41 | parser.add_argument('--sample_num', type=int, default=30) 42 | parser.add_argument('--world_size', type=int, default=8) 43 | parser.add_argument('--max_gen', type=int, default=512) 44 | parser.add_argument('--gpt', action='store_true', help="Evaluate on test mode") 45 | parser.add_argument('--temperature', type=float, default=0) 46 | return parser.parse_args(args) 47 | 48 | def get_api_results(model_name, prompt, gpu_id, sample_num, max_gen, temp,gpt=False): 49 | max_retries = 5 50 | response_list = [] 51 | # json_schema = Response.model_json_schema() #TODO 52 | for i in range(max_retries): 53 | if gpt: 54 | api_key = os.getenv("OPENAI_API_KEY") 55 | client = OpenAI(api_key=api_key, base_url="Your Online Model URL") 56 | else: 57 | client = OpenAI(api_key="EMPTY", base_url=f"http://localhost:800{gpu_id}/v1") 58 | try: 59 | ''' 60 | response=client.completions.create( 61 | prompt=prompt, 62 | #messages=[ 63 | # {"role":"system", "content": "You are a helpful assistant."}, 64 | # {"role": "user","content": prompt} 65 | #], 66 | model=model_name, 67 | temperature=temp, 68 | n=sample_num, 69 | stop=["}"], 70 | max_tokens=max_gen, 71 | # extra_body={"guided_json": json_schema}, 72 | ) 73 | 74 | for choice in response.choices: 75 | #response_list.append(choice.message.content) 76 | response_list.append(choice.text) 77 | return response_list 78 | ''' 79 | 80 | response=client.chat.completions.create( 81 | #prompt=prompt, 82 | messages=[ 83 | {"role":"system", "content": "You are a helpful assistant."}, 84 | {"role": "user","content": prompt} 85 | ], 86 | #stop=["}","."], 87 | model=model_name, 88 | temperature=temp, 89 | n=sample_num, 90 | max_tokens=max_gen, 91 | # extra_body={"guided_json": json_schema}, 92 | ) 93 | 94 | for choice in response.choices: 95 | response_list.append(choice.message.content) 96 | #response_list.append(choice.text) 97 | return response_list 98 | except Exception as e: 99 | print(e) 100 | time.sleep(50) 101 | return None 102 | 103 | def get_pred_from_vllm(rank, data, max_gen, model_name, out_path, sample_num,lock, temp,gpt): 104 | # print("Temp: ",temp) 105 | if gpt: 106 | print("Eval On ",model_name) 107 | for json_obj in tqdm(data): 108 | prompt = json_obj['instruction'] 109 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt) 110 | def check_pred_validity(pred:str, prompt): 111 | if prompt.endswith("Answer:") or prompt.endswith("Type:") or prompt.endswith("Summary: ") or prompt.endswith("Answer:\n") or prompt.endswith("\".\n"): 112 | return True 113 | if "\"answer\"" not in pred: 114 | return False 115 | return True 116 | 117 | if preds==None: 118 | new_preds = get_api_results(model_name, prompt, rank, 5, max_gen=max_gen, temp=0.3,gpt=gpt) 119 | if new_preds==None: 120 | continue 121 | else: 122 | preds=new_preds 123 | 124 | check_flag=False 125 | if len(preds) == 1: 126 | if not check_pred_validity(preds[0], prompt): 127 | new_preds = get_api_results(model_name, prompt, rank, 5, max_gen=max_gen, temp=0.3) 128 | if new_preds!=None: 129 | for pred in new_preds: 130 | if check_pred_validity(pred, prompt): 131 | preds = [pred] 132 | check_flag=True 133 | break 134 | else: 135 | check_flag=True 136 | 137 | if not check_pred_validity(preds[0], prompt): 138 | new_preds = get_api_results(model_name, prompt, rank, 10, max_gen=max_gen, temp=0.3) 139 | if new_preds!=None: 140 | for pred in new_preds: 141 | if check_pred_validity(pred, prompt): 142 | preds = [pred] 143 | check_flag=True 144 | break 145 | else: 146 | continue 147 | else: 148 | check_flag=True 149 | 150 | if "answers" in json_obj.keys(): 151 | instruction, answers, _id = json_obj["instruction"], json_obj["answers"], json_obj["id"] 152 | else: 153 | instruction, answers, _id = json_obj["instruction"], [json_obj["output"]], json_obj["id"] 154 | if "all_classes" in json_obj.keys(): 155 | all_classes=json_obj['all_classes'] 156 | else: 157 | all_classes=[] 158 | 159 | try: 160 | question=json_obj['question'] 161 | except: 162 | question = instruction.split("Question:")[-1].replace("\n\nAnswer:","").replace("*","").strip() 163 | with lock: 164 | with open(out_path, "a", encoding="utf-8") as f: 165 | json.dump({"pred": preds, "instruction": instruction, "question":question, "answers": answers, "id":_id,"check_flag":str(check_flag) ,"all_classes": all_classes, "length": 0}, f, ensure_ascii=False) 166 | f.write('\n') 167 | dist.destroy_process_group() 168 | return 169 | 170 | if __name__ == '__main__': 171 | print(os.getpid()) 172 | seed_everything(42) 173 | args = parse_args() 174 | world_size = args.world_size #torch.cuda.device_count() 175 | print("Wold Size ",world_size) 176 | mp.set_start_method('fork', force=True) 177 | model_name = args.model 178 | dataset_name = args.dataset_name 179 | 180 | sources = load_jsonl_file(args.data_path) 181 | data_all = [data_sample for data_sample in sources] 182 | data_subsets = [data_all[i::world_size] for i in range(world_size)] 183 | 184 | out_path = Path(args.output_path) 185 | out_path.parent.mkdir(parents=True, exist_ok=True) 186 | if out_path.exists(): 187 | out_path.unlink() 188 | processes = [] 189 | lock = mp.RLock() 190 | for rank in range(world_size): 191 | p = mp.Process(target=get_pred_from_vllm, args=(rank, data_subsets[rank], args.max_gen, model_name, out_path, args.sample_num, lock, args.temperature,args.gpt)) 192 | p.start() 193 | processes.append(p) 194 | 195 | for p in processes: 196 | p.join() -------------------------------------------------------------------------------- /evaltoolkits/step2_extract_preds_from_raw.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import json_repair 4 | import argparse 5 | import jsonlines 6 | 7 | from typing import List 8 | from tqdm import tqdm 9 | from utils import load_jsonl_file 10 | 11 | def parse_args(args=None): 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument('--path_to_src_file', type=str, default=None) 14 | return parser.parse_args(args) 15 | 16 | def extract_pred_list(raw_pred_list:List[str]): 17 | extracted_pred_list = [] 18 | for pred in raw_pred_list: 19 | if pred.startswith("```json"): 20 | # remove the begining and ending ```json 21 | pred = pred.replace("```json", "").replace("```", "") 22 | pred = pred.strip() 23 | if pred.startswith("{"): 24 | pred = pred.strip() 25 | try: 26 | content = json_repair.loads(pred) 27 | if type(content)==list: 28 | content=content[0] 29 | content = content["answer"] 30 | extracted_pred_list.append(str(content)) 31 | except Exception as e: 32 | # print(e, pred) 33 | # try to extract the answer from the raw pred, if failed, append the raw pred 34 | # use re to extract the content after "answer: " and before "." (inclusive) 35 | try: 36 | #content =re.findall(r'"answer": (.+?)(?=\n|$)', pred)[0].strip() 37 | #print(content) 38 | #content = content.strip('\'"[]') 39 | pattern = r'"answer": "([^"]+)"' 40 | match = re.search(pattern, pred) 41 | content = match.group(1) 42 | #print(content) 43 | # print(f"Extracted re: {content}") 44 | extracted_pred_list.append(content) 45 | except Exception as e2: 46 | extracted_pred_list.append(pred) 47 | else: 48 | # extract plain text format response 49 | # print("extracting plain text format response") 50 | ''' 51 | try: 52 | content = pred.split("Answer:")[1].split("Reasoning:")[0] 53 | except: 54 | try: 55 | content = pred.split("Answer:")[1] 56 | except: 57 | content = pred 58 | try: 59 | content = pred.split("Reasoning:")[0] 60 | except: 61 | content = pred 62 | ''' 63 | try: 64 | content=pred.split("Answer:")[1] 65 | #content=pred.split("Reasoning:")[1] 66 | except: 67 | try: 68 | content=pred.split("Reasoning:")[1] 69 | except: 70 | try: 71 | #print("Pred:",pred) 72 | content=pred.split('\n')[0] 73 | #print("Content:",content) 74 | except: 75 | content=pred 76 | try: 77 | content=content.split("\n")[0] 78 | except: 79 | content=content 80 | extracted_pred_list.append(content) 81 | return extracted_pred_list 82 | 83 | 84 | def main(): 85 | args = parse_args() 86 | data_list = load_jsonl_file(args.path_to_src_file) 87 | for data in tqdm(data_list, desc="Extracting preds"): 88 | extracted_pred_list = extract_pred_list(data["pred"]) 89 | data["extracted_pred_list"] = extracted_pred_list 90 | path_to_tgt_file = args.path_to_src_file.replace(".jsonl", "_eval.jsonl") 91 | with jsonlines.open(path_to_tgt_file, mode='w') as writer: 92 | writer.write_all(data_list) 93 | 94 | if __name__ == "__main__": 95 | main() 96 | 97 | 98 | 99 | -------------------------------------------------------------------------------- /evaltoolkits/step3_eval_f1.py: -------------------------------------------------------------------------------- 1 | import json 2 | import json_repair 3 | import argparse 4 | import numpy as np 5 | import jsonlines 6 | 7 | from pathlib import Path 8 | from typing import List 9 | from tqdm import tqdm 10 | 11 | from utils import load_jsonl_file 12 | from metrics import ( 13 | qa_f1_score, 14 | rouge_zh_score, 15 | qa_f1_zh_score, 16 | rouge_score, 17 | classification_score, 18 | retrieval_score, 19 | retrieval_zh_score, 20 | count_score, 21 | code_sim_score, 22 | qa_recall_score, 23 | babiq3_score, 24 | ruler_score, 25 | babi_score, 26 | accuracy_score 27 | ) 28 | 29 | dataset2metric = { 30 | "narrativeqa": qa_f1_score, 31 | "qasper": qa_f1_score, 32 | "Ruler": ruler_score, 33 | "MQA-Medium": accuracy_score, 34 | "MQA-Medium-v2": accuracy_score, 35 | "babiq3": qa_f1_score, 36 | "Babi": babi_score, 37 | "Babiq3": babiq3_score, 38 | "MQA": accuracy_score, 39 | "ICL": accuracy_score, 40 | "LIL": accuracy_score, 41 | "LSDU": accuracy_score, 42 | "NIAH": accuracy_score, 43 | "BABILong": accuracy_score, 44 | "LHU": accuracy_score, 45 | "CRU": accuracy_score, 46 | "SQA": accuracy_score, 47 | "SQA-Medium-v2": accuracy_score, 48 | "multifieldqa": qa_f1_score, 49 | "multifieldqa_en": qa_f1_score, 50 | "multifieldqa_zh": qa_f1_zh_score, 51 | "hotpotqa": qa_f1_score, 52 | "2wikimqa": qa_f1_score, 53 | "musique": qa_f1_score, 54 | "dureader": rouge_zh_score, 55 | "gov_report": rouge_score, 56 | "gov": rouge_score, 57 | "qmsum": rouge_score, 58 | "multi_news": rouge_score, 59 | "multi": rouge_score, 60 | "vcsum": rouge_zh_score, 61 | "trec": classification_score, 62 | "triviaqa": qa_f1_score, 63 | "samsum": rouge_score, 64 | "lsht": classification_score, 65 | "passage_retrieval_en": retrieval_score, 66 | "passage_count": count_score, 67 | "passage_retrieval_zh": retrieval_zh_score, 68 | "lcc": code_sim_score, 69 | "repobench-p": code_sim_score, 70 | } 71 | 72 | def parse_args(args=None): 73 | parser = argparse.ArgumentParser() 74 | parser.add_argument('--path_to_src_file', type=str, default=None) 75 | return parser.parse_args(args) 76 | 77 | def get_score_list(dataset:str, predictions:List[str], answers:List[str],**kwargs) -> List[float]: 78 | score_list: List[float] = [] 79 | for pred in predictions: 80 | if dataset=="Ruler": 81 | score=1 82 | else: 83 | score=0 84 | for answer in answers: 85 | if dataset=="Ruler": 86 | score = min(dataset2metric[dataset](pred, answer,**kwargs),score) 87 | else: 88 | score = max(dataset2metric[dataset](pred, answer,**kwargs),score) 89 | score_list.append(score) 90 | return score_list 91 | 92 | def main(): 93 | args = parse_args() 94 | data_list = load_jsonl_file(args.path_to_src_file) 95 | best_score_list = [] 96 | file_name = Path(args.path_to_src_file).name 97 | dataset = file_name.split('.')[0].split("-")[0] 98 | print(f"Eval {dataset}") 99 | for data in tqdm(data_list, desc="Calculating F1 score"): 100 | extracted_pred_list:List[str] = data["extracted_pred_list"] 101 | answers = data["answers"] 102 | if type(answers) !=type(["!"]): 103 | answers=[answers] 104 | if "all_classes" in data.keys(): 105 | all_classes=data['all_classes'] 106 | else: 107 | all_classes=[] 108 | score_list = get_score_list(dataset, extracted_pred_list, answers,all_classes=all_classes) 109 | best_score_in_this_data = max(score_list) 110 | best_score_list.append(best_score_in_this_data) 111 | data["f1_score_list"] = score_list 112 | final_score = np.mean(best_score_list) # *100 and round to 2 decimal places 113 | final_score = round(final_score*100, 2) 114 | print(f"Final score: {final_score}") 115 | with jsonlines.open(args.path_to_src_file, mode='w') as writer: 116 | writer.write_all(data_list) 117 | data_list_noinstr = [{k:v for k,v in data.items() if k!="instruction"} for data in data_list] 118 | with jsonlines.open(args.path_to_src_file.replace(".jsonl","_noinstr.jsonl"), mode='w') as writer: 119 | writer.write_all(data_list_noinstr) 120 | 121 | # add to result.json, overwrite the value if it already exists 122 | # check if result.json exists 123 | result_path = Path(args.path_to_src_file).parent / "result.json" 124 | if result_path.exists(): 125 | with open(result_path, 'r') as f: 126 | result = json.load(f) 127 | else: 128 | result = {} 129 | result[file_name] = final_score 130 | with open(result_path, 'w') as f: 131 | json.dump(result, f, ensure_ascii=False, indent=4) 132 | print(f"Result saved in {result_path}") 133 | 134 | if __name__ == '__main__': 135 | main() 136 | 137 | 138 | -------------------------------------------------------------------------------- /evaltoolkits/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import string 4 | import json_repair 5 | 6 | from pathlib import Path 7 | from typing import Union, List 8 | 9 | # from metrics import normalize_answer 10 | def extract_number(text): 11 | match = re.search(r'\[\[([0-9]*\.?[0-9]+)\]\]', text) 12 | if match: 13 | return float(match.group(1)) 14 | match = re.search(r'\[([0-9]*\.?[0-9]+)\]', text) 15 | if match: 16 | return float(match.group(1)) 17 | return 0.0 18 | 19 | def normalize_answer(s): 20 | """Lower text and remove punctuation, articles and extra whitespace.""" 21 | 22 | def remove_articles(text): 23 | return re.sub(r"\b(a|an|the)\b", " ", text) 24 | 25 | def white_space_fix(text): 26 | return " ".join(text.split()) 27 | 28 | def remove_punc(text): 29 | exclude = set(string.punctuation) 30 | return "".join(ch for ch in text if ch not in exclude) 31 | 32 | def lower(text): 33 | return text.lower() 34 | 35 | return white_space_fix(remove_articles(remove_punc(lower(s)))) 36 | 37 | def load_jsonl_file(path_to_file: Union[str, Path]): 38 | data_list = [] 39 | error_cnt = 0 40 | with open(path_to_file) as f: 41 | for idx, line in enumerate(f): 42 | try: 43 | data = json.loads(line) 44 | data_list.append(data) 45 | except Exception as e: 46 | error_cnt += 1 47 | print(f"Failed loading line {idx}, error: {e}") 48 | print(line) 49 | print(f"Failed loading {error_cnt} lines, total {len(data_list)} lines loaded") 50 | return data_list 51 | 52 | def preprocess_pred_for_json_repair(pred: str): 53 | escaped_str = re.sub( 54 | r'(?<="reasoning": ")(.*?)(?="\s*,\s*\n\s*"answer":)', 55 | lambda match: re.sub(r'(? bool: 67 | # 去除 instruction 中的标点符号 68 | instruction_cleaned = normalize_answer(instruction) 69 | 70 | for fact in fact_list: 71 | # 去除 fact 中的标点符号 72 | fact_cleaned = normalize_answer(fact) 73 | # 比较去除标点符号后的 fact 和 instruction 74 | if fact_cleaned not in instruction_cleaned: 75 | # print(fact) 76 | return False 77 | return True 78 | 79 | def check_pred_fact_consistency(pred: str, instruction: str): 80 | 81 | processed_pred = preprocess_pred_for_json_repair(pred) 82 | content = json_repair.loads(processed_pred) 83 | if not isinstance(content, dict) or not content or 'reasoning' not in content or len(content) > 2 or type(content['reasoning']) != str: 84 | return False 85 | fact_list = extract_fact_list(content['reasoning']) 86 | if len(fact_list) > 0 and verify_fact_list(fact_list, instruction): 87 | return True 88 | return False -------------------------------------------------------------------------------- /log/README.md: -------------------------------------------------------------------------------- 1 | This is a folder for log saving. -------------------------------------------------------------------------------- /pics/combined_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/pics/combined_plot.png -------------------------------------------------------------------------------- /pics/llama.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/pics/llama.png -------------------------------------------------------------------------------- /pics/main_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/pics/main_table.png -------------------------------------------------------------------------------- /preprocess_lbv1.py: -------------------------------------------------------------------------------- 1 | import json 2 | import jsonlines 3 | import json_repair 4 | import pandas as pd 5 | import numpy as np 6 | import torch 7 | import shutil 8 | import glob 9 | import re 10 | 11 | from typing import List 12 | from tqdm import tqdm 13 | from pathlib import Path 14 | from transformers import AutoTokenizer, AutoModelForCausalLM 15 | from datasets import Dataset, load_dataset 16 | from config.prompt import prompt_lbv1_cot,prompt_lbv1_nocot,prompt_cot 17 | 18 | def construct_cot_nocot_split(split: str): 19 | data_list=load_dataset('THUDM/LongBench',split, split='test') 20 | new_cot_data_list = [] 21 | new_nocot_data_list = [] 22 | for data in data_list: 23 | context = data["context"] 24 | question = data["input"] 25 | answers = data["answers"] 26 | all_classes=data['all_classes'] 27 | id = data["_id"] 28 | output = answers[0] 29 | instruction_cot = prompt_lbv1_cot.format(context=context, question=question) 30 | instruction_nocot = prompt_lbv1_nocot.format(context=context, question=question) 31 | new_cot_data_list.append({"id": id, "question":question,"instruction": instruction_cot, "answers": answers,"all_classes":all_classes ,"output": output, "system": "You are a helpful assistant."}) 32 | new_nocot_data_list.append({"id": id,"question":question,"instruction": instruction_nocot, "answers": answers, "all_classes":all_classes,"output": output, "system": "You are a helpful assistant."}) 33 | print(f"size of {split} new_cot_data_list: {len(new_cot_data_list)}") 34 | print(f"size of {split} new_nocot_data_list: {len(new_nocot_data_list)}") 35 | with jsonlines.open(f"dataset/longbenchv1/{split}_cot.jsonl", 'w') as writer: 36 | writer.write_all(new_cot_data_list) 37 | with jsonlines.open(f"dataset/longbenchv1/{split}_nocot.jsonl", 'w') as writer: 38 | writer.write_all(new_nocot_data_list) 39 | print(f"Finished writing {split} dataset") 40 | return 41 | 42 | 43 | 44 | from datasets import load_dataset 45 | datasets = ["qasper", "multifieldqa_en", "hotpotqa", "2wikimqa","musique"] 46 | for dataset in datasets: 47 | construct_cot_nocot_split(dataset) 48 | -------------------------------------------------------------------------------- /preprocess_lbv2.py: -------------------------------------------------------------------------------- 1 | import json 2 | import jsonlines 3 | import json_repair 4 | import pandas as pd 5 | import numpy as np 6 | import torch 7 | import shutil 8 | import glob 9 | import re 10 | 11 | from typing import List 12 | from tqdm import tqdm 13 | from pathlib import Path 14 | from transformers import AutoTokenizer, AutoModelForCausalLM 15 | from datasets import Dataset, load_dataset 16 | from config.prompt import prompt_lbv2_cot,prompt_lbv2_nocot 17 | 18 | def construct_cot_nocot_split(filtered_data,split="MQA"): 19 | new_cot_data_list = [] 20 | new_nocot_data_list = [] 21 | for item in filtered_data: 22 | context = item["context"] 23 | question=item['question'] 24 | _id=item['_id'] 25 | difficulty=item['difficulty'] 26 | instruction_cot=prompt_lbv2_cot.format(context=context, question=question,choice_A=item['choice_A'],choice_B=item['choice_B'],choice_C=item['choice_C'],choice_D=item['choice_D']) 27 | instruction_nocot=prompt_lbv2_nocot.format(context=context, question=question,choice_A=item['choice_A'],choice_B=item['choice_B'],choice_C=item['choice_C'],choice_D=item['choice_D']) 28 | new_cot_data_list.append({"id": id, "instruction": instruction_cot, "output": item['answer'], "id":_id,"difficulty":difficulty,"question":item['question'],"num_tokens":item['token_num'],"system": "You are a helpful assistant."}) 29 | new_nocot_data_list.append({"id": id, "instruction": instruction_nocot, "output": item['answer'], "id":_id,"difficulty":difficulty,"question":item['question'],"num_tokens":item['token_num'],"system": "You are a helpful assistant."}) 30 | 31 | print(f"size of new_cot_data_list: {len(new_cot_data_list)}") 32 | print(f"size of new_nocot_data_list: {len(new_nocot_data_list)}") 33 | with jsonlines.open(f"dataset/longbenchv2/{split}_cot.jsonl", 'w') as writer: 34 | writer.write_all(new_cot_data_list) 35 | with jsonlines.open(f"dataset/longbenchv2/{split}_nocot.jsonl", 'w') as writer: 36 | writer.write_all(new_nocot_data_list) 37 | 38 | 39 | 40 | split_list=["Single-Document QA","Multi-Document QA"] 41 | split_tag=["SQA","MQA"] 42 | dataset=load_dataset('Lemon123prog/Longmix-LongRePS',split='test') 43 | 44 | for (split,tag) in zip(split_list,split_tag): 45 | filter_data=[] 46 | for data in dataset: 47 | if data['domain']==split and data['token_num']<105*1024: #In case of the Bug for over-long data 48 | filter_data.append(data) 49 | construct_cot_nocot_split(filter_data,split=tag) 50 | 51 | -------------------------------------------------------------------------------- /preprocess_train.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import jsonlines 3 | 4 | 5 | def preprocess(model:str): 6 | dataset = load_dataset(f"Lemon123prog/{model}-LongRePS") 7 | warmup_data=dataset['warmup'].to_list() 8 | orm_data=dataset['train_orm'].to_list() 9 | prm_data=dataset['train_prm'].to_list() 10 | 11 | with jsonlines.open(f"./data/{model}_warmup.jsonl", 'w') as writer: 12 | writer.write_all(warmup_data) 13 | 14 | with jsonlines.open(f"./data/musique-{model}_orm_train.jsonl", 'w') as writer: 15 | writer.write_all(orm_data) 16 | 17 | with jsonlines.open(f"./data/musique-{model}_prm_train.jsonl", 'w') as writer: 18 | writer.write_all(prm_data) 19 | 20 | 21 | preprocess("Llama-3.1-8B") 22 | preprocess("Qwen-2.5-7B") -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | vllm==0.6.0 2 | vllm-flash-attn==2.6.1 3 | deepspeed==0.15.4 4 | openai 5 | jsonlines 6 | json-repair 7 | fuzzywuzzy 8 | jieba -------------------------------------------------------------------------------- /scripts/llama_sft.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/LongRePS/saves/Llama3.1-8B/full/Llama-3.1-8B_warmup_train_lr1e-5_maxlen16k_2025-03-13-22-43-49/checkpoint-10" 2 | template="llama3" 3 | learning_rate=5e-6 4 | dataset="Llama-3.1-8B_sample30_thresh1.0_v2_prm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Llama3.1-8B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done -------------------------------------------------------------------------------- /scripts/llama_warmup.sh: -------------------------------------------------------------------------------- 1 | model_path="../Model/meta-llama/Llama-3.1-8B" 2 | template="llama3" 3 | learning_rate=1e-5 4 | dataset="Llama-3.1-8B_warmup" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Llama3.1-8B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done -------------------------------------------------------------------------------- /scripts/lora_sft.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-warmup-epoch1" 2 | template="qwen" 3 | learning_rate=5e-5 4 | dataset="Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-32B/lora/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | cp /mnt/xiyu/LongRePS/scripts/lora_sft.sh ${output_path} 10 | 11 | llamafactory-cli train \ 12 | --stage sft \ 13 | --do_train True \ 14 | --model_name_or_path ${model_path} \ 15 | --preprocessing_num_workers 16 \ 16 | --finetuning_type lora \ 17 | --lora_rank 128 \ 18 | --lora_alpha 128 \ 19 | --lora_dropout 0.05 \ 20 | --lora_target all \ 21 | --template ${template} \ 22 | --flash_attn auto \ 23 | --dataset_dir data \ 24 | --dataset ${dataset} \ 25 | --cutoff_len 16384 \ 26 | --learning_rate ${learning_rate} \ 27 | --num_train_epochs 2 \ 28 | --max_samples 100000 \ 29 | --per_device_train_batch_size 1 \ 30 | --gradient_accumulation_steps 4 \ 31 | --lr_scheduler_type cosine \ 32 | --max_grad_norm 1.0 \ 33 | --logging_steps 5 \ 34 | --save_strategy epoch \ 35 | --packing False \ 36 | --save_only_model True \ 37 | --report_to none \ 38 | --output_dir ${output_path} \ 39 | --bf16 True \ 40 | --plot_loss True \ 41 | --ddp_timeout 180000000 \ 42 | --optim adamw_torch \ 43 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 44 | done -------------------------------------------------------------------------------- /scripts/merge_config.yaml: -------------------------------------------------------------------------------- 1 | model_name_or_path: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-warmup-epoch1 2 | adapter_name_or_path: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm_train_lr5e-5_maxlen16k_2025-03-30-17-41-39/checkpoint-88 3 | template: qwen 4 | finetuning_type: lora 5 | 6 | export_dir: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2 7 | export_size: 2 8 | export_device: cpu 9 | export_legacy_format: false -------------------------------------------------------------------------------- /scripts/preprocess_lb.sh: -------------------------------------------------------------------------------- 1 | mkdir -p dataset/longbenchv1 2 | mkdir -p dataset/longbenchv2 3 | 4 | python preprocess_lbv1.py 5 | python preprocess_lbv2.py -------------------------------------------------------------------------------- /scripts/qwen14b_lora.yaml: -------------------------------------------------------------------------------- 1 | model_name_or_path: /mnt/xiyu/Model/Qwen/Qwen2.5-14B 2 | 3 | stage: sft 4 | do_train: true 5 | finetuning_type: lora 6 | lora_target: all 7 | 8 | dataset: Qwen-2.5-14B_warmup 9 | dataset_dir: data 10 | template: qwen 11 | cutoff_len: 16384 12 | max_samples: 100000 13 | overwrite_cache: true 14 | preprocessing_num_workers: 16 15 | 16 | output_dir: saves/llama3-8b/lora/sft 17 | logging_steps: 5 18 | save_strategy: epoch 19 | plot_loss: true 20 | overwrite_output_dir: true 21 | 22 | per_device_train_batch_size: 1 23 | gradient_accumulation_steps: 4 24 | learning_rate: 1e-5 25 | num_train_epochs: 2.0 26 | lr_scheduler_type: constant_with_warmup 27 | warmup_steps: 5 28 | bf16: true 29 | ddp_timeout: 180000000 -------------------------------------------------------------------------------- /scripts/qwen_lora.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/Model/Qwen/Qwen2.5-14B" 2 | template="qwen" 3 | learning_rate=1e-5 4 | dataset="Qwen-2.5-7B_orm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-14B/lora/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | cp /mnt/xiyu/LongRePS/scripts/qwen_lora.sh ${output_path} 10 | 11 | llamafactory-cli train \ 12 | --stage sft \ 13 | --do_train True \ 14 | --model_name_or_path ${model_path} \ 15 | --preprocessing_num_workers 16 \ 16 | --finetuning_type lora \ 17 | --lora_rank 128 \ 18 | --lora_alpha 128 \ 19 | --lora_dropout 0.05 \ 20 | --lora_target all \ 21 | --template ${template} \ 22 | --flash_attn auto \ 23 | --dataset_dir data \ 24 | --dataset ${dataset} \ 25 | --cutoff_len 16384 \ 26 | --learning_rate ${learning_rate} \ 27 | --num_train_epochs 2 \ 28 | --max_samples 100000 \ 29 | --per_device_train_batch_size 1 \ 30 | --gradient_accumulation_steps 4 \ 31 | --lr_scheduler_type cosine \ 32 | --max_grad_norm 1.0 \ 33 | --logging_steps 5 \ 34 | --save_strategy epoch \ 35 | --packing False \ 36 | --save_only_model True \ 37 | --report_to none \ 38 | --output_dir ${output_path} \ 39 | --bf16 True \ 40 | --plot_loss True \ 41 | --ddp_timeout 180000000 \ 42 | --optim adamw_torch \ 43 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 44 | done -------------------------------------------------------------------------------- /scripts/qwen_sft.sh: -------------------------------------------------------------------------------- 1 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-7B/full/Qwen-2.5-7B_warmup_train_lr1e-5_maxlen16k_2025-03-26-17-19-47/checkpoint-18" 2 | template="qwen" 3 | learning_rate=5e-6 4 | dataset="Qwen-2.5-7B_sample100_thresh1.0_yarn_checkstage3_prm" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-7B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done -------------------------------------------------------------------------------- /scripts/qwen_warmup.sh: -------------------------------------------------------------------------------- 1 | model_path="../Model/Qwen/Qwen2.5-7B" 2 | template="qwen" 3 | learning_rate=1e-5 4 | dataset="Qwen-2.5-7B_warmup" 5 | echo "Dataname: ${dataset}" 6 | 7 | output_path="saves/Qwen2.5-7B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S") 8 | mkdir -p ${output_path} 9 | 10 | llamafactory-cli train \ 11 | --stage sft \ 12 | --do_train True \ 13 | --model_name_or_path ${model_path} \ 14 | --preprocessing_num_workers 16 \ 15 | --finetuning_type full \ 16 | --template ${template} \ 17 | --flash_attn auto \ 18 | --dataset_dir data \ 19 | --dataset ${dataset} \ 20 | --cutoff_len 16384 \ 21 | --learning_rate ${learning_rate} \ 22 | --num_train_epochs 2 \ 23 | --max_samples 100000 \ 24 | --per_device_train_batch_size 1 \ 25 | --gradient_accumulation_steps 4 \ 26 | --lr_scheduler_type constant_with_warmup \ 27 | --max_grad_norm 1.0 \ 28 | --logging_steps 5 \ 29 | --save_strategy epoch \ 30 | --warmup_steps 5 \ 31 | --packing False \ 32 | --save_only_model True \ 33 | --report_to none \ 34 | --output_dir ${output_path} \ 35 | --bf16 True \ 36 | --plot_loss True \ 37 | --ddp_timeout 180000000 \ 38 | --optim adamw_torch \ 39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1 40 | done --------------------------------------------------------------------------------