├── .gitignore
├── README.md
├── cache
└── ds_z3_config.json
├── config
├── __pycache__
│ ├── prompt.cpython-38.pyc
│ └── prompt.cpython-39.pyc
└── prompt.py
├── data
└── dataset_info.json
├── evaltoolkits
├── cache
│ ├── ds_z2_config.json
│ ├── ds_z2_offload_config.json
│ ├── ds_z3_config.json
│ ├── ds_z3_offload_config.json
│ └── user_config.yaml
├── filter.sh
├── filter_data.py
├── inference.out
├── inference2.out
├── launch_inference.sh
├── launch_lbv1.sh
├── launch_lbv1big.sh
├── launch_lbv2.sh
├── launch_lbv2m.sh
├── loop_eval.sh
├── loop_sample.sh
├── metrics.py
├── step1_eval_inference.py
├── step2_extract_preds_from_raw.py
├── step3_eval_f1.py
└── utils.py
├── log
└── README.md
├── pics
├── combined_plot.png
├── llama.png
└── main_table.png
├── preprocess_lbv1.py
├── preprocess_lbv2.py
├── preprocess_train.py
├── requirements.txt
└── scripts
├── llama_sft.sh
├── llama_warmup.sh
├── lora_sft.sh
├── merge_config.yaml
├── preprocess_lb.sh
├── qwen14b_lora.yaml
├── qwen_lora.sh
├── qwen_sft.sh
└── qwen_warmup.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | dataset/
2 | data/
3 | saves/
4 | evaltoolkits/pred*/
5 | **__pycache__/
6 | log/
7 | *.out
8 | **bak/
9 | **dev/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | # 📖 Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision
6 |
7 |
8 | 🤗 HF Repo • 📃 Paper
9 |
10 |
11 | **LongRePS** tackles quality bottlenecks in CoT reasoning for extended contexts by integrating process supervision. As shown in the figure, we have discovered that in complex task scenarios, using the chain of thought always enables the model performance to achieve a universal gain. Furthermore, we figure out that while vanilla CoT improves with context length, self-sampled reasoning paths exhibit significant inconsistency and hallucination risks, especially in multi-hop QA and complex scenarios.
12 |
13 |
14 | The framework operates in two phases: (1) **Self-sampling** generates diverse CoT candidates to capture reasoning variability, and (2) **Context-aware assessment** enforces answer correctness, grounding via text matching, and intrinsic consistency via LLM-based scoring.
15 |
16 |
17 | Evaluations on long-context tasks show LongRePS achieves 13.6/3.8-point gains on MuSiQue (LLaMA/Qwen) and cross-task robustness, outperforming outcome supervision. The results validate process supervision as pivotal for scalable long-context reasoning, with open-source code enabling community adoption.
18 | ***
19 | 
20 | ***
21 | 
22 |
23 |
24 | ## 🔥 News
25 | **[2025/03/03]** Release training and evaluation data for **LongRePS**. The model parameters and complete codes will be available soon.
26 |
27 | ## 🔍 List of Contents
28 | - [🔨 Requirements](#requirements)
29 | - [⚙️ How to Prepare Data for Training](#how-to-Prepare-Data-for-Training)
30 | - [🖥️ How to Prepare Data for Evaluating](#how-to-Prepare-Data-for-Evaluating)
31 | - [🍧 Training](#training)
32 | - [📊 Evaluation](#evaluation)
33 | - [📄 Acknowledgement](#acknowledgement)
34 |
35 |
36 |
37 | ## 🔨 Requirements
38 |
39 | **Install LLaMA-Factory**
40 |
41 | Please refer to this tutorial for [installation](https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/installation.html).
42 | Or you can use following command:
43 | ```bash
44 | git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
45 | cd LLaMA-Factory
46 | pip install -e ".[torch,metrics]"
47 | ```
48 |
49 | **Install Other Supporting Libraries**
50 |
51 | ```bash
52 | cd ..
53 | git clone https://github.com/lemon-prog123/LongRePS.git
54 | cd LongRePS
55 | pip install -r requirements.txt
56 | ```
57 |
58 |
59 |
60 | ## ⚙️ How to Prepare Data for Training
61 |
62 | **Llama-3.1-8B**:
63 | ```python
64 | from datasets import load_dataset
65 | import jsonlines
66 | model="Llama-3.1-8B"
67 | dataset = load_dataset("Lemon123prog/Llama-3.1-8B-LongRePS")
68 | warmup_data=dataset['warmup'].to_list()
69 | orm_data=dataset['train_orm'].to_list()
70 | prm_data=dataset['train_prm'].to_list()
71 |
72 | with jsonlines.open(f"data/{model}_warmup.jsonl", 'w') as writer:
73 | writer.write_all(warmup_data)
74 |
75 | with jsonlines.open(f"data/{model}_orm.jsonl", 'w') as writer:
76 | writer.write_all(orm_data)
77 |
78 | with jsonlines.open(f"data/{model}_prm.jsonl", 'w') as writer:
79 | writer.write_all(prm_data)
80 | ```
81 |
82 | **Qwen-2.5-7B**:
83 | ```python
84 | from datasets import load_dataset
85 | import jsonlines
86 | model="Qwen-2.5-7B"
87 | dataset = load_dataset("Lemon123prog/Qwen-2.5-7B-LongRePS")
88 | warmup_data=dataset['warmup'].to_list()
89 | orm_data=dataset['train_orm'].to_list()
90 | prm_data=dataset['train_prm'].to_list()
91 |
92 | with jsonlines.open(f"data/{model}_warmup.jsonl", 'w') as writer:
93 | writer.write_all(warmup_data)
94 |
95 | with jsonlines.open(f"data/{model}_orm.jsonl", 'w') as writer:
96 | writer.write_all(orm_data)
97 |
98 | with jsonlines.open(f"data/{model}_prm.jsonl", 'w') as writer:
99 | writer.write_all(prm_data)
100 | ```
101 |
102 | Or you can simply run [preprocess.py](preprocess.py)
103 | ```bash
104 | python preprocess_train.py
105 | ```
106 |
107 |
108 |
109 | ## 🖥️ How to Prepare Data for Evaluating
110 |
111 | ```bash
112 | bash scripts/preprocess_lb.sh
113 | ```
114 | Then you will obtain the processed evaluation data in the **dataset** directory.
115 |
116 |
117 |
118 | ## 🍧 Training
119 |
120 | ### Download base models
121 |
122 | ```python
123 | from huggingface_hub import snapshot_download
124 | from pathlib import Path
125 | repo_id ="Qwen/Qwen2.5-7B"
126 | root_dir = Path("Your own path for Qwen")
127 | snapshot_download(repo_id=repo_id,local_dir=root_dir/repo_id,repo_type="model")
128 |
129 | repo_id ="meta-llama/Llama-3.1-8B"
130 | root_dir = Path("Your own path for Llama")
131 | snapshot_download(repo_id=repo_id,local_dir=root_dir/repo_id,repo_type="model")
132 | ```
133 |
134 | Set **Model_Path** in the scripts before training.
135 |
136 | ### Warm Up Stage
137 |
138 | **Llama-3.1-8B**
139 | ```bash
140 | bash scripts/llama_warmup.sh
141 | ```
142 |
143 | **Qwen-2.5-7B**
144 | ```bash
145 | bash scripts/qwen_warmup.sh
146 | ```
147 |
148 | ### Sample Data and Fine-tune Models
149 |
150 | Set **Model-Name** & **Model-Path** & **File-Name** in the scripts before sampling.
151 | ```bash
152 | cd evaltoolkits
153 | bash loop_sample.sh
154 | ```
155 |
156 | After the sampling process, you can use [filter_data.py](evaltoolkits/filter_data.py) to launch the filtering framework.
157 |
158 | ```bash
159 | cd evaltoolkits
160 | python filter_data.py \
161 | --path_to_src_file [Sampling Data] \
162 | --path_to_stage1_file [Output Data Path]
163 | ```
164 |
165 | You can modify [dataset_info.json](data/dataset_info.json) to enable the added **filtered dataset** to be included in the file list.
166 |
167 | Finally, by set the **warm-up model path** and **datset_name** in the scripts, you can launch the fine-tuning process.
168 |
169 | **Llama-3.1-8B**
170 | ```bash
171 | bash scripts/llama_sft.sh
172 | ```
173 |
174 | **Qwen-2.5-7B**
175 | ```bash
176 | bash scripts/qwen_sft.sh
177 | ```
178 |
179 |
180 |
181 | ## 📊 Evaluation
182 |
183 | **LongBench v1**
184 | ```bash
185 | cd evaltoolkits
186 | bash launch_lbv1.sh
187 | ```
188 |
189 | **LongBench v2**
190 | ```bash
191 | cd evaltoolkits
192 | bash launch_lbv2.sh
193 | ```
194 |
195 | Note: Set **model_path** and **mode** to the desired target model.
196 |
197 |
198 |
199 | ## 📝 Citation
200 | ```
201 | @article{zhu2025chain,
202 | title={Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision},
203 | author={Zhu, Dawei and Wei, Xiyu and Zhao, Guangxiang and Wu, Wenhao and Zou, Haosheng and Ran, Junfeng and Wang, Xun and Sun, Lin and Zhang, Xiangzheng and Li, Sujian},
204 | journal={arXiv preprint arXiv:2502.20790},
205 | year={2025}
206 | }
207 | ```
208 | ## 📄 Acknowledgement
209 | We are deeply thankful for the following projects that serve as the foundation for LongRePS:
210 |
211 | * [**SEALONG**](https://github.com/SihengLi99/SEALONG)
212 | * [**LongBench**](https://github.com/THUDM/LongBench)
213 | * [**LLaMA-Factory**](https://github.com/hiyouga/LLaMA-Factory)
214 | * [**360-LLaMA-Factory**](https://github.com/Qihoo360/360-LLaMA-Factory)
215 |
216 |
--------------------------------------------------------------------------------
/cache/ds_z3_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "train_batch_size": "auto",
3 | "train_micro_batch_size_per_gpu": "auto",
4 | "gradient_accumulation_steps": "auto",
5 | "gradient_clipping": "auto",
6 | "zero_allow_untested_optimizer": true,
7 | "fp16": {
8 | "enabled": "auto",
9 | "loss_scale": 0,
10 | "loss_scale_window": 1000,
11 | "initial_scale_power": 16,
12 | "hysteresis": 2,
13 | "min_loss_scale": 1
14 | },
15 | "bf16": {
16 | "enabled": "auto"
17 | },
18 | "zero_optimization": {
19 | "stage": 3,
20 | "overlap_comm": true,
21 | "contiguous_gradients": true,
22 | "sub_group_size": 1000000000.0,
23 | "reduce_bucket_size": "auto",
24 | "stage3_prefetch_bucket_size": "auto",
25 | "stage3_param_persistence_threshold": "auto",
26 | "stage3_max_live_parameters": 1000000000.0,
27 | "stage3_max_reuse_distance": 1000000000.0,
28 | "stage3_gather_16bit_weights_on_model_save": true
29 | }
30 | }
--------------------------------------------------------------------------------
/config/__pycache__/prompt.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/config/__pycache__/prompt.cpython-38.pyc
--------------------------------------------------------------------------------
/config/__pycache__/prompt.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/config/__pycache__/prompt.cpython-39.pyc
--------------------------------------------------------------------------------
/config/prompt.py:
--------------------------------------------------------------------------------
1 | prompt_lbv1_cot="""
2 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. You should follow the instructions below to provide an accurate reasoning path, as well as a concise answer to the question:
3 |
4 | **Instructions:**
5 | Step 1. **Reasoning:** Imagine you are a student who has no prior knowledge about the giving context. Your task is to answer the questions based solely on the information presented here. First retrieve all relevant information, then deduce the correct answer. Begin by carefully reading the provided context. Identify and extract all relevant information that is directly related to the question. Be succinct and only extract the most important excerpts that will help you answer the question. Finally, deduce the correct answer based on the retrieved information.
6 | Step 2. **Answer:** Using the information you have retrieved, and your deduction, answer the question as concisely as you can, using a single phrase or sentence if possible. Ensure that your answer should be brief and to the point.
7 | Step 3. **Format Your Response:** Present your response in JSON format, comprising two components: "reasoning" and "answer". The "reasoning" section should detail your thought process, including the breakdown of the question, the relevant excerpts (indicated by [Excerpt xxx] at the start), and the derived conclusion. Ensure that each excerpt is an exact match to the original document. Limit the number of excerpts to a maximum of 10. The "answer" part should contain your final answer to the question, as concise and to the point as possible.
8 |
9 | Illustrative Examples:
10 |
11 | Example #1:
12 |
13 | **Context:** [... Saltram is living with the Mulvilles at Wimbledon ... He is not working or producing anything ... He is idle and dependent on others ...]
14 | **Question:** What is Saltram's living situation?
15 |
16 | **Response:**
17 | {{
18 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about Saltram's living situation. In the document, I can first locate that [Excerpt 1] `Saltram is living with the Mulvilles at Wimbledon`. Additionally, it is mentioned that [Excerpt 2] `He is not working or producing anything` and [Excerpt 3] `He is idle and dependent on others`. From these excerpts, I can deduce that Saltram is a guest in the home of the Mulvilles.",
19 | "answer": "He is a guest in the home of the Mulvilles."
20 | }}
21 |
22 | Example #2:
23 |
24 | **Context:** [... The Collegian is the bi-weekly official student publication of Houston Baptist University in Houston, Texas ... Houston Baptist University, affiliated with the Baptist General Convention of Texas, offers bachelor's and graduate degrees. It was founded in 1960 ...]
25 | **Question:** When was the institute that owned The Collegian founded?
26 |
27 | **Response:**
28 | {{
29 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about the founding date of the institute that owned The Collegian. In the document, I can first locate that [Excerpt 1] `The Collegian is the bi-weekly official student publication of Houston Baptist University in Houston, Texas`, so I need to look for information about Houston Baptist University. I find that [Excerpt 2] `Houston Baptist University was founded in 1960`. Therefore, the institute that owned The Collegian was founded in 1960.",
30 | "answer": "1960"
31 | }}
32 |
33 |
34 | Now, based on the context provided below, answer the question as concisely as you can, using a single phrase or sentence if possible. Meanwhile, reasoning must comply with the original text, and any knowledge should be derived from the original text.
35 |
36 | **Context:** {context}
37 | **Question:** {question}
38 |
39 | **Response:**
40 | """
41 |
42 | prompt_lbv1_nocot="""
43 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. Answer the question as concisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nContext:{context}\n\nNow, answer the question based on the context as concisely as you can, using a single phrase if possible. Do not provide any explanation and only give the best answer once.\n\nQuestion:{question}\n\nAnswer:"""
44 |
45 |
46 | prompt_lbv2_cot="""
47 | You are given a long document such as a story, meeting script, a news article, etc, and a question. Your task is to answer the question based on the information provided in the document. You should follow the instructions below to provide an accurate reasoning path, as well as a answer chosen from ABCD options to the question:
48 |
49 | **Instructions:**
50 | Step 1. **Reasoning:** First retrieve all relevant information, then deduce the correct answer. Begin by carefully reading the provided context. Identify and extract all relevant information that is directly related to the question. Be succinct and only extract the most important excerpts that will help you answer the question. Finally, deduce the correct answer based on the retrieved information.
51 | Step 2. **Answer:** Using the information you have retrieved, and your deduction, answer the question as concisely as you can, using a single phrase or sentence if possible. Ensure that your answer should be brief and to the point.
52 | Step 3. **Format Your Response:** Present your response in JSON format, comprising two components: "reasoning" and "answer". The "reasoning" section should detail your thought process, including the breakdown of the question, the relevant excerpts (indicated by [Excerpt xxx] at the start), and the derived conclusion. Ensure that each excerpt is an exact match to the original document. Limit the number of excerpts to a maximum of 10. The "answer" part should contain your final answer to the question, which is a choice selected from the ABCD options.
53 |
54 | Illustrative Examples:
55 |
56 | Example #1:
57 |
58 | **Context:** [... Saltram is living with the Mulvilles at Wimbledon ... He is not working or producing anything ... He is idle and dependent on others ...]
59 | **Question:** What is Saltram's living situation?
60 | **Choices:**
61 | (A) He is a guest in the home of the Mulvilles.
62 | (B) He is in a hotel.
63 | (C) He is homeless now.
64 | (D) Unkonwn
65 |
66 | **Response:**
67 | {{
68 | "reasoning": "Let me first retrieve relevant excerpts from the document, then answer the question. The question asks about Saltram's living situation. In the document, I can first locate that [Excerpt 1] `Saltram is living with the Mulvilles at Wimbledon`. Additionally, it is mentioned that [Excerpt 2] `He is not working or producing anything` and [Excerpt 3] `He is idle and dependent on others`. From these excerpts, I can deduce that Saltram is a guest in the home of the Mulvilles.",
69 | "answer": "A"
70 | }}
71 |
72 | Now, based on the context provided below, answer the question with a choice selected from ABCD.
73 |
74 | **Context:** {context}
75 | **Question:** {question}
76 | **Choices:**
77 | (A) {choice_A}
78 | (B) {choice_B}
79 | (C) {choice_C}
80 | (D) {choice_D}
81 |
82 | **Response:**
83 | """
84 |
85 | prompt_lbv2_nocot="""
86 | Please read the following text and answer the question below.
87 |
88 | {context}
89 |
90 | What is the correct answer to this question: {question}
91 | Choices:
92 | (A) {choice_A}
93 | (B) {choice_B}
94 | (C) {choice_C}
95 | (D) {choice_D}
96 |
97 | Format your response as follows: "The correct answer is (insert answer here)".
98 | """
99 |
100 | prompt_cot="""Given a document and a question, answer concisely using a single phrase and provide a brief reasoning process. \n\nContext:{context}\n\n Now, answer the question based on the context as concisely as you can and give a reasoning paths, using a single phrase if possible. \n\nQuestion:{question}\n\n Format your response as:
101 | Answer: []
102 | Reasoning: []
103 | Ensure both sections are separated clearly for easy extraction."""
--------------------------------------------------------------------------------
/data/dataset_info.json:
--------------------------------------------------------------------------------
1 | {
2 | "Llama-3.1-8B_warmup": {
3 | "file_name": "Llama-3.1-8B_warmup.jsonl",
4 | "columns": {
5 | "prompt": "instruction",
6 | "response": "output",
7 | "system": "system"
8 | }
9 | },
10 | "Llama-3.1-8B_orm": {
11 | "file_name": "Llama-3.1-8B_orm.jsonl",
12 | "columns": {
13 | "prompt": "instruction",
14 | "response": "output",
15 | "system": "system"
16 | }
17 | },
18 | "Llama-3.1-8B_sample30_thresh1.0_prm": {
19 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_llamastage2.jsonl",
20 | "columns": {
21 | "prompt": "instruction",
22 | "response": "output",
23 | "system": "system"
24 | }
25 | },
26 | "Llama-3.1-8B_sample30_thresh1.0_checkstage3_prm": {
27 | "file_name": "musique-llama3.1_sample30temp0.7thresh1.0_factcheck_stage3.jsonl",
28 | "columns": {
29 | "prompt": "instruction",
30 | "response": "output",
31 | "system": "system"
32 | }
33 | },
34 | "Qwen-2.5-7B_warmup": {
35 | "file_name": "Qwen-2.5-7B_warmup.jsonl",
36 | "columns": {
37 | "prompt": "instruction",
38 | "response": "output",
39 | "system": "system"
40 | }
41 | },
42 | "Qwen-2.5-7B_orm": {
43 | "file_name": "Qwen-2.5-7B_orm.jsonl",
44 | "columns": {
45 | "prompt": "instruction",
46 | "response": "output",
47 | "system": "system"
48 | }
49 | }
50 | ,
51 | "Qwen-2.5-7B_prm": {
52 | "file_name": "Qwen-2.5-7B_prm.jsonl",
53 | "columns": {
54 | "prompt": "instruction",
55 | "response": "output",
56 | "system": "system"
57 | }
58 | },
59 | "Qwen-2.5-7B_sample30_thresh1.0_prm": {
60 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_qwenstage2.jsonl",
61 | "columns": {
62 | "prompt": "instruction",
63 | "response": "output",
64 | "system": "system"
65 | }
66 | },
67 | "Qwen-2.5-7B_sample30_thresh1.0_checkstage3_prm": {
68 | "file_name": "musique-qwen2.5_sample30temp0.7thresh1.0_factcheck_stage3.jsonl",
69 | "columns": {
70 | "prompt": "instruction",
71 | "response": "output",
72 | "system": "system"
73 | }
74 | },
75 | "Qwen-2.5-7B_sample30_thresh1.0_yarn_prm": {
76 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_yarn_qwenstage2.jsonl",
77 | "columns": {
78 | "prompt": "instruction",
79 | "response": "output",
80 | "system": "system"
81 | }
82 | },
83 | "Qwen-2.5-7B_sample100_thresh1.0_yarn_checkstage3_prm": {
84 | "file_name": "musique-qwen2.5_sample100temp0.7thresh1.0_yarn_factcheck_stage3.jsonl",
85 | "columns": {
86 | "prompt": "instruction",
87 | "response": "output",
88 | "system": "system"
89 | }
90 | },
91 | "Llama-3.1-8B_sample30_thresh1.0_v2_checkstage3_prm": {
92 | "file_name": "musique-llama3.1_sample30temp0.7thresh1.0_v2_factcheck_stage3.jsonl",
93 | "columns": {
94 | "prompt": "instruction",
95 | "response": "output",
96 | "system": "system"
97 | }
98 | },
99 | "Llama-3.1-8B_sample30_thresh1.0_v2_prm": {
100 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_v2_llamastage2.jsonl",
101 | "columns": {
102 | "prompt": "instruction",
103 | "response": "output",
104 | "system": "system"
105 | }
106 | },
107 | "Qwen-2.5-14B_warmup": {
108 | "file_name": "musique_train_warmup_qwen14b.jsonl",
109 | "columns": {
110 | "prompt": "instruction",
111 | "response": "output",
112 | "system": "system"
113 | }
114 | },
115 | "Qwen-2.5-14B_sample30_temp0.7_thresh1.0_prm": {
116 | "file_name": "musique_train_temp0.7_sample30_thresh1.0_v2_qwen14bstage2.jsonl",
117 | "columns": {
118 | "prompt": "instruction",
119 | "response": "output",
120 | "system": "system"
121 | }
122 | },
123 | "Qwen-2.5-14B_sample30_temp0.7_thresh1.0_checkstage3_prm": {
124 | "file_name": "musique-qwen14b_sample30temp0.7thresh1.0_factcheck_stage3.jsonl",
125 | "columns": {
126 | "prompt": "instruction",
127 | "response": "output",
128 | "system": "system"
129 | }
130 | },
131 | "Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm": {
132 | "file_name": "musique-qwen32b_sample30temp0.7thresh1.0_factcheck_stage3.jsonl",
133 | "columns": {
134 | "prompt": "instruction",
135 | "response": "output",
136 | "system": "system"
137 | }
138 | }
139 | }
--------------------------------------------------------------------------------
/evaltoolkits/cache/ds_z2_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "train_batch_size": "auto",
3 | "train_micro_batch_size_per_gpu": "auto",
4 | "gradient_accumulation_steps": "auto",
5 | "gradient_clipping": "auto",
6 | "zero_allow_untested_optimizer": true,
7 | "fp16": {
8 | "enabled": "auto",
9 | "loss_scale": 0,
10 | "loss_scale_window": 1000,
11 | "initial_scale_power": 16,
12 | "hysteresis": 2,
13 | "min_loss_scale": 1
14 | },
15 | "bf16": {
16 | "enabled": "auto"
17 | },
18 | "zero_optimization": {
19 | "stage": 2,
20 | "allgather_partitions": true,
21 | "allgather_bucket_size": 500000000.0,
22 | "overlap_comm": true,
23 | "reduce_scatter": true,
24 | "reduce_bucket_size": 500000000.0,
25 | "contiguous_gradients": true,
26 | "round_robin_gradients": true
27 | }
28 | }
--------------------------------------------------------------------------------
/evaltoolkits/cache/ds_z2_offload_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "train_batch_size": "auto",
3 | "train_micro_batch_size_per_gpu": "auto",
4 | "gradient_accumulation_steps": "auto",
5 | "gradient_clipping": "auto",
6 | "zero_allow_untested_optimizer": true,
7 | "fp16": {
8 | "enabled": "auto",
9 | "loss_scale": 0,
10 | "loss_scale_window": 1000,
11 | "initial_scale_power": 16,
12 | "hysteresis": 2,
13 | "min_loss_scale": 1
14 | },
15 | "bf16": {
16 | "enabled": "auto"
17 | },
18 | "zero_optimization": {
19 | "stage": 2,
20 | "allgather_partitions": true,
21 | "allgather_bucket_size": 500000000.0,
22 | "overlap_comm": true,
23 | "reduce_scatter": true,
24 | "reduce_bucket_size": 500000000.0,
25 | "contiguous_gradients": true,
26 | "round_robin_gradients": true,
27 | "offload_optimizer": {
28 | "device": "cpu",
29 | "pin_memory": true
30 | }
31 | }
32 | }
--------------------------------------------------------------------------------
/evaltoolkits/cache/ds_z3_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "train_batch_size": "auto",
3 | "train_micro_batch_size_per_gpu": "auto",
4 | "gradient_accumulation_steps": "auto",
5 | "gradient_clipping": "auto",
6 | "zero_allow_untested_optimizer": true,
7 | "fp16": {
8 | "enabled": "auto",
9 | "loss_scale": 0,
10 | "loss_scale_window": 1000,
11 | "initial_scale_power": 16,
12 | "hysteresis": 2,
13 | "min_loss_scale": 1
14 | },
15 | "bf16": {
16 | "enabled": "auto"
17 | },
18 | "zero_optimization": {
19 | "stage": 3,
20 | "overlap_comm": true,
21 | "contiguous_gradients": true,
22 | "sub_group_size": 1000000000.0,
23 | "reduce_bucket_size": "auto",
24 | "stage3_prefetch_bucket_size": "auto",
25 | "stage3_param_persistence_threshold": "auto",
26 | "stage3_max_live_parameters": 1000000000.0,
27 | "stage3_max_reuse_distance": 1000000000.0,
28 | "stage3_gather_16bit_weights_on_model_save": true
29 | }
30 | }
--------------------------------------------------------------------------------
/evaltoolkits/cache/ds_z3_offload_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "train_batch_size": "auto",
3 | "train_micro_batch_size_per_gpu": "auto",
4 | "gradient_accumulation_steps": "auto",
5 | "gradient_clipping": "auto",
6 | "zero_allow_untested_optimizer": true,
7 | "fp16": {
8 | "enabled": "auto",
9 | "loss_scale": 0,
10 | "loss_scale_window": 1000,
11 | "initial_scale_power": 16,
12 | "hysteresis": 2,
13 | "min_loss_scale": 1
14 | },
15 | "bf16": {
16 | "enabled": "auto"
17 | },
18 | "zero_optimization": {
19 | "stage": 3,
20 | "overlap_comm": true,
21 | "contiguous_gradients": true,
22 | "sub_group_size": 1000000000.0,
23 | "reduce_bucket_size": "auto",
24 | "stage3_prefetch_bucket_size": "auto",
25 | "stage3_param_persistence_threshold": "auto",
26 | "stage3_max_live_parameters": 1000000000.0,
27 | "stage3_max_reuse_distance": 1000000000.0,
28 | "stage3_gather_16bit_weights_on_model_save": true,
29 | "offload_optimizer": {
30 | "device": "cpu",
31 | "pin_memory": true
32 | },
33 | "offload_param": {
34 | "device": "cpu",
35 | "pin_memory": true
36 | }
37 | }
38 | }
--------------------------------------------------------------------------------
/evaltoolkits/cache/user_config.yaml:
--------------------------------------------------------------------------------
1 | cache_dir: null
2 | lang: zh
3 | last_model: Qwen2.5-14B
4 | path_dict:
5 | Qwen2.5-14B: /mnt/xiyu/Model/Qwen/Qwen2.5-14B
6 |
--------------------------------------------------------------------------------
/evaltoolkits/filter.sh:
--------------------------------------------------------------------------------
1 | python /mnt/xiyu/LongRePS/evaltoolkits/filter_data.py \
2 | --path_to_src_file /mnt/xiyu/musique-Qwen-2.5-32B.temp0.7sample30.cot_eval.jsonl \
3 | --path_to_stage1_file /mnt/xiyu/LongRePS/dataset/musique-qwen32b_sample30temp0.7thresh1.0_factcheck_stage1.jsonl \
4 | --sample_num 30
5 |
--------------------------------------------------------------------------------
/evaltoolkits/filter_data.py:
--------------------------------------------------------------------------------
1 | import json
2 | import jsonlines
3 | import json_repair
4 | import pandas as pd
5 | import numpy as np
6 | import torch
7 | import shutil
8 | import glob
9 | import re
10 | import string
11 | import time
12 | import os
13 | import argparse
14 | from typing import List, Tuple
15 | from tqdm import tqdm
16 | from pathlib import Path
17 | from json_repair import repair_json
18 | from transformers import AutoTokenizer, AutoModelForCausalLM
19 | from datasets import Dataset
20 | from openai import OpenAI
21 |
22 | from utils import normalize_answer, preprocess_pred_for_json_repair, extract_fact_list, verify_fact_list, extract_number
23 | def parse_args(args=None):
24 | parser = argparse.ArgumentParser()
25 | parser.add_argument('--path_to_src_file', type=str, default=None)
26 | parser.add_argument('--path_to_stage1_file', type=str, default=None)
27 | parser.add_argument('--sample_num', type=int, default=30)
28 | return parser.parse_args(args)
29 | # 我们需要你帮忙评价模型推理过程的质量。模型的接收的输入是一段长文本,以及一个复杂的问题,它的任务是根据问题的需要,从长文本中检索出相关信息(以[Excerpt xxx]的形式开头,包含在``中),并给出正确的答案。现在,我们已经在上面给出了问题和模型的推理过程。模型最终得到的结果是正确的,但是我们需要你来评价模型的推理过程是否合理。请你根据以下几个方面来评价模型的推理过程:- 逻辑性:模型对问题的拆解应当合理。推理过程对于检索到的信息的使用应该符合逻辑,根据检索到的信息得出答案的逻辑链条应该合理。 - 完整性:推理过程应该主要使用从文中检索到的信息,即[Excerpts xxx]后内容,而非模型自身的知识。 - 简洁性:只应当检索回答问题相关的信息,不应罗列过多无关的信息。
30 |
31 | EVAL_PROMPT = '''[Question]
32 | {question}
33 |
34 | [The Start of Assistant's Reasoning Path]
35 | {reasoning}
36 | [The End of Assistant's Reasoning Path]
37 |
38 | [System]
39 | We would like to request your feedback on the quality of the reasoning process in the given response.
40 | The model receives a long text input and a complex question. Its task is to retrieve relevant information from the long text (marked as [Excerpt xxx] and enclosed in ``) based on the question's requirements and provide the correct answer. Above, we have provided both the question and the model's reasoning process. While the model's final answer is correct, we need you to evaluate whether its reasoning process is sound.
41 |
42 | Please assess the model's reasoning process based on the following aspects:
43 |
44 | 1. Logical Coherence:
45 | - The model should break down the question appropriately
46 | - The use of retrieved information should follow logical patterns
47 | - The chain of reasoning from retrieved information to the final answer should be sound
48 |
49 | 2. Completeness:
50 | - The reasoning process should primarily rely on information retrieved from the text ([Excerpts xxx])
51 | - The model should not heavily depend on its own knowledge base
52 |
53 | 3. Conciseness:
54 | - Only information relevant to answering the question should be retrieved
55 | - The model should avoid listing excessive or irrelevant information
56 |
57 | Please rate whether this reasoning path is suitable for the question. The assistant receives an overall score on a scale of 1 to 100, where a higher score indicates better overall performance.
58 | Please note that if the assistant's reasoning process fully meets the above criteria, its overall rating should be full marks (100).
59 | Please first provide a comprehensive explanation of your evaluation, avoiding any potential bias.
60 | Then, output a line indicating the score of the Assistant.
61 |
62 | PLEASE OUTPUT WITH THE FOLLOWING FORMAT, WHERE THE SCORE IS ON A SCALE OF 1 TO 100 BY STRICTLY FOLLOWING THIS FORMAT: "[[score]]", FOR EXAMPLE "Rating: [[100]]":
63 |
64 | Evaluation evidence: your evaluation explanation here, no more than 100 words
65 | Rating: [[score]]
66 |
67 |
68 | Now, start your evaluation:'''
69 |
70 | def process_single_data(example):
71 | """处理单条数据的函数"""
72 | checked_data = {**example}
73 | checked_data['new_pred'] = []
74 | checked_data['new_f1_score_list'] = []
75 | checked_data['new_extracted_pred_list'] = []
76 | fact_valid_flag = 0
77 | json_valid_flag = 0
78 | total_num=len(example['pred'])
79 | save_num=0
80 | for pred_idx, pred in enumerate(example['pred']):
81 | try:
82 | pred = preprocess_pred_for_json_repair(pred)
83 | content = json_repair.loads(pred)
84 | if len(content) >1:
85 | content=content[0]
86 | pred=json.dumps(content)
87 | if not isinstance(content, dict) or not content or 'reasoning' not in content or len(content) > 2 or type(content['reasoning']) != str:
88 | continue
89 | else:
90 | json_valid_flag = 1
91 |
92 | fact_list = extract_fact_list(content['reasoning'])
93 | if len(fact_list) >= 0:
94 | if verify_fact_list(fact_list, example['instruction']):
95 | save_num+=1
96 | fact_valid_flag = 1
97 | checked_data['new_pred'].append(pred)
98 | checked_data['new_f1_score_list'].append(example['f1_score_list'][pred_idx])
99 | checked_data['new_extracted_pred_list'].append(example['extracted_pred_list'][pred_idx])
100 | except:
101 | continue
102 |
103 | checked_data['fact_valid'] = fact_valid_flag
104 | checked_data['json_valid'] = json_valid_flag
105 | checked_data['save_rate']=save_num/total_num
106 | return checked_data
107 |
108 | def filter_stage_1(path_to_src_file: str, path_to_stage1_file: str, f1_score_thresh: float = 1.0, sample: int =10):
109 |
110 | with jsonlines.open(path_to_src_file) as fin:
111 | data_list = list(fin)
112 | print("Sample Number ",sample)
113 | for data in data_list:
114 | data['pred']=data['pred'][:sample]
115 | data['f1_score_list']=data['f1_score_list'][:sample]
116 | data['extracted_pred_list']=data['extracted_pred_list'][:sample]
117 |
118 | dataset = Dataset.from_list(data_list)
119 |
120 | # 并行处理数据
121 | processed_dataset = dataset.map(
122 | process_single_data,
123 | num_proc=16,
124 | desc="Processing data"
125 | )
126 |
127 | # 统计结果
128 | no_valid_fact_cnt = sum(1 for x in processed_dataset if not x['fact_valid'])
129 | no_valid_json_cnt = sum(1 for x in processed_dataset if not x['json_valid'])
130 | avg_save_rate = np.mean([x['save_rate'] for x in processed_dataset])
131 | processed_dataset = processed_dataset.filter(lambda x: len(x['new_pred']) > 0)
132 |
133 | print(f"Avg Save Rate: {avg_save_rate}")
134 | print(f"No valid fact count: {no_valid_fact_cnt}")
135 | print(f"No valid JSON count: {no_valid_json_cnt}")
136 | print(f"Checked data count: {len(processed_dataset)}")
137 |
138 | # 处理最终结果
139 | result_data_list = []
140 |
141 | for item in processed_dataset:
142 | if not item['new_f1_score_list']:
143 | continue
144 | max_value = max(item['new_f1_score_list'])
145 | # max_index = item['new_f1_score_list'].index(max_value)
146 | # score = item['new_f1_score_list'][max_index]
147 |
148 | if max_value >= f1_score_thresh:
149 |
150 | remaining_idx_list = [idx for idx, score in enumerate(item['new_f1_score_list']) if score >= f1_score_thresh][:10]
151 | remaining_pred_list = [item['new_pred'][idx] for idx in remaining_idx_list][:10]
152 | remaining_f1_score_list = [item['new_f1_score_list'][idx] for idx in remaining_idx_list][:10]
153 | remaining_extracted_pred_list = [item['new_extracted_pred_list'][idx] for idx in remaining_idx_list][:10]
154 | output = remaining_pred_list[0]
155 | st_output = item['answers']
156 | f1_score = remaining_f1_score_list[0]
157 |
158 | result_data_list.append({
159 | "instruction": item['instruction'],
160 | "filtered_pred": remaining_pred_list,
161 | "filtered_f1_score_list": remaining_f1_score_list,
162 | "filtered_extracted_pred_list": remaining_extracted_pred_list,
163 | "answers": item['answers'],
164 | "output": output,
165 | "st_output": st_output,
166 | "f1_score": f1_score,
167 | "system": "You are a helpful assistant.",
168 | "id": item['id'],
169 | })
170 |
171 | print(f"selected_cnt: {len(result_data_list)}")
172 |
173 | # 保存结果
174 | with jsonlines.open(path_to_stage1_file, 'w') as writer:
175 | writer.write_all(result_data_list)
176 |
177 | return
178 |
179 |
180 | def get_score_evidence(question:str, pred:str,ground_truth:str) -> Tuple[float, str]:
181 | prompt = EVAL_PROMPT.format(question=question, reasoning=pred)
182 | max_retries = 5
183 | API_KEY = os.getenv("OPENAI_API_KEY")
184 | client = OpenAI(api_key=API_KEY,base_url="")
185 | for _ in range(max_retries):
186 | content = ""
187 | try:
188 | response = client.chat.completions.create(
189 | messages=[{"role": "user","content": prompt}],
190 | model="gpt-4o-mini",
191 | temperature=0,
192 | max_tokens=512,
193 | )
194 | content = response.choices[0].message.content
195 | if content == None:
196 | continue
197 | evidence = content
198 | rating = extract_number(content)
199 | return (rating, evidence)
200 | except Exception as e:
201 | content = "" if content == None else content
202 | print(e, content)
203 | time.sleep(30)
204 | return (0.0, content)
205 |
206 | return (0.0, "")
207 |
208 |
209 | def get_score_evidence_list(question:str, predictions:List[str], answer:str) -> Tuple[List[float], List[str]]:
210 | score_list: List[float] = []
211 | evidence_list: List[str] = []
212 | for pred in predictions:
213 | # if there are more than 3 "100" in score_list, break
214 | if score_list.count(100) >= 3:
215 | break
216 | score, evidence = get_score_evidence(question=question, pred=pred, ground_truth=answer)
217 | score_list.append(float(score))
218 | evidence_list.append(evidence)
219 | return score_list, evidence_list
220 |
221 |
222 | def get_llm_score_for_single_data(data):
223 | answer = data["answers"][0]
224 | question = data["instruction"].split("Question:")[-1].replace("\n\nAnswer:","").replace("*","").strip().replace("\n\nResponse:","").strip()
225 | reasoning_list = [json_repair.loads(pred)['reasoning'] for pred in data["filtered_pred"]]
226 | score_list, evidence_list = get_score_evidence_list(question, reasoning_list, answer)
227 |
228 | data["llm_score_list"] = score_list
229 | data["llm_evidence_list"] = evidence_list
230 | return data
231 |
232 |
233 | def filter_stage_2(path_to_stage1_file: str, path_to_stage2_file: str, path_to_stage3_file: str, llm_score_thresh: float = 100):
234 |
235 | with jsonlines.open(path_to_stage1_file) as fin:
236 | data_list = list(fin)
237 | dataset = Dataset.from_list(data_list)
238 |
239 | preprocessed_dataset = dataset.map(
240 | get_llm_score_for_single_data,
241 | num_proc=8,
242 | desc="Processing data"
243 | )
244 | preprocessed_dataset = preprocessed_dataset.to_list()
245 | score_list=[]
246 | for data in preprocessed_dataset:
247 | score_list.append(np.mean(data['llm_score_list']))
248 | max_score = max(data["llm_score_list"])
249 | max_index = data["llm_score_list"].index(max_score)
250 | data["output"] = data["filtered_pred"][max_index]
251 | data["st_output"] = data["answers"]
252 | data["llm_score"] = max_score
253 | print(f'Avg Score in Stage2 is {np.mean(score_list)}')
254 |
255 | with jsonlines.open(path_to_stage2_file, 'w') as writer:
256 | writer.write_all(preprocessed_dataset)
257 |
258 |
259 | stage3_data_list = [data for data in preprocessed_dataset if data['llm_score'] >= llm_score_thresh]
260 | print(f"Stage3 selected_cnt: {len(stage3_data_list)}, avg score in stage3 is {np.mean([data['llm_score'] for data in stage3_data_list])}")
261 |
262 | with jsonlines.open(path_to_stage3_file, 'w') as writer:
263 | writer.write_all(stage3_data_list)
264 |
265 | return
266 |
267 | def filter_stage_3(path_to_stage2_file: str, path_to_stage3_file: str, llm_score_thresh: float = 100):
268 | with jsonlines.open(path_to_stage2_file) as fin:
269 | preprocessed_dataset= list(fin)
270 |
271 | stage3_data_list = [data for data in preprocessed_dataset if data['llm_score'] >= llm_score_thresh]
272 | print(f"Stage3 selected_cnt: {len(stage3_data_list)}, avg score in stage3 is {np.mean([data['llm_score'] for data in stage3_data_list])}")
273 |
274 | with jsonlines.open(path_to_stage3_file, 'w') as writer:
275 | writer.write_all(stage3_data_list)
276 |
277 | return
278 |
279 | args = parse_args()
280 | path_to_src_file = args.path_to_src_file
281 | path_to_stage1_file = args.path_to_stage1_file
282 | sample=args.sample_num
283 | path_to_stage2_file = path_to_stage1_file.replace("stage1", "stage2")
284 | path_to_stage3_file = path_to_stage1_file.replace("stage1", "stage3")
285 | f1_score_thresh = 1.0
286 | llm_score_thresh = 0
287 |
288 | assert "thresh" + str(f1_score_thresh) in path_to_stage1_file, "f1_score_thresh is not consistent with the one used in stage1"
289 |
290 | filter_stage_1(path_to_src_file, path_to_stage1_file, f1_score_thresh,sample)
291 | filter_stage_2(path_to_stage1_file, path_to_stage2_file, path_to_stage3_file, llm_score_thresh)
292 |
--------------------------------------------------------------------------------
/evaltoolkits/inference.out:
--------------------------------------------------------------------------------
1 | 2007527
2 | Wold Size 8
3 | Failed loading 0 lines, total 3000 lines loaded
4 |
0%| | 0/375 [00:00, ?it/s]
0%| | 0/375 [00:00, ?it/s]
0%| | 0/375 [00:00, ?it/s]
0%| | 0/375 [00:00, ?it/s]
0%| | 0/375 [00:00, ?it/s]
0%| | 0/375 [00:00, ?it/s]
0%| | 0/375 [00:00, ?it/s]
0%| | 0/375 [00:00, ?it/s]
0%| | 1/375 [00:55<5:48:57, 55.98s/it]
0%| | 1/375 [00:56<5:51:52, 56.45s/it]
0%| | 1/375 [00:56<5:53:22, 56.69s/it]
0%| | 1/375 [00:56<5:53:51, 56.77s/it]
0%| | 1/375 [00:57<5:57:01, 57.28s/it]
0%| | 1/375 [00:57<5:58:06, 57.45s/it]
0%| | 1/375 [00:58<6:02:37, 58.18s/it]
0%| | 1/375 [00:58<6:04:35, 58.49s/it]
1%| | 2/375 [00:59<2:34:42, 24.89s/it]
1%| | 2/375 [00:59<2:35:46, 25.06s/it]
1%| | 2/375 [00:59<2:36:27, 25.17s/it]
1%| | 2/375 [00:59<2:37:02, 25.26s/it]
1%| | 2/375 [01:01<2:43:36, 26.32s/it]
1%| | 2/375 [01:02<2:42:49, 26.19s/it]
1%| | 2/375 [01:02<2:45:06, 26.56s/it]
1%| | 3/375 [01:02<1:32:31, 14.92s/it]
1%| | 3/375 [01:03<1:34:43, 15.28s/it]
1%| | 3/375 [01:03<1:34:45, 15.28s/it]
1%| | 3/375 [01:03<1:35:51, 15.46s/it]
1%| | 2/375 [01:03<2:48:15, 27.07s/it]
1%| | 3/375 [01:05<1:37:59, 15.80s/it]
1%| | 3/375 [01:05<1:38:49, 15.94s/it]
1%| | 3/375 [01:06<1:39:02, 15.97s/it]
1%| | 4/375 [01:06<1:05:20, 10.57s/it]
1%| | 4/375 [01:06<1:05:24, 10.58s/it]
1%| | 4/375 [01:06<1:05:48, 10.64s/it]
1%| | 3/375 [01:07<1:43:54, 16.76s/it]
1%| | 4/375 [01:08<1:09:28, 11.24s/it]
1%| | 4/375 [01:08<1:06:36, 10.77s/it]
1%| | 4/375 [01:09<1:07:17, 10.88s/it]
1%| | 4/375 [01:09<1:08:39, 11.10s/it]
1%| | 4/375 [01:10<1:09:19, 11.21s/it]
1%|▏ | 5/375 [01:10<50:50, 8.25s/it]
1%|▏ | 5/375 [01:11<51:20, 8.33s/it]
1%|▏ | 5/375 [01:11<50:08, 8.13s/it]
1%|▏ | 5/375 [01:12<50:50, 8.24s/it]
1%|▏ | 5/375 [01:12<54:41, 8.87s/it]
1%|▏ | 5/375 [01:13<52:12, 8.46s/it]
2%|▏ | 6/375 [01:13<39:26, 6.41s/it]
1%|▏ | 5/375 [01:14<55:15, 8.96s/it]
2%|▏ | 6/375 [01:15<42:32, 6.92s/it]
2%|▏ | 6/375 [01:15<40:23, 6.57s/it]
2%|▏ | 6/375 [01:15<41:53, 6.81s/it]
2%|▏ | 7/375 [01:16<32:27, 5.29s/it]
2%|▏ | 6/375 [01:16<42:29, 6.91s/it]
2%|▏ | 6/375 [01:17<42:07, 6.85s/it]
2%|▏ | 7/375 [01:18<32:19, 5.27s/it]
2%|▏ | 7/375 [01:19<36:02, 5.88s/it]
2%|▏ | 7/375 [01:19<33:50, 5.52s/it]
2%|▏ | 8/375 [01:20<29:23, 4.80s/it]
2%|▏ | 6/375 [01:20<48:15, 7.85s/it]
2%|▏ | 8/375 [01:21<28:49, 4.71s/it]
2%|▏ | 7/375 [01:21<38:20, 6.25s/it]
1%|▏ | 5/375 [01:23<1:19:27, 12.89s/it]
2%|▏ | 9/375 [01:23<25:59, 4.26s/it]
2%|▏ | 8/375 [01:23<30:03, 4.91s/it]
2%|▏ | 8/375 [01:24<30:13, 4.94s/it]
2%|▏ | 7/375 [01:24<39:11, 6.39s/it]
2%|▏ | 8/375 [01:25<35:22, 5.78s/it]
2%|▏ | 9/375 [01:26<25:59, 4.26s/it]
2%|▏ | 8/375 [01:26<32:01, 5.24s/it]
2%|▏ | 9/375 [01:26<29:55, 4.91s/it]
2%|▏ | 9/375 [01:26<26:13, 4.30s/it]
2%|▏ | 6/375 [01:26<59:40, 9.70s/it]
3%|▎ | 10/375 [01:27<24:45, 4.07s/it]
2%|▏ | 9/375 [01:28<30:31, 5.00s/it]
2%|▏ | 7/375 [01:29<45:23, 7.40s/it]
3%|▎ | 10/375 [01:30<26:37, 4.38s/it]
2%|▏ | 9/375 [01:30<28:28, 4.67s/it]
3%|▎ | 11/375 [01:31<24:50, 4.09s/it]
3%|▎ | 10/375 [01:31<27:20, 4.50s/it]
2%|▏ | 7/375 [01:31<1:01:55, 10.10s/it]
3%|▎ | 10/375 [01:32<28:08, 4.63s/it]
3%|▎ | 10/375 [01:32<29:21, 4.83s/it]
2%|▏ | 8/375 [01:32<37:24, 6.11s/it]
3%|▎ | 11/375 [01:33<23:58, 3.95s/it]
3%|▎ | 12/375 [01:34<23:22, 3.86s/it]
3%|▎ | 10/375 [01:34<28:01, 4.61s/it]
3%|▎ | 11/375 [01:34<25:08, 4.15s/it]
3%|▎ | 11/375 [01:35<26:13, 4.32s/it]
3%|▎ | 12/375 [01:36<23:29, 3.88s/it]
2%|▏ | 8/375 [01:36<52:15, 8.54s/it]
3%|▎ | 12/375 [01:37<22:37, 3.74s/it]
3%|▎ | 13/375 [01:38<22:28, 3.73s/it]
3%|▎ | 11/375 [01:39<27:53, 4.60s/it]
3%|▎ | 11/375 [01:39<32:44, 5.40s/it]
2%|▏ | 9/375 [01:40<42:27, 6.96s/it]
3%|▎ | 13/375 [01:40<23:34, 3.91s/it]
3%|▎ | 13/375 [01:40<21:42, 3.60s/it]
3%|▎ | 12/375 [01:40<27:27, 4.54s/it]
3%|▎ | 12/375 [01:42<28:12, 4.66s/it]
3%|▎ | 12/375 [01:43<26:33, 4.39s/it]
3%|▎ | 13/375 [01:43<24:17, 4.03s/it]
3%|▎ | 10/375 [01:43<35:53, 5.90s/it]
4%|▎ | 14/375 [01:44<22:35, 3.75s/it]
4%|▎ | 14/375 [01:44<21:02, 3.50s/it]
4%|▎ | 14/375 [01:45<29:09, 4.85s/it]
3%|▎ | 13/375 [01:46<26:02, 4.32s/it]
2%|▏ | 9/375 [01:46<51:00, 8.36s/it]
3%|▎ | 13/375 [01:47<25:27, 4.22s/it]
3%|▎ | 11/375 [01:48<32:20, 5.33s/it]
4%|▍ | 15/375 [01:48<23:27, 3.91s/it]
4%|▍ | 15/375 [01:48<25:48, 4.30s/it]
4%|▎ | 14/375 [01:48<25:45, 4.28s/it]
4%|▎ | 14/375 [01:49<46:51, 7.79s/it]
4%|▎ | 14/375 [01:49<46:51, 7.79s/it]
4%|▍ | 15/375 [01:49<43:36, 7.27s/it]
3%|▎ | 11/375 [01:49<1:00:07, 9.91s/it]
2%|▏ | 9/375 [01:49<1:13:53, 12.11s/it]
3%|▎ | 13/375 [01:49<50:35, 8.39s/it]
5 |
6 |
4%|▍ | 15/375 [01:49<43:36, 7.27s/it]
7 |
8 |
3%|▎ | 13/375 [01:49<50:35, 8.39s/it]
9 |
10 |
11 |
12 | Process Process-8:
13 | Process Process-7:
14 | Process Process-3:
15 | Process Process-2:
16 | Process Process-1:
17 | Process Process-5:
18 | Process Process-6:
19 | Process Process-4:
20 | Traceback (most recent call last):
21 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 196, in
22 | p.join()
23 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 149, in join
24 | res = self._popen.wait(timeout)
25 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
26 | return self.poll(os.WNOHANG if timeout == 0.0 else 0)
27 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
28 | pid, sts = os.waitpid(self.pid, flag)
29 | KeyboardInterrupt
30 | Traceback (most recent call last):
31 | Traceback (most recent call last):
32 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
33 | self.run()
34 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
35 | self.run()
36 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
37 | self._target(*self._args, **self._kwargs)
38 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
39 | self._target(*self._args, **self._kwargs)
40 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
41 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
42 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
43 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
44 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
45 | response=client.completions.create(
46 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
47 | response=client.completions.create(
48 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
49 | return func(*args, **kwargs)
50 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
51 | return func(*args, **kwargs)
52 | Traceback (most recent call last):
53 | Traceback (most recent call last):
54 | Traceback (most recent call last):
55 | Traceback (most recent call last):
56 | Traceback (most recent call last):
57 | Traceback (most recent call last):
58 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
59 | return self._post(
60 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
61 | return self._post(
62 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
63 | self.run()
64 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
65 | self.run()
66 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
67 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
68 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
69 | self.run()
70 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
71 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
72 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
73 | self._target(*self._args, **self._kwargs)
74 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
75 | self.run()
76 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
77 | self._target(*self._args, **self._kwargs)
78 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
79 | return self._request(
80 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
81 | self._target(*self._args, **self._kwargs)
82 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
83 | self.run()
84 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
85 | return self._request(
86 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
87 | self._target(*self._args, **self._kwargs)
88 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
89 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
90 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
91 | self.run()
92 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
93 | self._target(*self._args, **self._kwargs)
94 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
95 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
96 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
97 | response = self._client.send(
98 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
99 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
100 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
101 | response = self._client.send(
102 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
103 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
104 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
105 | response=client.completions.create(
106 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
107 | self._target(*self._args, **self._kwargs)
108 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
109 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
110 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
111 | response=client.completions.create(
112 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
113 | response = self._send_handling_auth(
114 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
115 | response=client.completions.create(
116 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
117 | response = self._send_handling_auth(
118 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
119 | response=client.completions.create(
120 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
121 | return func(*args, **kwargs)
122 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 109, in get_pred_from_vllm
123 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
124 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
125 | response=client.completions.create(
126 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
127 | return func(*args, **kwargs)
128 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
129 | response = self._send_handling_redirects(
130 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
131 | return func(*args, **kwargs)
132 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
133 | response = self._send_handling_redirects(
134 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
135 | return func(*args, **kwargs)
136 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
137 | return self._post(
138 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 60, in get_api_results
139 | response=client.completions.create(
140 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
141 | return func(*args, **kwargs)
142 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
143 | return self._post(
144 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
145 | response = self._send_single_request(request)
146 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
147 | return self._post(
148 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
149 | response = self._send_single_request(request)
150 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
151 | return self._post(
152 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
153 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
154 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
155 | return func(*args, **kwargs)
156 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
157 | return self._post(
158 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
159 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
160 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
161 | response = transport.handle_request(request)
162 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
163 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
164 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
165 | response = transport.handle_request(request)
166 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
167 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
168 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
169 | return self._request(
170 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/resources/completions.py", line 539, in create
171 | return self._post(
172 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
173 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
174 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
175 | return self._request(
176 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
177 | resp = self._pool.handle_request(req)
178 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
179 | return self._request(
180 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
181 | resp = self._pool.handle_request(req)
182 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
183 | return self._request(
184 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
185 | response = self._client.send(
186 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 1242, in post
187 | return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
188 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
189 | return self._request(
190 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
191 | response = self._client.send(
192 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
193 | raise exc from None
194 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
195 | response = self._client.send(
196 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
197 | raise exc from None
198 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
199 | response = self._client.send(
200 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
201 | response = self._send_handling_auth(
202 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 919, in request
203 | return self._request(
204 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
205 | response = self._client.send(
206 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
207 | response = self._send_handling_auth(
208 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
209 | response = connection.handle_request(
210 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
211 | response = self._send_handling_auth(
212 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
213 | response = connection.handle_request(
214 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
215 | response = self._send_handling_auth(
216 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
217 | response = self._send_handling_redirects(
218 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/openai/_base_client.py", line 955, in _request
219 | response = self._client.send(
220 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
221 | response = self._send_handling_auth(
222 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
223 | response = self._send_handling_redirects(
224 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
225 | return self._connection.handle_request(request)
226 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
227 | response = self._send_handling_redirects(
228 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
229 | return self._connection.handle_request(request)
230 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
231 | response = self._send_handling_redirects(
232 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
233 | response = self._send_single_request(request)
234 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
235 | response = self._send_handling_auth(
236 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
237 | response = self._send_handling_redirects(
238 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
239 | response = self._send_single_request(request)
240 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
241 | raise exc
242 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
243 | response = self._send_single_request(request)
244 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
245 | raise exc
246 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
247 | response = self._send_single_request(request)
248 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
249 | response = transport.handle_request(request)
250 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
251 | response = self._send_handling_redirects(
252 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
253 | response = self._send_single_request(request)
254 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
255 | response = transport.handle_request(request)
256 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
257 | ) = self._receive_response_headers(**kwargs)
258 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
259 | response = transport.handle_request(request)
260 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
261 | ) = self._receive_response_headers(**kwargs)
262 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
263 | response = transport.handle_request(request)
264 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
265 | resp = self._pool.handle_request(req)
266 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
267 | response = self._send_single_request(request)
268 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
269 | response = transport.handle_request(request)
270 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
271 | resp = self._pool.handle_request(req)
272 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
273 | event = self._receive_event(timeout=timeout)
274 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
275 | resp = self._pool.handle_request(req)
276 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
277 | event = self._receive_event(timeout=timeout)
278 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
279 | resp = self._pool.handle_request(req)
280 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
281 | raise exc from None
282 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request
283 | response = transport.handle_request(request)
284 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
285 | resp = self._pool.handle_request(req)
286 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
287 | raise exc from None
288 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
289 | data = self._network_stream.read(
290 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
291 | raise exc from None
292 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
293 | data = self._network_stream.read(
294 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
295 | raise exc from None
296 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
297 | response = connection.handle_request(
298 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request
299 | resp = self._pool.handle_request(req)
300 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
301 | raise exc from None
302 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
303 | response = connection.handle_request(
304 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
305 | return self._sock.recv(max_bytes)
306 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
307 | response = connection.handle_request(
308 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
309 | return self._sock.recv(max_bytes)
310 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
311 | response = connection.handle_request(
312 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
313 | return self._connection.handle_request(request)
314 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
315 | raise exc from None
316 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
317 | response = connection.handle_request(
318 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
319 | return self._connection.handle_request(request)
320 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
321 | return self._connection.handle_request(request)
322 | KeyboardInterrupt
323 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
324 | return self._connection.handle_request(request)
325 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
326 | raise exc
327 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
328 | response = connection.handle_request(
329 | KeyboardInterrupt
330 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
331 | return self._connection.handle_request(request)
332 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
333 | raise exc
334 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
335 | raise exc
336 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
337 | raise exc
338 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
339 | ) = self._receive_response_headers(**kwargs)
340 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
341 | return self._connection.handle_request(request)
342 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
343 | raise exc
344 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
345 | ) = self._receive_response_headers(**kwargs)
346 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
347 | ) = self._receive_response_headers(**kwargs)
348 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
349 | ) = self._receive_response_headers(**kwargs)
350 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
351 | event = self._receive_event(timeout=timeout)
352 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
353 | raise exc
354 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
355 | ) = self._receive_response_headers(**kwargs)
356 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
357 | event = self._receive_event(timeout=timeout)
358 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
359 | event = self._receive_event(timeout=timeout)
360 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
361 | event = self._receive_event(timeout=timeout)
362 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
363 | data = self._network_stream.read(
364 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
365 | ) = self._receive_response_headers(**kwargs)
366 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
367 | event = self._receive_event(timeout=timeout)
368 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
369 | data = self._network_stream.read(
370 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
371 | data = self._network_stream.read(
372 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
373 | data = self._network_stream.read(
374 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
375 | return self._sock.recv(max_bytes)
376 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
377 | event = self._receive_event(timeout=timeout)
378 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
379 | data = self._network_stream.read(
380 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
381 | return self._sock.recv(max_bytes)
382 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
383 | return self._sock.recv(max_bytes)
384 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
385 | return self._sock.recv(max_bytes)
386 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
387 | data = self._network_stream.read(
388 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
389 | return self._sock.recv(max_bytes)
390 | Connection error.
391 | Connection error.
392 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 128, in read
393 | return self._sock.recv(max_bytes)
394 | KeyboardInterrupt
395 | KeyboardInterrupt
396 | KeyboardInterrupt
397 | KeyboardInterrupt
398 | KeyboardInterrupt
399 | KeyboardInterrupt
400 | Connection error.
401 | Connection error.
402 | Connection error.
403 | Connection error.
404 | Connection error.
405 | Connection error.
406 |
--------------------------------------------------------------------------------
/evaltoolkits/inference2.out:
--------------------------------------------------------------------------------
1 | 2058724
2 | Wold Size 8
3 | Failed loading 0 lines, total 200 lines loaded
4 |
0%| | 0/25 [00:00, ?it/s]
0%| | 0/25 [00:00, ?it/s]
0%| | 0/25 [00:00, ?it/s]
0%| | 0/25 [00:00, ?it/s]
0%| | 0/25 [00:00, ?it/s]
0%| | 0/25 [00:00, ?it/s]
0%| | 0/25 [00:00, ?it/s]
0%| | 0/25 [00:00, ?it/s]
4%|▍ | 1/25 [00:01<00:39, 1.63s/it]
4%|▍ | 1/25 [00:02<01:05, 2.73s/it]
4%|▍ | 1/25 [00:02<01:10, 2.95s/it]
4%|▍ | 1/25 [00:03<01:24, 3.53s/it]
4%|▍ | 1/25 [00:03<01:28, 3.71s/it]
4%|▍ | 1/25 [00:03<01:35, 3.97s/it]
8%|▊ | 2/25 [00:04<00:49, 2.13s/it]
4%|▍ | 1/25 [00:05<02:01, 5.07s/it]
8%|▊ | 2/25 [00:05<01:02, 2.74s/it]
4%|▍ | 1/25 [00:06<02:25, 6.08s/it]
8%|▊ | 2/25 [00:06<01:09, 3.01s/it]
8%|▊ | 2/25 [00:06<01:08, 2.98s/it]
8%|▊ | 2/25 [00:06<01:12, 3.13s/it]
8%|▊ | 2/25 [00:06<01:09, 3.03s/it]
12%|█▏ | 3/25 [00:07<00:55, 2.51s/it]
12%|█▏ | 3/25 [00:07<00:48, 2.19s/it]
12%|█▏ | 3/25 [00:08<00:57, 2.63s/it]
12%|█▏ | 3/25 [00:08<01:02, 2.86s/it]
12%|█▏ | 3/25 [00:09<01:09, 3.14s/it]
16%|█▌ | 4/25 [00:09<00:51, 2.45s/it]
8%|▊ | 2/25 [00:09<01:49, 4.74s/it]
16%|█▌ | 4/25 [00:10<00:53, 2.53s/it]
12%|█▏ | 3/25 [00:10<01:19, 3.61s/it]
16%|█▌ | 4/25 [00:10<00:53, 2.53s/it]
16%|█▌ | 4/25 [00:10<00:54, 2.61s/it]
8%|▊ | 2/25 [00:11<02:04, 5.41s/it]
12%|█▏ | 3/25 [00:11<01:17, 3.53s/it]
20%|██ | 5/25 [00:11<00:47, 2.40s/it]
20%|██ | 5/25 [00:12<00:44, 2.24s/it]
16%|█▌ | 4/25 [00:13<01:17, 3.69s/it]
20%|██ | 5/25 [00:13<00:54, 2.70s/it]
20%|██ | 5/25 [00:14<00:58, 2.93s/it]
16%|█▌ | 4/25 [00:14<01:17, 3.68s/it]
24%|██▍ | 6/25 [00:14<00:49, 2.63s/it]
12%|█▏ | 3/25 [00:14<01:43, 4.70s/it]
16%|█▌ | 4/25 [00:15<01:18, 3.72s/it]
20%|██ | 5/25 [00:16<01:01, 3.09s/it]
24%|██▍ | 6/25 [00:16<00:51, 2.70s/it]
24%|██▍ | 6/25 [00:16<00:57, 3.00s/it]
24%|██▍ | 6/25 [00:17<00:55, 2.94s/it]
20%|██ | 5/25 [00:17<01:14, 3.73s/it]
28%|██▊ | 7/25 [00:17<00:48, 2.69s/it]
16%|█▌ | 4/25 [00:17<01:24, 4.04s/it]
20%|██ | 5/25 [00:18<01:10, 3.51s/it]
28%|██▊ | 7/25 [00:19<00:49, 2.73s/it]
24%|██▍ | 6/25 [00:19<01:00, 3.19s/it]
28%|██▊ | 7/25 [00:20<00:54, 3.02s/it]
24%|██▍ | 6/25 [00:20<01:04, 3.41s/it]
20%|██ | 5/25 [00:20<01:10, 3.52s/it]
28%|██▊ | 7/25 [00:21<00:59, 3.29s/it]
32%|███▏ | 8/25 [00:21<00:54, 3.22s/it]
24%|██▍ | 6/25 [00:22<01:06, 3.48s/it]
28%|██▊ | 7/25 [00:22<00:51, 2.86s/it]
28%|██▊ | 7/25 [00:22<00:56, 3.14s/it]
32%|███▏ | 8/25 [00:22<00:50, 2.95s/it]
24%|██▍ | 6/25 [00:23<01:02, 3.29s/it]
32%|███▏ | 8/25 [00:23<00:51, 3.00s/it]
28%|██▊ | 7/25 [00:23<00:51, 2.86s/it]
32%|███▏ | 8/25 [00:24<00:58, 3.43s/it]
32%|███▏ | 8/25 [00:24<00:45, 2.68s/it]
36%|███▌ | 9/25 [00:25<00:43, 2.74s/it]
32%|███▏ | 8/25 [00:25<00:49, 2.89s/it]
28%|██▊ | 7/25 [00:25<00:51, 2.84s/it]
36%|███▌ | 9/25 [00:25<00:45, 2.82s/it]
36%|███▌ | 9/25 [00:26<00:56, 3.54s/it]
32%|███▏ | 8/25 [00:26<00:48, 2.84s/it]
36%|███▌ | 9/25 [00:27<00:40, 2.56s/it]
36%|███▌ | 9/25 [00:27<00:52, 3.29s/it]
36%|███▌ | 9/25 [00:27<00:45, 2.83s/it]
32%|███▏ | 8/25 [00:27<00:44, 2.64s/it]
40%|████ | 10/25 [00:27<00:38, 2.55s/it]
40%|████ | 10/25 [00:27<00:44, 2.98s/it]
36%|███▌ | 9/25 [00:28<00:38, 2.44s/it]
40%|████ | 10/25 [00:29<00:48, 3.23s/it]
40%|████ | 10/25 [00:29<00:37, 2.51s/it]
40%|████ | 10/25 [00:29<00:45, 3.01s/it]
44%|████▍ | 11/25 [00:30<00:38, 2.76s/it]
44%|████▍ | 11/25 [00:30<00:36, 2.60s/it]
40%|████ | 10/25 [00:30<00:44, 2.99s/it]
40%|████ | 10/25 [00:31<00:41, 2.76s/it]
44%|████▍ | 11/25 [00:32<00:43, 3.08s/it]
48%|████▊ | 12/25 [00:32<00:33, 2.55s/it]
36%|███▌ | 9/25 [00:32<00:54, 3.38s/it]
48%|████▊ | 12/25 [00:32<00:32, 2.50s/it]
44%|████▍ | 11/25 [00:33<00:40, 2.92s/it]
44%|████▍ | 11/25 [00:33<00:42, 3.00s/it]
44%|████▍ | 11/25 [00:34<00:37, 2.67s/it]
52%|█████▏ | 13/25 [00:34<00:28, 2.34s/it]
48%|████▊ | 12/25 [00:34<00:37, 2.86s/it]
44%|████▍ | 11/25 [00:34<00:50, 3.57s/it]
40%|████ | 10/25 [00:34<00:44, 2.96s/it]
48%|████▊ | 12/25 [00:35<00:33, 2.61s/it]
48%|████▊ | 12/25 [00:36<00:33, 2.60s/it]
52%|█████▏ | 13/25 [00:36<00:34, 2.89s/it]
56%|█████▌ | 14/25 [00:37<00:27, 2.52s/it]
52%|█████▏ | 13/25 [00:37<00:34, 2.88s/it]
52%|█████▏ | 13/25 [00:37<00:28, 2.42s/it]
48%|████▊ | 12/25 [00:37<00:45, 3.53s/it]
44%|████▍ | 11/25 [00:38<00:46, 3.34s/it]
56%|█████▌ | 14/25 [00:38<00:29, 2.72s/it]
56%|█████▌ | 14/25 [00:39<00:27, 2.54s/it]
56%|█████▌ | 14/25 [00:40<00:27, 2.48s/it]
52%|█████▏ | 13/25 [00:40<00:35, 2.97s/it]
52%|█████▏ | 13/25 [00:40<00:38, 3.22s/it]
60%|██████ | 15/25 [00:40<00:27, 2.79s/it]
48%|████▊ | 12/25 [00:41<00:59, 4.58s/it]
60%|██████ | 15/25 [00:41<00:26, 2.60s/it]
60%|██████ | 15/25 [00:42<00:25, 2.54s/it]
56%|█████▌ | 14/25 [00:42<00:33, 3.01s/it]
64%|██████▍ | 16/25 [00:43<00:24, 2.73s/it]
60%|██████ | 15/25 [00:43<00:32, 3.22s/it]
52%|█████▏ | 13/25 [00:43<00:45, 3.82s/it]
56%|█████▌ | 14/25 [00:43<00:34, 3.14s/it]
64%|██████▍ | 16/25 [00:44<00:23, 2.60s/it]
68%|██████▊ | 17/25 [00:45<00:21, 2.72s/it]
48%|████▊ | 12/25 [00:46<00:59, 4.56s/it]
60%|██████ | 15/25 [00:46<00:30, 3.07s/it]
64%|██████▍ | 16/25 [00:46<00:29, 3.33s/it]
68%|██████▊ | 17/25 [00:46<00:20, 2.52s/it]
60%|██████ | 15/25 [00:47<00:31, 3.18s/it]
56%|█████▌ | 14/25 [00:47<00:43, 3.93s/it]
64%|██████▍ | 16/25 [00:47<00:29, 3.30s/it]
64%|██████▍ | 16/25 [00:48<00:25, 2.83s/it]
52%|█████▏ | 13/25 [00:48<00:48, 4.04s/it]
64%|██████▍ | 16/25 [00:49<00:25, 2.85s/it]
72%|███████▏ | 18/25 [00:49<00:17, 2.55s/it]
68%|██████▊ | 17/25 [00:50<00:19, 2.46s/it]
72%|███████▏ | 18/25 [00:50<00:22, 3.21s/it]
56%|█████▌ | 14/25 [00:50<00:36, 3.32s/it]
68%|██████▊ | 17/25 [00:50<00:27, 3.48s/it]
68%|██████▊ | 17/25 [00:50<00:25, 3.20s/it]
68%|██████▊ | 17/25 [00:51<00:21, 2.63s/it]
60%|██████ | 15/25 [00:51<00:38, 3.86s/it]
60%|██████ | 15/25 [00:52<00:29, 2.96s/it]
76%|███████▌ | 19/25 [00:52<00:16, 2.81s/it]
76%|███████▌ | 19/25 [00:53<00:19, 3.27s/it]
72%|███████▏ | 18/25 [00:53<00:19, 2.82s/it]
72%|███████▏ | 18/25 [00:54<00:18, 2.65s/it]
72%|███████▏ | 18/25 [00:54<00:24, 3.53s/it]
64%|██████▍ | 16/25 [00:54<00:23, 2.58s/it]
76%|███████▌ | 19/25 [00:56<00:17, 2.98s/it]
72%|███████▏ | 18/25 [00:56<00:26, 3.86s/it]
80%|████████ | 20/25 [00:56<00:15, 3.15s/it]
76%|███████▌ | 19/25 [00:56<00:16, 2.79s/it]
64%|██████▍ | 16/25 [00:56<00:37, 4.19s/it]
68%|██████▊ | 17/25 [00:56<00:20, 2.55s/it]
80%|████████ | 20/25 [00:57<00:16, 3.23s/it]
76%|███████▌ | 19/25 [00:57<00:18, 3.01s/it]
68%|██████▊ | 17/25 [00:58<00:29, 3.66s/it]
84%|████████▍ | 21/25 [00:59<00:11, 2.99s/it]
76%|███████▌ | 19/25 [00:59<00:21, 3.58s/it]
80%|████████ | 20/25 [00:59<00:15, 3.05s/it]
84%|████████▍ | 21/25 [00:59<00:11, 2.94s/it]
80%|████████ | 20/25 [00:59<00:14, 2.96s/it]
72%|███████▏ | 18/25 [01:00<00:21, 3.00s/it]
80%|████████ | 20/25 [01:01<00:15, 3.05s/it]
88%|████████▊ | 22/25 [01:01<00:08, 2.73s/it]
80%|████████ | 20/25 [01:01<00:15, 3.10s/it]
84%|████████▍ | 21/25 [01:02<00:12, 3.09s/it]
72%|███████▏ | 18/25 [01:02<00:25, 3.65s/it]
84%|████████▍ | 21/25 [01:02<00:11, 2.97s/it]
88%|████████▊ | 22/25 [01:03<00:09, 3.31s/it]
84%|████████▍ | 21/25 [01:04<00:12, 3.09s/it]
92%|█████████▏| 23/25 [01:04<00:05, 2.93s/it]
84%|████████▍ | 21/25 [01:04<00:13, 3.28s/it]
76%|███████▌ | 19/25 [01:05<00:19, 3.31s/it]
88%|████████▊ | 22/25 [01:05<00:08, 2.88s/it]
88%|████████▊ | 22/25 [01:05<00:09, 3.15s/it]
76%|███████▌ | 19/25 [01:06<00:22, 3.75s/it]
92%|█████████▏| 23/25 [01:06<00:06, 3.20s/it]
96%|█████████▌| 24/25 [01:07<00:02, 2.84s/it]
88%|████████▊ | 22/25 [01:07<00:09, 3.21s/it]
96%|█████████▌| 24/25 [01:07<00:02, 2.67s/it]
88%|████████▊ | 22/25 [01:07<00:09, 3.29s/it]
92%|█████████▏| 23/25 [01:08<00:05, 2.85s/it]
92%|█████████▏| 23/25 [01:08<00:06, 3.17s/it]
100%|██████████| 25/25 [01:09<00:00, 2.59s/it]
100%|██████████| 25/25 [01:09<00:00, 2.77s/it]
5 | Process Process-2:
6 | Traceback (most recent call last):
7 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
8 | self.run()
9 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
10 | self._target(*self._args, **self._kwargs)
11 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
12 | dist.destroy_process_group()
13 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
14 | assert pg is not None
15 | AssertionError
16 |
80%|████████ | 20/25 [01:09<00:17, 3.53s/it]
92%|█████████▏| 23/25 [01:10<00:05, 2.95s/it]
80%|████████ | 20/25 [01:10<00:20, 4.04s/it]
96%|█████████▌| 24/25 [01:12<00:03, 3.24s/it]
84%|████████▍ | 21/25 [01:12<00:13, 3.37s/it]
96%|█████████▌| 24/25 [01:12<00:03, 3.25s/it]
96%|█████████▌| 24/25 [01:12<00:02, 2.88s/it]
100%|██████████| 25/25 [01:13<00:00, 3.47s/it]
100%|██████████| 25/25 [01:13<00:00, 2.93s/it]
17 | Process Process-5:
18 | Traceback (most recent call last):
19 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
20 | self.run()
21 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
22 | self._target(*self._args, **self._kwargs)
23 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
24 | dist.destroy_process_group()
25 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
26 | assert pg is not None
27 | AssertionError
28 |
92%|█████████▏| 23/25 [01:14<00:08, 4.13s/it]
84%|████████▍ | 21/25 [01:14<00:15, 3.88s/it]
100%|██████████| 25/25 [01:14<00:00, 2.61s/it]
100%|██████████| 25/25 [01:14<00:00, 3.00s/it]
29 | Process Process-7:
30 | Traceback (most recent call last):
31 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
32 | self.run()
33 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
34 | self._target(*self._args, **self._kwargs)
35 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
36 | dist.destroy_process_group()
37 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
38 | assert pg is not None
39 | AssertionError
40 |
100%|██████████| 25/25 [01:15<00:00, 3.19s/it]
100%|██████████| 25/25 [01:15<00:00, 3.02s/it]
41 | Process Process-4:
42 | Traceback (most recent call last):
43 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
44 | self.run()
45 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
46 | self._target(*self._args, **self._kwargs)
47 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
48 | dist.destroy_process_group()
49 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
50 | assert pg is not None
51 | AssertionError
52 |
88%|████████▊ | 22/25 [01:15<00:09, 3.29s/it]
96%|█████████▌| 24/25 [01:16<00:03, 3.62s/it]
100%|██████████| 25/25 [01:17<00:00, 3.78s/it]
100%|██████████| 25/25 [01:17<00:00, 3.09s/it]
53 | Process Process-3:
54 | Traceback (most recent call last):
55 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
56 | self.run()
57 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
58 | self._target(*self._args, **self._kwargs)
59 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
60 | dist.destroy_process_group()
61 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
62 | assert pg is not None
63 | AssertionError
64 |
88%|████████▊ | 22/25 [01:17<00:11, 3.69s/it]
92%|█████████▏| 23/25 [01:18<00:06, 3.13s/it]
100%|██████████| 25/25 [01:20<00:00, 3.60s/it]
100%|██████████| 25/25 [01:20<00:00, 3.20s/it]
65 | Process Process-1:
66 | Traceback (most recent call last):
67 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
68 | self.run()
69 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
70 | self._target(*self._args, **self._kwargs)
71 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
72 | dist.destroy_process_group()
73 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
74 | assert pg is not None
75 | AssertionError
76 |
96%|█████████▌| 24/25 [01:20<00:02, 2.73s/it]
92%|█████████▏| 23/25 [01:20<00:06, 3.37s/it]
100%|██████████| 25/25 [01:21<00:00, 2.48s/it]
100%|██████████| 25/25 [01:21<00:00, 3.28s/it]
77 | Process Process-8:
78 | Traceback (most recent call last):
79 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
80 | self.run()
81 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
82 | self._target(*self._args, **self._kwargs)
83 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
84 | dist.destroy_process_group()
85 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
86 | assert pg is not None
87 | AssertionError
88 |
96%|█████████▌| 24/25 [01:23<00:03, 3.39s/it]
100%|██████████| 25/25 [01:25<00:00, 2.96s/it]
100%|██████████| 25/25 [01:25<00:00, 3.43s/it]
89 | Process Process-6:
90 | Traceback (most recent call last):
91 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
92 | self.run()
93 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/multiprocessing/process.py", line 108, in run
94 | self._target(*self._args, **self._kwargs)
95 | File "/mnt/xiyu/LongRePS/evaltoolkits/step1_eval_inference.py", line 167, in get_pred_from_vllm
96 | dist.destroy_process_group()
97 | File "/home/azureuser/miniconda3/envs/longreps/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1721, in destroy_process_group
98 | assert pg is not None
99 | AssertionError
100 |
--------------------------------------------------------------------------------
/evaltoolkits/launch_inference.sh:
--------------------------------------------------------------------------------
1 | # if model_name, model_path, eval_data_dir, file_name, sample_num, thresh, temperature, filtered_filename, are passed as arguments, then parse these arguments
2 | if [[ $# -eq 8 ]]; then
3 | model_name=$1
4 | model_path=$2
5 | eval_data_dir=$3
6 | file_name=$4
7 | sample_num=$5
8 | thresh=$6
9 | temperature=$7
10 | filtered_filename=$8
11 | inference_mode=$(echo "$file_name" | grep -q "train" && echo "train" || echo "eval") # train (for sample data), eval (for evaluation)
12 | else
13 | model_name=$1
14 | model_path=$2
15 | file_name=$3
16 | eval_data_dir="../dataset"
17 | mode="predicted_answer"
18 | inference_mode=$(echo "$file_name" | grep -q "train" && echo "train" || echo "eval") # train (for sample data), eval (for evaluation)
19 | if [[ $inference_mode == "train" ]]; then
20 | sample_num=1
21 | thresh=1.0
22 | temperature=0.7
23 | filtered_filename="${model_name}_sample${sample_num}temp${temperature}thresh${thresh}.jsonl"
24 | elif [[ $inference_mode == "eval" ]]; then
25 | sample_num=1
26 | temperature=0.0
27 | fi
28 |
29 | fi
30 |
31 |
32 |
33 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1)
34 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and ||
35 | result_dir="./pred_${inference_mode}"
36 | output_dir="${result_dir}/${model_name}"
37 | mkdir -p ${output_dir}
38 | echo -e "\nScript executed with parameters: $@" >> ${output_dir}/new_launch_inference.sh
39 |
40 | result_dir="${result_dir%/}" # remove the trailing slash if there is any
41 | eval_data_dir="${eval_data_dir%/}"
42 |
43 | echo "Launching inference for ${model_name}..."
44 | echo "Model path: ${model_path}"
45 | echo "Eval data dir: ${eval_data_dir}"
46 | echo "File name: ${file_name}"
47 | echo "Inference mode: ${inference_mode}"
48 | echo "Sample num: ${sample_num}"
49 | echo "Result dir: ${result_dir}"
50 | echo "Temperature: ${temperature}"
51 | echo "COT mode: ${cot_mode}"
52 | echo "Dataset: ${eval_dataset_name}"
53 | echo "Filtered filename: ${filtered_filename}"
54 | echo "Thresh: ${thresh}"
55 |
56 | #cp Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Model
57 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} #for Llama Model
58 |
59 | for gpu_id in 0 1 2 3 4 5 6 7; do
60 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \
61 | --served-model-name ${model_name} \
62 | --model ${model_path} \
63 | --tensor-parallel-size=1 \
64 | --trust-remote-code \
65 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 &
66 | done
67 |
68 | sleep 30 # sleep 30s, wait for the servers to start
69 |
70 | echo "Evaluating ${eval_dataset_name}..."
71 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl"
72 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl"
73 |
74 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \
75 | --model ${model_name} \
76 | --model_path ${model_path} \
77 | --data_path ${eval_data_dir}/${file_name} \
78 | --output_path ${path_to_inference_output} \
79 | --sample_num ${sample_num} \
80 | --dataset_name ${eval_dataset_name} \
81 | --temperature ${temperature} \
82 | > ./inference.out
83 |
84 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output}
85 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result}
86 |
87 | pkill -f vllm; pkill -f spawn_main
88 |
--------------------------------------------------------------------------------
/evaltoolkits/launch_lbv1.sh:
--------------------------------------------------------------------------------
1 |
2 | model_name="Qwen2.5-7B-Instruct"
3 | model_path="/mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct"
4 | #model_config="${model_root}/tokenizer*"
5 | #model_path="${model_root}/checkpoint-58"
6 | mode="cot"
7 |
8 | domain_list=("all")
9 | eval_data_dir="../dataset/longbenchv1"
10 | sample_num=1
11 | temperature=0.0
12 |
13 |
14 | #cp ${model_config} ${model_path}
15 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Models
16 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path}
17 |
18 |
19 | for gpu_id in 0 1 2 3 4 5 6 7; do
20 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \
21 | --served-model-name ${model_name} \
22 | --model ${model_path} \
23 | --tensor-parallel-size=1 \
24 | --swap-space 32\
25 | --trust-remote-code \
26 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 &
27 | done
28 | sleep 30 # sleep 30s, wait for the servers to start
29 |
30 |
31 | for domain in "${domain_list[@]}"; do
32 | file_name_list=( "musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" "qasper_${mode}.jsonl" "2wikimqa_${mode}.jsonl") #"musique_${mode}.jsonl"
33 | for file_name in "${file_name_list[@]}"; do
34 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1)
35 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and ||
36 | result_dir="./pred_cot_vs_nocot"
37 | output_dir="${result_dir}/${model_name}"
38 | mkdir -p ${output_dir}
39 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh
40 |
41 | result_dir="${result_dir%/}" # remove the trailing slash if there is any
42 | eval_data_dir="${eval_data_dir%/}"
43 |
44 | echo "Launching inference for ${model_name}..."
45 | echo "Model path: ${model_path}"
46 | echo "Eval data dir: ${eval_data_dir}"
47 | echo "File name: ${file_name}"
48 | echo "Sample num: ${sample_num}"
49 | echo "Result dir: ${result_dir}"
50 | echo "Temperature: ${temperature}"
51 | echo "COT mode: ${cot_mode}"
52 | echo "Dataset: ${eval_dataset_name}"
53 |
54 | echo "Evaluating ${eval_dataset_name}..."
55 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl"
56 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl"
57 |
58 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \
59 | --model ${model_name} \
60 | --model_path ${model_path} \
61 | --data_path ${eval_data_dir}/${file_name} \
62 | --output_path ${path_to_inference_output} \
63 | --sample_num ${sample_num} \
64 | --dataset_name ${eval_dataset_name} \
65 | --temperature ${temperature} \
66 | > ./inference2.out
67 |
68 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output}
69 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result}
70 |
71 | done
72 | done
73 | pkill -f vllm; pkill -f spawn_main
74 |
--------------------------------------------------------------------------------
/evaltoolkits/launch_lbv1big.sh:
--------------------------------------------------------------------------------
1 |
2 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2"
3 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2"
4 | #model_config="${model_root}/tokenizer*"
5 | #model_path="${model_root}/checkpoint-58"
6 | mode="cot"
7 |
8 | domain_list=("all")
9 | eval_data_dir="../dataset/longbenchv1"
10 | sample_num=1
11 | temperature=0.0
12 |
13 |
14 | #cp ${model_config} ${model_path}
15 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path} #for Qwen Models
16 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path}
17 |
18 | CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \
19 | --served-model-name ${model_name} \
20 | --model ${model_path} \
21 | --tensor-parallel-size=2 \
22 | --max_model_len 25000\
23 | --swap-space 32\
24 | --trust-remote-code \
25 | --port 8000 > ../log/vllm_${model_name}_gpu0.log 2>&1 &
26 |
27 | CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \
28 | --served-model-name ${model_name} \
29 | --model ${model_path} \
30 | --tensor-parallel-size=2 \
31 | --max_model_len 25000\
32 | --swap-space 32\
33 | --trust-remote-code \
34 | --port 8001 > ../log/vllm_${model_name}_gpu1.log 2>&1 &
35 |
36 | CUDA_VISIBLE_DEVICES=4,5 python -m vllm.entrypoints.openai.api_server \
37 | --served-model-name ${model_name} \
38 | --model ${model_path} \
39 | --tensor-parallel-size=2 \
40 | --max_model_len 25000\
41 | --swap-space 32\
42 | --trust-remote-code \
43 | --port 8002 > ../log/vllm_${model_name}_gpu2.log 2>&1 &
44 |
45 | CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server \
46 | --served-model-name ${model_name} \
47 | --model ${model_path} \
48 | --tensor-parallel-size=2 \
49 | --max_model_len 25000\
50 | --swap-space 32\
51 | --trust-remote-code \
52 | --port 8003 > ../log/vllm_${model_name}_gpu3.log 2>&1 &
53 |
54 | sleep 30 # sleep 30s, wait for the servers to start
55 |
56 |
57 | for domain in "${domain_list[@]}"; do
58 | file_name_list=( "musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" "qasper_${mode}.jsonl" "2wikimqa_${mode}.jsonl") #"musique_${mode}.jsonl" "hotpotqa_${mode}.jsonl" "multifieldqa_en_${mode}.jsonl" #"musique_${mode}.jsonl"
59 | for file_name in "${file_name_list[@]}"; do
60 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1)
61 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and ||
62 | result_dir="./pred_cot_vs_nocot"
63 | output_dir="${result_dir}/${model_name}"
64 | mkdir -p ${output_dir}
65 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh
66 |
67 | result_dir="${result_dir%/}" # remove the trailing slash if there is any
68 | eval_data_dir="${eval_data_dir%/}"
69 |
70 | echo "Launching inference for ${model_name}..."
71 | echo "Model path: ${model_path}"
72 | echo "Eval data dir: ${eval_data_dir}"
73 | echo "File name: ${file_name}"
74 | echo "Sample num: ${sample_num}"
75 | echo "Result dir: ${result_dir}"
76 | echo "Temperature: ${temperature}"
77 | echo "COT mode: ${cot_mode}"
78 | echo "Dataset: ${eval_dataset_name}"
79 |
80 | echo "Evaluating ${eval_dataset_name}..."
81 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl"
82 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl"
83 |
84 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \
85 | --model ${model_name} \
86 | --model_path ${model_path} \
87 | --data_path ${eval_data_dir}/${file_name} \
88 | --output_path ${path_to_inference_output} \
89 | --sample_num ${sample_num} \
90 | --world_size 4 \
91 | --dataset_name ${eval_dataset_name} \
92 | --temperature ${temperature} \
93 | > ./inference2.out
94 |
95 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output}
96 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result}
97 |
98 | done
99 | done
100 | pkill -f vllm; pkill -f spawn_main
--------------------------------------------------------------------------------
/evaltoolkits/launch_lbv2.sh:
--------------------------------------------------------------------------------
1 |
2 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2"
3 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2"
4 | mode="cot"
5 |
6 | domain_list=("all")
7 | eval_data_dir="../dataset/longbenchv2"
8 | sample_num=1
9 | temperature=0.0
10 |
11 | #cp /mnt/xiyu/Model/meta-llama/Llama-3.1-8B-Instruct/tokenizer* ${model_path} #for Llama Models
12 | #cp /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct/tokenizer* ${model_path}
13 |
14 |
15 | for gpu_id in 0 1 2 3 4 5 6 7; do
16 | CUDA_VISIBLE_DEVICES=${gpu_id} python -m vllm.entrypoints.openai.api_server \
17 | --served-model-name ${model_name} \
18 | --model ${model_path} \
19 | --tensor-parallel-size=1 \
20 | --trust-remote-code \
21 | --dtype bfloat16 \
22 | --port 800${gpu_id} > ../log/vllm_${model_name}_gpu${gpu_id}.log 2>&1 &
23 | done
24 | sleep 30 # sleep 30s, wait for the servers to start
25 |
26 |
27 | for domain in "${domain_list[@]}"; do
28 | file_name_list=("MQA_${mode}.jsonl" "SQA_${mode}.jsonl")
29 | for file_name in "${file_name_list[@]}"; do
30 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1)
31 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and ||
32 | result_dir="./pred_cot_vs_nocot"
33 | output_dir="${result_dir}/${model_name}"
34 | mkdir -p ${output_dir}
35 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh
36 |
37 | result_dir="${result_dir%/}" # remove the trailing slash if there is any
38 | eval_data_dir="${eval_data_dir%/}"
39 |
40 | echo "Launching inference for ${model_name}..."
41 | echo "Model path: ${model_path}"
42 | echo "Eval data dir: ${eval_data_dir}"
43 | echo "File name: ${file_name}"
44 | echo "Sample num: ${sample_num}"
45 | echo "Result dir: ${result_dir}"
46 | echo "Temperature: ${temperature}"
47 | echo "COT mode: ${cot_mode}"
48 | echo "Dataset: ${eval_dataset_name}"
49 |
50 | echo "Evaluating ${eval_dataset_name}..."
51 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl"
52 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl"
53 |
54 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \
55 | --model ${model_name} \
56 | --model_path ${model_path} \
57 | --data_path ${eval_data_dir}/${file_name} \
58 | --output_path ${path_to_inference_output} \
59 | --sample_num ${sample_num} \
60 | --dataset_name ${eval_dataset_name} \
61 | --temperature ${temperature} \
62 | > ./inference2.out
63 |
64 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output}
65 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result}
66 |
67 | done
68 | done
69 | pkill -f vllm; pkill -f spawn_main
70 |
--------------------------------------------------------------------------------
/evaltoolkits/launch_lbv2m.sh:
--------------------------------------------------------------------------------
1 | model_name="Qwen2.5-32B-prm-sample30-checkstage3-epoch2"
2 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2"
3 | #model_config="${model_root}/tokenizer*"
4 | #model_path="${model_root}/checkpoint-58"
5 | mode="cot"
6 |
7 |
8 | domain_list=("all")
9 | eval_data_dir="../dataset/longbenchv2"
10 | sample_num=1
11 | temperature=0.0
12 |
13 |
14 |
15 | CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \
16 | --served-model-name ${model_name} \
17 | --model ${model_path} \
18 | --tensor-parallel-size=4 \
19 | --trust-remote-code \
20 | --port 8000 > ../log/vllm_${model_name}_gpu0.log 2>&1 &
21 |
22 | CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server \
23 | --served-model-name ${model_name} \
24 | --model ${model_path} \
25 | --tensor-parallel-size=4 \
26 | --trust-remote-code \
27 | --port 8001 > ../log/vllm_${model_name}_gpu1.log 2>&1 &
28 |
29 | sleep 30 # sleep 30s, wait for the servers to start
30 |
31 |
32 | for domain in "${domain_list[@]}"; do
33 | file_name_list=("MQA_${mode}.jsonl" "SQA_${mode}.jsonl")
34 | for file_name in "${file_name_list[@]}"; do
35 | eval_dataset_name=$(echo "$file_name" | cut -d'_' -f1)
36 | cot_mode=$(echo "$file_name" | grep -q "nocot" && echo "nocot" || echo "cot") # this is a trick in bash to implement in-line if-else using && and ||
37 | result_dir="./pred_cot_vs_nocot"
38 | output_dir="${result_dir}/${model_name}"
39 | mkdir -p ${output_dir}
40 | #echo -e "\nScript executed with parameters: $@" >> ${output_dir}/dw_launch_cot_vs_nocot.sh
41 |
42 | result_dir="${result_dir%/}" # remove the trailing slash if there is any
43 | eval_data_dir="${eval_data_dir%/}"
44 |
45 | echo "Launching inference for ${model_name}..."
46 | echo "Model path: ${model_path}"
47 | echo "Eval data dir: ${eval_data_dir}"
48 | echo "File name: ${file_name}"
49 | echo "Sample num: ${sample_num}"
50 | echo "Result dir: ${result_dir}"
51 | echo "Temperature: ${temperature}"
52 | echo "COT mode: ${cot_mode}"
53 | echo "Dataset: ${eval_dataset_name}"
54 |
55 | echo "Evaluating ${eval_dataset_name}..."
56 | path_to_inference_output="${result_dir}/${model_name}/${eval_dataset_name}.temp${temperature}sample${sample_num}.${cot_mode}.jsonl"
57 | path_to_extracted_result="${path_to_inference_output%.jsonl}_eval.jsonl" # remove the last ".jsonl" and add "_eval.jsonl"
58 |
59 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python step1_eval_inference.py \
60 | --model ${model_name} \
61 | --model_path ${model_path} \
62 | --data_path ${eval_data_dir}/${file_name} \
63 | --output_path ${path_to_inference_output} \
64 | --sample_num ${sample_num} \
65 | --dataset_name ${eval_dataset_name} \
66 | --temperature ${temperature} \
67 | --world_size 2 \
68 | > ./inference2.out
69 |
70 | python step2_extract_preds_from_raw.py --path_to_src_file ${path_to_inference_output}
71 | python step3_eval_f1.py --path_to_src_file ${path_to_extracted_result}
72 |
73 | done
74 | done
75 | pkill -f vllm; pkill -f spawn_main
--------------------------------------------------------------------------------
/evaltoolkits/loop_eval.sh:
--------------------------------------------------------------------------------
1 | eval_model_list=(
2 | "Llama-8B-warmup-lr1e-5-epoch1 ../saves/Llama3.1-8B/full/Llama-3.1-8B_warmup_train_lr1e-5_maxlen16k_2025-03-13-22-43-49/checkpoint-10 "
3 | )
4 | model_list=("${eval_model_list[@]}")
5 |
6 | for model in "${model_list[@]}"; do
7 | model_name=$(echo $model | cut -d' ' -f1)
8 | model_path=$(echo $model | cut -d' ' -f2)
9 | file_name=$(echo $model | cut -d' ' -f3)
10 | echo "Launching inference for ${model_name}..."
11 | echo "Model path: ${model_path}"
12 | echo "File name: ${file_name}"
13 | bash new_launch_inference.sh ${model_name} ${model_path} ${file_name}
14 | done
--------------------------------------------------------------------------------
/evaltoolkits/loop_sample.sh:
--------------------------------------------------------------------------------
1 | eval_model_list=(
2 | "Qwen-7B-Instruct-yarn-example /mnt/xiyu/Model/Qwen/Qwen2.5-7B-Instruct musique-Qwen-2.5-7B_prm_train.jsonl"
3 | )
4 | model_list=("${eval_model_list[@]}")
5 |
6 | for model in "${model_list[@]}"; do
7 | model_name=$(echo $model | cut -d' ' -f1)
8 | model_path=$(echo $model | cut -d' ' -f2)
9 | file_name=$(echo $model | cut -d' ' -f3)
10 | echo "Launching inference for ${model_name}..."
11 | echo "Model path: ${model_path}"
12 | echo "File name: ${file_name}"
13 | bash launch_inference.sh ${model_name} ${model_path} ${file_name}
14 | done
--------------------------------------------------------------------------------
/evaltoolkits/metrics.py:
--------------------------------------------------------------------------------
1 | import re
2 | import string
3 |
4 | import jieba
5 | from fuzzywuzzy import fuzz
6 | import difflib
7 |
8 | from typing import List
9 | from collections import Counter
10 | from rouge import Rouge
11 |
12 | def normalize_answer(s):
13 | """Lower text and remove punctuation, articles and extra whitespace."""
14 |
15 | def remove_articles(text):
16 | return re.sub(r"\b(a|an|the)\b", " ", text)
17 |
18 | def white_space_fix(text):
19 | return " ".join(text.split())
20 |
21 | def remove_punc(text):
22 | exclude = set(string.punctuation)
23 | return "".join(ch for ch in text if ch not in exclude)
24 |
25 | def lower(text):
26 | return text.lower()
27 |
28 | return white_space_fix(remove_articles(remove_punc(lower(s))))
29 |
30 |
31 | def normalize_zh_answer(s):
32 | """Lower text and remove punctuation, extra whitespace."""
33 |
34 | def white_space_fix(text):
35 | return "".join(text.split())
36 |
37 | def remove_punc(text):
38 | cn_punctuation = "!?。。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
39 | all_punctuation = set(string.punctuation + cn_punctuation)
40 | return "".join(ch for ch in text if ch not in all_punctuation)
41 |
42 | def lower(text):
43 | return text.lower()
44 |
45 | return white_space_fix(remove_punc(lower(s)))
46 |
47 | def count_score(prediction, ground_truth, **kwargs):
48 | numbers = re.findall(r"\d+", prediction)
49 | right_num = 0
50 | for number in numbers:
51 | if str(number) == str(ground_truth):
52 | right_num += 1
53 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
54 | return float(final_score)
55 |
56 | def retrieval_score(prediction, ground_truth, **kwargs):
57 | pattern = r'Paragraph (\d+)'
58 | matches = re.findall(pattern, ground_truth)
59 | ground_truth_id = matches[0]
60 | numbers = re.findall(r"\d+", prediction)
61 | right_num = 0
62 | for number in numbers:
63 | if str(number) == str(ground_truth_id):
64 | right_num += 1
65 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
66 | return float(final_score)
67 |
68 | def babi_score(prediction, ground_truth, **kwargs):
69 |
70 | if ground_truth in prediction:
71 | return 1.0
72 | elif prediction==ground_truth:
73 | return 1.0
74 | elif prediction==(ground_truth+')'):
75 | return 1.0
76 | elif prediction==('('+ground_truth+')'):
77 | return 1.0
78 | else:
79 | return 0
80 |
81 | def babiq3_score(prediction, ground_truth, **kwargs):
82 | try:
83 | answer=prediction.split('was in')[1]
84 | except:
85 | answer=prediction
86 | if ground_truth in answer:
87 | return 1.0
88 | else:
89 | return 0
90 |
91 | def ruler_score(prediction, ground_truth, **kwargs):
92 | def postprocess_pred(predict_str: str):
93 |
94 | predict_str = predict_str.strip()
95 | # Remove all non-printable characters
96 | np_pattern = re.compile(r'[\x00-\x1f]')
97 | predict_str = np_pattern.sub('\n', predict_str).strip()
98 | return predict_str
99 | positive_pattern1 = r':\s*(\d+)'
100 | positive_pattern2 = r'is\s*(\d+)'
101 | negative_pattern = r'no \s* magic number'
102 | positive_match1 = re.search(positive_pattern1, prediction)
103 | positive_match2 = re.search(positive_pattern2, prediction)
104 | negative_match = re.search(negative_pattern, prediction)
105 | if negative_match:
106 | return 0
107 | elif positive_match1 or positive_match2:
108 | return 1.0
109 |
110 | if ground_truth in prediction:
111 | return 1.0
112 | elif prediction==ground_truth:
113 | return 1.0
114 | elif prediction==(ground_truth+')'):
115 | return 1.0
116 | elif prediction==('('+ground_truth+')'):
117 | return 1.0
118 | elif postprocess_pred(prediction)==ground_truth:
119 | return 1.0
120 | else:
121 | return 0
122 |
123 | def accuracy_score(prediction, ground_truth, **kwargs):
124 | def extract_answer(response):
125 | response = response.replace('*', '')
126 | match = re.search(r'The correct answer is \(([A-D])\)', response)
127 | if match:
128 | return match.group(1)
129 | else:
130 | match = re.search(r'The correct answer is ([A-D])', response)
131 | if match:
132 | return match.group(1)
133 | else:
134 | return None
135 | bool_cnt=0
136 | choice_list=['A','B','C','D']
137 | pattern1 = f'{ground_truth} '
138 | pattern3=f'\* {ground_truth}'
139 | pattern4=f'{ground_truth}$'
140 | for choice in choice_list:
141 | if ('('+choice+')') in prediction:
142 | bool_cnt+=1
143 | continue
144 | pattern2 = f'{choice} '
145 | matches = re.findall(pattern2, prediction)
146 | if matches:
147 | bool_cnt+=1
148 | continue
149 | if bool_cnt>=2: #m choices
150 | return 0
151 | if ('('+ground_truth+')') in prediction:
152 | return 1.0
153 | if ground_truth==prediction:
154 | return 1.0
155 | matches1 = re.findall(pattern1, prediction)
156 | matches2 = re.findall(pattern3, prediction)
157 | matches3 = re.findall(pattern4, prediction)
158 | if matches1 or matches2 or matches3:
159 | return 1.0
160 | else:
161 | return 0
162 |
163 | def retrieval_zh_score(prediction, ground_truth, **kwargs):
164 | pattern = r'段落(\d+)'
165 | matches = re.findall(pattern, ground_truth)
166 | ground_truth_id = matches[0]
167 | numbers = re.findall(r"\d+", prediction)
168 | right_num = 0
169 | for number in numbers:
170 | if str(number) == str(ground_truth_id):
171 | right_num += 1
172 | final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
173 | return float(final_score)
174 |
175 | def code_sim_score(prediction, ground_truth, **kwargs):
176 | all_lines = prediction.lstrip('\n').split('\n')
177 | prediction = ""
178 | for line in all_lines:
179 | if ('`' not in line) and ('#' not in line) and ('//' not in line):
180 | prediction = line
181 | break
182 | return (fuzz.ratio(prediction, ground_truth) / 100)
183 |
184 | def classification_score(prediction, ground_truth, **kwargs):
185 | #print(prediction)
186 | #if '\n' in prediction:
187 | # prediction = prediction.lstrip('\n').split('\n')[0]
188 | em_match_list = []
189 | all_classes = kwargs["all_classes"]
190 | for class_name in all_classes:
191 | if class_name in prediction:
192 | em_match_list.append(class_name)
193 | for match_term in em_match_list:
194 | if match_term in ground_truth and match_term != ground_truth:
195 | em_match_list.remove(match_term)
196 | if ground_truth in em_match_list:
197 | score = (1.0 / len(em_match_list))
198 | else:
199 | score = 0.0
200 | return score
201 |
202 | def rouge_score(prediction, ground_truth, **kwargs):
203 | rouge = Rouge()
204 | try:
205 | scores = rouge.get_scores([prediction], [ground_truth], avg=True)
206 | except:
207 | return 0.0
208 | return scores["rouge-l"]["f"]
209 |
210 | def rouge_zh_score(prediction, ground_truth, **kwargs):
211 | prediction = " ".join(list(jieba.cut(prediction, cut_all=False)))
212 | ground_truth = " ".join(list(jieba.cut(ground_truth, cut_all=False)))
213 | score = rouge_score(prediction, ground_truth)
214 | return score
215 |
216 | def f1_score(prediction, ground_truth, **kwargs):
217 | common = Counter(prediction) & Counter(ground_truth)
218 | num_same = sum(common.values())
219 | if num_same == 0:
220 | return 0
221 | precision = 1.0 * num_same / len(prediction)
222 | recall = 1.0 * num_same / len(ground_truth)
223 | f1 = (2 * precision * recall) / (precision + recall)
224 | return f1
225 |
226 | def recall_score(prediction, ground_truth, **kwargs):
227 | common = Counter(prediction) & Counter(ground_truth)
228 | num_same = sum(common.values())
229 | if num_same == 0:
230 | return 0
231 | recall = 1.0 * num_same / len(ground_truth)
232 | return recall
233 |
234 | def qa_f1_score(prediction, ground_truth, **kwargs):
235 | normalized_prediction = normalize_answer(prediction)
236 | normalized_ground_truth = normalize_answer(ground_truth)
237 |
238 | prediction_tokens = normalized_prediction.split()
239 | ground_truth_tokens = normalized_ground_truth.split()
240 | return f1_score(prediction_tokens, ground_truth_tokens)
241 |
242 | def qa_recall_score(prediction, ground_truth, **kwargs):
243 | normalized_prediction = normalize_answer(prediction)
244 | normalized_ground_truth = normalize_answer(ground_truth)
245 |
246 | prediction_tokens = normalized_prediction.split()
247 | ground_truth_tokens = normalized_ground_truth.split()
248 | return recall_score(prediction_tokens, ground_truth_tokens)
249 |
250 |
251 | def qa_f1_zh_score(prediction, ground_truth, **kwargs):
252 | prediction_tokens = list(jieba.cut(prediction, cut_all=False))
253 | ground_truth_tokens = list(jieba.cut(ground_truth, cut_all=False))
254 | prediction_tokens = [normalize_zh_answer(token) for token in prediction_tokens]
255 | ground_truth_tokens = [normalize_zh_answer(token) for token in ground_truth_tokens]
256 | prediction_tokens = [token for token in prediction_tokens if len(token) > 0]
257 | ground_truth_tokens = [token for token in ground_truth_tokens if len(token) > 0]
258 | return f1_score(prediction_tokens, ground_truth_tokens)
259 |
--------------------------------------------------------------------------------
/evaltoolkits/step1_eval_inference.py:
--------------------------------------------------------------------------------
1 | import os
2 | import torch
3 | import json
4 | import argparse
5 | import torch.distributed as dist
6 | import numpy as np
7 | import random
8 | import torch.multiprocessing as mp
9 | import time
10 |
11 | from openai import OpenAI
12 | from tqdm import tqdm
13 | from pathlib import Path
14 | from pydantic import BaseModel
15 | from typing import List
16 |
17 | from utils import load_jsonl_file, check_pred_fact_consistency
18 |
19 | class Response(BaseModel):
20 | reasoning: str
21 | answer: str
22 |
23 |
24 | def seed_everything(seed):
25 | torch.manual_seed(seed)
26 | torch.cuda.manual_seed(seed)
27 | np.random.seed(seed)
28 | random.seed(seed)
29 | torch.backends.cudnn.benchmark = False
30 | torch.backends.cudnn.deterministic = True
31 | torch.cuda.manual_seed_all(seed)
32 |
33 | def parse_args(args=None):
34 | parser = argparse.ArgumentParser()
35 | parser.add_argument('--model', type=str, default=None)
36 | parser.add_argument('--test', action='store_true', help="Evaluate on test mode")
37 | parser.add_argument('--model_path', type=str, default=None)
38 | parser.add_argument('--data_path', type=str, default=None)
39 | parser.add_argument('--output_path', type=str, default=None)
40 | parser.add_argument('--dataset_name', type=str, default=None)
41 | parser.add_argument('--sample_num', type=int, default=30)
42 | parser.add_argument('--world_size', type=int, default=8)
43 | parser.add_argument('--max_gen', type=int, default=512)
44 | parser.add_argument('--gpt', action='store_true', help="Evaluate on test mode")
45 | parser.add_argument('--temperature', type=float, default=0)
46 | return parser.parse_args(args)
47 |
48 | def get_api_results(model_name, prompt, gpu_id, sample_num, max_gen, temp,gpt=False):
49 | max_retries = 5
50 | response_list = []
51 | # json_schema = Response.model_json_schema() #TODO
52 | for i in range(max_retries):
53 | if gpt:
54 | api_key = os.getenv("OPENAI_API_KEY")
55 | client = OpenAI(api_key=api_key, base_url="Your Online Model URL")
56 | else:
57 | client = OpenAI(api_key="EMPTY", base_url=f"http://localhost:800{gpu_id}/v1")
58 | try:
59 | '''
60 | response=client.completions.create(
61 | prompt=prompt,
62 | #messages=[
63 | # {"role":"system", "content": "You are a helpful assistant."},
64 | # {"role": "user","content": prompt}
65 | #],
66 | model=model_name,
67 | temperature=temp,
68 | n=sample_num,
69 | stop=["}"],
70 | max_tokens=max_gen,
71 | # extra_body={"guided_json": json_schema},
72 | )
73 |
74 | for choice in response.choices:
75 | #response_list.append(choice.message.content)
76 | response_list.append(choice.text)
77 | return response_list
78 | '''
79 |
80 | response=client.chat.completions.create(
81 | #prompt=prompt,
82 | messages=[
83 | {"role":"system", "content": "You are a helpful assistant."},
84 | {"role": "user","content": prompt}
85 | ],
86 | #stop=["}","."],
87 | model=model_name,
88 | temperature=temp,
89 | n=sample_num,
90 | max_tokens=max_gen,
91 | # extra_body={"guided_json": json_schema},
92 | )
93 |
94 | for choice in response.choices:
95 | response_list.append(choice.message.content)
96 | #response_list.append(choice.text)
97 | return response_list
98 | except Exception as e:
99 | print(e)
100 | time.sleep(50)
101 | return None
102 |
103 | def get_pred_from_vllm(rank, data, max_gen, model_name, out_path, sample_num,lock, temp,gpt):
104 | # print("Temp: ",temp)
105 | if gpt:
106 | print("Eval On ",model_name)
107 | for json_obj in tqdm(data):
108 | prompt = json_obj['instruction']
109 | preds = get_api_results(model_name, prompt, rank, sample_num, max_gen=max_gen, temp=temp,gpt=gpt)
110 | def check_pred_validity(pred:str, prompt):
111 | if prompt.endswith("Answer:") or prompt.endswith("Type:") or prompt.endswith("Summary: ") or prompt.endswith("Answer:\n") or prompt.endswith("\".\n"):
112 | return True
113 | if "\"answer\"" not in pred:
114 | return False
115 | return True
116 |
117 | if preds==None:
118 | new_preds = get_api_results(model_name, prompt, rank, 5, max_gen=max_gen, temp=0.3,gpt=gpt)
119 | if new_preds==None:
120 | continue
121 | else:
122 | preds=new_preds
123 |
124 | check_flag=False
125 | if len(preds) == 1:
126 | if not check_pred_validity(preds[0], prompt):
127 | new_preds = get_api_results(model_name, prompt, rank, 5, max_gen=max_gen, temp=0.3)
128 | if new_preds!=None:
129 | for pred in new_preds:
130 | if check_pred_validity(pred, prompt):
131 | preds = [pred]
132 | check_flag=True
133 | break
134 | else:
135 | check_flag=True
136 |
137 | if not check_pred_validity(preds[0], prompt):
138 | new_preds = get_api_results(model_name, prompt, rank, 10, max_gen=max_gen, temp=0.3)
139 | if new_preds!=None:
140 | for pred in new_preds:
141 | if check_pred_validity(pred, prompt):
142 | preds = [pred]
143 | check_flag=True
144 | break
145 | else:
146 | continue
147 | else:
148 | check_flag=True
149 |
150 | if "answers" in json_obj.keys():
151 | instruction, answers, _id = json_obj["instruction"], json_obj["answers"], json_obj["id"]
152 | else:
153 | instruction, answers, _id = json_obj["instruction"], [json_obj["output"]], json_obj["id"]
154 | if "all_classes" in json_obj.keys():
155 | all_classes=json_obj['all_classes']
156 | else:
157 | all_classes=[]
158 |
159 | try:
160 | question=json_obj['question']
161 | except:
162 | question = instruction.split("Question:")[-1].replace("\n\nAnswer:","").replace("*","").strip()
163 | with lock:
164 | with open(out_path, "a", encoding="utf-8") as f:
165 | json.dump({"pred": preds, "instruction": instruction, "question":question, "answers": answers, "id":_id,"check_flag":str(check_flag) ,"all_classes": all_classes, "length": 0}, f, ensure_ascii=False)
166 | f.write('\n')
167 | dist.destroy_process_group()
168 | return
169 |
170 | if __name__ == '__main__':
171 | print(os.getpid())
172 | seed_everything(42)
173 | args = parse_args()
174 | world_size = args.world_size #torch.cuda.device_count()
175 | print("Wold Size ",world_size)
176 | mp.set_start_method('fork', force=True)
177 | model_name = args.model
178 | dataset_name = args.dataset_name
179 |
180 | sources = load_jsonl_file(args.data_path)
181 | data_all = [data_sample for data_sample in sources]
182 | data_subsets = [data_all[i::world_size] for i in range(world_size)]
183 |
184 | out_path = Path(args.output_path)
185 | out_path.parent.mkdir(parents=True, exist_ok=True)
186 | if out_path.exists():
187 | out_path.unlink()
188 | processes = []
189 | lock = mp.RLock()
190 | for rank in range(world_size):
191 | p = mp.Process(target=get_pred_from_vllm, args=(rank, data_subsets[rank], args.max_gen, model_name, out_path, args.sample_num, lock, args.temperature,args.gpt))
192 | p.start()
193 | processes.append(p)
194 |
195 | for p in processes:
196 | p.join()
--------------------------------------------------------------------------------
/evaltoolkits/step2_extract_preds_from_raw.py:
--------------------------------------------------------------------------------
1 | import re
2 | import json
3 | import json_repair
4 | import argparse
5 | import jsonlines
6 |
7 | from typing import List
8 | from tqdm import tqdm
9 | from utils import load_jsonl_file
10 |
11 | def parse_args(args=None):
12 | parser = argparse.ArgumentParser()
13 | parser.add_argument('--path_to_src_file', type=str, default=None)
14 | return parser.parse_args(args)
15 |
16 | def extract_pred_list(raw_pred_list:List[str]):
17 | extracted_pred_list = []
18 | for pred in raw_pred_list:
19 | if pred.startswith("```json"):
20 | # remove the begining and ending ```json
21 | pred = pred.replace("```json", "").replace("```", "")
22 | pred = pred.strip()
23 | if pred.startswith("{"):
24 | pred = pred.strip()
25 | try:
26 | content = json_repair.loads(pred)
27 | if type(content)==list:
28 | content=content[0]
29 | content = content["answer"]
30 | extracted_pred_list.append(str(content))
31 | except Exception as e:
32 | # print(e, pred)
33 | # try to extract the answer from the raw pred, if failed, append the raw pred
34 | # use re to extract the content after "answer: " and before "." (inclusive)
35 | try:
36 | #content =re.findall(r'"answer": (.+?)(?=\n|$)', pred)[0].strip()
37 | #print(content)
38 | #content = content.strip('\'"[]')
39 | pattern = r'"answer": "([^"]+)"'
40 | match = re.search(pattern, pred)
41 | content = match.group(1)
42 | #print(content)
43 | # print(f"Extracted re: {content}")
44 | extracted_pred_list.append(content)
45 | except Exception as e2:
46 | extracted_pred_list.append(pred)
47 | else:
48 | # extract plain text format response
49 | # print("extracting plain text format response")
50 | '''
51 | try:
52 | content = pred.split("Answer:")[1].split("Reasoning:")[0]
53 | except:
54 | try:
55 | content = pred.split("Answer:")[1]
56 | except:
57 | content = pred
58 | try:
59 | content = pred.split("Reasoning:")[0]
60 | except:
61 | content = pred
62 | '''
63 | try:
64 | content=pred.split("Answer:")[1]
65 | #content=pred.split("Reasoning:")[1]
66 | except:
67 | try:
68 | content=pred.split("Reasoning:")[1]
69 | except:
70 | try:
71 | #print("Pred:",pred)
72 | content=pred.split('\n')[0]
73 | #print("Content:",content)
74 | except:
75 | content=pred
76 | try:
77 | content=content.split("\n")[0]
78 | except:
79 | content=content
80 | extracted_pred_list.append(content)
81 | return extracted_pred_list
82 |
83 |
84 | def main():
85 | args = parse_args()
86 | data_list = load_jsonl_file(args.path_to_src_file)
87 | for data in tqdm(data_list, desc="Extracting preds"):
88 | extracted_pred_list = extract_pred_list(data["pred"])
89 | data["extracted_pred_list"] = extracted_pred_list
90 | path_to_tgt_file = args.path_to_src_file.replace(".jsonl", "_eval.jsonl")
91 | with jsonlines.open(path_to_tgt_file, mode='w') as writer:
92 | writer.write_all(data_list)
93 |
94 | if __name__ == "__main__":
95 | main()
96 |
97 |
98 |
99 |
--------------------------------------------------------------------------------
/evaltoolkits/step3_eval_f1.py:
--------------------------------------------------------------------------------
1 | import json
2 | import json_repair
3 | import argparse
4 | import numpy as np
5 | import jsonlines
6 |
7 | from pathlib import Path
8 | from typing import List
9 | from tqdm import tqdm
10 |
11 | from utils import load_jsonl_file
12 | from metrics import (
13 | qa_f1_score,
14 | rouge_zh_score,
15 | qa_f1_zh_score,
16 | rouge_score,
17 | classification_score,
18 | retrieval_score,
19 | retrieval_zh_score,
20 | count_score,
21 | code_sim_score,
22 | qa_recall_score,
23 | babiq3_score,
24 | ruler_score,
25 | babi_score,
26 | accuracy_score
27 | )
28 |
29 | dataset2metric = {
30 | "narrativeqa": qa_f1_score,
31 | "qasper": qa_f1_score,
32 | "Ruler": ruler_score,
33 | "MQA-Medium": accuracy_score,
34 | "MQA-Medium-v2": accuracy_score,
35 | "babiq3": qa_f1_score,
36 | "Babi": babi_score,
37 | "Babiq3": babiq3_score,
38 | "MQA": accuracy_score,
39 | "ICL": accuracy_score,
40 | "LIL": accuracy_score,
41 | "LSDU": accuracy_score,
42 | "NIAH": accuracy_score,
43 | "BABILong": accuracy_score,
44 | "LHU": accuracy_score,
45 | "CRU": accuracy_score,
46 | "SQA": accuracy_score,
47 | "SQA-Medium-v2": accuracy_score,
48 | "multifieldqa": qa_f1_score,
49 | "multifieldqa_en": qa_f1_score,
50 | "multifieldqa_zh": qa_f1_zh_score,
51 | "hotpotqa": qa_f1_score,
52 | "2wikimqa": qa_f1_score,
53 | "musique": qa_f1_score,
54 | "dureader": rouge_zh_score,
55 | "gov_report": rouge_score,
56 | "gov": rouge_score,
57 | "qmsum": rouge_score,
58 | "multi_news": rouge_score,
59 | "multi": rouge_score,
60 | "vcsum": rouge_zh_score,
61 | "trec": classification_score,
62 | "triviaqa": qa_f1_score,
63 | "samsum": rouge_score,
64 | "lsht": classification_score,
65 | "passage_retrieval_en": retrieval_score,
66 | "passage_count": count_score,
67 | "passage_retrieval_zh": retrieval_zh_score,
68 | "lcc": code_sim_score,
69 | "repobench-p": code_sim_score,
70 | }
71 |
72 | def parse_args(args=None):
73 | parser = argparse.ArgumentParser()
74 | parser.add_argument('--path_to_src_file', type=str, default=None)
75 | return parser.parse_args(args)
76 |
77 | def get_score_list(dataset:str, predictions:List[str], answers:List[str],**kwargs) -> List[float]:
78 | score_list: List[float] = []
79 | for pred in predictions:
80 | if dataset=="Ruler":
81 | score=1
82 | else:
83 | score=0
84 | for answer in answers:
85 | if dataset=="Ruler":
86 | score = min(dataset2metric[dataset](pred, answer,**kwargs),score)
87 | else:
88 | score = max(dataset2metric[dataset](pred, answer,**kwargs),score)
89 | score_list.append(score)
90 | return score_list
91 |
92 | def main():
93 | args = parse_args()
94 | data_list = load_jsonl_file(args.path_to_src_file)
95 | best_score_list = []
96 | file_name = Path(args.path_to_src_file).name
97 | dataset = file_name.split('.')[0].split("-")[0]
98 | print(f"Eval {dataset}")
99 | for data in tqdm(data_list, desc="Calculating F1 score"):
100 | extracted_pred_list:List[str] = data["extracted_pred_list"]
101 | answers = data["answers"]
102 | if type(answers) !=type(["!"]):
103 | answers=[answers]
104 | if "all_classes" in data.keys():
105 | all_classes=data['all_classes']
106 | else:
107 | all_classes=[]
108 | score_list = get_score_list(dataset, extracted_pred_list, answers,all_classes=all_classes)
109 | best_score_in_this_data = max(score_list)
110 | best_score_list.append(best_score_in_this_data)
111 | data["f1_score_list"] = score_list
112 | final_score = np.mean(best_score_list) # *100 and round to 2 decimal places
113 | final_score = round(final_score*100, 2)
114 | print(f"Final score: {final_score}")
115 | with jsonlines.open(args.path_to_src_file, mode='w') as writer:
116 | writer.write_all(data_list)
117 | data_list_noinstr = [{k:v for k,v in data.items() if k!="instruction"} for data in data_list]
118 | with jsonlines.open(args.path_to_src_file.replace(".jsonl","_noinstr.jsonl"), mode='w') as writer:
119 | writer.write_all(data_list_noinstr)
120 |
121 | # add to result.json, overwrite the value if it already exists
122 | # check if result.json exists
123 | result_path = Path(args.path_to_src_file).parent / "result.json"
124 | if result_path.exists():
125 | with open(result_path, 'r') as f:
126 | result = json.load(f)
127 | else:
128 | result = {}
129 | result[file_name] = final_score
130 | with open(result_path, 'w') as f:
131 | json.dump(result, f, ensure_ascii=False, indent=4)
132 | print(f"Result saved in {result_path}")
133 |
134 | if __name__ == '__main__':
135 | main()
136 |
137 |
138 |
--------------------------------------------------------------------------------
/evaltoolkits/utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import json
3 | import string
4 | import json_repair
5 |
6 | from pathlib import Path
7 | from typing import Union, List
8 |
9 | # from metrics import normalize_answer
10 | def extract_number(text):
11 | match = re.search(r'\[\[([0-9]*\.?[0-9]+)\]\]', text)
12 | if match:
13 | return float(match.group(1))
14 | match = re.search(r'\[([0-9]*\.?[0-9]+)\]', text)
15 | if match:
16 | return float(match.group(1))
17 | return 0.0
18 |
19 | def normalize_answer(s):
20 | """Lower text and remove punctuation, articles and extra whitespace."""
21 |
22 | def remove_articles(text):
23 | return re.sub(r"\b(a|an|the)\b", " ", text)
24 |
25 | def white_space_fix(text):
26 | return " ".join(text.split())
27 |
28 | def remove_punc(text):
29 | exclude = set(string.punctuation)
30 | return "".join(ch for ch in text if ch not in exclude)
31 |
32 | def lower(text):
33 | return text.lower()
34 |
35 | return white_space_fix(remove_articles(remove_punc(lower(s))))
36 |
37 | def load_jsonl_file(path_to_file: Union[str, Path]):
38 | data_list = []
39 | error_cnt = 0
40 | with open(path_to_file) as f:
41 | for idx, line in enumerate(f):
42 | try:
43 | data = json.loads(line)
44 | data_list.append(data)
45 | except Exception as e:
46 | error_cnt += 1
47 | print(f"Failed loading line {idx}, error: {e}")
48 | print(line)
49 | print(f"Failed loading {error_cnt} lines, total {len(data_list)} lines loaded")
50 | return data_list
51 |
52 | def preprocess_pred_for_json_repair(pred: str):
53 | escaped_str = re.sub(
54 | r'(?<="reasoning": ")(.*?)(?="\s*,\s*\n\s*"answer":)',
55 | lambda match: re.sub(r'(? bool:
67 | # 去除 instruction 中的标点符号
68 | instruction_cleaned = normalize_answer(instruction)
69 |
70 | for fact in fact_list:
71 | # 去除 fact 中的标点符号
72 | fact_cleaned = normalize_answer(fact)
73 | # 比较去除标点符号后的 fact 和 instruction
74 | if fact_cleaned not in instruction_cleaned:
75 | # print(fact)
76 | return False
77 | return True
78 |
79 | def check_pred_fact_consistency(pred: str, instruction: str):
80 |
81 | processed_pred = preprocess_pred_for_json_repair(pred)
82 | content = json_repair.loads(processed_pred)
83 | if not isinstance(content, dict) or not content or 'reasoning' not in content or len(content) > 2 or type(content['reasoning']) != str:
84 | return False
85 | fact_list = extract_fact_list(content['reasoning'])
86 | if len(fact_list) > 0 and verify_fact_list(fact_list, instruction):
87 | return True
88 | return False
--------------------------------------------------------------------------------
/log/README.md:
--------------------------------------------------------------------------------
1 | This is a folder for log saving.
--------------------------------------------------------------------------------
/pics/combined_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/pics/combined_plot.png
--------------------------------------------------------------------------------
/pics/llama.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/pics/llama.png
--------------------------------------------------------------------------------
/pics/main_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lemon-prog123/LongRePS/76c6f49e9ab428fb235b32a4f087652642939a82/pics/main_table.png
--------------------------------------------------------------------------------
/preprocess_lbv1.py:
--------------------------------------------------------------------------------
1 | import json
2 | import jsonlines
3 | import json_repair
4 | import pandas as pd
5 | import numpy as np
6 | import torch
7 | import shutil
8 | import glob
9 | import re
10 |
11 | from typing import List
12 | from tqdm import tqdm
13 | from pathlib import Path
14 | from transformers import AutoTokenizer, AutoModelForCausalLM
15 | from datasets import Dataset, load_dataset
16 | from config.prompt import prompt_lbv1_cot,prompt_lbv1_nocot,prompt_cot
17 |
18 | def construct_cot_nocot_split(split: str):
19 | data_list=load_dataset('THUDM/LongBench',split, split='test')
20 | new_cot_data_list = []
21 | new_nocot_data_list = []
22 | for data in data_list:
23 | context = data["context"]
24 | question = data["input"]
25 | answers = data["answers"]
26 | all_classes=data['all_classes']
27 | id = data["_id"]
28 | output = answers[0]
29 | instruction_cot = prompt_lbv1_cot.format(context=context, question=question)
30 | instruction_nocot = prompt_lbv1_nocot.format(context=context, question=question)
31 | new_cot_data_list.append({"id": id, "question":question,"instruction": instruction_cot, "answers": answers,"all_classes":all_classes ,"output": output, "system": "You are a helpful assistant."})
32 | new_nocot_data_list.append({"id": id,"question":question,"instruction": instruction_nocot, "answers": answers, "all_classes":all_classes,"output": output, "system": "You are a helpful assistant."})
33 | print(f"size of {split} new_cot_data_list: {len(new_cot_data_list)}")
34 | print(f"size of {split} new_nocot_data_list: {len(new_nocot_data_list)}")
35 | with jsonlines.open(f"dataset/longbenchv1/{split}_cot.jsonl", 'w') as writer:
36 | writer.write_all(new_cot_data_list)
37 | with jsonlines.open(f"dataset/longbenchv1/{split}_nocot.jsonl", 'w') as writer:
38 | writer.write_all(new_nocot_data_list)
39 | print(f"Finished writing {split} dataset")
40 | return
41 |
42 |
43 |
44 | from datasets import load_dataset
45 | datasets = ["qasper", "multifieldqa_en", "hotpotqa", "2wikimqa","musique"]
46 | for dataset in datasets:
47 | construct_cot_nocot_split(dataset)
48 |
--------------------------------------------------------------------------------
/preprocess_lbv2.py:
--------------------------------------------------------------------------------
1 | import json
2 | import jsonlines
3 | import json_repair
4 | import pandas as pd
5 | import numpy as np
6 | import torch
7 | import shutil
8 | import glob
9 | import re
10 |
11 | from typing import List
12 | from tqdm import tqdm
13 | from pathlib import Path
14 | from transformers import AutoTokenizer, AutoModelForCausalLM
15 | from datasets import Dataset, load_dataset
16 | from config.prompt import prompt_lbv2_cot,prompt_lbv2_nocot
17 |
18 | def construct_cot_nocot_split(filtered_data,split="MQA"):
19 | new_cot_data_list = []
20 | new_nocot_data_list = []
21 | for item in filtered_data:
22 | context = item["context"]
23 | question=item['question']
24 | _id=item['_id']
25 | difficulty=item['difficulty']
26 | instruction_cot=prompt_lbv2_cot.format(context=context, question=question,choice_A=item['choice_A'],choice_B=item['choice_B'],choice_C=item['choice_C'],choice_D=item['choice_D'])
27 | instruction_nocot=prompt_lbv2_nocot.format(context=context, question=question,choice_A=item['choice_A'],choice_B=item['choice_B'],choice_C=item['choice_C'],choice_D=item['choice_D'])
28 | new_cot_data_list.append({"id": id, "instruction": instruction_cot, "output": item['answer'], "id":_id,"difficulty":difficulty,"question":item['question'],"num_tokens":item['token_num'],"system": "You are a helpful assistant."})
29 | new_nocot_data_list.append({"id": id, "instruction": instruction_nocot, "output": item['answer'], "id":_id,"difficulty":difficulty,"question":item['question'],"num_tokens":item['token_num'],"system": "You are a helpful assistant."})
30 |
31 | print(f"size of new_cot_data_list: {len(new_cot_data_list)}")
32 | print(f"size of new_nocot_data_list: {len(new_nocot_data_list)}")
33 | with jsonlines.open(f"dataset/longbenchv2/{split}_cot.jsonl", 'w') as writer:
34 | writer.write_all(new_cot_data_list)
35 | with jsonlines.open(f"dataset/longbenchv2/{split}_nocot.jsonl", 'w') as writer:
36 | writer.write_all(new_nocot_data_list)
37 |
38 |
39 |
40 | split_list=["Single-Document QA","Multi-Document QA"]
41 | split_tag=["SQA","MQA"]
42 | dataset=load_dataset('Lemon123prog/Longmix-LongRePS',split='test')
43 |
44 | for (split,tag) in zip(split_list,split_tag):
45 | filter_data=[]
46 | for data in dataset:
47 | if data['domain']==split and data['token_num']<105*1024: #In case of the Bug for over-long data
48 | filter_data.append(data)
49 | construct_cot_nocot_split(filter_data,split=tag)
50 |
51 |
--------------------------------------------------------------------------------
/preprocess_train.py:
--------------------------------------------------------------------------------
1 | from datasets import load_dataset
2 | import jsonlines
3 |
4 |
5 | def preprocess(model:str):
6 | dataset = load_dataset(f"Lemon123prog/{model}-LongRePS")
7 | warmup_data=dataset['warmup'].to_list()
8 | orm_data=dataset['train_orm'].to_list()
9 | prm_data=dataset['train_prm'].to_list()
10 |
11 | with jsonlines.open(f"./data/{model}_warmup.jsonl", 'w') as writer:
12 | writer.write_all(warmup_data)
13 |
14 | with jsonlines.open(f"./data/musique-{model}_orm_train.jsonl", 'w') as writer:
15 | writer.write_all(orm_data)
16 |
17 | with jsonlines.open(f"./data/musique-{model}_prm_train.jsonl", 'w') as writer:
18 | writer.write_all(prm_data)
19 |
20 |
21 | preprocess("Llama-3.1-8B")
22 | preprocess("Qwen-2.5-7B")
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | vllm==0.6.0
2 | vllm-flash-attn==2.6.1
3 | deepspeed==0.15.4
4 | openai
5 | jsonlines
6 | json-repair
7 | fuzzywuzzy
8 | jieba
--------------------------------------------------------------------------------
/scripts/llama_sft.sh:
--------------------------------------------------------------------------------
1 | model_path="/mnt/xiyu/LongRePS/saves/Llama3.1-8B/full/Llama-3.1-8B_warmup_train_lr1e-5_maxlen16k_2025-03-13-22-43-49/checkpoint-10"
2 | template="llama3"
3 | learning_rate=5e-6
4 | dataset="Llama-3.1-8B_sample30_thresh1.0_v2_prm"
5 | echo "Dataname: ${dataset}"
6 |
7 | output_path="saves/Llama3.1-8B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S")
8 | mkdir -p ${output_path}
9 |
10 | llamafactory-cli train \
11 | --stage sft \
12 | --do_train True \
13 | --model_name_or_path ${model_path} \
14 | --preprocessing_num_workers 16 \
15 | --finetuning_type full \
16 | --template ${template} \
17 | --flash_attn auto \
18 | --dataset_dir data \
19 | --dataset ${dataset} \
20 | --cutoff_len 16384 \
21 | --learning_rate ${learning_rate} \
22 | --num_train_epochs 2 \
23 | --max_samples 100000 \
24 | --per_device_train_batch_size 1 \
25 | --gradient_accumulation_steps 4 \
26 | --lr_scheduler_type constant_with_warmup \
27 | --max_grad_norm 1.0 \
28 | --logging_steps 5 \
29 | --save_strategy epoch \
30 | --warmup_steps 5 \
31 | --packing False \
32 | --save_only_model True \
33 | --report_to none \
34 | --output_dir ${output_path} \
35 | --bf16 True \
36 | --plot_loss True \
37 | --ddp_timeout 180000000 \
38 | --optim adamw_torch \
39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1
40 | done
--------------------------------------------------------------------------------
/scripts/llama_warmup.sh:
--------------------------------------------------------------------------------
1 | model_path="../Model/meta-llama/Llama-3.1-8B"
2 | template="llama3"
3 | learning_rate=1e-5
4 | dataset="Llama-3.1-8B_warmup"
5 | echo "Dataname: ${dataset}"
6 |
7 | output_path="saves/Llama3.1-8B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S")
8 | mkdir -p ${output_path}
9 |
10 | llamafactory-cli train \
11 | --stage sft \
12 | --do_train True \
13 | --model_name_or_path ${model_path} \
14 | --preprocessing_num_workers 16 \
15 | --finetuning_type full \
16 | --template ${template} \
17 | --flash_attn auto \
18 | --dataset_dir data \
19 | --dataset ${dataset} \
20 | --cutoff_len 16384 \
21 | --learning_rate ${learning_rate} \
22 | --num_train_epochs 2 \
23 | --max_samples 100000 \
24 | --per_device_train_batch_size 1 \
25 | --gradient_accumulation_steps 4 \
26 | --lr_scheduler_type constant_with_warmup \
27 | --max_grad_norm 1.0 \
28 | --logging_steps 5 \
29 | --save_strategy epoch \
30 | --warmup_steps 5 \
31 | --packing False \
32 | --save_only_model True \
33 | --report_to none \
34 | --output_dir ${output_path} \
35 | --bf16 True \
36 | --plot_loss True \
37 | --ddp_timeout 180000000 \
38 | --optim adamw_torch \
39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1
40 | done
--------------------------------------------------------------------------------
/scripts/lora_sft.sh:
--------------------------------------------------------------------------------
1 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-warmup-epoch1"
2 | template="qwen"
3 | learning_rate=5e-5
4 | dataset="Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm"
5 | echo "Dataname: ${dataset}"
6 |
7 | output_path="saves/Qwen2.5-32B/lora/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S")
8 | mkdir -p ${output_path}
9 | cp /mnt/xiyu/LongRePS/scripts/lora_sft.sh ${output_path}
10 |
11 | llamafactory-cli train \
12 | --stage sft \
13 | --do_train True \
14 | --model_name_or_path ${model_path} \
15 | --preprocessing_num_workers 16 \
16 | --finetuning_type lora \
17 | --lora_rank 128 \
18 | --lora_alpha 128 \
19 | --lora_dropout 0.05 \
20 | --lora_target all \
21 | --template ${template} \
22 | --flash_attn auto \
23 | --dataset_dir data \
24 | --dataset ${dataset} \
25 | --cutoff_len 16384 \
26 | --learning_rate ${learning_rate} \
27 | --num_train_epochs 2 \
28 | --max_samples 100000 \
29 | --per_device_train_batch_size 1 \
30 | --gradient_accumulation_steps 4 \
31 | --lr_scheduler_type cosine \
32 | --max_grad_norm 1.0 \
33 | --logging_steps 5 \
34 | --save_strategy epoch \
35 | --packing False \
36 | --save_only_model True \
37 | --report_to none \
38 | --output_dir ${output_path} \
39 | --bf16 True \
40 | --plot_loss True \
41 | --ddp_timeout 180000000 \
42 | --optim adamw_torch \
43 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1
44 | done
--------------------------------------------------------------------------------
/scripts/merge_config.yaml:
--------------------------------------------------------------------------------
1 | model_name_or_path: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-warmup-epoch1
2 | adapter_name_or_path: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen-2.5-32B_sample30_temp0.7_thresh1.0_checkstage3_prm_train_lr5e-5_maxlen16k_2025-03-30-17-41-39/checkpoint-88
3 | template: qwen
4 | finetuning_type: lora
5 |
6 | export_dir: /mnt/xiyu/LongRePS/saves/Qwen2.5-32B/lora/Qwen2.5-32B-prm-checkstage3-epoch2
7 | export_size: 2
8 | export_device: cpu
9 | export_legacy_format: false
--------------------------------------------------------------------------------
/scripts/preprocess_lb.sh:
--------------------------------------------------------------------------------
1 | mkdir -p dataset/longbenchv1
2 | mkdir -p dataset/longbenchv2
3 |
4 | python preprocess_lbv1.py
5 | python preprocess_lbv2.py
--------------------------------------------------------------------------------
/scripts/qwen14b_lora.yaml:
--------------------------------------------------------------------------------
1 | model_name_or_path: /mnt/xiyu/Model/Qwen/Qwen2.5-14B
2 |
3 | stage: sft
4 | do_train: true
5 | finetuning_type: lora
6 | lora_target: all
7 |
8 | dataset: Qwen-2.5-14B_warmup
9 | dataset_dir: data
10 | template: qwen
11 | cutoff_len: 16384
12 | max_samples: 100000
13 | overwrite_cache: true
14 | preprocessing_num_workers: 16
15 |
16 | output_dir: saves/llama3-8b/lora/sft
17 | logging_steps: 5
18 | save_strategy: epoch
19 | plot_loss: true
20 | overwrite_output_dir: true
21 |
22 | per_device_train_batch_size: 1
23 | gradient_accumulation_steps: 4
24 | learning_rate: 1e-5
25 | num_train_epochs: 2.0
26 | lr_scheduler_type: constant_with_warmup
27 | warmup_steps: 5
28 | bf16: true
29 | ddp_timeout: 180000000
--------------------------------------------------------------------------------
/scripts/qwen_lora.sh:
--------------------------------------------------------------------------------
1 | model_path="/mnt/xiyu/Model/Qwen/Qwen2.5-14B"
2 | template="qwen"
3 | learning_rate=1e-5
4 | dataset="Qwen-2.5-7B_orm"
5 | echo "Dataname: ${dataset}"
6 |
7 | output_path="saves/Qwen2.5-14B/lora/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S")
8 | mkdir -p ${output_path}
9 | cp /mnt/xiyu/LongRePS/scripts/qwen_lora.sh ${output_path}
10 |
11 | llamafactory-cli train \
12 | --stage sft \
13 | --do_train True \
14 | --model_name_or_path ${model_path} \
15 | --preprocessing_num_workers 16 \
16 | --finetuning_type lora \
17 | --lora_rank 128 \
18 | --lora_alpha 128 \
19 | --lora_dropout 0.05 \
20 | --lora_target all \
21 | --template ${template} \
22 | --flash_attn auto \
23 | --dataset_dir data \
24 | --dataset ${dataset} \
25 | --cutoff_len 16384 \
26 | --learning_rate ${learning_rate} \
27 | --num_train_epochs 2 \
28 | --max_samples 100000 \
29 | --per_device_train_batch_size 1 \
30 | --gradient_accumulation_steps 4 \
31 | --lr_scheduler_type cosine \
32 | --max_grad_norm 1.0 \
33 | --logging_steps 5 \
34 | --save_strategy epoch \
35 | --packing False \
36 | --save_only_model True \
37 | --report_to none \
38 | --output_dir ${output_path} \
39 | --bf16 True \
40 | --plot_loss True \
41 | --ddp_timeout 180000000 \
42 | --optim adamw_torch \
43 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1
44 | done
--------------------------------------------------------------------------------
/scripts/qwen_sft.sh:
--------------------------------------------------------------------------------
1 | model_path="/mnt/xiyu/LongRePS/saves/Qwen2.5-7B/full/Qwen-2.5-7B_warmup_train_lr1e-5_maxlen16k_2025-03-26-17-19-47/checkpoint-18"
2 | template="qwen"
3 | learning_rate=5e-6
4 | dataset="Qwen-2.5-7B_sample100_thresh1.0_yarn_checkstage3_prm"
5 | echo "Dataname: ${dataset}"
6 |
7 | output_path="saves/Qwen2.5-7B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S")
8 | mkdir -p ${output_path}
9 |
10 | llamafactory-cli train \
11 | --stage sft \
12 | --do_train True \
13 | --model_name_or_path ${model_path} \
14 | --preprocessing_num_workers 16 \
15 | --finetuning_type full \
16 | --template ${template} \
17 | --flash_attn auto \
18 | --dataset_dir data \
19 | --dataset ${dataset} \
20 | --cutoff_len 16384 \
21 | --learning_rate ${learning_rate} \
22 | --num_train_epochs 2 \
23 | --max_samples 100000 \
24 | --per_device_train_batch_size 1 \
25 | --gradient_accumulation_steps 4 \
26 | --lr_scheduler_type constant_with_warmup \
27 | --max_grad_norm 1.0 \
28 | --logging_steps 5 \
29 | --save_strategy epoch \
30 | --warmup_steps 5 \
31 | --packing False \
32 | --save_only_model True \
33 | --report_to none \
34 | --output_dir ${output_path} \
35 | --bf16 True \
36 | --plot_loss True \
37 | --ddp_timeout 180000000 \
38 | --optim adamw_torch \
39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1
40 | done
--------------------------------------------------------------------------------
/scripts/qwen_warmup.sh:
--------------------------------------------------------------------------------
1 | model_path="../Model/Qwen/Qwen2.5-7B"
2 | template="qwen"
3 | learning_rate=1e-5
4 | dataset="Qwen-2.5-7B_warmup"
5 | echo "Dataname: ${dataset}"
6 |
7 | output_path="saves/Qwen2.5-7B/full/${dataset}_train_lr${learning_rate}_maxlen16k_"$(date -d "+8 hours" +"%Y-%m-%d-%H-%M-%S")
8 | mkdir -p ${output_path}
9 |
10 | llamafactory-cli train \
11 | --stage sft \
12 | --do_train True \
13 | --model_name_or_path ${model_path} \
14 | --preprocessing_num_workers 16 \
15 | --finetuning_type full \
16 | --template ${template} \
17 | --flash_attn auto \
18 | --dataset_dir data \
19 | --dataset ${dataset} \
20 | --cutoff_len 16384 \
21 | --learning_rate ${learning_rate} \
22 | --num_train_epochs 2 \
23 | --max_samples 100000 \
24 | --per_device_train_batch_size 1 \
25 | --gradient_accumulation_steps 4 \
26 | --lr_scheduler_type constant_with_warmup \
27 | --max_grad_norm 1.0 \
28 | --logging_steps 5 \
29 | --save_strategy epoch \
30 | --warmup_steps 5 \
31 | --packing False \
32 | --save_only_model True \
33 | --report_to none \
34 | --output_dir ${output_path} \
35 | --bf16 True \
36 | --plot_loss True \
37 | --ddp_timeout 180000000 \
38 | --optim adamw_torch \
39 | --deepspeed cache/ds_z3_config.json > ${output_path}/output.log 2>&1
40 | done
--------------------------------------------------------------------------------