├── .gitignore ├── figures ├── framework.jpg ├── overview.jpg └── main_results.jpg ├── code ├── passk_adv.py └── maze_verifier.py └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /figures/framework.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RUCAIBox/Passk_Training/HEAD/figures/framework.jpg -------------------------------------------------------------------------------- /figures/overview.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RUCAIBox/Passk_Training/HEAD/figures/overview.jpg -------------------------------------------------------------------------------- /figures/main_results.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RUCAIBox/Passk_Training/HEAD/figures/main_results.jpg -------------------------------------------------------------------------------- /code/passk_adv.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | 3 | import numpy as np 4 | import torch 5 | import random 6 | 7 | from scipy.special import comb 8 | 9 | 10 | def calc_adv(val, k): 11 | c = len(np.where(val==1)[0]) 12 | n = len(val) 13 | rho = 1 - comb(n-c, k) / comb(n, k) 14 | sigma = np.sqrt(rho * (1 - rho)) 15 | adv_p = (1 - rho) / (sigma + 1e-6) 16 | adv_n = (1 - rho - comb(n-c-1, k-1)/comb(n-1,k-1)) / (sigma + 1e-6) 17 | new_val = np.where(val==1, adv_p, val) 18 | new_val = np.where(new_val==0, adv_n, new_val) 19 | return new_val 20 | 21 | def compute_advantage(token_level_rewards, response_mask, index, K): 22 | scores = token_level_rewards.sum(dim=-1) 23 | 24 | id2score = defaultdict(list) 25 | uid2sid = defaultdict(list) 26 | id2mean = {} 27 | id2std = {} 28 | 29 | with torch.no_grad(): 30 | bsz = scores.shape[0] 31 | for i in range(bsz): 32 | id2score[index[i]].append(scores[i].detach().item()) 33 | uid2sid[index[i]].append(i) 34 | for uid in id2score.keys(): 35 | reward = np.array(id2score[uid]) 36 | adv = calc_adv(reward, K) 37 | print(uid2sid[uid]) 38 | for i in range(len(uid2sid[uid])): 39 | scores[uid2sid[uid][i]] = adv[i] 40 | 41 | scores = scores.unsqueeze(-1) * response_mask 42 | 43 | return scores, scores -------------------------------------------------------------------------------- /code/maze_verifier.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | def extract_answer_maze(s): 4 | if ('' in s and "" in s): 5 | s = s.split("")[-1].strip().split("")[0].strip() 6 | ans = s.split("boxed") 7 | if len(ans) == 1: 8 | return s 9 | ans = ans[-1] 10 | if len(ans) == 0: 11 | return "" 12 | try: 13 | if ans[0] == "{": 14 | stack = 1 15 | a = "" 16 | for c in ans[1:]: 17 | if c == "{": 18 | stack += 1 19 | a += c 20 | elif c == "}": 21 | stack -= 1 22 | if stack == 0: 23 | break 24 | a += c 25 | else: 26 | a += c 27 | else: 28 | a = ans.split("$")[0].strip() 29 | except: 30 | return "" 31 | return a 32 | 33 | 34 | def compute_score(solution, maze): 35 | solution = extract_answer_maze(solution) 36 | solution = solution.upper() 37 | maze = maze.strip().split('\n') 38 | n = len(maze) 39 | m = len(maze[0]) 40 | 41 | def find_st(maze): 42 | for i in range(n): 43 | for j in range(m): 44 | if (maze[i][j] == 'S'): 45 | return i, j 46 | assert(False) 47 | 48 | x, y = find_st(maze) 49 | for step in solution: 50 | if (step == 'L'): 51 | y = y - 1 52 | elif (step == 'R'): 53 | y = y + 1 54 | elif (step == 'U'): 55 | x = x - 1 56 | elif (step == 'D'): 57 | x = x + 1 58 | else: 59 | continue 60 | 61 | if (x < 0 or x >= n): 62 | return 0 63 | if (y < 0 or y >= m): 64 | return 0 65 | if (maze[x][y] == '*'): 66 | return 0 67 | 68 | if (maze[x][y] == 'E'): 69 | return 1 70 | else: 71 | return 0 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 |

Pass@k Training for Adptively Balancing
Eplortion and Exploitation of LRMs

3 | 4 | 5 | 6 | [![Paper](https://img.shields.io/badge/paper-5f16a8?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2508.10751) 7 | [![Blog](https://img.shields.io/badge/Code-3858bf?style=for-the-badge&logo=github)](https://github.com/RUCAIBox/Passk_Training/tree/main/code) 8 | [![Dataset](https://img.shields.io/badge/Datasets-4d8cd8?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/RUC-AIBOX/Passk_Training_Maze) 9 |
10 | 11 | # Introduction 12 | 13 | Reinforcement Learning with Verifiable Rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. 14 | 15 | Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability remains largely overlooked. To investigate this, we first utilize Pass@k as the reward to train the policy model (i.e., **Pass@k training**), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we initially explore the advantage design for RLVR, showing promising results and highlighting a potential future direction. 16 | 17 |
18 | 19 | ![](./figures/overview.jpg) 20 |
21 | 22 | # Pass@k Training 23 | 24 | Given the question $x$, the policy model is utilized to rollout the $k$ responses through a specific decoding strategy or searching algorithm (e.g., sampling-based decoding strategy or Monte Carlo Tree Search). The $i$-th sampled response $\hat{y}\_i$ will receive a reward $R\_i$, which is provided by the verifier. Based on this, the value of the Pass@k metric is defined as the expected maximum reward obtained from the $k$ sampled responses. Formally, the Pass@k metric can be computed using the following equation, 25 | 26 | $$ 27 | \text{Pass@k} = \mathbb{E}\_{(x,y)\sim D,\\{\hat{y}\_i\\}_{i=1}^K\sim \pi\_\theta(\cdot|x)}\left[\max\left(R_1, \dots, R_K)\right)\right]. 28 | $$ 29 | 30 | The Pass@k metric is utilized as the reward function in the RLVR training process. To improve the effectiveness and efficiency of the Pass@k Training process, we compute the analytical solution of the advantage value of positive responses as follows, 31 | 32 | $$ 33 | \bar{R}^{\text{group}}=1-\frac{\binom{N\_\text{neg}}{k}}{\binom{N\_\text{rollout}}{k}}, \sigma^\text{group}=\sqrt{\bar{R}^\text{group}\times\left(1-\bar{R}^\text{group}\right)}, 34 | $$ 35 | 36 | $$ 37 | \hat{A}\_{\text{pos}}=\frac{1-\bar{R}^{\text{group}}}{\sigma^{\text{group}}}, \hat{A}\_{\text{neg}}=\left(1-\bar{R}^\text{group}-\frac{\binom{N\_\text{neg}-1}{k-1}}{\binom{N\_\text{rollout}-1}{k-1}}\right)\times\left(\sigma^\text{group}\right)^{-1}. 38 | $$ 39 | 40 | The implementation details of **Pass@k Training with Analytical Derivation** can be found in [`code/passk_adv.py`](code/passk_adv.py), which is utilized to compute the advantage values of each response and adapted to the verl framework. 41 | Besides, the code of the verifier of Maze tasks can be found in [`code/maze_verifier.py`](code/maze_verifier.py). 42 | 43 |
44 | 45 | ![](./figures/framework.jpg) 46 |
47 | 48 | 49 | # Key Insights 50 | 51 | + Compared to Pass@1 training, **Pass@k training significantly enhances the exploration ability of LLMs, improving Pass@k performance while maintaining Pass@1**. Among its three progressive variants, bootstrap sampling offers higher training efficiency than full sampling, and analytical derivation serves as its theoretical asymptotic form that mitigates the variance introduced by sampling. 52 | 53 | + Compared to baseline methods, Pass@k training is both robust to different values of K and generalizable across domains and tasks. **The enhancement of LLM exploration ability is helpful to improve their exploitation through continual training**, leading 7B LLM to surpass the powerful LLMs (e.g., GPT-4o and Claude-3.7), highlighting the practical value of Pass@k training. 54 | 55 | + Pass@k training with analytical derivation, which directly designs the advantage function, can be viewed as a form of implicit reward design. Following this idea, empirical experiments suggest that **implicit reward design allows finer-grained control over optimization**, such as focusing on harder problems or improving training efficiency, without complex theoretical derivations, making it a promising direction for future RLVR development. 56 | 57 | # Key Performance 58 | 59 | The benefits brought by Pass@k training can be transferred to Pass@1 performance of LLMs, which is not affected by the scale of model parameters (e.g., 7B or 32B), model architecture (e.g, dense model or MoE model), model family (i.e., Qwen model or Seed model), or downstream tasks (natural language tasks or multi-modal tasks). 60 | 61 |
62 | 63 | ![](./figures/main_results.jpg) 64 |
65 | 66 | 67 | # Reference 68 | 69 | Please kindly cite our report if it is helpful for your research. 70 | 71 | ``` 72 | @article{Passk_Training, 73 | title={Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models}, 74 | author={Chen, Zhipeng and Qin, Xiaobo and Wu, Youbin and Ling, Yue and Ye, Qinghao and Zhao, Wayne Xin and Shi, Guang}, 75 | journal={arXiv preprint arXiv:2508.10751}, 76 | year={2025} 77 | } 78 | ``` 79 | --------------------------------------------------------------------------------