Pass@k Training for Adptively Balancing
Eplortion and Exploitation of LRMs

├── .gitignore
├── figures
    ├── framework.jpg
    ├── overview.jpg
    └── main_results.jpg
├── code
    ├── passk_adv.py
    └── maze_verifier.py
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | 


--------------------------------------------------------------------------------
/figures/framework.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RUCAIBox/Passk_Training/HEAD/figures/framework.jpg


--------------------------------------------------------------------------------
/figures/overview.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RUCAIBox/Passk_Training/HEAD/figures/overview.jpg


--------------------------------------------------------------------------------
/figures/main_results.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RUCAIBox/Passk_Training/HEAD/figures/main_results.jpg


--------------------------------------------------------------------------------
/code/passk_adv.py:
--------------------------------------------------------------------------------
 1 | from collections import defaultdict
 2 | 
 3 | import numpy as np
 4 | import torch
 5 | import random
 6 | 
 7 | from scipy.special import comb
 8 | 
 9 | 
10 | def calc_adv(val, k):
11 |     c = len(np.where(val==1)[0])
12 |     n = len(val)
13 |     rho = 1 - comb(n-c, k) / comb(n, k)
14 |     sigma = np.sqrt(rho * (1 - rho))
15 |     adv_p = (1 - rho) / (sigma + 1e-6)
16 |     adv_n = (1 - rho - comb(n-c-1, k-1)/comb(n-1,k-1)) / (sigma + 1e-6)
17 |     new_val = np.where(val==1, adv_p, val)
18 |     new_val = np.where(new_val==0, adv_n, new_val)
19 |     return new_val
20 | 
21 | def compute_advantage(token_level_rewards, response_mask, index, K):
22 |     scores = token_level_rewards.sum(dim=-1)
23 |     
24 |     id2score = defaultdict(list)
25 |     uid2sid = defaultdict(list)
26 |     id2mean = {}
27 |     id2std = {}
28 | 
29 |     with torch.no_grad():
30 |         bsz = scores.shape[0]
31 |         for i in range(bsz):
32 |             id2score[index[i]].append(scores[i].detach().item())
33 |             uid2sid[index[i]].append(i)
34 |         for uid in id2score.keys():
35 |             reward = np.array(id2score[uid])
36 |             adv = calc_adv(reward, K)
37 |             print(uid2sid[uid])
38 |             for i in range(len(uid2sid[uid])):
39 |                 scores[uid2sid[uid][i]] = adv[i]
40 | 
41 |     scores = scores.unsqueeze(-1) * response_mask
42 |     
43 |     return scores, scores


--------------------------------------------------------------------------------
/code/maze_verifier.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | def extract_answer_maze(s):
 4 |     if ('<answer>' in s and "</answer>" in s):
 5 |         s = s.split("<answer>")[-1].strip().split("</answer>")[0].strip()
 6 |     ans = s.split("boxed")
 7 |     if len(ans) == 1:
 8 |         return s
 9 |     ans = ans[-1]
10 |     if len(ans) == 0:
11 |         return ""
12 |     try:
13 |         if ans[0] == "{":
14 |             stack = 1
15 |             a = ""
16 |             for c in ans[1:]:
17 |                 if c == "{":
18 |                     stack += 1
19 |                     a += c
20 |                 elif c == "}":
21 |                     stack -= 1
22 |                     if stack == 0:
23 |                         break
24 |                     a += c
25 |                 else:
26 |                     a += c
27 |         else:
28 |             a = ans.split("$")[0].strip()
29 |     except:
30 |         return ""
31 |     return a
32 | 
33 | 
34 | def compute_score(solution, maze):
35 |     solution = extract_answer_maze(solution)
36 |     solution = solution.upper()
37 |     maze = maze.strip().split('\n')
38 |     n = len(maze)
39 |     m = len(maze[0])
40 | 
41 |     def find_st(maze):
42 |         for i in range(n):
43 |             for j in range(m):
44 |                 if (maze[i][j] == 'S'):
45 |                     return i, j
46 |         assert(False)
47 | 
48 |     x, y = find_st(maze)
49 |     for step in solution:
50 |         if (step == 'L'):
51 |             y = y - 1
52 |         elif (step == 'R'):
53 |             y = y + 1
54 |         elif (step == 'U'):
55 |             x = x - 1
56 |         elif (step == 'D'):
57 |             x = x + 1
58 |         else:
59 |             continue
60 |         
61 |         if (x < 0 or x >= n):
62 |             return 0
63 |         if (y < 0 or y >= m):
64 |             return 0
65 |         if (maze[x][y] == '*'):
66 |             return 0
67 |     
68 |     if (maze[x][y] == 'E'):
69 |         return 1
70 |     else:
71 |         return 0


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <div align='center'>
 2 | <h1>Pass@k Training for Adptively Balancing<br>Eplortion and Exploitation of LRMs</h1>
 3 | 
 4 | 
 5 | <!-- TODO:  Thread,Paper,Dataset,Weights-->
 6 | [![Paper](https://img.shields.io/badge/paper-5f16a8?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2508.10751)
 7 | [![Blog](https://img.shields.io/badge/Code-3858bf?style=for-the-badge&logo=github)](https://github.com/RUCAIBox/Passk_Training/tree/main/code)
 8 | [![Dataset](https://img.shields.io/badge/Datasets-4d8cd8?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/RUC-AIBOX/Passk_Training_Maze)
 9 | </div>
10 | 
11 | # Introduction
12 | 
13 | Reinforcement Learning with Verifiable Rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial.
14 | 
15 | Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability remains largely overlooked. To investigate this, we first utilize Pass@k as the reward to train the policy model (i.e., **Pass@k training**), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we initially explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.
16 | 
17 | <div align='center'>
18 | 
19 | ![](./figures/overview.jpg)
20 | </div>
21 | 
22 | # Pass@k Training
23 | 
24 | Given the question $x$, the policy model is utilized to rollout the $k$ responses through a specific decoding strategy or searching algorithm (e.g., sampling-based decoding strategy or Monte Carlo Tree Search). The $i$-th sampled response $\hat{y}\_i$ will receive a reward $R\_i$, which is provided by the verifier. Based on this, the value of the Pass@k metric is defined as the expected maximum reward obtained from the $k$ sampled responses. Formally, the Pass@k metric can be computed using the following equation,
25 | 
26 | $$
27 | \text{Pass@k} = \mathbb{E}\_{(x,y)\sim D,\\{\hat{y}\_i\\}_{i=1}^K\sim \pi\_\theta(\cdot|x)}\left[\max\left(R_1, \dots, R_K)\right)\right].
28 | $$
29 | 
30 | The Pass@k metric is utilized as the reward function in the RLVR training process. To improve the effectiveness and efficiency of the Pass@k Training process, we compute the analytical solution of the advantage value of positive responses as follows,
31 | 
32 | $$
33 | \bar{R}^{\text{group}}=1-\frac{\binom{N\_\text{neg}}{k}}{\binom{N\_\text{rollout}}{k}}, \sigma^\text{group}=\sqrt{\bar{R}^\text{group}\times\left(1-\bar{R}^\text{group}\right)},
34 | $$
35 | 
36 | $$
37 | \hat{A}\_{\text{pos}}=\frac{1-\bar{R}^{\text{group}}}{\sigma^{\text{group}}}, \hat{A}\_{\text{neg}}=\left(1-\bar{R}^\text{group}-\frac{\binom{N\_\text{neg}-1}{k-1}}{\binom{N\_\text{rollout}-1}{k-1}}\right)\times\left(\sigma^\text{group}\right)^{-1}.
38 | $$
39 | 
40 | The implementation details of **Pass@k Training with Analytical Derivation** can be found in [`code/passk_adv.py`](code/passk_adv.py), which is utilized to compute the advantage values of each response and adapted to the verl framework.
41 | Besides, the code of the verifier of Maze tasks can be found in [`code/maze_verifier.py`](code/maze_verifier.py).
42 | 
43 | <div align='center'>
44 | 
45 | ![](./figures/framework.jpg)
46 | </div>
47 | 
48 | 
49 | # Key Insights
50 | 
51 | + Compared to Pass@1 training, **Pass@k training significantly enhances the exploration ability of LLMs, improving Pass@k performance while maintaining Pass@1**. Among its three progressive variants, bootstrap sampling offers higher training efficiency than full sampling, and analytical derivation serves as its theoretical asymptotic form that mitigates the variance introduced by sampling.
52 | 
53 | + Compared to baseline methods, Pass@k training is both robust to different values of K and generalizable across domains and tasks. **The enhancement of LLM exploration ability is helpful to improve their exploitation through continual training**, leading 7B LLM to surpass the powerful LLMs (e.g., GPT-4o and Claude-3.7), highlighting the practical value of Pass@k training.
54 | 
55 | + Pass@k training with analytical derivation, which directly designs the advantage function, can be viewed as a form of implicit reward design. Following this idea, empirical experiments suggest that **implicit reward design allows finer-grained control over optimization**, such as focusing on harder problems or improving training efficiency, without complex theoretical derivations, making it a promising direction for future RLVR development.
56 | 
57 | # Key Performance
58 | 
59 | The benefits brought by Pass@k training can be transferred to Pass@1 performance of LLMs, which is not affected by the scale of model parameters (e.g., 7B or 32B), model architecture (e.g, dense model or MoE model), model family (i.e., Qwen model or Seed model), or downstream tasks (natural language tasks or multi-modal tasks).
60 | 
61 | <div align='center'>
62 | 
63 | ![](./figures/main_results.jpg)
64 | </div>
65 | 
66 | 
67 | # Reference
68 | 
69 | Please kindly cite our report if it is helpful for your research.
70 | 
71 | ```
72 | @article{Passk_Training,
73 |   title={Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models},
74 |   author={Chen, Zhipeng and Qin, Xiaobo and Wu, Youbin and Ling, Yue and Ye, Qinghao and Zhao, Wayne Xin and Shi, Guang},
75 |   journal={arXiv preprint arXiv:2508.10751},
76 |   year={2025}
77 | }
78 | ```
79 | 


--------------------------------------------------------------------------------