└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 | <h2>
  3 | A Comprehensive Survey of Reward Models: 
  4 |   
  5 | Taxonomy, Applications, Challenges, and Future
  6 | </h2>
  7 | </div>
  8 | 
  9 | <div align="center">
 10 | <b>Jialun Zhong</b><sup>1,4∗</sup>,
 11 | <b>Wei Shen</b><sup>2∗</sup>,
 12 | <b>Yanzeng Li</b><sup>1</sup>,
 13 | <b>Songyang Gao</b><sup>2</sup>,
 14 | <b>Hua Lu</b><sup>3</sup>,
 15 | <b>Yicheng Chen</b><sup>4</sup>,
 16 | <br/>
 17 | <b>Yang Zhang</b><sup>4</sup>,
 18 | <b>Jinjie Gu</b><sup>4</sup>,
 19 | <b>Wei Zhou</b><sup>4</sup>,
 20 | <b>Lei Zou</b><sup>1†</sup>
 21 | </div>
 22 | 
 23 | <div align="center">
 24 | <sup>1</sup>Peking University
 25 | </div>
 26 | <div align="center">
 27 | <sup>2</sup>Fudan University
 28 | </div>
 29 | <div align="center">
 30 | <sup>3</sup>Huazhong University of Science and Technology
 31 | </div>
 32 | <div align="center">
 33 | <sup>4</sup>Ant Group
 34 | </div>
 35 | <div align="center">
 36 | [<a href="https://arxiv.org/abs/2504.12328">arxiv</a>]
 37 | </div>
 38 | <br/>
 39 | 
 40 | 😄 Welcome to recommend missing papers through **`Issues`** and **`Pull Requests`**. 
 41 | 
 42 | ## Paper List
 43 | 
 44 | ### 🔍 Preference Collection
 45 | 
 46 | #### Human Preference
 47 | 
 48 | * Deep Reinforcement Learning from Human Preferences `2017` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf)]
 49 | * Batch Active Preference-Based Learning of Reward Functions `2018` [[CoRL](https://proceedings.mlr.press/v87/biyik18a/biyik18a.pdf)]
 50 | * Reward learning from human preferences and demonstrations in Atari `2018` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2018/file/8cbe9ce23f42628c98f80fa0fac8b19a-Paper.pdf)]
 51 | * Active Preference-Based Gaussian Process Regression for Reward Learning `2020` [[RSS](https://www.roboticsproceedings.org/rss16/p041.pdf)]
 52 | * Information Directed Reward Learning for Reinforcement Learning `2021` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2021/file/1fa6269f58898f0e809575c9a48747ef-Paper.pdf)]
 53 | * PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training `2021` [[ICML](https://proceedings.mlr.press/v139/lee21i/lee21i.pdf)]
 54 | * Improving alignment of dialogue agents via targeted human judgements `2022` [[arxiv](https://arxiv.org/pdf/2209.14375)]
 55 | * Training language models to follow instructions with human feedback `2022` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)]
 56 | * SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning `2022` [[ICLR](https://openreview.net/pdf?id=TfhfZLQ2EJO)]
 57 | * Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback `2022` [[arxiv](https://arxiv.org/pdf/2204.05862)] 
 58 | * Active Reward Learning from Multiple Teachers `2023` [[AAAI Workshop](https://ceur-ws.org/Vol-3381/48.pdf)] 
 59 | * RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback `2023` [[ICML Workshop](https://openreview.net/pdf?id=JvkZtzJBFQ)] 
 60 | * Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback `2023` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2023/file/99766cda865be123d55a1d9666c7b9fc-Paper-Conference.pdf)]
 61 | * Fine-Grained Human Feedback Gives Better Rewards for Language Model Training `2023` [[NeurIPS](https://papers.neurips.cc/paper_files/paper/2023/file/b8c90b65739ae8417e61eadb521f63d5-Paper-Conference.pdf)]
 62 | * Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback `2024` [[ICLR](https://openreview.net/pdf?id=WesY0H9ghM)] 
 63 | * HelpSteer2: Open-source dataset for training top-performing reward models `2024` [[arxiv](https://arxiv.org/pdf/2406.08673)] 
 64 | * Batch Active Learning of Reward Functions from Human Preferences `2024` [[arxiv](https://arxiv.org/pdf/2402.15757)] 
 65 | * Towards Comprehensive Preference Data Collection for Reward Modeling `2024` [[arxiv](https://arxiv.org/pdf/2406.16486)]
 66 | * RLHF Workflow: From Reward Modeling to Online RLHF `2024` [[TMLR](https://openreview.net/pdf?id=a13aYUU9eU)] 
 67 | * Towards Comprehensive Preference Data Collection for Reward Modeling `2024` [[arxiv](https://arxiv.org/pdf/2406.16486)]
 68 | * Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data `2024` [[arxiv](https://openreview.net/pdf?id=bWNPx6t0sF)]
 69 | * Less is More: Improving LLM Alignment via Preference Data Selection `2025` [[arxiv](https://arxiv.org/pdf/2502.14560)]
 70 | * RLTHF: Targeted Human Feedback for LLM Alignment `2025` [[arxiv](https://arxiv.org/pdf/2502.13417)]
 71 | 
 72 | #### AI Preference
 73 | 
 74 | * Constitutional AI: Harmlessness from AI Feedback `2022` [[arxiv](https://arxiv.org/pdf/2212.08073)] 
 75 | * AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback `2023` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2023/file/5fc47800ee5b30b8777fdd30abcaaf3b-Paper-Conference.pdf)]
 76 | * Aligning Large Language Models through Synthetic Feedback `2023` [[EMNLP](https://aclanthology.org/2023.emnlp-main.844.pdf)]
 77 | * RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback `2024` [[ICML](https://openreview.net/pdf?id=uydQ2W41KO)]
 78 | * UltraFeedback: Boosting Language Models with Scaled AI Feedback `2024` [[ICML](https://openreview.net/forum?id=BOorDpKHiJ)]
 79 | * SALMON: Self-Alignment with Instructable Reward Models `2024` [[ICLR](https://openreview.net/pdf?id=xJbsmB8UMx)]
 80 | * Improving Reward Models with Synthetic Critiques `2024` [[arxiv](https://arxiv.org/pdf/2405.20850)]
 81 | * Self-Generated Critiques Boost Reward Modeling for Language Models `2024` [[arxiv](https://arxiv.org/pdf/2411.16646)]
 82 | * Safer-Instruct: Aligning Language Models with Automated Preference Data `2024` [[NAACL](https://aclanthology.org/2024.naacl-long.422.pdf)]
 83 | * Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization `2024` [[EMNLP Findings](https://aclanthology.org/2024.findings-emnlp.56.pdf)]
 84 | * RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment `2024` [[ICLR](https://openreview.net/pdf?id=v3XXtxWKi6)]
 85 | * West-of-N: Synthetic Preference Generation for Improved Reward Modeling `2024` [[ICLR Workshop](https://openreview.net/pdf?id=7kNwZhMefs)]
 86 | * RMBoost: Reward Model Training With Preference-Conditional Multi-Aspect Synthetic Data Generation `2025` [[ICLR Workshop](https://openreview.net/pdf?id=pcehmKPjX5)]
 87 | * Interpreting Language Model Preferences Through the Lens of Decision Trees `2025` [[Online](https://rlhflow.github.io/posts/2025-01-22-decision-tree-reward-model/)]
 88 | 
 89 | ### 🖥️ Reward Modeling
 90 | 
 91 | #### Type-Level
 92 | 
 93 | ##### Discriminative Reward
 94 | 
 95 | * LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion `2023` [[ACL](https://aclanthology.org/2023.acl-long.792.pdf)]
 96 | * InternLM2 Technical Report `2024` [[arxiv](https://arxiv.org/pdf/2403.17297)]
 97 | * Advancing LLM Reasoning Generalists with Preference Trees `2024` [[arxiv](https://arxiv.org/pdf/2404.02078)]
 98 | * Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs `2024` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2024/file/71f7154547c748c8041505521ca433ab-Paper-Conference.pdf)]
 99 | * MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences `2024` [[ICLR](https://openreview.net/pdf?id=slO3xTt4CG)]
100 | * Nemotron-4 340B Technical Report `2024` [[arxiv](https://arxiv.org/pdf/2406.11704)]
101 | * Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts `2024` [[arxiv](https://arxiv.org/pdf/2406.12845)]
102 | 
103 | ##### Generative Reward
104 | 
105 | * Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena `2023` [[NeurIPS](https://openreview.net/pdf?id=uccHPGDlao)]
106 | * Generative Judge for Evaluating Alignment `2024` [[ICLR](https://openreview.net/pdf?id=gtkFw6sZGS)]
107 | * Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models `2024` [[EMNLP](https://aclanthology.org/2024.emnlp-main.248.pdf)]
108 | * CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution `2024` [[arxiv](https://arxiv.org/pdf/2410.16256)] 
109 | * LLM Critics Help Catch LLM Bugs `2024` [[arxiv](https://arxiv.org/pdf/2407.00215)] 
110 | * LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback `2024` [[arxiv](https://arxiv.org/pdf/2406.14024)]
111 | * Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge `2024` [[arxiv](https://arxiv.org/pdf/2407.19594)]
112 | * Self-Taught Evaluators `2024` [[arxiv](https://arxiv.org/pdf/2408.02666)]
113 | * Self-Rewarding Language Models `2024` [[ICML](https://openreview.net/pdf?id=0NphYCmgua)]
114 | * Direct Judgement Preference Optimization `2024` [[arxiv](https://arxiv.org/pdf/2409.14664)]
115 | * Generative Reward Models `2024` [[arxiv](https://arxiv.org/pdf/2410.12832)]
116 | * Generative Verifiers: Reward Modeling as Next-Token Prediction `2024` [[arxiv](https://arxiv.org/pdf/2408.15240)]
117 | * Beyond Scalar Reward Model: Learning Generative Judge from Preference Data `2024` [[arxiv](https://arxiv.org/pdf/2410.03742)]
118 | * Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint `2024` [[ACL Findings](https://aclanthology.org/2024.findings-acl.338.pdf)]
119 | 
120 | ##### Implicit Reward
121 | 
122 | * Direct Preference Optimization: Your Language Model is Secretly a Reward Model `2023` [[NeurIPS](https://openreview.net/pdf?id=HPuSIXJaa9)]
123 | * SLiC-HF: Sequence Likelihood Calibration with Human Feedback `2023` [[arxiv](https://arxiv.org/pdf/2305.10425)]
124 | * A General Theoretical Paradigm to Understand Learning from Human Preferences `2023` [[arxiv](https://arxiv.org/pdf/2310.12036)]
125 | * A Minimaximalist Approach to Reinforcement Learning from Human Feedback `2024` [[ICML](https://openreview.net/pdf?id=5kVgd2MwMY)]
126 | * Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive `2024` [[arxiv](https://arxiv.org/pdf/2402.13228)]
127 | * From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function `2024` [[COLM](https://openreview.net/pdf?id=kEVcNxtqXk)]
128 | * Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs `2024` [[arxiv](https://arxiv.org/pdf/2406.18629)]
129 | * Token-level Direct Preference Optimization `2024` [[ICML](https://openreview.net/pdf?id=1RZKuvqYCR)]
130 | * $β$-DPO: Direct Preference Optimization with Dynamic $β$ `2024` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2024/file/ea888178abdb6fc233226d12321d754f-Paper-Conference.pdf)]
131 | * Generalized Preference Optimization: A Unified Approach to Offline Alignment `2024` [[ICML](https://openreview.net/pdf?id=gu3nacA9AH)]
132 | * Contrastive Preference Optimization: Pushing the Boundaries of LLM `2024` [[ICML](https://openreview.net/pdf?id=51iwkioZpn)]
133 | * Offline Regularised Reinforcement Learning for Large Language Models Alignment `2024` [[arxiv](https://arxiv.org/pdf/2405.19107)]
134 | * Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences `2024` [[arxiv](https://arxiv.org/pdf/2404.03715)]
135 | * ORPO: Monolithic Preference Optimization without Reference Model `2024` [[EMNLP](https://aclanthology.org/2024.emnlp-main.626.pdf)]
136 | * Mixed Preference Optimization: A Two-stage Reinforcement Learning with Human Feedbacks `2024` [[arxiv](https://arxiv.org/pdf/2403.19443)]
137 | * LiPO: Listwise Preference Optimization through Learning-to-Rank `2024` [[arxiv](https://arxiv.org/pdf/2402.01878)]
138 | * Noise Contrastive Alignment of Language Models with Explicit Rewards `2024` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2024/file/d5a58d198afa370a3dff0e1ca4fe1802-Paper-Conference.pdf)]
139 | * SimPO: Simple Preference Optimization with a Reference-Free Reward `2024` [[NeurIPS](https://openreview.net/pdf?id=3Tzcot1LKb)]
140 | * Direct Preference Optimization with an Offset `2024` [[ACL Findings](https://aclanthology.org/2024.findings-acl.592.pdf)]
141 | * Statistical Rejection Sampling Improves Preference Optimization `2024` [[ICLR](https://openreview.net/pdf?id=xbjSwwrQOe)]
142 | * sDPO: Don’t Use Your Data All at Once `2025` [[COLING Industry](https://aclanthology.org/2025.coling-industry.31.pdf)]
143 | * Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization `2025` [[ICLR](https://openreview.net/pdf?id=CbfsKHiWEn)]
144 | * Self-Play Preference Optimization for Language Model Alignment `2025` [[ICLR](https://openreview.net/pdf?id=a3PmRgAB5T)]
145 | * TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights `2025` [[ICLR](https://openreview.net/pdf?id=oF6e2WwxX0)]
146 | 
147 | #### Granularity-Level
148 | 
149 | ##### Outcome Reward
150 | 
151 | TBD
152 | 
153 | ##### Process Reward
154 | 
155 | * Solving math word problems with process- and outcome-based feedback `2022` [[arxiv](https://arxiv.org/pdf/2211.14275)]
156 | * GRACE: Discriminator-Guided Chain-of-Thought Reasoning `2023` [[EMNLP Findings](https://aclanthology.org/2023.findings-emnlp.1022.pdf)]
157 | * Making Language Models Better Reasoners with Step-Aware Verifier `2023` [[ACL](https://aclanthology.org/2023.acl-long.291.pdf)]
158 | * Let's reward step by step: Step-Level reward model as the Navigators for Reasoning `2023` [[arxiv](https://arxiv.org/pdf/2310.10080)]
159 | * Let’s Reinforce Step by Step `2023` [[NeurIPS Workshop](https://openreview.net/pdf?id=QkdRqpClab)]
160 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations `2024` [[ACL](https://aclanthology.org/2024.acl-long.510.pdf)]
161 | * Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision `2024` [[EMNLP Findings](https://aclanthology.org/2024.findings-emnlp.429.pdf)]
162 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision `2024` [[arxiv](https://arxiv.org/pdf/2406.06592)]
163 | * Let's Verify Step by Step `2024` [[ICLR](https://openreview.net/pdf?id=v8L0pN6EOi)]
164 | * AutoPSV: Automated Process-Supervised Verifier `2024` [[NeurIPS](https://openreview.net/pdf?id=eOAPWWOGs9)]
165 | * Process Reward Model with Q-value Rankings `2025` [[ICLR](https://openreview.net/pdf?id=wQEdh2cgEk)]
166 | * Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning `2025` [[ICLR](https://openreview.net/pdf?id=A6Y7AqlzLW)]
167 | * Process Reinforcement through Implicit Rewards `2025` [[arxiv](https://arxiv.org/pdf/2502.01456)]
168 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning `2025` [[arxiv](https://arxiv.org/pdf/2501.07301)]
169 | * AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence `2025` [[arxiv](https://arxiv.org/pdf/2502.13943)]
170 | * Better Process Supervision with Bi-directional Rewarding Signals `2025` [[arxiv](https://arxiv.org/pdf/2503.04618)]
171 | * An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning `2025` [[arxiv](https://www.arxiv.org/pdf/2503.02382)]
172 | * VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data `2025` [[arxiv](https://arxiv.org/pdf/2502.06737)]
173 | 
174 | ### 🦾 Usages
175 | 
176 | #### Data Selection
177 | 
178 | * RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment `2023` [[TMLR](https://openreview.net/pdf?id=m7p5O7zblY)]
179 | * RRHF: Rank Responses to Align Language Models with Human Feedback without tears `2023` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2023/file/23e6f78bdec844a9f7b6c957de2aae91-Paper-Conference.pdf)]
180 | * Reinforced Self-Training (ReST) for Language Modeling `2023` [[arxiv](https://arxiv.org/pdf/2308.08998)]
181 | * Iterative Reasoning Preference Optimization `2024` [[NeruIPS](https://openreview.net/pdf?id=4XIKfvNYvx)]
182 | * Filtered Direct Preference Optimization `2024` [[EMNLP](https://aclanthology.org/2024.emnlp-main.1266.pdf)]
183 | 
184 | #### Policy Training
185 | 
186 | * Fine-Grained Human Feedback Gives Better Rewards for Language Model Training `2023` [[NeurIPS](https://papers.neurips.cc/paper_files/paper/2023/file/b8c90b65739ae8417e61eadb521f63d5-Paper-Conference.pdf)]
187 | * Aligning Crowd Feedback via Distributional Preference Reward modeling `2024` [[arxiv](https://arxiv.org/pdf/2402.09764)]
188 | * Reward-Robust RLHF in LLMs `2024` [[arxiv](https://arxiv.org/pdf/2409.15360)]
189 | * Bayesian Reward Models for LLM Alignment `2024` [[arxiv](https://arxiv.org/pdf/2402.13210)]
190 | * Prior Constraints-based Reward Model Training for Aligning Large Language Models `2024` [[CCL](https://aclanthology.org/2024.ccl-1.107.pdf)]
191 | * ODIN: Disentangled Reward Mitigates Hacking in RLHF `2024` [[ICML](https://openreview.net/pdf?id=zcIV8OQFVF)]
192 | * Disentangling Length from Quality in Direct Preference Optimization `2024` [[ACL Findings](https://aclanthology.org/2024.findings-acl.297.pdf)]
193 | * WARM: On the Benefits of Weight Averaged Reward Models `2024` [[ICML](https://openreview.net/pdf?id=s7RDnNUJy6)]
194 | * Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble `2024` [[arxiv](https://arxiv.org/pdf/2401.16635)]
195 | * RRM: Robust Reward Model Training Mitigates Reward Hacking `2025` [[ICLR](https://openreview.net/pdf?id=88AS5MQnmC)]
196 | * Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment `2025` [[arxiv](https://arxiv.org/pdf/2501.09620)]
197 | 
198 | #### Inference
199 | 
200 | * Let's reward step by step: Step-Level reward model as the Navigators for Reasoning `2023` [[arxiv](https://arxiv.org/pdf/2310.10080)]
201 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search `2024` [[arxiv](https://arxiv.org/pdf/2406.03816)]
202 | * Advancing Process Verification for Large Language Models via Tree-Based Preference Learning `2024` [[arxiv](https://arxiv.org/pdf/2407.00390)]
203 | * Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning `2025` [[ICLR](https://openreview.net/pdf?id=A6Y7AqlzLW)]
204 | * Process Reward Models for LLM Agents: Practical Framework and Directions `2025` [[arxiv](https://arxiv.org/pdf/2502.10325)]
205 | * Reward-Guided Speculative Decoding for Efficient LLM Reasoning `2025` [[arxiv](https://arxiv.org/pdf/2501.19324)]
206 | 
207 | ### 🛠️ Applications
208 | 
209 | #### Harmless Dialogue
210 | 
211 | *Dialogue*
212 | 
213 | * Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback `2022` [[arxiv](https://arxiv.org/pdf/2204.05862)] 
214 | * Constitutional AI: Harmlessness from AI Feedback `2022` [[arxiv](https://arxiv.org/pdf/2212.08073)] 
215 | * Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue `2023` [[arxiv](https://arxiv.org/pdf/2308.03549)]
216 | * HuatuoGPT, Towards Taming Language Models To Be a Doctor `2023` [[EMNLP Findings](https://aclanthology.org/2023.findings-emnlp.725.pdf)]
217 | * Empathy Level Alignment via Reinforcement Learning for Empathetic Response Generation `2024` [[arxiv](https://arxiv.org/pdf/2408.02976)]
218 | * RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback `2024` [[ICML](https://openreview.net/pdf?id=uydQ2W41KO)]
219 | * Safe RLHF: Safe Reinforcement Learning from Human Feedback `2024` [[ICLR](https://openreview.net/pdf?id=TyFrPOKYXw)]
220 | * Deliberative Alignment: Reasoning Enables Safer Language Models `2024` [[arxiv](https://arxiv.org/pdf/2412.16339)]
221 | * Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression `2025` [[arxiv](https://arxiv.org/pdf/2501.12698)]
222 | 
223 | #### Logical Reasoning
224 | 
225 | *Math*
226 | * Training Verifiers to Solve Math Word Problems `2022` [[arxiv](https://arxiv.org/pdf/2110.14168)]
227 | * Solving math word problems with process- and outcome-based feedback `2022` [[arxiv](https://arxiv.org/pdf/2211.14275)]
228 | * WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct `2023` [[arxiv](https://arxiv.org/pdf/2308.09583)]
229 | * Let's Verify Step by Step `2024` [[ICLR](https://openreview.net/pdf?id=v8L0pN6EOi)]
230 | * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models `2024` [[arxiv](https://arxiv.org/pdf/2402.03300)]
231 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision `2024` [[arxiv](https://arxiv.org/pdf/2406.06592)]
232 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations `2024` [[ACL](https://aclanthology.org/2024.acl-long.510.pdf)]
233 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning `2025` [[arxiv](https://arxiv.org/pdf/2501.07301)]
234 | * Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning `2025` [[arxiv](https://arxiv.org/pdf/2502.14361)]
235 | 
236 | *Code*
237 | * Let's reward step by step: Step-Level reward model as the Navigators for Reasoning `2023` [[arxiv](https://arxiv.org/pdf/2310.10080)]
238 | * Applying RLAIF for Code Generation with API-usage in Lightweight LLMs `2024` [[arxiv](https://arxiv.org/pdf/2406.20060)]
239 | * Process Supervision-Guided Policy Optimization for Code Generation `2024` [[arxiv](https://arxiv.org/pdf/2410.17621)]
240 | * Performance-Aligned LLMs for Generating Fast Code `2024` [[arxiv](https://arxiv.org/pdf/2404.18864)]
241 | * Policy Filtration in RLHF to Fine-Tune LLM for Code Generation `2024` [[arxiv](https://arxiv.org/pdf/2409.06957)]
242 | * LLM Critics Help Catch LLM Bugs `2024` [[arxiv](https://arxiv.org/pdf/2407.00215)]
243 | 
244 | #### Retrieve & Recommendation
245 | 
246 | *Retrieve*
247 | * Enhancing Generative Retrieval with Reinforcement Learning from Relevance Feedback `2023` [[EMNLP](https://aclanthology.org/2023.emnlp-main.768.pdf)]
248 | * When Search Engine Services meet Large Language Models: Visions and Challenges `2024` [[arxiv](https://arxiv.org/pdf/2407.00128)]
249 | * Syntriever: How to Train Your Retriever with Synthetic Data from LLMs `2025` [[arxiv](https://arxiv.org/pdf/2502.03824)]
250 | * RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision `2025` [[arxiv](https://arxiv.org/pdf/2502.13957)]
251 | * DeepRAG: Thinking to Retrieval Step by Step for Large Language Models `2025` [[arxiv](https://arxiv.org/pdf/2502.01142)]
252 | 
253 | *Recommendation*
254 | * Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action Modeling `2024` [[SIGIR](https://dl.acm.org/doi/pdf/10.1145/3626772.3657767)]
255 | * RLRF4Rec: Reinforcement Learning from Recsys Feedback for Enhanced Recommendation Reranking `2024` [[arxiv](https://arxiv.org/pdf/2410.05939)]
256 | * Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward `2025` [[AAAI](https://ojs.aaai.org/index.php/AAAI/article/view/28777)]
257 | 
258 | #### Other Applications
259 | 
260 | *Text to Audio*
261 | * MusicRL: Aligning Music Generation to Human Preferences `2024` [[ICML](https://openreview.net/pdf?id=EruV94XRDs)]
262 | * BATON: Aligning Text-to-Audio Model Using Human Preference Feedback `2024` [[IJCAI](https://www.ijcai.org/proceedings/2024/0502.pdf)]
263 | * Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models `2024` [[arxiv](https://arxiv.org/pdf/2405.14632)]
264 | 
265 | *Text to Image*
266 | * Aligning Text-to-Image Models using Human Feedback `2023` [[arxiv](https://arxiv.org/pdf/2302.12192)]
267 | * ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation `2023` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2023/file/33646ef0ed554145eab65f6250fab0c9-Paper-Conference.pdf)]
268 | * DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models] `2023` [[arxiv](https://arxiv.org/pdf/2305.16381)]
269 | 
270 | *Text to Video*
271 | * InstructVideo: Instructing Video Diffusion Models with Human Feedback `2024` [[CVPR](https://openaccess.thecvf.com/content/CVPR2024/papers/Yuan_InstructVideo_Instructing_Video_Diffusion_Models_with_Human_Feedback_CVPR_2024_paper.pdf)]
272 | * Boosting Text-to-Video Generative Model with MLLMs Feedback `2024` [[NeurIPS](https://openreview.net/pdf/4c9eebaad669788792e0a010be4031be5bdc426e.pdf)]
273 | * Harness Local Rewards for Global Benefits: Effective Text-to-Video Generation Alignment with Patch-level Reward Models `2025` [[arxiv](https://arxiv.org/pdf/2502.06812)]
274 | 
275 | *Robotic*
276 | * Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models `2023` [[ICRA](https://ieeexplore.ieee.org/document/10161081)]
277 | * Accelerating Reinforcement Learning of Robotic Manipulations via Feedback from Large Language Models `2023` [[arxiv](https://arxiv.org/pdf/2311.02379)]
278 | * Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning `2024` [[ICLR](https://openreview.net/pdf?id=N0I2RtD8je)]
279 | 
280 | *Game*
281 | * DIP-RL: Demonstration-Inferred Preference Learning in Minecraft `2025` [[arxiv](https://arxiv.org/pdf/2307.12158)]
282 | * Process Reward Models for LLM Agents: Practical Framework and Directions `2025` [[arxiv](https://arxiv.org/pdf/2502.10325)]
283 | 
284 | ### 💯 Evaluation
285 | 
286 | #### Benchmarks
287 | 
288 | * RewardBench: Evaluating Reward Models for Language Modeling `2024` [[arxiv](https://arxiv.org/pdf/2403.13787)] [[Leaderboard](https://hf.co/spaces/allenai/reward-bench)]
289 | * RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style `2024` [[arxiv](https://arxiv.org/pdf/2410.16184)]
290 | * RMB: comprehensively benchmarking reward models in LLM alignment `2024` [[arxiv](https://arxiv.org/pdf/2410.09893)]
291 | * VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models `2024` [[arxiv](https://arxiv.org/pdf/2411.17451)] [[Leaderboard](https://huggingface.co/spaces/MMInstruction/VL-RewardBench)]
292 | * How to Evaluate Reward Models for RLHF `2024` [[arxiv](https://arxiv.org/pdf/2410.14872)] [[Leaderboard](https://huggingface.co/spaces/lmarena-ai/preference-proxy-evaluations)]
293 | * ProcessBench: Identifying Process Errors in Mathematical Reasoning `2024` [[arxiv](https://arxiv.org/pdf/2412.06559)]
294 | * RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment `2024` [[arxiv](https://arxiv.org/pdf/2412.13746)]
295 | * M-RewardBench: Evaluating Reward Models in Multilingual Settings `2024` [[arxiv](https://arxiv.org/abs/2410.15522)]
296 | * MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? `2024` [[arxiv](https://arxiv.org/pdf/2407.04842)] [[Leaderboard](https://huggingface.co/spaces/MJ-Bench/MJ-Bench-Leaderboard)]
297 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models `2025` [[arxiv](https://arxiv.org/pdf/2501.03124)]
298 | * Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models `2025` [[arxiv](https://arxiv.org/abs/2502.14191)]
299 | * VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models `2025` [[arxiv](https://arxiv.org/pdf/2503.07478)]
300 | 
301 | ### 🤺 Challenges
302 | 
303 | #### Data
304 | 
305 | * Fine-Tuning Language Models from Human Preferences `2019` [[arxiv](https://arxiv.org/abs/1909.08593)]
306 | * The Expertise Problem: Learning from Specialized Feedback `2022` [[arxiv](https://arxiv.org/abs/2211.06519)]
307 | * Active Reward Learning from Multiple Teachers `2023` [[AAAI Workshop](https://ceur-ws.org/Vol-3381/48.pdf)] 
308 | * Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models `2024` [[ICLR](https://openreview.net/forum?id=dKl6lMwbCy)]
309 | 
310 | #### Training
311 | 
312 | * Defining and Characterizing Reward Hacking `2022` [[NeurIPS](https://openreview.net/pdf?id=yb3HOXO3lX2)]
313 | * A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift `2023` [[arxiv](https://arxiv.org/pdf/2311.14743)]
314 | * Scaling Laws for Reward Model Overoptimization `2023` [[ICML](https://proceedings.mlr.press/v202/gao23h/gao23h.pdf)]
315 | * Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback `2023` [[EMNLP Findings](https://aclanthology.org/2023.findings-emnlp.188.pdf)]
316 | * Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack `2024` [[arxiv](https://arxiv.org/pdf/2410.06491)]
317 | * Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models `2024` [[arxiv](https://arxiv.org/pdf/2406.10162)]
318 | * Language Models Learn to Mislead Humans via RLHF `2024` [[arxiv](https://arxiv.org/pdf/2409.12822)]
319 | * Towards Understanding Sycophancy in Language Models `2024` [[ICLR](https://openreview.net/pdf?id=tvhaxkMKAn)]
320 | * Reward Model Ensembles Help Mitigate Overoptimization `2024` [[ICLR](https://openreview.net/pdf?id=dcjtMYkpXx)]
321 | * Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification `2024` [[NeurIPS](https://proceedings.neurips.cc/paper_files/paper/2024/file/1a8189929f3d7bd6183718f42c3f4309-Paper-Conference.pdf)]
322 | * Spontaneous Reward Hacking in Iterative Self-Refinement `2024` [[arxiv](https://arxiv.org/pdf/2407.04549)]
323 | * Confronting Reward Model Overoptimization with Constrained RLHF `2024` [[ICLR](https://openreview.net/pdf?id=gkfUvn0fLU)]
324 | * Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree? `2025` [[ICLR](https://openreview.net/pdf?id=Cnwz9jONi5)]
325 | * Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking `2025` [[ICLR](https://openreview.net/pdf?id=msEr27EejF)]
326 | * RRM: Robust Reward Model Training Mitigates Reward Hacking `2025` [[ICLR](https://openreview.net/pdf?id=88AS5MQnmC)]
327 | * The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking `2025` [[arxiv](https://arxiv.org/pdf/2501.19358)]
328 | 
329 | #### Evaluation
330 | 
331 | * An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers `2024` [[arxiv](https://arxiv.org/pdf/2403.02839)] 
332 | * OffsetBias: Leveraging Debiased Data for Tuning Evaluators `2024` [[EMNLP Findings](https://aclanthology.org/2024.findings-emnlp.57.pdf)] 
333 | * Preference Leakage: A Contamination Problem in LLM-as-a-judge `2025` [[arxiv](https://arxiv.org/pdf/2502.01534)]
334 | 
335 | ### 📊 Analysis
336 | 
337 | * The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models `2024` [[EMNLP](https://aclanthology.org/2024.emnlp-main.174.pdf)]
338 | * Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking `2024` [[COLM](https://openreview.net/pdf?id=5u1GpUkKtG)]
339 | * Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives `2024` [[arxiv](https://arxiv.org/pdf/2411.04991)]
340 | * RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs `2024` [[arxiv](https://arxiv.org/pdf/2404.08555)]
341 | * Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective `2024` [[arxiv](https://arxiv.org/pdf/2404.04626)]
342 | * Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study `2024` [[ICML](https://openreview.net/forum?id=6XH8R7YrSk)]
343 | * Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree? `2025` [[ICLR](https://openreview.net/pdf/7aa9cdaa8ae2a1fe57278fed0f70bed213ce9381.pdf)]
344 | * All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning `2025` [[arxiv](https://arxiv.org/pdf/2503.01067)]
345 | * Reward Models Identify Consistency, Not Causality `2025` [[arxiv](https://arxiv.org/pdf/2502.14619)]
346 | * What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-boosted Mathematical Reasoning `2025` [[AAAI](https://ojs.aaai.org/index.php/AAAI/article/view/34663/36818)]
347 | 
348 | ## Resources
349 | 
350 | ### 🌏 Blogs
351 | 
352 | * Illustrating Reinforcement Learning from Human Feedback (RLHF) [[Link](https://huggingface.co/blog/rlhf)]
353 | * Why reward models are key for alignment [[Link](https://www.interconnects.ai/p/why-reward-models-matter)]
354 | * Reward Hacking in Reinforcement Learning [[Link](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/)]
355 | 
356 | ### 📚 Prior Survey
357 | 
358 | * A Survey on Interactive Reinforcement Learning: Design Principles and Open Challenges `2021` [[arxiv](https://arxiv.org/pdf/2105.12949)] 
359 | * Reinforcement Learning With Human Advice: A Survey `2021` [[Frontiers Robotics AI](https://doi.org/10.3389/frobt.2021.584075)] 
360 | * AI Alignment: A Comprehensive Survey `2023` [[arxiv](https://arxiv.org/pdf/2310.19852)] 
361 | * A Survey of Reinforcement Learning from Human Feedback `2023` [[arxiv](https://arxiv.org/pdf/2312.14925)] 
362 | * Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback `2023` [[TMLR](https://openreview.net/pdf?id=bx24KpJ4Eb)] 
363 | * Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities `2024` [[JAIR](https://jair.org/index.php/jair/article/view/15348/27006)]
364 | * Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods `2024` [[arxiv](https://arxiv.org/pdf/2404.00282)]
365 | * A Survey on Human Preference Learning for Large Language Models `2024` [[arxiv](https://arxiv.org/pdf/2406.11191)]
366 | * A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More `2024` [[arxiv](https://arxiv.org/pdf/2407.16216)]
367 | * Reinforcement Learning Enhanced LLMs: A Survey `2024` [[arxiv](https://arxiv.org/pdf/2412.10400)]
368 | * Towards a Unified View of Preference Learning for Large Language Models: A Survey `2024` [[arxiv](https://arxiv.org/pdf/2409.02795)]
369 | * A Survey on Post-training of Large Language Models `2025` [[arxiv](https://arxiv.org/pdf/2503.06072)]
370 | 


--------------------------------------------------------------------------------