├── Notes
├── Intro_RLHF.md
├── R1_reasoning.md
└── Reward_Hacking.md
├── README.md
├── Slides
└── Intro_RLHF_Reading_Group.pdf
└── images
└── bt_model_rlhf_workflow.png
/Notes/Intro_RLHF.md:
--------------------------------------------------------------------------------
1 | # RLHF Algorithms
2 | A brief and partial summary of RLHF algorithms.
3 | This page summarizes a list of papers and useful blogs for RLHF that are covered in my reading group presentation of a brief summary for RLHF algorithms. Please find the slides [here](/Slides/Intro_RLHF_Reading_Group.pdf).
4 |
5 | ## Why RLHF?
6 | LLMs pre-trained on large text corpus express unintended behaviors such as hallucinations, bias/toxicity, or failure to follow instructions.
7 | - Misaligned: language modeling objective (next token prediction) is different from the objective of human values (helpful, honest, harmless).
8 |
9 | RLHF is proposed to align a model trained on general corpus to complex human values.
10 | - Use human feedback for generated text as a measure of performance and use that feedback as a loss to optimize the model.
11 | - Use methods from RL to directly optimize a language model with human feedback.
12 |
13 | ## Learning from (Human/AI) Preference Feedback
14 |
15 | 1. Preference Reward Modeling
16 | - Requires building a reward model based on user preferences, optimized with RL, typically using the PPO (Proximal Policy Optimization) algorithm.
17 | - Computationally expensive and sensitive to hyper-parameter selection.
18 |
19 | 2. Direct Preference Optimization
20 | - Views preference optimization as offline RL, with implicit reward model.
21 | - Starting with DPO (Direct Preference Optimization), the evolution of variations of DPO aims in adjusting its loss function, with ongoing fixes that make it more RL-like and overcome existing weaknesses.
22 |
23 | ## Bradley-Terry model
24 | Assumption of most RLHF algorithms: the preference signal can be modeled using the reward-based Bradley-Terry model.
25 |
26 |
27 |
28 | RLHF Workflow: From Reward Modeling to Online RLHF.
29 |
30 |
31 | * For Bradley-Terry model, the reward maximization approach is limited by the nature of “point-wise” rewards (scalar score for a single response to input x), which fails to express complex intransitive or cyclic preference relations. [[DNO](https://arxiv.org/abs/2404.03715)]
32 |
33 |
34 | ## Online RL
35 | - [Training language models to follow instructions with human feedback.](https://arxiv.org/abs/2203.02155)
36 | - (Summary) [RLHF Workflow: From Reward Modeling to Online RLHF.](https://arxiv.org/pdf/2405.07863v1)
37 | - (Summary) [Secrets of RLHF in Large Language Models Part I: PPO.](https://arxiv.org/abs/2307.04964)
38 | - [Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint](https://arxiv.org/abs/2312.11456)
39 | - (Algorithm) PPO: [Proximal Policy Optimization Algorithms.](https://arxiv.org/abs/1707.06347)
40 | - (Algorithm) RLOO: [Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs.](https://arxiv.org/abs/2402.14740)
41 | - (Algorithm) GRPO: [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.](https://arxiv.org/pdf/2402.03300)
42 | - (Algorithm) [ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models](https://arxiv.org/abs/2310.10505)
43 | - (Algorithm) [MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences](https://arxiv.org/abs/2402.08925)
44 |
45 | ## Offline RL
46 | - DPO: [Direct Preference Optimization: Your Language Model is Secretly a Reward Model.](https://arxiv.org/abs/2305.18290)
47 | - RPO: [Iterative Reasoning Preference Optimization.](https://arxiv.org/abs/2404.19733)
48 | - Additional NLL loss.
49 |
50 | **Beyond Bradley-Terry Model**
51 | - [KTO: Model Alignment as Prospect Theoretic Optimization.](https://arxiv.org/abs/2402.01306)
52 | - DNO: [Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences.](https://arxiv.org/abs/2404.03715)
53 |
54 | **Bias of Length**
55 | - R-DPO: [Disentangling Length from Quality in Direct Preference Optimization.](https://arxiv.org/abs/2403.19159)
56 | - [SimPO: Simple Preference Optimization with a Reference-Free Reward.](https://arxiv.org/abs/2405.14734)
57 | See our related discussion on [reward hacking](https://yihe-deng.notion.site/Spurious-Correlation-Shortcut-Learning-and-Reward-Hacking-163ab2d2c1fb808bbfd7c6a17b01a39d)
58 |
59 | **Step-wise (Process) Reward**
60 | - [Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs](https://arxiv.org/abs/2406.18629)
61 | - [ReST-MCTS∗: LLM Self-Training via Process Reward Guided Tree Search.](https://arxiv.org/abs/2406.03816)
62 | - [Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning](https://arxiv.org/abs/2410.22304)
63 |
64 | Related to Process Reward Model (PRM)
65 | - [Let's Verify Step by Step](https://arxiv.org/abs/2305.20050)
66 | - [Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations](https://arxiv.org/abs/2312.08935)
67 |
68 | Related to Monte Carlo Tree Search (MTCS)
69 | - [LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning.](https://arxiv.org/abs/2410.02884)
70 | - [Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning.](https://arxiv.org/abs/2405.00451)
71 |
72 | **Iterative DPO (Multiple Iterations)**
73 | - [SAIL: Self-Improving Efficient Online Alignment of Large Language Models](https://arxiv.org/abs/2406.15567)
74 | - GSHF: [Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint](https://arxiv.org/pdf/2312.11456)
75 | - Self-Reward: [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020)
76 | - SPIN: [Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/abs/2401.01335)
77 | - [Building Math Agents with Multi-Turn Iterative Preference Learning.](https://arxiv.org/abs/2409.02392)
78 |
79 | **List Ranking: Beyond Pairwise Preference**
80 | - [RRHF: Rank Responses to Align Language Models with Human Feedback without tears.](https://arxiv.org/abs/2304.05302)
81 | - [Preference Ranking Optimization for Human Alignment.](https://arxiv.org/abs/2306.17492)
82 | - [LiPO: Listwise Preference Optimization through Learning-to-Rank.](https://arxiv.org/abs/2402.01878)
83 |
84 | **Preference Data Construction**
85 | Self-training: preference siginal other than human/AI labeling for each data pair?
86 | - [Self-Consistency Preference Optimization.](https://arxiv.org/abs/2411.04109)
87 | - (Multi-modal) [Aligning modalities in vision large language models via preference fine-tuning.](https://arxiv.org/abs/2402.11411)
88 | - (Multi-modal) [Enhancing Large Vision Language Models with Self-Training on Image Comprehension.](https://arxiv.org/abs/2405.19716)
89 |
90 | **SFT**
91 | - [RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment](https://arxiv.org/abs/2304.06767)
92 |
93 | ## Useful Blogs
94 | - [Policy Gradient Algorithms](https://lilianweng.github.io/posts/2018-04-08-policy-gradient/)
95 | - [Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients](https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/)
96 | - [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf)
97 | - [Proximal Policy Optimization (PPO)](https://huggingface.co/blog/deep-rl-ppo)
98 | - [关于LLM+RL(HF)的片面脉络梳理](https://zhuanlan.zhihu.com/p/1686790674?utm_psn=1833144248435879936)
99 | - [Advanced Tricks for Training Large Language Models with Proximal Policy Optimization](https://www.notion.so/eb7b2d1891f44b3a84e7396d19d39e6f?v=01bcb084210149488d730064cbabc99f)
100 | - [Can Better Cold-Start Strategies Improve RL Training for LLMs?](https://tangible-polo-203.notion.site/)
101 |
--------------------------------------------------------------------------------
/Notes/R1_reasoning.md:
--------------------------------------------------------------------------------
1 | # R1 Reasoning
2 | [DeepSeek-R1 tech report](https://arxiv.org/abs/2501.12948) explores how reinforcement learning (RL) can spark advanced chain-of-thought reasoning in language models. Notably, DeepSeek-R1 introduced the concept of a sudden “**aha moment**” during RL training with verifiable rewards, in which the model rapidly develops more sophisticated reasoning and self-reflection. This discovery, among with many others, has inspired a wave of open-source reproduction and extension projects.
3 |
4 | Below, I summarize some key notes and insights from these follow-up efforts. For a deeper dive into DeepSeek-R1 itself and the below blog/papers, please refer to the original papers for a complete read.
5 |
6 | ## Open-Source Reproduction Projects
7 | [**Open-R1**](https://github.com/huggingface/open-r1/)
8 | - HuggingFace's open reproduction of the DeepSeek-R1 project, aiming to replicate the entire pipeline described in the original DeepSeek-R1 tech report.
9 | - Including model training (SFT and GRPO) and evaluation on tasks like AIME 2024, MATH-500, and code generation benchmarks.
10 |
11 | [**Tiny-Zero**](https://github.com/Jiayi-Pan/TinyZero)
12 | - Reproduce DeepSeek R1-Zero in the countdown and multiplication tasks, using `Qwen2.5-3B-Instruct` that developed self-verification through RL.
13 | - With PPO, GRPO and PRIME, the team discovered that long CoT all emerge and seem to all work well.
14 |
15 |
16 |
17 | ## Blogs
18 | [**DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL**](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2)
19 | - Github repo: https://github.com/agentica-project/deepscaler
20 | - Authors present [**DeepScaleR-1.5B-Preview**](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview), a 1.5B-parameter language model finetuned via RL on math reasoning tasks. It surpasses O1-Preview on AIME2024 and performs strongly across multiple competition-level math benchmarks.
21 | - The work builds on `DeepSeek-R1-Distill-Qwen-1.5B`.
22 | - **Iterative Context Lengthening** (“Think Shorter, Then Longer”)
23 | - A critical challenge is the long outputs for math reasoning, which slow training and can lead to repetitive, unhelpful content. The team considers a staged approach with GRPO:
24 | - Begin with an 8K context window to keep outputs shorter and focus on correct reasoning.
25 | - Increase to 16K once the model plateaus and starts hitting the 8K limit.
26 | - Finally, increase to 24K to fully exploit longer chain-of-thought once the model has learned to reason more concisely.
27 |
28 | [**Open-Reasoner-Zero**](https://yasminezhang.notion.site/Open-Reasoner-Zero-19e12cf72d418007b9cdebf44b0e7903)
29 | - Github repo: https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero, [PDF](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf)
30 | - The authors collect and clean a dataset of 57k reasoning and math problems, combining open-source math and logic tasks and some synthetic samples.
31 | - They show that scaling the dataset (e.g., going from ~7k to ~57k samples) significantly improves both response correctness and response length (i.e., deeper reasoning steps).
32 | - As training proceeds, models’ response lengths grow steadily, mirroring prior observations from systems like DeepSeek-R1-Zero.
33 | - **Emergent “Aha Moments"**
34 | - "Step moment": the report highlights “step-function” jumps, where the model suddenly transitions to substantially better performance and longer, more thorough solutions.
35 | - The authors also observe increased average correct reflection length once the model passes certain training thresholds (Figure 7 of the [PDF](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf)).
36 |
37 | [**Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead**](https://efficient-unicorn-451.notion.site/Online-DPO-R1-Unlocking-Effective-Reasoning-Without-the-PPO-Overhead-1908b9a70e7b80c3bc83f4cf04b2f175)
38 | - Github repo: https://github.com/RLHFlow/Online-DPO-R1
39 | - Rule-based approaches like iterative DPO and RAFT can substantially enhance a capable base model’s performance on challenging math benchmarks, while remaining simpler to implement than full PPO training.
40 | - While iterative DPO is simpler to implement and yields strong gains, PPO-based models still reach the highest overall performance. Nonetheless, the gap between DPO and PPO narrows substantially if the model is **warm-started with SFT**.
41 | - **Importance of a strong base model**: attempts to apply iterative DPO on `Llama-3-8B-Instruct` did not yield significant improvements. The authors suggest that Qwen2.5’s stronger math pre-training allows rule-based RL to succeed.
42 | - **“Aha Moments” and self-reflection**
43 | - The authors did not observe a clear “aha moment” with small-scale open-source models. Instead, they note that Qwen2.5-Math-7B-Base already uses reflective reasoning patterns, and neither iterative DPO nor PPO specifically increase their frequency.
44 |
45 | [**There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study**](https://oatllm.notion.site/oat-zero)
46 | - Github repo: https://github.com/sail-sg/oat-zero
47 | - The study challenges the common observation that this "**Aha Moment**" occurs during training, discovering that
48 | - Aha moment at epoch 0: Self-reflection patterns were observed in base models from the start, without requiring any post-training. This suggests the "Aha moment" is present right from the beginning.
49 | - **Superficial Self-Reflection** (SSR): While some models show self-reflection behavior, many of these are not effective and do not lead to correct answers.
50 | - Response length and RL dynamics: The increasing response length often associated with the "aha moment" is not linked to emergent self-reflection but results from RL optimizing rule-based reward functions. The paper demonstrates that response length changes are a consequence of RL dynamics and not necessarily related to improved reasoning.
51 |
52 | ## Relevant Papers
53 | [**Demystifying Long Chain-of-Thought Reasoning in LLMs**](https://arxiv.org/abs/2502.03373)
54 | - A detailed analysis of long CoT reasoning in LLMs. These long CoTs often include branching, backtracking, error-checking, and other advanced features similar to deeper analytical thinking.
55 | - SFT with long CoTs
56 | - Fine-tuning a base model on long CoT data consistently outperforms short CoT training and continues improving with *more data*.
57 | - Reward shaping for stable length growth
58 | - Simply relying on a verifiable reward can lead to unstable or explosive **CoT length growth**, causing outputs to exceed the context window and degrade performance.
59 | - The authors introduce a cosine reward function (varies by answer correctness and sequence length) to help stabilize length scaling.
60 | - In addition, applying a repetition penalty mitigates **reward hacking** where models add repeated content just to accumulate more reward.
61 | - Importance of **verifiable signals**
62 | - RL becomes more stable and effective when the model’s final answers can be verified using ground-truth signals.
63 | - RL from base models vs. long-CoT–SFT models
64 | - Training RL directly from smaller base models sometimes fails to spark the “aha” behaviors.
65 | - Initializing RL with a model already tuned for long CoT consistently yields more robust exploration of longer reasoning paths.
66 | - Latent abilities and data origins
67 | - The paper also suggests that capabilities like error-correction **may already exist as latent skills in base models** (due to large-scale pretraining on forum discussions and step-by-step solutions).
68 | - RL and SFT can bring these “buried” skills to the surface with proper rewards, penalties, and prompts.
69 |
70 | [**LIMO: Less is More for Reasoning**](https://arxiv.org/abs/2502.03387)
71 | - The authors challenge the assumption that complex mathematical reasoning requires massive amounts of fine-tuning data. Instead, they show that using fewer than 1k carefully curated training examples can dramatically enhance a model’s ability to solve advanced math problems.
72 | - *The LIMO Hypothesis*: In foundation models that already have extensive domain knowledge “baked in” from pre-training, sophisticated reasoning capabilities can emerge with only a small number of carefully designed demonstrations.
73 | - Related: [**LIMR: Less is More for RL Scaling**](https://arxiv.org/abs/2502.11886)
74 |
75 | [**Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning**](https://arxiv.org/abs/2502.14768)
76 | - Authors use Knights and Knaves (K&K) puzzles as controlled data that allows precise measurement of logical reasoning performance.
77 | - They generate Knights and Knaves puzzles of varying complexity (2–8 inhabitants), ensuring each puzzle has a unique, verifiable solution.
78 | - This procedural approach eliminates ambiguity and makes reward-checking straightforward.
79 | - Modified REINFORCE++ algorithm
80 | - Instead of PPO or GRPO, the authors use a modified REINFORCE approach.
81 | - Key tweaks include a KL-divergence penalty term to avoid drifting too far from the base model’s distribution and a specialized KL estimator.
82 | - Emergent reasoning behaviors
83 | - As training progresses, the model begins to show verification, multi-path exploration, and self-reflection in the section.
84 | - Response length tends to grow over training, but **length alone is not a reliable indicator** of improved reasoning accuracy.
85 | - Generalization beyond training data
86 | - Despite training on only 5K synthetic logic puzzles, the model demonstrates a transfer effect to tougher math benchmarks like AIME.
87 | - The authors note it outperforms a simple SFT approach, which tends to memorize specific formats rather than develop robust reasoning.
88 |
89 | ---
90 |
91 | ### More related works:
92 |
93 | **Technical reports**
94 |
95 | [**Kimi k1.5: Scaling Reinforcement Learning with LLMs**](https://arxiv.org/abs/2501.12599)
96 |
97 | [**Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement**](https://arxiv.org/abs/2409.12122)
98 |
99 | **Test-time scaling**
100 |
101 | [**Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems**](https://arxiv.org/abs/2412.09413)
102 |
103 | [**s1: Simple test-time scaling**](https://arxiv.org/abs/2501.19393)
104 |
105 | [**Scaling Test-Time Compute Without Verification or RL is Suboptimal**](https://arxiv.org/abs/2502.12118)
106 |
107 | [**Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling**](https://arxiv.org/abs/2502.06703)
108 |
109 | [**Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach**](https://arxiv.org/abs/2502.05171)
110 |
111 | [**Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought**](https://arxiv.org/abs/2501.04682)
112 |
113 | [**A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods**](https://arxiv.org/abs/2502.01618)
114 |
115 | **Concise/Efficient reasoning**
116 |
117 | [**Self-Training Elicits Concise Reasoning in Large Language Models**](https://arxiv.org/abs/2502.20122)
118 |
119 | [**LightThinker: Thinking Step-by-Step Compression**](https://arxiv.org/abs/2502.15589)
120 |
121 | [**Training Language Models to Reason Efficiently**](https://arxiv.org/abs/2502.04463)
122 |
123 | **Math reasoning**
124 |
125 | [**Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning**](https://arxiv.org/abs/2502.06781)
126 |
127 | [**Self-Rewarding Correction for Mathematical Reasoning**](https://arxiv.org/abs/2502.19613)
128 |
129 | **More**
130 |
131 | [**Thinking Preference Optimization**](https://arxiv.org/abs/2502.13173)
132 |
133 | [**On the Emergence of Thinking in LLMs I: Searching for the Right Intuition**](https://arxiv.org/abs/2502.06773)
134 |
135 | ## Surveys
136 | [**From System 1 to System 2: A Survey of Reasoning Large Language Models**](https://arxiv.org/abs/2502.17419)
137 |
138 | ---
139 |
140 | Thanks to [Dang Nguyen](https://hsgser.github.io/) and [Isha Puri](https://ishapuri.github.io/) for pointing to some of the relevant papers!
141 |
--------------------------------------------------------------------------------
/Notes/Reward_Hacking.md:
--------------------------------------------------------------------------------
1 | # Reward Hacking, Shortcut Learning, and Spurious Correlation
2 |
3 | **Authors:** [**Yihe Deng**](https://www.notion.so/Yihe-Deng-167ab2d2c1fb80b3a76dfb120f716c84?pvs=21) ([twitter](https://x.com/Yihe__Deng)), [**Yu Yang**](https://sites.google.com/g.ucla.edu/yuyang/home) ([twitter](https://x.com/YuYang_i))
4 |
5 | **Date:** 12/23/2024
6 |
7 | Thanks to [Fan Yin](https://fanyin3639.github.io/) for providing invaluable feedback on an earlier draft of this blog.
8 |
9 | > *Spurious correlation or shortcut learning ([Geirhos et al. 2020](https://arxiv.org/abs/2004.07780)) in classification tasks is closely related to reward hacking.*
10 | > — [**Reward Hacking in Reinforcement Learning**](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/#rl-algorithm-improvement)
11 |
12 | This note focuses on the connection between reward hacking and spurious correlation. **Please see the [blog](https://yihe-deng.notion.site/) for detailed discussion and a comprehensive references list.**
13 |
14 | ## 1. Introduction
15 |
16 | ### Reward Hacking (Classic RL Context)
17 | - **Definition**: Occurs when RL agents exploit flaws in the reward function to produce unintended behaviors (e.g., looping for infinite points, crashing the environment).
18 | - **Key Point**: Misalignment between the reward function and the true goal leads to perverse strategies.
19 |
20 | ### Data-Induced Reward Hacking
21 | - In **preference-based** training (e.g., RLHF, LLM alignment), reward hacking can take the form of **spurious correlations**: the model learns superficial patterns (e.g., lengthier text = better) rather than genuine quality.
22 | - Reflects **Goodhart’s Law**: overoptimizing a proxy (human preference data) invalidates that proxy’s utility.
23 |
24 | ## 2. Spurious Correlation
25 |
26 | A **spurious correlation** is a superficial feature strongly associated with the target label but not causally related. Classic examples include:
27 | - **Vision**: Background colors, water vs. land, or demographic attributes that overshadow the core features.
28 | - **Language**: Negation words, certain templates, or keywords (e.g., “safe,” “ethical”) correlated with higher or lower labels.
29 |
30 | Spurious correlations often lead to **shortcut learning**, degrading model robustness, especially for minority or out-of-distribution examples.
31 |
32 | ## 3. How Preference Data Can Contain Spurious Correlations
33 |
34 | ### Length as a Proxy for Quality
35 | - **Longer Responses** often earn higher human ratings for thoroughness or politeness.
36 | - **DPO (Direct Preference Optimization)** can inadvertently overfit to “longer = better,” causing the model to produce excessively long or rambling outputs.
37 |
38 | ### “Keyword Hacking” in Safety and Alignment
39 | - **Refusal Tokens** (e.g., “I apologize,” “I cannot”) might be overused if preference data consistently labels these as safer or more responsible responses.
40 | - Shallow alignment becomes easy to bypass and does not address deeper generative behavior.
41 |
42 | ### Other Potential Correlations
43 | - **Formatting Bias**: Bullet points, special tokens (emojis, exclamation marks), or bold text might inflate user ratings.
44 | - **Confidence Tone**: Overly confident or enthusiastic language can be misconstrued as correctness or helpfulness.
45 | - **Positivity Bias**: Cheerful or agreeable responses may be preferred, even when a neutral or critical stance is more accurate.
46 |
47 | These spurious features mirror the phenomenon seen in image or text classification, where superficial cues overshadow the true underlying quality or correctness.
48 |
49 |
50 | ## 4. Further Questions
51 |
52 | ### Does Negative Data Accelerate Shortcut Learning?
53 | - Including clearly dispreferred samples (shorter, impolite, etc.) may reinforce superficial features that separate negative from positive examples.
54 | - **DPO vs. SFT** comparisons reveal t
55 |
56 | ### Will Iterative Self-Training Amplify Spurious Correlations?
57 | - Multi-round self-improvement cycles can **amplify** existing biases if the model keeps learning from its own biased outputs.
58 | - Without careful monitoring or correction, spurious signals may escalate in each iteration.
59 |
60 |
61 | ## 5. Potential Mitigations
62 |
63 | ### Algorithmic Approaches
64 | - [**R-DPO**](https://arxiv.org/abs/2403.19159): Adjusts per-example learning rates to balance out length or other spurious signals.
65 | - [**SimPO**](https://arxiv.org/pdf/2405.14734): Uses regularization to discourage overly long or superficial responses.
66 | - [**Reward Model Ensembles**](https://arxiv.org/abs/2312.09244): Combines multiple RMs for a more conservative reward signal, though shared biases can still persist.
67 | - [**ODIN**](https://arxiv.org/abs/2402.07319): Separates reward signals into length-correlated and length-independent heads to discourage length-based exploitation.
68 |
69 | ### Data-Centric Approaches
70 | - [**RRM**](https://arxiv.org/abs/2409.13156) (Robust Reward Model Training): Augments or balances chosen vs. rejected responses to disrupt spurious correlations.
71 | - **Group-Based Methods** (e.g., GroupDRO, PDE, JTT): Identify spurious attributes (like backgrounds in vision tasks or style tokens in text) and balance them during training.
72 | - [**PDE**](https://arxiv.org/abs/2306.04949) (Progressive Data Expansion): Start with balanced subsets before adding potentially imbalanced data, preserving early robustness.
73 |
74 | ### Theoretical Insights
75 | - Spurious features often dominate because they are **easier to learn** or highly correlated with the label.
76 | - Distributionally robust strategies and balanced sampling can mitigate these shortcuts by limiting reliance on the superficial signals.
77 |
78 |
79 | ## 6. Conclusion
80 |
81 | Reward hacking and spurious correlation are deeply intertwined challenges, underscoring the gap between *true objectives* (e.g., factual correctness, genuine helpfulness) and *over-optimized proxies* (length, specific tokens, or superficial style). While recent advancements (e.g., R-DPO, ODIN, RRM) tackle these shortcuts algorithmically and through data manipulation, fully robust alignment requires:
82 | 1. **Better understanding** of which artifacts are truly spurious.
83 | 2. **Iterative monitoring** of model outputs and self-training loops.
84 | 3. **Data balancing and annotation** strategies that reduce reliance on superficial cues.
85 |
86 | > **When Good Writing Hides Bad Content**
87 | > High-quality formatting and engaging style can mask inaccuracies or shallow reasoning. The real challenge is defining (and enforcing) robust criteria for “good” responses across diverse user needs without encouraging misleading superficial correlations.
88 |
89 |
90 | ## Selected References
91 |
92 | **Spurious Correlation**
93 |
94 | Alain et al. [Variance Reduction in SGD by Distributed Importance Sampling](https://arxiv.org/abs/1511.06481). 2015.
95 |
96 | Ding et al. [Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation.](https://arxiv.org/abs/2307.07907) NeurIPS, 2023.
97 |
98 | Li et al. [A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others.](https://arxiv.org/abs/2212.04825) CVPR, 2023.
99 |
100 | McCoy et al. [Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference.](https://arxiv.org/abs/1902.01007) ACL, 2019.
101 |
102 | **How Preference Data Can Contain Spurious Signals**
103 |
104 | [Disentangling Length from Quality in Direct Preference Optimization.](https://arxiv.org/abs/2403.19159) 2024.
105 |
106 | [SimPO: Simple Preference Optimization with a Reference-Free Reward.](https://arxiv.org/pdf/2405.14734) 2024.
107 |
108 | Qi et al. [Safety Alignment Should Be Made More Than Just a Few Tokens Deep.](https://arxiv.org/abs/2406.05946) 2024.
109 |
110 | Zhang et al. [From Lists to Emojis: How Format Bias Affects Model Alignment.](https://arxiv.org/abs/2409.11704) 2024.
111 |
112 | Park et al. [OffsetBias: Leveraging Debiased Data for Tuning Evaluators.](https://arxiv.org/abs/2407.06551) 2024
113 |
114 | **Further Questions**
115 |
116 | Ren et al. [Bias Amplification in Language Model Evolution: An Iterated Learning Perspective.](https://arxiv.org/pdf/2404.04286) 2024.
117 |
118 | **Potential Mitigations**
119 |
120 | Eisenstein et al. [Helping or Herding? Reward Model Ensembles Mitigate But Do Not Eliminate Reward Hacking.](https://arxiv.org/abs/2312.09244) 2023.
121 |
122 | Gao et al. [Scaling Laws for Reward Model Overoptimization.](https://arxiv.org/abs/2210.10760) ICML, 2023.
123 |
124 | Chen et al. [ODIN: Disentangled Reward Mitigates Hacking in RLHF.](https://arxiv.org/abs/2402.07319) 2024
125 |
126 | Wang et al. [Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards.](https://arxiv.org/abs/2402.18571) 2024.
127 |
128 | Liu et al. [RRM: Robust Reward Model Training Mitigates Reward Hacking.](https://arxiv.org/abs/2409.13156) 2024.
129 |
130 | Ramé et al. [WARM: On the Benefits of Weight Averaged Reward Models](https://arxiv.org/abs/2401.12187). ICML, 2024.
131 |
132 | **Previous approaches to spurious correlations**
133 |
134 | Nam et al. [Learning From Failure: De-biasing Classifier from Biased Classifier.](https://arxiv.org/abs/2007.02561) NeurIPS, 2020.
135 |
136 | Liu et al. [Just Train Twice: Improving Group Robustness without Training Group Information.](http://arxiv.org/abs/2107.09044) ICML, 2021.
137 |
138 | Creager et al. [Environment Inference for Invariant Learning.](https://proceedings.mlr.press/v139/creager21a/creager21a.pdf) ICML, 2021.
139 |
140 | Yang et al. [Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias.](https://arxiv.org/abs/2305.18761) AISTATS, 2024.
141 |
142 | Sagawa et al. [Distributionally Robust Neural Networks.](https://arxiv.org/abs/1911.08731) ICLR, 2019.
143 |
144 | Kirichenko et al. [Last Layer Re-training is Sufficient for Robustness to Spurious Correlations.](https://arxiv.org/abs/2204.02937) ICLR, 2023.
145 |
146 | Deng et al. [Robust Learning with Progressive Data Expansion Against Spurious Correlation.](https://arxiv.org/abs/2306.04949) NeurIPS, 2023.
147 |
148 | Sagawa et al. [An Investigation of Why Overparameterization Exacerbates Spurious Correlations.](https://arxiv.org/abs/2005.04345) ICML, 2020.
149 |
150 | Lin et al. [Spurious Feature Diversification Improves Out-of-Distribution Generalization.](https://arxiv.org/abs/2309.17230) ICLR, 2024.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # RL(HF) and LLM Reasoning Summary Notes
2 |
3 | Welcome to my collection of study notes related to **Reinforcement Learning (Human Feedback)** and **Large Language Model Reasoning**. This repository is an evolving resource, containing literature references and personal summaries of the topics I’m currently exploring.
4 |
5 | My notes draw on public information from Arxiv, with all opinions being my own.
6 |
7 |
8 | ## Overview
9 |
10 | - **Purpose**: The aim of this repository is to share my ongoing learning process and provide an indexed collection of research articles and conceptual overviews in topics related to RL(HF) and LLM reasoning.
11 | - **Scope**: Notes include brief summaries, personal annotations, and relevant citations.
12 |
13 |
14 | #### Repository Structure
15 | ```
16 | .
17 | ├── Notes
18 | │ ├── Intro_RLHF.md
19 | │ ├── Reward_Hacking.md
20 | │ ├── R1_reasoning.md
21 | │ └── ...
22 | └── README.md
23 |
24 | ```
25 |
26 | ## Index of Notes
27 |
28 | Below is an index of the Markdown files currently available in the `Notes` folder, along with brief descriptions.
29 |
30 | 1. **[Intro_RLHF.md](Notes/Intro_RLHF.md)**
31 | *A brief and partial summary of RLHF algorithms by the end of 2024. It summarizes a list of papers and useful blogs for RLHF that are covered in my reading group presentation of a brief summary for RLHF algorithms. Please find the slides [here](/Slides/Intro_RLHF_Reading_Group.pdf).*
32 |
33 | 2. **[Reward_Hacking.md](Notes/Reward_Hacking.md)**
34 |
35 | 3. **[R1_reasoning.md](Notes/R1_reasoning.md)**
36 |
37 | ## Getting Involved
38 |
39 | - **Suggestions & Feedback**: If you have ideas on improving the notes, please open an issue or share a pull request.
40 |
41 | All kinds of contributions—small fixes or new references—are most welcome :) Thanks!
42 |
--------------------------------------------------------------------------------
/Slides/Intro_RLHF_Reading_Group.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yihedeng9/rlhf-summary-notes/ba924f487ba24d9c63accdb7214e659dbbb8226b/Slides/Intro_RLHF_Reading_Group.pdf
--------------------------------------------------------------------------------
/images/bt_model_rlhf_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yihedeng9/rlhf-summary-notes/ba924f487ba24d9c63accdb7214e659dbbb8226b/images/bt_model_rlhf_workflow.png
--------------------------------------------------------------------------------