├── images
└── tweet_mark_chen.png
├── self-improvement_self-evolvement.md
└── README.md
/images/tweet_mark_chen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jzhou316/Post-DeepSeek-R1_LLM-RL/HEAD/images/tweet_mark_chen.png
--------------------------------------------------------------------------------
/self-improvement_self-evolvement.md:
--------------------------------------------------------------------------------
1 | ## Recent Work on Self Improvment of LLMs
2 |
3 | [(2022 Oct) Large Language Models Can Self-Improve](https://arxiv.org/abs/2210.11610)
4 |
5 | [(2023 Mar) Self-Refine: Iterative Refinement with Self-Feedback](https://arxiv.org/abs/2303.17651)
6 |
7 | [(2024 Jan) Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020)
8 |
9 | [(2024 Apr) Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing](https://arxiv.org/abs/2404.12253)
10 |
11 | [(2024 Apr) A Survey on Self-Evolution of Large Language Models](https://arxiv.org/abs/2404.14387)
12 |
13 | - **Survey** paper on LLM self-evolution conceptualized with iterative cycles composed of four phases: experience acquisition, experience refinement, updating, and evaluation.
14 |
15 | [(2024 June) Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/abs/2401.01335)
16 |
17 | [(2024 Sept; ICLR 2025) Training Language Models to Self-Correct via Reinforcement Learning](https://arxiv.org/abs/2409.12917)
18 |
19 | - Google Deepmind
20 | - OpenReview Discussion: https://openreview.net/forum?id=CjwERcAU7w
21 |
22 | [(2024 Oct) Techniques for Self-Improving LLM Evals](https://arize.com/blog/techniques-for-self-improving-llm-evals/)
23 |
24 | [(2024 Nov) Self-Consistency Preference Optimization](https://arxiv.org/abs/2411.04109)
25 |
26 | [(2024 Nov) Self-Evolved Reward Learning for LLMs](https://arxiv.org/abs/2411.00418)
27 |
28 | - Learning from self-feedback with reward model (RM)
29 |
30 | [(2024 Nov) Preference Optimization for Reasoning with Pseudo Feedback](https://arxiv.org/abs/2411.16345)
31 |
32 | [(2024 Dec) AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement](https://arxiv.org/abs/2412.06176)
33 |
34 | - Inference time iteration for code generation
35 |
36 | [(2024 Dec) Self-Improvement in Language Models: The Sharpening Mechanism](https://arxiv.org/abs/2412.01951)
37 |
38 | [(2024 Dec) A Survey on LLM Inference-Time Self-Improvement](https://arxiv.org/pdf/2412.14352)
39 |
40 | - **Survey** paper for inference-time self-improvement
41 | - Collection of prior papers: https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement
42 |
43 | [(2024 Dec) Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models](https://arxiv.org/abs/2412.02674)
44 |
45 | [(2025 Jan) rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://arxiv.org/abs/2501.04519)
46 |
47 | - Specific on math reasoning, and iteratively using Monte Carlo search and process reward models
48 |
49 | [(2025 Feb) Self-rewarding correction for mathematical reasoning](https://arxiv.org/abs/2502.19613)
50 |
51 | - Mathematical reasoning
52 |
53 | [(2025 Feb) Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data](https://arxiv.org/abs/2502.05400)
54 |
55 | Haoyan Yang, et al.
56 |
57 | [(2025 Mar) Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models](https://arxiv.org/abs/2503.04813)
58 |
59 | [(2025 Apr - ICLR 2025 Workshop) Scaling Self-Improving Foundation Models](https://sites.google.com/berkeley.edu/selfimprovingfoundationmodels/home)
60 |
61 | - ICLR 2025 **Workshop**
62 | - Accepted papers around the topic [here](https://sites.google.com/berkeley.edu/selfimprovingfoundationmodels/accepted-papers)
63 |
64 | [(2025 Apr) A Self-Improving Coding Agent](https://arxiv.org/abs/2504.15228)
65 |
66 | [(2025 Apr) Collaborative Reasoner : Self-Improving Social Agents with Synthetic Conversations](https://ai.meta.com/research/publications/collaborative-reasoner-self-improving-social-agents-with-synthetic-conversations/)
67 |
68 | [(2025 May ICML Position) Position: Truly Self-Improving Agents Require Intrinsic Metacognitive Learning](https://openreview.net/forum?id=4KhDd0Ozqe)
69 |
70 | [(2025 May) Can Large Reasoning Models Self-Train?](https://arxiv.org/abs/2505.21444)
71 |
72 | - Majority voting to get rewards or replace external verify
73 | - Model learning collapses to consistent but shortcut answers for long training
74 |
75 | [(2025 May) Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO](https://arxiv.org/abs/2505.22453)
76 |
77 | - Very similar to above with majority voting rewards, but apply to multimodal models
78 |
79 | [(2025 May) UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents](https://arxiv.org/abs/2505.21496)
80 |
81 | - Autonomous Agents
82 |
83 | [(2025 May) Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution](https://arxiv.org/abs/2505.20286)
84 |
85 | - Autonomous Agents
86 |
87 | [(2025 May) Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents](https://arxiv.org/abs/2505.22954)
88 |
89 | - Autonomous Agents
90 |
91 | [(2025 May) DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning](https://arxiv.org/abs/2505.15734)
92 |
93 | - Using multi-agent debate to create reasoning traces for training reasoning models
94 |
95 | [(2025 May) AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/)
96 |
97 | - From Google: An evolutionary coding agent powered by large language models for general-purpose algorithm discovery and optimization.
98 | - [The Rise of Self-Evolving AI: Google’s AlphaEvolve is Just the Beginning](https://www.linkedin.com/pulse/rise-self-evolving-ai-googles-alphaevolve-just-beginning-reddy-oqojc)
99 | - [Meet SELF-DISCOVER: Google DeepMind’s New Method for LLM Reasoning](https://jrodthoughts.medium.com/meet-self-discover-google-deepminds-new-method-for-llm-reasoning-4f3fdc547926)
100 |
101 |
102 | [(2025) Language Modeling by Language Models](https://arxiv.org/abs/2506.20249)
103 |
104 | - Using LLM agents to discover **LM architectures**
105 | - Automated pipeline for self improvement on the architecture exploration problem
106 |
107 | [(2025 Sept) Autonomous Code Evolution Meets NP-Completeness](https://arxiv.org/abs/2509.07367v1)
108 |
109 | - Following AlphaEvolve. From Nvidia.
110 |
111 |
112 |
113 | [(2025 May) Latent Principle Discovery for Language Model Self-Improvement](https://arxiv.org/abs/2505.16927)
114 |
115 | [(2025 May) Self-Evolving Curriculum for LLM Reasoning](https://arxiv.org/abs/2505.14970)
116 |
117 | - Curriculum learning approach with RL for training data selection
118 |
119 | [(2025 June) Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning](https://arxiv.org/pdf/2506.08745)
120 |
121 | - Measures how much intermediate reasoning steps lead to the same final answer, as a "consistency" metric summarizing the reasoning trajectory
122 | - Also measures how much sudden changes of the final answer at later reasoning steps are there in the trajectory, as a "volatility" metric
123 | - Observes clear separations of these two metrics between trajectories leading to correct vs. incorrect final answers
124 | - Include these trajectory statistics for reward, plus a "curiosity" reward that encourages diversity -> no external reward is needed, as during training the reward just depends on the sample trajectories and their final answers
125 |
126 |
127 | [(2025 June) Self-Adapting Language Models](https://arxiv.org/pdf/2506.10943)
128 |
129 | - _Joe: Something I've been thinking to do. Very excited for this direction._
130 |
131 | [(2025 Aug) Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement](https://arxiv.org/abs/2508.00410)
132 |
133 | - Paraphrase questions, and voting for consistency answers for rewards
134 | - _Joe: similar to consistency-based self-rewarding but with a tweak_
135 |
136 | [(2025 Sept) Self-Evolving LLMs via Continual Instruction Tuning](https://www.arxiv.org/abs/2509.18133)
137 |
138 |
139 | ### Zero
140 |
141 | _Joe: very interested in this._
142 |
143 | [(2025 May) Absolute Zero: Reinforced Self-play Reasoning with Zero Data](https://arxiv.org/abs/2505.03335)
144 |
145 | [(2025 Aug) R-Zero: Self-Evolving Reasoning LLM from Zero Data](https://arxiv.org/abs/2508.05004)
146 |
147 |
148 |
149 | ### Self-Exploration
150 |
151 | Like human beings who are capable of exploring the environments and gain new knowledge in unfamiliar situations, future models/agents will need to be capable of the same, exploring and exploiting the environment, interactively, for continual learning.
152 |
153 | [(2024) NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild](https://www.nnetnav.dev/)
154 |
155 |
156 | ### Systems and Code
157 |
158 | [(2023) metamorph](https://github.com/victorb/metamorph/)
159 |
160 | - Early attempts to use GPT to improve code
161 | - "An experiment in letting GPT-4 edit a program by itself. What program? The program that lets GPT-4 edit itself."
162 |
163 | [(2025 May) Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents](https://github.com/jennyzzt/dgm)
164 |
165 | - Coding agent for the paper
166 |
167 | [OpenEvolve](https://github.com/codelion/openevolve)
168 | - An evolutionary coding agent
169 | - Based on AlphaEvolve research
170 | - Serves as both a research platform for evolutionary AI and a practical tool for automated code optimization.
171 |
172 | ### Collection
173 |
174 | [(Paper Collection) Awesome Label-Free Reinforcement Learning with Verifiable Rewards](https://github.com/QingyangZhang/Label-Free-RLVR/)
175 |
176 | - RLVR
177 |
178 | [(2025 Aug) A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence](https://arxiv.org/abs/2507.21046)
179 |
180 | [Stanford CS329A Self-Improving AI Agents](https://cs329a.stanford.edu/)
181 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Post-DeepSeek-R1
2 | Resources and research after DeepSeek-R1, around test-time computing, resurgence of RL, and new LLM learning/application paradigms.
3 |
4 |
5 | > This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
6 |
7 | -- From [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
8 |
9 |
10 |

11 |
12 |
13 | -- From [Mark Chen](https://x.com/markchen90/status/1884303237186216272), OpenAI Chief Research Officer
14 |
15 | ---
16 |
17 | ### Table of Contents
18 |
19 | - [DeepSeek-R1 Reproduction](#deepseek-r1-reproduction-popular-and-fast-ones)
20 | - [R1-like RL Reproduction for More Scenarios](#r1-like-rl-reproduction-for-more-scenarios)
21 | - [Tools](#tools)
22 | - [LLM + RL with/for X](#llm--rl-withfor-x)
23 | - [Literature](#literature)
24 | - [Test-time Scaling](#test-time-scaling)
25 | - [Process Reward](#process-reward-after-o1)
26 | - [Multimodal, Image Generation](#multimodal-image-generation)
27 | - [RL for Different Ways of Generation](#rl-for-different-ways-of-generation)
28 | - [Improve Long CoT for Reasoning](#improve-long-cot-for-reasoning)
29 | - [Understanding R1 and RL + LLMs, Tricks to Train RL](#understanding-r1-and-rl--llms-tricks-to-train-rl)
30 | - [Efficiency](#efficiency)
31 |
32 | ---
33 |
34 | ## DeepSeek-R1 Reproduction ("popular" and fast ones)
35 |
36 |
37 | #### [Simple Reinforcement Learning for Reasoning](https://github.com/hkust-nlp/simpleRL-reason?tab=readme-ov-file#simple-reinforcement-learning-for-reasoning) (HKUST)
38 |
39 |
40 | - Rule-based reward (no MCTS and reward models)
41 | - Uses PPO rather than GRPO
42 | - Trains small models (7B) on limited data (8K examples)
43 | - Starting from Qwen2.5-Math-7B (base model), performs RL on it directly, achieving surprisingly strong results
44 |
45 |
46 |

47 |
48 |
49 | > Training dynamics of our Qwen2.5-SimpleRL-Zero training starting from the Qwen2.5-Math-7B, without SFT or reward models.
50 |
51 |
52 | #### [DeepScaleR](https://github.com/agentica-project/deepscaler/tree/main?tab=readme-ov-file#deepscaler) (Berkeley)
53 |
54 | - Aimed to democratize reinforcement learning (RL) for LLMs and reproduce DeepSeek R1 and OpenAI O1/O3 at scale
55 | - Iteratively scaling Deepseek's GRPO algorithm from 8K→16K→24K context length for thinking
56 | - Trained on top of [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) (_Joe: so the initial model is already capable of deep thinking; better if we can do from base models_)
57 | - Heavily based on modified fork of [veRL](https://github.com/volcengine/verl), an open-source RLHF library
58 | - Good insight and training receipe: error cases are initially longer CoTs, so gradually extending context length for thinking during training (_Joe: a sort of curriculum learning for RL_)
59 |
60 | 
61 |
62 | *Figure 1: DeepScaleR 1.5B model's Pass@1 accuracy on AIME2024 as RL training progresses. At step 1040 and 1520, the context length is extended to 16K and 24K. For more details, see our [blog post](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2).*
63 |
64 |
65 | #### [Open R1](https://github.com/huggingface/open-r1?tab=readme-ov-file#open-r1) (Hugging Face)
66 |
67 | - Fully open reproduction of DeepSeek-R1
68 | - [Blog post](https://huggingface.co/blog/open-r1)
69 |
70 |
71 | #### [TinyZero](https://github.com/Jiayi-Pan/TinyZero)
72 |
73 | - A reproduction of DeepSeek-R1-Zero in countdown and multiplication tasks
74 | - Through RL, the 3B base LM develops self-verification and search abilities all on its own
75 | - Fails to learn reasoning with Qwen2.5-0.5B base
76 | - Works with [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) model
77 | - Experiment run based on [veRL](https://github.com/volcengine/verl)
78 |
79 | #### [Mini-R1](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/mini-deepseek-r1-aha-grpo.ipynb)
80 |
81 | - A minimal single notebook that tries to reproduce the DeepSeek-R1 "reasoning" results on a single task (the Countdown Game)
82 | - Uses GRPO and Q-Lora, also with the [TRL](https://huggingface.co/docs/trl/en/index) library
83 | - Starting with the [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) model (suggested using models > 1.5B) (_Joe: Yes, we need the model to start with to have certain capabilities_)
84 | - Good learning material with code
85 |
86 |
87 | #### [Oat-Zero](https://github.com/sail-sg/oat-zero?tab=readme-ov-file#there-may-not-be-aha-moment-in-r1-zero-like-training--a-pilot-study)
88 |
89 | There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study
90 |
91 | - Aha moment (such as self-reflection patterns) may already exist in the base model.
92 | - There are Superficial Self-Reflection (SSR) from base models' responses, in which case self-reflections do not necessarily lead to correct final answers.
93 | - Closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions.
94 |
95 | #### [Open Reasoner Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero?tab=readme-ov-file#open-reasoner-zero)
96 |
97 | - An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
98 | - Uses PPO (instead of GRPO; some [discussions](https://x.com/rosstaylor90/status/1892664646890312125))
99 |
100 |
101 | #### [Colab Reproductions with Unsloth](https://unsloth.ai/blog/r1-reasoning)
102 |
103 | - One GPU with GRPO (worth trying when resource constraint)
104 | - Experience the "aha moment" for [free on Colab](https://x.com/danielhanchen/status/1887564724071768529) (seems easy to play with)
105 |
106 |
107 | ### Online Materials, Discussions
108 |
109 | - Video tutorials from Sasha Rush on [o1-like test-time scaling](https://github.com/srush/awesome-o1) and [DeepSeek](https://www.youtube.com/watch?v=KtBcIDtS13M)
110 | - [Some takeaways from the R1, DeepSeek-V3 and GRPO papers](https://x.com/Dan_Jeffries1/status/1881679981849215080) (twitter)
111 |
112 |
113 | ### Other RL Trained Models
114 |
115 | [(2025 Mar) QwQ-32B: Embracing the Power of Reinforcement Learning](https://qwenlm.github.io/blog/qwq-32b/)
116 |
117 |
118 |
119 | ## R1-like RL Reproduction for More Scenarios
120 |
121 | ### Tools
122 |
123 | - RL libraries:
124 | - [veRL](https://github.com/volcengine/verl) (seems most popular as of Mar 2025). Check this [list](https://github.com/volcengine/verl?tab=readme-ov-file#awesome-work-using-verl) of R1 followup works
125 | - [TRL](https://huggingface.co/docs/trl/en/index)
126 | - Inference: [vLLM](https://github.com/vllm-project/vllm) seems a must to speed up inference
127 | - Starting models: [Qwen2.5](https://github.com/QwenLM/Qwen2.5) (base, instruct, R1-distilled, math) seems most popular (as of Mar 2025) (why? some [empirical answers](https://arxiv.org/abs/2503.01307)), both 3B and 7B models are made work; 0.5B is a bit weaker but could also learn
128 | - RL algorithms: [GRPO](https://arxiv.org/abs/2402.03300), [PPO](https://arxiv.org/pdf/1707.06347) (some dispute on whether GRPO is the must, [here](https://github.com/ZihanWang314/ragen?tab=readme-ov-file#-ragen-training-agents-by-reinforcing-reasoning-) and [here](https://x.com/finbarrtimbers/status/1899118175830397322))
129 | - some tutorials [here](https://anukriti-ranjan.medium.com/preference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090#:~:text=GRPO%2C%20from%20DeepSeek%20AI%2C%20is,making%20it%20lighter%20and%20faster.) and [here](https://huggingface.co/blog/NormalUhr/grpo)
130 | - GPU resourse: see the other reproductions, and discussion e.g. [here](https://github.com/huggingface/open-r1/issues/100)
131 | - One GPU with GRPO on [Colab](https://unsloth.ai/blog/r1-reasoning)
132 |
133 | ### LLM + RL with/for X
134 |
135 |
136 | #### [RAGEN: Training Agents by Reinforcing Reasoning](https://github.com/ZihanWang314/ragen?tab=readme-ov-file#-ragen-training-agents-by-reinforcing-reasoning-)
137 | RL + LLM applied to **agents**
138 | - Using PPO instead of GRPO
139 |
140 | #### [Logic-RL](https://github.com/Unakar/Logic-RL?tab=readme-ov-file#logic-rl)
141 | RL + LLM applied with **synthetic logic puzzles** with controllable complexity and straightforward answer verification
142 |
143 | #### [Teaching Language Models to Critique via Reinforcement Learning](https://github.com/HKUNLP/critic-rl?tab=readme-ov-file#-teaching-language-models-to-critique-via-reinforcement-learning-)
144 | RL + LLM applied to **coding**
145 | - Train with GRPO using verifiable rewards from sandbox execution
146 |
147 | #### [Code-R1: Reproducing R1 for Code with Reliable Rewards](https://github.com/ganler/code-r1?tab=readme-ov-file#code-r1-reproducing-r1-for-code-with-reliable-rewards)
148 | RL + LLM applied to **coding**
149 |
150 | #### [EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework](https://github.com/hiyouga/EasyR1)
151 | RL + LLM applied to **multimodality** (such as VLMs)
152 |
153 | #### [R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning](https://arxiv.org/abs/2503.05379)
154 | RL + LLM applied to **multimodality**
155 |
156 | - For the specific task of emotion recognition, with visual and audio signals (videos)
157 | - Learning with a 0.5B model
158 |
159 | #### [Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?](https://arxiv.org/abs/2505.09439)
160 | RL + LLM applied to **multimodality**
161 |
162 | - Audio LLM, fine-tuned with GRPO
163 |
164 | #### [SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning](https://arxiv.org/abs/2504.20024)
165 | RL + LLM applied to **multimodality**
166 |
167 | - Spatial reasoning with 3D augmented input parsed from images
168 |
169 | #### [Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning](https://github.com/PeterGriffinJin/Search-R1?tab=readme-ov-file#search-r1-train-your-llms-to-reason-and-call-a-search-engine-with-reinforcement-learning)
170 |
171 | RL + LLM applied to **retrieval** (interleaved with generaion/reasoning)
172 | - Tested on NQ dataset, retrieving from Wikipedia
173 |
174 | #### [ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning](https://github.com/Agent-RL/ReSearch)
175 | RL + LLM applied to **retrieval** (RAG)
176 | - Trained with HotpotQA data
177 |
178 | #### [DeepRetrieval - Hacking Search Engines & Retrievers with LLM + RL](https://github.com/pat-jj/DeepRetrieval)
179 | RL + LLM applied to **retrieval**
180 | - Tested on literature mining, publication search and trial search tasks
181 |
182 |
183 | ---
184 |
185 | ## Literature
186 |
187 | Here is a collection of papers of different topics and flavors. They are not (cannot be) exhaustive, but grouped based on their themes to give some sense of different types of research and problems in the space.
188 |
189 | _Joe: I marked the year with month for papers, due to the extreme fast pace in this domain of exploding research_
190 |
191 |
192 | ### Test-time Scaling
193 |
194 | [(2024 Aug) Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters](https://arxiv.org/abs/2408.03314)
195 |
196 | -> Test-time scaling for math
197 |
198 | - Includes search strategies such as Best-of-N, beam search, and beam search with lookahead
199 | - Involves process reward model (PRM) and revision models
200 |
201 | (2024 Nov) Deliberative Alignment: Reasoning Enables Safer Language Models
202 |
203 | -> Test-time scaling for safety
204 |
205 | [(2025 Jan) s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393)
206 |
207 | -> Test-time scaling for reasoning
208 |
209 | - Collected 1K datapoints from diverse datasets and their reasoning traces (from Google Gemini Flash Thinking API), and then a pipeline of quality control and filtering
210 | - Finetune Qwen2.5-32B-Instruct on the 1K datapoints, with training takes just 26 minutes on 16 NVIDIA H100 GPUs
211 | - Control the test-time compute in the sequential generation scenario (as opposed to parallel like search or best of N). Control the reasoning length by inserting tokens "Final Answer:" and "Wait"
212 |
213 | [(2025 Feb) S∗: Test Time Scaling for Code Generation](https://arxiv.org/pdf/2502.14382)
214 |
215 | -> Test-time scaling for coding
216 |
217 | [(2025 Feb) Teaching Language Models to Critique via Reinforcement Learning
218 | ](https://arxiv.org/abs/2502.03492)
219 |
220 | -> Test-time scaling for coding
221 |
222 | > [!note]
223 | > _Joe: If we think about test time computing promoted by OpenAI o1, Deepmind [AlphaCode](https://deepmind.google/discover/blog/competitive-programming-with-alphacode/) in 2022 already used test-time scaling to do a lot of sampling and selection to boost the performance of competitive coding._
224 |
225 | [(2025 Feb) Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers](https://arxiv.org/abs/2502.20379)
226 |
227 | -> Test-time scaling with multiple agents (LLMs) for verification
228 |
229 | [(2025 Mar) Chain-of-Retrieval Augmented Generation](https://arxiv.org/abs/2501.14342)
230 |
231 | -> Test-time scaling for RAG
232 |
233 | - Design ways that can scale up inference computation for RAG, such as decomping the question into modular questions and iteratively retrieve
234 | - _Joe: this is a recurring theme of current rearch on test-time scaling for X. Design ways to increase inference computation, whether it be long CoT, search, verification, etc._
235 |
236 | [(2025 Mar) Remasking Discrete Diffusion Models with Inference-Time Scaling](https://arxiv.org/abs/2503.00307)
237 |
238 | -> Test-time scaling for discrete diffusion models for texts
239 |
240 |
241 | #### Scaling Laws (all kinds of)
242 |
243 | Scaling Laws
244 |
245 | [(2024 Feb) Scaling Laws for Downstream Task Performance in Machine Translation](https://arxiv.org/abs/2402.04177)
246 |
247 | -> Scaling behavior in a transfer learning setting
248 |
249 | [(2025 Feb) Distillation Scaling Laws](https://arxiv.org/abs/2502.08606)
250 |
251 | -> Scaling behavior for knowledge distillation
252 |
253 | [(2025 Feb) Distributional Scaling Laws for Emergent Capabilities](https://arxiv.org/abs/2502.17356)
254 |
255 | -> Emerging capabilities across multiple training runs with different random seeds
256 |
257 | - Training experiments with Qwen2.5-0.5B and Qwen2.5-1.5B
258 |
259 |
260 |
261 | ### Process Reward (after o1)
262 |
263 | [(2025 Feb) Process Reinforcement through Implicit Rewards](https://arxiv.org/abs/2502.01456)
264 |
265 |
266 | ### Multimodal, Image Generation
267 |
268 | [(2025 Jan) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step](https://arxiv.org/abs/2501.13926)
269 |
270 | [(2025 Mar) ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning](https://arxiv.org/abs/2503.19312)
271 |
272 | [(2025 Apr) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning](https://arxiv.org/abs/2504.20024)
273 |
274 | - Spatial reasoning from vision inputs, augmented with parsed 3D structures
275 |
276 |
277 | [(2025 May) Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?](https://arxiv.org/abs/2505.09439)
278 |
279 |
280 |
281 | ### RL for Different Ways of Generation
282 |
283 | [(2025 Feb) Self-rewarding correction for mathematical reasoning
284 | ](https://arxiv.org/pdf/2502.19613)
285 |
286 | -> Self corrections trained with RL during generaion
287 |
288 | [(2025 Mar) Reinforcement Learning for Long-Horizon Interactive LLM Agents](https://arxiv.org/pdf/2502.01600)
289 |
290 | -> RL (LOOP, a data- and memory-efficient variant of proximal policy optimization) for long-horizon interactive **agents** ([AppWorld](https://appworld.dev/))
291 |
292 |
293 |
294 | ### Improve Long CoT for Reasoning
295 |
296 | [(2025 Mar) START: Self-taught Reasoner with Tools](https://arxiv.org/abs/2503.04625)
297 |
298 | -> Integrate tool usages with reasoning, with controled hint insertion and rejection sampling for training
299 | - Tool usage (writing Python code) inside reasoning
300 | - Enhance tool usage by injecting hint sequences in CoT during training, such as "Wait", "Maybe I can use Python" at various places based on heuristics
301 | - Interleave Python code + executor with reasoning
302 | - Rejection sampling fine-tuning (RFT)
303 | - _Joe: this uses rejection sampling (you can call it RL, from the [Llama2 paper](https://arxiv.org/abs/2307.09288)). And the paper was not well polished (e.g. from small things like in-text citation formats, etc.)_
304 |
305 | [(2025 Feb) LIMO: Less is More for Reasoning](https://arxiv.org/abs/2502.03387)
306 |
307 | - 817 curated training samples
308 | - Fine-tune Qwen2.5-32B-Instruct with SFT
309 |
310 | [(2025 May) SEAL: Steerable Reasoning Calibration of Large Language Models for Free](https://arxiv.org/abs/2504.07986)
311 |
312 | - Categorize the reasoning steps into three behaviors: Execution thoughts, Reflecting thoughts, and Transition thoughts
313 | - Analyzed that wrong reasonings often result in much longer generations, with more usages of reflecting and transition
314 | - Extract hidden states corresponding to different behaviral steps, and construct steering vectors to control the type of reasoning steps
315 | - Achieve more effective and efficient reasoning with inference-time steering
316 |
317 | ### Understanding R1 and RL + LLMs, Tricks to Train RL
318 |
319 | [(2025 Jan) Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://arxiv.org/abs/2501.11651)
320 |
321 | -> Tricks to scale up RL training to make it work
322 |
323 | - Encourage sample diversity through oversampling
324 | - Auxilliary loss on entropy
325 | - Penalize undesired behaviors
326 |
327 | [(2025 Feb) Demystifying Long Chain-of-Thought Reasoning in LLMs](https://arxiv.org/pdf/2502.03373)
328 |
329 | -> Analyzing the learning dynamics of emergent reasoning with LLM + RL, across different factors such as SFT initilization, lengh reward design, etc.
330 |
331 | [(2025 Mar) Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs](https://arxiv.org/abs/2503.01307)
332 |
333 | -> Analyzing the behaviors of emergent reasoning from LLM + RL, across base models and training data
334 |
335 | - Why Qwen works better then Llama? Qwen already exhibits certain reasoning behaviors before training
336 | - Priming Llama to begin RL training with data of complext reasoning behaviors helps, even when the final anwer is not correct
337 | - _Joe: somehow I don't really get the name of cognitive behaviors (and the whole title); maybe I'm naive_
338 |
339 | [(2025 Mar) Understanding R1-Zero-Like Training: A Critical Perspective](https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf)
340 |
341 | -> Analyzing base models and RL
342 |
343 | [(2025, Mar) The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models](https://arxiv.org/abs/2503.02875)
344 |
345 | -> Analyzing the role of prefixes of reasoning trajectories; could also work for self-improvements
346 |
347 | - Found low diversity in the first few token generations (which makes sense as the sequence length is short, and the possibilities of different trajectories grow exponentially)
348 | - Only sample a short prefix and fine-tune the model based on that. Not using labels.
349 |
350 |
351 | [(2025 Jue) Thought Anchors: Which LLM Reasoning Steps Matter?](https://arxiv.org/abs/2506.19143)
352 | -> Analysis of reasoning sentences
353 |
354 | - Break down the reasoning chain into each single sentence, and check their causal relations and importances to other sentences and answer
355 | - Summarized a sentence taxonomy for reasoning sentences (Table 1 in Appendix A)
356 | - And visualize, with a good demo page https://www.thought-anchors.com/
357 |
358 | [(2025 Apr) Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning](https://arxiv.org/pdf/2506.02867)
359 |
360 | - Mutual information is computed between hidden states (continuous vectors) at token step t and ground truth answer
361 | - Mutual information (MI) is not computed by estimating a distribution in the Shannon entropy format, but estimated by the Hilbert–Schmidt Independence Criterion (HSIC) with Gaussian kernels
362 | - MI is computed by first sampling the hidden state vectors, and then compute based on HSIC between two matrices (collections of the continuous vectors)
363 | - Token step t vectors are then mapped to tokens (e.g. by projection to the vocabulary) for concrete analysis
364 |
365 | [(2025 Feb) Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology](https://arxiv.org/abs/2502.17026)
366 |
367 | -> Not necessarily long CoT, but built a topological graph to explain reasoning patterns.
368 |
369 | - The structured representation of reasoning could be applied elsewhere, e.g. to super long reasoning process.
370 |
371 | [(2025 Sept) Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic](https://arxiv.org/abs/2509.01363)
372 | -> Steering/task vectors for reasoning
373 |
374 | - Two identity models, one going through SFT, and one GRPO
375 | - Extract task vectors as the difference between parameters to control the reasoning behaviors
376 |
377 |
378 | [(2025 Oct) First Try Matters: Revisiting the Role of Reflection in Reasoning Models](https://arxiv.org/abs/2510.08308)
379 | -> challenges the conception that reflection in model reasoning actually does "reflection"
380 |
381 | - Focused on reflective behaviours of model reasoning
382 | - Found that most reflective behaviors do not actually alter model reasonings, but merely confirm
383 | - Fine-tuning on more reflective behaivors mostly enhance first-answer correctness
384 |
385 | [(2025 Sept) RL's Razor: Why Online Reinforcement Learning Forgets Less](https://arxiv.org/abs/2509.04259)
386 |
387 | - RL training incurs less forgetting than SFT
388 |
389 | [(2025 Apr) Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?](https://arxiv.org/abs/2504.13837)
390 |
391 | - Challenges the role of RL to incentive model with new capabilities vs. just capitalizing on existing capabilities
392 |
393 | [(2025 Nov) Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs](https://arxiv.org/abs/2511.05933)
394 |
395 | - Similar research topic as above
396 |
397 |
398 | #### Data
399 |
400 | Thought Anchors: https://www.thought-anchors.com/
401 |
402 | Open Thoughts: https://github.com/open-thoughts/open-thoughts
403 |
404 | #### Training Receipe
405 |
406 | [(2025 Nov) JustRL: Scaling a 1.5B LLM with a Simple RL Recipe](https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8)
407 |
408 | - Simple RL training receipe for scaling 1.5B LLM training
409 |
410 |
411 | #### RL Alrogithms
412 |
413 | [(2025, Jun) TreeRPO: Tree Relative Policy Optimization](https://arxiv.org/abs/2506.05183)
414 |
415 | - Sampling to generate a tree structured trajectory, and collect rewards for every node
416 | - Improves sampling efficiency for training effciency
417 |
418 | [(2025 June) Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning](https://arxiv.org/pdf/2506.08745)
419 |
420 | - Measures how much intermediate reasoning steps lead to the same final answer, as a "consistency" metric summarizing the reasoning trajectory
421 | - Also measures how much sudden changes of the final answer at later reasoning steps are there in the trajectory, as a "volatility" metric
422 | - Observes clear separations of these two metrics between trajectories leading to correct vs. incorrect final answers
423 | - Include these trajectory statistics for reward, plus a "curiosity" reward that encourages diversity; also borrows the grouping idea from GRPO -> no external reward is needed, as during training the reward just depends on the sample trajectories and their final answers
424 |
425 | [(2025, July) STeCa: Step-level Trajectory Calibration for LLM Agent Learning](https://aclanthology.org/2025.findings-acl.604/)
426 |
427 | ### Efficiency
428 |
429 | [(2024 Dec) Compressed Chain of Thought: Efficient Reasoning through Dense Representations](https://arxiv.org/abs/2412.13171)
430 |
431 | - Reasoning with continuous tokens
432 |
433 | [(2025 Feb) TokenSkip: Controllable Chain-of-Thought Compression in LLMs](https://arxiv.org/abs/2502.12067)
434 |
435 | - Filtering out some "unimportant" CoT tokens based on heuristics, generate compressed CoT tokens, and then fine-tune on the reduced trajectories
436 | - _Joe: similar flavor to context compression, token delection, like LLMLingua_
437 |
438 | [(2025 Mar) Chain of Draft: Thinking Faster by Writing Less](https://arxiv.org/abs/2502.18600)
439 |
440 | -> _Joe: this is not using RL, but just a simple way of prompting by limiting the reasoning step lengths with instructions in prompts. I think similarly we can train LLM with RL to enforce this, and/or as a reward, to improve efficiency during the reasoning process_
441 |
442 | -> _Joe: (a few days later) found out the following paper does that exactly lol_
443 |
444 | [(2025 Mar) L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
445 | ](https://arxiv.org/abs/2503.04697)
446 |
447 | -> _Joe: LLM + RL to encourage shorter reasoning steps. The way is to condition on special symbols in the prompt controling reasoning steps, which poses another reward_
448 |
449 | - Training starts from the base model [DeepScaleR-1.5B-Preview](#deepscaler-berkeley) (using the same hyperparameters for GRPO)
450 | - Training data also from DeepScaleR-Preview-Dataset, 40K question-answer pairs drawn from AIME, AMC, Omni-Math and STILL
451 | - Training context length restricted to 4K, and testing restricted to 8K
452 | - Fine-tuned for 700 steps and further 120 steps for two different length reward formulations
453 | - Again using [VeRL](#tools) framework
454 |
455 | [(2025 Feb) Demystifying Long Chain-of-Thought Reasoning in LLMs](https://arxiv.org/pdf/2502.03373)
456 |
457 | -> _Joe: see Section 4.2 for the length control with reward design. Strategy is similar to the paper above._
458 |
459 | [(2025, Feb) [EMNLP 2025] LightThinker: Thinking Step-by-Step Compression](https://arxiv.org/abs/2502.15589)
460 |
461 | -> _Joe: Compressing thinking steps into smaller set of special tokens. Train with special attention mask, inference with reduced KV cache based on the mask structures._
462 |
463 | - Not using RL. Merging rules are based on heuristics.
464 |
465 | [(2025 Mar) The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models](https://arxiv.org/pdf/2503.02875)
466 |
467 |
468 | [(2025 Apr) Z1: Efficient Test-time Scaling with Code](https://arxiv.org/abs/2504.00810)
469 |
470 | -> Reducing reasoning token length through SFT on QwQ-32B-preview model generated data
471 | - Dataset size of 107K, SFT model Qwen-2.5-Coder-7B-Instruct with bfloat16, FSDP, global batch size to 128 for 2 epochs using 8 NVIDIA
472 | A100-80G GPUs
473 | - Simple reasoning dataset analysis of trigram frequency in Section 2.1 and Appendix A.2
474 | - The biggest difference is removing `...` delimiters?
475 | - _Joe: Not quite sure about the "Shifted Thinking Window" name_
476 |
477 | [(2024 Apr) Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification](https://arxiv.org/abs/2504.05419v1)
478 |
479 | -> Probe whether the intermediate reasoning step hidden states can predict the correctness of the final answer
480 | - Can use the probe for early exit for long reasoning
481 |
482 | [(2025 Apr) ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning](https://arxiv.org/abs/2504.01296)
483 |
484 | -> Added reasoning length limit as a reward for RL
485 |
486 | [(2025 Apr) Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models](https://arxiv.org/abs/2503.16419)
487 |
488 | -> Survey
489 | - Collection of papers: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs
490 |
491 | [(2025 Apr) Learning Adaptive Parallel Reasoning with Language Models](https://arxiv.org/abs/2504.15466)
492 |
493 | -> Changing the generation process to combine parallel and sequential search during generation
494 | - Similar to one of my earlier ideas of optimizing generation process that can be trained with RL directly for efficiency
495 | - But focused on Countdown task only, and trained model from scratch for small scale experiments
496 |
497 | [(2025 May) Learn to Reason Efficiently with Adaptive Length-based Reward Shaping](https://arxiv.org/abs/2505.15612)
498 |
499 | -> Reducing reasoning trajectory with different length reward shapes
500 |
501 | [(2025 May) SEAL: Steerable Reasoning Calibration of Large Language Models for Free](https://arxiv.org/abs/2504.07986)
502 |
503 | - Categorize the reasoning steps into three behaviors: Execution thoughts, Reflecting thoughts, and Transition thoughts
504 | - Analyzed that wrong reasonings often result in much longer generations, with more usages of reflecting and transition
505 | - Extract hidden states corresponding to different behaviral steps, and construct steering vectors to control the type of reasoning steps
506 | - Achieve more effective and efficient reasoning with inference-time steering, by rougly controling the number of steps for reflection, etc.
507 |
508 | [(2025 May) AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models](https://arxiv.org/abs/2505.22662)
509 |
510 |
511 | [(2025 June) Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning](https://arxiv.org/abs/2506.05256)
512 |
513 | -> Again adding a length related penalty in the reward for RL training, but adjusted to the difficulty of each questions, measured by the pass rate of K samples
514 | - The length reward formulation is a bit less straightforward
515 | - Doesn't show superior performance compared to previous baselines with simple length reward, such as L1-Max
516 |
517 | [(2025 June) Token-Efficient RL for LLM Reasoning](https://arxiv.org/pdf/2504.20834v4)
518 |
519 | -> Reduce resource usages when training with GRPO with LoRA
520 | - Restrict the tokens that contribute to the loss
521 | - Estimate token level advantage, and uses replay for resampling
522 |
523 | [(2025 July) RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents](https://arxiv.org/abs/2507.22844)
524 |
525 | - RL for long-horizon reasoning with agents
526 |
527 | [(2025 Aug) Efficient Inference for Large Reasoning Models: A Survey](https://arxiv.org/abs/2503.23077)
528 |
529 | -> Survey
530 |
--------------------------------------------------------------------------------