├── images └── tweet_mark_chen.png ├── self-improvement_self-evolvement.md └── README.md /images/tweet_mark_chen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jzhou316/Post-DeepSeek-R1_LLM-RL/HEAD/images/tweet_mark_chen.png -------------------------------------------------------------------------------- /self-improvement_self-evolvement.md: -------------------------------------------------------------------------------- 1 | ## Recent Work on Self Improvment of LLMs 2 | 3 | [(2022 Oct) Large Language Models Can Self-Improve](https://arxiv.org/abs/2210.11610) 4 | 5 | [(2023 Mar) Self-Refine: Iterative Refinement with Self-Feedback](https://arxiv.org/abs/2303.17651) 6 | 7 | [(2024 Jan) Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020) 8 | 9 | [(2024 Apr) Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing](https://arxiv.org/abs/2404.12253) 10 | 11 | [(2024 Apr) A Survey on Self-Evolution of Large Language Models](https://arxiv.org/abs/2404.14387) 12 | 13 | - **Survey** paper on LLM self-evolution conceptualized with iterative cycles composed of four phases: experience acquisition, experience refinement, updating, and evaluation. 14 | 15 | [(2024 June) Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/abs/2401.01335) 16 | 17 | [(2024 Sept; ICLR 2025) Training Language Models to Self-Correct via Reinforcement Learning](https://arxiv.org/abs/2409.12917) 18 | 19 | - Google Deepmind 20 | - OpenReview Discussion: https://openreview.net/forum?id=CjwERcAU7w 21 | 22 | [(2024 Oct) Techniques for Self-Improving LLM Evals](https://arize.com/blog/techniques-for-self-improving-llm-evals/) 23 | 24 | [(2024 Nov) Self-Consistency Preference Optimization](https://arxiv.org/abs/2411.04109) 25 | 26 | [(2024 Nov) Self-Evolved Reward Learning for LLMs](https://arxiv.org/abs/2411.00418) 27 | 28 | - Learning from self-feedback with reward model (RM) 29 | 30 | [(2024 Nov) Preference Optimization for Reasoning with Pseudo Feedback](https://arxiv.org/abs/2411.16345) 31 | 32 | [(2024 Dec) AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement](https://arxiv.org/abs/2412.06176) 33 | 34 | - Inference time iteration for code generation 35 | 36 | [(2024 Dec) Self-Improvement in Language Models: The Sharpening Mechanism](https://arxiv.org/abs/2412.01951) 37 | 38 | [(2024 Dec) A Survey on LLM Inference-Time Self-Improvement](https://arxiv.org/pdf/2412.14352) 39 | 40 | - **Survey** paper for inference-time self-improvement 41 | - Collection of prior papers: https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement 42 | 43 | [(2024 Dec) Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models](https://arxiv.org/abs/2412.02674) 44 | 45 | [(2025 Jan) rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://arxiv.org/abs/2501.04519) 46 | 47 | - Specific on math reasoning, and iteratively using Monte Carlo search and process reward models 48 | 49 | [(2025 Feb) Self-rewarding correction for mathematical reasoning](https://arxiv.org/abs/2502.19613) 50 | 51 | - Mathematical reasoning 52 | 53 | [(2025 Feb) Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data](https://arxiv.org/abs/2502.05400) 54 | 55 | Haoyan Yang, et al. 56 | 57 | [(2025 Mar) Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models](https://arxiv.org/abs/2503.04813) 58 | 59 | [(2025 Apr - ICLR 2025 Workshop) Scaling Self-Improving Foundation Models](https://sites.google.com/berkeley.edu/selfimprovingfoundationmodels/home) 60 | 61 | - ICLR 2025 **Workshop** 62 | - Accepted papers around the topic [here](https://sites.google.com/berkeley.edu/selfimprovingfoundationmodels/accepted-papers) 63 | 64 | [(2025 Apr) A Self-Improving Coding Agent](https://arxiv.org/abs/2504.15228) 65 | 66 | [(2025 Apr) Collaborative Reasoner : Self-Improving Social Agents with Synthetic Conversations](https://ai.meta.com/research/publications/collaborative-reasoner-self-improving-social-agents-with-synthetic-conversations/) 67 | 68 | [(2025 May ICML Position) Position: Truly Self-Improving Agents Require Intrinsic Metacognitive Learning](https://openreview.net/forum?id=4KhDd0Ozqe) 69 | 70 | [(2025 May) Can Large Reasoning Models Self-Train?](https://arxiv.org/abs/2505.21444) 71 | 72 | - Majority voting to get rewards or replace external verify 73 | - Model learning collapses to consistent but shortcut answers for long training 74 | 75 | [(2025 May) Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO](https://arxiv.org/abs/2505.22453) 76 | 77 | - Very similar to above with majority voting rewards, but apply to multimodal models 78 | 79 | [(2025 May) UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents](https://arxiv.org/abs/2505.21496) 80 | 81 | - Autonomous Agents 82 | 83 | [(2025 May) Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution](https://arxiv.org/abs/2505.20286) 84 | 85 | - Autonomous Agents 86 | 87 | [(2025 May) Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents](https://arxiv.org/abs/2505.22954) 88 | 89 | - Autonomous Agents 90 | 91 | [(2025 May) DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning](https://arxiv.org/abs/2505.15734) 92 | 93 | - Using multi-agent debate to create reasoning traces for training reasoning models 94 | 95 | [(2025 May) AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/) 96 | 97 | - From Google: An evolutionary coding agent powered by large language models for general-purpose algorithm discovery and optimization. 98 | - [The Rise of Self-Evolving AI: Google’s AlphaEvolve is Just the Beginning](https://www.linkedin.com/pulse/rise-self-evolving-ai-googles-alphaevolve-just-beginning-reddy-oqojc) 99 | - [Meet SELF-DISCOVER: Google DeepMind’s New Method for LLM Reasoning](https://jrodthoughts.medium.com/meet-self-discover-google-deepminds-new-method-for-llm-reasoning-4f3fdc547926) 100 | 101 | 102 | [(2025) Language Modeling by Language Models](https://arxiv.org/abs/2506.20249) 103 | 104 | - Using LLM agents to discover **LM architectures** 105 | - Automated pipeline for self improvement on the architecture exploration problem 106 | 107 | [(2025 Sept) Autonomous Code Evolution Meets NP-Completeness](https://arxiv.org/abs/2509.07367v1) 108 | 109 | - Following AlphaEvolve. From Nvidia. 110 | 111 | 112 | 113 | [(2025 May) Latent Principle Discovery for Language Model Self-Improvement](https://arxiv.org/abs/2505.16927) 114 | 115 | [(2025 May) Self-Evolving Curriculum for LLM Reasoning](https://arxiv.org/abs/2505.14970) 116 | 117 | - Curriculum learning approach with RL for training data selection 118 | 119 | [(2025 June) Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning](https://arxiv.org/pdf/2506.08745) 120 | 121 | - Measures how much intermediate reasoning steps lead to the same final answer, as a "consistency" metric summarizing the reasoning trajectory 122 | - Also measures how much sudden changes of the final answer at later reasoning steps are there in the trajectory, as a "volatility" metric 123 | - Observes clear separations of these two metrics between trajectories leading to correct vs. incorrect final answers 124 | - Include these trajectory statistics for reward, plus a "curiosity" reward that encourages diversity -> no external reward is needed, as during training the reward just depends on the sample trajectories and their final answers 125 | 126 | 127 | [(2025 June) Self-Adapting Language Models](https://arxiv.org/pdf/2506.10943) 128 | 129 | - _Joe: Something I've been thinking to do. Very excited for this direction._ 130 | 131 | [(2025 Aug) Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement](https://arxiv.org/abs/2508.00410) 132 | 133 | - Paraphrase questions, and voting for consistency answers for rewards 134 | - _Joe: similar to consistency-based self-rewarding but with a tweak_ 135 | 136 | [(2025 Sept) Self-Evolving LLMs via Continual Instruction Tuning](https://www.arxiv.org/abs/2509.18133) 137 | 138 | 139 | ### Zero 140 | 141 | _Joe: very interested in this._ 142 | 143 | [(2025 May) Absolute Zero: Reinforced Self-play Reasoning with Zero Data](https://arxiv.org/abs/2505.03335) 144 | 145 | [(2025 Aug) R-Zero: Self-Evolving Reasoning LLM from Zero Data](https://arxiv.org/abs/2508.05004) 146 | 147 | 148 | 149 | ### Self-Exploration 150 | 151 | Like human beings who are capable of exploring the environments and gain new knowledge in unfamiliar situations, future models/agents will need to be capable of the same, exploring and exploiting the environment, interactively, for continual learning. 152 | 153 | [(2024) NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild](https://www.nnetnav.dev/) 154 | 155 | 156 | ### Systems and Code 157 | 158 | [(2023) metamorph](https://github.com/victorb/metamorph/) 159 | 160 | - Early attempts to use GPT to improve code 161 | - "An experiment in letting GPT-4 edit a program by itself. What program? The program that lets GPT-4 edit itself." 162 | 163 | [(2025 May) Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents](https://github.com/jennyzzt/dgm) 164 | 165 | - Coding agent for the paper 166 | 167 | [OpenEvolve](https://github.com/codelion/openevolve) 168 | - An evolutionary coding agent 169 | - Based on AlphaEvolve research 170 | - Serves as both a research platform for evolutionary AI and a practical tool for automated code optimization. 171 | 172 | ### Collection 173 | 174 | [(Paper Collection) Awesome Label-Free Reinforcement Learning with Verifiable Rewards](https://github.com/QingyangZhang/Label-Free-RLVR/) 175 | 176 | - RLVR 177 | 178 | [(2025 Aug) A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence](https://arxiv.org/abs/2507.21046) 179 | 180 | [Stanford CS329A Self-Improving AI Agents](https://cs329a.stanford.edu/) 181 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Post-DeepSeek-R1 2 | Resources and research after DeepSeek-R1, around test-time computing, resurgence of RL, and new LLM learning/application paradigms. 3 | 4 | 5 | > This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes. 6 | 7 | -- From [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) 8 | 9 |
10 | 11 |
12 | 13 | -- From [Mark Chen](https://x.com/markchen90/status/1884303237186216272), OpenAI Chief Research Officer 14 | 15 | --- 16 | 17 | ### Table of Contents 18 | 19 | - [DeepSeek-R1 Reproduction](#deepseek-r1-reproduction-popular-and-fast-ones) 20 | - [R1-like RL Reproduction for More Scenarios](#r1-like-rl-reproduction-for-more-scenarios) 21 | - [Tools](#tools) 22 | - [LLM + RL with/for X](#llm--rl-withfor-x) 23 | - [Literature](#literature) 24 | - [Test-time Scaling](#test-time-scaling) 25 | - [Process Reward](#process-reward-after-o1) 26 | - [Multimodal, Image Generation](#multimodal-image-generation) 27 | - [RL for Different Ways of Generation](#rl-for-different-ways-of-generation) 28 | - [Improve Long CoT for Reasoning](#improve-long-cot-for-reasoning) 29 | - [Understanding R1 and RL + LLMs, Tricks to Train RL](#understanding-r1-and-rl--llms-tricks-to-train-rl) 30 | - [Efficiency](#efficiency) 31 | 32 | --- 33 | 34 | ## DeepSeek-R1 Reproduction ("popular" and fast ones) 35 | 36 | 37 | #### [Simple Reinforcement Learning for Reasoning](https://github.com/hkust-nlp/simpleRL-reason?tab=readme-ov-file#simple-reinforcement-learning-for-reasoning) (HKUST) 38 | 39 | 40 | - Rule-based reward (no MCTS and reward models) 41 | - Uses PPO rather than GRPO 42 | - Trains small models (7B) on limited data (8K examples) 43 | - Starting from Qwen2.5-Math-7B (base model), performs RL on it directly, achieving surprisingly strong results 44 | 45 |
46 | simplelr-reaoning-intro-figure_00 47 |
48 | 49 | > Training dynamics of our Qwen2.5-SimpleRL-Zero training starting from the Qwen2.5-Math-7B, without SFT or reward models. 50 | 51 | 52 | #### [DeepScaleR](https://github.com/agentica-project/deepscaler/tree/main?tab=readme-ov-file#deepscaler) (Berkeley) 53 | 54 | - Aimed to democratize reinforcement learning (RL) for LLMs and reproduce DeepSeek R1 and OpenAI O1/O3 at scale 55 | - Iteratively scaling Deepseek's GRPO algorithm from 8K→16K→24K context length for thinking 56 | - Trained on top of [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) (_Joe: so the initial model is already capable of deep thinking; better if we can do from base models_) 57 | - Heavily based on modified fork of [veRL](https://github.com/volcengine/verl), an open-source RLHF library 58 | - Good insight and training receipe: error cases are initially longer CoTs, so gradually extending context length for thinking during training (_Joe: a sort of curriculum learning for RL_) 59 | 60 | ![](https://github.com/agentica-project/deepscaler/blob/main/figures/deepscaler.png) 61 | 62 | *Figure 1: DeepScaleR 1.5B model's Pass@1 accuracy on AIME2024 as RL training progresses. At step 1040 and 1520, the context length is extended to 16K and 24K. For more details, see our [blog post](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2).* 63 | 64 | 65 | #### [Open R1](https://github.com/huggingface/open-r1?tab=readme-ov-file#open-r1) (Hugging Face) 66 | 67 | - Fully open reproduction of DeepSeek-R1 68 | - [Blog post](https://huggingface.co/blog/open-r1) 69 | 70 | 71 | #### [TinyZero](https://github.com/Jiayi-Pan/TinyZero) 72 | 73 | - A reproduction of DeepSeek-R1-Zero in countdown and multiplication tasks 74 | - Through RL, the 3B base LM develops self-verification and search abilities all on its own 75 | - Fails to learn reasoning with Qwen2.5-0.5B base 76 | - Works with [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) model 77 | - Experiment run based on [veRL](https://github.com/volcengine/verl) 78 | 79 | #### [Mini-R1](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/mini-deepseek-r1-aha-grpo.ipynb) 80 | 81 | - A minimal single notebook that tries to reproduce the DeepSeek-R1 "reasoning" results on a single task (the Countdown Game) 82 | - Uses GRPO and Q-Lora, also with the [TRL](https://huggingface.co/docs/trl/en/index) library 83 | - Starting with the [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) model (suggested using models > 1.5B) (_Joe: Yes, we need the model to start with to have certain capabilities_) 84 | - Good learning material with code 85 | 86 | 87 | #### [Oat-Zero](https://github.com/sail-sg/oat-zero?tab=readme-ov-file#there-may-not-be-aha-moment-in-r1-zero-like-training--a-pilot-study) 88 | 89 | There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study 90 | 91 | - Aha moment (such as self-reflection patterns) may already exist in the base model. 92 | - There are Superficial Self-Reflection (SSR) from base models' responses, in which case self-reflections do not necessarily lead to correct final answers. 93 | - Closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions. 94 | 95 | #### [Open Reasoner Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero?tab=readme-ov-file#open-reasoner-zero) 96 | 97 | - An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model 98 | - Uses PPO (instead of GRPO; some [discussions](https://x.com/rosstaylor90/status/1892664646890312125)) 99 | 100 | 101 | #### [Colab Reproductions with Unsloth](https://unsloth.ai/blog/r1-reasoning) 102 | 103 | - One GPU with GRPO (worth trying when resource constraint) 104 | - Experience the "aha moment" for [free on Colab](https://x.com/danielhanchen/status/1887564724071768529) (seems easy to play with) 105 | 106 | 107 | ### Online Materials, Discussions 108 | 109 | - Video tutorials from Sasha Rush on [o1-like test-time scaling](https://github.com/srush/awesome-o1) and [DeepSeek](https://www.youtube.com/watch?v=KtBcIDtS13M) 110 | - [Some takeaways from the R1, DeepSeek-V3 and GRPO papers](https://x.com/Dan_Jeffries1/status/1881679981849215080) (twitter) 111 | 112 | 113 | ### Other RL Trained Models 114 | 115 | [(2025 Mar) QwQ-32B: Embracing the Power of Reinforcement Learning](https://qwenlm.github.io/blog/qwq-32b/) 116 | 117 | 118 | 119 | ## R1-like RL Reproduction for More Scenarios 120 | 121 | ### Tools 122 | 123 | - RL libraries: 124 | - [veRL](https://github.com/volcengine/verl) (seems most popular as of Mar 2025). Check this [list](https://github.com/volcengine/verl?tab=readme-ov-file#awesome-work-using-verl) of R1 followup works 125 | - [TRL](https://huggingface.co/docs/trl/en/index) 126 | - Inference: [vLLM](https://github.com/vllm-project/vllm) seems a must to speed up inference 127 | - Starting models: [Qwen2.5](https://github.com/QwenLM/Qwen2.5) (base, instruct, R1-distilled, math) seems most popular (as of Mar 2025) (why? some [empirical answers](https://arxiv.org/abs/2503.01307)), both 3B and 7B models are made work; 0.5B is a bit weaker but could also learn 128 | - RL algorithms: [GRPO](https://arxiv.org/abs/2402.03300), [PPO](https://arxiv.org/pdf/1707.06347) (some dispute on whether GRPO is the must, [here](https://github.com/ZihanWang314/ragen?tab=readme-ov-file#-ragen-training-agents-by-reinforcing-reasoning-) and [here](https://x.com/finbarrtimbers/status/1899118175830397322)) 129 | - some tutorials [here](https://anukriti-ranjan.medium.com/preference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090#:~:text=GRPO%2C%20from%20DeepSeek%20AI%2C%20is,making%20it%20lighter%20and%20faster.) and [here](https://huggingface.co/blog/NormalUhr/grpo) 130 | - GPU resourse: see the other reproductions, and discussion e.g. [here](https://github.com/huggingface/open-r1/issues/100) 131 | - One GPU with GRPO on [Colab](https://unsloth.ai/blog/r1-reasoning) 132 | 133 | ### LLM + RL with/for X 134 | 135 | 136 | #### [RAGEN: Training Agents by Reinforcing Reasoning](https://github.com/ZihanWang314/ragen?tab=readme-ov-file#-ragen-training-agents-by-reinforcing-reasoning-) 137 | RL + LLM applied to **agents** 138 | - Using PPO instead of GRPO 139 | 140 | #### [Logic-RL](https://github.com/Unakar/Logic-RL?tab=readme-ov-file#logic-rl) 141 | RL + LLM applied with **synthetic logic puzzles** with controllable complexity and straightforward answer verification 142 | 143 | #### [Teaching Language Models to Critique via Reinforcement Learning](https://github.com/HKUNLP/critic-rl?tab=readme-ov-file#-teaching-language-models-to-critique-via-reinforcement-learning-) 144 | RL + LLM applied to **coding** 145 | - Train with GRPO using verifiable rewards from sandbox execution 146 | 147 | #### [Code-R1: Reproducing R1 for Code with Reliable Rewards](https://github.com/ganler/code-r1?tab=readme-ov-file#code-r1-reproducing-r1-for-code-with-reliable-rewards) 148 | RL + LLM applied to **coding** 149 | 150 | #### [EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework](https://github.com/hiyouga/EasyR1) 151 | RL + LLM applied to **multimodality** (such as VLMs) 152 | 153 | #### [R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning](https://arxiv.org/abs/2503.05379) 154 | RL + LLM applied to **multimodality** 155 | 156 | - For the specific task of emotion recognition, with visual and audio signals (videos) 157 | - Learning with a 0.5B model 158 | 159 | #### [Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?](https://arxiv.org/abs/2505.09439) 160 | RL + LLM applied to **multimodality** 161 | 162 | - Audio LLM, fine-tuned with GRPO 163 | 164 | #### [SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning](https://arxiv.org/abs/2504.20024) 165 | RL + LLM applied to **multimodality** 166 | 167 | - Spatial reasoning with 3D augmented input parsed from images 168 | 169 | #### [Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning](https://github.com/PeterGriffinJin/Search-R1?tab=readme-ov-file#search-r1-train-your-llms-to-reason-and-call-a-search-engine-with-reinforcement-learning) 170 | 171 | RL + LLM applied to **retrieval** (interleaved with generaion/reasoning) 172 | - Tested on NQ dataset, retrieving from Wikipedia 173 | 174 | #### [ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning](https://github.com/Agent-RL/ReSearch) 175 | RL + LLM applied to **retrieval** (RAG) 176 | - Trained with HotpotQA data 177 | 178 | #### [DeepRetrieval - Hacking Search Engines & Retrievers with LLM + RL](https://github.com/pat-jj/DeepRetrieval) 179 | RL + LLM applied to **retrieval** 180 | - Tested on literature mining, publication search and trial search tasks 181 | 182 | 183 | --- 184 | 185 | ## Literature 186 | 187 | Here is a collection of papers of different topics and flavors. They are not (cannot be) exhaustive, but grouped based on their themes to give some sense of different types of research and problems in the space. 188 | 189 | _Joe: I marked the year with month for papers, due to the extreme fast pace in this domain of exploding research_ 190 | 191 | 192 | ### Test-time Scaling 193 | 194 | [(2024 Aug) Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters](https://arxiv.org/abs/2408.03314) 195 | 196 | -> Test-time scaling for math 197 | 198 | - Includes search strategies such as Best-of-N, beam search, and beam search with lookahead 199 | - Involves process reward model (PRM) and revision models 200 | 201 | (2024 Nov) Deliberative Alignment: Reasoning Enables Safer Language Models 202 | 203 | -> Test-time scaling for safety 204 | 205 | [(2025 Jan) s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393) 206 | 207 | -> Test-time scaling for reasoning 208 | 209 | - Collected 1K datapoints from diverse datasets and their reasoning traces (from Google Gemini Flash Thinking API), and then a pipeline of quality control and filtering 210 | - Finetune Qwen2.5-32B-Instruct on the 1K datapoints, with training takes just 26 minutes on 16 NVIDIA H100 GPUs 211 | - Control the test-time compute in the sequential generation scenario (as opposed to parallel like search or best of N). Control the reasoning length by inserting tokens "Final Answer:" and "Wait" 212 | 213 | [(2025 Feb) S∗: Test Time Scaling for Code Generation](https://arxiv.org/pdf/2502.14382) 214 | 215 | -> Test-time scaling for coding 216 | 217 | [(2025 Feb) Teaching Language Models to Critique via Reinforcement Learning 218 | ](https://arxiv.org/abs/2502.03492) 219 | 220 | -> Test-time scaling for coding 221 | 222 | > [!note] 223 | > _Joe: If we think about test time computing promoted by OpenAI o1, Deepmind [AlphaCode](https://deepmind.google/discover/blog/competitive-programming-with-alphacode/) in 2022 already used test-time scaling to do a lot of sampling and selection to boost the performance of competitive coding._ 224 | 225 | [(2025 Feb) Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers](https://arxiv.org/abs/2502.20379) 226 | 227 | -> Test-time scaling with multiple agents (LLMs) for verification 228 | 229 | [(2025 Mar) Chain-of-Retrieval Augmented Generation](https://arxiv.org/abs/2501.14342) 230 | 231 | -> Test-time scaling for RAG 232 | 233 | - Design ways that can scale up inference computation for RAG, such as decomping the question into modular questions and iteratively retrieve 234 | - _Joe: this is a recurring theme of current rearch on test-time scaling for X. Design ways to increase inference computation, whether it be long CoT, search, verification, etc._ 235 | 236 | [(2025 Mar) Remasking Discrete Diffusion Models with Inference-Time Scaling](https://arxiv.org/abs/2503.00307) 237 | 238 | -> Test-time scaling for discrete diffusion models for texts 239 | 240 | 241 | #### Scaling Laws (all kinds of) 242 | 243 |
Scaling Laws 244 | 245 | [(2024 Feb) Scaling Laws for Downstream Task Performance in Machine Translation](https://arxiv.org/abs/2402.04177) 246 | 247 | -> Scaling behavior in a transfer learning setting 248 | 249 | [(2025 Feb) Distillation Scaling Laws](https://arxiv.org/abs/2502.08606) 250 | 251 | -> Scaling behavior for knowledge distillation 252 | 253 | [(2025 Feb) Distributional Scaling Laws for Emergent Capabilities](https://arxiv.org/abs/2502.17356) 254 | 255 | -> Emerging capabilities across multiple training runs with different random seeds 256 | 257 | - Training experiments with Qwen2.5-0.5B and Qwen2.5-1.5B 258 | 259 |
260 | 261 | ### Process Reward (after o1) 262 | 263 | [(2025 Feb) Process Reinforcement through Implicit Rewards](https://arxiv.org/abs/2502.01456) 264 | 265 | 266 | ### Multimodal, Image Generation 267 | 268 | [(2025 Jan) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step](https://arxiv.org/abs/2501.13926) 269 | 270 | [(2025 Mar) ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning](https://arxiv.org/abs/2503.19312) 271 | 272 | [(2025 Apr) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning](https://arxiv.org/abs/2504.20024) 273 | 274 | - Spatial reasoning from vision inputs, augmented with parsed 3D structures 275 | 276 | 277 | [(2025 May) Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?](https://arxiv.org/abs/2505.09439) 278 | 279 | 280 | 281 | ### RL for Different Ways of Generation 282 | 283 | [(2025 Feb) Self-rewarding correction for mathematical reasoning 284 | ](https://arxiv.org/pdf/2502.19613) 285 | 286 | -> Self corrections trained with RL during generaion 287 | 288 | [(2025 Mar) Reinforcement Learning for Long-Horizon Interactive LLM Agents](https://arxiv.org/pdf/2502.01600) 289 | 290 | -> RL (LOOP, a data- and memory-efficient variant of proximal policy optimization) for long-horizon interactive **agents** ([AppWorld](https://appworld.dev/)) 291 | 292 | 293 | 294 | ### Improve Long CoT for Reasoning 295 | 296 | [(2025 Mar) START: Self-taught Reasoner with Tools](https://arxiv.org/abs/2503.04625) 297 | 298 | -> Integrate tool usages with reasoning, with controled hint insertion and rejection sampling for training 299 | - Tool usage (writing Python code) inside reasoning 300 | - Enhance tool usage by injecting hint sequences in CoT during training, such as "Wait", "Maybe I can use Python" at various places based on heuristics 301 | - Interleave Python code + executor with reasoning 302 | - Rejection sampling fine-tuning (RFT) 303 | - _Joe: this uses rejection sampling (you can call it RL, from the [Llama2 paper](https://arxiv.org/abs/2307.09288)). And the paper was not well polished (e.g. from small things like in-text citation formats, etc.)_ 304 | 305 | [(2025 Feb) LIMO: Less is More for Reasoning](https://arxiv.org/abs/2502.03387) 306 | 307 | - 817 curated training samples 308 | - Fine-tune Qwen2.5-32B-Instruct with SFT 309 | 310 | [(2025 May) SEAL: Steerable Reasoning Calibration of Large Language Models for Free](https://arxiv.org/abs/2504.07986) 311 | 312 | - Categorize the reasoning steps into three behaviors: Execution thoughts, Reflecting thoughts, and Transition thoughts 313 | - Analyzed that wrong reasonings often result in much longer generations, with more usages of reflecting and transition 314 | - Extract hidden states corresponding to different behaviral steps, and construct steering vectors to control the type of reasoning steps 315 | - Achieve more effective and efficient reasoning with inference-time steering 316 | 317 | ### Understanding R1 and RL + LLMs, Tricks to Train RL 318 | 319 | [(2025 Jan) Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://arxiv.org/abs/2501.11651) 320 | 321 | -> Tricks to scale up RL training to make it work 322 | 323 | - Encourage sample diversity through oversampling 324 | - Auxilliary loss on entropy 325 | - Penalize undesired behaviors 326 | 327 | [(2025 Feb) Demystifying Long Chain-of-Thought Reasoning in LLMs](https://arxiv.org/pdf/2502.03373) 328 | 329 | -> Analyzing the learning dynamics of emergent reasoning with LLM + RL, across different factors such as SFT initilization, lengh reward design, etc. 330 | 331 | [(2025 Mar) Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs](https://arxiv.org/abs/2503.01307) 332 | 333 | -> Analyzing the behaviors of emergent reasoning from LLM + RL, across base models and training data 334 | 335 | - Why Qwen works better then Llama? Qwen already exhibits certain reasoning behaviors before training 336 | - Priming Llama to begin RL training with data of complext reasoning behaviors helps, even when the final anwer is not correct 337 | - _Joe: somehow I don't really get the name of cognitive behaviors (and the whole title); maybe I'm naive_ 338 | 339 | [(2025 Mar) Understanding R1-Zero-Like Training: A Critical Perspective](https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf) 340 | 341 | -> Analyzing base models and RL 342 | 343 | [(2025, Mar) The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models](https://arxiv.org/abs/2503.02875) 344 | 345 | -> Analyzing the role of prefixes of reasoning trajectories; could also work for self-improvements 346 | 347 | - Found low diversity in the first few token generations (which makes sense as the sequence length is short, and the possibilities of different trajectories grow exponentially) 348 | - Only sample a short prefix and fine-tune the model based on that. Not using labels. 349 | 350 | 351 | [(2025 Jue) Thought Anchors: Which LLM Reasoning Steps Matter?](https://arxiv.org/abs/2506.19143) 352 | -> Analysis of reasoning sentences 353 | 354 | - Break down the reasoning chain into each single sentence, and check their causal relations and importances to other sentences and answer 355 | - Summarized a sentence taxonomy for reasoning sentences (Table 1 in Appendix A) 356 | - And visualize, with a good demo page https://www.thought-anchors.com/ 357 | 358 | [(2025 Apr) Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning](https://arxiv.org/pdf/2506.02867) 359 | 360 | - Mutual information is computed between hidden states (continuous vectors) at token step t and ground truth answer 361 | - Mutual information (MI) is not computed by estimating a distribution in the Shannon entropy format, but estimated by the Hilbert–Schmidt Independence Criterion (HSIC) with Gaussian kernels 362 | - MI is computed by first sampling the hidden state vectors, and then compute based on HSIC between two matrices (collections of the continuous vectors) 363 | - Token step t vectors are then mapped to tokens (e.g. by projection to the vocabulary) for concrete analysis 364 | 365 | [(2025 Feb) Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology](https://arxiv.org/abs/2502.17026) 366 | 367 | -> Not necessarily long CoT, but built a topological graph to explain reasoning patterns. 368 | 369 | - The structured representation of reasoning could be applied elsewhere, e.g. to super long reasoning process. 370 | 371 | [(2025 Sept) Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic](https://arxiv.org/abs/2509.01363) 372 | -> Steering/task vectors for reasoning 373 | 374 | - Two identity models, one going through SFT, and one GRPO 375 | - Extract task vectors as the difference between parameters to control the reasoning behaviors 376 | 377 | 378 | [(2025 Oct) First Try Matters: Revisiting the Role of Reflection in Reasoning Models](https://arxiv.org/abs/2510.08308) 379 | -> challenges the conception that reflection in model reasoning actually does "reflection" 380 | 381 | - Focused on reflective behaviours of model reasoning 382 | - Found that most reflective behaviors do not actually alter model reasonings, but merely confirm 383 | - Fine-tuning on more reflective behaivors mostly enhance first-answer correctness 384 | 385 | [(2025 Sept) RL's Razor: Why Online Reinforcement Learning Forgets Less](https://arxiv.org/abs/2509.04259) 386 | 387 | - RL training incurs less forgetting than SFT 388 | 389 | [(2025 Apr) Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?](https://arxiv.org/abs/2504.13837) 390 | 391 | - Challenges the role of RL to incentive model with new capabilities vs. just capitalizing on existing capabilities 392 | 393 | [(2025 Nov) Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs](https://arxiv.org/abs/2511.05933) 394 | 395 | - Similar research topic as above 396 | 397 | 398 | #### Data 399 | 400 | Thought Anchors: https://www.thought-anchors.com/ 401 | 402 | Open Thoughts: https://github.com/open-thoughts/open-thoughts 403 | 404 | #### Training Receipe 405 | 406 | [(2025 Nov) JustRL: Scaling a 1.5B LLM with a Simple RL Recipe](https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8) 407 | 408 | - Simple RL training receipe for scaling 1.5B LLM training 409 | 410 | 411 | #### RL Alrogithms 412 | 413 | [(2025, Jun) TreeRPO: Tree Relative Policy Optimization](https://arxiv.org/abs/2506.05183) 414 | 415 | - Sampling to generate a tree structured trajectory, and collect rewards for every node 416 | - Improves sampling efficiency for training effciency 417 | 418 | [(2025 June) Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning](https://arxiv.org/pdf/2506.08745) 419 | 420 | - Measures how much intermediate reasoning steps lead to the same final answer, as a "consistency" metric summarizing the reasoning trajectory 421 | - Also measures how much sudden changes of the final answer at later reasoning steps are there in the trajectory, as a "volatility" metric 422 | - Observes clear separations of these two metrics between trajectories leading to correct vs. incorrect final answers 423 | - Include these trajectory statistics for reward, plus a "curiosity" reward that encourages diversity; also borrows the grouping idea from GRPO -> no external reward is needed, as during training the reward just depends on the sample trajectories and their final answers 424 | 425 | [(2025, July) STeCa: Step-level Trajectory Calibration for LLM Agent Learning](https://aclanthology.org/2025.findings-acl.604/) 426 | 427 | ### Efficiency 428 | 429 | [(2024 Dec) Compressed Chain of Thought: Efficient Reasoning through Dense Representations](https://arxiv.org/abs/2412.13171) 430 | 431 | - Reasoning with continuous tokens 432 | 433 | [(2025 Feb) TokenSkip: Controllable Chain-of-Thought Compression in LLMs](https://arxiv.org/abs/2502.12067) 434 | 435 | - Filtering out some "unimportant" CoT tokens based on heuristics, generate compressed CoT tokens, and then fine-tune on the reduced trajectories 436 | - _Joe: similar flavor to context compression, token delection, like LLMLingua_ 437 | 438 | [(2025 Mar) Chain of Draft: Thinking Faster by Writing Less](https://arxiv.org/abs/2502.18600) 439 | 440 | -> _Joe: this is not using RL, but just a simple way of prompting by limiting the reasoning step lengths with instructions in prompts. I think similarly we can train LLM with RL to enforce this, and/or as a reward, to improve efficiency during the reasoning process_ 441 | 442 | -> _Joe: (a few days later) found out the following paper does that exactly lol_ 443 | 444 | [(2025 Mar) L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning 445 | ](https://arxiv.org/abs/2503.04697) 446 | 447 | -> _Joe: LLM + RL to encourage shorter reasoning steps. The way is to condition on special symbols in the prompt controling reasoning steps, which poses another reward_ 448 | 449 | - Training starts from the base model [DeepScaleR-1.5B-Preview](#deepscaler-berkeley) (using the same hyperparameters for GRPO) 450 | - Training data also from DeepScaleR-Preview-Dataset, 40K question-answer pairs drawn from AIME, AMC, Omni-Math and STILL 451 | - Training context length restricted to 4K, and testing restricted to 8K 452 | - Fine-tuned for 700 steps and further 120 steps for two different length reward formulations 453 | - Again using [VeRL](#tools) framework 454 | 455 | [(2025 Feb) Demystifying Long Chain-of-Thought Reasoning in LLMs](https://arxiv.org/pdf/2502.03373) 456 | 457 | -> _Joe: see Section 4.2 for the length control with reward design. Strategy is similar to the paper above._ 458 | 459 | [(2025, Feb) [EMNLP 2025] LightThinker: Thinking Step-by-Step Compression](https://arxiv.org/abs/2502.15589) 460 | 461 | -> _Joe: Compressing thinking steps into smaller set of special tokens. Train with special attention mask, inference with reduced KV cache based on the mask structures._ 462 | 463 | - Not using RL. Merging rules are based on heuristics. 464 | 465 | [(2025 Mar) The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models](https://arxiv.org/pdf/2503.02875) 466 | 467 | 468 | [(2025 Apr) Z1: Efficient Test-time Scaling with Code](https://arxiv.org/abs/2504.00810) 469 | 470 | -> Reducing reasoning token length through SFT on QwQ-32B-preview model generated data 471 | - Dataset size of 107K, SFT model Qwen-2.5-Coder-7B-Instruct with bfloat16, FSDP, global batch size to 128 for 2 epochs using 8 NVIDIA 472 | A100-80G GPUs 473 | - Simple reasoning dataset analysis of trigram frequency in Section 2.1 and Appendix A.2 474 | - The biggest difference is removing `...` delimiters? 475 | - _Joe: Not quite sure about the "Shifted Thinking Window" name_ 476 | 477 | [(2024 Apr) Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification](https://arxiv.org/abs/2504.05419v1) 478 | 479 | -> Probe whether the intermediate reasoning step hidden states can predict the correctness of the final answer 480 | - Can use the probe for early exit for long reasoning 481 | 482 | [(2025 Apr) ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning](https://arxiv.org/abs/2504.01296) 483 | 484 | -> Added reasoning length limit as a reward for RL 485 | 486 | [(2025 Apr) Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models](https://arxiv.org/abs/2503.16419) 487 | 488 | -> Survey 489 | - Collection of papers: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs 490 | 491 | [(2025 Apr) Learning Adaptive Parallel Reasoning with Language Models](https://arxiv.org/abs/2504.15466) 492 | 493 | -> Changing the generation process to combine parallel and sequential search during generation 494 | - Similar to one of my earlier ideas of optimizing generation process that can be trained with RL directly for efficiency 495 | - But focused on Countdown task only, and trained model from scratch for small scale experiments 496 | 497 | [(2025 May) Learn to Reason Efficiently with Adaptive Length-based Reward Shaping](https://arxiv.org/abs/2505.15612) 498 | 499 | -> Reducing reasoning trajectory with different length reward shapes 500 | 501 | [(2025 May) SEAL: Steerable Reasoning Calibration of Large Language Models for Free](https://arxiv.org/abs/2504.07986) 502 | 503 | - Categorize the reasoning steps into three behaviors: Execution thoughts, Reflecting thoughts, and Transition thoughts 504 | - Analyzed that wrong reasonings often result in much longer generations, with more usages of reflecting and transition 505 | - Extract hidden states corresponding to different behaviral steps, and construct steering vectors to control the type of reasoning steps 506 | - Achieve more effective and efficient reasoning with inference-time steering, by rougly controling the number of steps for reflection, etc. 507 | 508 | [(2025 May) AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models](https://arxiv.org/abs/2505.22662) 509 | 510 | 511 | [(2025 June) Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning](https://arxiv.org/abs/2506.05256) 512 | 513 | -> Again adding a length related penalty in the reward for RL training, but adjusted to the difficulty of each questions, measured by the pass rate of K samples 514 | - The length reward formulation is a bit less straightforward 515 | - Doesn't show superior performance compared to previous baselines with simple length reward, such as L1-Max 516 | 517 | [(2025 June) Token-Efficient RL for LLM Reasoning](https://arxiv.org/pdf/2504.20834v4) 518 | 519 | -> Reduce resource usages when training with GRPO with LoRA 520 | - Restrict the tokens that contribute to the loss 521 | - Estimate token level advantage, and uses replay for resampling 522 | 523 | [(2025 July) RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents](https://arxiv.org/abs/2507.22844) 524 | 525 | - RL for long-horizon reasoning with agents 526 | 527 | [(2025 Aug) Efficient Inference for Large Reasoning Models: A Survey](https://arxiv.org/abs/2503.23077) 528 | 529 | -> Survey 530 | --------------------------------------------------------------------------------