├── README.md
├── assets
├── develope.jpg
├── paper.json
├── timeline.jpg
├── timeline.png
└── timeline_2.png
└── src
├── list.md
├── main.py
├── table.md
└── timeline.png
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome-System2-Reasoning-LLM
2 |
3 | [](http://arxiv.org/abs/2502.17419)
4 | [](https://github.com/zzli2022/System2-Reasoning-LLM)
5 | [](https://github.com/zzli2022/System2-Reasoning-LLM)
6 | []()
7 |
8 |
9 | ## 📢 Updates
10 |
11 | - **2025.02**: We released a survey paper "[From System 1 to System 2: A Survey of Reasoning Large Language Models](http://arxiv.org/abs/2502.17419)". Feel free to cite or open pull requests.
12 |
13 |
14 | ## 👀 Introduction
15 |
16 | Welcome to the repository for our survey paper, "From System 1 to System 2: A Survey of Reasoning Large Language Models". This repository provides resources and updates related to our research. For a detailed introduction, please refer to [our survey paper](http://arxiv.org/abs/2502.17419).
17 |
18 | Achieving human-level intelligence requires enhancing the transition from System 1 (fast, intuitive) to System 2 (slow, deliberate) reasoning. While foundational Large Language Models (LLMs) have made significant strides, they still fall short of human-like reasoning in complex tasks. Recent reasoning LLMs, like OpenAI’s o1, have demonstrated expert-level performance in domains such as mathematics and coding, resembling System 2 thinking. This survey explores the development of reasoning LLMs, their foundational technologies, benchmarks, and future directions. We maintain an up-to-date GitHub repository to track the latest developments in this rapidly evolving field.
19 |
20 |
21 | 
22 |
23 | This image highlights the progression of AI systems, emphasizing the shift from rapid, intuitive approaches to deliberate, reasoning-driven models. It shows how AI has evolved to handle a broader range of real-world challenges.
24 |
25 | 
26 | The recent timeline of reasoning LLMs, covering core methods and the release of open-source and closed-source reproduction projects.
27 |
28 |
29 | ## 📒 Table of Contents
30 |
31 | - [Awesome-System-2-AI](#awesome-system-2-ai)
32 | - [Part 1: O1 Replication](#part-1-o1-replication)
33 | - [Part 2: Process Reward Models](#part-2-process-reward-models)
34 | - [Part 3: Reinforcement Learning](#part-3-reinforcement-learning)
35 | - [Part 4: MCTS/Tree Search](#part-4-mctstree-search)
36 | - [Part 5: Self-Training / Self-Improve](#part-5-self-training--self-improve)
37 | - [Part 6: Reflection](#part-6-reflection)
38 | - [Part 7: Efficient System2](#part-7-efficient-system2)
39 | - [Part 8: Explainability](#part-8-explainability)
40 | - [Part 9: Multimodal Agent related Slow-Fast System](#part-9-multimodal-agent-related-slow-fast-system)
41 | - [Part 10: Benchmark and Datasets](#part-10-benchmark-and-datasets)
42 | - [Part 11: Reasoning and Safety](#part-11-reasoning-and-safety)
43 | - [Part 12: R1 Driven Multimodal Reasoning Enhancement](#part-12-r1-driven-multimodal-reasoning-enhancement)
44 |
45 | ## Part 1: O1 Replication
46 |
47 | * O1 Replication Journey: A Strategic Progress Report -- Part 1 [[Paper]](https://arxiv.org/abs/2410.18982) 
48 | * Enhancing LLM Reasoning with Reward-guided Tree Search [[Paper]](https://arxiv.org/abs/2411.11694) 
49 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) 
50 | * O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [[Paper]](https://arxiv.org/abs/2411.16489) 
51 | * Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [[Paper]](https://arxiv.org/abs/2412.09413) 
52 | * o1-Coder: an o1 Replication for Coding [[Paper]](https://arxiv.org/abs/2412.00154) 
53 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) 
54 | * DRT: Deep Reasoning Translation via Long Chain-of-Thought [[Paper]](https://arxiv.org/abs/2412.17498) 
55 | * mini-deepseek-r1 [[Blog]](https://www.philschmid.de/mini-deepseek-r1) 
56 | * Run DeepSeek R1 Dynamic 1.58-bit [[Blog]](https://unsloth.ai/blog/deepseekr1-dynamic) 
57 | * Simple Reinforcement Learning for Reasoning [[Notion]](https://hkust-nlp.notion.site/simplerl-reason) 
58 | * TinyZero [[github]](https://github.com/Jiayi-Pan/TinyZero) 
59 | * Open R1 [[github]](https://github.com/huggingface/open-r1) 
60 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) 
61 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) 
62 | * The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [[Paper]](https://arxiv.org/abs/2502.15631) 
63 | * Open-Reasoner-Zero [[Paper]](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) 
64 | * X-R1 [[github]](https://github.com/dhcode-cpp/X-R1) 
65 | * Unlock-Deepseek [[Blog]](https://mp.weixin.qq.com/s/Z7P61IV3n4XYeC0Et_fvwg) 
66 | * Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [[Paper]](https://arxiv.org/abs/2502.14768) 
67 | * LLM-R1 [[github]](https://github.com/TideDra/lmm-r1) 
68 | ## Part 2: Process Reward Models
69 |
70 | * Solving Math Word Problems with Process and Outcome-Based Feedback [[Paper]](https://arxiv.org/abs/2211.14275) 
71 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision [[Paper]](https://arxiv.org/abs/2306.05372) 
72 | * Making Large Language Models Better Reasoners with Step-Aware Verifier [[Paper]](https://arxiv.org/abs/2206.02336) 
73 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[Paper]](https://aclanthology.org/2024.acl-long.510/) 
74 | * OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [[Paper]](https://aclanthology.org/2024.findings-naacl.55/) 
75 | * Let's Verify Step by Step. [[Paper]](https://arxiv.org/abs/2305.20050) 
76 | * Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [[Paper]](https://arxiv.org/abs/2406.18629) 
77 | * AutoPSV: Automated Process-Supervised Verifier [[Paper]](https://openreview.net/forum?id=eOAPWWOGs9) 
78 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://openreview.net/forum?id=8rcFOqEud5) 
79 | * Free Process Rewards without Process Labels. [[Paper]](https://arxiv.org/abs/2412.01981) 
80 | * Outcome-Refining Process Supervision for Code Generation [[Paper]](https://arxiv.org/abs/2412.15118) 
81 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [[Paper]](https://arxiv.org/abs/2501.03124) 
82 | * ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [[Paper]](https://arxiv.org/abs/2501.07861) 
83 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning. [[Paper]](https://arxiv.org/abs/2501.07301) 
84 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [[Paper]](https://arxiv.org/abs/2501.01290) 
85 | * ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [[Paper]](https://arxiv.org/abs/2502.12130) 
86 | * Uncertainty-Aware Step-wise Verification with Generative Reward Models [[Paper]](https://arxiv.org/abs/2502.11250) 
87 | * AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [[Paper]](https://www.arxiv.org/abs/2502.13943) 
88 | * Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [[Paper]](https://www.arxiv.org/abs/2502.08922) 
89 | * Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [[Paper]](https://arxiv.org/abs/2502.06703) 
90 | * Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [[Paper]](https://arxiv.org/abs/2502.19328) 
91 | * Unified Reward Model for Multimodal Understanding and Generation [[Paper]](https://arxiv.org/abs/2503.05236) 
92 | * Reward Shaping to Mitigate Reward Hacking in RLHF [[Paper]](https://arxiv.org/abs/2502.18770) 
93 | * Multi-head Reward Aggregation Guided by Entropy [[Paper]](https://arxiv.org/abs/2503.20995) 
94 | * [[Paper]](https://arxiv.org/abs/2503.21295) 
95 | * Better Process Supervision with Bi-directional Rewarding Signals [[Paper]](https://arxiv.org/abs/2503.04618) 
96 | * Inference-Time Scaling for Generalist Reward Modeling [[Paper]](https://arxiv.org/abs/2504.02495) 
97 |
98 | ## Part 3: Reinforcement Learning
99 |
100 | * Improve Vision Language Model Chain-of-thought Reasoning [[Paper]](https://arxiv.org/abs/2410.16198) 
101 | * Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [[Paper]](https://arxiv.org/abs/2412.06000) 
102 | * Offline Reinforcement Learning for LLM Multi-Step Reasoning [[Paper]](https://arxiv.org/abs/2412.16145) 
103 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) 
104 | * InfAlign: Inference-aware language model alignment [[Paper]](https://arxiv.org/abs/2412.19792) 
105 | * Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [[Paper]](https://arxiv.org/abs/2501.11651) 
106 | * Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [[Paper]](https://arxiv.org/abs/2501.17030) 
107 | * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2501.12948) 
108 | * Kimi k1.5: Scaling Reinforcement Learning with LLMs [[Paper]](https://arxiv.org/abs/2501.12599) 
109 | * Deepseekmath: Pushing the limits of mathematical reasoning in open language models [[Paper]](https://arxiv.org/abs/2402.03300) 
110 | * Reasoning with Reinforced Functional Token Tuning [[Paper]](https://arxiv.org/abs/2502.13389) 
111 | * Value-Based Deep RL Scales Predictably [[Paper]](https://arxiv.org/abs/2502.04327) 
112 | * MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [[Paper]](https://arxiv.org/abs/2502.10391) 
113 | * Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [[Paper]](https://arxiv.org/abs/2502.02508) 
114 | * DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [[Paper]](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) 
115 | * LIMR: Less is More for RL Scaling [[Paper]](https://arxiv.org/abs/2502.11886) 
116 | * A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics [[Paper]](https://arxiv.org/abs/2502.143) 
117 | * Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning [[Paper]](https://arxiv.org/abs/2502.19655) 
118 | * QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [[Paper]](https://arxiv.org/abs/2502.02584) 
119 | * Process Reinforcement through Implicit Rewards [[Paper]](https://arxiv.org/abs/2502.01456) 
120 | * UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [[Paper]](https://arxiv.org/abs/2503.21620) 
121 | * All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [[Paper]](https://arxiv.org/abs/2503.01067) 
122 | * R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [[Paper]](https://arxiv.org/abs/2503.05132) 
123 | * Visual-RFT: Visual Reinforcement Fine-Tuning [[Paper]](https://arxiv.org/abs/2503.01785) 
124 | * GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [[Paper]](https://arxiv.org/abs/2503.08525) 
125 | * L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [[Paper]](https://arxiv.org/abs/2503.04697) 
126 | * Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [[Paper]](https://arxiv.org/abs/2503.16219) 
127 | * Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement [[Paper]](https://arxiv.org/abs/2503.07065) 
128 | * VLAA-Thinker [[github]](https://github.com/UCSC-VLAA/VLAA-Thinking/) 
129 | * Concise Reasoning via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.05185) 
130 | * d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [[github]](https://dllm-reasoning.github.io/media/preprint.pdf) 
131 | * Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.05108) 
132 | * Efficient Reinforcement Finetuning via Adaptive Curriculum Learning [[Paper]](https://arxiv.org/pdf/2504.05520) 
133 | * VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [[Paper]](https://arxiv.org/abs/2504.06958) 
134 | * SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [[Paper]](https://arxiv.org/abs/2504.11468) 
135 | * RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models [[Paper]](https://arxiv.org/abs/2504.07282) 
136 | * MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning [[Paper]](https://arxiv.org/abs/2504.10160) 
137 | * VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [[Paper]](https://arxiv.org/abs/2503.07523) 
138 |
139 | ## Part 4: MCTS/Tree Search
140 |
141 | * Reasoning with Language Model is Planning with World Model [[Paper]](https://aclanthology.org/2023.emnlp-main.507/) 
142 | * Fine-grained Conversational Decoding via Isotropic and Proximal Search [[Paper]](https://aclanthology.org/2023.emnlp-main.5/) 
143 | * Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) 
144 | * ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) 
145 | * Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) 
146 | * MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [[Paper]](https://arxiv.org/abs/2309.15028) 
147 | * Look-back Decoding for Open-Ended Text Generation [[Paper]](https://aclanthology.org/2023.emnlp-main.66/) 
148 | * Stream of Search (SoS): Learning to Search in Language [[Paper]](https://arxiv.org/abs/2404.03683) 
149 | * Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [[Paper]](https://arxiv.org/abs/2404.12253) 
150 | * Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [[Paper]](https://openreview.net/forum?id=CVpuVe1N22¬eId=aTI8PGpO47) 
151 | * AlphaMath Almost Zero: process Supervision without process [[Paper]](https://arxiv.org/abs/2405.03553) 
152 | * Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2405.15383) 
153 | * MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [[Paper]](https://arxiv.org/abs/2405.16265) 
154 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) 
155 | * Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [[Paper]](https://arxiv.org/abs/2406.07394) 
156 | * Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [[Paper]](https://openreview.net/forum?id=rviGTsl0oy) 
157 | * LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [[Paper]](https://openreview.net/forum?id=h1mvwbQiXR) 
158 | * LiteSearch: Efficacious Tree Search for LLM [[Paper]](https://arxiv.org/abs/2407.00320) 
159 | * Tree Search for Language Model Agents [[Paper]](https://arxiv.org/abs/2407.01476) 
160 | * Uncertainty-Guided Optimization on Large Language Model Search Trees [[Paper]](https://arxiv.org/abs/2407.03951) 
161 | * Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [[Paper]](https://arxiv.org/abs/2408.10635) 
162 | * RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2409.09584) 
163 | * AFlow: Automating Agentic Workflow Generation [[Paper]](https://arxiv.org/abs/2410.10762) 
164 | * Interpretable Contrastive Monte Carlo Tree Search Reasoning [[Paper]](https://arxiv.org/abs/2410.01707) 
165 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) 
166 | * Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [[Paper]](https://arxiv.org/abs/2410.06508) 
167 | * TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [[Paper]](https://arxiv.org/abs/2410.16033) 
168 | * Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [[Paper]](https://arxiv.org/abs/2410.17820) 
169 | * CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [[Paper]](https://arxiv.org/abs/2411.04329) 
170 | * GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [[Paper]](https://arxiv.org/abs/2411.04459) 
171 | * MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [[Paper]](https://arxiv.org/abs/2411.15645) 
172 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) 
173 | * SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2411.11053) 
174 | * Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [[Paper]](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion) 
175 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) 
176 | * Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.09078) 
177 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) 
178 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) 
179 | * Proposing and solving olympiad geometry with guided tree search [[Paper]](https://arxiv.org/abs/2412.10673) 
180 | * SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [[Paper]](https://arxiv.org/abs/2412.11605) 
181 | * Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2412.17397) 
182 | * Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata [[Paper]](https://aclanthology.org/2024.naacl-short.42/) 
183 | * Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2502.11169) 
184 | * PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament [[Paper]](https://arxiv.org/abs/2501.13007) 
185 | * ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [[Paper]](https://arxiv.org/abs/2502.12130) 
186 | * On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [[Paper]](https://ieeexplore.ieee.org/abstract/document/10870057/) 
187 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) 
188 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) 
189 | * LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [[Paper]](https://arxiv.org/abs/2502.17925) 
190 | * Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking [[Paper]](https://arxiv.org/abs/2502.02339) 
191 | * DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [[Paper]](https://arxiv.org/abs/2502.20730) 
192 | * Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [[Paper]](https://arxiv.org/abs/2502.11881) 
193 | * VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [[Paper]](https://arxiv.org/abs/2504.09130) 
194 | ## Part 5: Self-Training / Self-Improve
195 |
196 | * Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [[Paper]](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html) 
197 | * STaR: Bootstrapping Reasoning With Reasoning [[Paper]](https://arxiv.org/abs/2203.14465) 
198 | * Large Language Models are Better Reasoners with Self-Verification [[Paper]](/aclanthology.org/2023.findings-emnlp.167/) 
199 | * Self-Evaluation Guided Beam Search for Reasoning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/81fde95c4dc79188a69ce5b24d63010b-Abstract-Conference.html) 
200 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) 
201 | * ReST: Reinforced Self-Training for Language Modeling [[Paper]](https://arxiv.org/abs/2308.08998) 
202 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) 
203 | * V-star: Training Verifiers for Self-Taught Reasoners [[Paper]](https://arxiv.org/abs/2402.06457) 
204 | * Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [[Paper]](https://arxiv.org/abs/2403.09629) 
205 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) 
206 | * Enhancing Large Vision Language Models with Self-Training on Image Comprehension [[Paper]](https://arxiv.org/abs/2405.19716) 
207 | * Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2406.11736) 
208 | * SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [[Paper]](https://openreview.net/forum?id=pTHfApDakA) 
209 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) 
210 | * Learning From Correctness Without Prompting Makes LLM Efficient Reasoner [[Paper]](https://openreview.net/forum?id=dcbNzhVVQj#discussion) 
211 | * Self-Improvement in Language Models: The Sharpening Mechanism [[Paper]](https://arxiv.org/abs/2412.01951) 
212 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) 
213 | * Recursive Introspection: Teaching Language Model Agents How to Self-Improve [[Paper]](https://openreview.net/forum?id=DRC9pZwBwR) 
214 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) 
215 | * ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [[Paper]](https://openreview.net/forum?id=lNAyUngGFK) 
216 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) 
217 | * Enabling Scalable Oversight via Self-Evolving Critic [[Paper]](https://arxiv.org/abs/2501.05727) 
218 | * S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [[Paper]](https://www.arxiv.org/abs/2502.12853) 
219 | * ProgCo: Program Helps Self-Correction of Large Language Models [[Paper]](https://arxiv.org/abs/2501.01264) 
220 | * Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [[Paper]](https://arxiv.org/abs/2501.04519) 
221 | * Self-Training Elicits Concise Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.20122) 
222 | * Language Models can Self-Improve at State-Value Estimation for Better Search [[Paper]](https://arxiv.org/abs/2503.02878) 
223 | * Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasonin [[Paper]](https://arxiv.org/abs/2504.08672) 
224 | * START: Self-taught Reasoner with Tools [[Paper]](https://arxiv.org/abs/2503.04625) 
225 | ## Part 6: Reflection
226 | * SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [[Paper]](https://arxiv.org/abs/2308.00436) 
227 | * Reflection-Tuning: An Approach for Data Recycling [[Paper]](https://arxiv.org/abs/2310.11716) 
228 | * Learning From Mistakes Makes LLM Better Reasoner [[Paper]](https://arxiv.org/abs/2310.20689) 
229 | * Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [[Paper]](https://arxiv.org/abs/2408.06195) 
230 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) 
231 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) 
232 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.18478) 
233 | * Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [[Paper]](https://arxiv.org/abs/2411.11930) 
234 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) 
235 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) 
236 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) 
237 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) 
238 | * Refiner: Restructure Retrieved Content Efficiently to Advance Question-Answering Capabilities [[Paper]](https://aclanthology.org/2024.findings-emnlp.500/) 
239 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) 
240 | * RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [[Paper]](https://arxiv.org/abs/2501.11284) 
241 | * Perception in Reflection [[Paper]](https://arxiv.org/abs/2504.07165) 
242 | ## Part 7: Efficient System2
243 |
244 | * Guiding Language Model Reasoning with Planning Tokens [[Paper]](https://arxiv.org/abs/2310.05707) 
245 | * AutoReason: Automatic Few-Shot Reasoning Decomposition [[Paper]](https://arxiv.org/abs/2412.06975) 
246 | * DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2407.01009) 
247 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) 
248 | * Token-Budget-Aware LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.18547) 
249 | * Training Large Language Models to Reason in a Continuous Latent Space [[Paper]](https://arxiv.org/abs/2412.06769) 
250 | * From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs [[Paper]](https://arxiv.org/abs/2501.16207) 
251 | * MALT: Improving Reasoning with Multi-Agent LLM Training [[Paper]](https://arxiv.org/abs/2412.01928) 
252 | * Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [[Paper]](https://arxiv.org/abs/2501.18585) 
253 | * Efficient Reasoning with Hidden Thinking [[Paper]](https://arxiv.org/abs/2501.19201) 
254 | * O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [[Paper]](https://arxiv.org/abs/2501.12570) 
255 | * Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [[Paper]](https://arxiv.org/abs/2501.01306) 
256 | * Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [[Paper]](https://www.arxiv.org/abs/2502.13260) 
257 | * Titans: Learning to Memorize at Test Time [[Paper]](https://arxiv.org/abs/2501.00663) 
258 | * MoBA: Mixture of Block Attention for Long-Context LLMs [[Paper]](https://arxiv.org/abs/2502.13189) 
259 | * One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs [[Paper]](https://arxiv.org/abs/2502.10454) 
260 | * Small Models Struggle to Learn from Strong Reasoners [[Paper]](https://arxiv.org/abs/2502.12143) 
261 | * TokenSkip: Controllable Chain-of-Thought Compression in LLMs [[Paper]](https://arxiv.org/abs/2502.12067) 
262 | * SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2502.12134) 
263 | * Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning [[Paper]](https://arxiv.org/abs/2502.10428) 
264 | * Thinking Preference Optimization [[Paper]](https://arxiv.org/abs/2502.13173) 
265 | * Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [[Paper]](https://arxiv.org/abs/2502.12215) 
266 | * Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options [[Paper]](https://arxiv.org/abs/2502.12929) 
267 | * CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [[Paper]](https://arxiv.org/abs/2502.07316) 
268 | * OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [[Paper]](https://arxiv.org/abs/2502.11271) 
269 | * LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [[Paper]](https://arxiv.org/abs/2502.11176) 
270 | * Atom of Thoughts for Markov LLM Test-Time Scaling [[Paper]](https://arxiv.org/abs/2502.12018) 
271 | * Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [[Paper]](https://arxiv.org/abs/2502.11147) 
272 | * Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models [[Paper]](https://arxiv.org/abs/2502.12855) 
273 | * Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [[Paper]](https://arxiv.org/abs/2502.08482) 
274 | * Scalable Language Models with Posterior Inference of Latent Thought Vectors [[Paper]](https://arxiv.org/abs/2502.01567) 
275 | * Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [[Paper]](https://arxiv.org/abs/2502.08482) 
276 | * Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [[Paper]](https://arxiv.org/abs/2502.03275) 
277 | * LightThinker: Thinking Step-by-Step Compression [[Paper]](https://arxiv.org/abs/2502.15589) 
278 | * The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities [[Paper]](https://arxiv.org/pdf/2502.17416) 
279 | * Reasoning with Latent Thoughts: On the Power of Looped Transformers [[Paper]](https://arxiv.org/pdf/2502.17416) 
280 | * Efficient Reasoning with Hidden Thinking [[Paper]](https://arxiv.org/pdf/2501.19201) 
281 | * Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.20332) 
282 | * Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study [[Paper]](https://arxiv.org/abs/2502.11514) 
283 | * Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2502.19918) 
284 | * FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [[Paper]](https://arxiv.org/abs/2502.20238) 
285 | * MixLLM: Dynamic Routing in Mixed Large Language Models [[Paper]](https://arxiv.org/abs/2502.18482) 
286 | * PEARL: Towards Permutation-Resilient LLMs [[Paper]](https://arxiv.org/abs/2502.14628) 
287 | * Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [[Paper]](https://www.arxiv.org/abs/2502.07803) 
288 | * Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [[Paper]](https://arxiv.org/abs/2502.19361) 
289 | * Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [[Paper]](https://arxiv.org/abs/2502.19411) 
290 | * Training Large Language Models to be Better Rule Followers [[Paper]](https://arxiv.org/abs/2502.11525) 
291 | * Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research [[Paper]](https://arxiv.org/abs/2502.04644) 
292 | * CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [[Paper]](https://arxiv.org/abs/2502.21074) 
293 | * SIFT: Grounding LLM Reasoning in Contexts via Stickers [[Paper]](https://arxiv.org/abs/2502.14922) 
294 | * AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [[Paper]](https://arxiv.org/abs/2502.13943) 
295 | * How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [[Paper]](https://arxiv.org/abs/2503.01141) 
296 | * PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [[Paper]](https://arxiv.org/abs/2503.02324) 
297 | * DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [[Paper]](https://arxiv.org/abs/2503.04472) 
298 | * Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [[Paper]](https://arxiv.org/abs/2503.04691) 
299 | * Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [[Paper]](https://arxiv.org/abs/2503.09567) 
300 | * TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation [[Paper]](https://arxiv.org/abs/2503.04872) 
301 | * Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [[Paper]](https://arxiv.org/abs/2503.05641) 
302 | * Entropy-based Exploration Conduction for Multi-step Reasoning [[Paper]](https://arxiv.org/abs/2503.15848) 
303 | * MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion [[Paper]](https://arxiv.org/abs/2503.16212) 
304 | * Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [[Paper]](https://arxiv.org/abs/2503.16419) 
305 | * ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs [[Paper]](https://arxiv.org/abs/2503.12918) 
306 | * Agent models: Internalizing Chain-of-Action Generation into Reasoning models [[Paper]](https://arxiv.org/abs/2503.06580) 
307 | * StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error [[Paper]](https://arxiv.org/abs/2503.10105) 
308 | * Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding [[Paper]](https://arxiv.org/abs/2503.10183) 
309 | * Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [[Paper]](https://arxiv.org/abs/2503.19877) 
310 | * Shared Global and Local Geometry of Language Model Embeddings [[Paper]](https://arxiv.org/abs/2503.21073) 
311 | * Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [[Paper]](https://arxiv.org/abs/2503.13360) 
312 | * Effectively Controlling Reasoning Models through Thinking Intervention [[Paper]](https://arxiv.org/abs/2503.24370) 
313 | * Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [[Paper]](https://arxiv.org/abs/2503.02318) 
314 | * TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning [[Paper]](https://arxiv.org/abs/2504.09641) 
315 | * Lemmanaid: Neuro-Symbolic Lemma Conjecturing [[Paper]](https://arxiv.org/abs/2504.04942) 
316 | * ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning [[Paper]](https://arxiv.org/abs/2504.06650) 
317 | * Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought [[Paper]](https://arxiv.org/abs/2504.05599) 
318 | * Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [[Paper]](https://arxiv.org/abs/2504.05419) 
319 | * Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? [[Paper]](https://arxiv.org/abs/2504.06514) 
320 | * Decentralizing AI Memory: SHIMI, a Semantic Hierarchical Memory Index for Scalable Agent Reasoning [[Paper]](https://arxiv.org/abs/2504.06135) 
321 | * Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning [[Paper]](https://arxiv.org/pdf/2504.05632) 
322 | * Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [[Paper]](https://arxiv.org/abs/2504.00294) 
323 | * RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability [[Paper]](https://arxiv.org/abs/2504.10081) 
324 | * Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [[Paper]](https://arxiv.org/abs/2504.11741) 
325 |
326 |
327 | ## Part 8: Explainability
328 | * Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [[Paper]](https://openreview.net/forum?id=xPhcP6rbI4) 
329 | * Distilling System 2 into System 1 [[Paper]](https://arxiv.org/abs/2407.06023) 
330 | * The Impact of Reasoning Step Length on Large Language Models [[Paper]](https://arxiv.org/abs/2401.04925) 
331 | * What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [[Paper]](https://arxiv.org/abs/2410.23743) 
332 | * When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [[Paper]](https://arxiv.org/abs/2410.01792) 
333 | * System 2 Attention (is something you might need too) [[Paper]](https://arxiv.org/abs/2311.11829) 
334 | * Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [[Paper]](https://arxiv.org/abs/2501.04682) 
335 | * LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [[Paper]](https://arxiv.org/abs/2501.06186) 
336 | * Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [[Paper]](https://arxiv.org/abs/2502.19230) 
337 | * Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2503.11074) 
338 | * Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [[Paper]](https://arxiv.org/abs/2503.15558) 
339 | ## Part 9: Multimodal Agent related Slow-Fast System
340 |
341 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) 
342 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) 
343 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) 
344 | * Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [[Paper]](https://arxiv.org/pdf/2412.03704) 
345 | * Slow Perception: Let's Perceive Geometric Figures Step-by-Step [[Paper]](https://arxiv.org/abs/2412.20631) 
346 | * Diving into Self-Evolving Training for Multimodal Reasoning [[Paper]](https://arxiv.org/abs/2412.17451) 
347 | * Visual Agents as Fast and Slow Thinkers [[Paper]](https://openreview.net/forum?id=ncCuiD3KJQ) 
348 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) 
349 | * I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [[Paper]](https://arxiv.org/abs/2502.10458) 
350 | * RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [[Paper]](https://arxiv.org/abs/2502.13957) 
351 | ## Part 10: Benchmark and Datasets
352 |
353 | * Evaluation of OpenAI o1: Opportunities and Challenges of AGI [[Paper]](https://arxiv.org/abs/2409.18486) 
354 | * A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https://arxiv.org/abs/2409.15277) 
355 | * FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [[Paper]](https://arxiv.org/abs/2411.04872) 
356 | * MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https://openreview.net/forum?id=GN2qbxZlni) 
357 | * Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https://arxiv.org/abs/2412.21187) 
358 | * EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [[Paper]](https://arxiv.org/abs/2502.12466) 
359 | * SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [[Paper]](https://arxiv.org/abs/2502.14739) 
360 | * Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [[Paper]](https://arxiv.org/abs/2502.14191) 
361 | * MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [[Paper]](https://arxiv.org/abs/2502.06453) 
362 | * LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [[Paper]](https://arxiv.org/abs/2501.15089) 
363 | * Humanity's Last Exam [[Paper]](https://arxiv.org/abs/2501.14249) 
364 | * RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [[Paper]](https://openreview.net/forum?id=QEHrmQPBdd)-2025.01-blue)
365 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https://arxiv.org/abs/2501.03124) 
366 | * Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [[Paper]](https://arxiv.org/abs/2502.17387) 
367 | * ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [[paper]](https://arxiv.org/abs/2502.09696) 
368 | * MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [[paper]](https://arxiv.org/abs/2502.09621) 
369 | * MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [[paper]](https://arxiv.org/abs/2502.00698) 
370 | * LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [[Paper]](https://arxiv.org/abs/2502.17848) 
371 | * BIG-Bench Extra Hard [[Paper]](https://arxiv.org/abs/2502.19187) 
372 | * MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [[paper]](https://arxiv.org/abs/2502.20808) 
373 | * MastermindEval: A Simple But Scalable Reasoning Benchmark [[paper]](https://arxiv.org/abs/2503.05891) 
374 | * DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs [[paper]](https://arxiv.org/abs/2503.15793) 
375 | * V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks [[github]](https://github.com/haonan3/V1) 
376 | * ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [[paper]](https://arxiv.org/abs/2503.21248) 
377 | * S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models [[paper]](https://arxiv.org/abs/2504.10368) 
378 | * When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks [[paper]](https://arxiv.org/abs/2504.02010) 
379 | * BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [[paper]](https://openai.com/index/browsecomp/) 
380 | * Mle-bench: Evaluating machine learning agents on machine learning engineering [[paper]](https://arxiv.org/abs/2410.07095) 
381 | * How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [[paper]](https://arxiv.org/abs/2403.11807) 
382 | * OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [[paper]](https://openreview.net/forum?id=tN61DTr4Ed) 
383 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [[paper]](https://arxiv.org/abs/2501.01290) 
384 | * Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [[paper]](https://arxiv.org/abs/2501.11733) 
385 | * PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [[paper]](https://arxiv.org/abs/2502.12054) 
386 | * Text2World: Benchmarking Large Language Models for Symbolic World Model Generation [[paper]](https://arxiv.org/abs/2502.13092) 
387 | * WebGames: Challenging General-Purpose Web-Browsing AI Agents [[paper]](https://arxiv.org/abs/2502.18356) 
388 | * UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.21620) 
389 | * Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [[paper]](https://openreview.net/forum?id=HjwK-Tc_Bc) 
390 | * Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots [[paper]](https://arxiv.org/abs/2405.07990) 
391 | * M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [[paper]](https://aclanthology.org/2024.acl-long.446/) 
392 | * PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [[paper]](https://aclanthology.org/2024.findings-acl.962/) 
393 | * Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation [[paper]](https://openreview.net/forum?id=t1mAXb4Cop) 
394 | * HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks [[paper]](https://arxiv.org/abs/2410.12381) 
395 | * CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [[paper]](https://arxiv.org/abs/2412.12932) 
396 | * ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation [[paper]](https://openreview.net/forum?id=sGpCzsfd1K) 
397 | * Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios [[paper]](https://arxiv.org/abs/2502.19973) 
398 | * EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [[paper]](https://arxiv.org/abs/2502.08859) 
399 | * Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities [[paper]](https://arxiv.org/abs/2502.11829) 
400 | * Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models [[paper]](https://arxiv.org/abs/2503.04801) 
401 | * MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems [[paper]](https://arxiv.org/abs/2503.01891) 
402 | * LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? [[paper]](https://arxiv.org/abs/2503.19990) 
403 | * On the measure of intelligence [[paper]](https://arxiv.org/abs/1911.01547) 
404 | * Competition-Level Code Generation with AlphaCode [[paper]](https://arxiv.org/abs/2203.07814) 
405 | * Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [[paper]](https://aclanthology.org/2023.findings-acl.824/) 
406 | * OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [[paper]](https://openreview.net/forum?id=ayF8bEKYQy) 
407 | * Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning [[paper]](https://openreview.net/forum?id=YXnwlZe0yf) 
408 | * Let's verify step by step [[paper]](https://openreview.net/forum?id=v8L0pN6EOi) 
409 | * Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation [[paper]](https://arxiv.org/abs/2405.11430) 
410 | * Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai [[paper]](https://arxiv.org/abs/2411.04872) 
411 | * LiveBench: A Challenging, Contamination-Limited LLM Benchmark [[paper]](https://openreview.net/forum?id=sKYHBTAxVa) 
412 | * JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [[paper]](https://arxiv.org/abs/2501.14851) 
413 | * MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [[paper]](https://arxiv.org/abs/2501.18362) 
414 | * Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [[paper]](https://arxiv.org/abs/2502.15815) 
415 | * AIME 2025 [[huggingface]](https://huggingface.co/datasets/opencompass/AIME2025) 
416 | * ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning [[paper]](https://arxiv.org/abs/2502.16268) 
417 | * ProBench: Benchmarking Large Language Models in Competitive Programming [[paper]](https://arxiv.org/abs/2502.20868) 
418 | * ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [[paper]](https://arxiv.org/abs/2502.01100) 
419 | * DivIL: Unveiling and Addressing Over-Invariance for Out-of-Distribution Generalization [[paper]](https://openreview.net/forum?id=2Zan4ATYsh) 
420 | * QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? [[paper]](https://arxiv.org/abs/2503.22674) 
421 | * Benchmarking Reasoning Robustness in Large Language Models [[paper]](https://arxiv.org/abs/2503.04550) 
422 | * Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges [[paper]](https://arxiv.org/abs/2502.08680) 
423 | * Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [[paper]](https://arxiv.org/abs/2502.12521) 
424 | * Rewardbench: Evaluating reward models for language modeling [[paper]](https://arxiv.org/abs/2403.13787) 
425 | * Evaluating LLMs at Detecting Errors in LLM Responses [[paper]](https://openreview.net/forum?id=dnwRScljXr) 
426 | * CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [[paper]](https://aclanthology.org/2024.findings-acl.91/) 
427 | * Judgebench: A benchmark for evaluating llm-based judges [[paper]](https://arxiv.org/abs/2410.12784) 
428 | * Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection [[paper]](https://arxiv.org/abs/2410.04509) 
429 | * Processbench: Identifying process errors in mathematical reasoning [[paper]](https://arxiv.org/abs/2412.06559) 
430 | * Medec: A benchmark for medical error detection and correction in clinical notes [[paper]](https://arxiv.org/abs/2412.19260) 
431 | * CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [[paper]](https://arxiv.org/abs/2502.16614) 
432 | * Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [[paper]](https://arxiv.org/abs/2502.19361) 
433 | * FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [[paper]](https://arxiv.org/abs/2502.20238) 
434 |
435 |
436 |
437 | ## Part 11: Reasoning and Safety
438 | * Measuring Faithfulness in Chain-of-Thought Reasoning [[Blog]](https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning) 
439 | * Deliberative Alignment: Reasoning Enables Safer Language Models [[Paper]](https://arxiv.org/abs/2412.16339) 
440 | * OpenAI trained o1 and o3 to ‘think’ about its safety policy [[Blog]](https://techcrunch.com/2024/12/22/openai-trained-o1-and-o3-to-think-about-its-safety-policy) 
441 | * Why AI Safety Researchers Are Worried About DeepSeek [[Blog]](https://time.com/7210888/deepseeks-hidden-ai-safety-warning/) 
442 | * OverThink: Slowdown Attacks on Reasoning LLMs [[Paper]](https://arxiv.org/abs/2502.02542) 
443 | * GuardReasoner: Towards Reasoning-based LLM Safeguards [[Paper]](https://arxiv.org/abs/2501.18492) 
444 | * SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2502.12025) 
445 | * ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [[Paper]](https://arxiv.org/abs/2502.13458) 
446 | * SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [[Paper]](https://arxiv.org/abs/2502.12025) 
447 | * H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [[Paper]](https://arxiv.org/abs/2502.12893) 
448 | * BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [[Paper]](https://arxiv.org/abs/2502.12202) 
449 | * The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [[Paper]](https://arxiv.org/abs/2502.12659) 
450 | * Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google [[Blog]](https://far.ai/post/2025-02-r1-redteaming/) 
451 | * Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [[Paper]](https://arxiv.org/abs/2503.00555) 
452 | * DeepSeek-R1 Thoughtology: Let's about LLM Reasoning [[Paper]](https://arxiv.org/abs/2504.12659) 
453 | * STAR-1: Safer Alignment of Reasoning LLMs with 1K Data [[Paper]](https://arxiv.org/abs/2504.01903) 
454 |
455 | ## Part 12: R1 Driven Multimodal Reasoning Enhancement
456 | * Open R1 Video [[github]](https://github.com/Wang-Xiaodong1899/Open-R1-Video) 
457 | * R1-Vision: Let's first take a look at the image [[github]](https://github.com/yuyq96/R1-Vision) 
458 | * MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning [[paper]](https://arxiv.org/abs/2502.19634) 
459 | * Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning [[github]](https://github.com/baibizhe/Efficient-R1-VLLM) 
460 | * MMR1: Advancing the Frontiers of Multimodal Reasoning [[github]](https://github.com/LengSicong/MMR1) 
461 | * Skywork-R1V: Pioneering Multimodal Reasoning with CoT [[github]](https://github.com/SkyworkAI/Skywork-R1V/tree/main) 
462 | * VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [[Blog]](https://om-ai-lab.github.io/index.html) 
463 | * Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [[paper]](https://arxiv.org/abs/2503.24376v1) 
464 | * Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [[paper]](https://arxiv.org/abs/2503.20752) 
465 | * MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.07365) 
466 | * R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [[paper]](https://arxiv.org/abs/2503.05379) 
467 | * R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [[paper]](https://arxiv.org/abs/2503.10615) 
468 | * R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [[paper]](https://arxiv.org/abs/2503.12937) 
469 | * Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [[paper]](https://arxiv.org/abs/2503.11197) 
470 | * Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [[paper]](https://arxiv.org/abs/2503.06520) 
471 | * TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [[paper]](https://arxiv.org/abs/2503.13377) 
472 | * Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [[paper]](https://arxiv.org/abs/2503.06749) 
473 | * Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [[paper]](http://arxiv.org/abs/2503.22679) 
474 | * Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [[paper]](http://arxiv.org/abs/2504.02587) 
475 | * VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [[paper]](http://arxiv.org/abs/2504.07615) 
476 | * VLAA-Thinking [[github]](https://github.com/UCSC-VLAA/VLAA-Thinking/blob/main/assets/VLAA-Thinker.pdf) 
477 | * SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [[paper]](http://arxiv.org/abs/2504.07934) 
478 | * Perception-R1: Pioneering Perception Policy with Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.07954) 
479 | * VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.08837) 
480 | * Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [[paper]](http://arxiv.org/abs/2504.12680) 
481 | * NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [[paper]](https://arxiv.org/abs/2504.13055v1) 
482 |
483 |
484 |
485 |
486 | ## Citation
487 | If you find this work useful, welcome to cite us.
488 | ```bib
489 | @misc{li202512surveyreasoning,
490 | title={From System 1 to System 2: A Survey of Reasoning Large Language Models},
491 | author={Zhong-Zhi Li and Duzhen Zhang and Ming-Liang Zhang and Jiaxin Zhang and Zengyan Liu and Yuxuan Yao and Haotian Xu and Junhao Zheng and Pei-Jie Wang and Xiuyi Chen and Yingying Zhang and Fei Yin and Jiahua Dong and Zhijiang Guo and Le Song and Cheng-Lin Liu},
492 | year={2025},
493 | eprint={2502.17419},
494 | archivePrefix={arXiv},
495 | primaryClass={cs.AI},
496 | url={https://arxiv.org/abs/2502.17419},
497 | }
498 | ```
499 |
500 |
501 | ## ⭐ Star History
502 |
503 |
504 |
505 |
506 |
507 |
508 |
509 |
510 |
--------------------------------------------------------------------------------
/assets/develope.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/develope.jpg
--------------------------------------------------------------------------------
/assets/paper.json:
--------------------------------------------------------------------------------
1 | {
2 | "Part 1: O1 Replication": [
3 | {
4 | "paper": "O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?",
5 | "link": "https://arxiv.org/abs/2411.16489",
6 | "venue": "arXiv",
7 | "date": "2024-11",
8 | "label": "huang2024o1"
9 | },
10 | {
11 | "paper": "O1 Replication Journey: A Strategic Progress Report -- Part 1",
12 | "link": "https://arxiv.org/abs/2410.18982",
13 | "venue": "arXiv",
14 | "date": "2024-10",
15 | "label": "qin2024o1"
16 | },
17 | {
18 | "paper": "Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions",
19 | "link": "https://arxiv.org/abs/2411.14405",
20 | "venue": "arXiv",
21 | "date": "2024-11",
22 | "label": "zhao2024marco"
23 | },
24 | {
25 | "paper": "o1-Coder: an o1 Replication for Coding",
26 | "link": "https://arxiv.org/abs/2412.00154",
27 | "venue": "arXiv",
28 | "date": "2024-12",
29 | "label": "zhang2024o1"
30 | },
31 | {
32 | "paper": "Enhancing LLM Reasoning with Reward-guided Tree Search",
33 | "link": "https://arxiv.org/abs/2411.11694",
34 | "venue": "arXiv",
35 | "date": "2024-11",
36 | "label": "chen2024enhancing"
37 | },
38 | {
39 | "paper": "Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems",
40 | "link": "https://arxiv.org/abs/2412.09413",
41 | "venue": "arXiv",
42 | "date": "2024-12",
43 | "label": "min2024imitate"
44 | }
45 | ],
46 | "Part 2: Process Reward Models": [
47 | {
48 | "paper": "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations",
49 | "link": "https://aclanthology.org/2024.acl-long.510/",
50 | "venue": "ACL",
51 | "date": "2024-08",
52 | "label": "wang2024mathshepherd"
53 | },
54 | {
55 | "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search",
56 | "link": "https://openreview.net/forum?id=8rcFOqEud5",
57 | "venue": "NeurIPS",
58 | "date": "2024-12",
59 | "label": "zhang2024restmcts"
60 | },
61 | {
62 | "paper": "Let's Verify Step by Step.",
63 | "link": "https://arxiv.org/abs/2305.20050",
64 | "venue": "ICLR",
65 | "date": "2024-05",
66 | "label": "lightman2023letsverify"
67 | },
68 | {
69 | "paper": "Making Large Language Models Better Reasoners with Step-Aware Verifier",
70 | "link": "https://arxiv.org/abs/2206.02336",
71 | "venue": "arXiv",
72 | "date": "2023-06",
73 | "label": "yuan2023stepaware"
74 | },
75 | {
76 | "paper": "Improve Mathematical Reasoning in Language Models by Automated Process Supervision",
77 | "link": "https://arxiv.org/abs/2306.05372",
78 | "venue": "arXiv",
79 | "date": "2023-06",
80 | "label": "chen2023automatedprocess"
81 | },
82 | {
83 | "paper": "OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning",
84 | "link": "https://aclanthology.org/2024.findings-naacl.55/",
85 | "venue": "ACL Findings",
86 | "date": "2024-08",
87 | "label": "liu2023ovm"
88 | },
89 | {
90 | "paper": "Solving Math Word Problems with Process and Outcome-Based Feedback",
91 | "link": "https://arxiv.org/abs/2211.14275",
92 | "venue": "arXiv",
93 | "date": "2022-11",
94 | "label": "zhang2023processoutcome"
95 | },
96 | {
97 | "paper": "AutoPSV: Automated Process-Supervised Verifier",
98 | "link": "https://openreview.net/forum?id=eOAPWWOGs9",
99 | "venue": "NeurIPS",
100 | "date": "2024-12",
101 | "label": "lu2024autopsv"
102 | },
103 | {
104 | "paper": "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs",
105 | "link": "https://arxiv.org/abs/2406.18629",
106 | "venue": "arXiv",
107 | "date": "2024-06",
108 | "label": "chen2023stepdpo"
109 | },
110 | {
111 | "paper": "ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding",
112 | "link": "https://arxiv.org/abs/2501.07861",
113 | "venue": "arXiv",
114 | "date": "2025-01",
115 | "label": "li2025rearter"
116 | },
117 | {
118 | "paper": "The Lessons of Developing Process Reward Models in Mathematical Reasoning.",
119 | "link": "https://arxiv.org/abs/2501.07301",
120 | "venue": "arXiv",
121 | "date": "2025-01",
122 | "label": "gao2025lessons"
123 | },
124 | {
125 | "paper": "Outcome-Refining Process Supervision for Code Generation",
126 | "link": "https://arxiv.org/abs/2412.15118",
127 | "venue": "arXiv",
128 | "date": "2024-12",
129 | "label": "chen2024outcomerefining"
130 | },
131 | {
132 | "paper": "Free Process Rewards without Process Labels.",
133 | "link": "https://arxiv.org/abs/2412.01981",
134 | "venue": "arXiv",
135 | "date": "2024-12",
136 | "label": "yuan2024freeprocess"
137 | },
138 | {
139 | "paper": "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.",
140 | "link": "https://arxiv.org/abs/2501.03124",
141 | "venue": "arXiv",
142 | "date": "2025-01",
143 | "label": "liu2025prmbench"
144 | },
145 | {
146 | "paper": "ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark.",
147 | "link": "https://arxiv.org/abs/2501.01290",
148 | "venue": "arXiv",
149 | "date": "2025-01",
150 | "label": "zhang2024toolcomp"
151 | }
152 | ],
153 | "Part 3: Reinforcement Learning": [
154 | {
155 | "paper": "Offline Reinforcement Learning for LLM Multi-Step Reasoning",
156 | "link": "https://arxiv.org/abs/2412.16145",
157 | "venue": "arXiv",
158 | "date": "2024-12",
159 | "label": "wang2024offline"
160 | },
161 | {
162 | "paper": "ReFT: Representation Finetuning for Language Models",
163 | "link": "https://aclanthology.org/2024.acl-long.410.pdf",
164 | "venue": "ACL",
165 | "date": "2024-08",
166 | "label": "wu2024reft"
167 | },
168 | {
169 | "paper": "Deepseekmath: Pushing the limits of mathematical reasoning in open language models",
170 | "link": "https://arxiv.org/abs/2402.03300",
171 | "venue": "arXiv",
172 | "date": "2024-02",
173 | "label": "lee2024deepseekmath"
174 | },
175 | {
176 | "paper": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning",
177 | "link": "https://arxiv.org/abs/2501.12948",
178 | "venue": "arXiv",
179 | "date": "2025-01",
180 | "label": "luong2025deepseekr1"
181 | },
182 | {
183 | "paper": "Kimi k1.5: Scaling Reinforcement Learning with LLMs",
184 | "link": "https://arxiv.org/abs/2501.12599",
185 | "venue": "arXiv",
186 | "date": "2025-01",
187 | "label": "liu2025kimi"
188 | },
189 | {
190 | "paper": "Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search",
191 | "link": "https://arxiv.org/abs/2502.02508",
192 | "venue": "arXiv",
193 | "date": "2025-02",
194 | "label": "zhang2025satori"
195 | },
196 | {
197 | "paper": "Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling",
198 | "link": "https://arxiv.org/abs/2501.11651",
199 | "venue": "arXiv",
200 | "date": "2025-01",
201 | "label": "wang2025advancing"
202 | },
203 | {
204 | "paper": "Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies",
205 | "link": "https://arxiv.org/abs/2501.17030",
206 | "venue": "arXiv",
207 | "date": "2025-01",
208 | "label": "chen2025aisafety"
209 | },
210 | {
211 | "paper": "Does RLHF Scale? Exploring the Impacts From Data, Model, and Method",
212 | "link": "https://arxiv.org/abs/2412.06000",
213 | "venue": "arXiv",
214 | "date": "2024-12",
215 | "label": "lee2024rlhf"
216 | }
217 | ],
218 | "Part 4: MCTS/Tree Search": [
219 | {
220 | "paper": "Reasoning with Language Model is Planning with World Model",
221 | "link": "https://aclanthology.org/2023.emnlp-main.507/",
222 | "venue": "EMNLP",
223 | "date": "2023-12",
224 | "label": "hao2023rap"
225 | },
226 | {
227 | "paper": "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning",
228 | "link": "https://arxiv.org/abs/2405.00451",
229 | "venue": "arXiv",
230 | "date": "2024-05",
231 | "label": "zhou2024mctsboost"
232 | },
233 | {
234 | "paper": "Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training",
235 | "link": "https://openreview.net/forum?id=PJfc4x2jXY",
236 | "venue": "NeurIPS WorkShop",
237 | "date": "2023-12",
238 | "label": "wang2023alphazero"
239 | },
240 | {
241 | "paper": "Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding",
242 | "link": "https://openreview.net/forum?id=kh9Zt2Ldmn#discussion",
243 | "venue": "CoLM",
244 | "date": "2024-10",
245 | "label": "chen2024valuemcts"
246 | },
247 | {
248 | "paper": "Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search",
249 | "link": "https://arxiv.org/abs/2412.18319",
250 | "venue": "arXiv",
251 | "date": "2024-12",
252 | "label": "li2024mulberry"
253 | },
254 | {
255 | "paper": "Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning",
256 | "link": "https://arxiv.org/abs/2412.17397",
257 | "venue": "arXiv",
258 | "date": "2024-12",
259 | "label": "liu2024intrinsicmcts"
260 | },
261 | {
262 | "paper": "Proposing and solving olympiad geometry with guided tree search",
263 | "link": "https://arxiv.org/abs/2412.10673",
264 | "venue": "arXiv",
265 | "date": "2024-12",
266 | "label": "zhang2024geometrymcts"
267 | },
268 | {
269 | "paper": "SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models",
270 | "link": "https://arxiv.org/abs/2412.11605",
271 | "venue": "arXiv",
272 | "date": "2024-12",
273 | "label": "wang2024spar"
274 | },
275 | {
276 | "paper": "Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning",
277 | "link": "https://arxiv.org/abs/2412.09078",
278 | "venue": "arXiv",
279 | "date": "2024-12",
280 | "label": "xu2024forest"
281 | },
282 | {
283 | "paper": "SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation",
284 | "link": "https://arxiv.org/abs/2411.11053",
285 | "venue": "arXiv",
286 | "date": "2024-11",
287 | "label": "liu2024sramcts"
288 | },
289 | {
290 | "paper": "MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree",
291 | "link": "https://arxiv.org/abs/2411.15645",
292 | "venue": "arXiv",
293 | "date": "2024-11",
294 | "label": "wang2024mcnest"
295 | },
296 | {
297 | "paper": "Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions",
298 | "link": "https://arxiv.org/abs/2411.14405",
299 | "venue": "arXiv",
300 | "date": "2024-11",
301 | "label": "zhang2024marcoo1"
302 | },
303 | {
304 | "paper": "GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection",
305 | "link": "https://arxiv.org/abs/2411.04459",
306 | "venue": "arXiv",
307 | "date": "2024-11",
308 | "label": "chen2024gptmcts"
309 | },
310 | {
311 | "paper": "CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models",
312 | "link": "https://arxiv.org/abs/2411.04329",
313 | "venue": "arXiv",
314 | "date": "2024-11",
315 | "label": "liu2024codetree"
316 | },
317 | {
318 | "paper": "Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination",
319 | "link": "https://arxiv.org/abs/2410.17820",
320 | "venue": "arXiv",
321 | "date": "2024-10",
322 | "label": "wang2024treeofthoughts"
323 | },
324 | {
325 | "paper": "TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling",
326 | "link": "https://arxiv.org/abs/2410.16033",
327 | "venue": "arXiv",
328 | "date": "2024-10",
329 | "label": "zhou2024treebon"
330 | },
331 | {
332 | "paper": "Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning",
333 | "link": "https://arxiv.org/abs/2410.06508",
334 | "venue": "arXiv",
335 | "date": "2024-10",
336 | "label": "li2024selfimprove"
337 | },
338 | {
339 | "paper": "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning",
340 | "link": "https://arxiv.org/abs/2410.02884",
341 | "venue": "arXiv",
342 | "date": "2024-10",
343 | "label": "yang2024llamaberry"
344 | },
345 | {
346 | "paper": "Interpretable Contrastive Monte Carlo Tree Search Reasoning",
347 | "link": "https://arxiv.org/abs/2410.01707",
348 | "venue": "arXiv",
349 | "date": "2024-10",
350 | "label": "hu2024interpretablemcts"
351 | },
352 | {
353 | "paper": "MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time",
354 | "link": "https://arxiv.org/abs/2405.16265",
355 | "venue": "arXiv",
356 | "date": "2024-05",
357 | "label": "zhang2024mindstar"
358 | },
359 | {
360 | "paper": "RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation",
361 | "link": "https://arxiv.org/abs/2409.09584",
362 | "venue": "arXiv",
363 | "date": "2024-09",
364 | "label": "li2024rethinkmcts"
365 | },
366 | {
367 | "paper": "Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search",
368 | "link": "https://arxiv.org/abs/2408.10635",
369 | "venue": "arXiv",
370 | "date": "2024-08",
371 | "label": "zhou2024strategist"
372 | },
373 | {
374 | "paper": "Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search",
375 | "link": "https://arxiv.org/abs/2405.15383",
376 | "venue": "arXiv",
377 | "date": "2024-05",
378 | "label": "chen2024codeworldmcts"
379 | },
380 | {
381 | "paper": "Uncertainty-Guided Optimization on Large Language Model Search Trees",
382 | "link": "https://arxiv.org/abs/2407.03951",
383 | "venue": "arXiv",
384 | "date": "2024-07",
385 | "label": "yang2024uncertaintymcts"
386 | },
387 | {
388 | "paper": "Tree Search for Language Model Agents",
389 | "link": "https://arxiv.org/abs/2407.01476",
390 | "venue": "arXiv",
391 | "date": "2024-07",
392 | "label": "wu2024treesearchlm"
393 | },
394 | {
395 | "paper": "LiteSearch: Efficacious Tree Search for LLM",
396 | "link": "https://arxiv.org/abs/2407.00320",
397 | "venue": "arXiv",
398 | "date": "2024-07",
399 | "label": "chen2024litesearch"
400 | },
401 | {
402 | "paper": "Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B",
403 | "link": "https://arxiv.org/abs/2406.07394",
404 | "venue": "arXiv",
405 | "date": "2024-06",
406 | "label": "liu2024accessing"
407 | },
408 | {
409 | "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search",
410 | "link": "https://arxiv.org/abs/2406.03816",
411 | "venue": "NeurIPS",
412 | "date": "2024-12",
413 | "label": "wang2024restmcts"
414 | },
415 | {
416 | "paper": "On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes",
417 | "link": "https://ieeexplore.ieee.org/abstract/document/10870057/",
418 | "venue": "IEEE TAC",
419 | "date": "2025-01",
420 | "label": "zhang2025mcts"
421 | },
422 | {
423 | "paper": "AlphaMath Almost Zero: process Supervision without process",
424 | "link": "https://arxiv.org/abs/2405.03553",
425 | "venue": "arXiv",
426 | "date": "2024-05",
427 | "label": "chen2024alphamath"
428 | },
429 | {
430 | "paper": "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning",
431 | "link": "https://arxiv.org/abs/2405.00451",
432 | "venue": "arXiv",
433 | "date": "2024-05",
434 | "label": "liu2024mctsboost"
435 | },
436 | {
437 | "paper": "Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping",
438 | "link": "https://openreview.net/forum?id=rviGTsl0oy",
439 | "venue": "ICLR WorkShop",
440 | "date": "2024-05",
441 | "label": "wang2024beyonda"
442 | },
443 | {
444 | "paper": "Stream of Search (SoS): Learning to Search in Language",
445 | "link": "https://arxiv.org/abs/2404.03683",
446 | "venue": "arXiv",
447 | "date": "2024-04",
448 | "label": "yang2024sos"
449 | },
450 | {
451 | "paper": "LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models",
452 | "link": "https://openreview.net/forum?id=h1mvwbQiXR",
453 | "venue": "ICLR WorkShop",
454 | "date": "2024-05",
455 | "label": "zhang2024llmreasoners"
456 | },
457 | {
458 | "paper": "Search-o1: Agentic Search-Enhanced Large Reasoning Models",
459 | "link": "https://arxiv.org/abs/2501.05366",
460 | "venue": "arXiv",
461 | "date": "2025-01",
462 | "label": "li2025searcho1"
463 | },
464 | {
465 | "paper": "rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking",
466 | "link": "https://arxiv.org/abs/2501.04519",
467 | "venue": "arXiv",
468 | "date": "2025-01",
469 | "label": "chen2025rstar"
470 | },
471 | {
472 | "paper": "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs",
473 | "link": "https://arxiv.org/abs/2412.18925",
474 | "venue": "arXiv",
475 | "date": "2024-12",
476 | "label": "zhang2024huatuo"
477 | },
478 | {
479 | "paper": "AFlow: Automating Agentic Workflow Generation",
480 | "link": "https://arxiv.org/abs/2410.10762",
481 | "venue": "arXiv",
482 | "date": "2024-10",
483 | "label": "wang2024aflow"
484 | },
485 | {
486 | "paper": "MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING",
487 | "link": "https://arxiv.org/abs/2309.15028",
488 | "venue": "arXiv",
489 | "date": "2023-09",
490 | "label": "liu2023ppo"
491 | },
492 | {
493 | "paper": "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning",
494 | "link": "https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html",
495 | "venue": "NeurIPS",
496 | "date": "2023-12",
497 | "label": "wang2023commonsense"
498 | },
499 | {
500 | "paper": "ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING",
501 | "link": "https://openreview.net/forum?id=PJfc4x2jXY",
502 | "venue": "NeurIPS WorkShop",
503 | "date": "2023-12",
504 | "label": "li2023alphazero"
505 | },
506 | {
507 | "paper": "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing",
508 | "link": "https://arxiv.org/abs/2404.12253",
509 | "venue": "arXiv",
510 | "date": "2024-04",
511 | "label": "chen2024selfimprove"
512 | }
513 | ],
514 | "Part 5: Self-Training / Self-Improve": [
515 | {
516 | "paper": "STaR: Bootstrapping Reasoning With Reasoning",
517 | "link": "https://arxiv.org/abs/2203.14465",
518 | "venue": "NeurIPS2022",
519 | "date": "2022-05",
520 | "label": "zelikman2022star"
521 | },
522 | {
523 | "paper": "ReST: Reinforced Self-Training for Language Modeling",
524 | "link": "https://arxiv.org/abs/2308.08998",
525 | "venue": "arXiv",
526 | "date": "2023-08",
527 | "label": "gulcehre2023rest"
528 | },
529 | {
530 | "paper": "ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models",
531 | "link": "https://openreview.net/forum?id=lNAyUngGFK",
532 | "venue": "TMLR",
533 | "date": "2024-09",
534 | "label": "yang2024restem"
535 | },
536 | {
537 | "paper": "Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search",
538 | "link": "https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html",
539 | "venue": "NeurIPS",
540 | "date": "2017-12",
541 | "label": "anthony2017expert"
542 | },
543 | {
544 | "paper": "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking",
545 | "link": "https://arxiv.org/abs/2403.09629",
546 | "venue": "arXiv",
547 | "date": "2024-03",
548 | "label": "zelikman2024quietstar"
549 | },
550 | {
551 | "paper": "V-star: Training Verifiers for Self-Taught Reasoners",
552 | "link": "https://arxiv.org/abs/2402.06457",
553 | "venue": "arXiv",
554 | "date": "2024-02",
555 | "label": "liu2024vstar"
556 | },
557 | {
558 | "paper": "Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models",
559 | "link": "https://arxiv.org/abs/2406.11736",
560 | "venue": "arXiv",
561 | "date": "2024-06",
562 | "label": "wang2024interactive"
563 | },
564 | {
565 | "paper": "ReFT: Representation Finetuning for Language Models",
566 | "link": "https://aclanthology.org/2024.acl-long.410.pdf",
567 | "venue": "ACL",
568 | "date": "2024-08",
569 | "label": "wu2024reft"
570 | },
571 | {
572 | "paper": "ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search",
573 | "link": "https://arxiv.org/abs/2406.03816",
574 | "venue": "NeurIPS",
575 | "date": "2024-12",
576 | "label": "chen2024restmcts"
577 | },
578 | {
579 | "paper": "Recursive Introspection: Teaching Language Model Agents How to Self-Improve",
580 | "link": "https://openreview.net/forum?id=DRC9pZwBwR",
581 | "venue": "NeurIPS",
582 | "date": "2024-12",
583 | "label": "zhang2024recursive"
584 | },
585 | {
586 | "paper": "B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner",
587 | "link": "https://arxiv.org/abs/2412.17256",
588 | "venue": "arXiv",
589 | "date": "2024-12",
590 | "label": "he2024bstar"
591 | },
592 | {
593 | "paper": "Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math)",
594 | "link": "https://arxiv.org/abs/2501.04519",
595 | "venue": "arXiv",
596 | "date": "2025-01",
597 | "label": "xu2025rstar"
598 | },
599 | {
600 | "paper": "Enhancing Large Vision Language Models with Self-Training on Image Comprehension",
601 | "link": "https://arxiv.org/abs/2405.19716",
602 | "venue": "arXiv",
603 | "date": "2024-05",
604 | "label": "li2024enhancing"
605 | },
606 | {
607 | "paper": "Self-Refine: Iterative Refinement with Self-Feedback",
608 | "link": "https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html",
609 | "venue": "NeurIPS",
610 | "date": "2023-12",
611 | "label": "madaan2023selfrefine"
612 | },
613 | {
614 | "paper": "CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing",
615 | "link": "https://openreview.net/forum?id=Sx038qxjek",
616 | "venue": "ICLR",
617 | "date": "2024-05",
618 | "label": "zhang2024critic"
619 | }
620 | ],
621 | "Part 6: Reflection": [
622 | {
623 | "paper": "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers",
624 | "link": "https://arxiv.org/abs/2408.06195",
625 | "venue": "arXiv",
626 | "date": "2024-08",
627 | "label": "zheng2024mutual"
628 | },
629 | {
630 | "paper": "Reflection-Tuning: An Approach for Data Recycling",
631 | "link": "https://arxiv.org/abs/2310.11716",
632 | "venue": "arXiv",
633 | "date": "2023-10",
634 | "label": "li2024reflection"
635 | },
636 | {
637 | "paper": "Vision-Language Models Can Self-Improve Reasoning via Reflection",
638 | "link": "https://arxiv.org/abs/2411.00855",
639 | "venue": "arXiv",
640 | "date": "2024-11",
641 | "label": "cheng2024vision"
642 | },
643 | {
644 | "paper": "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs",
645 | "link": "https://arxiv.org/abs/2412.18925",
646 | "venue": "arXiv",
647 | "date": "2024-12",
648 | "label": "zhang2024huatuo"
649 | },
650 | {
651 | "paper": "AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning",
652 | "link": "https://arxiv.org/abs/2411.11930",
653 | "venue": "arXiv",
654 | "date": "2024-11",
655 | "label": "liu2024atomthink"
656 | },
657 | {
658 | "paper": "LLaVA-o1: Let Vision Language Models Reason Step-by-Step",
659 | "link": "https://arxiv.org/abs/2411.10440",
660 | "venue": "arXiv",
661 | "date": "2024-11",
662 | "label": "xu2024llava"
663 | }
664 | ],
665 | "Part 7: Efficient System2": [
666 | {
667 | "paper": "Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking",
668 | "link": "https://arxiv.org/abs/2501.01306",
669 | "venue": "arXiv",
670 | "date": "2025-01",
671 | "label": "cheng2025think"
672 | },
673 | {
674 | "paper": "Token-Budget-Aware LLM Reasoning",
675 | "link": "https://arxiv.org/abs/2412.18547",
676 | "venue": "arXiv",
677 | "date": "2024-12",
678 | "label": "wang2024token"
679 | },
680 | {
681 | "paper": "B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner",
682 | "link": "https://arxiv.org/abs/2412.17256",
683 | "venue": "arXiv",
684 | "date": "2024-12",
685 | "label": "he2024bstar"
686 | },
687 | {
688 | "paper": "Guiding Language Model Reasoning with Planning Tokens",
689 | "link": "https://arxiv.org/abs/2310.05707",
690 | "venue": "CoLM",
691 | "date": "2024-10",
692 | "label": "wang2023guiding"
693 | },
694 | {
695 | "paper": "DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models",
696 | "link": "https://arxiv.org/abs/2407.01009",
697 | "venue": "EMNLP",
698 | "date": "2024-12",
699 | "label": "li2024dynathink"
700 | },
701 | {
702 | "paper": "Training Large Language Models to Reason in a Continuous Latent Space",
703 | "link": "https://arxiv.org/abs/2412.06769",
704 | "venue": "arXiv",
705 | "date": "2024-12",
706 | "label": "chen2024training"
707 | },
708 | {
709 | "paper": "O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning",
710 | "link": "https://arxiv.org/abs/2501.12570",
711 | "venue": "arXiv",
712 | "date": "2025-01",
713 | "label": "zhang2025o1pruner"
714 | }
715 | ],
716 | "Part 8: Explainability": [
717 | {
718 | "paper": "What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective",
719 | "link": "https://arxiv.org/abs/2410.23743",
720 | "venue": "arXiv",
721 | "date": "2024-10",
722 | "label": "li2024whathappened"
723 | },
724 | {
725 | "paper": "When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1",
726 | "link": "https://arxiv.org/abs/2410.01792",
727 | "venue": "arXiv",
728 | "date": "2024-10",
729 | "label": "zhang2024embers"
730 | },
731 | {
732 | "paper": "Agents Thinking Fast and Slow: A Talker-Reasoner Architecture",
733 | "link": "https://openreview.net/forum?id=xPhcP6rbI4",
734 | "venue": "NeurIPS WorkShop",
735 | "date": "2024-12",
736 | "label": "wang2024agents"
737 | },
738 | {
739 | "paper": "System 2 Attention (is something you might need too)",
740 | "link": "https://arxiv.org/abs/2311.11829",
741 | "venue": "arXiv",
742 | "date": "2023-11",
743 | "label": "chen2023system2"
744 | },
745 | {
746 | "paper": "Distilling System 2 into System 1",
747 | "link": "https://arxiv.org/abs/2407.06023",
748 | "venue": "arXiv",
749 | "date": "2024-07",
750 | "label": "liu2024distilling"
751 | },
752 | {
753 | "paper": "The Impact of Reasoning Step Length on Large Language Models",
754 | "link": "https://arxiv.org/abs/2401.04925",
755 | "venue": "ACL Findings",
756 | "date": "2024-08",
757 | "label": "sun2024impact"
758 | }
759 | ],
760 | "Part 9: Multimodal Agent related Slow-Fast System": [
761 | {
762 | "paper": "AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning",
763 | "link": "https://arxiv.org/abs/2411.11930",
764 | "venue": "arXiv",
765 | "date": "2024-11",
766 | "label": "liu2024atomthink"
767 | },
768 | {
769 | "paper": "LLaVA-o1: Let Vision Language Models Reason Step-by-Step",
770 | "link": "https://arxiv.org/abs/2411.10440",
771 | "venue": "arXiv",
772 | "date": "2024-11",
773 | "label": "xu2024llava"
774 | },
775 | {
776 | "paper": "Visual Agents as Fast and Slow Thinkers",
777 | "link": "https://openreview.net/forum?id=ncCuiD3KJQ",
778 | "venue": "ICLR",
779 | "date": "2025-01",
780 | "label": "gao2025visualagents"
781 | },
782 | {
783 | "paper": "Slow Perception: Let's Perceive Geometric Figures Step-by-Step",
784 | "link": "https://arxiv.org/abs/2412.20631",
785 | "venue": "arXiv",
786 | "date": "2024-12",
787 | "label": "wei2024slow"
788 | },
789 | {
790 | "paper": "Virgo: A Preliminary Exploration on Reproducing o1-like MLLM",
791 | "link": "https://arxiv.org/abs/2501.01904",
792 | "venue": "arXiv",
793 | "date": "2025-01",
794 | "label": "du2025virgo"
795 | },
796 | {
797 | "paper": "Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension",
798 | "link": "https://arxiv.org/pdf/2412.03704",
799 | "venue": "arXiv",
800 | "date": "2024-12",
801 | "label": "feng2024scaling"
802 | },
803 | {
804 | "paper": "Vision-Language Models Can Self-Improve Reasoning via Reflection",
805 | "link": "https://arxiv.org/abs/2411.00855",
806 | "venue": "arXiv",
807 | "date": "2024-11",
808 | "label": "cheng2024vision"
809 | },
810 | {
811 | "paper": "Diving into Self-Evolving Training for Multimodal Reasoning",
812 | "link": "https://arxiv.org/abs/2412.17451",
813 | "venue": "ICLR",
814 | "date": "2025-01",
815 | "label": "zhao2024selfevolving"
816 | }
817 | ],
818 | "Part 10: Benchmark and Datasets": [
819 | {
820 | "paper": "A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?",
821 | "link": "https://arxiv.org/abs/2409.15277",
822 | "venue": "arXiv",
823 | "date": "2024-09",
824 | "label": "tu2024preliminary"
825 | },
826 | {
827 | "paper": "MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs",
828 | "link": "https://openreview.net/forum?id=GN2qbxZlni",
829 | "venue": "NeurIPS",
830 | "date": "2024-12",
831 | "label": "li2024mrben"
832 | },
833 | {
834 | "paper": "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models",
835 | "link": "https://arxiv.org/abs/2501.03124",
836 | "venue": "arXiv",
837 | "date": "2025-01",
838 | "label": "song2025prmbench"
839 | },
840 | {
841 | "paper": "Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs",
842 | "link": "https://arxiv.org/abs/2412.21187",
843 | "venue": "arXiv",
844 | "date": "2024-12",
845 | "label": "huang2024overthinking"
846 | }
847 | ]
848 | }
--------------------------------------------------------------------------------
/assets/timeline.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline.jpg
--------------------------------------------------------------------------------
/assets/timeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline.png
--------------------------------------------------------------------------------
/assets/timeline_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/assets/timeline_2.png
--------------------------------------------------------------------------------
/src/list.md:
--------------------------------------------------------------------------------
1 | ## Part 1: O1 Replication
2 | * Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [[Paper]](https://arxiv.org/abs/2412.09413) 
3 | * o1-Coder: an o1 Replication for Coding [[Paper]](https://arxiv.org/abs/2412.00154) 
4 | * Enhancing LLM Reasoning with Reward-guided Tree Search [[Paper]](https://arxiv.org/abs/2411.11694) 
5 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) 
6 | * O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [[Paper]](https://arxiv.org/abs/2411.16489) 
7 | * O1 Replication Journey: A Strategic Progress Report -- Part 1 [[Paper]](https://arxiv.org/abs/2410.18982) 
8 | ## Part 2: Process Reward Models
9 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [[Paper]](https://arxiv.org/abs/2501.03124) 
10 | * ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [[Paper]](https://arxiv.org/abs/2501.07861) 
11 | * The Lessons of Developing Process Reward Models in Mathematical Reasoning. [[Paper]](https://arxiv.org/abs/2501.07301) 
12 | * ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [[Paper]](https://arxiv.org/abs/2501.01290) 
13 | * AutoPSV: Automated Process-Supervised Verifier [[Paper]](https://openreview.net/forum?id=eOAPWWOGs9) 
14 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://openreview.net/forum?id=8rcFOqEud5) 
15 | * Free Process Rewards without Process Labels. [[Paper]](https://arxiv.org/abs/2412.01981) 
16 | * Outcome-Refining Process Supervision for Code Generation [[Paper]](https://arxiv.org/abs/2412.15118) 
17 | * Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[Paper]](https://aclanthology.org/2024.acl-long.510/) 
18 | * OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [[Paper]](https://aclanthology.org/2024.findings-naacl.55/) 
19 | * Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [[Paper]](https://arxiv.org/abs/2406.18629) 
20 | * Let's Verify Step by Step. [[Paper]](https://arxiv.org/abs/2305.20050) 
21 | * Improve Mathematical Reasoning in Language Models by Automated Process Supervision [[Paper]](https://arxiv.org/abs/2306.05372) 
22 | * Making Large Language Models Better Reasoners with Step-Aware Verifier [[Paper]](https://arxiv.org/abs/2206.02336) 
23 | * Solving Math Word Problems with Process and Outcome-Based Feedback [[Paper]](https://arxiv.org/abs/2211.14275) 
24 | ## Part 3: Reinforcement Learning
25 | * Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [[Paper]](https://arxiv.org/abs/2502.02508) 
26 | * Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [[Paper]](https://arxiv.org/abs/2501.11651) 
27 | * Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [[Paper]](https://arxiv.org/abs/2501.17030) 
28 | * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[Paper]](https://arxiv.org/abs/2501.12948) 
29 | * Kimi k1.5: Scaling Reinforcement Learning with LLMs [[Paper]](https://arxiv.org/abs/2501.12599) 
30 | * Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [[Paper]](https://arxiv.org/abs/2412.06000) 
31 | * Offline Reinforcement Learning for LLM Multi-Step Reasoning [[Paper]](https://arxiv.org/abs/2412.16145) 
32 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) 
33 | * Deepseekmath: Pushing the limits of mathematical reasoning in open language models [[Paper]](https://arxiv.org/abs/2402.03300) 
34 | ## Part 4: MCTS/Tree Search
35 | * On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [[Paper]](https://ieeexplore.ieee.org/abstract/document/10870057/) 
36 | * Search-o1: Agentic Search-Enhanced Large Reasoning Models [[Paper]](https://arxiv.org/abs/2501.05366) 
37 | * rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [[Paper]](https://arxiv.org/abs/2501.04519) 
38 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) 
39 | * Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.09078) 
40 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) 
41 | * Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2412.18319) 
42 | * Proposing and solving olympiad geometry with guided tree search [[Paper]](https://arxiv.org/abs/2412.10673) 
43 | * SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [[Paper]](https://arxiv.org/abs/2412.11605) 
44 | * Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2412.17397) 
45 | * CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [[Paper]](https://arxiv.org/abs/2411.04329) 
46 | * GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [[Paper]](https://arxiv.org/abs/2411.04459) 
47 | * MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [[Paper]](https://arxiv.org/abs/2411.15645) 
48 | * Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [[Paper]](https://arxiv.org/abs/2411.14405) 
49 | * SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2411.11053) 
50 | * Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [[Paper]](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion) 
51 | * AFlow: Automating Agentic Workflow Generation [[Paper]](https://arxiv.org/abs/2410.10762) 
52 | * Interpretable Contrastive Monte Carlo Tree Search Reasoning [[Paper]](https://arxiv.org/abs/2410.01707) 
53 | * LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2410.02884) 
54 | * Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [[Paper]](https://arxiv.org/abs/2410.06508) 
55 | * TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [[Paper]](https://arxiv.org/abs/2410.16033) 
56 | * Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [[Paper]](https://arxiv.org/abs/2410.17820) 
57 | * RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [[Paper]](https://arxiv.org/abs/2409.09584) 
58 | * Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [[Paper]](https://arxiv.org/abs/2408.10635) 
59 | * LiteSearch: Efficacious Tree Search for LLM [[Paper]](https://arxiv.org/abs/2407.00320) 
60 | * Tree Search for Language Model Agents [[Paper]](https://arxiv.org/abs/2407.01476) 
61 | * Uncertainty-Guided Optimization on Large Language Model Search Trees [[Paper]](https://arxiv.org/abs/2407.03951) 
62 | * Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [[Paper]](https://arxiv.org/abs/2406.07394) 
63 | * Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [[Paper]](https://openreview.net/forum?id=rviGTsl0oy) 
64 | * LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [[Paper]](https://openreview.net/forum?id=h1mvwbQiXR) 
65 | * AlphaMath Almost Zero: process Supervision without process [[Paper]](https://arxiv.org/abs/2405.03553) 
66 | * Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [[Paper]](https://arxiv.org/abs/2405.15383) 
67 | * MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [[Paper]](https://arxiv.org/abs/2405.16265) 
68 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) 
69 | * Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [[Paper]](https://arxiv.org/abs/2405.00451) 
70 | * Stream of Search (SoS): Learning to Search in Language [[Paper]](https://arxiv.org/abs/2404.03683) 
71 | * Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [[Paper]](https://arxiv.org/abs/2404.12253) 
72 | * Reasoning with Language Model is Planning with World Model [[Paper]](https://aclanthology.org/2023.emnlp-main.507/) 
73 | * Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html) 
74 | * ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) 
75 | * Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [[Paper]](https://openreview.net/forum?id=PJfc4x2jXY) 
76 | * MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [[Paper]](https://arxiv.org/abs/2309.15028) 
77 | ## Part 5: Self-Training / Self-Improve
78 | * Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [[Paper]](https://arxiv.org/abs/2501.04519) 
79 | * ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [[Paper]](https://arxiv.org/abs/2406.03816) 
80 | * Recursive Introspection: Teaching Language Model Agents How to Self-Improve [[Paper]](https://openreview.net/forum?id=DRC9pZwBwR) 
81 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) 
82 | * ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [[Paper]](https://openreview.net/forum?id=lNAyUngGFK) 
83 | * ReFT: Representation Finetuning for Language Models [[Paper]](https://aclanthology.org/2024.acl-long.410.pdf) 
84 | * Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2406.11736) 
85 | * CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [[Paper]](https://openreview.net/forum?id=Sx038qxjek) 
86 | * Enhancing Large Vision Language Models with Self-Training on Image Comprehension [[Paper]](https://arxiv.org/abs/2405.19716) 
87 | * Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [[Paper]](https://arxiv.org/abs/2403.09629) 
88 | * V-star: Training Verifiers for Self-Taught Reasoners [[Paper]](https://arxiv.org/abs/2402.06457) 
89 | * Self-Refine: Iterative Refinement with Self-Feedback [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html) 
90 | * ReST: Reinforced Self-Training for Language Modeling [[Paper]](https://arxiv.org/abs/2308.08998) 
91 | * STaR: Bootstrapping Reasoning With Reasoning [[Paper]](https://arxiv.org/abs/2203.14465) 
92 | * Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [[Paper]](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html) 
93 | ## Part 6: Reflection
94 | * HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [[Paper]](https://arxiv.org/abs/2412.18925) 
95 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) 
96 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) 
97 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) 
98 | * Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [[Paper]](https://arxiv.org/abs/2408.06195) 
99 | * Reflection-Tuning: An Approach for Data Recycling [[Paper]](https://arxiv.org/abs/2310.11716) 
100 | ## Part 7: Efficient System2
101 | * O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [[Paper]](https://arxiv.org/abs/2501.12570) 
102 | * Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [[Paper]](https://arxiv.org/abs/2501.01306) 
103 | * DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [[Paper]](https://arxiv.org/abs/2407.01009) 
104 | * B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [[Paper]](https://arxiv.org/abs/2412.17256) 
105 | * Token-Budget-Aware LLM Reasoning [[Paper]](https://arxiv.org/abs/2412.18547) 
106 | * Training Large Language Models to Reason in a Continuous Latent Space [[Paper]](https://arxiv.org/abs/2412.06769) 
107 | * Guiding Language Model Reasoning with Planning Tokens [[Paper]](https://arxiv.org/abs/2310.05707) 
108 | ## Part 8: Explainability
109 | * Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [[Paper]](https://openreview.net/forum?id=xPhcP6rbI4) 
110 | * What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [[Paper]](https://arxiv.org/abs/2410.23743) 
111 | * When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [[Paper]](https://arxiv.org/abs/2410.01792) 
112 | * The Impact of Reasoning Step Length on Large Language Models [[Paper]](https://arxiv.org/abs/2401.04925) 
113 | * Distilling System 2 into System 1 [[Paper]](https://arxiv.org/abs/2407.06023) 
114 | * System 2 Attention (is something you might need too) [[Paper]](https://arxiv.org/abs/2311.11829) 
115 | ## Part 9: Multimodal Agent related Slow-Fast System
116 | * Diving into Self-Evolving Training for Multimodal Reasoning [[Paper]](https://arxiv.org/abs/2412.17451) 
117 | * Visual Agents as Fast and Slow Thinkers [[Paper]](https://openreview.net/forum?id=ncCuiD3KJQ) 
118 | * Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[Paper]](https://arxiv.org/abs/2501.01904) 
119 | * Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [[Paper]](https://arxiv.org/pdf/2412.03704) 
120 | * Slow Perception: Let's Perceive Geometric Figures Step-by-Step [[Paper]](https://arxiv.org/abs/2412.20631) 
121 | * AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [[Paper]](https://arxiv.org/abs/2411.11930) 
122 | * LLaVA-o1: Let Vision Language Models Reason Step-by-Step [[Paper]](https://arxiv.org/abs/2411.10440) 
123 | * Vision-Language Models Can Self-Improve Reasoning via Reflection [[Paper]](https://arxiv.org/abs/2411.00855) 
124 | ## Part 10: Benchmark and Datasets
125 | * PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [[Paper]](https://arxiv.org/abs/2501.03124) 
126 | * MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [[Paper]](https://openreview.net/forum?id=GN2qbxZlni) 
127 | * Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [[Paper]](https://arxiv.org/abs/2412.21187) 
128 | * A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [[Paper]](https://arxiv.org/abs/2409.15277) 
129 |
--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | from typing import Optional
4 |
5 |
6 | class PaperInformation:
7 | def __init__(
8 | self, paper: str, link: str, venue: str, date: str, label: Optional[str] = None
9 | ):
10 | self.paper = paper
11 | self.link = link
12 | self.venue = venue
13 | self.date = date
14 | self.label = label
15 |
16 | def __hash__(self):
17 | return self.label
18 |
19 | def __lt__(self, other: "PaperInformation"):
20 | self_year, self_month = map(int, self.date.split("-"))
21 | other_year, other_month = map(int, other.date.split("-"))
22 |
23 | if self_year != other_year:
24 | return self_year > other_year # Reverse logic to sort DESCENDING
25 | if self_month != other_month:
26 | return self_month > other_month # Reverse logic to sort DESCENDING
27 | if self.venue != other.venue:
28 | return self.venue < other.venue # Sort venues in descending order
29 | return self.paper < other.paper # Sort titles in descending order
30 |
31 | def __eq__(self, other: "PaperInformation"):
32 | return self.label == other.label
33 |
34 |
35 | class Utility:
36 | @staticmethod
37 | def get_paper_information(raw_paper_dict: dict) -> list[PaperInformation]:
38 | paper_information_list = []
39 | for key, value in raw_paper_dict.items():
40 | if isinstance(value, dict):
41 | paper_information_list.extend(Utility.get_paper_information(value))
42 | elif isinstance(value, list):
43 | for raw_paper_information in value:
44 | if len(raw_paper_information.keys()) == 1:
45 | assert "label" in raw_paper_information.keys()
46 | else:
47 | paper_label = raw_paper_information.get("label", None)
48 | paper_information = PaperInformation(
49 | paper=raw_paper_information["paper"],
50 | link=raw_paper_information["link"],
51 | venue=raw_paper_information["venue"],
52 | date=raw_paper_information["date"],
53 | label=paper_label,
54 | )
55 | assert (
56 | paper_label is not None
57 | or paper_information not in paper_information_list
58 | )
59 | paper_information_list.append(paper_information)
60 | else:
61 | raise TypeError(f"Unexpected type: {type(value)}")
62 | return paper_information_list
63 |
64 | @staticmethod
65 | def fill_paper_dict(
66 | raw_paper_dict: dict, paper_information_list: list[PaperInformation]
67 | ) -> dict:
68 | processed_paper_dict = {}
69 | for key, value in raw_paper_dict.items():
70 | if isinstance(value, dict):
71 | processed_paper_dict[key] = Utility.fill_paper_dict(
72 | value, paper_information_list
73 | )
74 | elif isinstance(value, list):
75 | processed_paper_dict[key] = []
76 | for raw_paper_information in value:
77 | if (
78 | len(raw_paper_information.keys()) == 1
79 | or "label" in raw_paper_information.keys()
80 | ):
81 | paper_label = raw_paper_information["label"]
82 | for paper_information in paper_information_list:
83 | if paper_information.label == paper_label:
84 | break
85 | else:
86 | raise ValueError(f"Paper label not found: {paper_label}")
87 | processed_paper_dict[key].append(paper_information)
88 | else:
89 | processed_paper_dict[key].append(
90 | PaperInformation(
91 | paper=raw_paper_information["paper"],
92 | link=raw_paper_information["link"],
93 | venue=raw_paper_information["venue"],
94 | date=raw_paper_information["date"],
95 | )
96 | )
97 | else:
98 | raise TypeError(f"Unexpected type: {type(value)}")
99 | return processed_paper_dict
100 |
101 | @staticmethod
102 | def generate_title_with_level(title: str, title_level: int) -> str:
103 | return f"{'#' * (title_level + 2)} {title}\n"
104 |
105 | @staticmethod
106 | def generate_readme_table_with_title(
107 | title: str, title_level: int, paper_information_list: list[PaperInformation]
108 | ) -> str:
109 | result_str = Utility.generate_title_with_level(title, title_level)
110 | result_str += "|Title|Venue|Date|\n"
111 | result_str += "|:---|:---|:---|\n"
112 | paper_information_list.sort()
113 | for paper_information in paper_information_list:
114 | result_str += (
115 | f"|[{paper_information.paper}]({paper_information.link})|"
116 | f"{paper_information.venue}|"
117 | f"{paper_information.date}|\n"
118 | )
119 | return result_str
120 |
121 | @staticmethod
122 | def generate_all_table(
123 | paper_dict: dict, topmost_table_level: int, current_table_str: str
124 | ) -> str:
125 | for key, value in paper_dict.items():
126 | if isinstance(value, dict):
127 | current_table_str += Utility.generate_title_with_level(
128 | key, topmost_table_level
129 | )
130 | current_table_str = Utility.generate_all_table(
131 | value, topmost_table_level + 1, current_table_str
132 | )
133 | elif isinstance(value, list):
134 | current_table_str += Utility.generate_readme_table_with_title(
135 | key, topmost_table_level, value
136 | )
137 | else:
138 | raise TypeError(f"Unexpected type: {type(value)}")
139 | return current_table_str
140 |
141 | @staticmethod
142 | def generate_list_with_title(
143 | title: str, title_level: int, paper_information_list: list[PaperInformation]
144 | ) -> str:
145 | result_str = Utility.generate_title_with_level(title, title_level)
146 | for paper_information in paper_information_list:
147 | badge_color = "blue"
148 | if "arxiv" in paper_information.link.lower():
149 | badge_color = "red"
150 | badge_text = f"arXiv-{paper_information.date.replace('-', '.')}"
151 | else:
152 | venue = paper_information.venue.replace(" ", "_")
153 | year = paper_information.date.split("-")[0]
154 | badge_text = f"{venue}-{year}"
155 | result_str += (
156 | f"* {paper_information.paper} [[Paper]]({paper_information.link}) "
157 | f"\n"
158 | )
159 | return result_str
160 |
161 | @staticmethod
162 | def generate_all_list(
163 | paper_dict: dict, topmost_list_level: int, current_list_str: str
164 | ) -> str:
165 | for key, value in paper_dict.items():
166 | if isinstance(value, dict):
167 | current_list_str += Utility.generate_title_with_level(
168 | key, topmost_list_level
169 | )
170 | current_list_str = Utility.generate_all_list(
171 | value, topmost_list_level + 1, current_list_str
172 | )
173 | elif isinstance(value, list):
174 | current_list_str += Utility.generate_list_with_title(
175 | key, topmost_list_level, value
176 | )
177 | else:
178 | raise TypeError(f"Unexpected type: {type(value)}")
179 | return current_list_str
180 |
181 |
182 | def main():
183 | raw_paper_dict = json.load(open("./assets/paper.json", "r"))
184 | paper_information_list = Utility.get_paper_information(raw_paper_dict)
185 | processed_paper_dict = Utility.fill_paper_dict(
186 | raw_paper_dict, paper_information_list
187 | )
188 | all_table_str = Utility.generate_all_table(processed_paper_dict, 0, "")
189 | with open("./src/table.md", "w") as f:
190 | f.write(all_table_str)
191 | all_list_str = Utility.generate_all_list(processed_paper_dict, 0, "")
192 | with open("./src/list.md", "w") as f:
193 | f.write(all_list_str)
194 |
195 |
196 | if __name__ == "__main__":
197 | main()
198 |
--------------------------------------------------------------------------------
/src/table.md:
--------------------------------------------------------------------------------
1 | ## Part 1: O1 Replication
2 | |Title|Venue|Date|
3 | |:---|:---|:---|
4 | |[Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems](https://arxiv.org/abs/2412.09413)|arXiv|2024-12|
5 | |[o1-Coder: an o1 Replication for Coding](https://arxiv.org/abs/2412.00154)|arXiv|2024-12|
6 | |[Enhancing LLM Reasoning with Reward-guided Tree Search](https://arxiv.org/abs/2411.11694)|arXiv|2024-11|
7 | |[Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions](https://arxiv.org/abs/2411.14405)|arXiv|2024-11|
8 | |[O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?](https://arxiv.org/abs/2411.16489)|arXiv|2024-11|
9 | |[O1 Replication Journey: A Strategic Progress Report -- Part 1](https://arxiv.org/abs/2410.18982)|arXiv|2024-10|
10 | ## Part 2: Process Reward Models
11 | |Title|Venue|Date|
12 | |:---|:---|:---|
13 | |[PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.](https://arxiv.org/abs/2501.03124)|arXiv|2025-01|
14 | |[ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding](https://arxiv.org/abs/2501.07861)|arXiv|2025-01|
15 | |[The Lessons of Developing Process Reward Models in Mathematical Reasoning.](https://arxiv.org/abs/2501.07301)|arXiv|2025-01|
16 | |[ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark.](https://arxiv.org/abs/2501.01290)|arXiv|2025-01|
17 | |[AutoPSV: Automated Process-Supervised Verifier](https://openreview.net/forum?id=eOAPWWOGs9)|NeurIPS|2024-12|
18 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://openreview.net/forum?id=8rcFOqEud5)|NeurIPS|2024-12|
19 | |[Free Process Rewards without Process Labels.](https://arxiv.org/abs/2412.01981)|arXiv|2024-12|
20 | |[Outcome-Refining Process Supervision for Code Generation](https://arxiv.org/abs/2412.15118)|arXiv|2024-12|
21 | |[Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations](https://aclanthology.org/2024.acl-long.510/)|ACL|2024-08|
22 | |[OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning](https://aclanthology.org/2024.findings-naacl.55/)|ACL Findings|2024-08|
23 | |[Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs](https://arxiv.org/abs/2406.18629)|arXiv|2024-06|
24 | |[Let's Verify Step by Step.](https://arxiv.org/abs/2305.20050)|ICLR|2024-05|
25 | |[Improve Mathematical Reasoning in Language Models by Automated Process Supervision](https://arxiv.org/abs/2306.05372)|arXiv|2023-06|
26 | |[Making Large Language Models Better Reasoners with Step-Aware Verifier](https://arxiv.org/abs/2206.02336)|arXiv|2023-06|
27 | |[Solving Math Word Problems with Process and Outcome-Based Feedback](https://arxiv.org/abs/2211.14275)|arXiv|2022-11|
28 | ## Part 3: Reinforcement Learning
29 | |Title|Venue|Date|
30 | |:---|:---|:---|
31 | |[Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://arxiv.org/abs/2502.02508)|arXiv|2025-02|
32 | |[Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://arxiv.org/abs/2501.11651)|arXiv|2025-01|
33 | |[Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies](https://arxiv.org/abs/2501.17030)|arXiv|2025-01|
34 | |[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)|arXiv|2025-01|
35 | |[Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)|arXiv|2025-01|
36 | |[Does RLHF Scale? Exploring the Impacts From Data, Model, and Method](https://arxiv.org/abs/2412.06000)|arXiv|2024-12|
37 | |[Offline Reinforcement Learning for LLM Multi-Step Reasoning](https://arxiv.org/abs/2412.16145)|arXiv|2024-12|
38 | |[ReFT: Representation Finetuning for Language Models](https://aclanthology.org/2024.acl-long.410.pdf)|ACL|2024-08|
39 | |[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)|arXiv|2024-02|
40 | ## Part 4: MCTS/Tree Search
41 | |Title|Venue|Date|
42 | |:---|:---|:---|
43 | |[On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes](https://ieeexplore.ieee.org/abstract/document/10870057/)|IEEE TAC|2025-01|
44 | |[Search-o1: Agentic Search-Enhanced Large Reasoning Models](https://arxiv.org/abs/2501.05366)|arXiv|2025-01|
45 | |[rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://arxiv.org/abs/2501.04519)|arXiv|2025-01|
46 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://arxiv.org/abs/2406.03816)|NeurIPS|2024-12|
47 | |[Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning](https://arxiv.org/abs/2412.09078)|arXiv|2024-12|
48 | |[HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs](https://arxiv.org/abs/2412.18925)|arXiv|2024-12|
49 | |[Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search](https://arxiv.org/abs/2412.18319)|arXiv|2024-12|
50 | |[Proposing and solving olympiad geometry with guided tree search](https://arxiv.org/abs/2412.10673)|arXiv|2024-12|
51 | |[SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models](https://arxiv.org/abs/2412.11605)|arXiv|2024-12|
52 | |[Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2412.17397)|arXiv|2024-12|
53 | |[CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models](https://arxiv.org/abs/2411.04329)|arXiv|2024-11|
54 | |[GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection](https://arxiv.org/abs/2411.04459)|arXiv|2024-11|
55 | |[MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree](https://arxiv.org/abs/2411.15645)|arXiv|2024-11|
56 | |[Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions](https://arxiv.org/abs/2411.14405)|arXiv|2024-11|
57 | |[SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation](https://arxiv.org/abs/2411.11053)|arXiv|2024-11|
58 | |[Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding](https://openreview.net/forum?id=kh9Zt2Ldmn#discussion)|CoLM|2024-10|
59 | |[AFlow: Automating Agentic Workflow Generation](https://arxiv.org/abs/2410.10762)|arXiv|2024-10|
60 | |[Interpretable Contrastive Monte Carlo Tree Search Reasoning](https://arxiv.org/abs/2410.01707)|arXiv|2024-10|
61 | |[LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning](https://arxiv.org/abs/2410.02884)|arXiv|2024-10|
62 | |[Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning](https://arxiv.org/abs/2410.06508)|arXiv|2024-10|
63 | |[TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling](https://arxiv.org/abs/2410.16033)|arXiv|2024-10|
64 | |[Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination](https://arxiv.org/abs/2410.17820)|arXiv|2024-10|
65 | |[RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation](https://arxiv.org/abs/2409.09584)|arXiv|2024-09|
66 | |[Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search](https://arxiv.org/abs/2408.10635)|arXiv|2024-08|
67 | |[LiteSearch: Efficacious Tree Search for LLM](https://arxiv.org/abs/2407.00320)|arXiv|2024-07|
68 | |[Tree Search for Language Model Agents](https://arxiv.org/abs/2407.01476)|arXiv|2024-07|
69 | |[Uncertainty-Guided Optimization on Large Language Model Search Trees](https://arxiv.org/abs/2407.03951)|arXiv|2024-07|
70 | |[Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B](https://arxiv.org/abs/2406.07394)|arXiv|2024-06|
71 | |[Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping](https://openreview.net/forum?id=rviGTsl0oy)|ICLR WorkShop|2024-05|
72 | |[LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models](https://openreview.net/forum?id=h1mvwbQiXR)|ICLR WorkShop|2024-05|
73 | |[AlphaMath Almost Zero: process Supervision without process](https://arxiv.org/abs/2405.03553)|arXiv|2024-05|
74 | |[Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search](https://arxiv.org/abs/2405.15383)|arXiv|2024-05|
75 | |[MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time](https://arxiv.org/abs/2405.16265)|arXiv|2024-05|
76 | |[Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2405.00451)|arXiv|2024-05|
77 | |[Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning](https://arxiv.org/abs/2405.00451)|arXiv|2024-05|
78 | |[Stream of Search (SoS): Learning to Search in Language](https://arxiv.org/abs/2404.03683)|arXiv|2024-04|
79 | |[Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing](https://arxiv.org/abs/2404.12253)|arXiv|2024-04|
80 | |[Reasoning with Language Model is Planning with World Model](https://aclanthology.org/2023.emnlp-main.507/)|EMNLP|2023-12|
81 | |[Large Language Models as Commonsense Knowledge for Large-Scale Task Planning](https://proceedings.neurips.cc/paper_files/paper/2023/hash/65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html)|NeurIPS|2023-12|
82 | |[ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING](https://openreview.net/forum?id=PJfc4x2jXY)|NeurIPS WorkShop|2023-12|
83 | |[Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training](https://openreview.net/forum?id=PJfc4x2jXY)|NeurIPS WorkShop|2023-12|
84 | |[MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING](https://arxiv.org/abs/2309.15028)|arXiv|2023-09|
85 | ## Part 5: Self-Training / Self-Improve
86 | |Title|Venue|Date|
87 | |:---|:---|:---|
88 | |[Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math)](https://arxiv.org/abs/2501.04519)|arXiv|2025-01|
89 | |[ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https://arxiv.org/abs/2406.03816)|NeurIPS|2024-12|
90 | |[Recursive Introspection: Teaching Language Model Agents How to Self-Improve](https://openreview.net/forum?id=DRC9pZwBwR)|NeurIPS|2024-12|
91 | |[B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner](https://arxiv.org/abs/2412.17256)|arXiv|2024-12|
92 | |[ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](https://openreview.net/forum?id=lNAyUngGFK)|TMLR|2024-09|
93 | |[ReFT: Representation Finetuning for Language Models](https://aclanthology.org/2024.acl-long.410.pdf)|ACL|2024-08|
94 | |[Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models](https://arxiv.org/abs/2406.11736)|arXiv|2024-06|
95 | |[CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https://openreview.net/forum?id=Sx038qxjek)|ICLR|2024-05|
96 | |[Enhancing Large Vision Language Models with Self-Training on Image Comprehension](https://arxiv.org/abs/2405.19716)|arXiv|2024-05|
97 | |[Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking](https://arxiv.org/abs/2403.09629)|arXiv|2024-03|
98 | |[V-star: Training Verifiers for Self-Taught Reasoners](https://arxiv.org/abs/2402.06457)|arXiv|2024-02|
99 | |[Self-Refine: Iterative Refinement with Self-Feedback](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)|NeurIPS|2023-12|
100 | |[ReST: Reinforced Self-Training for Language Modeling](https://arxiv.org/abs/2308.08998)|arXiv|2023-08|
101 | |[STaR: Bootstrapping Reasoning With Reasoning](https://arxiv.org/abs/2203.14465)|NeurIPS2022|2022-05|
102 | |[Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search](https://proceedings.neurips.cc/paper/2017/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html)|NeurIPS|2017-12|
103 | ## Part 6: Reflection
104 | |Title|Venue|Date|
105 | |:---|:---|:---|
106 | |[HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs](https://arxiv.org/abs/2412.18925)|arXiv|2024-12|
107 | |[AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning](https://arxiv.org/abs/2411.11930)|arXiv|2024-11|
108 | |[LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://arxiv.org/abs/2411.10440)|arXiv|2024-11|
109 | |[Vision-Language Models Can Self-Improve Reasoning via Reflection](https://arxiv.org/abs/2411.00855)|arXiv|2024-11|
110 | |[Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers](https://arxiv.org/abs/2408.06195)|arXiv|2024-08|
111 | |[Reflection-Tuning: An Approach for Data Recycling](https://arxiv.org/abs/2310.11716)|arXiv|2023-10|
112 | ## Part 7: Efficient System2
113 | |Title|Venue|Date|
114 | |:---|:---|:---|
115 | |[O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning](https://arxiv.org/abs/2501.12570)|arXiv|2025-01|
116 | |[Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking](https://arxiv.org/abs/2501.01306)|arXiv|2025-01|
117 | |[DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models](https://arxiv.org/abs/2407.01009)|EMNLP|2024-12|
118 | |[B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner](https://arxiv.org/abs/2412.17256)|arXiv|2024-12|
119 | |[Token-Budget-Aware LLM Reasoning](https://arxiv.org/abs/2412.18547)|arXiv|2024-12|
120 | |[Training Large Language Models to Reason in a Continuous Latent Space](https://arxiv.org/abs/2412.06769)|arXiv|2024-12|
121 | |[Guiding Language Model Reasoning with Planning Tokens](https://arxiv.org/abs/2310.05707)|CoLM|2024-10|
122 | ## Part 8: Explainability
123 | |Title|Venue|Date|
124 | |:---|:---|:---|
125 | |[Agents Thinking Fast and Slow: A Talker-Reasoner Architecture](https://openreview.net/forum?id=xPhcP6rbI4)|NeurIPS WorkShop|2024-12|
126 | |[What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective](https://arxiv.org/abs/2410.23743)|arXiv|2024-10|
127 | |[When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1](https://arxiv.org/abs/2410.01792)|arXiv|2024-10|
128 | |[The Impact of Reasoning Step Length on Large Language Models](https://arxiv.org/abs/2401.04925)|ACL Findings|2024-08|
129 | |[Distilling System 2 into System 1](https://arxiv.org/abs/2407.06023)|arXiv|2024-07|
130 | |[System 2 Attention (is something you might need too)](https://arxiv.org/abs/2311.11829)|arXiv|2023-11|
131 | ## Part 9: Multimodal Agent related Slow-Fast System
132 | |Title|Venue|Date|
133 | |:---|:---|:---|
134 | |[Diving into Self-Evolving Training for Multimodal Reasoning](https://arxiv.org/abs/2412.17451)|ICLR|2025-01|
135 | |[Visual Agents as Fast and Slow Thinkers](https://openreview.net/forum?id=ncCuiD3KJQ)|ICLR|2025-01|
136 | |[Virgo: A Preliminary Exploration on Reproducing o1-like MLLM](https://arxiv.org/abs/2501.01904)|arXiv|2025-01|
137 | |[Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension](https://arxiv.org/pdf/2412.03704)|arXiv|2024-12|
138 | |[Slow Perception: Let's Perceive Geometric Figures Step-by-Step](https://arxiv.org/abs/2412.20631)|arXiv|2024-12|
139 | |[AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning](https://arxiv.org/abs/2411.11930)|arXiv|2024-11|
140 | |[LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://arxiv.org/abs/2411.10440)|arXiv|2024-11|
141 | |[Vision-Language Models Can Self-Improve Reasoning via Reflection](https://arxiv.org/abs/2411.00855)|arXiv|2024-11|
142 | ## Part 10: Benchmark and Datasets
143 | |Title|Venue|Date|
144 | |:---|:---|:---|
145 | |[PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models](https://arxiv.org/abs/2501.03124)|arXiv|2025-01|
146 | |[MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs](https://openreview.net/forum?id=GN2qbxZlni)|NeurIPS|2024-12|
147 | |[Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs](https://arxiv.org/abs/2412.21187)|arXiv|2024-12|
148 | |[A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?](https://arxiv.org/abs/2409.15277)|arXiv|2024-09|
149 |
--------------------------------------------------------------------------------
/src/timeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzli2022/Awesome-System2-Reasoning-LLM/a2e912deb854a32a5dfca4fc1e08ea355ceae59c/src/timeline.png
--------------------------------------------------------------------------------